# T81-558: Applications of Deep Neural Networks
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), School of Engineering and Applied Science, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

**Module 8 Assignment: Building a Kaggle Submission File**

**Student Name: Julia Huang**


# Assignment Instructions

For this assignment you will use the [**reg-33-data.csv**](http://data.heatonresearch.com/data/t81-558/datasets/reg-33-data.csv) dataset to train a neural network and [**reg-33-eval.csv**](http://data.heatonresearch.com/data/t81-558/datasets/reg-33-eval.csv) to use as test to build a submission (similar to Kaggle).  The preprocessing/feature engineering code used for this assignment will be identical to [Assignmnent 5](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/assignments/assignment_yourname_class5.ipynb) and you are encouraged to use your Assignment 5 code as a starting point.  Refer to [Module 8](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class8_kaggle.ipynb) for instructions on producing a Kaggle type submission file.  

The dataframe that you submit should have two columns: *id* and *target*.  The *id* column should matchup with the test data file.  The *target* column is your prediction.  It is unlikely that the mean of *target* will match exactly with mine.

Note, my version generated this warning.  You should be lower than 300, obviously, lower is better!  **Warning: The mean of column target differs from the solution file by 121.90476911730366.**

# Assignment Submit Function

You will submit the 10 programming assignments electronically.  The following submit function can be used to do this.  My server will perform a basic check of each assignment and let you know if it sees any basic problems. 

**It is unlikely that should need to modify this function.**

In [0]:
import base64
import os
import numpy as np
import pandas as pd
import requests

# This function submits an assignment.  You can submit an assignment as much as you like, only the final
# submission counts.  The paramaters are as follows:
# data - Pandas dataframe output.
# key - Your student key that was emailed to you.
# no - The assignment class number, should be 1 through 1.
# source_file - The full path to your Python or IPYNB file.  This must have "_class1" as part of its name.  
# .             The number must match your assignment number.  For example "_class2" for class assignment #2.
def submit(data,key,no,source_file=None):
    if source_file is None and '__file__' not in globals(): raise Exception('Must specify a filename when a Jupyter notebook.')
    if source_file is None: source_file = __file__
    suffix = '_class{}'.format(no)
    if suffix not in source_file: raise Exception('{} must be part of the filename.'.format(suffix))
    with open(source_file, "rb") as image_file:
        encoded_python = base64.b64encode(image_file.read()).decode('ascii')
    ext = os.path.splitext(source_file)[-1].lower()
    if ext not in ['.ipynb','.py']: raise Exception("Source file is {} must be .py or .ipynb".format(ext))
    r = requests.post("https://api.heatonresearch.com/assignment-submit",
        headers={'x-api-key':key}, json={'csv':base64.b64encode(data.to_csv(index=False).encode('ascii')).decode("ascii"),
        'assignment': no, 'ext':ext, 'py':encoded_python})
    if r.status_code == 200:
        print("Success: {}".format(r.text))
    else: print("Failure: {}".format(r.text))

# Google CoLab Instructions

If you are using Google CoLab, it will be necessary to mount your GDrive so that you can send your notebook during the submit process.  Running the following code will map your GDrive to /content/drive.

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
!ls /content/drive/My\ Drive/Colab\ Notebooks

assignment_hjulia_class7.ipynb	assignment_jhuang_class6.ipynb
assignment_jhuang_class1.ipynb	assignment_jhuang_class8.ipynb
assignment_jhuang_class2.ipynb	assignment_juliahuang_class3.ipynb
assignment_jhuang_class3.ipynb	Untitled
assignment_jhuang_class4.ipynb	Untitled0.ipynb
assignment_jhuang_class5.ipynb


# Assignment #8 Sample Code

The following code provides a starting point for this assignment.

In [0]:
import os
import pandas as pd
from scipy.stats import zscore
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
import pandas as pd
import io
import requests
import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import EarlyStopping
from sklearn import metrics

# This is your student key that I emailed to you at the beginnning of the semester.
key = "Yg3Uc8sn118A6HaWAFSKG5g1Th1nOyw34jLD5Uh8"

# You must also identify your source file.  (modify for your local setup)
# You must also identify your source file.  (modify for your local setup)
file='/content/drive/My Drive/Colab Notebooks/assignment_jhuang_class8.ipynb'  # Google CoLab
# file='C:\\Users\\jeffh\\projects\\t81_558_deep_learning\\assignments\\assignment_yourname_class8.ipynb'  # Windows
#file='/Users/jheaton/projects/t81_558_deep_learning/assignments/assignment_yourname_class8.ipynb'  # Mac/Linux

# Begin assignment
df = pd.read_csv("https://data.heatonresearch.com/data/t81-558/datasets/reg-33-data.csv")


# Encode the feature vector
ids = df['id']
df.drop('id',1,inplace=True)

# Generate dummies for convention
df = pd.concat([df,pd.get_dummies(df['convention'],prefix="convention")],axis=1)
df.drop('convention', axis=1, inplace=True)

# Generate dummies for cat2
df = pd.concat([df,pd.get_dummies(df['cat2'],prefix="cat2")],axis=1)
df.drop('cat2', axis=1, inplace=True)

# Generate dummies for usage
df = pd.concat([df,pd.get_dummies(df['usage'],prefix="usage")],axis=1)
df.drop('usage', axis=1, inplace=True)

# Generate dummies for region
df = pd.concat([df,pd.get_dummies(df['region'],prefix="region")],axis=1)
df.drop('region', axis=1, inplace=True)

# Generate dummies for code
df = pd.concat([df,pd.get_dummies(df['code'],prefix="code")],axis=1)
df.drop('code', axis=1, inplace=True)

# Generate dummies for item
df = pd.concat([df,pd.get_dummies(df['item'],prefix="item")],axis=1)
df.drop('item', axis=1, inplace=True)

# Generate dummies for usage
df = pd.concat([df,pd.get_dummies(df['country'],prefix="country")],axis=1)
df.drop('country', axis=1, inplace=True)

# Missing values for height
med = df['height'].median()
df['height'] = df['height'].fillna(med)

# Missing values for length
med = df['length'].median()
df['length'] = df['length'].fillna(med)

# Standardize ranges
df['height'] = zscore(df['height'])
df['max'] = zscore(df['max'])
df['number'] = zscore(df['number'])
df['length'] = zscore(df['length'])
df['power'] = zscore(df['power'])
df['weight'] = zscore(df['weight'])


# Convert to numpy - Classification
x_columns = df.columns.drop('target')
x = df[x_columns].values
y = df['target'].values


In [0]:
x_train, x_test, y_train, y_test = train_test_split(    
    x, y, test_size=0.25, random_state=42)
    
model = Sequential()
model.add(Dense(25, input_dim=x.shape[1], activation='relu')) # Hidden 1
model.add(Dense(10, activation='relu')) # Hidden 2
model.add(Dense(1)) # Output
model.compile(loss='mean_squared_error', optimizer='adam')
monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, patience=5, 
                        verbose=1, mode='auto', restore_best_weights=True)
model.fit(x_train,y_train,validation_data=(x_test,y_test),
          verbose=2,callbacks=[monitor],epochs=1000)
  
# Predict
pred = model.predict(x_test)

Train on 8106 samples, validate on 2703 samples
Epoch 1/1000
8106/8106 - 1s - loss: 9414560477.1971 - val_loss: 9324467494.9256
Epoch 2/1000
8106/8106 - 0s - loss: 9328169142.6677 - val_loss: 9147636510.5912
Epoch 3/1000
8106/8106 - 0s - loss: 9020320610.8502 - val_loss: 8694988207.1180
Epoch 4/1000
8106/8106 - 0s - loss: 8403527205.1399 - val_loss: 7913570664.2753
Epoch 5/1000
8106/8106 - 0s - loss: 7460486135.9151 - val_loss: 6822997009.2371
Epoch 6/1000
8106/8106 - 0s - loss: 6247232641.8633 - val_loss: 5519071310.6090
Epoch 7/1000
8106/8106 - 0s - loss: 4890012385.6186 - val_loss: 4147879320.1036
Epoch 8/1000
8106/8106 - 0s - loss: 3548201337.2415 - val_loss: 2874756837.5760
Epoch 9/1000
8106/8106 - 0s - loss: 2375533054.8631 - val_loss: 1834349862.4047
Epoch 10/1000
8106/8106 - 0s - loss: 1482373426.3884 - val_loss: 1106553805.2830
Epoch 11/1000
8106/8106 - 0s - loss: 902657525.1675 - val_loss: 679409857.7521
Epoch 12/1000
8106/8106 - 0s - loss: 583309836.5221 - val_loss: 46333788

In [0]:
import numpy as np

# Measure RMSE error.  RMSE is common for regression.
score = np.sqrt(metrics.mean_squared_error(pred,y_test))
print("Final score (RMSE): {}".format(score))

Final score (RMSE): 1729.560935683772


In [0]:
df = pd.read_csv("https://data.heatonresearch.com/data/t81-558/datasets/reg-33-eval.csv")


# Encode the feature vector
ids = df['id']
df.drop('id',1,inplace=True)

# Generate dummies for convention
df = pd.concat([df,pd.get_dummies(df['convention'],prefix="convention")],axis=1)
df.drop('convention', axis=1, inplace=True)

# Generate dummies for cat2
df = pd.concat([df,pd.get_dummies(df['cat2'],prefix="cat2")],axis=1)
df.drop('cat2', axis=1, inplace=True)

# Generate dummies for usage
df = pd.concat([df,pd.get_dummies(df['usage'],prefix="usage")],axis=1)
df.drop('usage', axis=1, inplace=True)

# Generate dummies for region
df = pd.concat([df,pd.get_dummies(df['region'],prefix="region")],axis=1)
df.drop('region', axis=1, inplace=True)

# Generate dummies for code
df = pd.concat([df,pd.get_dummies(df['code'],prefix="code")],axis=1)
df.drop('code', axis=1, inplace=True)

# Generate dummies for item
df = pd.concat([df,pd.get_dummies(df['item'],prefix="item")],axis=1)
df.drop('item', axis=1, inplace=True)

# Generate dummies for usage
df = pd.concat([df,pd.get_dummies(df['country'],prefix="country")],axis=1)
df.drop('country', axis=1, inplace=True)

# Missing values for height
med = df['height'].median()
df['height'] = df['height'].fillna(med)

# Missing values for length
med = df['length'].median()
df['length'] = df['length'].fillna(med)

# Standardize ranges
df['height'] = zscore(df['height'])
df['max'] = zscore(df['max'])
df['number'] = zscore(df['number'])
df['length'] = zscore(df['length'])
df['power'] = zscore(df['power'])
df['weight'] = zscore(df['weight'])


# Convert to numpy - Classification
x_columns = df.columns
x = df[x_columns].values

pred = model.predict(x)

In [0]:
df_submit = pd.DataFrame(pred)
df_submit.insert(0,'id',ids)
df_submit.columns = ['id','target']

df_submit.to_csv("reg_submit.csv", index=False) # Write submit file locally

df_submit

Unnamed: 0,id,target
0,10810,74290.671875
1,10811,101452.421875
2,10812,402.904297
3,10813,42895.277344
4,10814,41186.191406
...,...,...
995,11805,121670.882812
996,11806,88694.882812
997,11807,94123.062500
998,11808,76196.718750
