# T81-558: Applications of Deep Neural Networks
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), School of Engineering and Applied Science, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

**Module 5 Assignment: K-Fold Cross-Validation**

**Student Name: Julia Huang**

# Assignment Instructions

For this assignment you will use the **reg-33-data.csv** dataset.  This is a dataset that I generated specifically for this semester.  You can find the CSV file on my data site, at this location: [reg-33-data.csv](https://data.heatonresearch.com/data/t81-558/datasets/reg-33-data.csv).

You will train 5 neural networks, one for each fold of a 5-fold cross validation and return the out of sample predictions.  You will submit these predictions to the **submit** function.  See [Assignment #1](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/assignments/assignment_yourname_class1.ipynb) for details on how to submit an assignment or check that one was submitted.

Complete the following tasks:

* Normalize all numerics to zscores and all text/categoricals to dummies.  Do not normalize the *target*.
* Your target (y) is the field named *target*.
* If you find any missing values (NA's), replace them with the median values for that column.
* Use a 5-fold cross validation and return out of sample predictions.  Your RMSE will not be as good as assignment #4, but this is because #4 was overfit.
* Your submission should contain the id (column name *id*), your prediction (column name *pred"), the expected value (from the **reg-33-data.csv** dataset, named *y*, and the absolute value of the difference between the expected and predicted (column name *diff*).
* You might get warnings about the means of your columns differing from mine.  Do not worry about small differences. My RMSE was around 9,000. There is a large range in y, so the RMSE will be higher on this data set.
* Your submitted dataframe will have these columns: id, y, pred, diff.


# Assignment Submit Function

You will submit the 10 programming assignments electronically.  The following submit function can be used to do this.  My server will perform a basic check of each assignment and let you know if it sees any basic problems. 

**It is unlikely that should need to modify this function.**

In [0]:
import base64
import os
import numpy as np
import pandas as pd
import requests

# This function submits an assignment.  You can submit an assignment as much as you like, only the final
# submission counts.  The paramaters are as follows:
# data - Pandas dataframe output.
# key - Your student key that was emailed to you.
# no - The assignment class number, should be 1 through 1.
# source_file - The full path to your Python or IPYNB file.  This must have "_class1" as part of its name.  
# .             The number must match your assignment number.  For example "_class2" for class assignment #2.
def submit(data,key,no,source_file=None):
    if source_file is None and '__file__' not in globals(): raise Exception('Must specify a filename when a Jupyter notebook.')
    if source_file is None: source_file = __file__
    suffix = '_class{}'.format(no)
    if suffix not in source_file: raise Exception('{} must be part of the filename.'.format(suffix))
    with open(source_file, "rb") as image_file:
        encoded_python = base64.b64encode(image_file.read()).decode('ascii')
    ext = os.path.splitext(source_file)[-1].lower()
    if ext not in ['.ipynb','.py']: raise Exception("Source file is {} must be .py or .ipynb".format(ext))
    r = requests.post("https://api.heatonresearch.com/assignment-submit",
        headers={'x-api-key':key}, json={'csv':base64.b64encode(data.to_csv(index=False).encode('ascii')).decode("ascii"),
        'assignment': no, 'ext':ext, 'py':encoded_python})
    if r.status_code == 200:
        print("Success: {}".format(r.text))
    else: print("Failure: {}".format(r.text))

# Google CoLab Instructions

If you are using Google CoLab, it will be necessary to mount your GDrive so that you can send your notebook during the submit process.  Running the following code will map your GDrive to /content/drive.

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
!ls /content/drive/My\ Drive/Colab\ Notebooks

# Assignment #5 Sample Code

The following code provides a starting point for this assignment.

### data_preprocessing


In [0]:
# Below is just a suggestion of how to begin.  

import os
import pandas as pd
from scipy.stats import zscore
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
import pandas as pd
import io
import requests
import numpy as np
from sklearn import metrics
from sklearn.model_selection import KFold
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from scipy.stats import zscore

# This is your student key that I emailed to you at the beginnning of the semester.
key = "Yg3Uc8sn118A6HaWAFSKG5g1Th1nOyw34jLD5Uh8"  # This is an example key and will not work.

# You must also identify your source file.  (modify for your local setup)
# file='/resources/t81_558_deep_learning/assignment_yourname_class1.ipynb'  # IBM Data Science Workbench
# file='C:\\Users\\jeffh\\projects\\t81_558_deep_learning\\t81_558_class1_intro_python.ipynb'  # Windows
# file='/Users/jheaton/projects/t81_558_deep_learning/assignments/assignment_yourname_class5.ipynb'  # Mac/Linux
# file = "C:\\Users\\jeffh\\Dropbox\\school\\teaching\\wustl\\classes\\T81_558_deep_learning\\solutions\\assignment_solution_class5.ipynb"
file='/content/drive/My Drive/Colab Notebooks/assignment_jhuang_class5.ipynb' 

# Begin assignment
df = pd.read_csv("https://data.heatonresearch.com/data/t81-558/datasets/reg-33-data.csv")


# Encode the feature vector
ids = df['id']
df.drop('id',1,inplace=True)

# Generate dummies for convention
df = pd.concat([df,pd.get_dummies(df['convention'],prefix="convention")],axis=1)
df.drop('convention', axis=1, inplace=True)

# Generate dummies for cat2
df = pd.concat([df,pd.get_dummies(df['cat2'],prefix="cat2")],axis=1)
df.drop('cat2', axis=1, inplace=True)

# Generate dummies for usage
df = pd.concat([df,pd.get_dummies(df['usage'],prefix="usage")],axis=1)
df.drop('usage', axis=1, inplace=True)

# Generate dummies for region
df = pd.concat([df,pd.get_dummies(df['region'],prefix="region")],axis=1)
df.drop('region', axis=1, inplace=True)

# Generate dummies for code
df = pd.concat([df,pd.get_dummies(df['code'],prefix="code")],axis=1)
df.drop('code', axis=1, inplace=True)

# Generate dummies for item
df = pd.concat([df,pd.get_dummies(df['item'],prefix="item")],axis=1)
df.drop('item', axis=1, inplace=True)

# Generate dummies for usage
df = pd.concat([df,pd.get_dummies(df['country'],prefix="country")],axis=1)
df.drop('country', axis=1, inplace=True)

# Missing values for height
med = df['height'].median()
df['height'] = df['height'].fillna(med)

# Missing values for length
med = df['length'].median()
df['length'] = df['length'].fillna(med)

# Standardize ranges
df['height'] = zscore(df['height'])
df['max'] = zscore(df['max'])
df['number'] = zscore(df['number'])
df['length'] = zscore(df['length'])
df['power'] = zscore(df['power'])
df['weight'] = zscore(df['weight'])
# df['target'] = zscore(df['target'])

# Convert to numpy - Classification
x_columns = df.columns.drop('target')
x = df[x_columns].values
# y = df['current'].values
y = df['target'].values

x_temp = x.copy()
y_temp = y.copy()

### Cross_validation

In [0]:
print(x.shape)
print(y.shape)

(10809, 256)
(10809,)


In [0]:

# Cross-Validate
kf = KFold(5, shuffle=True, random_state=42) # Use for KFold classification
    
oss_y = []
oss_pred = []

fold = 0
for train, test in kf.split(x):
    fold+=1
    print(f"Fold #{fold}")
          
    x_train = x[train]
    y_train = y[train]
    x_test = x[test]
    y_test = y[test]
    
    model = Sequential()
    model.add(Dense(20, input_dim=x.shape[1], activation='relu'))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(1))
    model.compile(loss='mean_squared_error', optimizer='adam')
    
    model.fit(x_train,y_train,validation_data=(x_test,y_test),verbose=0,epochs=500)
    
    pred = model.predict(x_test)
    
    oss_y.append(y_test)
    oss_pred.append(pred)    
      
    # Measure this fold's RMSE
    score = np.sqrt(metrics.mean_squared_error(pred,y_test))
    print(f"Fold score (RMSE): {score}")
  
# Build the prediction list and calculate the error.
oss_y = np.concatenate(oss_y)
oss_pred = np.concatenate(oss_pred)
diff = np.absolute(oss_pred - oss_y)
score = np.sqrt(metrics.mean_squared_error(oss_pred,oss_y))
print(f"Final, out of sample score (RMSE): {score}")    

# Write the cross-validated prediction
oss_y = pd.DataFrame(oss_y)
oss_pred = pd.DataFrame(oss_pred)
diff = pd.DataFrame(diff)
ossDF = pd.concat( [df, oss_y, oss_pred, diff],axis=1 )
# ossDF.to_csv(ossDF1_write,index=False)






Fold #1
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Fold score (RMSE): 1380.3402588951717
Fold #2
Fold score (RMSE): 1056.7969327369344
Fold #3


KeyboardInterrupt: ignored

In [0]:
oss_y.rename(columns={0: 'y'}, inplace=True)
oss_pred.rename(columns={0: 'pred'}, inplace=True)
# pd.concat([oss_y, oss_pred], axis=1)

In [0]:
temp = pd.concat([oss_y,oss_pred],axis=1)
map_y_to_target={temp.loc[x, 'y']: temp.loc[x, 'pred'] for x in range(len(temp))}   #####
# print(pd.concat([df['target'], oss_y, oss_pred], axis=1)
temp = pd.concat([df[['target']], oss_y], axis=1)
temp = pd.concat([ids, temp], axis=1)
temp['y'] = temp['y'].map(map_y_to_target)   ###### map
temp.rename(columns={'y': 'pred'}, inplace=True)
temp.rename(columns={'target': 'y'}, inplace=True)
temp['diff'] = abs(temp['y'] - temp['pred'])
temp.head()

df_submit = temp

In [0]:
submit(source_file=file,data=df_submit,key=key,no=5)

Success: Submitted Assignment #5 for hjulia:
This is your first submission of this assignment.



(8648, 256)