# T81-558: Applications of Deep Neural Networks
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), School of Engineering and Applied Science, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

**Module 4 Assignment: Classification and Regression Neural Network**

**Student Name: Julia Huang**

# Assignment Instructions

For this assignment you will use the **crx.csv** dataset.  This is a public dataset that can be found [here](https://archive.ics.uci.edu/ml/datasets/credit+approval). You should use the CSV file on my data site, at this location: [crx.csv](https://data.heatonresearch.com/data/t81-558/crx.csv) because it includes column headers.  This is a dataset that is usually used for binary classification. There are 15 attributes, plus a target column that contains only + or -.  Some of the columns have missing values.

For this assignment you will train a neural network and return the predictions.  You will submit these predictions to the **submit** function.  See [Assignment #1](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/assignments/assignment_yourname_class1.ipynb) for details on how to submit an assignment or check that one was submitted.

Complete the following tasks:

* Your task is to replace missing values in columns *a2* and *a14* with values estimated by a neural network (one neural network for *a2* and another for *a14*).
* Your submission file will contain the same headers as the source CSV: *a1*, *a2*, *s3*, *a4*, *a5*, *a6*, *a7*, *a8*, *a9*, *a10*, *a11*, *a12*, *a13*, *a14*, *a15*, and *a16*.
* You should only need to modify *a2* and *a14*.
* Neural networks can be much more powerful at filling missing variables than median and mean.
* Train two neural networks to predict *a2* and *a14*.  
* The y (target) for training the two nets will be *a2* and *a14*, depending on which you are trying to fill.
* The x for training the two nets will be 's3','a8','a9','a10','a11','a12','a13','a15'.  These are chosen because it is important not to use any columns with missing values, also it could cause unwanted bias if we include the ultimate target (*a16*).
* ONLY predict new values for missing values in *a2* and *a14*.
* You will likely get this small warning:  Warning: The mean of column a14 differs from the solution file by 0.20238937709643778. (might not matter if small)



# Assignment Submit Function

You will submit the 10 programming assignments electronically.  The following submit function can be used to do this.  My server will perform a basic check of each assignment and let you know if it sees any basic problems. 

**It is unlikely that should need to modify this function.**

In [0]:
import base64
import os
import numpy as np
import pandas as pd
import requests

# This function submits an assignment.  You can submit an assignment as much as you like, only the final
# submission counts.  The paramaters are as follows:
# data - Pandas dataframe output.
# key - Your student key that was emailed to you.
# no - The assignment class number, should be 1 through 1.
# source_file - The full path to your Python or IPYNB file.  This must have "_class1" as part of its name.  
# .             The number must match your assignment number.  For example "_class2" for class assignment #2.
def submit(data,key,no,source_file=None):
    if source_file is None and '__file__' not in globals(): raise Exception('Must specify a filename when a Jupyter notebook.')
    if source_file is None: source_file = __file__
    suffix = '_class{}'.format(no)
    if suffix not in source_file: raise Exception('{} must be part of the filename.'.format(suffix))
    with open(source_file, "rb") as image_file:
        encoded_python = base64.b64encode(image_file.read()).decode('ascii')
    ext = os.path.splitext(source_file)[-1].lower()
    if ext not in ['.ipynb','.py']: raise Exception("Source file is {} must be .py or .ipynb".format(ext))
    r = requests.post("https://api.heatonresearch.com/assignment-submit",
        headers={'x-api-key':key}, json={'csv':base64.b64encode(data.to_csv(index=False).encode('ascii')).decode("ascii"),
        'assignment': no, 'ext':ext, 'py':encoded_python})
    if r.status_code == 200:
        print("Success: {}".format(r.text))
    else: print("Failure: {}".format(r.text))

# Google CoLab Instructions

If you are using Google CoLab, it will be necessary to mount your GDrive so that you can send your notebook during the submit process.  Running the following code will map your GDrive to /content/drive.

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
!ls /content/drive/My\ Drive

 00012.MTS
 1162567958.jpg
'423 Assignment1.docx.gdoc'
'701 Welcome Letter.pdf'
'a-d case 3.gsheet'
'Assignment1 (1).pdf'
'Assignment 1_ANS_2016.pdf'
 Assignment1.docx
 Assignment1.pdf
 Assignment1.pdf.gdoc
'Assignment 2 investment.xls'
'Assignment 2 investment.xls.gsheet'
'Assignment 3-cloud computing.gdoc'
 CA_DMV_handbook_Chinese.pdf
'Colab Notebooks'
'Copy of Helicanus.gslides'
'diagram question 2 - H Julia.pptx'
'diagram question 2.pptx'
 Final_project_marketing.gdoc
'GRE files.rar'
'GRE填空机经1100题难度分级版（第一版） (1).pdf'
 GRE填空机经1100题难度分级版（第一版）.pdf
 GRE资料-LZD.zip
 HHH.gsheet
 image.jpg
'intended v incidental.pptx'
'marketing final ppt.gslides'
'marketing final ppt - Line chart 1.gsheet'
'marketing final ppt.pdf'
'Memory music analysis.gdoc'
'movie 2.zip'
'MSBA Resource List.docx'
'mus 110 exam2.gdoc'
'music final exam .gdoc'
 MVI_9879.MOV
 MVI_9880.MOV
 MVI_9882.MOV
 MVI_9887.MOV
 MVI_9888.MOV
 MVI_9891.MOV
 MVI_9899.MOV
'My Movie 2.mp4'
'My Movie 3.mp4'
'My Movie 5'
'My Movie.mp4'
'not

# Assignment #4 Sample Code

The following code provides a starting point for this assignment.

In [0]:
import os
import pandas as pd
from scipy.stats import zscore
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.callbacks import EarlyStopping
import pandas as pd
import io
import requests
import numpy as np
from sklearn import metrics

# This is your student key that I emailed to you at the beginnning of the semester.
key = "Yg3Uc8sn118A6HaWAFSKG5g1Th1nOyw34jLD5Uh8"  # This is an example key and will not work.

# You must also identify your source file.  (modify for your local setup)
# file='/resources/t81_558_deep_learning/assignment_yourname_class1.ipynb'  # IBM Data Science Workbench
# file='C:\\Users\\jeffh\\projects\\t81_558_deep_learning\\t81_558_class1_intro_python.ipynb'  # Windows
#file='/Users/jheaton/projects/t81_558_deep_learning/assignments/assignment_hjulia_class4.ipynb'  # Mac/Linux
file='/content/drive/My Drive/Colab Notebooks/assignment_jhuang_class4.ipynb' 
#file = "C:\\Users\\jeffh\\Dropbox\\school\\teaching\\wustl\\classes\\T81_558_deep_learning\\solutions\\assignment_solution_class4.ipynb"


import os
import pandas as pd
from scipy.stats import zscore
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.callbacks import EarlyStopping
import pandas as pd
import io
import requests
import numpy as np
from sklearn import metrics



# Begin assignment
df = pd.read_csv("https://data.heatonresearch.com/data/t81-558/crx.csv",na_values=['?'])


def fill_missing_numeric(df,current,target):
    df_original = df.copy(deep=True)
    
    print(df[['s3', 'a8', 'a9', 'a10', 'a11', 'a12', 'a13', 'a15']].head())
    df['a9'] = pd.factorize(df.a9)[0]
    df['a10'] = pd.factorize(df.a10)[0]
    df['a12'] = pd.factorize(df.a12)[0]
    df['a13'] = pd.factorize(df.a13)[0]
  
    x = df[['s3', 'a8', 'a9', 'a10', 'a11', 'a12', 'a13', 'a15']].values
    y = df[current].values
    
    print(df[['s3', 'a8', 'a9', 'a10', 'a11', 'a12', 'a13', 'a15']].head())
          
    test_idx = df[current].isnull()
    train_idx = [not a for a in df[current].isnull().values]
#     df_test = df[is_null]
#     df_train = df[is_not_null]
#     print(df_train)
    x_train, y_train = x[train_idx], y[train_idx]
  
    x_train, x_valid, y_train, y_valid = train_test_split(    
      x_train, y_train, test_size=0.25, random_state=42)
  
    x_test, y_test = x[test_idx], y[test_idx]
    
    model = Sequential()
    model.add(Dense(25, input_dim=x.shape[1], activation='relu')) # Hidden 1
    model.add(Dense(10, activation='relu')) # Hidden 2
    model.add(Dense(1)) # Output
    model.compile(loss='mean_squared_error', optimizer='adam')
    monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, patience=5, verbose=1, mode='auto',
        restore_best_weights=True)
    model.fit(x_train,y_train,validation_data=(x_valid,y_valid),callbacks=[monitor],verbose=2,epochs=5)
    
    preds = model.predict(x_test)
    preds = preds.reshape(1, -1)[0]
#     print(preds)
#     print(len(df[test_idx]), len(preds))
#     df[test_idx][current] = np.array(preds)
    
    print(df_original.head())
    for idx, pred in zip(df.index[test_idx].values, preds):
      print(idx, pred)
      df_original.iloc[idx].loc[current] = pred
    print(df_original.head())
    return df_original

df_submit = fill_missing_numeric(df,'a2','a16')
df_submit = fill_missing_numeric(df,'a14','a16')


submit(source_file=file,data=df_submit,key=key,no=4)

      s3    a8 a9 a10  a11 a12 a13  a15
0  0.000  1.25  t   t    1   f   g    0
1  4.460  3.04  t   t    6   f   g  560
2  0.500  1.50  t   f    0   f   g  824
3  1.540  3.75  t   t    5   t   g    3
4  5.625  1.71  t   f    0   f   s    0
      s3    a8  a9  a10  a11  a12  a13  a15
0  0.000  1.25   0    0    1    0    0    0
1  4.460  3.04   0    0    6    0    0  560
2  0.500  1.50   0    1    0    0    0  824
3  1.540  3.75   0    0    5    1    0    3
4  5.625  1.71   0    1    0    0    1    0
Train on 508 samples, validate on 170 samples
Epoch 1/5
508/508 - 0s - loss: 1593.6172 - val_loss: 3003.7991
Epoch 2/5
508/508 - 0s - loss: 1140.8766 - val_loss: 13672.1313
Epoch 3/5
508/508 - 0s - loss: 1352.7570 - val_loss: 1057.5915
Epoch 4/5
508/508 - 0s - loss: 1027.7766 - val_loss: 7109.3938
Epoch 5/5
508/508 - 0s - loss: 931.0018 - val_loss: 1048.6843
  a1     a2     s3 a4 a5 a6 a7    a8 a9 a10  a11 a12 a13    a14  a15 a16
0  b  30.83  0.000  u  g  w  v  1.25  t   t    1   f   g  202.

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


86 1.0957069
92 6.0262036
97 1.1450014
254 4.4816504
286 0.9279249
329 3.5648246
445 10.201503
450 4.7119546
500 6.13141
515 9.244883
608 3.0759432
  a1     a2     s3 a4 a5 a6 a7    a8 a9 a10  a11 a12 a13    a14  a15 a16
0  b  30.83  0.000  u  g  w  v  1.25  t   t    1   f   g  202.0    0   +
1  a  58.67  4.460  u  g  q  h  3.04  t   t    6   f   g   43.0  560   +
2  a  24.50  0.500  u  g  q  h  1.50  t   f    0   f   g  280.0  824   +
3  b  27.83  1.540  u  g  w  v  3.75  t   t    5   t   g  100.0    3   +
4  b  20.17  5.625  u  g  w  v  1.71  t   f    0   f   s  120.0    0   +
      s3    a8  a9  a10  a11  a12  a13  a15
0  0.000  1.25   0    0    1    0    0    0
1  4.460  3.04   0    0    6    0    0  560
2  0.500  1.50   0    1    0    0    0  824
3  1.540  3.75   0    0    5    1    0    3
4  5.625  1.71   0    1    0    0    1    0
      s3    a8  a9  a10  a11  a12  a13  a15
0  0.000  1.25   0    0    1    0    0    0
1  4.460  3.04   0    0    6    0    0  560
2  0.500  1.50   0