# Neural Network which replaces missing values.

For this assignment, you will use the **crx.csv** dataset.  This dataset is a public dataset that can you can find [here](https://archive.ics.uci.edu/ml/datasets/credit+approval). You should use the CSV file on my data site, at this location: [crx.csv](https://data.heatonresearch.com/data/t81-558/crx.csv) because it includes column headers.  The primary use for this dataset is binary classification. There are 15 attributes, plus a target column that contains only + or -.  Some of the columns have missing values.

You should train a neural network and return the predictions.  You will submit these predictions to the **submit** function.  See [Assignment #1](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/assignments/assignment_yourname_class1.ipynb) for details on how to submit an assignment or check that one was submitted.

Complete the following tasks:

* Your task is to replace missing values in columns *a2* and *a14* with values estimated by a neural network (one neural network for *a2* and another for *a14*).
* Your submission file will contain the same headers as the source CSV: *a1*, *a2*, *s3*, *a4*, *a5*, *a6*, *a7*, *a8*, *a9*, *a10*, *a11*, *a12*, *a13*, *a14*, *a15*, and *a16*.
* You should only need to modify *a2* and *a14*.
* Neural networks can be much more powerful at filling missing variables than median and mean.
* Train two neural networks to predict *a2* and *a14*.  
* The *y* (target) for training the two nets will be *a2* and *a14*, depending on which you are trying to fill.
* The x for training the two nets will be 's3','a8','a9','a10','a11','a12','a13','a15'.  These are chosen because it is important not to use any columns with missing values; also, it could cause unwanted bias if we include the ultimate target (*a16*).
* ONLY predict new values for missing values in *a2* and *a14*.
* You will likely get this small warning:  Warning: The mean of column a14 differs from the solution file by 0.20238937709643778. (might not matter if small)

In [4]:
import os
import pandas as pd
from scipy.stats import zscore
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import EarlyStopping
import pandas as pd
import io
import requests
import numpy as np
from sklearn import metrics

# My solution 

df = pd.read_csv("https://data.heatonresearch.com/data/t81-558/crx.csv",na_values=['?'])
df_submit = pd.read_csv("https://data.heatonresearch.com/data/t81-558/crx.csv",na_values=['?'])

# Finding nan values indices inside colums

indicesa2 = df.index[(np.isnan(df.a2))]
indicesa14 = df.index[(np.isnan(df.a14))]

# Dropping nan rows --> we will feed the neural network with this arrays.

dfa2 = df.drop(indicesa2,axis = 0).reset_index(drop = True)
dfa14 =  df.drop(indicesa14,axis = 0).reset_index(drop = True)


# categorical values, we should assign numerical values to all the categorical variables (dummy variables)
# in this dataframe (df), before we 'choose' the training data set.


# Categorical columns/variables: 'a4','a5','a6','a7','a9','a10','a12','a13','a16'
#
# We will train the two nets using 's3','a8','a9','a10','a11','a12','a13','a15'
# We want to predict a2 and a14.
#
# From these particular variables 'a9','a10','a12','a13' are categorical variables.
# 
# We should turn these variables into dummy variables.
#
# We use the code get_dummies -> Convert categorical variable into dummy/indicator variables.
# This code divides a variable into categories and give one or zero to a new variable in function
# of a particular category: example -> a9 has the values f and t. The code create two binary variables
# one representing fs (1 if f, 0 otherwise) and the other representing the ts.

dfa2 = dfa2.drop(df.columns.difference(['a2','a14','s3','a8','a9','a10','a11','a12','a13','a15']), axis = 1) # we drop the variables we don't use.
dfa14 = dfa14.drop(df.columns.difference(['a2','a14','s3','a8','a9','a10','a11','a12','a13','a15']), axis = 1) 


for i in ['a9','a10','a12','a13']:
    dummies = pd.get_dummies(dfa2[i],prefix = i)
    dfa2 = pd.concat([dfa2,dummies],axis=1)
    dfa2.drop(i, axis=1, inplace=True)
    
for i in ['a9','a10','a12','a13']:
    dummies = pd.get_dummies(dfa14[i],prefix = i)
    dfa14 = pd.concat([dfa14,dummies],axis=1)
    dfa14.drop(i, axis=1, inplace=True)

# We express our variables in a common scale
# It seems this improves the convergence of the gradient descent

# Standardize ranges

#df['a2'] = zscore(df['a2'])
#df['s3'] = zscore(df['s3'])
#df['a8'] = zscore(df['a8'])
#df['a14'] = zscore(df['a14'])
#df['a15'] = zscore(df['a15'])

# Extracting the columns we want to estimate.

ya2 = dfa2.a2.values # dependent variable for a2
ya14 = dfa14.a14.values # dependent variable for a14

# We extract the training set - this set should not have NaN values inside the columns a2 or a14.

# this code, below, find nan values inside an array and invert true and false to 
# the values which are not nan; This operation give us the non nan rows (with respect to df.a2).

#indicesa2 = (np.invert(np.isnan(df.a2.array)))  #indices for a2 and a14 not having nan values in both columns
#indicesa14 = (np.invert(np.isnan(df.a14.array)))

# Extracting and reindexing the training set.

x_trdsa2 = dfa2.drop(['a2','a14'],axis = 1).reset_index(drop=True)
x_trdsa14 = dfa14.drop(['a2','a14'],axis = 1).reset_index(drop=True)

# We will train the two nets using 's3','a8','a9','a10','a11','a12','a13','a15'

# Training a neural network filling a2

from sklearn import metrics


xa2 = x_trdsa2.values

x_train, x_test, y_train, y_test = train_test_split(xa2, ya2, test_size=0.25, random_state=42)


# This is our neural network

model = Sequential()
model.add(Dense(50, input_dim=xa2.shape[1], activation='relu')) # Hidden 1
model.add(Dense(20, activation='relu')) # Hidden 2
model.add(Dense(1)) # Output
model.compile(loss='mean_squared_error', optimizer='adam')
monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3,patience=5, verbose=1, mode='auto', restore_best_weights=True)
model.fit(x_train,y_train,validation_data=(x_test,y_test), callbacks=[monitor],verbose=2,epochs=1000)

# Predicting ONLY missing values for a14

x_missing2 = xa2[indicesa2]
preda2_missing = model.predict(x_missing2)

preda2_missing

# Training a neural network filling a14

xa14 = x_trdsa14.values


x_train, x_test, y_train, y_test = train_test_split(xa14, ya14, test_size=0.25, random_state=42)


# This is our neural network

model = Sequential()
model.add(Dense(150, input_dim=xa14.shape[1], activation='relu')) # Hidden 1
model.add(Dense(750, activation='relu')) # Hidden 2
model.add(Dense(1)) # Output
model.compile(loss='mean_squared_error', optimizer='adam')
monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3,patience=10, verbose=1, mode='auto', restore_best_weights=True)
model.fit(x_train,y_train,validation_data=(x_test,y_test), callbacks=[monitor],verbose=2,epochs=1000)


# Predicting ONLY missing values for a14

x_missing14 = xa14[indicesa14]
pred_a14_missings = model.predict(x_missing14)

# Filling with predicted values

a2 = df_submit.a2.values
a14 = df_submit.a14.values

j = 0
for i in indicesa2:
    a2[i] = preda2_missing[j]
    j += 1

df_submit['a2'] = a2

j = 0
for i in indicesa14:
    a14[i] = pred_a14_missings[j]
    j += 1  

df_submit['a14'] = a14


# If we predict with all the values we could just set the column a2 and a14
# with the result. In this case was explicity that we only should calculate 
# predictions for the NaN values.

#for i in range(len(df_submit.a2)):
#        print(df_submit.a2[i])



Epoch 1/1000
16/16 - 0s - loss: 172314.8750 - val_loss: 238846.3125
Epoch 2/1000
16/16 - 0s - loss: 5685.2969 - val_loss: 16274.4395
Epoch 3/1000
16/16 - 0s - loss: 3013.0027 - val_loss: 6432.2256
Epoch 4/1000
16/16 - 0s - loss: 989.6352 - val_loss: 909.8517
Epoch 5/1000
16/16 - 0s - loss: 819.8442 - val_loss: 2988.8982
Epoch 6/1000
16/16 - 0s - loss: 956.0715 - val_loss: 1783.9739
Epoch 7/1000
16/16 - 0s - loss: 754.6507 - val_loss: 949.6714
Epoch 8/1000
16/16 - 0s - loss: 576.7494 - val_loss: 9344.8623
Epoch 9/1000
Restoring model weights from the end of the best epoch.
16/16 - 0s - loss: 1040.7046 - val_loss: 3944.6360
Epoch 00009: early stopping
Epoch 1/1000
16/16 - 0s - loss: 67593.3047 - val_loss: 82462.7969
Epoch 2/1000
16/16 - 0s - loss: 61913.8164 - val_loss: 130712.0625
Epoch 3/1000
16/16 - 0s - loss: 48679.2422 - val_loss: 611331.5000
Epoch 4/1000
16/16 - 0s - loss: 54509.9922 - val_loss: 78531.1406
Epoch 5/1000
16/16 - 0s - loss: 42145.4180 - val_loss: 171116.5938
Epoch 6/1

Unnamed: 0,a1,a2,s3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16
0,b,30.83,0.000,u,g,w,v,1.25,t,t,1,f,g,202.0,0,+
1,a,58.67,4.460,u,g,q,h,3.04,t,t,6,f,g,43.0,560,+
2,a,24.50,0.500,u,g,q,h,1.50,t,f,0,f,g,280.0,824,+
3,b,27.83,1.540,u,g,w,v,3.75,t,t,5,t,g,100.0,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,+
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,b,21.08,10.085,y,p,e,h,1.25,f,f,0,f,g,260.0,0,-
686,a,22.67,0.750,u,g,c,v,2.00,f,t,2,t,g,200.0,394,-
687,a,25.25,13.500,y,p,ff,ff,2.00,f,t,1,t,g,200.0,1,-
688,b,17.92,0.205,u,g,aa,v,0.04,f,f,0,f,g,280.0,750,-


In [7]:
# We filled a2 and a14.

with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(df_submit)

      a1         a2      s3   a4   a5   a6   a7      a8 a9 a10  a11 a12 a13  \
0      b  30.830000   0.000    u    g    w    v   1.250  t   t    1   f   g   
1      a  58.670000   4.460    u    g    q    h   3.040  t   t    6   f   g   
2      a  24.500000   0.500    u    g    q    h   1.500  t   f    0   f   g   
3      b  27.830000   1.540    u    g    w    v   3.750  t   t    5   t   g   
4      b  20.170000   5.625    u    g    w    v   1.710  t   f    0   f   s   
5      b  32.080000   4.000    u    g    m    v   2.500  t   f    0   t   g   
6      b  33.170000   1.040    u    g    r    h   6.500  t   f    0   t   g   
7      a  22.920000  11.585    u    g   cc    v   0.040  t   f    0   f   g   
8      b  54.420000   0.500    y    p    k    h   3.960  t   f    0   f   g   
9      b  42.500000   4.915    y    p    w    v   3.165  t   f    0   t   g   
10     b  22.080000   0.830    u    g    c    h   2.165  f   f    0   t   g   
11     b  29.920000   1.835    u    g    c    h   4.

In [None]:
# Writing a function which does the work

# current is an string array with the name of the variables we use as predictors

#def fill_nan(df,current,target,dummy): # target is a string with the name of the column in df - dataframe
#    indices = df.index[(np.isnan(df.targe))]
#    df_copy = df.drop(indices,axis = 0).reset_index(drop = True) # we drop the rows with nan values on the target
#    df_copy = dfa_copy.drop(df.columns.difference(current), axis = 1) # we drop the variables we don't use.
#    
#    if dummy:
#          for i in dummy:
#            dummies = pd.get_dummies(df_copy[i],prefix = i)
#            dfa2 = pd.concat([df_copy,dummies],axis=1)
#            df_copy.drop(i, axis=1, inplace=True)                
#
#    y = df.target
#    x_trdsa = df_copy.drop(['a2','a14'],axis = 1).reset_index(drop=True)
#    x_train, x_test, y_train, y_test = train_test_split(x_trdsa, y, test_size=0.25, random_state=42)
#    
#    # This is our neural network
#
#    model = Sequential()
#    model.add(Dense(150, input_dim=x_trdsa.shape[1], activation='relu')) # Hidden 1
#    model.add(Dense(750, activation='relu')) # Hidden 2
#    model.add(Dense(1)) # Output
#    model.compile(loss='mean_squared_error', optimizer='adam')
#    monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3,patience=10, verbose=1, mode='auto', restore_best_weights=True)
#    model.fit(x_train,y_train,validation_data=(x_test,y_test), callbacks=[monitor],verbose=2,epochs=1000)
#
#
#    # Predicting ONLY missing values for target
#
#    x_missing14 = x_trdsa[indices]
#    pred_missings = model.predict(x_missing14)
#    return 

