<a href="https://colab.research.google.com/github/vssood/WU_DL/blob/master/WU_DL_AS4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Problem Statement

The primary use for this dataset is binary classification. There are 15 attributes, plus a target column that contains only + or -. Some of the columns have missing values.



Complete the following tasks:

Your task is to replace missing values in columns a2 and a14 with values estimated by a neural network (one neural network for a2 and another for a14).

Your submission file will contain the same headers as the source CSV: a1, a2, s3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, a15, and a16.
You should only need to modify a2 and a14.

Neural networks can be much more powerful at filling missing variables than median and mean.
Train two neural networks to predict a2 and a14.

The y (target) for training the two nets will be a2 and a14, depending on which you are trying to fill.

The x for training the two nets will be 's3','a8','a9','a10','a11','a12','a13','a15'. These are chosen because it is important not to use any columns with missing values; also, it could cause unwanted bias if we include the ultimate target (a16).

ONLY predict new values for missing values in a2 and a14.
You will likely get this small warning: Warning: The mean of column a14 differs from the solution file by 0.20238937709643778. (might not matter if small)

https://github.com/jeffheaton/t81_558_deep_learning/blob/df29ce2413c1ef32acaf99764c54b1b529cd8779/assignments/assignment_yourname_class4.ipynb


### Import packages 

In [1]:
import os
import pandas as pd
from scipy.stats import zscore
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
import pandas as pd
import io
import requests
import numpy as np
from sklearn import metrics

In [2]:
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import EarlyStopping
import numpy as np

### Load data 

In [3]:
file_path = "https://data.heatonresearch.com/data/t81-558/crx.csv"

In [4]:
# read the CSV file to data frame 
df = pd.read_csv(file_path, na_values= ["NA", "?"])

In [5]:
df.head()

Unnamed: 0,a1,a2,s3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202.0,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280.0,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,+


In [6]:
# Check for missing values 
df.isna().sum()

a1     12
a2     12
s3      0
a4      6
a5      6
a6      9
a7      9
a8      0
a9      0
a10     0
a11     0
a12     0
a13     0
a14    13
a15     0
a16     0
dtype: int64

In [7]:
# Create index as a column so that we can update the value later while submiting  _ No Need as 
# df['index1'] = df.index

### Create dataframe with only required columns 

In [8]:
# only these columns are important as we have to update missing records in col - a2 and 14
# removed remaining columns with missing values as well as a16. which is the output column
df_ = df[['a2', 's3','a8','a9','a10','a11','a12','a13', 'a14','a15']].copy()

In [9]:
df_.head(2)

Unnamed: 0,a2,s3,a8,a9,a10,a11,a12,a13,a14,a15
0,30.83,0.0,1.25,t,t,1,f,g,202.0,0
1,58.67,4.46,3.04,t,t,6,f,g,43.0,560


In [10]:
# check how many values are there -  so that we know what columns have to be one hot coded 
for i in df_:
    print(i, df_[i].nunique())

a2 349
s3 215
a8 132
a9 2
a10 2
a11 23
a12 2
a13 3
a14 170
a15 240


### Convert to onehot vector

In [11]:
df_ = pd.get_dummies(data=df_, columns=['a9', 'a10', 'a12', 'a13'])

In [12]:
df_.columns.values

array(['a2', 's3', 'a8', 'a11', 'a14', 'a15', 'a9_f', 'a9_t', 'a10_f',
       'a10_t', 'a12_f', 'a12_t', 'a13_g', 'a13_p', 'a13_s'], dtype=object)

### Split the records 

In [13]:
# create data frame 
#    df_na -> contains all records with missing values  - only created for creating a2 & a14 missing record data frame
#    df_na_a2 -> contains records with missing value  - Removed a14 column as suggested
#    df_na_a14 -> contains records with missing value  - Removed a2 column as suggested
#    df_nona   -> dataframe where none of the values are missing - this will be used as a training set 

df_na = df_[df_.isnull().any(axis = 1)]
df_na_a2 =  df_na[df_na.isnull()['a2']]
df_na_a2.drop('a14', axis =1, inplace= True)
df_na_a14 =  df_na[df_na.isnull()['a14']]
df_na_a14.drop('a2', axis =1, inplace= True)
df_nona = df_[~df_.isnull().any(axis = 1)]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


### Sanity Check before processing 

In [14]:
# check shape of the data frames 
print( df.shape, df_na.shape, df_na_a2.shape, df_na_a14.shape, df_nona.shape)

(690, 16) (24, 15) (12, 14) (13, 14) (666, 15)


In [15]:
# look at missing a2 records 
df_na_a2

Unnamed: 0,a2,s3,a8,a11,a15,a9_f,a9_t,a10_f,a10_t,a12_f,a12_t,a13_g,a13_p,a13_s
83,,3.5,3.0,0,0,0,1,1,0,0,1,1,0,0
86,,0.375,0.875,0,0,0,1,1,0,0,1,0,0,1
92,,5.0,8.5,0,0,0,1,1,0,1,0,1,0,0
97,,0.5,0.835,0,0,0,1,1,0,0,1,0,0,1
254,,0.625,0.25,0,2010,1,0,1,0,1,0,1,0,0
286,,1.5,0.0,2,105,1,0,0,1,0,1,1,0,0
329,,4.0,0.085,0,0,1,0,1,0,0,1,1,0,0
445,,11.25,0.0,0,5200,1,0,1,0,1,0,1,0,0
450,,3.0,7.0,0,1,1,0,1,0,1,0,1,0,0
500,,4.0,5.0,3,2279,0,1,0,1,0,1,1,0,0


In [16]:
# look at missing a14 records 
df_na_a14

Unnamed: 0,s3,a8,a11,a14,a15,a9_f,a9_t,a10_f,a10_t,a12_f,a12_t,a13_g,a13_p,a13_s
71,4.0,12.5,0,,0,0,1,1,0,0,1,1,0,0
202,2.75,2.25,6,,600,0,1,0,1,1,0,1,0,0
206,0.0,0.0,0,,0,1,0,1,0,1,0,0,1,0
243,7.5,2.71,5,,26726,0,1,0,1,1,0,1,0,0
270,0.0,0.0,0,,0,1,0,1,0,1,0,0,1,0
278,13.5,0.0,0,,0,1,0,1,0,1,0,1,0,0
330,0.0,0.0,0,,0,1,0,1,0,1,0,0,1,0
406,8.125,0.165,2,,18,1,0,0,1,1,0,1,0,0
445,11.25,0.0,0,,5200,1,0,1,0,1,0,1,0,0
456,0.0,0.0,0,,0,1,0,1,0,1,0,0,1,0


# Update missing values 

## Function for updating missing values

### Takes training data frame - i.e. with no missing records - df_nona

parameters:

*   training data frame - i.e. with no missing records - df_nona
*   missing record data frame (a2 or a14) 
*   column to be updated (a2 or a14

Process 

*   training record - dataframe with no missing record - x_train = all columns, except column to be updated ( a2 / a14)
*   test record - respective missing column data frame ( a2 / a14) - similar split between x & y


*   Calls neural network function - to predict the value of the missing column
*   Updates the missing values in respective dataframe


In [18]:
def fill_missing_numeric(df_nona, df_na_col, nan_col):
# Pandas to Numpy
    col_list = ['s3', 'a8', 'a11', 'a15', 'a9_f', 'a9_t',
        'a10_f', 'a10_t', 'a12_f', 'a12_t', 'a13_g', 'a13_p', 'a13_s']
    
    x_train = df_nona[col_list].values

    y_train = df_nona[nan_col].values

    # test_col_list = col_list.copy()
    # test_col_list.remove(remove_col)

    x_test = df_na_col[col_list].values

    y_test = df_na_col[nan_col].values

    # Call Neural Network 
    pred_test = nerual_network(x_train, y_train, x_test, y_test)

    update_missing_values(df_na_col, y_test, pred_test, nan_col)

### Builds neural network to predict missing col values

input:

*   x & y train / test data generated from the calling function 

Process:

*   create model with 2 hidden layer and output layer 
*   used regression as col a2 and a14 are numerical values - the problem is categorical

call function to update respective missing data frame

**NOTE : EARLY STOPPING NOT WORKING - gives error message**






In [19]:
def nerual_network(x_train, y_train, x_test, y_test):
    model = Sequential()
    model.add(Dense(25, input_dim = x_train.shape[1], activation='relu'))     # Hidden 1
    model.add(Dense(10, activation='relu'))                             # Hidden 2
    model.add(Dense(1))                                                 # output
    model.compile(loss ='mean_squared_error', optimizer = 'adam')
    # model.compile(loss='categorical_crossentropy', optimizer='adam')
    # monitor = EarlyStopping(monitor='val_loss', min_delta= 1e-3, verbose = 0, patience=5,  mode='auto', restore_best_weights= True)
    # model.fit(x_train, y_train, validation_data=(x_test, y_test), callbacks = [monitor], verbose = 0, epochs = 1000)
    monitor = EarlyStopping( monitor= 'val_loss',  min_delta=1e-3, patience=5, 
                            verbose = 1, mode= 'auto', restore_best_weights = True)
    model.fit(x_train, y_train, validation_data=(x_test, y_test),
            verbose = 0, epochs = 10)
    pred = model.predict(x_train)
    pred_test = model.predict(x_test)
    score = np.sqrt(metrics.mean_squared_error(pred, y_train))
    print(f"Final Score (RMSE): {score}")
    
    return(pred_test)

### Update missing values in data frame 

process :

*   loops throuh y_test & updates missing value in respective data frame 
*   as the number of records in y_test and missing data frame is the same, we can use the iloc to update the record
*   for colum - a2 & a14 , used get_loc methord to find the column iloc
*   returns the respective updated missing record dataframe 





In [20]:
def update_missing_values(df_u, y_test, pred_test, man_col ):
    print(df_u.shape)
    count = 0
    col_num = df_u.columns.get_loc(man_col)
    print(col_num)
    for num in y_test:
        if np.isnan(y_test[count]):
            print('iloc' , [count, 0], 'update', pred_test[count][0])
            df_u.iloc[count, col_num] = pred_test[count][0]
        else:
            print('No Update' , y_test[count])

        count +=1
    #df_[df_['a2'].isnull()]
    return df_u

In [27]:
# call functions to fill a2 and a14 data frame 
fill_missing_numeric(df_nona, df_na_a2, 'a2')  
fill_missing_numeric(df_nona, df_na_a14, 'a14')  

Final Score (RMSE): 25.84077968862533
(12, 14)
0
iloc [0, 0] update 12.814308
iloc [1, 0] update 6.6051793
iloc [2, 0] update 21.194366
iloc [3, 0] update 6.658072
iloc [4, 0] update 1.9763159
iloc [5, 0] update 2.812849
iloc [6, 0] update 10.108476
iloc [7, 0] update 5.4059577
iloc [8, 0] update 17.54561
iloc [9, 0] update 2.9111378
iloc [10, 0] update 23.355381
iloc [11, 0] update 11.634661


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


Final Score (RMSE): 240.48307716726632
(13, 14)
3
iloc [0, 0] update 31.066908
iloc [1, 0] update 8.056461
iloc [2, 0] update 5.930389
iloc [3, 0] update 23.65385
iloc [4, 0] update 5.930389
iloc [5, 0] update 27.590565
iloc [6, 0] update 5.930389
iloc [7, 0] update 20.412878
iloc [8, 0] update 12.717214
iloc [9, 0] update 5.930389
iloc [10, 0] update 5.930389
iloc [11, 0] update 5.930389
iloc [12, 0] update 19.062737


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


### Check if records are populated 

In [28]:
# check if missing a2 values are populated
df_na_a2

Unnamed: 0,a2,s3,a8,a11,a15,a9_f,a9_t,a10_f,a10_t,a12_f,a12_t,a13_g,a13_p,a13_s
83,12.814308,3.5,3.0,0,0,0,1,1,0,0,1,1,0,0
86,6.605179,0.375,0.875,0,0,0,1,1,0,0,1,0,0,1
92,21.194366,5.0,8.5,0,0,0,1,1,0,1,0,1,0,0
97,6.658072,0.5,0.835,0,0,0,1,1,0,0,1,0,0,1
254,1.976316,0.625,0.25,0,2010,1,0,1,0,1,0,1,0,0
286,2.812849,1.5,0.0,2,105,1,0,0,1,0,1,1,0,0
329,10.108476,4.0,0.085,0,0,1,0,1,0,0,1,1,0,0
445,5.405958,11.25,0.0,0,5200,1,0,1,0,1,0,1,0,0
450,17.54561,3.0,7.0,0,1,1,0,1,0,1,0,1,0,0
500,2.911138,4.0,5.0,3,2279,0,1,0,1,0,1,1,0,0


In [29]:
# check if missing a14 values are populated
df_na_a14

Unnamed: 0,s3,a8,a11,a14,a15,a9_f,a9_t,a10_f,a10_t,a12_f,a12_t,a13_g,a13_p,a13_s
71,4.0,12.5,0,31.066908,0,0,1,1,0,0,1,1,0,0
202,2.75,2.25,6,8.056461,600,0,1,0,1,1,0,1,0,0
206,0.0,0.0,0,5.930389,0,1,0,1,0,1,0,0,1,0
243,7.5,2.71,5,23.653851,26726,0,1,0,1,1,0,1,0,0
270,0.0,0.0,0,5.930389,0,1,0,1,0,1,0,0,1,0
278,13.5,0.0,0,27.590565,0,1,0,1,0,1,0,1,0,0
330,0.0,0.0,0,5.930389,0,1,0,1,0,1,0,0,1,0
406,8.125,0.165,2,20.412878,18,1,0,0,1,1,0,1,0,0
445,11.25,0.0,0,12.717214,5200,1,0,1,0,1,0,1,0,0
456,0.0,0.0,0,5.930389,0,1,0,1,0,1,0,0,1,0


## Submission file 

Not sure if one file has to be submitted or two files 

Also not sure if only missing record file has to be submitted or all records file 

Hence upadting df_na file 

In [30]:
df_out = df[(df['a2'].isnull()) | (df['a14'].isnull()) ]


In [31]:
for i, rows in df_na_a2.iterrows():
    df_out.loc[i, 'a2'] = rows.a2

for i, rows in df_na_a14.iterrows():
    df_out.loc[i, 'a14'] = rows.a14

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [32]:
df_out

Unnamed: 0,a1,a2,s3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16
71,b,34.83,4.0,u,g,d,bb,12.5,t,f,0,t,g,31.066908,0,-
83,a,12.814308,3.5,u,g,d,v,3.0,t,f,0,t,g,300.0,0,-
86,b,6.605179,0.375,u,g,d,v,0.875,t,f,0,t,s,928.0,0,-
92,b,21.194366,5.0,y,p,aa,v,8.5,t,f,0,f,g,0.0,0,-
97,b,6.658072,0.5,u,g,c,bb,0.835,t,f,0,t,s,320.0,0,-
202,b,24.83,2.75,u,g,c,v,2.25,t,t,6,f,g,8.056461,600,+
206,a,71.58,0.0,,,,,0.0,f,f,0,f,p,5.930389,0,+
243,a,18.75,7.5,u,g,q,v,2.71,t,t,5,f,g,23.653851,26726,+
254,b,1.976316,0.625,u,g,k,v,0.25,f,f,0,f,g,380.0,2010,-
270,b,37.58,0.0,,,,,0.0,f,f,0,f,p,5.930389,0,+
