# Neural Network Basic Classification Problem 
Here the following steps gives a broad overview of data source, getting the data and customizing for Neural Network Classification problem. Then a brief steps about creating a classification model.  
1. Read the data dictionary and description from here:

* https://oehha.ca.gov/media/downloads/calenviroscreen/document/calenviroscreen40resultsdatadictionaryf2021.zip

 Download the data using gdown, read the Excel file using pandas, print the first 10 rows using df.head() and use df.info() to examine the data types and missing values.

2. Simplify the raw dataframe so that you only keep the columns you need. The `X` variable will be the following columns: `Population`, `Ozone` through `Solid Waste Pctl`, and `Asthma` through `Linguistic Isolation Pctl`. The `y` variable will be `Poverty`. Examine the quality of each column and use your judgement about dropping rows or imputing missing values. Add text cells and lots of comments so we can understand your logic/justification!

3. Recode the target variable to a 1 if greater than the mean value of poverty, otherwise make it a 0. Use this recoded variable as the target variable! Now it is a classification problem.

4. Make two interesting plots or tables and a description for EDA.

5. Do an 90/10 split for X_train, X_test, y_train, y_test.

6. Use the StandardScaler() on train and apply to test partition. Do not scale the target variable!

7. Build a model using the Sequential API (like we do in class) with at least 2 dense layers with the relu activation function, and with dropout in between each dense layer (use a number between 0.1 and 0.5). Compile the model using an appropriate optimizer. Use early stopping with patience of at least 10 and restore the best weights once the model converges. You can choose whatever batch size you would like to.

8. Fit the model for 100000 epochs with a batch size of your choice, using X_test and y_test as the validation data. **Don’t forget the early stopping callback!**

9. Evaluate the model using learning curves, error metrics and confusion matrices for each partition.

10. Calculate what a baseline prediction would be for the train and test partitions (a mean only model). Did your model do better than the baseline predictions? If so, you have a useful model!

#Q1. Reading Data and Importing Modules
Read Data and Import Modules

In [2]:
# import modules
# for general data analysis/plotting
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# for data prep
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

#report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# neural net modules
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.callbacks import EarlyStopping






In [None]:
import requests
import pandas as pd

# File ID and URL
file_id = '1_8vmQwSZ02ZOMw_IPHY5OWkwpXQLVW2E'
url = f'https://drive.google.com/uc?id={file_id}'

# Download the file
response = requests.get(url)

# Save the content to a local file
with open('CalEnviroScreen.xlsx', 'wb') as f:
    f.write(response.content)

# Read the Excel file into a DataFrame
df_CalEnviroScreen = pd.read_excel('CalEnviroScreen.xlsx')


In [3]:
# https://drive.google.com/file/d/1_8vmQwSZ02ZOMw_IPHY5OWkwpXQLVW2E/view?usp=sharing
gdown --id 1_8vmQwSZ02ZOMw_IPHY5OWkwpXQLVW2E # ID for Environmental Health data! look up!
df_CalEnviroScreen = pd.read_excel('CalEnviroScreen.xlsx')

'gdown' is not recognized as an internal or external command,
operable program or batch file.


FileNotFoundError: [Errno 2] No such file or directory: 'CalEnviroScreen.xlsx'

In [None]:
# read data
df = pd.read_excel('CalEnviroScreen.xlsx')
df.info()

Printing First 10 Records

In [None]:
df.head(10)

Missing values across all the columns

In [None]:
df.isnull().sum()

Total missing records with atleast one missing value are 2724

In [None]:
df.isnull().sum().sum() # total records with at least one missing value

# Q2. Subsetting and Column info
Simplify the raw dataframe so that you only keep the columns you need. The `X` variable will be the following columns: `Population`, `Ozone` through `Solid Waste Pctl`, and `Asthma` through `Linguistic Isolation Pctl`. The `y` variable will be `Poverty`. Examine the quality of each column and use your judgement about dropping rows or imputing missing values. Add text cells and lots of comments so we can understand your logic/justification!


In [None]:
#Getting columns names index
column_name=df.columns
print('Column Names:', pd.DataFrame(column_name))

Subsetting the dataset with desired columns. The `X` variable will be the following columns: `Population`, `Ozone` through `Solid Waste Pctl`, and `Asthma` through `Linguistic Isolation Pctl`. The `y` variable will be `Poverty`.

## X and y dataset subsetting

In [None]:
df1=df[df.columns[[1] + list(range(11, 35)) + [7]+ list(range(38,48))+[48]]] #Getting the required columns

#subsetting the X from the df
X=df1.drop('Poverty',axis=1)

#Separating and Creating a y variable
y=df1['Poverty']



In [None]:
X.columns

In [None]:
X.describe()

## Missing Values (Desired Columns)

Lets analyse the missing values in the desired columns.

There aren't that many missing values. Total records having missing values are 1412, which is 17.5% of total records i.e. 8035. Here I have chosen to impute the records.

In [None]:
X.isnull().sum().sum()

In [None]:
1412/8035

## KNN Imputation for Missing Values

Here we will impute the missing values using KNN imputer. I have separated the columns that contains null values and also obtained the indices of it to verify that values have been imputed and indices are maintained.

In [None]:
#Getting the names of columns with NULL values
null_columns_X = X.columns[X.isnull().any()]
print('Names of Columns having Null values:/n', null_columns_X)

# Getting the indices of null values
null_indices = X[X.isnull().any(axis=1)].index

# Displaying the top 5 null indices
top_5_null_indices = null_indices[:5]
print("/nTop 5 Null Indices:/n")
print(top_5_null_indices)

In [None]:
#Seeing the records before imputation
X.loc[top_5_null_indices] #Notice the last two columns

In [None]:
#Imputing the missing values in the X
#KNN imputation
from sklearn.impute import KNNImputer

knn_imputer = KNNImputer(n_neighbors=10,weights='distance')

# Creating a new DataFrame with the selected columns (which has missing values)
data_with_missing_values_X = X.loc[:, null_columns_X]

# Imputing the missing values
imputed_data_X = knn_imputer.fit_transform(data_with_missing_values_X)

In [None]:
# Creating a new DataFrame with the imputed values
# setting the same index as X
''' This is very important otherwise it will create its new index and
create problem when we replace values in the original X dataset '''
imputed_df_X = pd.DataFrame(imputed_data_X, columns=null_columns_X,index=X.index)

# Replacing the imputed columns in the original DataFrame with the imputed values
X[null_columns_X]=imputed_df_X

#lets view the imputed data with same indices we saw earlier
X.loc[top_5_null_indices]


#Q3. Recoding Target Variable
Recode the target variable to a 1 if greater than the mean value of poverty, otherwise make it a 0. Use this recoded variable as the target variable! Now it is a classification problem.


In [None]:
#Recoding the y variable to catergorical variable
mean_poverty = y.mean()
y= np.where(y > mean_poverty, 1, 0)

#lets see the length of y
len(y)

#Q4. Plots and Tables
Make two interesting plots or tables and a description of why you made the table and what you see.


## 1. Correlation of Variable with Target Variable (Original)

Lets see how variable are correlated witht the target variable (the original one without recoding).

The figure below indicates that ```CES 3.0 Score```and ```Asthama Pctl``` are significantly positively correlated to the ```Poverty```

**Insights on relation to recoded values:**

Since we have assigned 1 to ```Poverty``` greater than the mean of Poverty, the NN  model might assign weights to these two features such that the value of  ```Poverty``` number could increase; thereby, predicting 1.

@Dave please correct me if I am wrong in giving a shot at (infering this!) what might be going inside the model

In [None]:
# Calculate the correlation matrix between X_train features and y_train target variable
corr_matrix = X.corrwith(df1['Poverty'])

# Sort the correlation matrix in descending order
sorted_corr_matrix = corr_matrix.sort_values(ascending=False)

# Create a heatmap of the sorted correlation matrix
fig, ax = plt.subplots(figsize=(6, 8))
sns.heatmap(sorted_corr_matrix.to_frame(), annot=True, cmap='coolwarm', cbar=True, ax=ax)
plt.title('Correlation Heatmap (Highest to Lowest)')
plt.tight_layout()
plt.show()

##2. Bar plot of Target Variable

The count of the target variable (1 and 0) is plotted as a bar plot to better visualise if the dataset is balanced or imbalanced.

Here from the graph we can tell that the dataset is nearly balanced.

It is important to have the dataset balanced in order to predict both the classes without bias. In case of imbalanced data, the class with more data wou

In [None]:
value_counts = np.bincount(y)

# Create a bar plot
plt.bar(range(len(value_counts)), value_counts)
plt.xlabel('Category')
plt.ylabel('Count')
plt.title('Count of Categories (1 and 0)')
plt.xticks(range(len(value_counts)))
for i, count in enumerate(value_counts):
    plt.text(i, count, str(count), ha='center', va='bottom')

plt.show()

##3. Correlation (all variables) Plot

This plot gives an idea about what features are correlated with each other with what degree of correlation. It should be noted here that Pctl columns are percentile of its counterparts (example, PM2.5 has PM2.5 Pctl). It occured to me that as these are just repeated information in the data, could this be gotten rid of and then build a model. A thought behind this is keeping features which are same just coded differently might make mode more complex or might make it biased for some features. I will run a same model with these features removed towards the end of this collab just for the experiment.


In [None]:
# Calculating the correlation matrix for all features
corr_matrix = X.corr()

# Sort the correlation matrix in descending order
sorted_corr_matrix = corr_matrix.unstack().sort_values(ascending=False)

# Removing the correlation values with itself (diagonal elements)
sorted_corr_matrix = sorted_corr_matrix[~(sorted_corr_matrix.index.get_level_values(0) == sorted_corr_matrix.index.get_level_values(1))]

# Creating a heatmap of the sorted correlation matrix
fig, ax = plt.subplots(figsize=(30, 30))
sns.heatmap(sorted_corr_matrix.unstack(), annot=True, cmap='coolwarm', cbar=True, ax=ax, annot_kws={"size":16},fmt='.2f')
plt.title('Correlation Heatmap (Highest to Lowest)',fontsize=20)
plt.tight_layout()
plt.show()

#Q5. Spliiting Data
Do an 90/10 split for X_train, X_test, y_train, y_test where the random seed is equal to your 7 digit studentID number.

In [None]:
#splitting dataset in train and test
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.1, random_state=3069860)

Saving these split of train and test dataset in another variables X2 for other model that thought to experiment with (by removing the correlated features, mainly corresponding Pctl metric). In later part (in Appendix), the columns containng the Pctl were removed and same model (defined under different name) was run on this data to compare the performance.

In [None]:
#Saving these train and test dataset in another dataframe for another model (in Appendix)
X2_train=X_train
X2_test=X_test

In [None]:
#making sure that the data is split correctly
print(X_train.shape,y_train.shape)
print(X_test.shape,y_test.shape)

#Q6. Scaling
Use the StandardScaler() on train and apply to test partition. Do not scale the target variable!


In [None]:
standard_scaler = StandardScaler()
X_train = standard_scaler.fit_transform(X_train)
X_test = standard_scaler.transform(X_test)

#Q7. Build and Compile the Model
Build a model using the Sequential API (like we do in class) with at least 2 dense layers with the relu activation function, and with dropout in between each dense layer (use a number between 0.1 and 0.5). Compile the model using an appropriate optimizer. Use early stopping with patience of at least 10 and restore the best weights once the model converges. You can choose whatever batch size you would like to.

In [None]:
X_train.shape[1]

In [None]:
model = Sequential()
model.add(Dense(128, input_shape=(X_train.shape[1],), activation='relu')) # (features,)
model.add(Dropout(0.15)) # specify a percentage between 0 and 0.5
model.add(Dense(64, activation='relu')) # output node
model.add(Dropout(0.2)) # specify a percentage between 0 and 0.5
model.add(Dense(32, activation='relu')) # output node
model.add(Dropout(0.2)) # specify a percentage between 0 and 0.5
model.add(Dense(1, activation='sigmoid')) # output node
model.summary() # see what your model looks like


In [None]:
# compile the model
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

#Q8. Fitting the Model
Fit the model for 100000 epochs with a batch size of your choice, using X_test and y_test as the validation data. **Don’t forget the early stopping callback!**

In [None]:
es = EarlyStopping(monitor='val_accuracy',
                                   mode='max', # don't minimize the accuracy!
                                   patience=20,
                                   restore_best_weights=True)

# now we just update our model fit call
history = model.fit(X_train,
                    y_train,
                    callbacks=[es],
                    epochs=100000, # you can set this to a big number!
                    batch_size=256*2*2*2,
                    validation_data=(X_test,y_test),
                    shuffle=True,
                    verbose=1)

#Q9. Evaluating the Model
Evaluate the model using learning curves, error metrics and confusion matrices for each partition (like we do in class). You should largely be able to copy and paste this from class notebooks. Add a few bullet points about what you see (did your model learn nice and gently?  If you don't have text cells here, you will lose points.


In [None]:
# learning curve

# accuracy
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']


epochs = range(1, len(acc) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, acc, 'bo', label='Training accuracy')
# b is for "solid blue line"
plt.plot(epochs, val_acc, 'b', label='Validation accuracy')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

#loss
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

We can see in the learning curves that the model has pretty quickly learnt the pattern and loss has been reduced and accuracy has increased sharply.

We can see from the confusion matrix that it has performed well on the train dataset achieving 0.95 F1 score. It also performs better on the test dataset with F1 score of 90. We have taken care of overfitting by dropout methods and it can be seen from F1 score of the train data that it has not overfit and predicts pretty good on test data. That being said, the accuracy on train is more than test, and also the loss on train is lesser than the test data.


#Q10. Baseline Model and  Classification Report
Calculate what a baseline prediction would be for the train and test partitions (a mean only model). Did your model do better than the baseline predictions? If so, you have a useful model!

In [None]:
#Base line prediction for train
y_train=pd.DataFrame(y_train)
count_train=y_train.value_counts()

baseline_pred_train=count_train[0]/(count_train[0]+count_train[1])
print('Baseline prediction for train:/n', baseline_pred_train)

#Base line prediction for test
y_test=pd.DataFrame(y_test)
count_test=y_test.value_counts()

baseline_pred_test=count_train[0]/(count_train[0]+count_train[1])
print('Baseline prediction for test:/n', baseline_pred_test)

In [None]:
## seeing how the model did! on train
preds_train = np.round(model.predict(X_train),0)

matrix = confusion_matrix(y_train,preds_train)
matrix

In [None]:
## seeing how the model did! on test
preds_test = np.round(model.predict(X_test),0)

matrix = confusion_matrix(y_test,preds_test)
matrix

In [None]:
#Printing the classification report on the train  data
print('Classification Report on Train:\n',classification_report(y_train , preds_train))

#Printing the classification report on the test data
print('Classification Report on Test:\n',classification_report(y_test, preds_test))

# Conclusion/Insights

Extending the insgihts from model evaluation, following are some thoughts about this assignment

1. After comparing the models (with same parameters; another model run in appendix below which did not include Pctl metric), It was observed that the first model (where the correlated features where present) had a tendency to overfit the train data (as the network got comple with more layers and units). The later model (where correlated features where removed), it gave pretty consistant performance of predicting train and test with similar numbers. The overfitting was not observed in later model (run in the appendix at the end).

2. Also, it was observed that more the comlex model is (with more layers and units with dropouts), better was the performance of predictions in train and to some extent in test data.


3. Compared to the baseline model, the F1 score of the model is 90 on test data. We can safely conclude that model performs better and has recognized patterns in the data to predict values.

4. In both the models (above and in the appendix), the validation accuracy is less than the training. Also, the validation loss is more (as in the loss number) than the training data.

5. The first model (where correlated features such as Pctl is present) has 36 total number of features. Whereas, after removing the correlated features (for which count is 17), the total number of feature in model_2 are 19. This might explain the consistency of the model between the train and test data. In that, the model is not fed with lot of noise. (@Dave please correct me if this inference is wrong; I would love to know more and learn from your feedback).

##**Appendix (Other model)**

Checking the same model performance when we get rid of the similar metric called Pctl against some of the features.

In [None]:
#selecting columnns without Pctl metric
X_columns_wo_Pctl = [col for col in X.columns if "Pctl" not in col]
X2_train=X2_train[X_columns_wo_Pctl]
X2_test=X2_test[X_columns_wo_Pctl]
#lets view the 5 records
X2_train.head(5)

### Scaling
Use the StandardScaler() on train and apply to test partition. Do not scale the target variable!


In [None]:
standard_scaler = StandardScaler()
X2_train = standard_scaler.fit_transform(X2_train)
X2_test = standard_scaler.transform(X2_test)

### Build and Compile the Model
Build a model using the Sequential API (like we do in class) with at least 2 dense layers with the relu activation function, and with dropout in between each dense layer (use a number between 0.1 and 0.5). Compile the model using an appropriate optimizer. Use early stopping with patience of at least 10 and restore the best weights once the model converges. You can choose whatever batch size you would like to.

In [None]:
X2_train.shape[1]

In [None]:
model_2 = Sequential()
model_2.add(Dense(128, input_shape=(X2_train.shape[1],), activation='relu')) # (features,)
model_2.add(Dropout(0.15)) # specify a percentage between 0 and 0.5
model_2.add(Dense(64, activation='relu')) # output node
model_2.add(Dropout(0.2)) # specify a percentage between 0 and 0.5
model_2.add(Dense(32, activation='relu')) # output node
model_2.add(Dropout(0.2)) # specify a percentage between 0 and 0.5
model_2.add(Dense(1, activation='sigmoid')) # output node
model_2.summary() # see what your model looks like


In [None]:
# compile the model
model_2.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

### Fitting the Model
Fit the model for 100000 epochs with a batch size of your choice, using X_test and y_test as the validation data. **Don’t forget the early stopping callback!**

In [None]:
es = EarlyStopping(monitor='val_accuracy',
                                   mode='max', # don't minimize the accuracy!
                                   patience=20,
                                   restore_best_weights=True)

# now we just update our model fit call
history_2 = model_2.fit(X2_train,
                    y_train,
                    callbacks=[es],
                    epochs=100000, # you can set this to a big number!
                    batch_size=256*2*2*2,
                    validation_data=(X2_test,y_test),
                    shuffle=True,
                    verbose=1)

###Evaluating the Model
Evaluate the model using learning curves, error metrics and confusion matrices for each partition (like we do in class). You should largely be able to copy and paste this from class notebooks. Add a few bullet points about what you see (did your model learn nice and gently?  If you don't have text cells here, you will lose points.


In [None]:
# learning curve

# accuracy
acc_2 = history_2.history['accuracy']
val_acc_2 = history_2.history['val_accuracy']


epochs = range(1, len(acc_2) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, acc_2, 'bo', label='Training accuracy')
# b is for "solid blue line"
plt.plot(epochs, val_acc_2, 'b', label='Validation accuracy')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

#loss
loss_2 = history_2.history['loss']
val_loss_2 = history_2.history['val_loss']

epochs = range(1, len(acc_2) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss_2, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss_2, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()


### Baseline Model and Evaluation on Test Data
Calculate what a baseline prediction would be for the train and test partitions (a mean only model). Did your model do better than the baseline predictions? If so, you have a useful model!

In [None]:
#Base line prediction for train
y_train=pd.DataFrame(y_train)
count_train=y_train.value_counts()

baseline_pred_train=count_train[0]/(count_train[0]+count_train[1])
print('Baseline prediction for train:/n', baseline_pred_train)

#Base line prediction for test
y_test=pd.DataFrame(y_test)
count_test=y_test.value_counts()

baseline_pred_test=count_train[0]/(count_train[0]+count_train[1])
print('Baseline prediction for test:/n', baseline_pred_test)

In [None]:
## seeing how the model did! on train
preds2_train = np.round(model_2.predict(X2_train),0)

matrix = confusion_matrix(y_train,preds2_train)
matrix

In [None]:
## seeing how the model did! on test
preds2_test = np.round(model_2.predict(X2_test),0)

matrix = confusion_matrix(y_test,preds2_test)
matrix

In [None]:
#Printing the classification report on the train  data
print('Classification Report on Train:\n',classification_report(y_train , preds2_train))

#Printing the classification report on the test data
print('Classification Report on Test:\n',classification_report(y_test, preds2_test))