# Machine Learning Models

The goal of this workbook is to compare different machine learning models on our dataset and compare their results. We are looking to see which models perform best on our dataset and which can most accurately be used to predict new data.

In this workbook we will load our cleaned csv from our ETL process, then compare the following models:
    - Logistic Regression
    - K Nearest Neighbor
    - Random Forest
    - Neural Network
    
After analyizing each of these models to see which is the best predictor, we will use a <b>Correlation Matrix</b> in order to see which of the input factors from our data has the best predictive power. This is an additional path our group wanted to go down to gain more insight into our data.
    
` While there are plenty of other machine learning models we could have explored, these are the handful that our team was most interested in experimenting with. In addition, our team selected these models as we believe they have a high likelyhood to be effective considering the nature of our dataset and predictive goals. `

In [1]:
# Import General Dependencies for this Workbook
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sn
import os
from sklearn import preprocessing

In [None]:
# Set file path to the cleaned data csv from data cleanup process
file = "data.csv"

# Read to a df
data_df = pd.read_csv(file)
# data_df.head()

In [None]:
print (data_df)
print (data_df.dtypes)

# ------------------------------

# Start Emerson Code

## Logistic Regression

Logistic Regression is a statistical method for predicting binary outcomes from data.

In [None]:
# The data contains the "income", stating rather an individual earns less than $50k or more than $50k. 
# Removing the "income" from the input set in order to not provide the model with the correct answer. 
# Assign X (data) and y (target)
X = data_df.drop("income", axis=1)
y = data_df["income"]
print(X.shape, y.shape)

In [None]:
X = pd.get_dummies(X)

In [None]:
# Split the data into a training (75% of values) and testing (25% of values) split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [None]:
# Generate the Logistic Regression Model
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier

In [None]:
# Fit the model
classifier.fit(X_train, y_train)

In [None]:
# Train the model
print(f"Training Data Score: {classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test, y_test)}")

Predict with new data

In [None]:
# Predict the class examples
predictions = classifier.predict(X_test)
print("Classes are either 0 (purple) or 1 (yellow)")
print(f"The new point was classified as: {predictions}")

In [None]:
# Predict into a df
predictions = classifier.predict(X_test)
pd.DataFrame({"Prediction": predictions, "Actual": y_test})

In [None]:
# Logistic Regression variable for charts
logistic_accuracy = classifier.score(X_test, y_test)

## K Nearest Neighbors (KNN) Model

In [None]:
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import os
from sklearn import preprocessing

In [None]:
# Set file path to the cleaned data csv from data cleanup process
file = "data.csv"

# Read to a df
data_df = pd.read_csv(file)
data_df.head()

In [None]:
# setting "y" to income 
y = data_df["income"]
target_names = ["negative", "positive"]

In [None]:
# dropping "income" from the data as this has the correct answer 
X = data_df.drop("income", axis=1)
X.head()

In [None]:
X = pd.get_dummies(X)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
from sklearn.preprocessing import StandardScaler

# Create a StandardScater model and fit it to the training data

X_scaler = StandardScaler().fit(X_train)

In [None]:
# Transform the training and testing data using the X_scaler and y_scaler models

X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

In [None]:
# Loop through different k values to see which has the highest accuracy
# Note: We only use odd numbers because we don't want any ties
train_scores = []
test_scores = []
for k in range(1, 20, 2):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_scaled, y_train)
    train_score = knn.score(X_train_scaled, y_train)
    test_score = knn.score(X_test_scaled, y_test)
    train_scores.append(train_score)
    test_scores.append(test_score)
    print(f"k: {k}, Train/Test Score: {train_score:.3f}/{test_score:.3f}")
    
    
plt.plot(range(1, 20, 2), train_scores, marker='o')
plt.plot(range(1, 20, 2), test_scores, marker="x")
plt.xlabel("k neighbors")
plt.ylabel("Testing accuracy Score")
plt.show()


In [None]:
# Note that K: 13 seems to be the best choice for this dataset
knn = KNeighborsClassifier(n_neighbors=13)
knn.fit(X_train_scaled, y_train)
print('k=13 Test Acc: %.3f' % knn.score(X_test_scaled, y_test))

In [None]:
# K Nearest Neighbors variable for charts
knn_accuracy = knn.score(X_test_scaled, y_test)

# End Emerson Code

# ------------------------------

# Start Sofanit Code

## Random Forest Model

In [2]:
#import the Random Forest Model

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd
import os
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier


In [3]:
# Set file path to the cleaned data csv from data cleanup process
file = "data.csv"

# Read to a df
data_df = pd.read_csv(file)
data_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_number,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [4]:
data_df.shape

(32561, 15)

In [5]:
target = data_df["income"]
target_names= [">=50","<=50"]
target

0         <=50K
1         <=50K
2         <=50K
3         <=50K
4         <=50K
          ...  
32556     <=50K
32557      >50K
32558     <=50K
32559     <=50K
32560      >50K
Name: income, Length: 32561, dtype: object

In [6]:
#labelencode the target
labelencoder = preprocessing.LabelEncoder()
target_binary = labelencoder.fit_transform(target)
print (target_binary)


[0 0 0 ... 0 0 1]


In [7]:
data_df = data_df.drop("income", axis=1)

In [8]:
feature_names = data_df.columns
data_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_number,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba


In [9]:
data_df.shape


(32561, 14)

In [10]:
data_binary_encoded = pd.get_dummies(data_df)
data_binary_encoded.head()

Unnamed: 0,age,fnlwgt,education_number,capital_gain,capital_loss,hours_per_week,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,...,native_country_ Portugal,native_country_ Puerto-Rico,native_country_ Scotland,native_country_ South,native_country_ Taiwan,native_country_ Thailand,native_country_ Trinadad&Tobago,native_country_ United-States,native_country_ Vietnam,native_country_ Yugoslavia
0,39,77516,13,2174,0,40,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,50,83311,13,0,0,13,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,38,215646,9,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,53,234721,7,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,28,338409,13,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
#import train_test_Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data_binary_encoded, target, random_state=42)
X_train.columns.to_list()


['age',
 'fnlwgt',
 'education_number',
 'capital_gain',
 'capital_loss',
 'hours_per_week',
 'workclass_ ?',
 'workclass_ Federal-gov',
 'workclass_ Local-gov',
 'workclass_ Never-worked',
 'workclass_ Private',
 'workclass_ Self-emp-inc',
 'workclass_ Self-emp-not-inc',
 'workclass_ State-gov',
 'workclass_ Without-pay',
 'education_ 10th',
 'education_ 11th',
 'education_ 12th',
 'education_ 1st-4th',
 'education_ 5th-6th',
 'education_ 7th-8th',
 'education_ 9th',
 'education_ Assoc-acdm',
 'education_ Assoc-voc',
 'education_ Bachelors',
 'education_ Doctorate',
 'education_ HS-grad',
 'education_ Masters',
 'education_ Preschool',
 'education_ Prof-school',
 'education_ Some-college',
 'marital_status_ Divorced',
 'marital_status_ Married-AF-spouse',
 'marital_status_ Married-civ-spouse',
 'marital_status_ Married-spouse-absent',
 'marital_status_ Never-married',
 'marital_status_ Separated',
 'marital_status_ Widowed',
 'occupation_ ?',
 'occupation_ Adm-clerical',
 'occupation_

In [12]:
#import the RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200)
rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.859231052696229

       As we can see the RandomForestClassifier has about 86% of accuracy. This is a very good indicator of a good model. On the next steps, we will prioritize the features based on the importance in order to investigate which feature was the greatest indicator of income that is 
       above 50,000 dollars or below 50,000 dollars. 

In [13]:
featuresPriority = sorted(zip(rf.feature_importances_, feature_names), reverse=True)

In [14]:

featuresPriority_df= pd.DataFrame(featuresPriority)
featuresPriority_df

Unnamed: 0,0,1
0,0.157075,workclass
1,0.147755,age
2,0.092927,education
3,0.083265,marital_status
4,0.060562,fnlwgt
5,0.030449,education_number
6,0.010846,capital_gain
7,0.008669,hours_per_week
8,0.006346,capital_loss
9,0.005904,race


In [15]:
featuresPriority_df = featuresPriority_df.rename(columns={0:'Accuracy %' , 1:'Income Determinants'})
featuresPriority_df

Unnamed: 0,Accuracy %,Income Determinants
0,0.157075,workclass
1,0.147755,age
2,0.092927,education
3,0.083265,marital_status
4,0.060562,fnlwgt
5,0.030449,education_number
6,0.010846,capital_gain
7,0.008669,hours_per_week
8,0.006346,capital_loss
9,0.005904,race


In [16]:
# import plotly.express as px

# fig = px.bar(featuresPriority_df, x = 'Income Determinants', y ='Accuracy %',  color =["red", "goldenrod",'#0d0887', '#46039f', '#7201a8', '#9c179e', '#bd3786', '#d8576b', '#ed7953', '#fb9f3a', '#fdca26', '#f0f921','blue','red'], labels={'Income Determinants', 'Accuracy %'}, title = "Factors That Contribute To Income Greater Than Or Equal To 50,000 Dollars Per Year")

# fig.update_layout(barmode='stack', xaxis={'categoryorder':'total descending'},showlegend= False)
# fig.show()

In [17]:
featuresPriority

[(0.1570754474215206, 'workclass'),
 (0.14775494810575307, 'age'),
 (0.0929270562211249, 'education'),
 (0.08326468700137904, 'marital_status'),
 (0.06056208375814542, 'fnlwgt'),
 (0.030448740941204616, 'education_number'),
 (0.010845686390150315, 'capital_gain'),
 (0.008669322422642638, 'hours_per_week'),
 (0.0063455466086725565, 'capital_loss'),
 (0.005903614938510791, 'race'),
 (0.005675269830955827, 'relationship'),
 (0.0045551949625135, 'native_country'),
 (0.001964548467953317, 'occupation'),
 (1.1231797476043326e-06, 'sex')]

In [18]:
data_binary_encoded

Unnamed: 0,age,fnlwgt,education_number,capital_gain,capital_loss,hours_per_week,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,...,native_country_ Portugal,native_country_ Puerto-Rico,native_country_ Scotland,native_country_ South,native_country_ Taiwan,native_country_ Thailand,native_country_ Trinadad&Tobago,native_country_ United-States,native_country_ Vietnam,native_country_ Yugoslavia
0,39,77516,13,2174,0,40,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,50,83311,13,0,0,13,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,38,215646,9,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,53,234721,7,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,28,338409,13,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,257302,12,0,0,38,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
32557,40,154374,9,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
32558,58,151910,9,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
32559,22,201490,9,0,0,20,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [19]:
data_df = data_df.drop(["occupation","sex","native_country","relationship"], axis=1)

In [20]:
data_binary_encoded_2 = pd.get_dummies(data_df)
data_binary_encoded_2.head()

Unnamed: 0,age,fnlwgt,education_number,capital_gain,capital_loss,hours_per_week,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,...,marital_status_ Married-civ-spouse,marital_status_ Married-spouse-absent,marital_status_ Never-married,marital_status_ Separated,marital_status_ Widowed,race_ Amer-Indian-Eskimo,race_ Asian-Pac-Islander,race_ Black,race_ Other,race_ White
0,39,77516,13,2174,0,40,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
1,50,83311,13,0,0,13,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
2,38,215646,9,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,53,234721,7,0,0,40,0,0,0,0,...,1,0,0,0,0,0,0,1,0,0
4,28,338409,13,0,0,40,0,0,0,0,...,1,0,0,0,0,0,0,1,0,0


In [21]:
#import train_test_Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data_binary_encoded_2, target, random_state=42)
y_train[0]

' <=50K'

In [22]:
#import the RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200)
rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)


0.8463333742783442

In [25]:
# Load this to a new variable
random_forest_accuracy = rf.score(X_test, y_test)
random_forest_accuracy 

0.8463333742783442

## Neural Network Model

In [None]:
# Set the seed value for the notebook so the results are reproducible
from numpy.random import seed
seed(42)

In [None]:
# Generate some fake data with 3 features

 from sklearn.datasets import make_classification

# X, y = data_binary_encoded                         

# y = y.reshape(-1, 1)

# print(X.shape)
# print(y.shape)

In [None]:
#import train_test_Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data_binary_encoded, target, random_state=42)
X_train.columns.to_list()

In [None]:
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical
# Step #1 label-encode the dataset
label_encoder = LabelEncoder()
label_encoder.fit(y_train)
y_train_encoded = label_encoder.transform(y_train)
y_test_encoded = label_encoder.transform(y_test)


In [None]:
# Step 2: Convert encoded labels to one-hot-encoding
y_train_categorical = to_categorical(y_train_encoded)
y_test_categorical = to_categorical(y_test_encoded)

In [None]:
from sklearn.preprocessing import StandardScaler

X_scaler = StandardScaler().fit(X_train)

In [None]:
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)


In [None]:
X_train_scaled.shape[1]

In [None]:
#data_binary_encoded
from tensorflow.keras.models import Sequential

model = Sequential()

In [None]:
from tensorflow.keras.layers import Dense
number_inputs = X_train_scaled.shape[1]
number_hidden_nodes = 4
model.add(Dense(units=number_hidden_nodes,
                activation='relu', input_dim=number_inputs))

In [None]:
 y_train_categorical.shape[1]
    

In [None]:
X_train_scaled

In [None]:
number_classes =  y_train_categorical.shape[1]
model.add(Dense(units=number_classes, activation='softmax'))

In [None]:
# Use categorical crossentropy for categorical data and mean squared error for regression
# Hint: your output layer in this example is using software for logistic regression (categorical)
# If your output layer activation was `linear` then you may want to use `mse` for loss
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

In [None]:
# Fit (train) the model
model.fit(
    X_train_scaled,
    y_train_categorical,
    epochs=100,
    shuffle=True,
    verbose=2
)

In [None]:
# Evaluate the model using the testing data
model_loss, model_accuracy = model.evaluate( X_test_scaled, y_test_categorical, verbose=2)
print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

In [None]:
# Neural Network Model variable for charts
neaural_network_accuracy = model.evaluate( X_test_scaled, y_test_categorical)
print(neaural_network_accuracy[1])

In [None]:
nnetwork_accuracy = (neaural_network_accuracy[1])
print(nnetwork_accuracy)

# End Sofanit Code

# ------------------------------

# Pass in output from models for chart

In [None]:
# # Logistic Regression variable for charts
# logistic_accuracy = classifier.score(X_test, y_test)
# print(logistic_accuracy) 

In [None]:
# # K Nearest Neighbors variable for charts
# knn_accuracy = knn.score(X_test_scaled, y_test)
# print(knn_accuracy)


In [None]:
# # Random Forest Model variable for charts 
# random_forest_accuracy = rf.score(X_test, y_test)
# print(f"Random Forest Accuracy: {random_forest_accuracy}")


In [None]:
# Print out all of the models accuracy scores
print(f"Logistic Regression Accuracy: {logistic_accuracy}")
print(f"K Nearest Neighbor Accuracy: {knn_accuracy}")
print(f"Random Forest Accuracy: {random_forest_accuracy}")
print(f"Neural Network Accuracy: {nnetwork_accuracy}")


In [None]:
# line chart 
x_values = ["Logistic Regression", "K-Nearest Neighbors", "Random Forest", "Neural Network"]
y_values = [logistic_accuracy, knn_accuracy, random_forest_accuracy, nnetwork_accuracy]
plt.xticks(rotation=75)
plt.plot(x_values,y_values, marker="o")
plt.xlabel("Models")
plt.ylabel("Accuracy")
plt.title("Accuracy per Model Type")
plt.ylim(.75,.9)
plt.show()
plt.savefig("model_comparison.png")


## Correlation Matrix

In [None]:
data_df.head()

In [None]:
# Focus in on inputs more likely to cause an effect
# Ask DOM about this

#strong_inputs_df = data_df[["age", "workclass", "education", "marital_status", "occupation", "hours_per_week"]]
#strong_inputs_df

In [None]:
# Create the correlation on the data frame
corrMatrix = data_df.corr()
#corrMatrix = strong_inputs_df.corr()
corrMatrix

In [None]:
# Use Seaborn and Matplot to plot the correlation matrix
sn.heatmap(corrMatrix, annot=True)

plt.show()