# Machine Learning Models

The goal of this workbook is to compare different machine learning models on our dataset and compare their results. We are looking to see which models perform best on our dataset and which can most accurately be used to predict new data.

In this workbook we will load our cleaned csv from our ETL process, then compare the following models:
    - Logistic Regression
    - K Nearest Neighbor
    - Random Forest
    - Neural Network
    
After analyizing each of these models to see which is the best predictor, we will use a <b>Correlation Matrix</b> in order to see which of the input factors from our data has the best predictive power. This is an additional path our group wanted to go down to gain more insight into our data.
    
` While there are plenty of other machine learning models we could have explored, these are the handful that our team was most interested in experimenting with. In addition, our team selected these models as we believe they have a high likelyhood to be effective considering the nature of our dataset and predictive goals. `

In [None]:
# Import General Dependencies for this Workbook
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import os

In [None]:
# Set file path to the cleaned data csv from data cleanup process
file = "data"

# Read to a df
data_df = pd.read_csv(file)
data_df.head()

# ------------------------------

# Start Emerson Code

## Logistic Regression

Logistic Regression is a statistical method for predicting binary outcomes from data.

In [None]:
# Split the data into a training and testing split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_THING=1)

In [None]:
# Generate the Logistic Regression Model
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier

In [None]:
# Fit the model
classifier.fit(X_train, y_train)

In [None]:
# Train the model
print(f"Training Data Score: {classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test, y_test)}")

Predict with new data

In [None]:
# Predict the class examples
predictions = classifier.predict(new_data)
print("Classes are either 0 (purple) or 1 (yellow)")
print(f"The new point was classified as: {predictions}")

In [None]:
# Predict into a df
predictions = classifier.predict(X_test)
pd.DataFrame({"Prediction": predictions, "Actual": y_test})

## K Nearest Neighbors (KNN) Model

# End Emerson Code

# ------------------------------

# Start Sofanit Code

## Random Forest Model

In [None]:
#import the Random Forest Model

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd
import os


In [None]:
# Set file path to the cleaned data csv from data cleanup process
file = "data"

# Read to a df
data_df = pd.read_csv(file)
data_df.head()

In [None]:
# Load the Iris Dataset
iris = load_iris()
#print (iris.DESCR)

In [None]:
target = data_df["income"]
target_names= [">=50","<=50"]

In [None]:
data_df = data_df.drop("income", axis=1)

In [None]:
feature_names = data_df.columns
data_df.head()

In [None]:
data_binary_encoded = pd.get_dummies(data_df)
data_binary_encoded.head()

In [None]:
#import train_test_Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data_binary_encoded, target, random_state=42)
y_train[0]


In [None]:
#import the RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200)
rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)


           As we can see the RandomForestClassifier has about 86% of accuracy. This is a very good indicator of a good model. On the next steps, we will prioritize the features based on the importance in order to investigate which feature was the greatest indicator of income that is 
           above 50,000 dollars or below 50,000 dollars. 

In [None]:
featuresPriority = sorted(zip(rf.feature_importances_, feature_names), reverse=True)

In [None]:

featuresPriority

In [None]:
featuresPriority
labels = []
values = []

for i in range(0, len(featuresPriority)): 
       values.append(featuresPriority[i][0])
       labels.append(featuresPriority[i][1])


In [None]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize= (12,6))
ax.bar(labels, values)
plt.xticks(rotation = 45)
plt.ylabel(" Percentage of Accuracy In Income Prediction",  fontsize=12)
plt.xlabel("Features that Contribute to Income Prediction",  fontsize=12)
plt.title ("Features that Predict Income greater than or less than 50K ")

plt.show()






#plt.bar(x = labels, height=values)

In [None]:
data_binary_encoded

In [None]:
data_df = data_df.drop(["occupation","sex","native_country","relationship"], axis=1)

In [None]:
data_binary_encoded_2 = pd.get_dummies(data_df)
data_binary_encoded_2.head()

In [None]:
#import train_test_Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data_binary_encoded_2, target, random_state=42)
y_train[0]

In [None]:
#import the RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200)
rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)


## Neural Network Model

In [None]:
# Set the seed value for the notebook so the results are reproducible
from numpy.random import seed
seed(42)

In [None]:
# Generate some fake data with 3 features

 from sklearn.datasets import make_classification

# X, y = data_binary_encoded                         

# y = y.reshape(-1, 1)

# print(X.shape)
# print(y.shape)

In [None]:
#import train_test_Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data_binary_encoded, target, random_state=42)
y_train


In [None]:
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical
# Step #1 label-encode the dataset
label_encoder = LabelEncoder()
label_encoder.fit(y_train)
y_train_encoded = label_encoder.transform(y_train)
y_test_encoded = label_encoder.transform(y_test)


In [None]:
# Step 2: Convert encoded labels to one-hot-encoding
y_train_categorical = to_categorical(y_train_encoded)
y_test_categorical = to_categorical(y_test_encoded)

In [None]:
from sklearn.preprocessing import StandardScaler

X_scaler = StandardScaler().fit(X_train)

In [None]:
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)


In [None]:
X_train_scaled.shape[1]

In [None]:
#data_binary_encoded

In [None]:
from tensorflow.keras.models import Sequential

model = Sequential()

In [None]:
from tensorflow.keras.layers import Dense
number_inputs = X_train_scaled.shape[1]
number_hidden_nodes = 4
model.add(Dense(units=number_hidden_nodes,
                activation='relu', input_dim=number_inputs))

In [None]:
 y_train_categorical.shape[1]
    

In [None]:
number_classes =  y_train_categorical.shape[1]
model.add(Dense(units=number_classes, activation='softmax'))

In [None]:
model.summary()

In [None]:
# Use categorical crossentropy for categorical data and mean squared error for regression
# Hint: your output layer in this example is using software for logistic regression (categorical)
# If your output layer activation was `linear` then you may want to use `mse` for loss
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

In [None]:
# Fit (train) the model
model.fit(
    X_train_scaled,
    y_train_categorical,
    epochs=100,
    shuffle=True,
    verbose=2
)

In [None]:
# Evaluate the model using the testing data
model_loss, model_accuracy = model.evaluate(
    X_test_scaled, y_test_categorical, verbose=2)
print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

# End Sofanit Code

# ------------------------------

## Correlation Matrix