# Lab 8: Implement Your Machine Learning Project Plan

In this lab assignment, you will implement the machine learning project plan you created in the written assignment. You will:

1. Load your data set and save it to a Pandas DataFrame.
2. Perform exploratory data analysis on your data to determine which feature engineering and data preparation techniques you will use.
3. Prepare your data for your model and create features and a label.
4. Fit your model to the training data and evaluate your model.
5. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.

### Import Packages

Before you get started, import a few packages.

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
# import seaborn as sns

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need for this task.

In [2]:
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

## Part 1: Load the Data Set


You have chosen to work with one of four data sets. The data sets are located in a folder named "data." The file names of the three data sets are as follows:

* The "adult" data set that contains Census information from 1994 is located in file `adultData.csv`
* The airbnb NYC "listings" data set is located in file  `airbnbListingsData.csv`
* The World Happiness Report (WHR) data set is located in file `WHR2018Chapter2OnlineData.csv`
* The book review data set is located in file `bookReviewsData.csv`



<b>Task:</b> In the code cell below, use the same method you have been using to load your data using `pd.read_csv()` and save it to DataFrame `df`.

In [3]:
filename = os.path.join(os.getcwd(), "data", 'adultData.csv')
df = pd.read_csv(filename, header=0)
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex_selfID,capital-gain,capital-loss,hours-per-week,native-country,income_binary
0,39.0,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Non-Female,2174,0,40.0,United-States,<=50K
1,50.0,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Non-Female,0,0,13.0,United-States,<=50K
2,38.0,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Non-Female,0,0,40.0,United-States,<=50K
3,53.0,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Non-Female,0,0,40.0,United-States,<=50K
4,28.0,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40.0,Cuba,<=50K


## Part 2: Exploratory Data Analysis

The next step is to inspect and analyze your data set with your machine learning problem and project plan in mind. 

This step will help you determine data preparation and feature engineering techniques you will need to apply to your data to build a balanced modeling data set for your problem and model. These data preparation techniques may include:
* addressing missingness, such as replacing missing values with means
* renaming features and labels
* finding and replacing outliers
* performing winsorization if needed
* performing one-hot encoding on categorical features
* performing vectorization for an NLP problem
* addressing class imbalance in your data sample to promote fair AI


Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.


<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. 

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [6]:
print("Shape of the DataFrame:", df.shape)
print("Column names:", df.columns)
print(df.info())

print(df.describe())

print(df.isnull().sum())

print("Class distribution:")
print(df['income_binary'].value_counts())

# sns.countplot(x='income_binary', data=df)
# plt.title("Class Distribution")
# plt.show()

# correlation_matrix = df.corr()
# plt.figure(figsize=(12,8))
# sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
# plt.title("Correlation Heatmap")
# plt.show()

numerical_features =['age', 'fnlwgt', 'education-num','capital-gain','capital-loss', 'hours-per-week']
# for feature in numerical_features:
#     plt.figure(figsize=(8,5))
#     sns.boxplot(x='income_binary', y = feature, data=df)
#     plt.title(f"Boxplot of {feature} vs Target")
#     plt.show()
    
# sns.pairplot(df[numerical_features + ['income_binary']], hue='income_binary')
# plt.title("Pairplot of Numerical Features")
# plt.show()

Shape of the DataFrame: (32561, 15)
Column names: Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex_selfID',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income_binary'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             32399 non-null  float64
 1   workclass       30725 non-null  object 
 2   fnlwgt          32561 non-null  int64  
 3   education       32561 non-null  object 
 4   education-num   32561 non-null  int64  
 5   marital-status  32561 non-null  object 
 6   occupation      30718 non-null  object 
 7   relationship    32561 non-null  object 
 8   race            32561 non-null  object 
 9   sex_selfID      32561 non-null  object 
 10  capital-gain    32561 non-null

## Part 3: Implement Your Project Plan

<b>Task:</b> Use the rest of this notebook to carry out your project plan. You will:

1. Prepare your data for your model and create features and a label.
2. Fit your model to the training data and evaluate your model.
3. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.


Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit.

In [7]:
df.dropna(subset=['income_binary'], inplace=True)

label_encoder = LabelEncoder()
df['income_binary'] = label_encoder.fit_transform(df['income_binary'])

categorical_features = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex_selfID', 'native-country']

encoded_features = pd.get_dummies(df[categorical_features], drop_first=True)

#X = pd.concat([df[numerical_features], encoded_features], axis=1)
X = df[numerical_features]
y = df['income_binary']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train[numerical_features] = scaler.fit_transform(X_train[numerical_features])
X_test[numerical_features] = scaler.transform(X_test[numerical_features])
X_train = X_train.fillna(X_train.mean())
X_test = X_test.fillna(X_test.mean())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value[:, i].tolist())
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pand

In [8]:
# YOUR CODE HERE
# choose the adult 
# select income or data or any column in the dataframe. 
# you have to make sure to drop your label columns.
# X Y
# AFTER THE STEP X FEATURE AND Y IS MY LABEL. CHOOSE A MACHINE LEARNING TECHNIQUE THAT WILL PREDICT Y FROM X. 
#Prepare your data for your model and create features and a label.


classifiers ={
    'Random Forest': RandomForestClassifier(),
    'Gradient Boosting': GradientBoostingClassifier(),
    'Support Vector Machine': SVC()
    
}

results = {}

for name, clf in classifiers.items():
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    results[name] = accuracy

for name, accuracy in results.items():
    print(f"{name}: Accuracy = {accuracy:.4f}")



Random Forest: Accuracy = 0.8052
Gradient Boosting: Accuracy = 0.8449
Support Vector Machine: Accuracy = 0.8256


In [7]:
np.isinf(y_train).any()

False

In [11]:
#Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.
param_grid = {
    'n_estimators':[50,100],
    'max_depth': [5,10],
    'min_samples_split': [2],
    'min_samples_leaf': [1]
}
# param_grid = {
#     'n_estimators':[100, 200],
#     'max_depth': [10, 20],
#     'min_samples_split': [2, 5],
#     'min_samples_leaf': [1,2]
# }
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

best_rf_model = grid_search.best_estimator_
best_rf_accuracy = grid_search.best_score_

print("Best Random Forest Model:")
print(best_rf_model)
print(f"Best Random Forest Accuracy: {best_rf_accuracy:.4f}")

Best Random Forest Model:
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=10, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=50,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
Best Random Forest Accuracy: 0.8368


In [11]:
X_train.fillna(X_train.mean())

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,workclass_Local-gov,workclass_Never-worked,workclass_Private,workclass_Self-emp-inc,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
5514,-0.408869,0.080051,1.133702,-0.252704,-0.217998,0.778213,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
19777,-0.189068,-0.981653,0.357049,-0.252704,4.457168,0.778213,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
10781,1.422808,0.126197,-1.972910,-0.252704,-0.217998,-0.032529,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
32240,-1.288074,-0.090935,0.357049,-0.252704,-0.217998,0.453916,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
9876,-0.848471,0.856334,-0.031277,-0.252704,-0.217998,-0.032529,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29802,0.616870,1.612662,1.133702,-0.252704,-0.217998,-0.032529,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
5390,-0.555403,-0.404294,-0.807930,-0.252704,-0.217998,-1.572939,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
860,-1.507875,0.252063,-1.196257,-0.252704,-0.217998,-1.654013,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
15795,0.836671,-1.287628,-0.419604,-0.252704,-0.217998,3.534737,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
