**Part 1: Data Exploration**

In [26]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split #imports the train_test_split function from scikit-learn’s model_selection module
from sklearn.preprocessing import StandardScaler #loading in train_test_split function from scikit-learn’s model_selection module
from sklearn.linear_model import Perceptron #loading in train_test_split function from scikit-learn’s model_selection module
from sklearn.linear_model import LogisticRegression #loading in train_test_split function from scikit-learn’s model_selection module
from sklearn.tree import DecisionTreeClassifier #loading in train_test_split function from scikit-learn’s model_selection module
from sklearn.svm import SVC #loading in train_test_split function from scikit-learn’s model_selection module
from sklearn.ensemble import RandomForestClassifier #loading in train_test_split function from scikit-learn’s model_selection module
from sklearn.neighbors import KNeighborsClassifier #loading in train_test_split function from scikit-learn’s model_selection module
from sklearn.metrics import accuracy_score, classification_report #loading in train_test_split function from scikit-learn’s model_selection module
from sklearn.preprocessing import OrdinalEncoder #imports the OrdinalEncoder function from scikit-learn’s model_selection module 


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/adult-income-classification/train.csv
/kaggle/input/adult-income-classification/test.csv


In [22]:
train = pd.read_csv('/kaggle/input/adult-income-classification/train.csv') #import train data set as "mydata"

**Part 2: Preprocessing and Feature Engineering**

Cleaning dataset by first replacing the "?" with NaN for dataset "mydata" then setting it as new dataset "mydata2", then dropping the rows with any NaN values and setting this new dataset as "mydata2". Then we would use the "drop" function to drop the "id" column and set the axis to 1.

In [23]:
train2 = train.replace('?', pd.NA, inplace=False)
train2.dropna(inplace=True)
train2 = train2.drop('id', axis=1)
train2.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,78,Private,111189,7th-8th,4,Never-married,Machine-op-inspct,Not-in-family,White,Female,0,0,35,Dominican-Republic,0
1,49,Self-emp-inc,122066,Some-college,10,Divorced,Sales,Not-in-family,White,Male,0,0,25,United-States,0
2,62,Self-emp-not-inc,168682,7th-8th,4,Married-civ-spouse,Sales,Husband,White,Male,0,0,5,United-States,0
3,18,Private,110230,10th,6,Never-married,Other-service,Own-child,White,Male,0,0,11,United-States,0
5,22,Private,218215,Some-college,10,Never-married,Adm-clerical,Own-child,White,Female,0,0,35,United-States,0


Next we need to create a new data set with the independent variables called "X" and then "Y" would be the target variable or prediction output.

In [24]:
X = train2.drop('income',axis=1) #independent variables
y = train2['income'] #output

We need to convert categorical variables into numerical representations. Using ordinal encoding to assigns integers to categories.

In [30]:
X_enc = OrdinalEncoder().fit_transform(X) #assigns integers to categories

Now we are spliting the datset into 75% training and 25% testing.

In [31]:
X_train, X_test, y_train, y_test = train_test_split(X_enc, y,test_size=0.25, random_state=1, stratify=y) #spliting dataset

We need to ensure that both the training and test datasets are scaled in the same way, so the machine learning algorithms will perform effectively. 


In [50]:
sc = StandardScaler().fit(X_train) #StandardScaler computes the mean and standard deviation for each feature in "X_train"
Xtrain_stnd=sc.transform(X_train) #transforms (scales) the training data
Xtest_stnd=sc.transform(X_test) #transforms (scales) the test data X_test 

**Part 3: Model Training and Evaluation:**

Here we building and using different models such as perceptron, logistic regression, support vector machines (SVM), decision tree, k-nearest neighbors (KNN), and random forest to evaluate each accuracy of the predictions. We are going to use the training and testing data set "Xtrain_stnd" and "Xtest_stnd" we just created.

In [56]:
#Perceptron model

precept = Perceptron(eta0=0.1, random_state=1).fit(Xtrain_stnd, y_train) #building perceptron model
ypre_pred = precept.predict(Xtest_stnd) #testing the model

accuracy = accuracy_score(y_test, ypre_pred) #calculating accuracy 
print("Accuracy:", accuracy)

Accuracy: 0.7963085764809903


In [54]:
#Logistic regression model

lr = LogisticRegression(C=100.0, solver='saga', max_iter=4000, multi_class='ovr').fit(Xtrain_stnd, y_train) #building logistic regression model
ylr_pred = lr.predict(Xtest_stnd) #testing the model

accuracy = accuracy_score(y_test, ylr_pred) #calculating accuracy 
print("Accuracy:", accuracy)

Accuracy: 0.8186339522546419


In [55]:
#Support Vector Machines (SVM) model

svm = SVC(kernel='linear', C=0.05, random_state=1).fit(Xtrain_stnd, y_train) #building Support Vector Machines (SVM) model
ysvm_pred = svm.predict(Xtest_stnd) #testing the model

accuracy = accuracy_score(y_test, ysvm_pred) #calculating accuracy 
print("Accuracy:", accuracy)

Accuracy: 0.8023872679045093


In [58]:
#Decision Treee model

dectree = DecisionTreeClassifier(criterion='gini', max_depth=5, random_state=1).fit(Xtrain_stnd, y_train) #building Decision Treee model
yddectree_pred = dectree.predict(Xtest_stnd) #testing the model

accuracy = accuracy_score(y_test, yddectree_pred) #calculating accuracy 
#print("Accuracy:", accuracy)

Accuracy: 0.8473695844385499


In [59]:
#K-nearest neighbors (KNN)

knn = KNeighborsClassifier(n_neighbors=5, p=1, metric='euclidean').fit(Xtrain_stnd, y_train) #building K-nearest neighbors (KNN)
yknn_pred = knn.predict(Xtest_stnd) #testing the model

accuracy = accuracy_score(y_test, yknn_pred) #calculating accuracy 
print("Accuracy:", accuracy)

Accuracy: 0.8271441202475686


In [60]:
#Random Forest model

rfc = RandomForestClassifier(n_estimators=100, random_state=1, n_jobs=2).fit(Xtrain_stnd, y_train)  #building Random Forest model
yrfc_pred = rfc.predict(Xtest_stnd) #testing the model

accuracy = accuracy_score(y_test, yrfc_pred) #calculating accuracy 
print("Accuracy:", accuracy)

Accuracy: 0.8546640141467727


**Part 4: Model Deployment and Submission**

Now we use the Random Forest model as our final model becuse it produces the highest accuracy of 0.8546640141467727 compared to the other models. We would also need to produce a submission excel sheet with the prediction/outcome of the "test" datset.

In [108]:
test4 = pd.read_csv('/kaggle/input/adult-income-classification/test.csv') #import test dataset


test6 = test4.drop('id', axis=1) # drop the "id" column

X_TestEnc = OrdinalEncoder().fit_transform(test6) #assigns integers to categories
sc3 = StandardScaler().fit(X_TestEnc)  #StandardScaler computes the mean and standard deviation for each feature in "X_TestEnc"
sc2 = sc3.transform(X_TestEnc) #transforms (scales) the X_TestEnc data



yrfc2_pred = rfc.predict(sc2) #testing the model



submission_df = pd.DataFrame({ #creating a DataFrame called submission_df with two columns "id" and "label". 
    "id": test4['id'],  
    "label": yrfc2_pred        #"label" contains the outputs/prediction from "yrfc2_pred" made from the Random Forest model
}) 

submission_df.to_csv("submission.csv", index=False) #saves the contents of "submission_df" toa CSV file named "submission.csv" without including he index column