# **Build a classifier to predict whether income of a person exceeds $50K/yr based on census data. Also known as "Census Income" dataset.**

Use neural networks, SVM, Random Forest and Logistic Regression and compare their performance

**import the required packages and libraries**


1.   Pandas is a Python library. Pandas is used to analyze data.
2.   sklearn is a machine learning library for Python. It has classes and functions with respect to various algorithms like support vector machine, logistic regression, random forests, etc.

  a. Label Encoder is for converting the values in the given column into numeric form 

  b.  StandardScaler will transform your data such that its distribution will have a mean value 0 and standard deviation of 1.

  c. sklearn.metrics module implements functions assessing prediction error for specific purposes.

  d.  train_test_split is a function in Sklearn model selection for splitting data arrays into two subsets: for training data and for testing data
  
5.   Keras is a Python library for neural networks

6. joblib provides utilities for saving and loading Python objects that make use of NumPy data structures, efficiently. 





In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
import sklearn.metrics
import keras
import joblib
import matplotlib.pyplot as plt

**Dataset Preparation**
1. read the dataset file
2. do a little preprocessing to convert alphanumerical values in certain columns to numerical/ordinal values
3. split the dataset into test and training datasets
4. perform feature scaling

In [None]:
def readData():
  features = pd.read_csv('/kaggle/input/adult-dataset/adult.csv',header=None) #fetching the dataset directly from the url
  features = features.rename(columns={14 : 'class'}) #renaming the result column
  print("The dataset-")
  print(features)
  n=features.shape[1]  #number of columns in the dataframe
  colnames_numerics_only = features.select_dtypes(include=np.number).columns.tolist() #forming a list of columns containing numerical values
  # Label Encoding refers to converting the labels into numeric form so as to convert it into the machine-readable form. 
  # Machine learning algorithms can then decide in a better way on how those labels must be operated
  label_encoder = LabelEncoder() 
  #this loop uses label_encoder to convert all non-numerical columns to numerical 
  for col in features.columns:
    if col not in colnames_numerics_only:
      features[col] = label_encoder.fit_transform(features[col]) 
  labels = features.pop('class')
  features.fillna(features.mean(),inplace=True)
  X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.20,random_state=5)
  scaler = StandardScaler()
  scaler.fit(X_train)
  X_train = scaler.transform(X_train)
  X_test = scaler.transform(X_test)
  return X_train,y_train, X_test, y_test

In [None]:
X_train,y_train, X_test, y_test=readData()

# Building a neural network classifier

Machine learning algorithms that use neural networks generally do not need to be programmed with specific rules that define what to expect from the input. The neural net learning algorithm instead learns from processing many labeled examples that are supplied during training and using this answer key to learn what characteristics of the input are needed to construct the correct output. Once a sufficient number of examples have been processed, the neural network can begin to process new, unseen inputs and successfully return accurate results.   

Here we build a neural network classifier with 2 layers having 24 neurons each.
1. Use tanh activation function for hidden layers
2. Use sigmoid activation function for output layer

In [None]:
# Train and evaluate
def train_and_evaluate(X_train, Y_train, X_test, Y_test):
    global accuracyNN
    m=X_train.shape[0]  #number of training examples 
    n=X_train.shape[1]  #number of features
    inputs = keras.layers.Input(shape=(n,), dtype='float32', name='input_layer') # Input (2 dimensions)
    outputs = keras.layers.Dense(24, activation='tanh', name='hidden_layer1')(inputs) # Hidden layer
    outputs = keras.layers.Dense(24, activation='tanh', name='hidden_layer2')(outputs) # Hidden layer
    outputs = keras.layers.Dense(1, activation='sigmoid', name='output_layer')(outputs) # Output layer 
    # Create a model from input layer and output layers
    model = keras.models.Model(inputs=inputs, outputs=outputs, name='neural_network')
    # Compile the model (binary_crossentropy if 2 classes)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    # Train the model on the train set (output debug information)
    model.fit(X_train, Y_train, epochs=100, verbose=1)#batch_size=1
    # Save the model (Make sure that the folder exists)
    model.save("my_nn_model.h5")
    # Evaluate on training data
    print('\n-- Training data --')
    predictions = model.predict(X_train)
    #now make the predicted output of those training instances as 1 which have value higher than 0.5(the chosen threshold)
    predictions[predictions>=0.5]=1
    predictions[predictions<0.5]=0
    accuracy = sklearn.metrics.accuracy_score(Y_train, predictions)
    print('Accuracy: {0:.2f}'.format(accuracy * 100.0))
    print('Classification Report:')
    print(sklearn.metrics.classification_report(Y_train, predictions))
    print('Confusion Matrix:')
    print(sklearn.metrics.confusion_matrix(Y_train, predictions))
    print('')
    # Evaluate on test data
    print('\n---- Test data ----')
    predictions = model.predict(X_test)
    predictions[predictions>=0.5]=1
    predictions[predictions<0.5]=0
    predictions=np.asarray(predictions).astype('int32')
    Y_test=np.asarray(Y_test).astype('int32')
    accuracyNN = sklearn.metrics.accuracy_score(Y_test, predictions)
    print('Accuracy: {0:.2f}'.format(accuracy * 100.0))
    print('Classification Report:')
    print(sklearn.metrics.classification_report(Y_test,predictions))
    print('Confusion Matrix:')
    print(sklearn.metrics.confusion_matrix(Y_test, predictions))

# The main entry point for this module
def main():
    X_train,Y_train,X_test,Y_test=readData()
    train_and_evaluate(X_train, Y_train, X_test, Y_test)
# Tell python to run main method
if __name__ == "__main__": 
  main()

# Building an SVM classifier

Support-vector machines (SVMs, also support-vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis.  An SVM maps training examples to points in space so as to maximise the width of the gap between the two categories. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.

In [None]:
def train_and_evaluate_SVM(X_train, y_train, X_test, y_test):
  global accuracySVM
  svclassifier = SVC(kernel='rbf')  #using SVM with radial basis function kernel
  svclassifier.fit(X_train, y_train)  #training
  filename = 'svm_model.sav'
  joblib.dump(svclassifier, filename)  #saving the svm model in a file
  y_pred = svclassifier.predict(X_test)  #predicting the output on the test set
  accuracySVM = sklearn.metrics.accuracy_score(y_test, y_pred)
  print("Accuracy on test data: ",accuracySVM)
  print(sklearn.metrics.confusion_matrix(y_test,y_pred))
  print(sklearn.metrics.classification_report(y_test,y_pred))

def main():
    X_train,Y_train,X_test,Y_test=readData()
    train_and_evaluate_SVM(X_train, Y_train, X_test, Y_test)
# Tell python to run main method
if __name__ == "__main__": 
  main()

# Building a Logistic Regression classifier

Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. Generally, logistic regression means binary logistic regression having binary target variables.

In [None]:
def train_and_evaluate_Log(X_train, y_train, X_test, y_test):
  global accuracyLog
  logisticRegr = LogisticRegression()
  logisticRegr.fit(X_train, y_train)  #training
  filename = 'log_model.sav'
  joblib.dump(logisticRegr, filename) #saving the logistic regression model in a file
  y_pred = logisticRegr.predict(X_test) #predicting the output on the test set
  accuracyLog = sklearn.metrics.accuracy_score(y_test, y_pred)
  print("Accuracy on test data: ",accuracyLog)
  print(sklearn.metrics.confusion_matrix(y_test,y_pred))
  print(sklearn.metrics.classification_report(y_test,y_pred))
def main():
    X_train,Y_train,X_test,Y_test=readData()
    train_and_evaluate_Log(X_train, Y_train, X_test, Y_test)

# Tell python to run main method
if __name__ == "__main__": 
  main()

# Building a Random Forest classifier

The random forest is a classification algorithm consisting of many decisions trees. It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree.

In [None]:
def train_and_evaluate_RF(X_train, y_train, X_test, y_test):
  global accuracyRF
  RFclassifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 42) # 10 decision trees used in this classifier
  RFclassifier.fit(X_train, y_train)  #training  
  filename = 'rf_model.sav'
  joblib.dump(RFclassifier, filename)  #save the model
  y_pred = RFclassifier.predict(X_test) #predict on test set
  accuracyRF = sklearn.metrics.accuracy_score(y_test, y_pred)
  print(sklearn.metrics.confusion_matrix(y_test,y_pred))
  print(sklearn.metrics.classification_report(y_test,y_pred))

def main():
    X_train,Y_train,X_test,Y_test=readData()
    train_and_evaluate_RF(X_train, Y_train, X_test, Y_test)
# Tell python to run main method
if __name__ == "__main__": 
  main()


**Comparing all the classifier models**

In [None]:
y = np.array([(accuracyNN),(accuracySVM),(accuracyLog),(accuracyRF)]) 
x = ['Neural Network','SVM','Log Regression', 'Random Forest']
plt.bar(x,y)
plt.title('Performance comparison')
plt.xlabel('Classifier')
plt.ylabel('accuracy')