#      CSCI 5622 Assignment 5, Kaggle Contest - Code and Description

**Name: Vandana Sridhar**

> **Imports**

In [29]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.ensemble import AdaBoostClassifier,GradientBoostingClassifier
from sklearn.model_selection import  train_test_split
from sklearn.preprocessing import LabelBinarizer, StandardScaler
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.svm import SVC

>**Load the test and train datasets**

In [30]:
names=['age','workclass','fnlwgt','education','education-num','marital-status','occupation','relationship','race','sex','capital-gain','capital-loss','hours-per-week','native-country','income']
traindata = pd.read_csv("train.data",names=names)
testdata = pd.read_csv("test.data",names=names)

>**Preprocessing using Label Binarizer, separating string data from numerical data**

In [31]:
x = traindata.drop("income", axis=1)
y = traindata["income"]

In [32]:
xtest=testdata.drop("income",axis=1)

In [33]:
y_binarizer = LabelBinarizer()
y = y_binarizer.fit_transform(y).ravel()

In [19]:
#x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [34]:
numberfeatures = []
stringfeatures = []

for f in range(x.values.shape[1]):
    if type(traindata.values[0, f]) is str:
        stringfeatures.append(traindata.columns.values[f])
    else:
        numberfeatures.append(traindata.columns.values[f])

class feature_elimination(): #Setup for the pipeline.Transformations on a column of data.
    
    f_names = list()
    
    def __init__(self, f_names):
        self.f_names = f_names

    def fit(self, x, y=None):
        return self

    def transform(self, x):
        return x[self.f_names].values

    
class feature_binarization(): #binarization for each feature
    b = []
    
    def  __init__(self):
        return None
    
    def fit(self, x, y=None): 
        no_of_features = x.shape[1]
        for i in range(no_of_features):
            a = LabelBinarizer()
            a.fit(x[:, i])
            self.b.append(a)
        
        return self
            
        
    def transform(self, x): #does binarization, returns a matrix of binary features
        mat = np.empty((0, 0))
        
        for f in range(len(self.b)):
            if (f == 0):
                mat = self.b[f].transform(x[:, f])
            else:
                mat = np.concatenate((mat, self.b[f].transform(x[:,f])),axis=1)

        return mat


> **Built a Pipeline**

In [35]:
numberPip = Pipeline([('selector', feature_elimination(numberfeatures)),('std_scaler', StandardScaler())])
stringPip = Pipeline([ ('selector', feature_elimination(stringfeatures)),('preprocessing_step', feature_binarization())])
Pip_create = FeatureUnion(transformer_list=[('numberpipeline', numberPip),('stringpipeline', stringPip)])

xtrain_new = Pip_create.fit_transform(x)
xtest_new = Pip_create.transform(xtest)



In [57]:
#print("new dataset \n ", xtrain_new[0, :])
#print("new testset \n", xtest_new[0, :])
#print(Xtrain_new.shape)
#print(Xtest_new.shape)

>**Running a classifier and building the model, obtained predictions**

In [36]:
# no of estimators = 150, 200, 300, 500 ..... 
#classifier1 = AdaBoostClassifier( n_estimators= 150, learning_rate = 1)
#classifier1 = AdaBoostClassifier( n_estimators= 200, learning_rate = 1)
#svc=SVC(probability=True, kernel='rbf')
#classifier1 = AdaBoostClassifier( n_estimators= 200, base_estimator=svc, learning_rate = 1)
#classifier1 = AdaBoostClassifier( n_estimators= 250, learning_rate = 1)
#classifier2 = GradientBoostingClassifier( n_estimators= 160, learning_rate = 0.1)
#classifier2 = GradientBoostingClassifier( n_estimators= 190, learning_rate = 0.1)
classifier2 = GradientBoostingClassifier( n_estimators= 150, learning_rate = 0.1)
model = classifier2.fit(xtrain_new, y)
ytest_predict = model.predict(xtest_new)

In [37]:
ytest_predict


array([0, 0, 0, ..., 1, 0, 1])

>**Document predictions to file**

In [27]:
prediction1 = pd.DataFrame(ytest_predict,columns=['category']).to_csv('prediction7_gbc150.csv')

# Code Description Writeup

> My best score for this assignment was 87.387% accuracy on Kaggle. <br/>

>**APPROACH 1** :

> The first approach consisted of me encoding categorical data. So I saw that columns such as workclass, education, marital status,occupation, relationship,race, sex and native country have string data so I mapped each individual string to a number. I handled question marks as well and I mapped them to -1. Thr first classifier that I ran was the Decision Tree classifier with the maximum leaf nodes parameter set to 145. My first accuracy was 83.783% which was a good start.<br/>

>**APPROACH 2** :

> I read quite a bit on preprocessing data and building pipelines. So my next step was to use the LabelBinarizer from the sklearn.preprocessing module. LabelBinarizer binarises labels in a one vs all manner. It is a type of one-hot encoding and I used it for y labels. Using the fit_transform(), binary targets are transformed to a column vector. The .ravel() function returns a one dimensional array.<br/>
>Next I iterated through the list and separated the string features and the numerical features.<br/>
>I had to devise a pipeline for this problem. Pipelines have fit & transform functions where the training data is fit to a pipeline and the test data is transformed using the pipeline. Pipelines streamline processes and makes it easy to model a problem. Sklearn's pipeline provides this functionality.<br/>

>The feature_elimination class has the declaration for the pipeline ie, the fit() and transform() functions. The feature_binarization class has the definitions for fit() and transform(). The fit function performs the binarization for each feature in a one vs all manner. The transform function concatenates and returns the final matrix with all binary features.<br/>
>Two pipelines are built - one for number data, and one for string data. The parameters passed for the pipelines are the action - "the select function" and the object - number features/string features. For the numerical features, the StandardScaler() is used which is just the standardization of features. This is done by removing the mean and scaling it to unit variance. For string features, feature binarization is used.<br/>
>Now to make a pipeline from both pipelines, FeatureUnion is used to join them together and the new pipeline has a list of  feature transformers mentioned above.<br/>
>I used the final pipeline to fit the training data and then transform the test data. The new preprocessed data is under the names xtrain_new and xtest_new respectively.<br/>

>**CLASSIFIERS**

>I used the descision tree classifier at first and then I moved on to Adaboost and GradientBoostingClassifier from the sklearn ensemble module.<br/>
> I experimented mostly on the number of estimators for both algorithms. I used the default learning rate for both.<br/>
> For Adaboost, I kept increasing the number of estimators starting from 150 to 250. My accuracy shot from 84.520% to 86.752%. <br/>
>Another method I tried was by using the entire training set for one attempt and I split the data into train and test for the other attempt. I saw no improvement on the accuracy so I decided to continue using the entire dataset for the model.<br/>
>I finally used the GradientBoostingAlgorithm with 150 estimators which finally got my accuracy to 87.387% on Kaggle.<br/>
>I did try varying the estimators on Gradient boosting a couple times but accuracies either remained stagnant or dropped to a lower value. <br/>

>**CONCLUSION**

>Hence, by utilizing preprocessing steps, building a pipeline, using the GradientBoosting Algorithm with number of estimators = 150, I got my best score of 87.387%