In this kernel I try to encode categorical features from [a past competition](https://www.kaggle.com/c/cat-in-the-dat), in order to build models that predict a binary (0/1) target. I am using [this notebook](https://www.kaggle.com/shahules/an-overview-of-encoding-techniques) to learn about various encoding approaches but my main goal is to learn somethings about preprocessing achieve a good prediction using simple classifiers, rather than comparing all the encoding techniques. 

(In retrospect, I realized I have some repeated lines in this notebook and understood that my coding and writing could be more neat but I'm fine with this style for a start).

First things first ...

In [None]:
# importing libraries

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O 

print("list of files under the input directory:\n")

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# reading the train data
train_data = pd.read_csv("/kaggle/input/cat-in-the-dat/train.csv")
print("\ntrain data info looks like:\n")
train_data.info()
train_data.head(10)

In [None]:
# reading the test data
test_data = pd.read_csv("/kaggle/input/cat-in-the-dat/test.csv")
print("test data info looks like:\n")
test_data.info()
test_data.head(5)

In [None]:
# defining the target
X = train_data.drop(['target'],axis=1)
print('the shape of X is {}'.format(X.shape))
y = train_data['target']
print('the shape of y is {}'.format(y.shape))
X.head()

Now I try to encode the data using `LabelEncoder` and then fit logistic regression model to predict the targets.

In [None]:
from sklearn.preprocessing import LabelEncoder

my_encoder = LabelEncoder()

X_encoded = X.copy()
for c in X.columns:
    if (X[c].dtype == 'object'):
        X_encoded[c] = my_encoder.fit_transform(X[c])

X_encoded.head()

In [None]:
# dividing the train data to 75% train set and 25% evaluation set
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X_encoded,y,random_state=1,test_size=0.25)
print('the shape of X_train is {}'.format(X_train.shape))
print('the shape of X_test is {}'.format(X_test.shape))
print('the shape of y_train is {}'.format(y_train.shape))
print('the shape of y_test is {}'.format(y_test.shape))

In [None]:
from sklearn.linear_model import  LogisticRegression

model_LogR = LogisticRegression(max_iter=500, C=0.10)
model_LogR.fit(X_train,y_train)
y_pre_LogR = model_LogR.predict(X_test)

from sklearn.metrics import accuracy_score
print('Accuracy : ',accuracy_score(y_test,y_pre_LogR))

The above model yields classification accuracy of 0.69004 for the defaul regularization (C=1). Here's how regularization affects the accuracy:
* C=1000, 0.68990
* C=100, 0.69004
* C=10, 0.69009
* C=1, 0.69004
* C=0.1, 0.68954
* C=0.01, 0.68954
* C=0.001, 0.69

Next I try `KNeighborsClassifier`.


In [None]:
from sklearn.neighbors import  KNeighborsClassifier

model_KNN = KNeighborsClassifier(n_neighbors=1)
model_KNN.fit(X_train,y_train)
y_pre_KNN = model_KNN.predict(X_test)

from sklearn.metrics import accuracy_score
print('Accuracy : ',accuracy_score(y_test,y_pre_KNN))

For KNN, the following accuracies are achieved using different N values:
* N=5, 0.63161
* N=10, 0.67569
* N=20, 0.68756
* N=40, 0.69322
* N=80, 0.69421
* N=200, 0.69425

Accuracy seems to plateau when increasing N. Tried weights='distance' with N=200 and it didn't change anything.

Next, I try `RandomForestClassifier`.

In [None]:
from sklearn.ensemble import RandomForestClassifier

model_RF = RandomForestClassifier(n_estimators=10, max_depth=5, random_state=1)
model_RF.fit(X_train,y_train)
y_pre_RF = model_RF.predict(X_test)

from sklearn.metrics import accuracy_score
print('Accuracy : ',accuracy_score(y_test,y_pre_RF))

With random forest, the following accuracies are achieved:
* n=100, depth=5, 0.69826
* n=200, depth=5, 0.69873
* n=200, depth=7, 0.70930
* n=500, depth=10, 0.72189

It's doing better than previous ones and it seems that the accuracy can be pushed by increasing n and depth, though it takes longer to calculate. 

Next, I tried `SVC` with rbf and linear kernels but had to stop them due to long run time. Decided to stick to the above classifiers (logistic regression, KNN, and random forests) for now and explore the effect of another encpding method. That is `OneHotEncoder`.

In [None]:
from sklearn.preprocessing import OneHotEncoder

my_encoder_OH = OneHotEncoder()
my_encoder_OH.fit(X)

X_encoded_OH = my_encoder_OH.transform(X)

print('the shape of X_encoded_OH is {}'.format(X_encoded_OH.shape))

# dividing the train data to 75% train set and 25% evaluation set
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X_encoded_OH,y,random_state=1,test_size=0.25)
print('the shape of X_train is {}'.format(X_train.shape))
print('the shape of X_test is {}'.format(X_test.shape))
print('the shape of y_train is {}'.format(y_train.shape))
print('the shape of y_test is {}'.format(y_test.shape))

In [None]:
from sklearn.linear_model import  LogisticRegression

model_LogR = LogisticRegression(max_iter=5000, C=0.01)
model_LogR.fit(X_train,y_train)
y_pre_LogR = model_LogR.predict(X_test)

from sklearn.metrics import accuracy_score
print('Accuracy : ',accuracy_score(y_test,y_pre_LogR))

The logistic regression achieves better accuracy with the onehot encoding. Here are the values for different regularization:
* C=10, 0.75421
* C=1, 0.75918
* C=0.1, 0.76377
* C=0.01, 0.75722
* C=0.001, 0.73924

Next is KNN.
 

In [None]:
from sklearn.neighbors import  KNeighborsClassifier

model_KNN = KNeighborsClassifier(n_neighbors=1)
model_KNN.fit(X_train,y_train)
y_pre_KNN = model_KNN.predict(X_test)

from sklearn.metrics import accuracy_score
print('Accuracy : ',accuracy_score(y_test,y_pre_KNN))

The KNN runs take too long, perhaps due to the fact that calculating the distances in a 316000 dimensional space takes a lot of cpu power. For n=5, accuracy us 0.67650 which is better than the previous encoding.

Next I try random forests.

In [None]:
from sklearn.ensemble import RandomForestClassifier

model_RF = RandomForestClassifier(n_estimators=10, max_depth=5, random_state=1)
model_RF.fit(X_train,y_train)
y_pre_RF = model_RF.predict(X_test)

from sklearn.metrics import accuracy_score
print('Accuracy : ',accuracy_score(y_test,y_pre_RF))

With random forest, the following accuracies are achieved:
* n=100, depth=5, 0.69425
* n=200, depth=5, 0.69425
* n=200, depth=7, 0.69425
* n=500, depth=10, 0.69425
* n=1000, depth=10, 0.69425

Not sure why there's no improvement in the accuracy of the RF.

At this point, the best result is from logistic regression with C=0.1. I will repeat that below and output the submission file. However, I realized that I need to fit my encoding machine to all the data rather than just X. I do that below and make the prediction. However, the experiments done upto this point with the three classifiers were on X only and so the hyperparameters will probably need to be tuned again. However, for now I am focusing to make my first prediction and will try to improve things later on. I especially need to attend to cross-validation. 

In [None]:
# creating a dataframe of all samples
X_test_actual=test_data.copy()
print('the shape of X_test_actual is {}'.format(X_test_actual.shape))
X_all = pd.concat([X, X_test_actual])
print('the shape of X_all is {}'.format(X_all.shape))

# encoding the dataframes
from sklearn.preprocessing import OneHotEncoder

my_encoder_OH_all = OneHotEncoder()
my_encoder_OH_all.fit(X_all)
X_test_actual_OH = my_encoder_OH_all.transform(X_test_actual) 
print('the shape of X_test_actual_OH is {}'.format(X_test_actual_OH.shape))
X_OH = my_encoder_OH_all.transform(X)
print('the shape of X_OH is {}'.format(X_OH.shape))
print('the shape of y is {}'.format(y.shape))

# fitting logistic regression
from sklearn.linear_model import  LogisticRegression

model_LogR = LogisticRegression(max_iter=5000, C=0.1)
model_LogR.fit(X_OH,y)
y_pre_LogR = model_LogR.predict(X_test_actual_OH)
print('the shape of y_pre_LogR is {}'.format(y_pre_LogR.shape))

output = pd.DataFrame({'id': X_test_actual.id, 'target': y_pre_LogR})
output.to_csv('my_submission_v1.csv', index=False)
print("Your submission was successfully saved!")