# Otto Group Product Challenge Classification

 

In this notebook, we are going to explore various classification techniques using the Otto Group Product Challenge classification dataset.

> This dataset describes the 93 obfuscated details of more than 61,000 products grouped into 10 product categories (e.g. fashion, electronics, etc.). Input attributes are counts of different events of some kind. The goal is to evaluate which classification algorithm provides best accuracy in predicting the category and make predictions for new products.

Many thanks to Rohan K (@rohankarthik) for your [notebook](https://www.kaggle.com/rohankarthik/classification-using-otto-group-dataset). I have created a simpler version of your notebook, attempting to find the accuracy of each classification model. However, I have some queries regarding the findings. (listed in below sections). Appreciate if anyone could help me with answers. This will be a good learning opportunity for Data Science beginners like me. Thank you very much !

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 50)
sns.set_style('darkgrid')

In [None]:
train = pd.read_csv('../input/otto-group-product-classification-challenge/train.csv')

In [None]:
X = train.drop(['target'],axis = 1)
y = train['target']

In [None]:
train.describe(include = 'all')

In [None]:
test_to_predict = pd.read_csv('../input/otto-group-product-classification-challenge/test.csv')

The following code cell shows that we have a class imbalance in the target column of the `train` dataset.

In [None]:
sns.countplot(x = train.target)

Splitting the train Dataset further into train and test sets, so that we can compare actual categories with our predicted values and calculate the accuracy of each Classification model

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 7)

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
y_train.shape

In [None]:
y_test.shape

Resetting the index of y_test as continuous integers from 0, so that it can be used in printing predicted and test values side by side

In [None]:
y_test.index = range(0,12376)

## Using KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=25, weights='distance')
classifier.fit(X_train,y_train)

In [None]:
y_predicted = classifier.predict(X_test)
df_y_pred = pd.Series(y_predicted)
pd.concat((df_y_pred,y_test), axis = 1)

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
cm = confusion_matrix(y_test,df_y_pred)
print('Confusion Matrix:')
print(cm)

print('Accuracy Score:')
accuracy_score(y_test,df_y_pred)

Accuracy score of 99.99% seems to be result of overfitting. Can anyone help me why this is occurring?

## Using DecisionTree

In [None]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train,y_train)

In [None]:
df_y_pred = pd.Series(classifier.predict(X_test))
pd.concat((df_y_pred,y_test), axis = 1)

In [None]:
cm = confusion_matrix(y_test, df_y_pred)
print('Confusion Matrix:')
print(cm)

print('Accuracy Score:')
accuracy_score(y_test, df_y_pred)

Again, The accuracy score of 99.99% seems to be overfitting. Can anyone please guide me why this is occurring, and how it can be overcome?

## Using Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier.fit(X_train,y_train)

In [None]:
df_y_pred = pd.Series(classifier.predict(X_test))
pd.concat((df_y_pred,y_test),axis = 1)

In [None]:
cm = confusion_matrix(y_test, df_y_pred)
print('Confusion Matrix:')
print(cm)

print('Accuracy Score:')
accuracy_score(y_test,df_y_pred)

98.47% accuracy score. Seems to be overfitting again. Any guidance please?

## Using Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(solver = 'saga',warm_start = True, l1_ratio=0.3, penalty = 'elasticnet',C=1,random_state = 0,max_iter = 500)
classifier.fit(X_train,y_train)

In [None]:
df_y_pred = pd.Series(classifier.predict(X_test))
pd.concat((df_y_pred,y_test), axis = 1)

In [None]:
cm = confusion_matrix(y_test,df_y_pred)
print('Confusion Matrix:')
print(cm)

print('Accuracy Score:')
accuracy_score(y_test,df_y_pred)

Accuracy score of 34.8% seems very low. In the Confusion matrix, there seems to be a pattern where most items are incorrectly predicted as Class_6 and Class_8. Can anyone please guide me what inference can be drawn from this? and how we can get a better prediction using Logistic Regression ?

## Using Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train,y_train)

In [None]:
df_y_pred = pd.Series(classifier.predict(X_test))
pd.concat((df_y_pred,y_test), axis=1)

In [None]:
cm = confusion_matrix(y_test,df_y_pred)
print('Confusion Matrix:')
print(cm)

print('Accuracy Score:')
accuracy_score(y_test,df_y_pred)

Accuracy score of 81% seems to be a decent fit. Can anyone confirm ?

## Using Support Vector Machines

In [None]:
from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', probability = True, random_state = 0)
classifier.fit(X_train,y_train)

In [None]:
df_y_pred = pd.Series(classifier.predict(X_test))
pd.concat((df_y_pred,y_test),axis = 1)

In [None]:
cm = confusion_matrix(y_test,df_y_pred)
print('Confusion Matrix:')
print(cm)

print('Accuracy Score:')
accuracy_score(y_test,df_y_pred)

99.82% accuracy score seems to be overfitting. Can anyone please guide me how this can be improved?