In this notebook, we are going to explore various classification techniques using the Otto Group Product Challenge classification dataset.

From machinelearningmastery:
> This dataset describes the 93 obfuscated details of more than 61,000 products grouped into 10 product categories (e.g. fashion, electronics, etc.). Input attributes are counts of different events of some kind. The goal is to make predictions for new products as an array of probabilities for each of the 10 categories and models are evaluated using multiclass logarithmic loss (also called cross entropy).

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 50)
sns.set_style('darkgrid')

In [None]:
train = pd.read_csv('/kaggle/input/otto-group-product-classification-challenge/train.csv')
train.head(7)

In [None]:
test = pd.read_csv('/kaggle/input/otto-group-product-classification-challenge/test.csv')
test.head(7)

The following code cell shows that we have a class imbalance in the target column of the `train` dataset.

In [None]:
sns.countplot(x = train.target)

The following 3 code cells are due to [@nagamiso](https://www.kaggle.com/nagomiso/feature-extraction-tfidf).

In [None]:
class_to_order = dict()
order_to_class = dict()

for idx, col in enumerate(train.target.unique()):
    order_to_class[idx] = col
    class_to_order[col] = idx

train["target_ord"] = train["target"].map(class_to_order).astype("int16")
feature_columns = [col for col in train.columns if col.startswith("feat_")]
target_column = ["target_ord"]

In [None]:
order_to_class

In [None]:
class_to_order

We are now going to see how each of the features are skewed. This would help us in further analysis.

In [None]:
from scipy.stats import skew

In [None]:
skew = []
for i in train[feature_columns].columns:
    skew.append(train[str(i)].skew())
    
skew_df = pd.DataFrame({'Feature': train[feature_columns].columns, 'Skewness': skew})
skew_df.plot(kind='bar',figsize=(18,10))

We are now going to apply the [QuantileTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html#sklearn.preprocessing.QuantileTransformer) from scikit-learn. I first used `StandardScaler` but found that there was no change in the skew value of the features. 

If anyone knows why I didn't see any change, please drop a comment!
***

From the scikit-learn website:
> QuantileTransformer applies a non-linear transformation such that the probability density function of each feature will be mapped to a uniform or Gaussian distribution. In this case, all the data, including outliers, will be mapped to a uniform distribution with the range $[0, 1]$, making outliers indistinguishable from inliers.

> RobustScaler and QuantileTransformer are robust to outliers in the sense that adding or removing outliers in the training set will yield approximately the same transformation. But contrary to RobustScaler, QuantileTransformer will also automatically collapse any outlier by setting them to the a priori defined range boundaries (0 and 1). This can result in saturation artifacts for extreme values.

> To map to a Gaussian distribution, set the parameter `output_distribution='normal'`.

In [None]:
from sklearn.preprocessing import QuantileTransformer
train[feature_columns] = QuantileTransformer(copy=False, output_distribution='normal').fit_transform(train[feature_columns])
test[feature_columns] = QuantileTransformer(copy=False, output_distribution='normal').fit_transform(test[feature_columns])

Let us now check the skew values of the features.

In [None]:
skew = []
for i in train[feature_columns].columns:
    skew.append(train[str(i)].skew())
    
skew_df = pd.DataFrame({'Feature': train[feature_columns].columns, 'Skewness': skew})
skew_df.plot(kind='bar',figsize=(18,10))

We are now going to remove the features that have a skew value > 3.75 (my arbitrary choice).

In [None]:
# check features for skew
skew_feats = train[feature_columns].skew().sort_values(ascending=False)
skewness = pd.DataFrame({'Skew': skew_feats})
skewness = skewness[abs(skewness) > 3.75].dropna()
skewed_features = skewness.index.values.tolist()
skewed_features

In [None]:
train_new = train.drop(skewed_features, axis = 1)
train_new

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss

X_train, X_valid, y_train, y_valid = train_test_split(
    train_new.drop(['id', 'target', 'target_ord'], axis = 1), 
    train_new[target_column],
    test_size = 0.275, 
    random_state = 7, 
    stratify = train_new[target_column]
)

## Using KNN

From the scikit-learn documentation:

> Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.

> The $k$-neighbors classification in KNeighborsClassifier is the most commonly used technique. The optimal choice of $k$ is highly data-dependent: in general a larger $k$ suppresses the effects of noise, but makes the classification boundaries less distinct.

> The basic nearest neighbors classification uses uniform weights: that is, the value assigned to a query point is computed from a simple majority vote of the nearest neighbors. Under some circumstances, it is better to weight the neighbors such that nearer neighbors contribute more to the fit. This can be accomplished through the weights keyword. The default value, `weights = uniform`, assigns uniform weights to each neighbor. `weights = distance` assigns weights proportional to the inverse of the distance from the query point. Alternatively, a user-defined function of the distance can be supplied to compute the weights.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knc = KNeighborsClassifier(n_neighbors = 25, weights = 'distance')
knc.fit(X_train, y_train)
yhat = knc.predict(X_valid)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

> A confusion matrix $C$ is such that $C_{i, j}$ is equal to the number of observations known to be in group $i$ and predicted to be in group $j$.

In [None]:
result = confusion_matrix(y_valid, yhat)
print("Confusion Matrix:")
print(result)

The `classification_report` function builds a text report showing the main classification metrics.

$$\text{Precision} = \frac{\text{# of True positives}}{\text{# of True positives + # of False positives}}$$
$$\text{Recall} = \frac{\text{# of True positives}}{\text{# of True positives + # of False negatives}}$$
$$\text{f1-score} = \text{harmonic average of precision and recall} = \frac{\text{2 * Precision * Recall}}{\text{Precision + Recall}}$$
***
$$ \text{Accuracy} = \frac{\text{# of correct predictions}}{\text{Total number of predictions}}$$
$$\text{Macro average - obtained by averaging the unweighted mean per label}$$
$$\text{Weighted average - obtained by averaging the support-weighted mean per label}$$
***

In [None]:
result1 = classification_report(y_valid, yhat)
print("Classification Report:")
print(result1)

In [None]:
yhat_KNN = knc.predict_proba(X_valid)
logloss_KNN = log_loss(y_valid, yhat_KNN)
print('Log loss using KNN classifier:', logloss_KNN)

## Using DecisionTree

From the scikit-learn documentation:
> Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation.

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
dtree = DecisionTreeClassifier(criterion='entropy', max_depth=8,
                              min_samples_leaf = 6, max_leaf_nodes = 40,
                              splitter = 'best')
dtree.fit(X_train, y_train)
yhat_tree = dtree.predict_proba(X_valid)
logloss_DTree = log_loss(y_valid, yhat_tree)
print('Log loss using Decision Tree: ', logloss_DTree)

In [None]:
# Plot decision tree
from IPython.display import Image as PImage
from subprocess import check_call
from PIL import Image, ImageDraw, ImageFont
from sklearn import tree

tree.plot_tree(dtree)

Let us now export the tree in Graphviz format. Thanks to [@dmilla](https://www.kaggle.com/dmilla/introduction-to-decision-trees-titanic-dataset) for the code cell.

In [None]:
# Export our trained model as a .dot file
with open("otto.dot", 'w') as f:
     f = tree.export_graphviz(dtree, out_file=f, max_depth = 3, impurity = True, 
                              feature_names = train_new.drop(['id', 'target', 'target_ord'], axis = 1).columns.values.tolist(), 
                              class_names = train_new.target.unique().tolist(), 
                              rounded = True, filled = True)
        
#Convert .dot to .png to allow display in web notebook
check_call(['dot','-Tpng','otto.dot','-o','otto.png'])

# Annotating chart with PIL
img = Image.open("otto.png")
draw = ImageDraw.Draw(img)
img.save('sample-out.png')
PImage("sample-out.png")

## Using Logistic Regression

From Wikipedia:
> In statistics, the logistic model is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc. Each object being detected in the image would be assigned a probability between 0 and 1, with a sum of one.

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
lr = LogisticRegression(solver = 'saga', warm_start = True,
                        penalty = 'elasticnet', l1_ratio = 0.3,
                        random_state = 5, C = 1, max_iter = 500)
lr.fit(X_train, y_train)

yhat = lr.predict(X_valid)
yhat_lr = lr.predict_proba(X_valid)
logloss_lr = log_loss(y_valid, yhat_lr)
print('Log loss using Logistic Regression:', logloss_lr)

Let us use a custom function to plot our confusion matrix.

In [None]:
import itertools
def plot_confusion_matrix(cm, classes, normalize = False, title = 'Confusion matrix', cmap = plt.cm.Blues):
    '''
    This function prints and plots the confusion matrix. Normalization can be applied by setting normalize = True.
    '''
    if normalize:
        cm = cm.astype('float')/cm.sum(axis=1)[:,np.newaxis]
        print('Normalized Confusion matrix')
    else:
        print('Confusion matrix without normalization')
    
    print(cm)
    plt.imshow(cm, interpolation = 'nearest', cmap = cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation = 45)
    plt.yticks(tick_marks, classes)
    
    if normalize:
        fmt = '.2f'
    else:
        fmt = 'd'
    
    thresh = cm.max()/2
    
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt), horizontalalignment = 'center',
                color = 'white' if cm[i, j] > thresh else 'black')
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
cnf_matrix = confusion_matrix(y_valid, yhat, labels = train_new.target_ord.unique().tolist())
np.set_printoptions(precision = 2)

plt.figure()
plot_confusion_matrix(cnf_matrix, classes = train_new.target.unique().tolist())

In [None]:
print('Classification Report:')
print(classification_report(y_valid, yhat))

## Using Support Vector Machines

From tutorialspoint:
> An SVM model is a representation of different classes in a hyperplane in multidimensional space. The hyperplane is generated in an iterative manner by SVM so that the error can be minimized. The goal of SVM is to divide the datasets into classes to find a `maximum margin(al) hyperplane`.

In [None]:
from sklearn import svm

In [None]:
svm = svm.SVC(kernel = 'rbf', probability = True, random_state = 7)
svm.fit(X_train, y_train)

yhat = svm.predict(X_valid)
yhat_svm = svm.predict_proba(X_valid)
logloss_svm = log_loss(y_valid, yhat_svm)
print('Logloss using Support Vector Machines:', logloss_svm)

In [None]:
cnf_matrix = confusion_matrix(y_valid, yhat, labels = train_new.target_ord.unique().tolist())

plt.figure()
plot_confusion_matrix(cnf_matrix, classes = train_new.target.unique().tolist())

In [None]:
print('Classification Report:')
print(classification_report(y_valid, yhat))

## Using XGBoostClassifier

In [None]:
from xgboost import XGBClassifier

In [None]:
xgb_params = {'n_estimators': 2500,
             'max_depth': 5,
             'learning_rate': 0.01,
             'min_child_weight': 4,
             'colsample_bytree': 0.4,
             'subsample': 0.4,
             'reg_alpha': 0.6,
             'reg_lambda': 0.6
             }
xgb = XGBClassifier(**xgb_params)
xgb.fit(X_train, y_train, early_stopping_rounds = 5,
       eval_set = [(X_train, y_train), (X_valid, y_valid)],
       verbose = False)

In [None]:
#To calculate log-loss, we need the probability of each prediction
yhat_xgbc = xgb.predict_proba(X_valid)
logloss_XGBC = log_loss(y_valid, yhat_xgbc)
print("Log loss using XGB Classifier:", logloss_XGBC)

### Using the XGBoost Feature Importance Plot

In [None]:
from xgboost import plot_importance
# Plot feature importance
ax = plot_importance(xgb, max_num_features=12, show_values=True) 
fig = ax.figure
fig.set_size_inches(10, 3)
plt.show()

In [None]:
results = xgb.evals_result()

In [None]:
# Plot learning curves
plt.plot(results['validation_0']['mlogloss'], label='train')
plt.plot(results['validation_1']['mlogloss'], label='test')
plt.legend()
plt.show()

## Using AdaBoostClassifier

In [None]:
from sklearn.ensemble import AdaBoostClassifier

In [None]:
abc = AdaBoostClassifier(n_estimators = 1000, random_state = 0, learning_rate = 0.12)
abc.fit(X_train, y_train)

yhat_ABC = abc.predict_proba(X_valid)
logloss_ABC = log_loss(y_valid, yhat_ABC)
print('Log loss using Ada Boost Classifier:', logloss_ABC)

## Using CatBoostClassifier

The following parameters have been set by trial and error from [Parameter tuning](https://catboost.ai/docs/concepts/parameter-tuning.html) and [Speeding up training](https://catboost.ai/docs/concepts/speed-up-training.html).

In [None]:
from catboost import CatBoostClassifier

In [None]:
CBC_params = {
                'iterations': 5000, 
                'od_wait': 250,
                'use_best_model': True,
                'loss_function': 'MultiClass',
                'eval_metric': 'MultiClass',
                'leaf_estimation_method': 'Newton',
                'bootstrap_type': 'Bernoulli',
                'subsample': 0.4,
                'learning_rate': 0.05,
                'l2_leaf_reg': 0.5, #L2 Regularization
                'random_strength': 10, #amount of randomness to use for scoring splits when tree structure is selected
                'depth': 6, #Tree depth
                'min_data_in_leaf': 3, #minimum number of training samples in a leaf
                'leaf_estimation_iterations': 4, #Earlier = 7
                'task_type': 'GPU',
                'border_count': 128, #Number of splits for numerical features
                'grow_policy': 'SymmetricTree'
            }

In [None]:
cbc = CatBoostClassifier(**CBC_params)
cbc.fit(X_train, y_train,
       eval_set = [(X_valid, y_valid)],
       early_stopping_rounds = 20,
       verbose = False)

In [None]:
yhat_CBC = cbc.predict_proba(X_valid)
logloss_CBC = log_loss(y_valid, yhat_CBC)
print('Log loss using CatBoost Classifier:', logloss_CBC)