<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Cars" data-toc-modified-id="Cars-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Cars</a></span><ul class="toc-item"><li><span><a href="#Data-Preparation" data-toc-modified-id="Data-Preparation-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Data Preparation</a></span><ul class="toc-item"><li><span><a href="#Read-the-data" data-toc-modified-id="Read-the-data-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Read the data</a></span></li><li><span><a href="#Balance-the-data" data-toc-modified-id="Balance-the-data-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>Balance the data</a></span></li><li><span><a href="#Split-the-data" data-toc-modified-id="Split-the-data-1.1.3"><span class="toc-item-num">1.1.3&nbsp;&nbsp;</span>Split the data</a></span></li><li><span><a href="#Using-Imbalanced-Data" data-toc-modified-id="Using-Imbalanced-Data-1.1.4"><span class="toc-item-num">1.1.4&nbsp;&nbsp;</span>Using Imbalanced Data</a></span></li></ul></li><li><span><a href="#Model-Building" data-toc-modified-id="Model-Building-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Model Building</a></span><ul class="toc-item"><li><span><a href="#Logistic-Regression" data-toc-modified-id="Logistic-Regression-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Logistic Regression</a></span></li><li><span><a href="#Decision-Tress" data-toc-modified-id="Decision-Tress-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Decision Tress</a></span></li><li><span><a href="#Random-Forest" data-toc-modified-id="Random-Forest-1.2.3"><span class="toc-item-num">1.2.3&nbsp;&nbsp;</span>Random Forest</a></span></li><li><span><a href="#Multi-Layer-Perceptron" data-toc-modified-id="Multi-Layer-Perceptron-1.2.4"><span class="toc-item-num">1.2.4&nbsp;&nbsp;</span>Multi Layer Perceptron</a></span></li></ul></li><li><span><a href="#Accuracy-Metrics" data-toc-modified-id="Accuracy-Metrics-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Accuracy Metrics</a></span></li><li><span><a href="#Results" data-toc-modified-id="Results-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Results</a></span></li></ul></li></ul></div>

## Cars

### Data Preparation

In [1]:
data_balancing = False
test_train_split = False

#### Read the data

In [2]:
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE

In [3]:
df = pd.read_csv("train.arff.csv")

#### Balance the data

The target variable has an uneven distribution with some classes occurring far more number of times than the others. The improve model performance we need to balance the data before. We will use `SMOTE` in this particular case. Other methods to oversample data are `ADASYN`. We will use the `imbalanced-learn` package for this purpose

In [4]:
if data_balancing:
    covariates = df.columns[0:df.shape[1]-1]
    df_covar_dumm = pd.get_dummies(df.loc[:, covariates])
    label = df.columns.tolist()[-1]
    smote = SMOTE(ratio='auto', random_state=42, k=None, k_neighbors=5, m=None,
                  m_neighbors=10, out_step=0.5, kind='regular', svm_estimator=None, n_jobs=1)
    data_resampled, label_resampled = smote.fit_sample(
        df_covar_dumm, df.loc[:, label])
    balancd_data = pd.DataFrame(data_resampled)
    balancd_data.loc[:, balancd_data.shape[1]] = label_resampled
    balancd_data.columns = df_covar_dumm.columns.tolist()+[label]

#### Split the data

In [5]:
if test_train_split:
    df_train = balancd_data.sample(axis=0, frac=0.7, random_state=42, weights=None)
    ind_test = list(set(balancd_data.index.tolist()).difference(
        set(df_train.index.values)))
    df_test = balancd_data.iloc[ind_test, :]

#### Using Imbalanced Data

In [6]:
if not data_balancing:
    covariates = df.columns[0:df.shape[1]-1]
    df_covar_dumm = pd.get_dummies(df.loc[:, covariates])
    label = df.columns.tolist()[-1]
    df_covar_dumm.loc[:, label] = df.loc[:,label]
    balancd_data = df_covar_dumm

if not test_train_split:
    df_train = balancd_data
    df_test = balancd_data

### Model Building

In this section we will build four models - Logistic Regression, Decision Trees, Random Forest, Multi Layer Perceptron. For each of these models, we will calculate the level wise classification metrics - accuracy, precision and recall

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report
from io import StringIO

X_train = df_train.iloc[:, 0:21]
y_train = df_train.iloc[:, 21]
X_test = df_test.iloc[:, 0:21]
y_test = df_test.iloc[:, 21]

#### Logistic Regression

In [8]:
model_logreg = LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None,
                                  random_state=42, solver='lbfgs', max_iter=100, multi_class='multinomial', verbose=0, warm_start=False, n_jobs=-1)

_ = model_logreg.fit(X_train, y_train)

#### Decision Tress

In [9]:
model_dt = DecisionTreeClassifier(criterion='entropy', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0,
                                  max_features=None, random_state=42, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort=False)

_ = model_dt.fit(X_train, y_train)

#### Random Forest

In [10]:
model_rf = RandomForestClassifier(n_estimators=100, criterion='entropy', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto',
                                  max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=42, verbose=0, warm_start=False, class_weight=None)

_ = model_rf.fit(X_train, y_train)

#### Multi Layer Perceptron

In [11]:
model_mlp = MLPClassifier(hidden_layer_sizes=(100, ), activation='relu', solver='adam', alpha=0.0001, batch_size='auto', learning_rate='adaptive', learning_rate_init=0.003, power_t=0.5, max_iter=200,
                          shuffle=True, random_state=42, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True, early_stopping=False, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08)

_ = model_mlp.fit(X_train, y_train)

### Accuracy Metrics

In [12]:
predictions_logreg = model_logreg.predict(X_test)
predictions_dt = model_dt.predict(X_test)
predictions_rf = model_rf.predict(X_test)
predictions_mlp = model_mlp.predict(X_test)

In [22]:
from sklearn.metrics import precision_score, accuracy_score, recall_score

def class_wise_accuracy(y_true,y_pred,average=None,labels=None):
    levels = list(set(y_true).intersection(set(y_pred)))
    levels.sort()
    accuracy_by_class = []
    for each_level in levels:
        relevant_cases = [each_true_label == each_level for each_true_label in y_true]
        lvl_y_true = [y_true[i] for i,relevant_case in enumerate(relevant_cases) if relevant_case]
        lvl_y_pred = [y_pred[i] for i,relevant_case in enumerate(relevant_cases) if relevant_case]
        accuracy_by_class.append(accuracy_score(lvl_y_true, lvl_y_pred))
    return [x for _,x in sorted(zip(labels,accuracy_by_class))]

In [23]:
# order of reporting
label_order = ['acc', 'vgood', 'good', 'unacc']
label_order.sort()

# variable naming
models = ["logreg","dt","rf","mlp"]
metrics = ["precision","recall","accuracy"]

# calculate metrics
metric_methods = [precision_score,recall_score,class_wise_accuracy]
predictions_all = [predictions_logreg, predictions_dt, predictions_rf, predictions_mlp]
# [print(each.shape) for each in predictions_all]

all_metrics = [each_metric_method(y_test, each_model_prediction,average=None, labels=label_order) for each_metric_method in metric_methods for each_model_prediction in predictions_all]

metrics_df = pd.DataFrame(all_metrics).T
metrics_df.index = label_order
metrics_df.columns = [each_model+"_"+each_metric for each_metric in metrics for each_model in models ]

### Results

The below table compares the classification metrics for each level and model. We have calculated three metrics for each model and level(levels here refers to each of the unique levels of the variable we are predicting: "unacc", "acc", "good", "vgood"). The metrics that we are calculating are: 
<ul>
<li><i><b>Accuracy: </b></i>measures the fraction of all instances that are correctly categorized</li>
<li><i><b>Recall: </b></i>is the proportion of people that tested positive and are positive (True Positive, TP) of all the people that actually are positive </li>
<li><i><b>Precision: </b></i>it is the proportion of true positives out of all positive results</li>
</ul>

<img src="Precisionrecall.png" width = 300px>

In [24]:
metrics_df

Unnamed: 0,logreg_precision,dt_precision,rf_precision,mlp_precision,logreg_recall,dt_recall,rf_recall,mlp_recall,logreg_accuracy,dt_accuracy,rf_accuracy,mlp_accuracy
acc,0.756494,1.0,1.0,1.0,0.882576,1.0,1.0,1.0,0.882576,1.0,1.0,1.0
good,0.64,1.0,1.0,1.0,0.307692,1.0,1.0,1.0,0.307692,1.0,1.0,1.0
unacc,0.973203,1.0,1.0,1.0,0.947805,1.0,1.0,1.0,0.947805,1.0,1.0,1.0
vgood,0.910714,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
