# Lab 2 - Classification tree and Kaggle competition

In this lab you will participate in a Kaggle competition:

https://www.kaggle.com/t/a85442f2e6f744f2b2cd06140001f127

In this competition you will train, test and submit the results to the competition. Use a decision tree while choosing how to preprocess the data and the model's parameters.

First, click the link and join the competition.

## set up:
Using 'kaggle.json' file from last lab, load the competition's data:

(remember to upload it to your environment)

You should have 3 files: 

```
WineQT_train_set.csv
WineQT_test_set.csv
WineQT_sample_submission.csv - which is an example file to your submission

```

In [1]:
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
!kaggle competitions download -c decision-trees-lab-2-wine-classification
! unzip /content/decision-trees-lab-2-wine-classification.zip

Downloading decision-trees-lab-2-wine-classification.zip to /content
  0% 0.00/67.9k [00:00<?, ?B/s]
100% 67.9k/67.9k [00:00<00:00, 37.2MB/s]
Archive:  /content/decision-trees-lab-2-wine-classification.zip
  inflating: sample_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


In [2]:
import pandas as pd

test_set = pd.read_csv('/content/test.csv', index_col='Id')

train_set = pd.read_csv('/content/train.csv', index_col='Id')
train_set.head(10)

Unnamed: 0_level_0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,8.0,0.5,0.39,2.2,0.073,30.0,39.0,0.99572,3.33,0.77,12.1,6
1,9.3,0.3,0.73,2.3,0.092,30.0,67.0,0.99854,3.32,0.67,12.8,6
2,7.1,0.51,0.03,2.1,0.059,3.0,12.0,0.9966,3.52,0.73,11.3,7
3,8.1,0.87,0.22,2.6,0.084,11.0,65.0,0.9973,3.2,0.53,9.8,5
4,8.5,0.36,0.3,2.3,0.079,10.0,45.0,0.99444,3.2,1.36,9.5,6
5,9.9,0.51,0.44,2.2,0.111,30.0,134.0,0.9982,3.11,0.54,9.6,5
6,7.2,0.87,0.0,2.3,0.08,6.0,18.0,0.99552,3.34,0.6,11.3,6
7,7.5,0.43,0.32,1.8,0.066,18.0,40.0,0.9956,3.3,0.43,9.7,6
8,11.6,0.38,0.55,2.2,0.084,17.0,40.0,1.0008,3.17,0.73,9.8,6
9,7.8,0.78,0.09,2.2,0.049,13.0,29.0,0.99682,3.51,0.49,9.5,5


In [3]:
train_set.shape

(2056, 12)

## Task 1:
Decide how to preprocess the data. Think about missing values, categorical values, normalization ect. I recommend analyzing missing values, distribution, mean values std values ect. to determine how to pre-process

(You can find information about the data and features on the competition page)

In [4]:
train_set.isna().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

In [5]:
train_set[train_set.duplicated()]

Unnamed: 0_level_0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1


In [6]:
from sklearn.model_selection import train_test_split

X = train_set.drop('quality', axis=1)
y = train_set['quality']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

In [7]:
from sklearn.preprocessing import StandardScaler

columns = X.columns
scaler = StandardScaler()
scaler.fit(pd.concat([X, test_set]))

X_train = pd.DataFrame(scaler.transform(X_train), columns=columns)
X_test = pd.DataFrame(scaler.transform(X_test), columns=columns)

X_train.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
0,2.092135,-0.436803,2.4716,1.011561,0.496594,-0.304076,-0.674329,1.470367,-0.924708,1.216748,-0.231682
1,-0.750237,-1.123922,0.70898,-1.024463,-1.029685,-1.104631,-1.00879,-1.510448,0.555039,0.558283,0.449081
2,-0.631805,0.307577,-0.09221,-0.458901,0.898246,-0.204007,-0.066217,-0.228259,-0.501923,-0.465997,0.449081
3,-0.987102,1.796335,-1.213877,-0.911351,-0.748529,-1.104631,-0.947979,-0.666614,1.964321,-1.197625,0.157325
4,1.381542,-1.181182,0.815806,-1.137575,-1.029685,-0.904493,-0.765545,-1.510448,-1.065636,0.48512,0.93534


## Task 2:
Create a decision tree model - select it's parameters as you see fit.

Here you can use any tools and libraries you know to find and select the best model.

In [8]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, cross_val_score

clf = DecisionTreeClassifier()

In [9]:
params = {'criterion': ['gini', 'entropy'],
          'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
          'min_samples_split': [2, 5, 10, 20, 30, 40, 50]}

In [10]:
grid_search = GridSearchCV(clf, params, cv=5)
grid_search.fit(X_train, y_train)

print("Best parameters GS:", grid_search.best_params_)
best_dt_gs = grid_search.best_estimator_

Best parameters GS: {'criterion': 'entropy', 'max_depth': 4, 'min_samples_split': 20}


In [11]:
random_search = RandomizedSearchCV(clf, params, cv=5)
random_search.fit(X_train, y_train)

print("Best parameters RS:", random_search.best_params_)
best_dt_rs = random_search.best_estimator_

Best parameters RS: {'min_samples_split': 30, 'max_depth': 4, 'criterion': 'entropy'}


In [12]:
import numpy as np

scores_gs = cross_val_score(best_dt_gs, X_train, y_train, cv=5)
scores_rs = cross_val_score(best_dt_rs, X_train, y_train, cv=5)

print(f"Mean cross-validation scores: GS - {np.mean(scores_gs)}, RS - {np.mean(scores_rs)}")
print(f"Standard deviation of cross-validation scores: GS - {np.std(scores_gs)}, RS - {np.std(scores_rs)}")

Mean cross-validation scores: GS - 0.5816216216216217, RS - 0.581081081081081
Standard deviation of cross-validation scores: GS - 0.01738155155763623, RS - 0.017093392757666904


In [13]:
from sklearn.metrics import accuracy_score

gs_pred = best_dt_gs.predict(X_test)
rs_pred = best_dt_rs.predict(X_test)

print(f"Accuracy scores: GS - {accuracy_score(gs_pred, y_test)}, RS - {accuracy_score(rs_pred, y_test)}")

Accuracy scores: GS - 0.558252427184466, RS - 0.558252427184466


In [14]:
dec_tree = best_dt_gs

## Task 3:
Create a predictions file. Use your trained model to make predictions on the test set and save the output to csv file named 'sample_submission.csv'.

Make sure to keep the ids of the test set (column 'Id') in your submission file. You have an example of this file named 'WineQT_sample_submission.csv' in the competition page on Kaggle and among the files you loaded. 

After you create and save your submission file, upload it to Kaggle competition by clicking on the submission button and choosing the file. You can submit your predictions up to **5 times a day.**

In [15]:
test_set = pd.read_csv('/content/test.csv', index_col='Id')
test_set.head(10)

Unnamed: 0_level_0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2056,7.2,0.51,0.01,2.0,0.077,31.0,54.0,0.99748,3.39,0.59,9.8
2057,7.2,0.755,0.15,2.0,0.102,14.0,35.0,0.99586,3.33,0.68,10.0
2058,8.4,0.46,0.4,2.0,0.065,21.0,50.0,0.99774,3.08,0.65,9.5
2059,8.0,0.47,0.4,1.8,0.056,14.0,25.0,0.9948,3.3,0.65,11.7
2060,6.5,0.34,0.32,2.1,0.044,8.0,94.0,0.99356,3.23,0.48,12.8
2061,6.1,0.32,0.25,2.3,0.073,11.0,86.0,0.99464,3.16,0.7,11.2
2062,6.7,0.64,0.05,1.8,0.054,6.0,14.0,0.99456,3.35,0.58,10.9
2063,12.5,0.37,0.59,1.8,0.079,3.0,16.0,0.9994,3.16,0.68,10.5
2064,6.3,0.47,0.32,1.9,0.069,18.0,85.0,0.9958,3.39,0.55,14.0
2065,7.9,0.18,0.4,1.7,0.066,23.0,99.0,0.99914,3.31,0.62,10.0


In [16]:
columns = test_set.columns
index = test_set.index

test_set = pd.DataFrame(scaler.transform(test_set), columns=columns)

test_set.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
0,-0.691021,-0.093243,-1.374115,-0.458901,-0.186215,1.397104,0.146622,0.418315,0.555039,-0.392834,-0.62069
1,-0.691021,1.309626,-0.626337,-0.458901,0.817916,-0.304076,-0.431084,-0.469355,0.132254,0.265631,-0.426186
2,0.019572,-0.379543,0.70898,-0.458901,-0.668198,0.39641,0.024999,0.56078,-1.629349,0.046143,-0.912446
3,-0.217293,-0.322283,0.70898,-0.685126,-1.029685,-0.304076,-0.73514,-1.050175,-0.079139,0.046143,1.227096
4,-1.105534,-1.066662,0.281679,-0.345788,-1.511668,-0.904493,1.362844,-1.729625,-0.572387,-1.197625,2.296867


In [17]:
y_pred = dec_tree.predict(test_set)
np.unique(y_pred, return_counts=True)

(array([5, 6, 7]), array([673, 432, 267]))

In [18]:
np.unique(y_train, return_counts=True)

(array([3, 4, 5, 6, 7, 8]), array([ 12,  48, 756, 700, 299,  35]))

In [19]:
results_df = pd.DataFrame({'quality': y_pred})
results_df.index = index
results_df.head()

Unnamed: 0_level_0,quality
Id,Unnamed: 1_level_1
2056,5
2057,6
2058,6
2059,6
2060,6


In [20]:
results_df.to_csv('sample_submission.csv', index=True)

## Task 4:
Convert the notebook into html:


1.   Download the notebook by clicking the File tab
2.   In the directory of the file open cmd
3.   Run the command:
```
 jupyter nbconvert --to html notebook.ipynb 
```

Or direcrtly from Colab:
```
 !jupyter nbconvert --to html notebook.ipynb 
```


In [24]:
!jupyter nbconvert --to html ../Lab2.ipynb 

This application is used to convert notebook files (*.ipynb)
        to various other formats.


Options
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
    <cmd> --help-all

--debug
    set log level to logging.DEBUG (maximize logging output)
    Equivalent to: [--Application.log_level=10]
--show-config
    Show the application's configuration (human-readable format)
    Equivalent to: [--Application.show_config=True]
--show-config-json
    Show the application's configuration (json format)
    Equivalent to: [--Application.show_config_json=True]
--generate-config
    generate default config file
    Equivalent to: [--JupyterApp.generate_config=True]
-y
    Answer yes to any questions instead of prompting.
    Equivalent to: [--JupyterApp.answer_yes=True]
--execute
    Execute the notebook prior to export.
    Equivalent to: [--ExecutePr