# Hands-on introduction to ML training
In this notebook, we will look at selecting the best model and best parameters for a dataset about wine quality.

### Step 1: Load and explore data
The first step is figuring out the data source. In this case we will use a pre-existing dataset. We will:
1. Create a folder 'data'
2. Download the file from public github repo using python package "requests" and save the WineQT.csv file in the data folder.

In [1]:
%config IPCompleter.greedy=True #Helps with auto-complete

import numpy as np
import pandas as pd
import os

try:
    os.mkdir('data')
except OSError as error:
    print(error)

import requests, csv

url = 'https://raw.githubusercontent.com/techno-nerd/ML_101_Course/main/06%20Models%20and%20Hyperparameters/data/WineQT.csv'
r = requests.get(url)
with open('data/WineQT.csv', 'w') as f:
  writer = csv.writer(f)
  for line in r.iter_lines():
    writer.writerow(line.decode('utf-8').split(','))

[Errno 17] File exists: 'data'


In [4]:
df = pd.read_csv('data/WineQT.csv')

In [5]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1146 entries, 0 to 1145
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1143 non-null   float64
 1   volatile acidity      1143 non-null   float64
 2   citric acid           1143 non-null   float64
 3   residual sugar        1143 non-null   float64
 4   chlorides             1143 non-null   float64
 5   free sulfur dioxide   1143 non-null   float64
 6   total sulfur dioxide  1143 non-null   float64
 7   density               1143 non-null   float64
 8   pH                    1143 non-null   float64
 9   sulphates             1143 non-null   float64
 10  alcohol               1143 non-null   float64
 11  quality               1143 non-null   float64
 12  Id                    1143 non-null   float64
dtypes: float64(13)
memory usage: 116.5 KB
None


In [6]:
print(df[:5])

   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54.0   0.9970  3.26       0.65   
3                 17.0                  60.0   0.9980  3.16       0.58   
4                 11.0                  34.0   0.9978  3.51       0.56   

   alcohol  quality   Id  
0      9.4      5.0  0.0  
1      9.8      5.0  1.0  
2    

In [7]:
print(df['quality'].value_counts())

quality
5.0    483
6.0    462
7.0    143
4.0     33
8.0     16
3.0      6
Name: count, dtype: int64


### Step 2: Data preparation

There are a few tasks we need to do before we can train the model on this data:
1. Improve representation of wines with quality 3, 4 and 8
2. Drop null values

Then, we will split the data the same way as last time:
1. Split the data into training set (80%) and test set (20%)
2. Separate the input features (test results) from target variable ("quality")

In [8]:
#Duplicating class "3" 15 times

temp = df[df['quality'] == 3]
print(temp.shape)
for i in range(1, 16):
    df = pd.concat([df, temp], axis=0, ignore_index=True)

print(df['quality'].value_counts())

(6, 13)
quality
5.0    483
6.0    462
7.0    143
3.0     96
4.0     33
8.0     16
Name: count, dtype: int64


In [9]:
#Duplicating class "8" 7 times

temp = df[df['quality'] == 8]
print(temp.shape)
for i in range(1, 8):
    df = pd.concat([df, temp], axis=0, ignore_index=True)

print(df['quality'].value_counts())

(16, 13)
quality
5.0    483
6.0    462
7.0    143
8.0    128
3.0     96
4.0     33
Name: count, dtype: int64


In [10]:
#Duplicating class "4" 3 times

temp = df[df['quality'] == 4]
print(temp.shape)
for i in range(1, 4):
    df = pd.concat([df, temp], axis=0, ignore_index=True)

print(df['quality'].value_counts())

(33, 13)
quality
5.0    483
6.0    462
7.0    143
4.0    132
8.0    128
3.0     96
Name: count, dtype: int64


In [12]:
#Missing values
print(f"Total Records: {df.shape[0]}")
for i in df.columns:
    print(f"{i}: {sum(df[i].isnull())}")

Total Records: 1447
fixed acidity: 3
volatile acidity: 3
citric acid: 3
residual sugar: 3
chlorides: 3
free sulfur dioxide: 3
total sulfur dioxide: 3
density: 3
pH: 3
sulphates: 3
alcohol: 3
quality: 3
Id: 3


In [13]:
#Deleting all rows with missing values (except cabin, as that column will be removed)
df.dropna(inplace=True)
print(df.shape[0])

1444


In [14]:
features = df.drop(['quality'], axis=1)

In [15]:
import sklearn.model_selection as ms

train_features, test_features, train_labels, test_labels = ms.train_test_split(features, df['quality'], test_size=0.2)
print(train_features.shape)
print(test_features.shape)
print(train_labels.shape)
print(test_labels.shape)

(1155, 12)
(289, 12)
(1155,)
(289,)


### Step 3: Model Selection and Training

To decide on the best model, we will run two experiments:
1. Which model is the best
2. Which parameters for that model are the best

Then, we will train the best model with the best parameters and see how well it performs.

#### Decision Tree Classifier

In [17]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
params_grid = {'max_depth': [None, 3, 5, 7],
               'min_samples_leaf': [1, 3, 5]}

grid_search = GridSearchCV(model, params_grid, cv=3, scoring='accuracy')
grid_search.fit(train_features, train_labels)

In [18]:
best_model = grid_search.best_estimator_
print("Best model:", best_model)

best_params = grid_search.best_params_
print("Best parameters:", best_params)

test_score = best_model.score(test_features, test_labels)
print("Test Score: ", test_score)

Best model: DecisionTreeClassifier()
Best parameters: {'max_depth': None, 'min_samples_leaf': 1}
Test Score:  0.740484429065744


#### KNN Classifier

In [19]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()
params_grid = {'weights': ['uniform', 'distance'],
               'n_neighbors': [3, 5, 10]
               }
grid_search = GridSearchCV(model, params_grid, cv=3, scoring='accuracy')
grid_search.fit(train_features, train_labels)

In [20]:
best_model = grid_search.best_estimator_
print("Best model:", best_model)

best_params = grid_search.best_params_
print("Best parameters:", best_params)

test_score = best_model.score(test_features, test_labels)
print("Test Score: ", test_score)

Best model: KNeighborsClassifier(n_neighbors=3, weights='distance')
Best parameters: {'n_neighbors': 3, 'weights': 'distance'}
Test Score:  0.6747404844290658


#### Linear Regression

Since there are very few parameters that can be tuned for Linear Regression, it will just be trained and evaluated.

In [21]:
from sklearn.linear_model import LinearRegression

regr = LinearRegression()
regr.fit(train_features, train_labels)

In [22]:
def accuracy(labels, predictions):
    total = labels.size
    result = (labels == predictions)
    correct = result.sum()
    accuracy = (correct)/total

    #Precision (correct '1' prediction / total '1' prediction)
    return accuracy

In [25]:
regr_pred = regr.predict(test_features)
print(regr_pred[:2])
regr_pred = np.round(regr_pred, 0)
regr_acc = accuracy(test_labels, regr_pred)

print("Linear Regression: ", regr_acc)

[4.6951415 5.8985488]
Linear Regression:  0.42214532871972316


### Training Final Model

Now that we know the best parameters, we will train the model on the whole training set.

In [26]:
dtree = DecisionTreeClassifier(max_depth=None, min_samples_leaf=1)
dtree.fit(train_features, train_labels)

In [28]:
dtree_pred = dtree.predict(test_features)
dtree_acc = accuracy(test_labels, dtree_pred)

print("Final Decision Tree: ", dtree_acc)

Final Decision Tree:  0.7508650519031141
