# **Classification**

#### Book used: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (third edition)

#### Chapter 3 exercises

## 1. An MNIST Classifier With Over 97% Accuracy

Exercise: _Try to build a classifier for the MNIST dataset that achieves over 97% accuracy on the test set. Hint: the `KNeighborsClassifier` works quite well for this task; you just need to find good hyperparameter values (try a grid search on the `weights` and `n_neighbors` hyperparameters)._

Importing the dataset and creating training and testing sets:

In [1]:
from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', as_frame=False)

In [2]:
X, y = mnist.data, mnist.target

In [3]:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

Trying a `KNeighborsClassifier`:

In [4]:
from sklearn.neighbors import KNeighborsClassifier

knn_classifier = KNeighborsClassifier()
knn_classifier.fit(X_train, y_train)

In [5]:
initial_accuracy = knn_classifier.score(X_test, y_test)
initial_accuracy    

0.9688

Checking the hyperparameters:

In [6]:
hyperparams = knn_classifier.get_params()
print(hyperparams)

{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', 'metric_params': None, 'n_jobs': None, 'n_neighbors': 5, 'p': 2, 'weights': 'uniform'}


Trying a grid search to go through different values for `n_neighbors` and `weights` (first 10,000 to speed things up):

In [7]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    {"weights": ["uniform", "distance"],
     "n_neighbors": [3, 4, 5, 6, 7, 8, 9, 10]}
]

grid_search = GridSearchCV(knn_classifier, param_grid, cv=3, scoring="accuracy")
grid_search.fit(X_train[:10000], y_train[:10000])

In [8]:
grid_search.best_params_

{'n_neighbors': 4, 'weights': 'distance'}

In [9]:
grid_search.best_score_

0.9397994088551026

On the full dataset:

In [10]:
grid_search.best_estimator_.fit(X_train, y_train)
test_accuracy = grid_search.score(X_test, y_test)
test_accuracy

0.9714

## 2. Data Augmentation

Exercise: _Write a function that can shift an MNIST image in any direction (left, right, up, or down) by one pixel. You can use the `shift()` function from the `scipy.ndimage` module. For example, `shift(image, [2, 1], cval=0)` shifts the image two pixels down and one pixel to the right. Then, for each image in the training set, create four shifted copies (one per direction) and add them to the training set. Finally, train your best model on this expanded training set and measure its accuracy on the test set. You should observe that your model performs even better now! This technique of artificially growing the training set is called _data augmentation_ or _training set expansion_._

In [11]:
from scipy.ndimage import shift

def shift_image (image, x, y):
    image = image.reshape(28, 28)
    shifted_image = shift(image, [y, x], cval=0, mode="constant")
    return shifted_image.reshape(-1)

**In progress**

## 3. Tackle the Titanic Dataset

Exercise: _Tackle the Titanic dataset. A great place to start is on [Kaggle](https://www.kaggle.com/c/titanic). Alternatively, you can download the data from https://homl.info/titanic.tgz and unzip this tarball like you did for the housing data in Chapter 2. This will give you two CSV files: _train.csv_ and _test.csv_ which you can load using `pandas.read_csv()`. The goal is to train a classifier that can predict the `Survived` column based on the other columns._

Reading the datasets which I downloaded from Kaggle:

In [12]:
import pandas as pd

train_data = pd.read_csv("./datasets/titanic/train.csv")
test_data = pd.read_csv("./datasets/titanic/test.csv")

train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Separating labels from features:

In [18]:
y_train = train_data.iloc[:, 1]
X_train = train_data.iloc[:, [0] + list(range(2, len(train_data.columns)))]

In [22]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Pclass       891 non-null    int64  
 2   Name         891 non-null    object 
 3   Sex          891 non-null    object 
 4   Age          714 non-null    float64
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
 7   Ticket       891 non-null    object 
 8   Fare         891 non-null    float64
 9   Cabin        204 non-null    object 
 10  Embarked     889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 76.7+ KB


In [20]:
X_train.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,2.0,20.125,0.0,0.0,7.9104
50%,446.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,3.0,38.0,1.0,0.0,31.0
max,891.0,3.0,80.0,8.0,6.0,512.3292


Taking a look at the data, there are some null values which need to be handled. I'll take the numerical attributes and use `SimpleImputer` with the median strategy to replace all null values in those columns to the median:

In [23]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")

In [24]:
import numpy as np

num_columns = X_train.select_dtypes(include=[np.number])