# Week-13: Classification Task in Python

<font size='4'>

* Welcome back!
* Today, we will go over a classification example using `skicit-learn` package in Python.

## Import data

<font size='4'>

* `Scikit-learn`, known as `sklearn`, is an open-source, robust library for machine learning in Python.
* It is created to streamline the process of implementing machine learning and statistical models in Python.
* The package comes with standard machine learning datasets, and you can import it without downloading them from an external website or database.
* Since we will go over a classification example, we will be using the [`wine dataset`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html#sklearn.datasets.load_wine) (Click it for more information).

In [1]:
import numpy as np
import pandas as pd
import sklearn


In [2]:
# load data


<font size='4'>

* Executing the above code returns a dictionary-like object (we just learned!) that contains data and metadata.
    * **Metadata**: This is a terminology for data dictionary, i.e., a description of the data itself.

In [3]:
# Convert the data to a Pandas dataframe


# Add the target label (add a new column)


# Preview


### Exploratory Data Analysis

<font size='4'>

* Before conducting any data analysis, always check the quality of the dataset with exploratory data analysis.
* You can call `.info()` method to print out a summary of each column.

<font size='4'>

* There are 178 data samples with 14 columns including the target column (output that we would like to predict).
* Luckily, no missing values are in our dataset from `Non-Null Count`.
* All features are `float64` except for target column.
* The dataset consumes 19.6 KB of memory.

<font size='4'>

* Since `target` is a categorical variable, we check the frequency and proportion using `.value_counts()` method.

## Data preprocessing

<font size='4'>

* Preprocessing is important prior to applying machine learning algorithms.
    * Check missing values, outliers, duplicates, errors, and data types.
    * Avoid "Garbage in, garbage out."
* Machine learning models typically require numerical inputs.
* Another practice is to standardize the input (via Z-transform) to make predictors more comparable. 
    * We do not want predictor A has a magnitude of 10000, while predictor B has a magnitude of 0.1.
    * We can achieve the goal using `StandardScaler()` class.
    * Each value goes through the following transformation columnwise, i.e., $x = (x - x_{mean})/x_{sd}$.

In [4]:
# Remember to import standardscaler (code should be written in the beginning of the Jupyter notebook).
# Split data into features (input) and label (output)


# Always make a copy to avoid making changes on the raw data (when you have sufficient amount of memory).



<font size='4'>
<img src="figures/sklearn_flowchart_1.png" alt="drawing" width="900"/>
    
* In this flowchart, no output data are involved. We use `.transform()` in the last step.

In [5]:
# Instantiate scaler and fit on features


# apply changes to training data
# and update parameters (in this case, no model parameters are available, so this is optional)


# apply changes to any data and assign it with another variable


# view the transformed output


## Model training

<font size='4'>

* As we previously mentioned, we need to have training and testing set when we perform classification tasks.
* If rows of input data are independent of each other, we can randomly select training and testing set.
* Sklearn package has a built-in function called `train_test_split()`.
* We usually set 70% data to train and 30% to test. The exact ratio vary depending on the volume of the data.

In [6]:
# remember to import train_test_split in the beginning


# Check the splits are correct



### Model building

<font size='4'>

* Sklearn has numerous built-in classification methods. We will demonstrate a couple of methods including
    * logistic regression
    * support vector machine
    * decision tree classifier
    
* <img src="figures/sklearn_flowchart_2.png" alt="drawing" width="900"/>

    * In this flowchart, both input and output (in the training set) are involved, so `.predict()` is used finally.

In [7]:
# Iniatiating the models 
# Sometimes, you need to modify the parameters inside each of the function.
# If not, the model will use its default values.
# write one model and the rest two are completed by students


# Training the models 


# Making predictions with each model


In [8]:
# You can view the probability vector per measure per method.


## Model Evaluation

<font size='4'>

* In our case, we will use `classification_report()` to build a text report showing main classification metrics such as `precision`, `recall`, `f1_score`, `accuracy`, etc.

In [9]:
# from sklearn.metrics import classification_report (put it in the beginning)
# Store model predictions in a dictionary
# this makes it easier to iterate through each model
# and print the results. 


<font size='4'>
The reported averages include 

* (sample) average (only for multilabel classification). 
* macro average (averaging the unweighted mean per label), 
* weighted average (averaging the support-weighted mean per label).

Which method performs the best?