<a href="https://colab.research.google.com/github/sigvehaug/MLwPython/blob/master/Course_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# About this Introduction to the Winter School

Basic introduction on how to perform typical machine learning tasks with Python.

PD Dr. Sigve Haug, 2021. https://github.com/sigvehaug/MLwPython

Based on notebooks by Mykhailo Vladymyrov & Aris Marcolongo, 2020. https://github.com/neworldemancer/DSF5

This work is licensed under <a href="https://creativecommons.org/share-your-work/public-domain/cc0/">CC0</a>.


# Machine Learning (ML) in Context




Societal development may be seen as a process boosted by technological revolutions:

* 10k BC    Agricultural/Neolithic
* 1760-1840 Industrial 
* 1850-1900 Electromagnetic
* 1960-2000 Information
* 2010-     Biocognitive (gene design and artificial and extended intelligence)

AI has a huge automatization potential (already happening). Humans are outsourcing cognitive (brain) tasks. The comprehensive impact on society is hard to foresee. Some scholars even talk about the end of humanity (singularity). When one talks about AI, one normally means machine learning algorithms.

In the CAS Applied Data Science we are here:
* M1 Data Management and Acquisition
* M2 Statistical Inference
* M3 **Machine Learning with focus on Deep Learning**
* M4 Best Practices and Ethics
* M5 Electives



In the scientific/data science process we are in the loops of this diagram here:
<img src="https://github.com/sigvehaug/MLwPython/raw/master/figures/2013-sciencemethod.png" width="60%"/>

In future the full process may be taken over by a (combination) of machine learning algorithms.

# What is ML?

Unlike classical algorithms, created by human to analyze some data:

<img src="https://github.com/neworldemancer/DSF5/raw/master/figures/alg_1.png" width="60%"/>

in machine learning the data itself is used for to define the algorithm:

<img src="https://github.com/neworldemancer/DSF5/raw/master/figures/alg_2.png" width="60%"/>

A ML definition (Tom Mitchell 1998):

"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."



# ML Tasks



The typical ML task categorisation with some applicartion examples.

<img src="https://github.com/sigvehaug/MLwPython/raw/master/figures/ML-Categories.jpg" width="60%"/>

## Classification versus regression

The two main tasks handled by (supervised) ML is regression and classification.
In regression we aim at modeling the relationship between the system's response (dependent variable) and one or more explanatory variables (independent variables).

Examples of regression would be predicting the temperature for each day of the year, or expenses of the household as a function of the number of children and adults.

In classification the aim is to identify what class does a data-point belong to. For example, the species or the iris plant based on the size of its petals, or whether an email is spam or not based on its content.

In this introduction we don't cover clustering and reinforcement learning.



## ML Algorithms

ML researchers have designed, tested and implemented tens, hundreds, thousands of ML algorithms. Which one to choose depends on the task, the data and the cost (time, ethics, etc). When doing ML, one needs to briefly know the main categories, which tools/implementations to use and try a few algorithms, fine tune the best and bring it into production for ones task.

A good way to get an overview is just to look at the list of implementations in scikit-learn: https://scikit-learn.org/stable/user_guide.html

This only one way to make an ML cheet sheet:

<img src="https://github.com/sigvehaug/MLwPython/raw/master/figures/cheatsheet.png" width="60%"/>




# ML Performance


## Overfitting and underfitting

When doing ML, the goal is to achieve a minimal bias and a minimal variance. In this case the generalisation will be optimal. Minimal bias and variance are related to under- and overfitting. 

<img src="https://github.com/neworldemancer/DSF5/raw/master/figures/Bias_variance_1.png" width="35%"/>

<img src="https://github.com/neworldemancer/DSF5/raw/master/figures/Bias_variance_2.png" width="60%"/>

## Training and test data

To measure model performance in an unbiassed way, we need to use different data than the data that the model was trained on. For this we use the 'train-test' split: e.g. 20% of all available dataset is reserved for model performance test, and the remaining 80% is used for actual model training.

## Performance Measures

### Regression:
* Mean Square Error: $\textrm{MSE}=\frac{1}{n}\sum_i(y_i - \hat y(\bar x_i))^2$
* Mean Absolute Error: $\textrm{MAE}=\frac{1}{n}\sum_i|y_i - \hat y(\bar x_i)|$
* Median Absolute Deviation: $\textrm{MAD}=\textrm{median}(|y_i - \hat y(\bar x_i)|)$
* Fraction of the explained variance: $R^2=1-\frac{\sum_i(y_i - \hat y(\bar x_i))^2}{\sum_i(y_i - \bar y)^2}$, where $\bar y=\frac{1}{n}\sum_i y_i$




### Classification:
* Confusion matrix 

<img src="https://github.com/neworldemancer/DSF5/raw/master/figures/confusion_mtr.png" width="60%"/>

* Accuracy $=\frac{\textrm{TP} + \textrm{TN}}{\textrm{TP} + \textrm{FP} + \textrm{FN} + \textrm{TN}}$
* Precision $=\frac{\textrm{TP}}{\textrm{TP} + \textrm{FP}}$ 
* Recall $=\frac{\textrm{TP}}{\textrm{TP} + \textrm{FN}}$
* F1 $=2\frac{\textrm{Precision} \cdot \textrm{Recall}}{\textrm{Precision} + \textrm{Recall}} = \frac{2 \textrm{TP}}{2 \textrm{TP} + \textrm{FP} + \textrm{FN}}$
* Threat score (TS), or Intersection over Union: $\mathrm{IoU}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}+\mathrm{FP}}$


During model optimization the used measure in most cases must be differentiable. To this end usually some measure of similarities of distributions are employed (e.g. cross-entropy).

# ML Data / Experience


Data is any sequence of symbols. For the current ML tools (Python scikit-lear, tensorflow, etc) data must be in numbers in a table (dataframe), one, two, or higher dimensional (tensor). So any data, like images, videos, sound, text ... must be turned into a table before processeing it with ML. Cleaning and preprocessing data into these tables are the normally the most time consuming part in data science and ML projects.

And, if your data is rubbish, the results will be rubbish, too.

# The `scikit-learn` Interface - some words

In this course we will primarily use the scikit-learn module. You can find extensive documentation with examples in the user guide

The module contains A LOT of different machine learning methods, and here we will cover only few of them. What is great about scikit-learn is that it has a uniform and consistent interface.

All the different ML approaches are implemented as classes with a set of same main methods:

fitter = ...: Create object.
fitter.fit(x, y[, sample_weight]): Fit model.
y_pred = fitter.predict(X): Predict using the linear model.
s = score(x, y[, sample_weight]): Return an appropriate measure of model performance.
This allows one to easily replace one approach with another and find the best one for the problem at hand, by simply using another regression/classification object, while the rest of the code can remain the same.

It is useful to know that generally in scikit-learn the input data is represented as a design matrix $X$ of dimensions n_samples x n_features , whereas the supervised labels/values are stored in a matrix $Y$ of dimensions n_samples x n_target .

# A first example - linear regression for house prices 

In many cases the scalar value of interest - dependent variable - is (or can be approximated as) linear combination of the independent variables. 

In linear regression the estimator is searched in the form: $$\hat{y}(w, x) = w_0 + w_1 x_1 + ... + w_p x_p$$

The parameters $w = (w_1,..., w_p)$ and $w_0$ are designated as `coef_` and `intercept_` in `sklearn`.

Reference: https://scikit-learn.org/stable/modules/linear_model.html

## Load libraries

In [None]:
# Scikit-learn (formerly scikits.learn and also known as sklearn) is a free 
# software machine learning library for the Python programming language. 
# It features various classification, regression and clustering algorithms, 
# and is designed to interoperate with the Python numerical and scientific 
# libraries NumPy and SciPy. (from wiki)

from sklearn import linear_model
from sklearn.model_selection import train_test_split

# common visualization module
from matplotlib import pyplot as plt

# numeric module
import numpy as np
# data analysis module
import pandas as pd

%matplotlib inline

## Load and preprocess dataset

Subset of the Ames Houses dataset: http://jse.amstat.org/v19n3/decock.pdf

In [None]:
def house_prices_dataset(return_df=False, price_max=400000, area_max=40000):
#  path = 'data/AmesHousing.csv'
  path = 'https://raw.githubusercontent.com/sigvehaug/MLwPython/master/data/AmesHousing.csv'
  df = pd.read_csv(path, na_values=('NaN', ''), keep_default_na=False)
  
  # Clean up the column names
  rename_dict = {k:k.replace(' ', '').replace('/', '') for k in df.keys()}
  df.rename(columns=rename_dict, inplace=True)
  
  # Select the columns to be used and make feature and target dataframe
  useful_fields = ['LotArea',
                  'Utilities', 'OverallQual', 'OverallCond',
                  'YearBuilt', 'YearRemodAdd', 'ExterQual', 'ExterCond',
                  'HeatingQC', 'CentralAir', 'Electrical',
                  '1stFlrSF', '2ndFlrSF','GrLivArea',
                  'FullBath', 'HalfBath',
                  'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd',
                  'Functional','PoolArea',
                  'YrSold', 'MoSold'
                  ]
  target_field = 'SalePrice'

  df.dropna(axis=0, subset=useful_fields+[target_field], inplace=True)

  cleanup_nums = {'Street':      {'Grvl': 0, 'Pave': 1},
                  'LotFrontage': {'NA':0},
                  'Alley':       {'NA':0, 'Grvl': 1, 'Pave': 2},
                  'LotShape':    {'IR3':0, 'IR2': 1, 'IR1': 2, 'Reg':3},
                  'Utilities':   {'ELO':0, 'NoSeWa': 1, 'NoSewr': 2, 'AllPub': 3},
                  'LandSlope':   {'Sev':0, 'Mod': 1, 'Gtl': 3},
                  'ExterQual':   {'Po':0, 'Fa': 1, 'TA': 2, 'Gd': 3, 'Ex':4},
                  'ExterCond':   {'Po':0, 'Fa': 1, 'TA': 2, 'Gd': 3, 'Ex':4},
                  'BsmtQual':    {'NA':0, 'Po':1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex':5},
                  'BsmtCond':    {'NA':0, 'Po':1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex':5},
                  'BsmtExposure':{'NA':0, 'No':1, 'Mn': 2, 'Av': 3, 'Gd': 4},
                  'BsmtFinType1':{'NA':0, 'Unf':1, 'LwQ': 2, 'Rec': 3, 'BLQ': 4, 'ALQ':5, 'GLQ':6},
                  'BsmtFinType2':{'NA':0, 'Unf':1, 'LwQ': 2, 'Rec': 3, 'BLQ': 4, 'ALQ':5, 'GLQ':6},
                  'HeatingQC':   {'Po':0, 'Fa': 1, 'TA': 2, 'Gd': 3, 'Ex':4},
                  'CentralAir':  {'N':0, 'Y': 1},
                  'Electrical':  {'':0, 'NA':0, 'Mix':1, 'FuseP':2, 'FuseF': 3, 'FuseA': 4, 'SBrkr': 5},
                  'KitchenQual': {'Po':0, 'Fa': 1, 'TA': 2, 'Gd': 3, 'Ex':4},
                  'Functional':  {'Sal':0, 'Sev':1, 'Maj2': 2, 'Maj1': 3, 'Mod': 4, 'Min2':5, 'Min1':6, 'Typ':7},
                  'FireplaceQu': {'NA':0, 'Po':1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex':5},
                  'PoolQC':      {'NA':0, 'Fa': 1, 'TA': 2, 'Gd': 3, 'Ex':4},
                  'Fence':       {'NA':0, 'MnWw': 1, 'GdWo': 2, 'MnPrv': 3, 'GdPrv':4},
                  }

  df_X = df[useful_fields].copy()                              
  df_X.replace(cleanup_nums, inplace=True)  # convert continous categorial variables to numerical
  df_Y = df[target_field].copy()

  # Convert to numpy arrays and return only rows with values below given maxima 
  x = df_X.to_numpy().astype(np.float32)
  y = df_Y.to_numpy().astype(np.float32)

  if price_max>0:
    idxs = y<price_max
    x = x[idxs]
    y = y[idxs]

  if area_max>0:
    idxs = x[:,0]<area_max
    x = x[idxs]
    y = y[idxs]

  return (x, y, df) if return_df else (x,y)

In [None]:
x, y, df = house_prices_dataset(return_df=True)
print(x.shape, y.shape)
df.head()

In [None]:
plt.plot(x[:, 0], y, '.r')
plt.xlabel('Area / ft^2')
plt.ylabel('Price / USD');

## Train/fit the ML model

In [None]:
# Make train/test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

# Fit the model
reg = linear_model.LinearRegression()
reg.fit(x_train, y_train)

## Evaluate and plot the results

In [None]:
# Evaluate MSE, MAD, and R2 on train and test datasets

# Prediction:
y_p_train = reg.predict(x_train)
y_p_test = reg.predict(x_test)

# mse
print('Train MSE = %5.2f' % np.std(y_train - y_p_train))
print('Test MSE = %5.2f' % np.std(y_test - y_p_test))
# mse
print('Train MAE = %5.2f' % np.mean(np.abs(y_train - y_p_train)))
print('Test MAE = %5.2f' % np.mean(np.abs(y_test - y_p_test)))
# R2
print('Train R2 = %5.2f' % reg.score(x_train, y_train))
print('Test R2 = %5.2f' % reg.score(x_test, y_test))

# Plot y vs predicted y for test and train parts
plt.figure(figsize=(10,10))
plt.plot(y_train, y_p_train, 'b.', label='Train')
plt.plot(y_test, y_p_test, 'r.', label='Test')

plt.plot([0], [0], 'w.')  # dummy to have origin
plt.xlabel('True Price')
plt.ylabel('Predicted Price')
#plt.gca().set_aspect('equal')
plt.legend()
plt.plot()

# Hands-on / Hackathon Session

Use the linear regression example as a template and fit house prices with a decision tree model. You can write, copy ande paste your solutions here below. Have fun! Do you see overfitting in your solution?

In [None]:
# Import libraries
#from sklearn import tree


# The rest of the Winter School

Now we have seen how to use the Python library scikit-learn for ML. Tomorrow and the rest of the week, you will learn about deep neural networks and use TensorFlow to perform ML.