# Hands-on Introduction to Python And Machine Learning

Instructor: Tak-Kei Lam

(Readers are assumed to have a little bit programming background.)


## Some categories of popular machine learning algorithms

### 1. Regression
In short and basic words, regression is about "curve fitting" while taking some constraints into account. The idea is to  derive a general model to describe the data using some functions (lines, curves, etc).

Example:
- Linear regression

#### Linear regression

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd

# Load the dataset
pokemons = pd.read_csv("pokemon.csv")

# Use two features: HP and Attack
pokemons_HP = pokemons['HP']
pokemons_Attack = pokemons['Attack']

# .values convert the datastructure from pandas's dataframe into numpy's array

# Split the data into training/test sets
last_index = -int(0.20*len(pokemons_HP))
pokemons_HP_train = pokemons_HP[:last_index].values
pokemons_HP_test = pokemons_HP[last_index:].values 

# Split the targets into training/testing sets
last_index = -int(0.20*len(pokemons_Attack))
pokemons_Attack_train = pokemons_Attack[:last_index].values 
pokemons_Attack_test = pokemons_Attack[last_index:].values 

# reshape each data set from a row into a column so that the data can be used by sklearn
pokemons_HP_train = np.reshape(pokemons_HP_train, (-1, 1))
pokemons_HP_test = np.reshape(pokemons_HP_test, (-1, 1))
pokemons_Attack_train = np.reshape(pokemons_Attack_train, (-1, 1))
pokemons_Attack_test = np.reshape(pokemons_Attack_test, (-1, 1))

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(pokemons_HP_train, pokemons_Attack_train)

# Make predictions using the testing set
pokemons_Attack_pred = regr.predict(pokemons_HP_test)

print('Coefficients: \n', regr.coef_)
print('Intercept: \n', regr.intercept_)
# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(pokemons_Attack_test, pokemons_Attack_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(pokemons_Attack_test, pokemons_Attack_pred))

# Plot outputs
plt.figure(figsize=(4,4), dpi=100)
plt.scatter(pokemons_HP_test, pokemons_Attack_test,  color='black')
plt.plot(pokemons_HP_test, pokemons_Attack_pred, color='blue', linewidth=3)
plt.xlabel("HP")
plt.ylabel("Attack")
plt.show()

#### Procedures of linear regression
(The explanation considers only two variables. But the concept can be exteneded to multi-variable (easily).)
1. Assume the data can be modelled using a straight line. What we need to do is to figure out its slope and the intercept. That is, the model is in the form: $y =b_0 + b_1x$
2. Assume the line passes through the mean of the x-axis and the mean of the y-axis
3. Find the distance between a data point and the mean of the x-axis (y-axis). The distance is known as the error of the data point (how far it is devriated from the model).
4. Try to minimize the overall error for all data points by making the errors ""more even''. There are many methods to achieve that. For example, we can use *Ordinary Least Squares*.

Example:
Supppose we have the data: $\{(0.9, 1), (1, 0.9), (2.1, 1.8), (1.9, 2.05)\}$

$\bar{x} = \text{mean}(x) = \frac{0.9 + 1 + 2.1 + 1.9}{4} = 1.475$

$\bar{y} = \text{mean}(y) = \frac{1 + 0.9 + 1.8 + 2.05}{4} = 1.4375$

In [None]:
# let's plot some graphs to explain to idea

import numpy as np
import matplotlib.pyplot as plt

data = [(0.9,1),(1,0.9),(2.1,1.8),(1.9,2.05)]

data_x  = np.array([d[0] for d in data]) # we use numpy array for the convenience of performing operations such as (vector-vector)
data_y  = np.array([d[1] for d in data])

xmean = np.array([1.475 for i in range(0,4)])
y = [i for i in range(0,4)]
ymean = np.array([1.4375 for i in range(0,4)])
x = [i for i in range(0,4)]

plt.figure(figsize=(10,8), dpi=100)

# ------------------------- first plot
plt.subplot(2,2,1)
plt.scatter(data_x, data_y,  color='black')
plt.xlim(0, 3)
plt.ylim(0, 3)
plt.title('Data points')

# ------------------------- second plot
plt.subplot(2,2,2)
plt.scatter(data_x, data_y,  color='black')

# plot mean of x
plt.plot(xmean, y, label='x mean') 
# plot mean of y
plt.plot(x, ymean, label='y mean') 
plt.legend()

# plot x errror
for i in range(0, len(data)):
    plt.plot([data_x[i], xmean[0]], [data_y[i], data_y[i]], '--')     
# plot y errror
for i in range(0, len(data)):
    plt.plot([data_x[i], data_x[i]], [data_y[i], ymean[0]], '--') 

plt.xlim(0, 3)
plt.ylim(0, 3)
plt.title('The means and their distances from the means')

# ------------------------- third plot
plt.subplot(2,2,3)
plt.scatter(data_x, data_y,  color='black')

# plot mean of x
plt.plot(xmean, y, label='x mean') 
# plot mean of y
plt.plot(x, ymean, label='y mean') 


# plot x errror
for i in range(0, len(data)):
    plt.plot([data_x[i], xmean[0]], [data_y[i], data_y[i]], '--')     
# plot y errror
for i in range(0, len(data)):
    plt.plot([data_x[i], data_x[i]], [data_y[i], ymean[0]], '--') 
    
# plot linear regression guesses
x = np.linspace(0,4,4)
y = (x - xmean[0])/((3-xmean[0])/(3-ymean[0]))+ymean[0]
plt.plot(x, y, '--', label='guess 1')
y = (x - xmean[0])/((2-xmean[0])/(3-ymean[0]))+ymean[0]
plt.plot(x, y, '--', label='guess 2')
y = (x - xmean[0])/((2-xmean[0])/(1.5-ymean[0]))+ymean[0]
plt.plot(x, y, '--', label='guess 3')
plt.text(2.5, 2.75, '?')
plt.text(1.75, 2.75, '?')
plt.text(2.75, 1.75, '?')
plt.legend()

plt.xlim(0, 3)
plt.ylim(0, 3)
plt.title('Guesses of linear regression')

plt.tight_layout()
plt.show()

Assume we have designed the model. We can check how each data point is different from what the model predicts. We can obtain the expected y-coordinate $y_i^\prime$, where $y_i^\prime = b_0 + b_1 x_i$ for every data point with $x_i$.

The difference between the actual y-coordinate $y_i$ and the expected y-coordinate $y_i^\prime$ is then $y_i - b_0 + b_1 x_i$. The idea of *ordinary least squares* is to minimize the sum of $(y_i - b_0 - b_1 x_i)^2$.

Let $f = \sum{(y_i - b_0 - b_1 x_i)^2}$. We can use differentiation to find $b_0$ and $b_1$ such that $f$ is minimum.

Set:
- $\frac{\partial{f}}{\partial{b_0}} = -2 \sum{(y_i - b_0 -b_1 x_i)} = 0$
- $\frac{\partial{f}}{\partial{b_1}} = -2 \sum{(y_i - b_0 -b_1 x_i) x_i} = 0$

Then (after many steps...), we can obtain:
- $b_0 = \bar{y} - b_1\bar{x}$ (hey, why?)
- $b_1 = \frac{\sum{x_i(x_i - \bar{y})}}{\sum{x_i(x_i - \bar{x})}} = \frac{\sum{(x_i-\bar{x})(x_i - \bar{y})}}{\sum{(x_i - \bar{x})(x_i - \bar{x})}}$ (hey, 1$^{\text{st}}$ why: why are the two expressions equivalent? 2$^{\text{nd}}$ why: why do we prefer the latter?)

 
 Therefore:
 
 $\text{slope of the line} = \frac{\sum{(x - \bar{x})(y - \bar{y})}}{\sum{(x - \bar{x})(x - \bar{x})}}$

In [None]:
# find the linear regression line of a set of data

from sklearn import linear_model

diff_x = data_x - xmean # if we do not use numpy array, we cannot do (vector - vector)
diff_y = data_y - ymean

# remember our model is in this form? y = b_0 + b_1 * x
# slope is b_1 in our linear regression model

slope = np.sum(diff_x * diff_y) / np.sum(diff_x * diff_x) 

# after we have calculated the slope, we can find the y-intercept b_0 by subsituting the slope and (x mean, y mean) into the formula
y_intercept = ymean[0] - slope *xmean[0]


# let's plot the graphs
plt.figure(figsize=(10,8), dpi=100)
plt.scatter(data_x, data_y,  color='black')

# plot mean of x
plt.plot(xmean, y, label='x mean') 
# plot mean of y
plt.plot(x, ymean, label='y mean') 

# plot x errror
for i in range(0, len(data)):
    plt.plot([data_x[i], xmean[0]], [data_y[i], data_y[i]], '--')     
# plot y errror
for i in range(0, len(data)):
    plt.plot([data_x[i], data_x[i]], [data_y[i], ymean[0]], '--') 
    
x = np.linspace(0, 4, 4)
y = [y_intercept + slope *x for x in x]
plt.plot(x, y, label='Manual linear regression')

# let try to use scikit-learn's linear regression
regr = linear_model.LinearRegression()
regr.fit(np.reshape(x, (-1, 1)), np.reshape(y, (-1, 1)))
y_pred = regr.predict(np.reshape(x, (-1, 1)))
plt.plot(x, y_pred, '--', label='Linear regression by scikit-learn')

plt.legend()
plt.xlim(0, 3)
plt.ylim(0, 3)
plt.title('Linear regression');


#### Use cases
- Finding relationship between two continuous variables
    - The higher the HP, the higher the Attack?
- Prediction
    - Who is the champion of World Cup 2018?
- Classification
    - See whether a data point is closer to a regression line A or another regression line B?

** Exercise **:
- Try to use scikit-learn's linear regression to model the following data points:

In [1]:
data = [(0, 0), (1, 1), (2, 2), (3, 3)] # where each data point is a pair of 2D coordinates. You may regard the first member as the x value, and the second member as the y value on a 2D Cartesian plane

- What is the intercept and coefficients? Are they what you expect?