# Understanding Supervised Learning

_Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs_

![Image](../images/ml_map.jpg)

## Before we start:

- __Data exploration:__ 

Shape, Descriptive statistics (numeric, categorical, timestamp), Visualization, Domain knowledge 

- __Data transformations:__ 

Arbitrary, Modeling (joins, feature eng.), Performance (outliers, scaling, encoding)

![Image](https://media.giphy.com/media/ZThQqlxY5BXMc/giphy.gif)

In [None]:
# Imports

import matplotlib.pyplot as plt
import random
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

%matplotlib inline

In [None]:
# Read data

file = '../data/baseball_100_ok.csv'
data = pd.read_csv(file)
data.head()

In [None]:
# Shape

data.shape

In [None]:
# Descriptive statistics

data.describe()

In [None]:
# Visualization

fig, ax = plt.subplots(ncols=1, nrows=1, figsize=(16,8))
ax.set(xlabel='Speed (m/s)', 
       ylabel='Time (s)',
       title='Baseball Speed-Time Relation')
ax.scatter(data['speed'], data['time'], c='grey')
file

### Domain knowledge

![Image](../images/velocity.png)

![Image](https://media.giphy.com/media/l0HlIJQUdby5FzlZe/giphy.gif)

In [None]:
# Features and predictions

X = data[['time','distance']]
y = data['speed']
print(X.shape,y.shape)

In [None]:
X

In [None]:
y

---

## Train-Test Split:

![Image](../images/train-test-split.jpg)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"X_train: {X_train.shape}, X_test: {X_test.shape}, y_train: {y_train.shape}, y_test: {y_test.shape}")
print(f"X_train: {type(X_train)}, X_test: {type(X_test)}, y_train: {type(y_train)}, y_test: {type(y_test)}")

In [None]:
X_train.describe()

In [None]:
y_train.describe()

In [None]:
X_test.describe()

In [None]:
y_test.describe()

---

## Models

![Image](../images/models.png)

### _Linear Regression (test-in-training)_ 

In [None]:
%%time

# Model definition

model = LinearRegression()
print(type(model))

In [None]:
%%time

# Model training

weights = model.fit(X, y)
print(type(weights))

In [None]:
%%time

# Model predictions

predictions = model.predict(X_test)
print(type(predictions))

In [None]:
# RMSE

tricky_error = round(mean_squared_error(y_test, predictions)**0.5, 5)
print(f"Speed predictions error is: +/- {tricky_error} m/s (Mean speed is around: 40 m/s and Std is around: 2 m/s)")

### _Linear Regression (the-real-stuff)_ 

In [None]:
%%time

# Model definition

model = LinearRegression()
print(type(model))

In [None]:
%%time

# Model training

weights = model.fit(X_train, y_train)
print(type(weights))

In [None]:
%%time

# Model predictions

predictions = model.predict(X_test)
print(type(predictions))

In [None]:
# RMSE

real_error = round(mean_squared_error(y_test, predictions)**0.5, 5)
print(f"Speed predictions error is: +/- {real_error} m/s (Mean speed is around: 40 m/s and Std is around: 2 m/s)")

In [None]:
# RMSE comparison

print(f"The real_error is {round(real_error/tricky_error, 2)} times the tricky_error")

---

### _Random Forest Regressor (test-in-training)_ 

In [None]:
%%time

# Model definition

model = RandomForestRegressor()
print(type(model))

In [None]:
%%time

# Model training

weights = model.fit(X, y)
print(type(weights))

In [None]:
%%time

# Model predictions

predictions = model.predict(X_test)
print(type(predictions))

In [None]:
# RMSE

tricky_error = round(mean_squared_error(y_test, predictions)**0.5, 5)
print(f"Speed predictions error is: +/- {tricky_error} m/s (Mean speed is around: 40 m/s and Std is around: 2 m/s)")

### _Random Forest Regressor (the-real-stuff)_ 

In [None]:
%%time

# Model definition

model = RandomForestRegressor()
print(type(model))

In [None]:
%%time

# Model training

weights = model.fit(X_train, y_train)
print(type(weights))

In [None]:
%%time

# Model predictions

predictions = model.predict(X_test)
print(type(predictions))

In [None]:
# RMSE

real_error = round(mean_squared_error(y_test, predictions)**0.5, 5)
print(f"Speed predictions error is: +/- {real_error} m/s (Mean speed is around: 40 m/s and Std is around: 2 m/s)")

In [None]:
# RMSE comparison

print(f"The real_error is {round(real_error/tricky_error, 2)} times the tricky_error")

---