# Lab 4 - ML Experiments in Python
- **Author:** Emily Aiken ([emilyaiken@berkeley.edu](mailto:emilyaiken@berkeley.edu)) (based on labs from Dimitris Papadimitriou and Guanghua Chi)
- **Date:** February 16, 2022
- **Course:** INFO 251: Applied machine learning

## Learning objectives:
At the end of this lab, you will be able to...
- Implement random train-test splits and cross validation in numpy and pandas
- Use built-in scikit-learn functionality for train-test splits and cross validation
- Understand the difference between cross validation and the bootstrap
- Understand the difference between normalization and standardization

## Topics:
1. Vectorized computation, revisited
2. Train-test splits
3. Cross validation
4. Bootstrapping
5. Normalization and standardization

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, KFold, cross_validate
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler, StandardScaler

## 1. Vectorized Computation, Revisited

Challenge: For each point in a random set of 2D points, find the index of the closest other point in terms of Euclidean distance. *Use a maximum of one for loop.*

In [None]:
np.random.seed(2)
points = np.random.rand(100, 2)

fig, ax = plt.subplots(1)
ax.scatter(points[:, 0], points[:, 1])
plt.show()

In [None]:
# Your code here

## 2. Random Splits

In [None]:
df = pd.read_csv('auto.csv')
df.head()

#### Option 1: Shuffling in numpy
This is what you should be using for your problem sets! 

In [None]:
train_percent = .80
train_number = int(train_percent*len(df))
print('Total examples: %i' % len(df))
print('Number of training examples: %i' % train_number)
print('Number of testing examples: %i' % (len(df) - train_number))

np.random.seed(1)

ids = np.arange(0, len(df), 1)
ids = np.random.permutation(ids)
df_shuffled = df.iloc[ids]
train = df_shuffled[:train_number]
test = df_shuffled[train_number:]

#### Option 2: Shuffling in pandas
Arguably more idiomatic, but less flexible. Don't use on PS3 or PS4, but could be useful in future data science applications.

In [None]:
df_shuffled = df.sample(frac=1, random_state=12, replace=False)
train = df_shuffled[:train_number]
test = df_shuffled[train_number:]

#### Option 3: Built-in functionality in scikit-learn
Often most convenient -- but don't use built-in sklearn functionality in PS3 or PS4.

In [None]:
train, test = train_test_split(df, train_size=train_percent, shuffle=True, random_state=12)

In [None]:
x = df[['mpg', 'cylinders', 'year']]
y = df['weight']
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=train_percent, shuffle=True, 
                                                    random_state=12)

## 3. Cross Validation

#### Option 1: Cross validation in numpy
**Challenge**: Extract five random cross validation splits from our dataframe using numpy's shuffle()

In [None]:
# Your code here

#### Option 2: Cross validation in pandas
**Challenge**: Extract five random cross validation splits from our dataframe using pandas sample()

In [None]:
# Your code here

#### Option 3: Built-in functionality in scikit-learn
Once again, do not use sci-kit learn on PS3 or PS4

In [None]:
# KFold -- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
kf = KFold(n_splits=5, random_state=12)
for train_idx, test_idx in kf.split(df):
    train, test = df.iloc[train_idx], df.iloc[test_idx]
    print(train.shape, test.shape)

In [None]:
# Cross validate -- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html
x = df[['mpg', 'cylinders', 'year']]
y = df['weight']
model = LinearRegression()
cross_validate(model, x, y, scoring='r2', cv=5)

## 4. Bootstrapping
We can think of train-test splits and cross validation as drawing samples without replacement: Once an observation is assigned to a split, it can't be redrawn into another split. 

*Bootstrapping* is the opposite: We draw samples with replacement. Bootstrapping is mostly used for estimating confidence intervals around point estimates --- more to come on bootstrapping in later lectures and labs!


#### Option 1: Random draws with numpy

In [None]:
np.random.seed(1)

n_bootstraps = 100
sample_size = 50
for i in range(n_bootstraps):
    sample_ids = np.random.choice(len(df), size=sample_size)
    bootstrap_sample = df.iloc[sample_ids]
    print(bootstrap_sample['mpg'].mean())

#### Option 2: Random draws with pandas

In [None]:
np.random.seed(2)

n_bootstraps = 100
sample_size = 50

for i in range(n_bootstraps):
    bootstrap_sample = df.sample(sample_size, replace=True)
    print(bootstrap_sample['mpg'].mean())

## 5. Normalization and Standardization

In [None]:
df_numeric = df[['mpg', 'cylinders', 'weight']]
train, test = train_test_split(df_numeric, test_size=0.8, random_state=1, shuffle=True)

mpg_train = train['mpg'].values.flatten()
mpg_test = test['mpg'].values.flatten()

#### Standardization
Standardization rescales the values in a vector to 0 mean and unit variance.

In [None]:
# Standardization by hand in numpy
mean_mpg, std_mpg = np.mean(mpg_train), np.std(mpg_train)

mpg_train_standardized = (mpg_train - mean_mpg)/std_mpg
mpg_test_standardized = (mpg_test - mean_mpg)/std_mpg

print(mpg_train_standardized.mean(), mpg_train_standardized.std())
print(mpg_test_standardized.mean(), mpg_test_standardized.std())

In [None]:
# Standardization with scikit-learn
scaler = StandardScaler()
train_standardized = scaler.fit_transform(train)
test_standardized = scaler.transform(test)

#### Normalization
Normalization rescales the values in a vector to a range of [0, 1].

In [None]:
# Normalization by hand in numpy
max_mpg, min_mpg = np.max(mpg_train), np.min(mpg_train)

mpg_train_normalized = (mpg_train - min_mpg)/(max_mpg - min_mpg)
mpg_test_normalized = (mpg_test - min_mpg)/(max_mpg - min_mpg)

print(mpg_train_normalized.min(), mpg_train_normalized.max())
print(mpg_test_normalized.min(), mpg_test_normalized.max())

In [None]:
# Normalization in scikit-learn
scaler = MinMaxScaler()
train_normalized = scaler.fit_transform(train)
test_normalized = scaler.transform(test)