# Lecture 3 Demo: ML Fundamentals

## Imports

In [7]:
# import the libraries
import os
import sys
sys.path.append(os.path.join("code"))
from plotting_functions import *
from utils import *

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

%matplotlib inline

DATA_DIR = os.path.join("data/")
pd.set_option("display.max_colwidth", 200)

<br><br>

## 

Let's bring back King County housing sale prediction data from the course introduction video. You can download the data from [here](https://www.kaggle.com/harlfoxem/housesalesprediction). 

In [9]:
housing_df = pd.read_csv(DATA_DIR + 'kc_house_data.csv')
housing_df

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.00,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.7210,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.00,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.00,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.00,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21608,263000018,20140521T000000,360000.0,3,2.50,1530,1131,3.0,0,0,...,8,1530,0,2009,0,98103,47.6993,-122.346,1530,1509
21609,6600060120,20150223T000000,400000.0,4,2.50,2310,5813,2.0,0,0,...,8,2310,0,2014,0,98146,47.5107,-122.362,1830,7200
21610,1523300141,20140623T000000,402101.0,2,0.75,1020,1350,2.0,0,0,...,7,1020,0,2009,0,98144,47.5944,-122.299,1020,2007
21611,291310100,20150116T000000,400000.0,3,2.50,1600,2388,2.0,0,0,...,8,1600,0,2004,0,98027,47.5345,-122.069,1410,1287


## Exploratory Data Analysis

Is this a classification problem or a regression problem? 

In [None]:
# How many data points do we have? 

In [None]:
# What are the columns in the dataset? 


Let's explore some features. Let's try the `describe()` method

In [None]:
housing_df.describe(include='all')

Do we need to keep all the columns? 

In [None]:
housing_df['id'].unique().shape[0]

In [None]:
housing_df['zipcode'].value_counts()

In [None]:
dates = pd.to_datetime(['20141013T000000', '20141209T000000', '20150218T000000'], format='%Y%m%dT%H%M%S')

In [None]:
# What are the value counts of the `waterfront` feature? 
housing_df['waterfront'].value_counts()

In [None]:
# What are the value_counts of `yr_renovated` feature? 
housing_df['yr_renovated'].value_counts()

Many opportunities to clean the data but we'll stop here. 

Let's create `X` and `y`. 

In [None]:
X = housing_df.drop(columns = [])

In [None]:
y = housing_df['']

<br><br>

## Baseline model 

In [None]:
# Train a DummyRegressor model 

from sklearn.dummy import DummyRegressor # Import DummyRegressor 

# Create a class object for the sklearn model.
dummy_regr = 


# fit the dummy regressor


# score the model 



How to interpret the score here? 

In [None]:
# predict on X using the model


<br><br>

## Decision tree model 

In [None]:
# Train a decision tree model 

from sklearn.tree import DecisionTreeRegressor # Import DecisionTreeRegressor 

# Create a class object for the sklearn model.
dt_regr = 


# fit the decision tree regressor 


# score the model 


We are getting a perfect accuracy. Should we be happy with this model and deploy it? Why or why not?

What's the depth of this model? 

<br><br>

## Data splitting 

Let's split the data and  
- Train on the train split 
- Score on the test split

In [None]:
# Split the data 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = 

In [None]:
# Instantiate a class object 
dt = DecisionTreeRegressor(random_state=123)

# Train a decision tree on X_train, y_train
dt.fit(X_train, y_train)

# Score on the train set
dt.score(X_train, y_train)


In [None]:
# Score on the test set
dt.score(X_test, y_test)


### Activity: Discuss the following questions in your group

- Why is there a large gap between train and test scores? 
- What would be the effect of increasing or decreasing `test_size`?
- Why are we setting the `random_state`? Is it a good idea to try a bunch of values for the `random_state` and pick the one which gives the best scores? 
- Would it be possible to further improve the scores? 

<br><br>

## Hyperparameter optimization 

Let's try out different tree depths. 

In [None]:
# max_depth= 1 
dt = DecisionTreeRegressor(max_depth=1, random_state=123) 
dt.fit(X_train, y_train)

In [None]:
# Visualize your decision stump
from sklearn.tree import plot_tree 
plot_tree(dt, feature_names = X.columns.tolist(), impurity=False, filled=True, fontsize=10);

In [None]:
dt.score(X_train, y_train) # Score on the train set

In [None]:
dt.score(X_test, y_test) # Score on the test set

- How do these scores compare to the previous scores? 

Let's try depth 10. 

In [None]:
dt = DecisionTreeRegressor(max_depth=10, random_state=123) # max_depth= 10 
dt.fit(X_train, y_train)

In [None]:
dt.score(X_train, y_train) # Score on the train set

In [None]:
dt.score(X_test, y_test) # Score on the test set

Any improvements? Which depth should we pick? 

<br><br>

## Single validation set

We are using the test data again and again. How about creating a validation set to pick the right depth and assessing the final model on the test set?   

In [None]:
# Create a validation set 
X_tr, X_valid, y_tr, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=123)

In [None]:
tr_scores = []
valid_scores = []
depths = np.arange(1, 35, 2)

for depth in depths:  
    # Create and fit a decision tree model for the given depth  
    dt = DecisionTreeRegressor(max_depth=depth, random_state=123)
    
    # Calculate and append r2 scores on the training and validation sets

    
results_single_valid_df = pd.DataFrame({"train_score": tr_scores, 
                           "valid_score": valid_scores},index = depths)
results_single_valid_df

In [None]:
results_single_valid_df[['train_score', 'valid_score']].plot(ylabel='r2 scores');

What depth gives the "best" validation score? 

In [None]:
# What depth gives the "best" validation score?
best_depth = results_single_valid_df['valid_score'].idxmax() 
best_depth

## Cross-validation

In [None]:
depths = np.arange(1, 35, 2)

cv_train_scores = []
cv_valid_scores = []
for depth in depths: 
    # Create and fit a decision tree model for the given depth   
    dt = DecisionTreeRegressor(max_depth = depth, random_state=123)

    # Carry out cross-validation


In [None]:
results_df = pd.DataFrame({"train_score": cv_train_scores, 
                           "valid_score": cv_valid_scores
                           },
                           index=depths
                            )
results_df

In [None]:
results_df[['train_score', 'valid_score']].plot(ylabel='r2 score', title='Housing price prediction depth vs. r2 score');

What's the "best" depth with cross-validation? 

In [None]:
best_depth = results_df['valid_score'].idxmax()
best_depth

### Discuss the following questions in your group

1.	At which depth(s) are we underfitting? At which depth(s) are we overfitting?
2.	Above, we chose the depth that gives us the best cross-validation score. Is it always a good idea to select this depth? What if a simpler model with a smaller max_depth gives nearly the same cross-validation score?
3.	If our main concern is test scores, why don't we use the test set during training?
4.	Do you trust our hyperparameter optimization process? In other words, do you believe we've found the best possible depth?

<br><br>

## Assessing on the test set

In [None]:
# Train a model with the best depth of the full training data
dt_final = DecisionTreeRegressor(max_depth=best_depth, random_state=123)
dt_final.fit(X_train, y_train)
dt_final.score(X_train, y_train)

In [None]:
dt_final.score(X_test, y_test)

How do these scores compare to the scores when we used a single validation set? 

### Learned model 

In [None]:
#What's the depth of the model? 
dt_final.get_depth()

In [None]:
# plot_tree(dt_final, feature_names = X_train.columns.tolist(), impurity=False, filled=True);

In [None]:
# Which features are the most important ones?
dt_final.feature_importances_

Let's examine feature importances. 

In [None]:
df = pd.DataFrame( 
    data = {
        "features": dt_final.feature_names_in_,
        "feature_importances": dt_final.feature_importances_
    }
)
df.sort_values("feature_importances", ascending=False)

<br><br>

## Concepts we revised in this demo

- Exploratory data analysis
- Baselines
- Data splitting: train, test, validation sets
- Cross validation
- Underfitting, overfitting, the fundamental tradeoff
- The golden rule of supervised ML

## Typical steps to build a supervised machine learning model

- Ensure the data is appropriate for your task (e.g., labeled data, suitable features).
- Split the data into training and testing sets.
- Perform exploratory data analysis (EDA) on the training data to understand distributions, identify patterns, and detect potential issues.
- Preprocess and encode features (e.g., handle missing values, scale features, encode categorical variables).
    - coming up 
- Build a baseline model to establish a performance benchmark.
- Train multiple candidate models on the training data.
    - coming up  
- Select promising models and perform hyperparameter tuning using cross-validation.
    - coming up 
- Evaluate the generalization performance of the best model on the test set.


<br><br>