# Putting it all together
In this section we'll look at an example of fitting many models on a data set and choosing the overall best one via test error.

Process:
1. Read the data in (would then want to explore data, we'll skip this part)
2. Split the data into a training and test set
3. For each model type, select a **best** model using the training set (we'll use cross-validation but you could split the training set into a training and validation set instead)
4. Compare the best models on the test set.  Select the model with the lowest error (with considerations for simplicity)

## 1. Read in the `diamonds` data set
Comes from [kaggle](https://www.kaggle.com/datasets/shivam2503/diamondshttps://www.kaggle.com/datasets/shivam2503/diamonds).

In [3]:
import pandas as pd
import numpy as np
diamonds = pd.read_csv("data/diamonds.csv")

In [4]:
print(diamonds.columns)
diamonds.head()

# Notice an index column, "Unnamed: 0"
# We'll remove it and try to predict the price of our diamonds (our response variable)

Index(['Unnamed: 0', 'carat', 'cut', 'color', 'clarity', 'depth', 'table',
       'price', 'x', 'y', 'z'],
      dtype='object')


Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,4,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [5]:
# Drop first column
diamonds = diamonds.drop(diamonds.columns[0], axis = 1)
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


Let's create dummy variables so we can include `cut` and `color`.  We'll also remove the `clarity` variable.

In [6]:
cut_dummies = pd.get_dummies(diamonds.cut)
color_dummies = pd.get_dummies(diamonds.color)
diamonds = diamonds.drop(["clarity", "cut", "color"], axis = 1)
diamonds = diamonds.join(cut_dummies).join(color_dummies)
diamonds.head()

Unnamed: 0,carat,depth,table,price,x,y,z,Fair,Good,Ideal,Premium,Very Good,D,E,F,G,H,I,J
0,0.23,61.5,55.0,326,3.95,3.98,2.43,0,0,1,0,0,0,1,0,0,0,0,0
1,0.21,59.8,61.0,326,3.89,3.84,2.31,0,0,0,1,0,0,1,0,0,0,0,0
2,0.23,56.9,65.0,327,4.05,4.07,2.31,0,1,0,0,0,0,1,0,0,0,0,0
3,0.29,62.4,58.0,334,4.2,4.23,2.63,0,0,0,1,0,0,0,0,0,0,1,0
4,0.31,63.3,58.0,335,4.34,4.35,2.75,0,1,0,0,0,0,0,0,0,0,0,1


Now let's check over the data to make sure the dummy variables aren't super rare.

In [7]:
diamonds.describe()

Unnamed: 0,carat,depth,table,price,x,y,z,Fair,Good,Ideal,Premium,Very Good,D,E,F,G,H,I,J
count,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
mean,0.79794,61.749405,57.457184,3932.799722,5.731157,5.734526,3.538734,0.029848,0.090953,0.399537,0.255673,0.22399,0.125603,0.181628,0.1769,0.209344,0.153949,0.100519,0.052058
std,0.474011,1.432621,2.234491,3989.439738,1.121761,1.142135,0.705699,0.170169,0.287545,0.489808,0.436243,0.416919,0.331404,0.385541,0.381588,0.406844,0.360903,0.300694,0.222146
min,0.2,43.0,43.0,326.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.4,61.0,56.0,950.0,4.71,4.72,2.91,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.7,61.8,57.0,2401.0,5.7,5.71,3.53,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.04,62.5,59.0,5324.25,6.54,6.54,4.04,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,5.01,79.0,95.0,18823.0,10.74,58.9,31.8,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Note: ideally we would explore the data more and consider transformations of variables and other feature engineering.

## 2. Training and Test Split
First, let's just read in all the functions we'll need from `sklearn`

In [8]:
from sklearn.model_selection import train_test_split, GridSearchCV, cross_validate
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import PolynomialFeatures

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
  diamonds.drop("price", axis = 1), # x variable
  diamonds["price"],                # y variable
  test_size=0.20, 
  random_state=42)

## 3. Fit and Select Models on Training Data

### MLR Models

In [35]:
# Full model
cv_full_model = cross_validate(
    LinearRegression(), 
    X_train, 
    y_train, 
    cv = 5, 
    scoring = "neg_mean_squared_error")
cv_numeric_model = cross_validate( # numeric only model, no dummy variables
    LinearRegression(), 
    X_train[["carat", "depth", "table", "x", "y", "z"]], 
    y_train, 
    cv = 5,
    scoring = "neg_mean_squared_error")
cv_dummy_model = cross_validate(     # only use our dummy variables
    LinearRegression(), 
    X_train.iloc[:, 6:], 
    y_train, 
    cv = 5,
    scoring = "neg_mean_squared_error")
poly = PolynomialFeatures(interaction_only=True, include_bias = False)  # create interaction terms, but do not standardize variables first
cv_full_interaction_model = cross_validate(
    LinearRegression(), 
    poly.fit_transform(X_train), # fit all 1-way (first-order) interactions between our variables
    y_train, 
    cv = 5,
    scoring = "neg_mean_squared_error")
cv_numeric_interaction_model = cross_validate(
    LinearRegression(), 
    np.concatenate((poly.fit_transform(X_train[["carat", "depth", "table", "x", "y", "z"]]), X_train.iloc[:, 6:].to_numpy()), axis = 1), 
    y_train, 
    cv = 5,
    scoring = "neg_mean_squared_error")

In [36]:
print(np.sqrt(-sum(cv_full_model['test_score'])), 
      np.sqrt(-sum(cv_numeric_model['test_score'])), 
      np.sqrt(-sum(cv_dummy_model['test_score'])), 
      np.sqrt(-sum(cv_full_interaction_model['test_score'])), 
      np.sqrt(-sum(cv_numeric_interaction_model['test_score']))) 

3116.874695185036 3352.484818653061 8736.954205420192 5961.673260726069 7807.238090666276


In [37]:
mlr_best = LinearRegression().fit(X_train, y_train)

### Regression Tree Model

In [38]:
parameters = {'max_depth': range(2,20), # how many splits we'll do
              'min_samples_leaf':[10, 50, 100, 250]}
tree_model = GridSearchCV(DecisionTreeRegressor(),
                            parameters, 
                            cv = 5, 
                            scoring='neg_mean_squared_error') \
                          .fit(X_train, y_train)

In [39]:
print(tree_model.best_estimator_)

DecisionTreeRegressor(max_depth=12, min_samples_leaf=50)


In [40]:
rtree_cv = cross_validate(tree_model.best_estimator_,
                          X_train,
                          y_train,
                          cv = 5,
                          scoring='neg_mean_squared_error')

In [41]:
print(np.sqrt(-sum(rtree_cv['test_score'])))

2690.9211179869594


In [42]:
rtree_best = tree_model.best_estimator_.fit(X_train, y_train)

### Random Forest Model (Includes Bagged Tree as a Special Case)

In [43]:
parameters = {"max_features" : range(1, X_train.shape[1]+1)}
rf_tune = GridSearchCV(RandomForestRegressor(n_estimators = 500),
                          parameters,
                          cv = 5,
                          scoring='neg_mean_squared_error') \
                          .fit(X_train, y_train)

In [44]:
print(rf_tune.best_estimator_)

RandomForestRegressor(max_features=6, n_estimators=500)


In [45]:
rf_cv = cross_validate(rf_tune.best_estimator_,
                       X_train,
                       y_train,
                       cv = 5,
                       scoring='neg_mean_squared_error')

In [46]:
print(np.sqrt(-sum(rf_cv['test_score'])))

2600.496110484433


In [47]:
rf_best = rf_tune.best_estimator_.fit(X_train, y_train)

## 4. Compare on the Test Set

In [48]:
from sklearn.metrics import mean_squared_error
mlr_pred = mlr_best.predict(X_test)
rtree_pred = rtree_best.predict(X_test)
rf_pred = rf_best.predict(X_test)

print(np.sqrt(mean_squared_error(y_test, mlr_pred)), np.sqrt(mean_squared_error(y_test, rtree_pred)), np.sqrt(mean_squared_error(y_test, rf_pred)))

1395.9382680292672 1169.0015805156052 1129.4158279264027
