**Note: Please make your own copy of this notebook to run and execute, thank you!**

1.   Go to the menu tab on the top left corner
2.   Click on "File"
3.   Under the File tab menu click on "Save a copy in Drive..."

# Model Evaluation

---

### Data Science Pipeline for Predictive Machine Learning Models

1. Import Libraries

2. Load the Data

3. Analyze the Dataset

4. Clean, Transform, and Prepare Data

5. Split the Data into Training and Testing Datasets

6. Choose Performance Metric for Model

7. Initialize our ML Model

8. Train and Validate ML Algorithm with Multiple Parameters to Find Optimal Model

9. Make Predictions

In [0]:
# Toy example to load, analyze, clean, prepare, train and tune or model
# NOTE: In reality we should throughly inspect and analyze our data 

# Import libraries
import sklearn
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.tree import DecisionTreeRegressor
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.metrics import r2_score
from sklearn.metrics import make_scorer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import Normalizer

# Load dataset
boston = load_boston()
features, prices = boston['data'], boston['target']

# Descriptive Analysis of the data
features = pd.DataFrame(features)
features.describe()

# Clean, transform, and prepare data (Not something we should automate - really should analze and understand dataset to make sure we have all the relevant data)
normal_features = Normalizer().fit_transform(features) # Normalize the features for easier processing
new_features = SelectKBest(f_classif, k=3).fit_transform(normal_features, prices) # Select the top 3 most relevant features (Should look into dimensionality reduction/feature engineering)

# Split the data into training, test, features, and labels
features_train, features_test, prices_train, prices_test = train_test_split(new_features, prices, test_size=0.2, random_state=42)

# Choose a performance metric (R2 evaluates how close the data matched the data)
scoring_fnc = make_scorer(r2_score)

# Initialize our machine learning model (Decision Tree model that does regression)
regressor = DecisionTreeRegressor()

# Train and validate best model
cv_sets = ShuffleSplit(prices_train.shape[0], test_size = 0.20, random_state = 0) # Takes our Training data and creates cross-validation sets
params = {'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]} # Parameters for our model to tune (the depth of the decision tree to make a decision)
grid = GridSearchCV(regressor, param_grid = params, scoring = scoring_fnc) # Tune our model to find the right complexity
grid = grid.fit(features_train, prices_train) # Train our models with the training data
best_classifer = grid.best_estimator_ # Get the best model

# Make new predictions (run several times and notice the test results are highly varied)
best_predictions = best_classifer.predict(features_test)
print("Final R2 score on the testing data: {:.4f}".format(r2_score(prices_test, best_predictions)))


Final R2 score on the testing data: 0.7913
