### Boston Housing Data

为了更好地理解在回归中使用的各种指标，我们将使用波士顿住房数据集。  

首先在下面的单元格中导入数据集，划分训练数据和测试数据来后面做准备。

In [9]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import numpy as np
#import tests2 as t

boston = load_boston()
y = boston.target
X = boston.data

X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.33, random_state=42)
X_train

array([[1.02330e+01, 0.00000e+00, 1.81000e+01, ..., 2.02000e+01,
        3.79700e+02, 1.80300e+01],
       [6.71910e-01, 0.00000e+00, 8.14000e+00, ..., 2.10000e+01,
        3.76880e+02, 1.48100e+01],
       [1.44550e-01, 1.25000e+01, 7.87000e+00, ..., 1.52000e+01,
        3.96900e+02, 1.91500e+01],
       ...,
       [1.50100e-02, 8.00000e+01, 2.01000e+00, ..., 1.70000e+01,
        3.90940e+02, 5.99000e+00],
       [1.11604e+01, 0.00000e+00, 1.81000e+01, ..., 2.02000e+01,
        1.09850e+02, 2.32700e+01],
       [2.28760e-01, 0.00000e+00, 8.56000e+00, ..., 2.09000e+01,
        7.08000e+01, 1.06300e+01]])

> **步骤 1：**在开始前，让我们先快速检查一下哪些模型可以用于回归问题。请将下面的模型字典中的各项和相应的字母（问题类型标识）进行配对。

In [10]:
# When can you use the model - use each option as many times as necessary
a = 'regression'
b = 'classification'
c = 'both regression and classification'

models = {
    'decision trees': c,
    'random forest': c,
    'adaptive boosting': c,
    'logistic regression': b,
    'linear regression': a
}

#checks your answer, no need to change this code
t.q1_check(models)

NameError: name 't' is not defined

> **步骤 2：**现在，从sklearn库中导入这些在前面找到的可用于回归的模型。

In [16]:
# Import models from sklearn - notice you will want to use 
# the regressor version (not classifier) - googling to find 
# each of these is what we all do!
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression

> **步骤 3：**现在，你已经导入了4个可用于回归问题的模型，接下来实例化这些模型。

In [17]:
# Instantiate each of the models you imported
# For now use the defaults for all the hyperparameters
dt_model = DecisionTreeRegressor()
rf_model = RandomForestRegressor()
ada_model = AdaBoostRegressor()
log_model = LogisticRegression()
li_model = LinearRegression()

> **步骤 4：**在训练集数据上拟合你的模型。

In [18]:
# Fit each of your models using the training data
dt_model.fit(X_train, y_train)
rf_model.fit(X_train, y_train)
ada_model.fit(X_train, y_train)
#log_model.fit(X_train, y_train)
li_model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

> **Step 5:** Use each of your models to predict on the test data.


In [35]:
dt_preds = dt_model.predict(X_test)
rf_preds = rf_model.predict(X_test)
ada_preds = ada_model.predict(X_test)
li_preds = li_model.predict(X_test)

> **Step 6:** Now for the information related to this lesson.  Use the dictionary to match the metrics that are used for regression and those that are for classification.

In [21]:
# potential model options
a = 'regression'
b = 'classification'
c = 'both regression and classification'

#
metrics = {
    'precision': b,
    'recall': b,
    'accuracy': b,
    'r2_score': a,
    'mean_squared_error': a,
    'area_under_curve': b, 
    'mean_absolute_area': a 
}

#checks your answer, no need to change this code
t.q6_check(metrics)

NameError: name 't' is not defined

> **Step 6:** Now that you have identified the metrics that can be used in for regression problems, use sklearn to import them.

In [22]:
# Import the metrics from sklearn
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error


> **Step 7:** Similar to what you did with classification models, let's make sure you are comfortable with how exactly each of these metrics is being calculated.  We can then match the value to what sklearn provides.

In [28]:
def r2(actual, preds):
    '''
    INPUT:
    actual - numpy array or pd series of actual y values
    preds - numpy array or pd series of predicted y values
    OUTPUT:
    returns the r-squared score as a float
    '''
    sse = np.sum((actual-preds)**2)
    sst = np.sum((actual - np.mean(actual))**2)
    return 1 - sse/sst

# Check solution matches sklearn
print(r2(y_test, dt_preds))
print(r2_score(y_test, dt_preds))
print("Since the above match, we can see that we have correctly calculated the r2 value.")

3195.55
0.7471537117785088
0.7471537117785088
Since the above match, we can see that we have correctly calculated the r2 value.


> **Step 8:** Your turn fill in the function below and see if your result matches the built in for mean_squared_error. 

In [30]:
def mse(actual, preds):
    '''
    INPUT:
    actual - numpy array or pd series of actual y values
    preds - numpy array or pd series of predicted y values
    OUTPUT:
    returns the mean squared error as a float
    '''
    
    return np.sum((actual-preds)**2)/len(actual)


# Check your solution matches sklearn
print(mse(y_test, dt_preds))
print(mean_squared_error(y_test, dt_preds))
print("If the above match, you are all set!")

19.13502994011976
19.13502994011976
If the above match, you are all set!


> **Step 9:** Now one last time - complete the function related to mean absolute error.  Then check your function against the sklearn metric to assure they match. 

In [31]:
def mae(actual, preds):
    '''
    INPUT:
    actual - numpy array or pd series of actual y values
    preds - numpy array or pd series of predicted y values
    OUTPUT:
    returns the mean absolute error as a float
    '''
    
    return np.sum(np.abs(actual-preds))/len(actual)

# Check your solution matches sklearn
print(mae(y_test, dt_preds))
print(mean_absolute_error(y_test, dt_preds))
print("If the above match, you are all set!")

3.0485029940119763
3.0485029940119763
If the above match, you are all set!


> **Step 10:** Which model performed the best in terms of each of the metrics?  Note that r2 and mse will always match, but the mae may give a different best model.  Use the dictionary and space below to match the best model via each metric.

In [32]:
#match each metric to the model that performed best on it
a = 'decision tree'
b = 'random forest'
c = 'adaptive boosting'
d = 'linear regression'


best_fit = {
    'mse': b,
    'r2': b,
    'mae': b
}

#Tests your answer - don't change this code
t.check_ten(best_fit)

NameError: name 't' is not defined

In [33]:
def print_metrics(y_true, preds, model_name=None):
    '''
    INPUT:
    y_true - the y values that are actually true in the dataset (numpy array or pandas series)
    preds - the predictions for those values from some model (numpy array or pandas series)
    model_name - (str - optional) a name associated with the model if you would like to add it to the print statements 
    
    OUTPUT:
    None - prints the mse, mae, r2
    '''
    if model_name == None:
        print('Mean Squared Error: ', format(mean_squared_error(y_true, preds)))
        print('Mean Absolute Error: ', format(mean_absolute_error(y_true, preds)))
        print('R2 Score: ', format(r2_score(y_true, preds)))
        print('\n\n')
    
    else:
        print('Mean Squared Error ' + model_name + ' :' , format(mean_squared_error(y_true, preds)))
        print('Mean Absolute Error ' + model_name + ' :', format(mean_absolute_error(y_true, preds)))
        print('R2 Score ' + model_name + ' :', format(r2_score(y_true, preds)))
        print('\n\n')

In [36]:
# Print Decision Tree scores
print_metrics(y_test, dt_preds, 'tree')

# Print Random Forest scores
print_metrics(y_test, rf_preds, 'random forest')

# Print AdaBoost scores
print_metrics(y_test, ada_preds, 'adaboost')

# Linear Regression scores
print_metrics(y_test, li_preds, 'linear reg')


Mean Squared Error tree : 19.13502994011976
Mean Absolute Error tree : 3.0485029940119763
R2 Score tree : 0.7471537117785088



Mean Squared Error random forest : 10.75214994610778
Mean Absolute Error random forest : 2.1911796407185617
R2 Score random forest : 0.8579233367921637



Mean Squared Error adaboost : 14.552095369061185
Mean Absolute Error adaboost : 2.7293361262950757
R2 Score adaboost : 0.807711651801615



Mean Squared Error linear reg : 20.724023437339717
Mean Absolute Error linear reg : 3.148255754816822
R2 Score linear reg : 0.7261570836552481



