<a href="https://colab.research.google.com/github/villafue/Machine_Learning_Notes/blob/master/Supervised_Learning/Supervised%20Learning%20with%20Scikit-Learn/Regression/Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Which of the following is a regression problem?
Andy introduced regression to you using the Boston housing dataset. But regression models can be used in a variety of contexts to solve a variety of different problems.

Given below are four example applications of machine learning. Your job is to pick the one that is best framed as a regression problem.

Incorrect Answers:

An e-commerce company using labeled customer data to predict whether or not a customer will purchase a particular item.

- There are only two outcomes here: Either the customer will purchase the item, or they will not. This is a classification task.

A healthcare company using data about cancer tumors (such as their geometric measurements) to predict whether a new tumor is benign or malignant.

- There are only two outcomes here: Either the tumor is benign, or it is malignant. This is a classification task.

A restaurant using review data to ascribe positive or negative sentiment to a given review.

- The target variable here is the sentiment of a review: It can be either positive or negative. This is not a task suited to regression.

Correct Answer:

A bike share company using time and weather data to predict the number of bikes being rented at any given hour.

- Great work! The target variable here - the number of bike rentals at any given hour - is quantitative, so this is best framed as a regression problem.

# Importing data for supervised learning
In this chapter, you will work with Gapminder data that we have consolidated into one CSV file available in the workspace as 'gapminder.csv'. Specifically, your goal will be to use this data to predict the life expectancy in a given country based on features such as the country's GDP, fertility rate, and population. As in Chapter 1, the dataset has been preprocessed.

Since the target variable here is quantitative, this is a regression problem. To begin, you will fit a linear regression with just one feature: 'fertility', which is the average number of children a woman in a given country gives birth to. In later exercises, you will use all the features to build regression models.

Before that, however, you need to import the data and get it into the form needed by scikit-learn. This involves creating feature and target variable arrays. Furthermore, since you are going to use only one feature to begin with, you need to do some reshaping using NumPy's .reshape() method. Don't worry too much about this reshaping right now, but it is something you will have to do occasionally when working with scikit-learn so it is useful to practice.

Instructions:

Import numpy and pandas as their standard aliases.

Read the file 'gapminder.csv' into a DataFrame df using the read_csv() function.
Create array X for the 'fertility' feature and array y for the 'life' target variable.

Reshape the arrays by using the .reshape() method and passing in -1 and 1.

Conclusion:

Great work! Notice the differences in shape before and after applying the .reshape() method. Getting the feature and target variable arrays into the right format for scikit-learn is an important precursor to model building.

In [None]:
# Import numpy and pandas
import numpy as np
import pandas as pd

# Read the CSV file into a DataFrame: df
df = pd.read_csv('gapminder.csv')

# Create arrays for features and target variable
y = df.life.values
X = df.fertility.values

# Print the dimensions of X and y before reshaping
print("Dimensions of y before reshaping: {}".format(y.shape))
print("Dimensions of X before reshaping: {}".format(X.shape))

# Reshape X and y
y = y.reshape(-1,1)
X = X.reshape(-1,1)

# Print the dimensions of X and y after reshaping
print("Dimensions of y after reshaping: {}".format(y.shape))
print("Dimensions of X after reshaping: {}".format(X.shape))

'''
<script.py> output:
    Dimensions of y before reshaping: (139,)
    Dimensions of X before reshaping: (139,)
    Dimensions of y after reshaping: (139, 1)
    Dimensions of X after reshaping: (139, 1)
'''

# Exploring the Gapminder data
As always, it is important to explore your data before building models. On the right, we have constructed a heatmap showing the correlation between the different features of the Gapminder dataset, which has been pre-loaded into a DataFrame as df and is available for exploration in the IPython Shell. Cells that are in green show positive correlation, while cells that are in red show negative correlation. Take a moment to explore this: Which features are positively correlated with life, and which ones are negatively correlated? Does this match your intuition?

Then, in the IPython Shell, explore the DataFrame using pandas methods such as .info(), .describe(), .head().

In case you are curious, the heatmap was generated using Seaborn's heatmap function and the following line of code, where df.corr() computes the pairwise correlation between columns:

sns.heatmap(df.corr(), square=True, cmap='RdYlGn')

Once you have a feel for the data, consider the statements below and select the one that is not true. After this, Hugo will explain the mechanics of linear regression in the next video and you will be on your way building regression models!

Wrong Answers:

1. The DataFrame has 139 samples (or rows) and 9 columns

 - The DataFrame does indeed have 139 rows and 9 columns, as seen by using df.info(). Remember, life is a column as well, even though we are using it as our target variable.

```
In [1]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 139 entries, 0 to 138
Data columns (total 9 columns):
population         139 non-null float64
fertility          139 non-null float64
HIV                139 non-null float64
CO2                139 non-null float64
BMI_male           139 non-null float64
GDP                139 non-null float64
BMI_female         139 non-null float64
life               139 non-null float64
child_mortality    139 non-null float64
dtypes: float64(9)
memory usage: 9.9 KB
```
2. life and fertility are negatively correlated.

  - This is a true statement. Look at the heatmap: the cell corresponding to life and fertility is red, indicating a negative correlation.

3. The mean of life is 69.602878.

 - Using df.describe() shows that the mean of life is indeed 69.602878.
```
In [2]: df.describe()
Out[2]: 
         population   fertility         HIV         CO2    BMI_male  \
count  1.390000e+02  139.000000  139.000000  139.000000  139.000000   
mean   3.549977e+07    3.005108    1.915612    4.459874   24.623054   
std    1.095121e+08    1.615354    4.408974    6.268349    2.209368   
min    2.773150e+05    1.280000    0.060000    0.008618   20.397420   
25%    3.752776e+06    1.810000    0.100000    0.496190   22.448135   
50%    9.705130e+06    2.410000    0.400000    2.223796   25.156990   
75%    2.791973e+07    4.095000    1.300000    6.589156   26.497575   
max    1.197070e+09    7.590000   25.900000   48.702062   28.456980   

                 GDP  BMI_female        life  child_mortality  
count     139.000000  139.000000  139.000000       139.000000  
mean    16638.784173  126.701914   69.602878        45.097122  
std     19207.299083    4.471997    9.122189        45.724667  
min       588.000000  117.375500   45.200000         2.700000  
25%      2899.000000  123.232200   62.200000         8.100000  
50%      9938.000000  126.519600   72.000000        24.000000  
75%     23278.500000  130.275900   76.850000        74.200000  
max    126076.000000  135.492000   82.600000       192.000000
```
4. GDP and life are positively correlated.

 - This is a true statement. Look at the heatmap: the cell corresponding to GDP and life is green, indicating a positive correlation

Correct Answer:

fertility is of type int64

- Good job! As seen by using df.info(), fertility, along with all the other columns, is of type float64, not int64.





# Fit & predict for regression
Now, you will fit a linear regression and predict life expectancy using just one feature. You saw Andy do this earlier using the 'RM' feature of the Boston housing dataset. In this exercise, you will use the 'fertility' feature of the Gapminder dataset. Since the goal is to predict life expectancy, the target variable here is 'life'. The array for the target variable has been pre-loaded as y and the array for 'fertility' has been pre-loaded as X_fertility.

A scatter plot with 'fertility' on the x-axis and 'life' on the y-axis has been generated. As you can see, there is a strongly negative correlation, so a linear regression should be able to capture this trend. Your job is to fit a linear regression and then predict the life expectancy, overlaying these predicted values on the plot to generate a regression line. You will also compute and print the R2 score using sckit-learn's .score() method.

**Instructions**

1. Import LinearRegression from sklearn.linear_model.
2. Create a LinearRegression regressor called reg.
3. Set up the prediction space to range from the minimum to the maximum of X_fertility. This has been done for you.
4. Fit the regressor to the data (X_fertility and y) and compute its predictions using the .predict() method and the prediction_space array.
5. Compute and print the R2 score using the .score() method.
6. Overlay the plot with your linear regression line. This has been done for you, so hit 'Submit Answer' to see the result!

**Hint**

1. You can import x from y using the command from y import x.
2. Use the function LinearRegression() to create the regressor.
3. Use the .fit() method on reg with X_fertility and y as arguments to fit the model.
4. Use the .predict() method on reg with prediction_space as the argument to compute the predictions.
5. Use the .score() method with X_fertility and y as arguments to compute the  R2  score.

**Conclusion**

Fantastic! Notice how the line captures the underlying trend in the data. And the performance is quite decent for this basic regression model with only one feature!

In [None]:
# Import LinearRegression
from sklearn.linear_model import LinearRegression

# Create the regressor: reg
reg = LinearRegression()

# Create the prediction space
prediction_space = np.linspace(min(X_fertility), max(X_fertility)).reshape(-1,1)
'''
In [3]: prediction_space
Out[3]: 
array([[1.28      ],
       [1.40877551],
       [1.53755102],
       [1.66632653],
       [1.79510204],
       [1.92387755],
       [2.05265306],
       [2.18142857],
       [2.31020408],
       [2.43897959],
       [2.5677551 ],
       [2.69653061],
       [2.82530612],
       [2.95408163],
       [3.08285714],
       [3.21163265],
       [3.34040816],
       [3.46918367],
       [3.59795918],
       [3.72673469],
       [3.8555102 ],
       [3.98428571],
       [4.11306122],
       [4.24183673],
       [4.37061224],
       [4.49938776],
       [4.62816327],
       [4.75693878],
       [4.88571429],
       [5.0144898 ],
       [5.14326531],
       [5.27204082],
       [5.40081633],
       [5.52959184],
       [5.65836735],
       [5.78714286],
       [5.91591837],
       [6.04469388],
       [6.17346939],
       [6.3022449 ],
       [6.43102041],
       [6.55979592],
       [6.68857143],
       [6.81734694],
       [6.94612245],
       [7.07489796],
       [7.20367347],
       [7.33244898],
       [7.46122449],
       [7.59      ]])
'''
# Fit the model to the data
reg.fit(X_fertility, y)

'''
In [4]: X_fertility
Out[4]: 
array([[2.73],
       [6.43],
       [2.24],
       [1.4 ],
       [1.96],
       [1.41],
       [1.99],
       [1.89],
       [2.38],
       [1.83],
       [1.42],
       [1.82],
       [2.91],
       [5.27],
       [2.51],
       [3.48],
       [2.86],
       [1.9 ],
       [1.43],
       [6.04],
       [6.48],
       [3.05],
       [5.17],
       [1.68],
       [6.81],
       [1.89],
       [2.43],
       [5.05],
       [5.1 ],
       [1.91],
       [4.91],
       [1.43],
       [1.5 ],
       [1.89],
       [3.76],
       [2.73],
       [2.95],
       [2.32],
       [5.31],
       [5.16],
       [1.62],
       [2.74],
       [1.85],
       [1.97],
       [4.28],
       [5.8 ],
       [1.79],
       [1.37],
       [4.19],
       [1.46],
       [4.12],
       [5.34],
       [5.25],
       [2.74],
       [3.5 ],
       [3.27],
       [1.33],
       [2.12],
       [2.64],
       [2.48],
       [1.88],
       [2.  ],
       [2.92],
       [1.39],
       [2.39],
       [1.34],
       [2.51],
       [4.76],
       [1.5 ],
       [1.57],
       [3.34],
       [5.19],
       [1.42],
       [1.63],
       [4.79],
       [5.78],
       [2.05],
       [2.38],
       [6.82],
       [1.38],
       [4.94],
       [1.58],
       [2.35],
       [1.49],
       [2.37],
       [2.44],
       [5.54],
       [2.05],
       [2.9 ],
       [1.77],
       [2.12],
       [2.72],
       [7.59],
       [6.02],
       [1.96],
       [2.89],
       [3.58],
       [2.61],
       [4.07],
       [3.06],
       [2.58],
       [3.26],
       [1.33],
       [1.36],
       [2.2 ],
       [1.34],
       [1.49],
       [5.06],
       [5.11],
       [1.41],
       [5.13],
       [1.28],
       [1.31],
       [1.43],
       [7.06],
       [2.54],
       [1.42],
       [2.32],
       [4.79],
       [2.41],
       [3.7 ],
       [1.92],
       [1.47],
       [3.7 ],
       [5.54],
       [1.48],
       [4.88],
       [1.8 ],
       [2.04],
       [2.15],
       [6.34],
       [1.38],
       [1.87],
       [2.07],
       [2.11],
       [2.46],
       [1.86],
       [5.88],
       [3.85]])

In [5]: y
Out[5]: 
array([[75.3],
       [58.3],
       [75.5],
       [72.5],
       [81.5],
       [80.4],
       [70.6],
       [72.2],
       [68.4],
       [75.3],
       [70.1],
       [79.4],
       [70.7],
       [63.2],
       [67.6],
       [70.9],
       [61.2],
       [73.9],
       [73.2],
       [59.4],
       [57.4],
       [66.2],
       [56.6],
       [80.7],
       [54.8],
       [78.9],
       [75.1],
       [62.6],
       [58.6],
       [79.7],
       [55.9],
       [76.5],
       [77.8],
       [78.7],
       [61. ],
       [74. ],
       [70.1],
       [74.1],
       [56.7],
       [60.4],
       [74. ],
       [65.7],
       [79.4],
       [81. ],
       [57.5],
       [62.2],
       [72.1],
       [80. ],
       [62.7],
       [79.5],
       [70.8],
       [58.3],
       [51.3],
       [63. ],
       [61.7],
       [70.9],
       [73.8],
       [82. ],
       [64.4],
       [69.5],
       [76.9],
       [79.4],
       [80.9],
       [81.4],
       [75.5],
       [82.6],
       [66.1],
       [61.5],
       [72.3],
       [77.6],
       [45.2],
       [61. ],
       [72. ],
       [80.7],
       [63.4],
       [51.4],
       [74.5],
       [78.2],
       [55.8],
       [81.4],
       [63.6],
       [72.1],
       [75.7],
       [69.6],
       [63.2],
       [73.3],
       [55. ],
       [60.8],
       [68.6],
       [80.3],
       [80.2],
       [75.2],
       [59.7],
       [58. ],
       [80.7],
       [74.6],
       [64.1],
       [77.1],
       [58.2],
       [73.6],
       [76.8],
       [69.4],
       [75.3],
       [79.2],
       [80.4],
       [73.4],
       [67.6],
       [62.2],
       [64.3],
       [76.4],
       [55.9],
       [80.9],
       [74.8],
       [78.5],
       [56.7],
       [55. ],
       [81.1],
       [74.3],
       [67.4],
       [69.1],
       [46.1],
       [81.1],
       [81.9],
       [69.5],
       [59.7],
       [74.1],
       [60. ],
       [71.3],
       [76.5],
       [75.1],
       [57.2],
       [68.2],
       [79.5],
       [78.2],
       [76. ],
       [68.7],
       [75.4],
       [52. ],
       [49. ]])
'''

# Compute predictions over the prediction space: y_pred
y_pred = reg.predict(prediction_space)

'''
In [3]: y_pred
Out[3]: 
array([[77.26904851],
       [76.69678573],
       [76.12452294],
       [75.55226016],
       [74.97999737],
       [74.40773459],
       [73.83547181],
       [73.26320902],
       [72.69094624],
       [72.11868345],
       [71.54642067],
       [70.97415788],
       [70.4018951 ],
       [69.82963232],
       [69.25736953],
       [68.68510675],
       [68.11284396],
       [67.54058118],
       [66.9683184 ],
       [66.39605561],
       [65.82379283],
       [65.25153004],
       [64.67926726],
       [64.10700447],
       [63.53474169],
       [62.96247891],
       [62.39021612],
       [61.81795334],
       [61.24569055],
       [60.67342777],
       [60.10116498],
       [59.5289022 ],
       [58.95663942],
       [58.38437663],
       [57.81211385],
       [57.23985106],
       [56.66758828],
       [56.0953255 ],
       [55.52306271],
       [54.95079993],
       [54.37853714],
       [53.80627436],
       [53.23401157],
       [52.66174879],
       [52.08948601],
       [51.51722322],
       [50.94496044],
       [50.37269765],
       [49.80043487],
       [49.22817208]])
'''

# Print R^2 
print(reg.score(X_fertility, y))

'''
<script.py> output:
    0.6192442167740035
'''

# Plot regression line
plt.plot(prediction_space, y_pred, color='black', linewidth=3)
plt.show()


# Train/test split for regression
As you learned in Chapter 1, train and test sets are vital to ensure that your supervised learning model is able to generalize well to new data. This was true for classification models, and is equally true for linear regression models.

In this exercise, you will split the Gapminder dataset into training and testing sets, and then fit and predict a linear regression over all features. In addition to computing the R2 score, you will also compute the Root Mean Squared Error (RMSE), which is another commonly used metric to evaluate regression models. The feature array X and target variable array y have been pre-loaded for you from the DataFrame df.

**Instructions**

1. Import LinearRegression from sklearn.linear_model, mean_squared_error from sklearn.metrics, and train_test_split from sklearn.model_selection.

2. Using X and y, create training and test sets such that 30% is used for testing and 70% for training. Use a random state of 42.
Create a linear regression regressor called reg_all, fit it to the training set, and evaluate it on the test set.

3. Compute and print the R2 score using the .score() method on the test set.

4. Compute and print the RMSE. To do this, first compute the Mean Squared Error using the mean_squared_error() function with the arguments y_test and y_pred, and then take its square root using np.sqrt().

**Conclusion:**

Excellent! Using all features has improved the model score. This makes sense, as the model has more information to learn from. However, there is one potential pitfall to this process. Can you spot it? You'll learn about this as well how to better validate your models in the next video!

In [None]:
# Import necessary modules
from sklearn.linear_model import LinearRegression  
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

# Create the regressor: reg_all
reg_all = LinearRegression()

# Fit the regressor to the training data
reg_all.fit(X_train, y_train)

# Predict on the test data: y_pred
y_pred = reg_all.predict(X_test)

# Compute and print R^2 and RMSE
print("R^2: {}".format(reg_all.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))

'''
<script.py> output:
    R^2: 0.838046873142936
    Root Mean Squared Error: 3.2476010800377213
'''

# 5-fold cross-validation
Cross-validation is a vital step in evaluating a model. It maximizes the amount of data that is used to train the model, as during the course of training, the model is not only trained, but also tested on all of the available data.

In this exercise, you will practice 5-fold cross validation on the Gapminder data. By default, scikit-learn's cross_val_score() function uses R2 as the metric of choice for regression. Since you are performing 5-fold cross-validation, the function will return 5 scores. Your job is to compute these 5 scores and then take their average.

The DataFrame has been loaded as df and split into the feature/target variable arrays X and y. The modules pandas and numpy have been imported as pd and np, respectively.

**Instructions**

1. Import LinearRegression from sklearn.linear_model and cross_val_score from sklearn.model_selection.

2. Create a linear regression regressor called reg.

3. Use the cross_val_score() function to perform 5-fold cross-validation on X and y.

4. Compute and print the average cross-validation score. You can use NumPy's mean() function to compute the average.

**Conclusion**

Great work! Now that you have cross-validated your model, you can more confidently evaluate its predictions.

In [None]:
# Import the necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# Create a linear regression object: reg
reg = LinearRegression()

# Compute 5-fold cross-validation scores: cv_scores
cv_scores = cross_val_score(reg, X, y, cv=5)

# Print the 5-fold cross-validation scores
print(cv_scores)

print("Average 5-Fold CV Score: {}".format(np.mean(cv_scores)))

'''
<script.py> output:
    [0.81720569 0.82917058 0.90214134 0.80633989 0.94495637]
    Average 5-Fold CV Score: 0.8599627722793232
'''

# Regularization I: Lasso
In the video, you saw how Lasso selected out the 'RM' feature as being the most important for predicting Boston house prices, while shrinking the coefficients of certain other features to 0. Its ability to perform feature selection in this way becomes even more useful when you are dealing with data involving thousands of features.

In this exercise, you will fit a lasso regression to the Gapminder data you have been working with and plot the coefficients. Just as with the Boston data, you will find that the coefficients of some features are shrunk to 0, with only the most important ones remaining.

The feature and target variable arrays have been pre-loaded as X and y.

**Instructions**

Import Lasso from sklearn.linear_model.
Instantiate a Lasso regressor with an alpha of 0.4 and specify normalize=True.
Fit the regressor to the data and compute the coefficients using the coef_ attribute.
Plot the coefficients on the y-axis and column names on the x-axis. This has been done for you, so hit 'Submit Answer' to view the plot!

**Conclusion**

Great work! According to the lasso algorithm, it seems like 'child_mortality' is the most important feature when predicting life expectancy.

In [None]:
# Import Lasso
from sklearn.linear_model import Lasso

# Instantiate a lasso regressor: lasso
lasso = Lasso(alpha=0.4, normalize=True)

# Fit the regressor to the data
# Compute and print the coefficients
lasso_coef = lasso.fit(X, y).coef_

print(lasso_coef)

# Plot the coefficients
plt.plot(range(len(df_columns)), lasso_coef)
plt.xticks(range(len(df_columns)), df_columns.values, rotation=60)
plt.margins(0.02)
plt.show()

'''
<script.py> output:
    [-0.         -0.         -0.          0.          0.          0.
     -0.         -0.07087587]
'''

# Regularization II: Ridge
Lasso is great for feature selection, but when building regression models, Ridge regression should be your first choice.

Recall that lasso performs regularization by adding to the loss function a penalty term of the absolute value of each coefficient multiplied by some alpha. This is also known as L1 regularization because the regularization term is the L1 norm of the coefficients. This is not the only way to regularize, however.

If instead you took the sum of the squared values of the coefficients multiplied by some alpha - like in Ridge regression - you would be computing the L2 norm. In this exercise, you will practice fitting ridge regression models over a range of different alphas, and plot cross-validated R2 scores for each, using this function that we have defined for you, which plots the R2 score as well as standard error for each alpha:

```
def display_plot(cv_scores, cv_scores_std):
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    ax.plot(alpha_space, cv_scores)

    std_error = cv_scores_std / np.sqrt(10)

    ax.fill_between(alpha_space, cv_scores + std_error, cv_scores - std_error, alpha=0.2)
    ax.set_ylabel('CV Score +/- Std Error')
    ax.set_xlabel('Alpha')
    ax.axhline(np.max(cv_scores), linestyle='--', color='.5')
    ax.set_xlim([alpha_space[0], alpha_space[-1]])
    ax.set_xscale('log')
    plt.show()
```
Don't worry about the specifics of the above function works. The motivation behind this exercise is for you to see how the R2 score varies with different alphas, and to understand the importance of selecting the right value for alpha. You'll learn how to tune alpha in the next chapter.

**Instructions**

Instantiate a Ridge regressor and specify normalize=True.
Inside the for loop:
Specify the alpha value for the regressor to use.
Perform 10-fold cross-validation on the regressor with the specified alpha. The data is available in the arrays X and y.
Append the average and the standard deviation of the computed cross-validated scores. NumPy has been pre-imported for you as np.
Use the display_plot() function to visualize the scores and standard deviations.

**Conclusion**

Great work! Notice how the cross-validation scores change with different alphas. Which alpha should you pick? How can you fine-tune your model? You'll learn all about this in the next chapter!

In [None]:
# Import necessary modules
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

# Setup the array of alphas and lists to store scores
alpha_space = np.logspace(-4, 0, 50)
ridge_scores = []
ridge_scores_std = []

# Create a ridge regressor: ridge
ridge = Ridge(normalize=True)

# Compute scores over range of alphas
for alpha in alpha_space:

    # Specify the alpha value to use: ridge.alpha
    ridge.alpha = alpha
    
    # Perform 10-fold CV: ridge_cv_scores
    ridge_cv_scores = cross_val_score(ridge, X, y, cv = 10)
    
    # Append the mean of ridge_cv_scores to ridge_scores
    ridge_scores.append(np.mean(ridge_cv_scores))
    
    # Append the std of ridge_cv_scores to ridge_scores_std
    ridge_scores_std.append(np.std(ridge_cv_scores))

# Display the plot
display_plot(ridge_scores, ridge_scores_std)
