#### Intro to Regression

regression is another type of supervised learning problem (supervised learning/classification was the one we just learned)

in regression tasks the target variable is a continuously varying variable, such as GDP or the price of a house 

if the target variable is quantitative then it's a regression problem

In [None]:
# load the data from a csv using pandas' read_csv() function
boston = pd.read_csv('boston.csv')
# and view the head
print(boston.head())

In [None]:
# scikit-learn wants features and target values in distinct arrays, X and y, so split the dataframe
# the .values attribute return the NumPy arrays that we'll use
X = boston.drop('MEDV', axis=1).values # drop the target
y = boston['MEDV'].values # keep only the target

In [None]:
# predicting house value from a single feature, the average number of rooms in a block
X_rooms = X[:, 5] # slice out the rooms column (the 5th column in this case)
# check the types and you'll see that both and NumPy arrays
type(X_rooms), type(y)

In [None]:
# turn the NumPy arrays into the desired shape, keep the first dimension but add another dimension of size 1 to X
y = y.reshape(-1, 1)
X_rooms = X_rooms.reshape(-1, 1)

In [None]:
# now plot house value as a function of number of rooms
plt.scatter(X_rooms, y)
plt.ylabel('Value of house /1000 ($)') # label the x axis
plt.xlabel('Number of rooms') # label the y axis
plt.show();
# you would see that, as expected, more rooms leads to higher prices

In [None]:
# now fit a regression model to the data
import numpy as np
from sklearn.linear_model import LinearRegression

# instantiate LinearRegression as reg
reg = LinearRegression()
# fit the regressor to the data
reg.fit(X_rooms, y
# check out the regressor's predictions over the range of the data
prediction_space = np.linspace(min(X_rooms), max(X_rooms)).reshape(-1, 1)

In [None]:
# plot the line with a scatter plot
plt.scatter(X_rooms, y, color='blue')
plt.plot(prediction_space, reg.predict(prediction_space), color='black', linewidth=3)
plt.show();

#### The Basics of Linear Regression

in linear regression you want to fit a line to the data, a line in two dimensions is always in the form y=ax+b:
- y is the target
- x is the single feature
- a and b are the parameters of the model that we want to learn

the question of fitting comes down to how to choose a and b 

one method is to find an error function (also called a loss or cost function) for any given line and then choose the line that minimizes the error function

but what will our loss function be? you want the line to be as close to the actual data points as possible, which means minimizing the vertical distance between the fit and the data, for each data point we calculate the vertical distance between it and the line and this distance is called the residual

remember that if you tried to minimize the sum of the residuals then you run into the issue of a large positive residual canceling out a large negative residual, that's why our goal is to minimize the **sum of the squares** of the residuals, this will be our loss function which is also called **ordinary least squares OLS**, this is the same as minimizing the mean squared error of the predictions on the training set, when you call fit on a linear regression model in scikit-learn it performs OLS under the hood 

if you have 2 features and 1 target, the line will be in the form of y=a~1~x~1~+a~2~x~2~+b, so to fit a linear regression model means you specify 3 variables: a1, a2, and b, even higher dimensions just adds more to the equation and means you have to specify a coefficient (ai) for each features as well as the variable (b)

the scikit-learn API workes the same way, you pass two arrays, one containing the features and the other containing the target variable

the default scoring method (metric) for linear regression is called R squared, this metric quantifies the amount of variance in the target variable that is predicted from the feature variables 

you could also compute RMSE (root mean error squared) which is another commonly used metric to evaluate regression models

using all features will improve the model score, which makes sense, buuuuuut this will probably lead to overfitting and won't generalize to other data so the next lesson we'll learn how to better evaluate our models 

In [None]:
# linear regression on all features
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# instantiate the regressor
reg_all = LinearRegression()

# fit it on the training set
reg_all.fit(X_train, y_train)

# predict on the test set
y_pred = reg_all.predict(X_test)

# compute r squared, pass it the test data and the test data target 
reg_all.score(X_test, y_test)

# Plot regression line
plt.plot(prediction_space, y_pred, color='black', linewidth=3)
plt.show()
# you'll almost never use linear regression right out of the box like this but will use regularization
# regularization will place further constraints on the model coefficients but this is the first step toward using regularized linear models 

In [None]:
# exercise example, split the dataset into train and test, predict a linear regression over all features, compute R^2^, 
# and RMSE (root mean error squared)
from sklearn.metrics import mean_squared_error

rmse = np.sqrt(mean_squared_error(y_test, y_pred))

#### Cross Validation

you're now more familiar with train test splits and computing model performance metrics on the test set, but there's a potential pitfall of this process
you're computing R^2^ on the test set so the result will depend on the way the data was split up, the data in the test set may have weird shit that means the R^2^ computed from it isn't representative of the model's ability to generalize to unseen data 
to combat this dependence on an artibrary split you can use cross validation 

cross validation is a vital step in validating a model because it maximizes the amount of data that's used to train the model because the model will be trained and tested on all of the available data

**cross validation** 
- start by splitting the dataset into 5 folds (groups)
- hold out the first fold as a test set 
- fit the  model on the remaining 4 folds 
- predict on the test set
- compute the metric of interest
- repeat these steps again but use the second fold as the test set, and again, and again, and again

you'll then have 5 values of R^2^ and then you can compute statistics of interest such as the mean, median, and 95% confidence intervals

using more folds is more computationally expensive since you're fitting and predicting more times, you could use %timeit right before the cross_val_score() to compare how long something like 10 folds is compared to just 3 folds 

In [None]:
# cross-validation using scikit-learn
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

# instantiate our model (the regressor)
reg = LinearRegression()

# call cross_val_score with the regressor, the feature data, the target, and the number of folds
cv_results = cross_val_score(reg, X, y, cv=5) 
# this returns an array of CV scores, the score reported (one for each fold) is R^2^ because
# that's the default for linear regression
print(cv_results)

# compute the mean
np.mean(cv_results)

#### Regularized Regression

fitting a linear regression means minimizing a loss function, it chooses a coefficient for each feature variable, if you allow these coefficients/parameters to be super large we'll end up with overfitting, because of this, it's common to use **regularization** to change the loss function so that it penalizes for large coefficients 

one type of regularization is **ridge regression** which uses the standard OLS loss function plus the squared value of each coefficient multiplied by some constant alpha, thds will make it so coefficients with a large magnitude are penalized, whether positive or negative, alpha is a parameter that we need to choose in order to fit and predict, we'll select the alpha that makes our model perform the best, it's similar to picking K in KNN, this is called hyperparameter tuning and we'll see more of this later, this alpha value is also called **lambda** and can be thought of as a parameter that controls model complexity, when alpha=0 we'll get back OLS which can lead to overfitting because large coefficients aren't penalized and the issue of overfitting isn't accounted for, a really high alpha can mean overpenalization which can lead to a model that's too simple and underfit

**lasso regression** is another type of regularized regression, the loss function is the standard OLS loss function plus the absolute value of each coefficient multiplied by some constant alpha, a cool feature of lasso regression is that it can be used to select important features of a dataset, this happens because it tends to shrink the less important features to be exactly 0 then the features that aren't shrunk to 0 are selected by the LASSO algorithm, lasso regression is a great sanity check and is a good way to communicate important result to non-technical colleagues

lasso regression is great for feature selection but when building regression models you should make ridge regression your first choice

the next thing you'll want to learn is which alpha should you pick? how can you fine tune the model? 

In [None]:
# ridge regression in scikit-learn
from sklearn.linear_model import Ridge

# test/train split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
#fit on the training, predict on the test
ridge = Ridge(alpha=0.1, normalize=True) #normalize ensures all the variables are on the same scale
ridge.fit(X_train, y_train)
ridge_pred = ridge.predict(X_test)
ridge.score(X_test, y_test)

In [None]:
# lasso regression in scikit-learn
from sklearn.linear_model import Lasso

# test/train split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
#fit on the training, predict on the test
lasso = Lasso(alpha=0.1, normalize=True) #normalize ensures all the variables are on the same scale
lasso.fit(X_train, y_train)
lasso_pred = lasso.predict(X_test)
lasso.score(X_test, y_test)

In [None]:
# lasso for feature selection in scikit-learn
from sklearn.linear_model import lasso

# store the feature names in the variable names
names = boston.drop('MEDV', axis=1).columns
# instantiate the regressor and fit it to the data
lasso = Lasso(alpha=0.1)
# extract the coef attribute and store it in a variable
lasso_coef = lasso.fit(X, y).coef_
# plot the coefficients as a function of feature names to see the figure
_ = plt.plot(range(len(names)), lasso_coef)
_ = plt.xticks(range(len(names)), names, rotation=60)
_ = plt.ylabel('Coefficients')
plt.show()
# you'll see that the most important predictor for the target variable is the highest point

In [None]:
# example exercise, plot a bunch of alphas
# Import necessary modules
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

# Setup the array of alphas and lists to store scores
alpha_space = np.logspace(-4, 0, 50)
ridge_scores = []
ridge_scores_std = []

# Create a ridge regressor: ridge
ridge = Ridge(normalize=True)

# Compute scores over range of alphas
for alpha in alpha_space:

    # Specify the alpha value to use: ridge.alpha
    ridge.alpha = alpha
    
    # Perform 10-fold CV: ridge_cv_scores
    ridge_cv_scores = cross_val_score(ridge, X, y, cv=10)
    
    # Append the mean of ridge_cv_scores to ridge_scores
    ridge_scores.append(np.mean(ridge_cv_scores))
    
    # Append the std of ridge_cv_scores to ridge_scores_std
    ridge_scores_std.append(np.std(ridge_cv_scores))

# Display the plot
display_plot(ridge_scores, ridge_scores_std)