## Regression
Regression is about taking a set of input a set of features and creating a model that predicts numerical values. _Examples of regressor models include the types of analysis done in financial markets to predict pricing, looking at how the features of an automobile might affect its gas mileage, or anything else that might be measured on a (usually broad) range._

In this example we will examine BitCoin prices and see if we can predict the value.  The data comes from a set of Coinbase trades from December of 2014 to January of 2018 and is available from Kaggle. _See the references section of this chapter for links._

Throughout this activity, we will:

* Transform and prepare the data so that we can utilize it in a regression analysis. This will include resampling data into a set of points summarized by the day.
* Create a summary of the data and look for distinguishing features.
* Create a regressor for the data which can be used to predict the cost of BitCoin.
* Asses the accuracy and perforance of the model.

### Linear Regression
Linear regression is the simplest and most widely used algorithm for building regression models. The algorithm plots the dataset as a set of points with the target variable on the y axis. It then attempts to fit a straight line (or plane) to the points using a variant of the equation $y=m*x+b$.


### Evaluating the Accuracy of a Regressor

* means squared error: standard measurement of evaluation for regression. Average square difference between the true value of the target variable and the model value.
* [R2 score (coefficient of determination)](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html). Used for testing the accuracy of the model. Best possible score is 1 and the score can be negative, as the model can be worse than random chance.


### Import Dependencies

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import ensemble, linear_model, model_selection, preprocessing, svm

# Import tools that we can use to evalue the accuracy of the model
from sklearn.metrics import mean_squared_error, r2_score
from yellowbrick.regressor import PredictionError, ResidualsPlot

### Data Preparation
Initial data preparation:

* load data from CSV file
* convert the unix time stamp to a datetime so that it is easier to resample
* reset the index
* rename columns

In [None]:
%%time
# Resampling data from minute interval to day
bit_df = pd.read_csv('../input/coinbase/coinbaseUSD_1-min_data_2014-12-01_to_2018-01-08.csv',
  low_memory=False, error_bad_lines=True)
bit_df['Timestamp'] = bit_df.Timestamp.astype('int', errors='ignore')

# Convert unix time to datetime so that it is easier to resample
bit_df['date'] = pd.to_datetime(bit_df.Timestamp, unit='s', errors='coerce')

# Reset index
bit_df = bit_df.set_index('date')

# Rename columns so easier to code
bit_df = bit_df.rename(columns={'Open':'open', 'High': 'hi', 'Low': 'lo',
   'Close': 'close', 'Volume_(BTC)': 'vol_btc',
   'Volume_(Currency)': 'vol_cur',
   'Weighted_Price': 'wp', 'Timestamp': 'ts'})

# Coerce to numeric data types (safeguard against corrupt data)
bit_df['hi'] = pd.to_numeric(bit_df.hi, errors='coerce')
bit_df['lo'] = pd.to_numeric(bit_df.lo, errors='coerce')
bit_df['close'] = pd.to_numeric(bit_df.close, errors='coerce')
bit_df['open'] = pd.to_numeric(bit_df.open, errors='coerce')
bit_df['ts'] = pd.to_numeric(bit_df.ts, errors='coerce')

Resample the initial data:

In [None]:
# Resample and only use recent samples that aren't missing
bit_df = bit_df.resample('d').agg({'open': 'mean', 'hi': 'mean',
    'lo': 'mean', 'close': 'mean', 'vol_btc': 'sum',
    'vol_cur': 'sum', 'wp': 'mean', 'ts': 'min'}).iloc[-1000:]

# Drop last row as it is not complete
bit_df = bit_df.iloc[:-1]

In [None]:
# Display the data to view values
bit_df

In [None]:
# Transpose the header values
bit_df.head().T

In [None]:
# Create a description of the data
bit_df.describe()

In [None]:
# Plot the distribution
bit_df.plot(figsize=(14,10))

In [None]:
bit_df.close.plot(figsize=(14,10))

#### Exercise: Load Data
Exercises associated with this example look at predicting the size of forest fires on meteorological data.

* Load data from a CSV file
* Inspect, summarize, and plot the dataset


### Can we predict tomorrow's close based on today's info?
We will use a row of data for input. We will call the input X and the prediction y. This is called "supervised learning" as we will feed in both X and y to train the model.

Let's use a model called Linear Regression. This performs better if we *standardize* the data (0 mean, 1 std).

For 2 dimensions this takes the form of:

$y = m*x + b$

M is the slope (or coefficient) and b is the intercept.

Let's see if we can predict the open price from the ts component.

In [None]:
bit_df.plot(kind='scatter', x='ts', y='open', figsize=(14,10))

In [None]:
# Create our input (X) and our labelled data (y) to train our model
X = bit_df[['ts']].iloc[:-1]  # drop last row because it represents the value trying to be predicted
y = bit_df.close.shift(-1).iloc[:-1]

In [None]:
# Train a model and predict output if it were given X
lr_model = linear_model.LinearRegression()
lr_model.fit(X, y)
pred = lr_model.predict(X)

In [None]:
# Plot the real data, our prediction (blue), and the model from the coeffictient (green shifted)
ax = bit_df.plot(kind='scatter', x='ts', y='open', color='black', figsize=(14,10))
ax.plot(X, pred, color='blue')  # matplotlib plot
ax.plot(X, X*lr_model.coef_ + lr_model.intercept_+ 100, linestyle='--', color='green')

In [None]:
# Vertical distance between line and point is the error. *Ordinary Least Squares*
# regression tries to minimize the square of the distance.
mean_squared_error(y, pred)

In [None]:
# R2 score is a measure from 0-100
# 0 - the model explains none of the variation
# 100 - 100% of the variation is explained by the model
print(r2_score(y, pred))

# Note that the .score method gives the same value
print(lr_model.score(X, y))

#### Exercise: Regression

* Use linear regression to predict `area` from the other columns.
* Calculate the predictive model's score.


### Visualize Performance of the Model: Actual and Predicted Values
You can plot the actuals and the predicted values. It looks like the model does a pretty poor job of describing the data.

In [None]:
# Prediction error plot from Yellowbrick
# plot of actual (blue) vs predicted (black dash)
# ideally would be around 45 degree line
fig, ax = plt.subplots(figsize=(10, 8))
err_viz = PredictionError(lr_model)

# Model is already fit
#err_viz.fit(X, y)
err_viz.score(X, y)
err_viz.poof()

In [None]:
# plot result
y_df = pd.DataFrame(y)
y_df['pred'] = pred
y_df['err'] = y_df.pred - y_df.close
(y_df
 #.iloc[-50:]
 .plot(figsize=(14,10))
)

#### Exercise: Visualize the Errors

* Plot the actual y and predicted y against one another to compare the accuracy of the model


### Improve the Accuracy of the Model: Try More Features
In an attempt to get a better model we are going to use more features to make a prediction. Many machine language estimators require "standardization" of the data and will perform badly if the individual features do not more or less look like normally distributed data: Gaussian distributions with a zero mean and unit variance.

In [None]:
# drop last row because we don't know what future is
X = (bit_df.drop(['close'], axis=1).iloc[:-1])
y = bit_df.close.shift(-1).iloc[:-1]
cols = X.columns

In [None]:
# The describe method on a dataframe gives a statistical summary of the columns
X.describe()

In [None]:
# We are going to scale the data so that volume and ts don't get more
# weight that other values
ss = preprocessing.StandardScaler()
ss.fit(X)
X = ss.transform(X)
X = pd.DataFrame(X, columns=cols)

In [None]:
# We can now see that the data has a mean close to 0
# and a std of 1
X.describe()

In [None]:
# Initialize a linear regression model using the normalized data
lr_model2 = linear_model.LinearRegression()
lr_model2.fit(X, y)
pred = lr_model2.predict(X)
lr_model2.score(X, y)

In [None]:
# plot result
y_df = pd.DataFrame(y)
y_df['pred'] = pred
y_df['err'] = y_df.pred - y_df.close
y_df.plot(figsize=(14,10))

In [None]:
# plot result
y_df = pd.DataFrame(y)
y_df['pred'] = pred
y_df['err'] = y_df.pred - y_df.close
y_df.iloc[-50:].plot(figsize=(14,10))

In [None]:
# our scores get worse with recent data
lr_model2.score(X[-50:], y[-50:])

In [None]:
lr_model2.coef_

In [None]:
list(zip(X.columns, lr_model2.coef_))

In [None]:
# These coefficients correspond to the columns in X
pd.DataFrame(list(zip(X.columns, lr_model2.coef_)), columns=['Feature', 'Coeff'])

In [None]:
bit_df.plot(kind='scatter', x='wp', y='close', figsize=(14,10))

In [None]:
bit_df.plot(kind='scatter', x='vol_cur', y='close', figsize=(14,10))

#### Exercise: Regression
* Try scaling the input and using the log of the area and see if you get a better score.
* Examine the coefficients


### Training/Test Split
In fact we were cheating, predicting things that we already saw serves little purpose. The model could just memorize the data and get a perfect score. But it wouldn't *generalize* to unseen data.

To see how it will perform in the real world we will train on a portion of the data and test on a portion that it hasn't seen.

In [None]:
# Split the data into a set for training and testing: X = Feature, Y = Target
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X, y, test_size=.3, random_state=42)

In [None]:
# Train the model on the training on the training data, evalue on the testing data
lr_model2 = linear_model.LinearRegression()
lr_model2.fit(X_train, y_train)
lr_model2.score(X_test, y_test)

In [None]:
y_df2 = pd.DataFrame(y_test)
y_df2['pred'] = lr_model2.predict(X_test)
y_df2['err'] = y_df2.pred - y_df2.close
(
y_df2
 #   .iloc[-50:]
    .plot(figsize=(14,10))
)

In [None]:
# yellow brick version
fig, ax = plt.subplots(figsize=(10, 10))
err_viz2 = PredictionError(lr_model2)           # Attempt to show the
err_viz2.score(X_test, y_test)                  
err_viz2.poof()                                 # Draw/show/poof the data

#### Exercise: Regression with Train/Test Split
Split the data into test and training data. What is the score on the test data?

### Visualize Errors with Residual Plots
**A residual is the difference between the prediction and the actual.** If we plot predicted value against residuals, we should get a random distribution. If not, a different model would be better given the data. _A pattern in the residuals implies that there is a non-parametric relationship at play._

In [None]:
def residual_plot(model, X_train, y_train, X_test, y_test):
    fig = plt.figure(figsize=(14,10))
    ax = plt.subplot(111)
    plt.scatter(model.predict(X_train),
                model.predict(X_train) - y_train,
                c='b', alpha=.3,
                label='train')
    plt.scatter(model.predict(X_test),
                model.predict(X_test) - y_test,
                color='green', alpha=.3,
                label='test')
    plt.title('Residual Plot - Train (blue), Test (green)')
    plt.ylabel('Residual')
    ax.legend()

In [None]:
residual_plot(lr_model2, X_train, y_train, X_test, y_test)

In [None]:
# Yellowbrick version
fig, ax = plt.subplots(figsize=(10, 10))
res_viz = ResidualsPlot(lr_model2)
res_viz.fit(X_train, y_train)
res_viz.score(X_test, y_test)
res_viz.poof()

#### Exercises: Residual Plot
Make a residual plot of your test and train data


### Other Models: SVM, Random Forest, and Huber
Logistic Regression is not the only model that can be used for classification:

* **SVM**: Support vector machines include both linear and non-linear variations. Like logistic regression, the main idea is to find the line (or plane or dividing shape) that separates the targets/classes optimally. Instead of measuring the distance to all points, VCMs try to find the largest margin between only the points on either side of the decision line. But rather than worry about points that are far away to the boundary of a decision (e.g., the obvious ones), the algorithm focuses on the points that the closest to the line. It then seeks to place the line in such a way so that the distance of those points is as great as possible.

SVMS use a trick to map points that are non-linear in nature to a coordinate plane that is non-linear. The algorithm then tries to find a linear boundary in the warped space.

* **Random Forest**: Random forests rely upon the use of a decision tree. Decision trees are based on a series of branch points that help to make a decision. When using a decision tree algorithm, you allow the computer to figure out (based on the training data) which variables are the most imortant. It then puts these at the top of the tree and gradually uses less important variables in subsequent branches until a path to target outcomes has been plotted.

In decision trees, the top most level branches have an enormous influence on the quality of the tree. If new data doesn't follow the same distribution as the training set, then the model doesn't generalize quite as well.

Random forests build a collection of decision trees and apply these to new observations. It then uses a set of "votes" to weight the outputs of several trees and apply them to the new observation. It provides the majority vote in the case of classification or the mean value when performing regression.

- Random forests have a degree of immunity to unimportant features
- They are also able to cope with noisy datasets or those with missing values

* **Huber**: A regression algorithm that is useful with datasets with outliers. It does this by scoring the outliers and weighting their scores appropriately.

In [None]:
# drop last row because we don't know what future is

X = (bit_df
         .drop(['close'], axis=1)
         .iloc[:-1])
y = bit_df.close.shift(-1).iloc[:-1]
cols = X.columns

ss = preprocessing.StandardScaler()
ss.fit(X)
X = ss.transform(X)
X = pd.DataFrame(X, columns=cols)

X_train, X_test, y_train, y_test = model_selection.\
    train_test_split(X, y, test_size=.3, random_state=42)

# Create an SVM model using the Epsilon-Support Vector Regression    
svm_model = svm.SVR(kernel='linear')
svm_model.fit(X_train, y_train)
svm_model.score(X_test, y_test)    

In [None]:
def train_reg_model(model, df):
    # drop last row because we don't know what future is

    X = (df
             .drop(['close'], axis=1)
             .iloc[:-1])
    y = df.close.shift(-1).iloc[:-1]
    cols = X.columns

    ss = preprocessing.StandardScaler()
    ss.fit(X)
    X = ss.transform(X)
    X = pd.DataFrame(X, columns=cols)

    X_train, X_test, y_train, y_test = model_selection.\
        train_test_split(X, y, test_size=.3, random_state=42)

    #svm_model = svm.SVR(kernel='linear')
    model.fit(X_train, y_train)
    return model.score(X_test, y_test), X_test, y_test, X_train, y_train    

# Generate a random forest model
rf_reg = ensemble.RandomForestRegressor()
score, X_test, y_test, X_train, y_train = train_reg_model(rf_reg, bit_df)
print(score)    

In [None]:
def error_plot(X_test, y_test, model):
    y_df3 = pd.DataFrame(y_test)
    y_df3['pred'] = model.predict(X_test)
    y_df3['err'] = y_df3.pred - y_df3.close
    (
    y_df3
     #   .iloc[-50:]
        .plot(figsize=(14,10))
    )
error_plot(X_test, y_test, rf_reg)

In [None]:
# yellow brick version
fig, ax = plt.subplots(figsize=(10, 10))
err_viz3 = PredictionError(rf_reg)
err_viz3.score(X_test, y_test)
err_viz3.poof()

In [None]:
residual_plot(rf_reg, X_train, y_train, X_test, y_test)

In [None]:
# Generate a Huber Regressor
huber_reg = linear_model.HuberRegressor()
huber_reg.fit(X_train, y_train)
huber_reg.score(X_test, y_test)

In [None]:
error_plot(X_test, y_test, huber_reg)

In [None]:
# yellow brick version
fig, ax = plt.subplots(figsize=(10, 10))
err_viz4 = PredictionError(huber_reg)
err_viz4.score(X_test, y_test)
err_viz4.poof()

In [None]:
residual_plot(huber_reg, X_train, y_train, X_test, y_test)

In [None]:
huber_reg

In [None]:
linear_model.HuberRegressor(
    alpha=0.0001, epsilon=1.35, fit_intercept=True, max_iter=100,
    tol=1e-05, warm_start=False)

#### Exercises: Other Models
Try using another model (`RandomForestRegressor` or `SVR`) and assess the accuracy of the new model