# Section 8: Machine Learning  

** In this section we will cover three different machine learning models that can be used for regression. Regression falls under supervised learning, which is a category of machine learning. Other categories are unsupervised learning, reinforcement learning, and deep learning. In machine learning, regression takes in the inputs, x, and attempts to predict the output, y, with as little error as possible by finding the relationship, or equation, between the inputs and the output. The fact that we already know the outputs is why regression falls under supervised learning. Here we will be using the daily prices of a stock as the input and attempt to predict the high of the stock, our output, 5 market days later. We will be using the python library called sci-kit learn for this. Sklearn, short for sci-kit learn, is a python library that contains a lot of functionality for machine learning and data analysis.**  

[Documentation for sklearn](http://scikit-learn.org/stable/)  


** First lets load in our data set.**

In [2]:
import pandas as pd
# Load dataset into dataframe
stock_data = pd.read_csv("daily_adjusted_HAL.csv", index_col='timestamp')

# Reverse order of dataframe
stock_data = stock_data.iloc[::-1]

# This is the target variable
# The high price that is five market days ahead is saved into a seperate series
target = stock_data['forecast_high_5']

# Here we seperate the input from the output
# The expected output is dropped from the dataframe
stock_data = stock_data.drop('forecast_high_5', axis=1)

** In machine learning, we do something called a test/train split. This splits the data between what we will use to train the model and what we will use to test the model on. Training a model on all of the data can lead overfitting. Overfitting is when the model produces a function that is too specific to the data it was trained on, causing any new data to provide inaccurate results. This can be seen when a model has high perfomance on the training data but performs poorly when any new data is used.**

[Documentation for train test split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [3]:
# test train split
from sklearn.model_selection import train_test_split

# The test train split function takes the input dataframe and the target variable and splits it into sets
# X_train is the input used to train the model
# X_test is the input used for testing
# y_train is the output used for training
# y_test is the output variable used for testing
X_train, X_test, y_train, y_test = train_test_split(stock_data, target, test_size = .25, random_state = 0)

** Now that our data has been split, it is time to train it on the various algorithms we will be using. **

** The linear regression model is one of the most popular regression models. It attempts to find a relationship between the input and output variables using a best fit straight line (also known as regression line).**  
[Documentation for Linear Regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

In [4]:
# Importing a LinearRegression model
from sklearn.linear_model import LinearRegression

# The linear model we will be using
linearRegressionClassifier = LinearRegression()

# This trains the linear model on the training data
linearRegressionClassifier.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

** We will use a model called support vector regression, a version of a classification model called support vector machine,  to compare its results to the linear regression model. **  
[Documentation for SVR](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html)

In [5]:
# Importing svm
from sklearn import svm

# The SVR model we will be using
svm_classifier = svm.SVR()

# This trains the model on the training data
svm_classifier.fit(X_train, y_train)

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

** K nearest neighbors, or KNN, is a algorithm that averages the k nearest values in order to make its prediction. There are versions for both classification and regression, though here we will be using the regression version.**  
[Documentation for KNN regressor](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html)

In [6]:
# Importing our model
from sklearn.neighbors import KNeighborsRegressor

# The KNNR model we will be using, with our K value as 4
neighbors_classifier = KNeighborsRegressor(n_neighbors=4)

# Here we train our model
neighbors_classifier.fit(X_train, y_train)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=4, p=2,
          weights='uniform')

** Now that our models have been trained on our training data, its time to make predictions about the testing data. **

In [7]:
# These return an array of the predicted output values, the high price, for each model
linear_prediction = linearRegressionClassifier.predict(X_test)
svm_prediction = svm_classifier.predict(X_test)
neighbors_prediction = neighbors_classifier.predict(X_test)

** Now we will use several different metrics to measure how accurate each model was. The error will be used to measure the performance of the models. The error is the distance from a data point to the regression line.**

** Mean absolute error measures the average absolute values of the errors. **  
[Documentation for mean absolute error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html)

In [8]:
# Importing the mean aboslute error function
from sklearn.metrics import mean_absolute_error

# Storing the MAE for the linear regression model into a variable
linear_abs_error = mean_absolute_error(y_test, linear_prediction)

# Storing the MAE for the SVR model into a variable
svm_abs_error = mean_absolute_error(y_test, svm_prediction)

# Storing the MAE for the KNN regression model into a variable
neighbors_abs_error = mean_absolute_error(y_test, neighbors_prediction)

print("The mean absolute error for linear regression is ", linear_abs_error)
print("The mean absolute error for SVM regression is ", svm_abs_error)
print("The mean absolute error for KNN regression is ", neighbors_abs_error)

The mean absolute error for linear regression is  1.51743830799
The mean absolute error for SVM regression is  3.76900858586
The mean absolute error for KNN regression is  3.82642727273


** Mean squared error measures the average of the squares of the errors. **  
[Documentation for mean squared error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html)

In [9]:
# Importing the mean squared error function
from sklearn.metrics import mean_squared_error

# Storing the MSE for the linear regression model into a variable
linear_squared_error = mean_squared_error(y_test, linear_prediction)

# Storing the MSE for the SVR model into a variable
svm_squared_error = mean_squared_error(y_test, svm_prediction)

# Storing the MSE for the KNN regression model into a variable
neighbors_squared_error = mean_squared_error(y_test, neighbors_prediction)

print("The mean squared error for linear regression is ", linear_squared_error)
print("The mean squared error for SVM regression is ", svm_squared_error)
print("The mean squared error for KNN regression is ", neighbors_squared_error)

The mean squared error for linear regression is  3.29780037985
The mean squared error for SVM regression is  22.7025350257
The mean squared error for KNN regression is  22.7173076455


** Mean squared logarithmic error measures the ratio between the actual value and the predicted value. **  
[Documentation for mean squared log error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_log_error.html)

In [10]:
# Importing the mean squared log error function
from sklearn.metrics import mean_squared_log_error

# Storing the MSLE for the linear regression model into a variable
linear_log_error = mean_squared_log_error(y_test, linear_prediction)

# Storing the MSLE for the SVR model into a variable
svm_log_error = mean_squared_log_error(y_test, svm_prediction)

# Storing the MSLE for the KNN regression model into a variable
neighbors_log_error = mean_squared_log_error(y_test, neighbors_prediction)

print("The mean squared log error for linear regression is ", linear_log_error)
print("The mean squared log error for svm regression is ", svm_log_error)
print("The mean squared log error for KNN regression is ", neighbors_log_error)

The mean squared log error for linear regression is  0.00137982633362
The mean squared log error for svm regression is  0.0091205426452
The mean squared log error for KNN regression is  0.00928045777612


** The median absolute error is similiar to the mean absolute error but is less effected by outliers in the data.**  
[Documentation for median absolute error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.median_absolute_error.html)

In [11]:
# Importing the median absolute error function
from sklearn.metrics import median_absolute_error

# Storing the Median absolute error for the linear regression model into a variable
linear_median_error = median_absolute_error(y_test, linear_prediction)

# Storing the Median absolute error for the svr model into a variable
svm_median_error = median_absolute_error(y_test, svm_prediction)

# Storing the Median absolute error for the KNN model into a variable
neighbors_median_error = median_absolute_error(y_test, neighbors_prediction)

print("The median absolute error for linear regression is ", linear_median_error)
print("The median absolute error for svm regression is ", svm_median_error)
print("The median absolute error for KNN regression is ", neighbors_median_error)

The median absolute error for linear regression is  1.37974014166
The median absolute error for svm regression is  3.00675
The median absolute error for KNN regression is  3.1725


** The R2 score is a statistical measure of how close the data is to the regression line. The closer to 1 the score is, the better the data fits the line.**  
[Documentation for R2 score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html)

In [12]:
from sklearn.metrics import r2_score

linear_r2 = r2_score(y_test, linear_prediction)

svm_r2 = r2_score(y_test, svm_prediction)

neighbors_r2 = r2_score(y_test, neighbors_prediction)

print("The r2 score for linear regression is ", linear_r2)
print("The r2 score for svm regression is ", svm_r2)
print("The r2 score for KNN regression is ", neighbors_r2)

The r2 score for linear regression is  0.846068981629
The r2 score for svm regression is  -0.0596834051766
The r2 score for KNN regression is  -0.0603729449121


** The linear regression model outperformed both of the other models in every metric so we will be using it for our prediction. I have selected the date of April 2, 2018 to try and predict the high price 5 market days later, on April 9, 2018.**

In [18]:
# The stock prices for the selected dates. Our goal here is for the prediction to be close to 47.38
#--TimeStamp----Open---High-----Low-----Close--Adj Close--Volume
#Apr 02, 2018	46.69	46.70	45.20	46.09	46.09	9,091,600
#Apr 09, 2018	47.25	47.38	46.48	46.56	46.56	8,553,100

# I have entered the inputs for april 2 into a 2D array so it can be used in the model
april_02_18 = [[46.69, 46.70, 45.20, 46.09, 46.09, 9091600]]

# The input is entered into the model and a prediction is made
april_09_18_predict = linearRegressionClassifier.predict(april_02_18)

print(april_09_18_predict)

[ 46.57805929]


**We wanted 47.38 but ended up with 46.58. This is not a desirable result, but given the simplicity of our model and how unpredictable stocks prices can be, this is not suprising.**