<a href="https://colab.research.google.com/github/vt-ai-ml/fall2019-meetings/blob/master/Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pandas as pd                 # data frame
import numpy as np                  # matrix manipulation
import matplotlib.pyplot as plt     # plotting
from sklearn import linear_model    # linear & logistic regression
from sklearn import preprocessing   # polynomial features for polynomial regression
from sklearn import model_selection # splitting data into train and test data

### Least Squares
We'll use method of least squares to model Google's stock price over the span of 20 days. 

**NOTE**: We're fitting our least squares model to less than 20 data points, thus we're probably unable to predict the price of Google's stock in 2019, but we may be able to predict the price in March of 2016.

In [0]:
# read in the data
url = 'https://github.com/vt-ai-ml/fall2019-meetings/raw/master/data/google_stock.csv'
df = pd.read_csv(url)

df[0:3]

In [0]:
plt.figure(figsize=(8,4))
plt.ylabel('Stock Price (Low)')
plt.xlabel('Days (Day 0 = Feb 1, 2016)')
plt.title("Google stock price in Feb 2016")
plt.plot(df['Date'], df['Price'], color='blue', marker='o')

### Linear Regression
$ y = mx + b $  
$ y = \beta_{0} + \beta_{1}x$

In [0]:
dates = df['Date'].values.reshape(-1,1)   # reshape(-1,1) so we can use in lm.fit()
prices = df['Price'].values.reshape(-1,1)

lm = linear_model.LinearRegression()
lm.fit(dates, prices)
predictions = lm.predict(dates)

today = lm.predict(np.reshape([1300], (1,-1)))  # let's predict the price 1300 days from Feb 1 2016
march = lm.predict(np.reshape([30], (1,-1)))    # let's predict the price in March
print('Today\'s prediction:', today, 'actual:', 1300)
print('Mar 1 2016 prediction:', march, 'actual:', 700)

In [0]:
plt.figure(figsize=(8,4))
plt.plot(df['Date'], predictions, color='red')
plt.plot(df['Date'], df['Price'], color='blue', marker='o')

### Polynomial Regression (degree = 2)
$ y = \beta_{0} + \beta_{1}x + \beta_{2}x^2$

In [0]:
# add an x^2 to our input
poly = preprocessing.PolynomialFeatures(2)
poly_dates = poly.fit_transform(dates)

lm.fit(poly_dates,prices) 
predictions2 = lm.predict(poly_dates)

today = lm.predict(poly.fit_transform(np.reshape([1300], (1,-1))))
march = lm.predict(poly.fit_transform(np.reshape([30], (1,-1))))
print('Today\'s prediction:', today, 'actual:', 1300)
print('Mar 1 2016 prediction:', march, 'actual:', 700)

In [0]:
plt.figure(figsize=(8,4))
plt.plot(df['Date'], predictions2, color='red')
plt.plot(df['Date'], df['Price'], color='blue', marker='o')

### Polynomial Regression (degree = 3)
$ y = \beta_{0} + \beta_{1}x + \beta_{2}x^2 + \beta_{3}x^3$

In [0]:
poly = preprocessing.PolynomialFeatures(3)
poly_dates = poly.fit_transform(dates)

lm.fit(poly_dates,prices) 
predictions3 = lm.predict(poly_dates)

today = lm.predict(poly.fit_transform(np.reshape([1300], (1,-1))))
march = lm.predict(poly.fit_transform(np.reshape([30], (1,-1))))
print('Today\'s prediction:', today, 'actual:', 1300)
print('Mar 1 2016 prediction:', march, 'actual:', 700)

In [0]:
plt.figure(figsize=(8,4))
plt.plot(df['Date'], predictions3, color='red')
plt.plot(df['Date'], df['Price'], color='blue', marker='o')

---
## Logistic Regression

<img src="https://github.com/vt-ai-ml/fall2019-meetings/raw/master/data/logistic.png" align="right" style="width:350px;height:250px;">

We will be predicting whether someone has heart_disease given their age, gender, cholesterol, and maximum heart rate.

$$ y = 
\begin{cases}
1 & \text{if }  \beta_{0} + \beta_{1}x > 0\\
0 & \text{else }
\end{cases} $$

dataset: https://www.kaggle.com/ronitf/heart-disease-uci

* male = 1, female = 0
* target = 0 (absence of heart disease)

In [0]:
# read in the data
url = 'https://github.com/vt-ai-ml/fall2019-meetings/raw/master/data/heart_disease.csv'
heart_df = pd.read_csv(url)

heart_df[0:3]

In [0]:
x_train, x_test, y_train, y_test = model_selection.train_test_split(heart_df[['age', 'sex', 'cholesterol', 'max_heart_rate']],
                                                    heart_df['target'], test_size=0.25, random_state=3)

logreg = linear_model.LogisticRegression()
logreg.fit(x_train, y_train)
heart_prediction = logreg.predict(x_test)

In [0]:
plt.figure(figsize=(8,4))
plt.scatter(range(0,len(heart_prediction)), abs(heart_prediction - y_test), color='red')
plt.title('Let\' see how many predictions did we get correct')

print('score:', logreg.score(x_test, y_test))

In [0]:
# try it out
age, sex, cholesterol, max_heart_rate = 77, 1, 304, 165
example = pd.Series([age, sex, cholesterol, max_heart_rate]).values.reshape(1, -1)

logreg.predict(example)