<center> <img src="res/ds3000.png"> </center>

<center> <h1> Week 11 - Day 1 </h1> </center>

<center> <h2> Part 1: Simple Linear Regression</h2></center>

## Outline
1. <a href='#1'>Simple Linear Regression</a>
2. <a href='#2'>Data Preparation</a>
3. <a href='#3'>Training the Regression Model</a>
4. <a href='#4'>Predicting Temperatures</a>
5. <a href='#5'>Testing the Model</a>
6. <a href='#6'>Regression Model Metrics</a>
7. <a href='#7'>Visualizing the Simple Linear Regression Model</a>

<a id="1"></a>

## 1. Simple Linear Regression
* Simple linear regression is the simplest regression algorithm
* Given a collection of numeric values representing a predictor variable and an outcome variable, simple linear regression describes the relationship between these variables with a straight line, known as **the regression line**


### 1.1. Time Series Data for Average Boston November Temperature
* Data obtained from https://www.ncdc.noaa.gov/cag/
    * the November average temperatures for Boston from 1936 through 2018
* Three columns per observation:
    * Date—A value of the form 'YYYYMM’ (such as '201011'). MM is always 11 because we downloaded data for only November of each year.
    * Value—A floating-point Fahrenheit temperature.
    * Anomaly—The difference between the value for the given date and average values for all dates (not used in this case study)

In [None]:
import pandas as pd
df = pd.read_csv('res/boston_weather.csv')
df.columns = ["Date", "Temperature", "Anomaly"]
df.head()

<a id="2"></a>

## 2.  Data Preparation
* Need to do some data munging to get rid of the month digits in dates and delete the Anomaly column

In [None]:
df.head()

In [None]:
#TODO in class

In [None]:
#TODO in class

In [None]:
df.head()

### 2.1. Transforming the Date Column
* For simple linear regression select one feature (the Date here) as the predictor variable
    * A column in DataFrame is a one-dimensional Series
    * Scikit-learn estimators require training and testing data to be two-dimensional
    * Need to transform Series of n elements, into two dimensions containing n rows and one column

In [None]:
features = df["Date"].values.reshape(-1, 1)
features.shape

In [None]:
features[:5]

* df["Date"].values returns NumPy array containing Date column’s values
* reshape(-1, 1) tells reshape to infer the number of rows, based on the number of columns (1) and the number of elements (124) in the array
* Transformed array will have 83 rows and one column

In [None]:
target = df["Temperature"]
target

### 2.2. Correlation Analysis
* Use SciPy's pearsonr() method

In [None]:
from scipy import stats

r, p = stats.pearsonr(#TODO in class)
print("correlation coefficient: ", r, "p-value:", p)

<a id="3"></a>

## 3. Training the Regression Model
* Use the LinearRegression() estimator
* To find the best fitting regression line for the data, the LinearRegression estimator iteratively adjusts the slope and intercept to minimize the sum of the squares of the data points’ distances from the line

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split


#split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=3000)

#select a classifier and create the model by fitting the training data
model = LinearRegression().fit(X=X_train, y=y_train)


### 3.1. Regression Line Equation
* Once the model is fitted, the estimator calculates the **slope** and **intercept** 
* We can make **predictions** with 

\begin{equation}
y = m x + b
\end{equation}

* Slope is the estimator’s **`coef_`** attribute (**m** in the equation) 
* Intercept is the estimator’s **`intercept_`** attribute (**b** in the equation)

In [None]:
slope = model.coef_
slope

In [None]:
slope = slope[0]
slope

In [None]:
intercept = model.intercept_
intercept

#### The equation:
\begin{equation}
y = 0.035601 x - 26.65185
\end{equation}

<a id="4"></a>

### 4. Predicting Temperatures

In [None]:
def predict_temp(x):
    
    #TODO in class
    
    return y

In [None]:
predict_temp(2019)

<a id="5"></a>

## 5. Testing the Model
* Test the model using the data in **`X_test`** and check some of the **predictions**

In [None]:
predicted = model.predict(X_test)

In [None]:
expected = y_test

In [None]:
predicted[:5]

In [None]:
expected[:5]

In [None]:
for p, e in zip(predicted[::5], expected[::5]):  # check every 5th element
    print(f'predicted: {p:.2f}, expected: {e:.2f}')

<a id="6"></a>

## 6. Regression Model Metrics
* **Metrics for regression estimators** include **coefficient of determination** (**$R^{2}$ score**; 0.0-1.0)
    * **1.0** &mdash; estimator **perfectly predicts** the **dependent variable’s value**, given independent variables' values
    * **0.0** &mdash; **model cannot make predictions with any accuracy**, given independent variables’ values 
* Calculate with arrays representing the **expected** and **predicted results**

In [None]:
from sklearn import metrics
metrics.r2_score(expected, predicted)

<a id="7"></a>

## 7. Visualizing the Simple Linear Regression Model
* Scatter plots are commonly used to visualize regression models
* Use plotly's scatter() method
* Setting the trendline="ols" *usually* draws the regression line on the graph

In [None]:
import plotly.express as px
import plotly.graph_objects as go

#produce the scatter plot
graph = px.scatter(df, x="Date", y="Temperature", template="none", color="Temperature", opacity=.8, trendline="ols")

#add a line shape on top of the scatter plot using two points (1936, predict_temp(1936)) and (2018, predict_temp(2018))
#because predict_temp uses the regression line equation, this will plot the regression line on the graph

graph.update_layout(
    
    shapes=[    
        go.layout.Shape(
            type="line",
            x0=1936, y0=predict_temp(1936),
            x1=2018, y1=predict_temp(2018),
            line=dict(color="coral", width=3, dash="solid")
        )
    ]
)

graph.update_traces(marker={"size":12})

#need to change y-axis; otherwise, plotly will auto-scale, leading to confusion
graph.update_layout(yaxis = dict(range = [20,60]))

graph.show()

In [None]:
import plotly.express as px

#produce the scatter plot
graph = px.scatter(df, x="Date", y="Temperature", template="none", color="Temperature", opacity=.8)

#add a line shape on top of the scatter plot using two points (1936, predict_temp(1936)) and (2018, predict_temp(2018))
#because predict_temp uses the regression line equation, this will plot the regression line on the graph

graph.update_layout(
    
    shapes=[    
        go.layout.Shape(
            type="line",
            x0=1936, y0=predict_temp(1936),
            x1=2018, y1=predict_temp(2018),
            line=dict(color="coral", width=2, dash="solid")
        )
    ]
)

graph.update_traces(marker={"size":12})

#need to change y-axis; otherwise, plotly will auto-scale, leading to confusion
graph.update_layout(yaxis = dict(range = [20,60]))

graph.show()

In [None]:
%matplotlib notebook

In [None]:
import matplotlib
import seaborn as sns
fig =sns.regplot(data=df, x="Date", y="Temperature")
fig.set_ylim(20, 60)
sns.set_palette("colorblind")
sns.set_style('whitegrid')