# Machine Learning in Python

by [Piotr Migdał](http://p.migdal.pl/)

Inkubator Uniwersytetu Warszawskiego

## 3.  Linear regression

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

%matplotlib inline

## Bicycles data

This dataset shows number of bikes present at several streets in Warsaw, together with weather data.

source:

- Monika Pawłowska (code: https://github.com/pawlowska/shiny-server)

- original source: http://rowery.um.warszawa.pl/pomiary-ruchu-rowerowego

In [None]:
bicycles_weather_data = pd.read_csv("data/dane_zsumowane_z_pogoda.csv", index_col=0)

In [None]:
bicycles_weather_data.head()

In [None]:
bicycles_weather_data.describe()

## Linear regression

Linear regression is modelling linear relationship between dependent variable y and one or more explanatory variables X. The case of one explanatory variable is called **simple linear regression**. For more than one explanatory variable, the process is called **multiple linear regression**.

$$
y = a x + b
$$

!["xkcd"](https://imgs.xkcd.com/comics/linear_regression.png)

Analyticas solutions exist:

- [Ordinary least squares](https://en.wikipedia.org/wiki/Ordinary_least_squares)

- [Ridge regression](https://en.wikipedia.org/wiki/Ridge_regression)

but are not always very efficient!

Correlation:

!["source: wikipedia.org"](https://upload.wikimedia.org/wikipedia/commons/d/d4/Correlation_examples2.svg)

Materials:

 - https://en.wikipedia.org/wiki/Linear_regression
 
 - http://setosa.io/ev/ordinary-least-squares-regression
 
 - http://onlinestatbook.com/2/regression/intro.html
 
 - https://www.youtube.com/watch?v=KsVBBJRb9TE
 


In [None]:
bicycles_weather_data.plot(x='temp_avg', y=['Marszałkowska', 'Banacha', 'Wysockiego'], style='o', figsize=(7,8))
plt.xlim([-20,25])

Scikit learn documentation:

- http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

- http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html

In [None]:
from sklearn.linear_model import LinearRegression

We make a linear regression object.

In [None]:
linreg = LinearRegression()

In [None]:
street = 'Banacha'
bicycles_weather_subset = bicycles_weather_data[['temp_avg', street]]
bicycles_weather_subset = bicycles_weather_subset.dropna()

In [None]:
bicycles_weather_subset.plot()

In [None]:
X = bicycles_weather_subset[['temp_avg']]  # a DataFrame, not a Series
y = bicycles_weather_subset[street]
linreg.fit(X, y)

print("Mean squared error: {:.1f}".format(np.sqrt(np.mean((linreg.predict(X) - y) ** 2))))

In [None]:
linreg.intercept_

In [None]:
linreg.coef_

See more: [Root-mean-square error (RMSE)](https://en.wikipedia.org/wiki/Root-mean-square_deviation)

We can plot our predicted curve.

In [None]:
bicycles_weather_data.plot(x='temp_avg', y=street, style='o', figsize=(7,8))
plt.plot(X['temp_avg'], linreg.predict(X), color='k', linewidth=3)
plt.xlim([-10,20])

In [None]:
coefficients = pd.Series(linreg.coef_, index=['temp_avg']).plot(kind='barh')

### Exercises

* Add other paremeters (`deszcz`, `snieg`, etc)
* Find in the dataset a street which closest to yours living / working place and perform linear regression.
* Extra: scale variables by `np.sqrt` or `np.log`