# Linear regression

by Dominik Krzemiński & Piotr Migdał

for El Passion, 2017

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
plt.style.use('ggplot')
sns.set_style('whitegrid')

%matplotlib inline

## Bicycles data

Let's read the data from csv files.

In [None]:
# first parameter is name of the file with data
# next we specify delimiter, it can be either comma, semicolon, sometimes tab
# very often we need to cope with missing data, here we denote it by "NA"
# if the data is already enumerated pandas doesn't need to double the job
bicycles_data = pd.read_csv("data/warsaw-bicycles.csv", delimiter=",", na_values="NA", index_col=0)
weather_data = pd.read_csv("data/weather.csv", delimiter=",", na_values="NA", index_col=0)

In [None]:
bicycles_data.head()

Description method provides simple statistics for each quantitative column.

In [None]:
bicycles_data.describe()

Now we take a look in a similar way at the weather dataset.

In [None]:
weather_data.head()

In [None]:
weather_data.describe()

Think about whether statistics for `state` make any sense to you?

### Data engineering

Now we play around weather dataset in order to extract the day of measurement.

In [None]:
weather_data["date"] = pd.to_datetime(weather_data["date"], format="%Y-%m-%d")

In [None]:
import calendar

In [None]:
weather_data["dayname"] = weather_data["date"].apply(lambda x: calendar.day_name[x.weekday()])

In [None]:
weather_data.head()

In [None]:
len(weather_data)

In [None]:
len(bicycles_data)

We clearly see that there are more measurements of weather states than bicycles counts, so we need limit one dataset to make it consistent.

In [None]:
bicycles_date_min, bicycles_date_max = bicycles_data["Data"].tolist()[0], bicycles_data["Data"].tolist()[-1]

In [None]:
weather_data_filtered = weather_data.query("'{}'<=date<='{}'".format(bicycles_date_min, bicycles_date_max))

In [None]:
weather_data_filtered = weather_data_filtered.reset_index(drop=True)

In [None]:
weather_data_filtered.index += 1 

Now `weather_data_filtered` should have the same number of rows as `bicycles_data`. You may check its `len` for exercise.

In [None]:
weather_data_filtered.head()

So we are ready to concatenate two datasets.

In [None]:
bicycles_weather_data = pd.concat([bicycles_data, weather_data_filtered], axis=1)

Some columns are no longer useful, so we can drop them.

In [None]:
bicycles_weather_data.drop(['Data', 'state', 'startTyg', 'startM'], axis=1, inplace=True)

In [None]:
bicycles_weather_data.rename(columns={'value': 'temp'}, inplace=True)

All in all, we end up with dataset which looks like this:

In [None]:
bicycles_weather_data.head()

In [None]:
#bicycles_weather_data.to_csv("data/bicycles_weather.csv")
bicycles_weather_data = pd.read_csv("data/bicycles_weather.csv", index_col=0)

## Linear regression

Linear regression is modelling linear relationship between dependent variable y and one or more explanatory variables X. The case of one explanatory variable is called **simple linear regression**. For more than one explanatory variable, the process is called **multiple linear regression**.

$$
y = a x + b
$$

!["xkcd"](https://imgs.xkcd.com/comics/linear_regression.png)

Analyticas solutions exist:

- [Ordinary least squares](https://en.wikipedia.org/wiki/Ordinary_least_squares)

- [Ridge regression](https://en.wikipedia.org/wiki/Ridge_regression)

but are not always very efficient!

Correlation:

!["source: wikipedia.org"](https://upload.wikimedia.org/wikipedia/commons/d/d4/Correlation_examples2.svg)

Materials:

 - https://en.wikipedia.org/wiki/Linear_regression
 
 - http://onlinestatbook.com/2/regression/intro.html
 
 - https://www.youtube.com/watch?v=KsVBBJRb9TE
 


In [None]:
bicycles_weather_data.plot(x='temp', y=['Marszałkowska', 'Banacha', 'Wysockiego'], style='o', figsize=(7,8))
plt.gca().invert_xaxis()

Scikit learn documentation:

- http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

- http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html

In [None]:
from sklearn import linear_model

We make a linear regression object.

In [None]:
linreg = linear_model.LinearRegression(fit_intercept=True)

In [None]:
street = 'Banacha'
bicycles_weather_subset = bicycles_weather_data[['temp', street]]
bicycles_weather_subset = bicycles_weather_subset.dropna()

In [None]:
x = bicycles_weather_subset['temp'].to_frame()
y = bicycles_weather_subset[street].to_frame()
linreg.fit(x, y)

print('Coefficients:\n a={:.3f}, b={:.3f}'.format(linreg.coef_[0][0], linreg.intercept_[0]))

print("Mean squared error: %.2f"
      % np.mean((linreg.predict(x) - y) ** 2))

We can plot our predicted curve.

In [None]:
bicycles_weather_data.plot(x='temp', y=street, style='o', figsize=(7,8))
plt.plot(x, linreg.predict(x), color='k', linewidth=3)
plt.gca().invert_xaxis()

### Exercises

(a) Find in the dataset your street (or closest to yours) and perform linear regression.

(b) Find a street with the smallest _mean squared error_ of fitting.