# Machine Learning - part 1

by Dominik Krzemiński & Piotr Migdał

for El Passion, 2017

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
plt.style.use('ggplot')
sns.set_style('whitegrid')

%matplotlib inline

![Image of Yaktocat](https://lh4.googleusercontent.com/2gl_RypBFZInaLETtoLvGYjZTJ8MyGG1b3G56HunV4ruqi0DPQSAwxWWlYYW1u7XgKobvbTSY5Rxz0pioWaXgwxYJEebAKg3mnLG8B8V37VtwDFbYBmlS2y2o2uGJ2lufMrRCdU)

## Bicycles data

Let's read the data from csv files.

In [None]:
# first parameter is name of the file with data
# next we specify delimiter, it can be either comma, semicolon, sometimes tab
# very often we need to cope with missing data, here we denote it by "NA"
# if the data is already enumerated pandas doesn't need to double the job
bicycles_data = pd.read_csv("data/warsaw-bicycles.csv", delimiter=",", na_values="NA", index_col=0)
weather_data = pd.read_csv("data/weather.csv", delimiter=",", na_values="NA", index_col=0)

In [None]:
bicycles_data.head()

Description method provides simple statistics for each quantitative column.

In [None]:
bicycles_data.describe()

Now we take a look in a similar way at the weather dataset.

In [None]:
weather_data.head()

In [None]:
weather_data.describe()

Think about whether statistics for `state` make any sense to you?

### Data engineering

Now we play around weather dataset in order to extract the day of measurement.

In [None]:
weather_data["date"] = pd.to_datetime(weather_data["date"], format="%Y-%m-%d")

In [None]:
import calendar

In [None]:
weather_data["dayname"] = weather_data["date"].apply(lambda x: calendar.day_name[x.weekday()])

In [None]:
weather_data.head()

In [None]:
len(weather_data)

In [None]:
len(bicycles_data)

We clearly see that there are more measurements of weather states than bicycles counts, so we need limit one dataset to make it consistent.

In [None]:
bicycles_date_min, bicycles_date_max = bicycles_data["Data"].tolist()[0], bicycles_data["Data"].tolist()[-1]

In [None]:
weather_data_filtered = weather_data.query("'{}'<=date<='{}'".format(bicycles_date_min, bicycles_date_max))

In [None]:
weather_data_filtered = weather_data_filtered.reset_index(drop=True)

In [None]:
weather_data_filtered.index += 1 

Now `weather_data_filtered` should have the same number of rows as `bicycles_data`. You may check its `len` for exercise.

In [None]:
weather_data_filtered.head()

So we are ready to concatenate two datasets.

In [None]:
bicycles_weather_data = pd.concat([bicycles_data, weather_data_filtered], axis=1)

Some columns are no longer useful, so we can drop them.

In [None]:
bicycles_weather_data.drop(['Data', 'state', 'startTyg', 'startM'], axis=1, inplace=True)

All in all, we end up with dataset which looks like this:

In [None]:
bicycles_weather_data.head()

In [None]:
#bicycles_weather_data.to_csv("data/bicycles_weather.csv")
bicycles_weather_data = pd.read_csv("data/bicycles_weather.csv", index_col=0)

## Linear regression

Materials:

 - https://en.wikipedia.org/wiki/Linear_regression
 
 - http://onlinestatbook.com/2/regression/intro.html
 
 - https://www.youtube.com/watch?v=KsVBBJRb9TE
 


In [None]:
bicycles_weather_data.plot(x='value', y=['Marszałkowska', 'Banacha', 'Wysockiego'], style='o', figsize=(7,8))
plt.gca().invert_xaxis()

Scikit learn documentation:

- http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

- http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html

In [None]:
from sklearn import linear_model

We make a linear regression object.

In [None]:
linreg = linear_model.LinearRegression(fit_intercept=True)

In [None]:
bicycles_weather_data = bicycles_weather_data.dropna()

In [None]:
x = bicycles_weather_data['value'].to_frame()
y = bicycles_weather_data['Banacha'].to_frame()
linreg.fit(x, y)

print('Coefficients: \n', regr.coef_)

print("Mean squared error: %.2f"
      % np.mean((linreg.predict(x) - y) ** 2))

We can plot our predicted curve.

In [None]:
bicycles_weather_data.plot(x='value', y='Banacha', style='o', figsize=(7,8))
plt.plot(x, linreg.predict(x), color='k', linewidth=3)
plt.gca().invert_xaxis()

### Exercise

(a) Find a street with the smallest _mean squared error_ of fitting.

## Logistic regression

Materials:

- https://en.wikipedia.org/wiki/Logistic_regression

- http://www.statisticssolutions.com/what-is-logistic-regression/

About the data: 100 volunteers provide a semen sample analyzed according to the WHO 2010 criteria. Sperm concentration are related to socio-demographic data, environmental factors, health status, and life habits

(source: https://archive.ics.uci.edu/ml/datasets/Fertility)

In [None]:
fertility_data = pd.read_csv("data/fertility_Diagnosis.csv", header=None)

Fertility data description:
    
- (0) Season in which the analysis was performed. 1) winter, 2) spring, 3) Summer, 4) fall. (-1, -0.33, 0.33, 1) 

- (1) Age at the time of analysis. 18-36 (0, 1) 

- (2) Childish diseases (ie , chicken pox, measles, mumps, polio)	1) yes, 2) no. (0, 1) 

- (3) Accident or serious trauma 1) yes, 2) no. (0, 1) 

- (4) Surgical intervention 1) yes, 2) no. (0, 1) 

- (5) High fevers in the last year 1) less than three months ago, 2) more than three months ago, 3) no. (-1, 0, 1) 

- (6) Frequency of alcohol consumption 1) several times a day, 2) every day, 3) several times a week, 4) once a week, 5) hardly ever or never (0, 1) 

- (7) Smoking habit 1) never, 2) occasional 3) daily. (-1, 0, 1) 

- (8) Number of hours spent sitting per day ene-16	(0, 1) 

Output:

- (9) Diagnosis	normal (N), altered (O)	

In [None]:
fertility_data.head()

Here we separate features from target classes.

In [None]:
fer_x = fertility_data[list(range(0,9))]
fer_y = fertility_data[9]

We split the data into training and testing.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(fer_x, fer_y, test_size=0.4, random_state=0)

Now you know how to create logistic regression model. It's as simple as in linear regression case.

In [None]:
logreg = linear_model.LogisticRegression()

In [None]:
logreg.fit(X_train, y_train)

In [None]:
logreg.score(X_test, y_test)

### Exercise

(a) Is it possible to obtain similar score using only selected features from the fertility dataset?