# Machine Learning in Python

by [Piotr Migdał](http://p.migdal.pl/) & Dominik Krzemiński

for El Passion, 2017

## 2. Data exploration

* [Pandas](http://pandas.pydata.org/) and [Seaborn](https://seaborn.pydata.org/)
* on data from Warsaw bike usage

Data from:

* [Dane z liczników rowerowych w Warszawie](http://greenelephant.pl/shiny/rowery/) by Monika Pawłowska (code: [github.com/pawlowska/shiny-server](https://github.com/pawlowska/shiny-server))
* original source: http://rowery.um.warszawa.pl/pomiary-ruchu-rowerowego

![](imgs/rowery_ECC2014205.gif)

In [None]:
# tabular data manipulation
import pandas as pd

# plots
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# reading CSV data
days = pd.read_csv("data/bicycles_weather.csv", index_col=0, parse_dates=[0])

In [None]:
# first entries
days.head()

In [None]:
# last entries
days.tail()

In [None]:
# rows and columns
days.shape

In [None]:
# columns
days.columns

In [None]:
# index
days.index[:5]

In [None]:
# data for each column
days.info()

In [None]:
# selecting column and looking at its first 8 entries
days['weekday'].head(8)

In [None]:
# counting occurences
days['weekday'].value_counts()

In [None]:
# maximal values per column
days.max()

In [None]:
days.isnull().sum(axis=1).plot()

In [None]:
# temperature by day
days['temp'].plot()

In [None]:
# ploting a series
days['Most Gdanski - ścieżka rowerowa'].plot()

In [None]:
# plotting a 1-column DataFrame (splot the difference!)
days[['Most Gdanski - ścieżka rowerowa']].plot()

In [None]:
# number of days with each temperature
days['temp'].hist(bins=25)

In [None]:
some_data = days[['Most Gdanski - ścieżka rowerowa', 'temp']].dropna()
plt.plot(some_data['temp'], some_data['Most Gdanski - ścieżka rowerowa'], 'o')

In [None]:
# filling missing values
days.fillna(0).head()

In [None]:
# scatter plots of pairs
sns.pairplot(data=days.fillna(0),
             vars=['Most Gdanski - ścieżka rowerowa',
                   'Dworzec Wilenski (al. Solidarności)',
                   'temp'],
             hue='weekday',
             dropna=True,
             size=4)

In [None]:
# creating a column, via:

# series method
days['is_weekend'] = days['weekday'].isin([5, 6])

# applying lambda function
days['is_weekend'] = days['weekday'].apply(lambda x: x in [5, 6])

# using a dictionary to map values
days['is_weekend'] = days['weekday'].map({0: False, 1: False, 2: False,
                               3: False, 4: False, 5: True, 6: True})

# now, use if for a nicer plot

In [None]:
# correlation heatmap
sns.heatmap(days.corr())

In [None]:
# averages of y per various factors
sns.factorplot(data=days.fillna(0),
               y='Most Gdanski - ścieżka rowerowa',
               x='weekday',
               kind='bar')

In [None]:
sns.factorplot?

In [None]:
sns.factorplot(data=days.fillna(0),
               y='Most Gdanski - ścieżka rowerowa',
               x='weekday',
               col='month',
               col_wrap=4,
               kind='bar')

In [None]:
# same data, another plot
sns.factorplot(data=days.fillna(0),
               y='Most Gdanski - ścieżka rowerowa',
               x='month',
               hue='weekday',
               kind='bar',
               ci=None,
               size=6)

In [None]:
# violon plot
sns.factorplot(data=days.fillna(0),
               y='Most Gdanski - ścieżka rowerowa',
               x='weekday',
               kind='violin')

In [None]:
# SQL-like operations
days.groupby('weekday').mean()

## Further reading

* Data manipulation
    * [An Introduction to Scientific Python – Pandas](http://www.datadependence.com/2016/05/scientific-python-pandas/)
    * [Pandas exercises](https://github.com/guipsamora/pandas_exercises)
    * [Top Pandas functions used in GitHub repos](https://kozikow.com/2016/07/01/top-pandas-functions-used-in-github-repos/)
* Charts
    * [Overview of Python Visualization Tools](http://pbpython.com/visualization-tools-1.html)
    * [A Dramatic Tour through Python’s Data Visualization Landscape](http://blog.yhat.com/posts/python-data-viz-landscape.html)
    * [Pandas Visualization](http://pandas.pydata.org/pandas-docs/stable/visualization.html)
    * [Matplotlib tutorial - Nicolas P. Rougier](http://www.labri.fr/perso/nrougier/teaching/matplotlib/)