
# Working with data

## Reproducible research

- replication crisis
- open research: share data, code and models
- code your data workflow !

## Data workflow


<img src="img/data-science-explore.png" width="70%">

Reproduced from [R for data science](https://r4ds.had.co.nz/)

# Good programming habits

## Getting help

- read the docs, eg [sklearn documentation](https://scikit-learn.org/stable/index.html) is really good
- programming Q&A : [StackOverflow](https://stats.stackexchange.com/)
- ML-Stats Q&A : [CrossValidated](https://stackoverflow.com/)
- [Jupyter gallery](https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks)
- cheatsheets !

## Improve your programming skills 

- fight against software entropy !
- [code smell](https://en.wikipedia.org/wiki/Code_smell), [spaghetti code](https://en.wikipedia.org/wiki/Spaghetti_code), [big ball of mud](https://en.wikipedia.org/wiki/Big_ball_of_mud)
- "kiss" : keep it stupid simple
- "dry" : don't repeat yourself
- use abstractions 
    - refactor using functions [FP](https://docs.python.org/3.7/howto/functional.html) 
    - and classes [OOP](https://python-textbok.readthedocs.io/en/1.0/Object_Oriented_Programming.html)


The Zen of Python

In [10]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


## Monitoring

- painfully slow program ?
    - IO-bound : reading and writing from disk
    - CPU-bound : computing 
- program crashes ?
    - out-of-memory errors
- monitor CPU, RAM, and IO usage
    - OS monitor or command line `top`, `iostat`

## Terminal 

<img src="img/terminal.png" width="10%" style="display:inline"> Don't be afraid to use it ;) There are many excellent tutorials 


- use tab-completion !
- explore the filesystem : `cd`, `pwd`, `ls`
- copy files : `cp`
- connect to remote: `ssh user@host`
- scp files from/to remote: `scp`

## Version control

Does your project look like this ?

<img src="img/version-control.png" alt="" width="50%">

## Version control 


<img src="img/Octocat.png" width="10%" style="display:inline">  <img src="img/gitlab-logo.png" width="10%" style="display:inline"> Use a version control software like `git` to:

- backup your code on a remote host eg [github](https://github.com/), [gitlab](https://about.gitlab.com/)
- write code collaboratively : **repository**
- download and modify someone else project : **clone**
- track changes history : **commit logs**
- switch between parallel versions of your software : **branches**
- easily revert to earlier state, **merge** different versions

# Prepare data

## Get and store data

- if you are lucky data can be fetched from an API or database:
    - datasets for ML: [mldata](https://www.mldata.io/datasets/), [UCI repository](http://archive.ics.uci.edu/ml/datasets.php) and [kaggle](https://www.kaggle.com/datasets)
    - weather data: [openweathermap](https://openweathermap.org/api)
    - geodata: [openstreetmap](https://wiki.openstreetmap.org/wiki/Databases_and_data_access_APIs)
    - genomic data: [UCSC](https://genome-euro.ucsc.edu)
- it's a good idea to store your data on a local/server database

## API demo

- we fetch an API with an http request : url and query string
- the API usually returns [JSON](https://en.wikipedia.org/wiki/JSON) data

**Example** http://api.openweathermap.org/data/2.5/weather?q=Paris,france&appid=YOUR_APP_ID

## Tidy data

Hadley Wickham's definition:
> 1.  Each variable forms a column.
> 2.  Each observation forms a row.
> 3.  Each type of observational unit forms a table.

**Reference** https://vita.had.co.nz/papers/tidy-data.pdf

## Clean data

Unfortunately it will often take a lot of time :(

- duplicated rows
- inconsistent records
- abnormal values
- reshape data to make it tidy

## Data wrangling in python

<img src="img/pandas-logo.png" width="70%"> 

In [6]:
import pandas as pd
df = pd.read_csv("cycle-share-dataset/station.csv")
df.head(4)

Unnamed: 0,station_id,name,lat,long,install_date,install_dockcount,modification_date,current_dockcount,decommission_date
0,BT-01,3rd Ave & Broad St,47.618418,-122.350964,10/13/2014,18,,18,
1,BT-03,2nd Ave & Vine St,47.615829,-122.348564,10/13/2014,16,,16,
2,BT-04,6th Ave & Blanchard St,47.616094,-122.341102,10/13/2014,16,,16,
3,BT-05,2nd Ave & Blanchard St,47.61311,-122.344208,10/13/2014,14,,14,


## Group-by


We group rows by values in column "col" and compute an **aggregate function** on each group.
<img src="img/groupby.png" width="70%"> 

### Demo

In [None]:
# groupby 

## Merge / Join

<img src="img/merge.png" width="50%"> 

### Demo

In [7]:
trips = pd.read_csv("data/small_trips.csv")
trips.head()

Unnamed: 0,date,trips
0,2018-10-01,421
1,2018-10-02,212
2,2018-10-03,183


In [8]:
weather = pd.read_csv("data/small_weather.csv")
weather.head()

Unnamed: 0,date,event,temp
0,2018-10-01,sun,15
1,2018-10-03,rain,7


In [9]:
trips_joined = pd.merge(trips, weather, on="date", how="left")
trips_joined

Unnamed: 0,date,trips,event,temp
0,2018-10-01,421,sun,15.0
1,2018-10-02,212,,
2,2018-10-03,183,rain,7.0


## Pivot

<img src="img/pivot.png" width="80%"> 

### Demo

In [None]:
# pivot 

## Melt

The "opposite" of pivot 
<img src="img/melt.png" width="300"> 

### Demo

In [None]:
# melt 

## Prepare data for machine learning

- the features $X$ and target $y$ must be **numeric arrays**

Handling special data-types
- categorical values : [one-hot encoding](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features)
- text : [bag of words representation](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)
- images : RGB array
- missing values: [imputation](https://scikit-learn.org/stable/modules/impute.html#impute)

## Exercice

For the following ML tasks, how would you construct the $X$ and $y$ arrays ?

- **automatic video description**
    - raw data =  bunch of mp4 videos and movie scripts (txt files)

- **generate Latex files or C code**  
    - raw data = a book written in LaTeX  or the Linux C source code
    - see https://karpathy.github.io/2015/05/21/rnn-effectiveness/

# Visualize data

## Misleading graph



## Grammar of graphics

## Let's make a simple chart

## Imperative (Matplotlib)

## Declarative (Altair)

## Interactivity

# Big data

## Big data is ...

1. marketing bullshit
2. a technical difficulty for data processing
3. a blessing/curse for machine learning

## As a technical difficulty

Here big data means "too big to be stored on a disk". Hence:
    
- most people/companies *don't* have big data
- you need distributed storage eg [HDFS](https://en.wikipedia.org/wiki/Apache_Hadoop#Hadoop_distributed_file_system)
- and distributed computing eg [MapReduce](https://en.wikipedia.org/wiki/MapReduce)

## As a curse for machine learning

Say $N$ = number of observations and $P$ = number of features. Here big data can mean:

- wide data $P\gg N$, too few observations for many features
    - realm of **high-dimensional statistics**
    - we can only afford very simple models and seek robustness

## As a blessing for machine learning

Say $N$ = number of observations and $P$ = number of features. Here big data can mean:

- long data $N \gg P$, many observations
    - can fit very complex models (deep learning)
    - if data does not fit in memory, must use **online** or **mini-batch** learning

# References 

Programming skills for data science

- [R for data science](https://r4ds.had.co.nz/)
- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)