
# Working with data
<hr>


To view this notebook in slides format

1. install RISE:

```bash
conda install -c conda-forge rise
```

2. click <div class="btn btn-default"> <i class="fa-bar-chart fa"></i> </div>  to enter the RISE slideshow  

## Reproducible research

- replication crisis
- open research: share data, code and models
- code your data workflow !

## Data workflow


<img src="img/data-science-explore.png" width="800">

Reproduced from [R for data science](https://r4ds.had.co.nz/)

# Good programming habits

## Getting help

- read the docs, eg [sklearn documentation](https://scikit-learn.org/stable/index.html) is really good
- programming Q&A : [StackOverflow](https://stats.stackexchange.com/)
- ML-Stats Q&A : [CrossValidated](https://stackoverflow.com/)
- [Jupyter gallery](https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks)
- cheatsheets !

## Improve your programming skills 

- fight against software entropy !
- [code smell](https://en.wikipedia.org/wiki/Code_smell), [spaghetti code](https://en.wikipedia.org/wiki/Spaghetti_code), [big ball of mud](https://en.wikipedia.org/wiki/Big_ball_of_mud)
- "kiss" : keep it stupid simple
- "dry" : don't repeat yourself
- use abstractions 
    - refactor using functions [FP](https://docs.python.org/3.7/howto/functional.html) 
    - and classes [OOP](https://python-textbok.readthedocs.io/en/1.0/Object_Oriented_Programming.html)


The Zen of Python

In [None]:
import this

## Monitoring

- painfully slow program ?
    - IO-bound : reading and writing from disk
    - CPU-bound : computing 
- program crashes ?
    - out-of-memory errors
- monitor CPU, RAM, and IO usage
    - Mac-OS monitor or shell `top`, `iostat`

## Terminal

Don't be afraid to use it ;) There are many excellent tutorials
- use tab-completion !
- explore the filesystem : `cd`, `pwd`, `ls`
- copy files : `cp`
- connect to remote: `ssh user@host`
- scp files from/to remote: `scp`

## Version control

Does your project look like this ?
<img src="img/version-control.png" alt="" width="400"/>

## Version control

Use a version control software like `git` to:

 <div class="row">
  <div class="col-sm-8">
      <small>
       <ul>
       <li>backup your code on a remote host eg [github](https://github.com/), 
           [gitlab](https://about.gitlab.com/) </li>
       <li>write code collaboratively : **repository**</li>
       <li>download and modify someone else project : **clone**</li>
       <li>track changes history : **commit logs**</li>
       <li>switch between parallel versions of your software : **branches**</li>
       <li>easily revert to earlier state, merge different versions</li>
       </ul>
       </small>
  </div>
  <div class="col-sm-2">
  </div>
  <div class="col-sm-2">
      <img src="img/Octocat.png" width="100"> 
      <img src="img/gitlab-logo.png" width="100">
  </div>
</div> 


There are tools eg [Github desktop](https://desktop.github.com/) if you don't like the command line.

# Prepare data

## Get and store data

- if you are lucky data can be fetched from an API or database:
    - datasets for ML: [mldata](https://www.mldata.io/datasets/), [UCI repository](http://archive.ics.uci.edu/ml/datasets.php) and [kaggle](https://www.kaggle.com/datasets)
    - weather data: [openweathermap](https://openweathermap.org/api)
    - geodata: [openstreetmap](https://wiki.openstreetmap.org/wiki/Databases_and_data_access_APIs)
    - genomic data: [UCSC](https://genome-euro.ucsc.edu)
- it's a good idea to store your data on a local/server database

## Tidy data

Hadley Wickham's definition:
> 1.  Each variable forms a column.
> 2.  Each observation forms a row.
> 3.  Each type of observational unit forms a table.

Reference: https://vita.had.co.nz/papers/tidy-data.pdf

## Clean data