# Preamble

This notebook is meant to be an example of how to use Jupyter notebooks to document and illustrate applied research in machine learning (which is these days referred to as _data science_). 

I am following to a great extent the great tutorials/walkthroughs on the book (which I *highly* recommend you get a copy of)

```
Hands-On Machine Learning with Scikit-Learn and Tensorflow
Aurelien Geron
O'Reilly Media, 2017
```

the notebooks for the book [can be found here](https://github.com/ageron/handson-ml).

In [1]:
import numpy as np
import os

# To plot figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

In [2]:
PROJECT_ROOT_DIR="../.."

# Get the Data

In [3]:
import pandas as pd

We load the training dataset directly as a `DataFrame`, note that we're doing so from a compressed file (using `gzip`)

In [4]:
X_data_path = os.path.join(PROJECT_ROOT_DIR, 'X_preprocessed_data.csv.gz')

In [5]:
# See Issue #6 so we do not have to transpose the dataset
df_X = pd.read_csv(X_data_path, header=None)
df_X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,500,501,502,503,504,505,506,507,508,509
0,15.07,15.12,14.63,14.75,8407500.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,81.22,81.93,80.94,81.89,296853.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,78.24,79.07,78.125,79.07,4632684.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,236.64,238.6924,235.75,238.16,552207.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,89.04,89.48,88.91,89.16,554948.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
df_X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 735000 entries, 0 to 734999
Columns: 510 entries, 0 to 509
dtypes: float64(510)
memory usage: 2.8 GB


The `describe()` method allows us to obtain some insight on the structure of the dataset we are working with

In [7]:
df_X.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,500,501,502,503,504,505,506,507,508,509
count,735000.0,735000.0,735000.0,735000.0,735000.0,735000.0,735000.0,735000.0,735000.0,735000.0,...,735000.0,735000.0,735000.0,735000.0,735000.0,735000.0,735000.0,735000.0,735000.0,735000.0
mean,58.352279,58.808949,57.971927,58.439323,4813515.0,0.002381,0.001905,0.002381,0.001905,0.001905,...,0.001905,0.002381,0.002381,0.002381,0.001905,0.002381,0.002381,0.001905,0.002381,0.001905
std,53.207832,53.65959,52.923152,53.365529,12217880.0,0.048737,0.043602,0.048737,0.043602,0.043602,...,0.043602,0.048737,0.048737,0.048737,0.043602,0.048737,0.048737,0.043602,0.048737,0.043602
min,2.58,2.6,2.58,2.59,88425.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,32.3075,32.6875,32.095,32.26,1021299.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,46.02,46.365,45.66,46.05,2059844.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,68.28,68.92355,67.549275,68.1525,4449050.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,703.0,708.0,698.9,705.62,231771600.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


I am not sure what are the 510 columns in the dataset, it looks to me a bit excessive (or it has already been processed into sequences for the RNN?). In any case, it would be good to have a notebook like this one

```
https://github.com/ageron/handson-ml/blob/master/02_end_to_end_machine_learning_project.ipynb
```

so we can study the data before applying any algorithms to it.