# Python Primer: starting ML

## <div style="color: #db366d"> Day 1.4 </div>

Python has become the de facto language for ML. It is almost expected that you are fluent with Python if you claim to be working on ML.

We have seen how to work with data. Once we're happy with how our data looks, we can now proceed to do the "interesting" stuff.

As with data processing, the ML community has provided Python with some very powerful open-source libs, e.g.,

<img src="../images/sklearn-logo.png" width="300px" style="float: left; padding: 5px 0 10px 0" />

<p style="padding: 20px 0 0 0">
scikit-learn is an active community project that provides "off-the-shelf" ML tools for classical techniques like like Support Vector Machines (SVMs), random forests, k-Means, etc.
</p>

<br style="clear:both" />

<img src="../images/tensorflow-logo.png" width="300px" style="float: left; padding: 5px 0 10px 0" />

<p style="padding: 30px 0 0 0">
TensorFlow is Google's creation for low-level algorithm design for ML. It is generally meant for more advanced ML programmers who want to create algorithms.
</p>

<br style="clear:both" />

<img src="../images/keras-logo.png" width="300px" style="float: left; padding: 5px 0 10px 0" />

<p style="padding: 15px 0 0 0">
Keras is another active community creation that focuses on deep learning (neural networks). It can be configured to use various low-level ML APIs, but perhaps it is most frequently paired with TensorFlow as the driving underlying engine.
</p>

<br style="clear:both" />

<p style="padding: 15px 0 0 0">
We will use scikit-learn and keras to help us understand the use of various key ML techniques.
</p>

# Case 1: Learning the Boston housing dataset
Here we will introduce the basic steps needed to apply ML to a data-centric problem. Assuming you have existing data to grab from, there are 5 steps:
1. Load & prep data
2. Split data into Train & Test data
3. Fit the Train data into the chosen ML model
4. Find predictions from the Test data
5. Analyze prediction accuracy
(improve, rinse and repeat)

### Step 1. Load & prep data
We now have the necessary foundation to load and prep all sorts of data (we may still need some help from Google of course).

In this case study exercise, we will work with the Boston Housing Dataset from sklearn.

Let's start by loading the raw dataset...

In [5]:
# import the sample dataset lib from sklearn
from sklearn import datasets

# load data directly from sklearn sample datasets
# dataset is an sklearn custom type
dataset = datasets.load_boston()
print('The datasets from sklearn are formatted as the type', type(dataset))

# let's see what this object gives us
print('\nThe info I can get from this Bunch object are:',dataset.keys())

The datasets from sklearn are formatted as the type <class 'sklearn.utils.Bunch'>

The info I can get from this Bunch object are: dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])


#### EXERCISE: Explore the data by completing the code below.

In [23]:
# TODO: print the number of (rows, cols) in the input data matrix (the X variable)

# TODO: print the number of rows in the output targets (the y variable)

# TODO: print the feature names and their descriptions

# TODO: print the data DESCRiption to see what the feature names are

Now we convert the Bunch type to a pandas DataFrame type, so that it is much easier to work with.

In [26]:
import pandas as pd

# construct the df and use the feature_names as the col names
df = pd.DataFrame(dataset.data, columns = dataset.feature_names)

# add a last col to the df as the output
df['PRICE'] = dataset.target

# see if everything looks correct
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


### Step 2. Split data into Train & Test data
With the finite data we have, we need to (1) reserve a large portion for training the chosen ML model, as well as (2) keep a small portion for testing how well the model works. A 75% / 25% split for train / test samples is probably a good rule of thumb.

#### EXERCISE: complete the following code to split our df from above, into training and testing datasets. We want to achieve a 75% / 25% split as mentioned above

In [29]:
# import the necessary function to split data
from sklearn.model_selection import train_test_split

# TODO: use train_test_split function to obtain 4 datasets:
# - X_train : input training data
# - X_test  : input test data
# - y_train : output training data
# - y_test  : output test data
# Q: why is X in caps and y in small letters?

# TODO: print the shapes of all the outputs to check we have done it correctly

### Step 3. Fit the Train data into the chosen ML model
Here we'll train the ML model, a.k.a model fitting. For simplicity let's choose a linear regression model.

#### EXERCISE: complete the following code to initialize a pre-made linear regression model and train it with our datasets

In [34]:
# import the pre-made model structures from sklearn
from sklearn import linear_model

# TODO: create a LinearRegression model object for us to work with
# hint: use linear_model.___

# TODO: train the model by fitting it onto X_train & y_train

### Step 4. Find predictions from the Test data
With the trained model, we can make predictions with it. We will first test it with the items from our test portion of our data.

#### EXERCISE: complete the following code to use the model to make predictions

In [None]:
# TODO: use the trained model to predict outputs from X_test
# hint: it is just a one liner with one function call

### Step 5. Analyze prediction accuracy
The last step is to analyze how well the model has performed.

#### EXERCISE: complete the code below to  calculate the root mean squared error of your predictions
#### STRETCH GOAL: plot the predicted values against the expected test values to view the accuracy

In [37]:
# import the necessary modules and functions
import numpy as np
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt # for plotting only

# TODO: calculate the RMSE
# hint: numpy has the square root function

# TODO: plot a scatter to visually compare test vs predicted outputs