# 1. Intro to Machine Learning

### Objective

* Use the pandas-profiling package to do EDA
* Know that you cannot do machine learning in scikit-learn with missing data
* Understand basic strategies for filling missing data
* Fill in missing data
* Extract data into NumPy arrays
* Learn the three step process for doing machine learning in Scikit-Learn
    1. Import the model
    2. Instantiate the model
    3. Train the model
* Input data must be 2D array
* Make predictions
* Measure performance by calculating accuracy

## Typical Workflow for Beginners
* Find dataset
    * [Kaggle Datasets][1]
    * [data.world][2]
    * [data.gov][3]
    * [UCI Machine Learning Repository][10]
* Read data into Pandas
* Clean data
* Exploratory data analysis with basic statistics and visualizations
* Define Problem
* Train and Evaluate model with Scikit-Learn

### Resources

* [Hands on Machine Learning with Scikit-Learn and Tensor Flow][6], very popular book
* [Introduction to Statistical Learning][8] by Trevor Hastie and Robert Tibshirani
* My Solutions to [Introduction to Statistical Learning][11] using Python
* Full college class on [Applied Machine Learning][7] by Andreas Mueller, core contributor to Scikit-Learn 
* Tutorial in [Jupyter Notebooks][5] from Andreas Mueller
* My article on the [new workflow from Pandas to Scikit-Learn][9]

[1]: https://www.kaggle.com/datasets
[2]: https://data.world/
[3]: https://www.data.gov/
[5]: https://github.com/amueller/scipy-2016-sklearn
[6]: https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291
[7]: http://www.cs.columbia.edu/~amueller/comsw4995s18/schedule/
[8]: http://www-bcf.usc.edu/~gareth/ISL/data.html
[9]: https://medium.com/dunder-data/from-pandas-to-scikit-learn-a-new-exciting-workflow-e88e2271ef62
[10]: https://archive.ics.uci.edu/ml/index.php
[11]: https://github.com/tdpetrou/Machine-Learning-Books-With-Python

# Heart Disease Dataset

We will be use the [heart disease][1] dataset from the ISLR book. Let's read it in and take a peak at the data.

[1]: https://archive.ics.uci.edu/ml/datasets/heart+Disease

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.options.display.max_colwidth = 200
%matplotlib inline

In [None]:
heart = pd.read_csv('../data/heart.csv')
heart.head()

### Understand the columns with help from the data dictionary
Always find or create a data dictionary when beginning a project. In this instance, it is provided for you.

In [None]:
dd = pd.read_csv('../data/heart_data_dictionary.csv')
dd

### Examine the data types and ensure they match the data dictionary

In [None]:
heart.dtypes

In [None]:
heart.shape

# Skipping EDA
Typically, you should complete a through exploratory data analysis before commencing machine learning. For these introductory notebooks, we will skip this step and jump right into machine learning. This allows us to focus on the mechanics of machine learning.

### Use Pandas Profiling instead
As a drop-in replacement for manual EDA, we can use the [pandas-profiling package][1]. Install it from the command line with `conda install pandas-profiling`. It has only one major object, `ProfileReport`. Simply pass the DataFrame to it and it will provide you with lots of basic descriptions of the dataset.

[1]: https://github.com/pandas-profiling/pandas-profiling

In [None]:
# Currently there are a lot of warnings that pandas-profiling emits when imported.
# Ignore them with the following

import warnings
warnings.filterwarnings('ignore')

import pandas_profiling as pf
pf.ProfileReport(heart)

## Identify the type of machine learning problem - Supervised or Unsupervised
Before beginning, we need to identify the type of machine learning problem we have. In this problem, we are predicting whether or not someone has heart disease and therefore are doing supervised learning.


## Identifying the target variable
We are interested in predicting whether some one has heart disease or not, which is the `disease` column in our DataFrame. 


## Classification or Regression
There are two classes of disease (0 or 1), which we see from the data dictionary, correspond to either no or yes. Thus, we have a **classification** problem. From the profile report, we see that 46% of the observations do have heart disease and 54% do not.

# Minimum data preparation
It is common to do lots of data preparation, but in this notebook, we only do the minimum necessary to enable scikit-learn models to work for us. 

### Check for missing values
Scikit-learn does not allow for any missing values. Let's check for them now.

In [None]:
heart.isna().sum()

## Must fill missing values
We have a couple columns that are missing values. The simplest thing we can do to resolve this issue is to drop the rows (or columns) containing the missing values. Calling `heart.dropna()` will drop every row with a missing value in it. Otherwise, we need to fill them with some value.

### Strategies for filling missing values
There are a number of strategies that have been developed to fill in missing values. This notebook focuses on the simplest strategies. The strategy used to fill in a missing value depends on the type of data in the column.

#### Filling in missing values for categorical columns
A common strategy is to use the **most frequent** value for categorical columns. 

A different strategy is to randomly select one of the non-missing values in the column. This preserves the distribution of the values in that column and is sometimes called **hot-deck** imputation.

#### Issues with these strategies
Filling missing values with the most frequent value for that column might bias our results significantly. If the most frequent value is only slightly more frequent than the second most frequent value, then this value can be significantly overrepresented.

#### Filling in missing values for continuous columns
Continuous data allows for other strategies with the simplest being using the mean or median. Hot-deck imputation can work as well.

#### More advanced strategies
There are many more advanced strategies that have been developed with most of them relying on using machine learning to fill in the missing values. These will be discussed in a different notebook.

### Filling the missing values in the heart dataset - know the type of variable 
In this dataset, the column `ca` has numeric values but is actually a categorical (it represents the number of major vessels colored by flourosopy) . It would not make any sense to use the mean to fill in the value here. Let's use `value_counts` to find the most common number.

In [None]:
vc_ca = heart['ca'].value_counts()
vc_ca

In [None]:
ca_fill = vc_ca.index[0]
ca_fill

### Do the same for `thal`
The `thal` column is also categorical, so we can again compute the most frequent value.

In [None]:
vc_thal = heart['thal'].value_counts()
vc_thal

In [None]:
thal_fill = vc_thal.index[0]
thal_fill

### Fill the missing values with `fillna`
Pass a dictionary to the `fillna` method mapping the column name to the value you would like to fill it with.

In [None]:
heart = heart.fillna({'ca': ca_fill, 'thal': thal_fill})
heart.head()

## Verify there are no missing values

In [None]:
heart.isna().sum()

### Clean-up: Change data type of `ca` to int
Because there was missing data in the `ca` column, its data type was float. We can now change it to an int.

In [None]:
# current data type
heart['ca'].dtype

In [None]:
# convert to int and check
heart['ca'] = heart['ca'].astype('int')
heart['ca'].dtype

# Extract data into NumPy arrays
Scikit-learn was built to integrate directly with NumPy and has traditionally (until version 0.20) had weak integration with Pandas. 

For now, all data will be taken out of Pandas DataFrames and put into NumPy arrays. By convention (as done in the Scikit-Learn documentation), use **`X`** and **`y`** as Python variable names for the arrays. Use the **`values`** DataFrame/Series attribute to retrieve the underlying NumPy arrays.

## Begin by using a single feature
It is possible to do machine learning with every single feature in the model, but when first beginning, it is good to keep things simple and use a single feature. 

## Must use a numeric column!
You cannot do machine learning directly in Scikit-Learn with string columns. You can only use numeric columns without any missing values. In order to use string columns, you must encode the strings as numeric values (more on this in later notebooks).

## Use `max_hr` column as feature
We pick `max_hr`, which from the profile report was the only variable that was negatively correlated to heart disease.

### Extract data into NumPy arrays with the `values` attribute
We extract the data from Pandas to a NumPy array:

In [None]:
X = heart['max_hr'].values
y = heart['disease'].values

### Verify we are in NumPy and output a few of the values

In [None]:
type(X)

In [None]:
type(y)

In [None]:
X[:5]

In [None]:
y[:5]

# Ready for machine learning in 3 steps
All machine learning models in scikit-learn use the same three-step process to train.

1. Import the model
2. Instantiate the model
3. Train the model

## Step 1: Import the model from Scikit-Learn
The scikit-learn library is structured differently than Pandas. It keeps all of its functionality tucked away in separate modules. By convention, we directly import the object we want by referencing the module where it is located. In this case, we will import one of the simplest classification models - Logistic Regression.

In [None]:
# step 1. Import the model
from sklearn.linear_model import LogisticRegression

#### Wait, why is the word regression in the name? Isn't this classification?
Unfortunately, the name "Logistic Regression" is very confusing. Despite having the word "regression" in the name, it is used for classification and not regression. The logistic regression model does return a continuous value, but it is always a number between 0 and 1 which represents the **probability** of each observation being classified as one class or another.

## Step 2: Instantiate the model (estimator)
In step 1, when we import a model, we have been handed a blueprint. It is not built and not ready to use. We must instantiate it (create an instance of it) in order to actually use it. Scikit-Learn uses the term **estimator** to refer to each model. You can also use the phrase **instantiate the estimator**.

### "Constructing our machine learning vehicle"
An additional phrase I like to use for this step is **constructing our machine learning vehicle** which really emphasizes what is happening. Below, the variable `logr` is being assigned the result of our machine learning vehicle construction. It is our physical object that will do the machine learning and predicting.

In [None]:
logr = LogisticRegression()

### Insantiating with default values
Go back into the last code cell and press **shift + tab + tab** with your cursor inside the parentheses. Notice all the parameters and their default values. When we constructed our model, we used these default values to build it. All these parameters are called **hyperparameters** in machine learning. These are 'specifications' to which our model was built. We can change them or **tune** them to construct our model in a different way.

# Step 3: Train the model
To train the model, we must give it some data. Our data is stored in the `X` and `y` Numpy arrays. All scikit-learn estimators use the **`fit`** method to train the model.

In [None]:
logr.fit(X, y)

# A very annoying Gotcha!
Scikit-Learn forces you to use a 2-dimensional array for your input values. When we selected one column above as our **`X`** array, it was a single dimension. Verify this with the `ndim` attribute.

In [None]:
X.ndim

In [None]:
X.shape

## Use the help from the error message
The error message gives us explicit advice on how to transform our input data. We need to call the `reshape` method. It will transform the data from a single dimensional array with 303 values into a two dimensional array with 303 rows and 1 column.

In [None]:
X = X.reshape(-1, 1)
X.ndim

In [None]:
X.shape

In [None]:
X[:5]

# Step 3 (again): Train the model
Now that we have two-dimensional data for our input, scikit-learn will no longer complain.

In [None]:
logr.fit(X, y)

### Our model is trained. What does that mean?
All machine learning models have different objectives that must be met in order for them to be trained. For instance, with logistic regression, the objective is to set the parameters of the model in such a manner that it predicts heart disease with 100% accuracy.

### Training is an iterative process
During training, scikit-learn slowly changes the values of the model parameters in order to get the highest possible accuracy. This is an **iterative process** done using a **for** loop. Advanced numerical analysis is used to determine how to change the parameters during each iteration. 

Since, it is unlikely that the model can achieve 100% accuracy, there is a stopping criterion that gets triggered whenever the accuracy fails to improve by a certain amount. This iterative process takes place during the execution of the `fit` method.

## Make a prediction
All supervised learning estimators in scikit-learn have a **`predict`** method. Let's use it to make some predictions about heart disease.

### Use a `max_hr` value that is within the range found in the dataset
It only makes sense to predict with a value that we have seen in the dataset. Let's find the minimum and maximum value in the dataset.

In [None]:
X.min(), X.max()

### Remember the gotcha
We have to pass the `predict` method a 2D array, just like we did when we trained it.

In [None]:
a = np.array([100, 150, 200])
a

### Use the same trick to reshape a 1D array

In [None]:
a = a.reshape(-1, 1)
a

### Make the prediction
Call the `predict` method with the NumPy array to make the prediction.

In [None]:
logr.predict(a)

## Interpretation of prediction
The returned array holds the classes that our model predicts. In this case it predicts heart disease for the first input (a `max_hr` of 100), and no heart disease for the last two (`max_hr` of 150 and 200).

## Make a prediction for all inputs
We can make predictions on all of the inputs by passing the `predict` method our original input array.

In [None]:
logr.predict(X)

## Measure performance by calculating the accuracy
That's great that we have a prediction, but we need to measure its performance. To do so, we must have the true outcome of each patient. Using the above data, the outcome is stored in the `y` variable. The simplest way to measure performance for a classification problem is to calculate the **accuracy** - which is defined as the percentage of the predictions that are correct. 

In this case, we are correct when the model predicts 0 when the true outcome is 0 or when the model predicts 1 and the true outcome is 1.

## Use the `score` method to calculate accuracy
All supervised learning estimators have a `score` method. The `score` method makes a prediction and calculates the accuracy. Pass it the input data and the expected output. We achieve 67% accuracy.

In [None]:
logr.score(X, y)

# Exercises

### Problem 1
<span  style="color:green; font-size:16px">Select a different variable besides `max_hr` and repeat the three step process to train a single-feature logistic regression model. Keep trying other numeric columns. Can you beat 67% accuracy? Can you define a function that automates this process?  Have the function accept the string name of the column to train on and return the accuracy.</span>