# Machine Learning Crash Course

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed for any analysis needed in the notebook.
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns # for pretty visualizations
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

# Day 1

## What is ML, DS, and AI?
>  "When you’ve written the same code 3 times, write a function. When you’ve given the same in-person advice 3 times, write a blog post" -@drob

At a high level the difference between data science (DS), machine learning (ML), and artificial intelligence (AI) can be explained in the following 3 points:
* DS produces **insights**
* ML produces **predictions**
* AI produces **actions**

Of course the fields have large overlays, and each successive step relies on the previous in many ways, but these are the large differentiators when comparing and relating the three fields.




## ML General Process (High Level)
1. Exploratory Analysis
    * First, "get to know" the data. This step should be quick, efficient, and decisive.

2. Data Cleaning
    * Then, clean your data to avoid many common pitfalls. Better data beats fancier algorithms. This can include transforming your data for easier use and enhanced performance in algorithms.

3. Feature Engineering
    * Next, help your algorithms "focus" on what's important by creating new features. Incredibly important in differentiating your findings from others and gaining insights via combining interesting new features that can aid with a better solution/confounding variables.

4. Algorithm Selection
    * Choose the best, most appropriate algorithms without wasting your time. This step can also include ensembling methods and bagging.

5. Model Training
    * Finally, train your models. This step is pretty formulaic once you've done the first 4.

# Day 2

## Exploratory Analysis
You are a commander with scarce resources.
Exploratory analysis is sending your scouts and spies ahead to learn where best to deploy your forces.

Exploratory analysis is the essential step of looking at what data we have and really understanding it.
* You’ll gain valuable hints for Data Cleaning (which can make or break your models).
* You’ll think of ideas for Feature Engineering (which can take your models from good to great).
* You’ll get a "feel" for the dataset, which will help you communicate results and deliver greater impact.

## Basic Information
1. How many observations (rows) do you have?
2. How many features (columns) do you have?
3. What are the data types? Numerical, categorical, date times?
4. Do you have a target variable (what you are trying to predict)?

## Let's Display Some Observations
We can display some actual observations (rows) in our dataset just to get a feel for what the data looks like.
The common methods used to do so are:




In [None]:
# First,
# Let us import our dataset into a pandas dataframe (like an Excel table with rows and columns)
df = pd.read_csv('../input/beers.csv')

# Use the variable df that holds the object containing your data
df.head(5) # will print out the first 5 rows
df.tail(5) # will print out the 5 last rows

* Notice the tail(5) command is what was outputted from the previous cell (block of code in this notebook), and overrides the head() method.

We can also use Pandas **slicing** functionality:

In [None]:
df[10:20] # uses pandas slicing to get a specified number of observations

## Obtaining subsets of data in Pandas
Label based	= loc

Position based = iloc

In [None]:
df.info() # allows us to see information about our variables and their data types

This is not the stage where we are doing intense analysis. That is for later.
We are just getting a feel for the data right now.
* Do the columns make sense?
* What types of values are we seeing in the various columns?
* Are there a lot of missing values we will need to address?
* Are the values on the right scale, meaning will we need to normalize the data to be on a similiar scale?


# Distributions
Next we can look at the distributions of our different features.
Typically we can look at the histograms to see what is going on such as:
* Unexpected distibutions
* Outliers that may affect our analysis (some may not make sense and could be candidates of data entry error for example)
* Features that are binary
* Boundaries that are not clear or are illogical
* Measurement errors

Start making notes that stick out and dig deeper into those potential problems. This will come in handy in the Data Cleansing phase of the project.

In [None]:
sns.set(color_codes=True) # setup seaborn for visualizations
sns.distplot(df.abv.dropna()) #histogram of 'abv' feature in our dataset
plt.show() # used to show the histogram in the output of the notebook

In [None]:
#scatterplot
sns.set()
df.columns
cols = ['abv', 'ibu']
sns.pairplot(df[cols].dropna(), size = 3.5) # must use dropna() or fillna() with avg or median depending on outliers
                                            # in order for sns to plot without errors
plt.show()

## Distribution of Categorical Features
A **class** is a unique value for a feature.
To see the dsitribution of categorical features, we can create a barchart of the unique classes within a feature.
For instance, we can do a count of each 'style' of beer in this example data set (Cider, Belgian IPA, etc).

We want to look for sparse classes (very low counts compared to the other classes). Sparse classes can create a major issue of *overfitness*, or they do not affect the model much at all.

We should take note of sparse classes in this step, which will could lend us a hand in the feature engineering step later on.

## Segmentations
Box plots allow us to see the interaction between categorical values and numerical values.


In [None]:
segment_ex = sns.boxplot(x="style", y="abv", data=df)

Woah, that is a lot of different beer styles, this does not lend itself to a great segmentation box plot because it is too noisy. We could filter or slice on the styles with the highest Abv values to make it cleaner and if you are looking to purchase beer that will make the party more fun.

Let us just look at 'Abv' though for experimenation.

In [None]:
sns.boxplot(x = df['abv'])

We can see the quartiles and median bar for the Abv percentages for each beer in our dataset. It looks like the median Abv% is around 5.7% or so (visually).

## Correlations
First thing is first. You have probably heard this many times before, but it is good to always reiterate this when dealing with data:
> Correlation does not equal causation.
> Correlation != causation.

Correlation ranges from -1 to 1, with values closer to 0 being less correlated.

Note, from statistics r^2 is our correlation variable that will give us these values.

If the r^2 is 0.98, this means the two variables are very positively correlated.
Say we have a dataset of different age groups of children and their heights, the age and height of the individual will be positively correlated, intuitively.

We can use correlation heatmaps to see correlation relationships between the different variables in our dataset.

In [None]:
# Compute the correlation matrix
corr = df.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

Wow, there isn't much in this dataset to work with. Next time, we will need to use a better dataset that can offer more insights.
But, we will work with what we have for now.

Maybe we want to see what correlates with the alcohol content in the beer. As expected the IBU (International Bitterness Units scale), which is a gauge of beer's bitterness correlates the most with abv. This makes sense since alcohol is inherently bitter in its chemical properties.

Questions to ask ourselves when looking at correlations between variables:
* Which features are strongly correlated with the target variable?
* Are there interesting or unexpected strong correlations between other features?

Again, our aim is to gain intuition about the data, which will help us throughout the rest of the workflow.

## What about missing data?
We can create a table of total missing values and the percent of the missing values within each feature.


In [None]:
#missing data
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)

In [None]:
#dealing with missing data
df = df.drop((missing_data[missing_data['Total'] > 1]).index, 1, inplace=True)
df = df.drop(df.loc[df['style'].isnull()].index, inplace=True)
df.isnull().sum().max() #just checking that there's no missing data missing...

## Ending the Exploratory Phase
We should now have a decent understanding of the dataset, some notes for data cleaning, and possibly some ideas for feature engineering.
As we become more advanced in the DS method, further steps can be added for more in depth analysis in this phase.

# Day 3

## Data Cleansing
> Garbage in = garbage out

A clean data set with simple algorithms is better than a messy dataset with advanced algorithms. 

## Unwanted Observations
First, we need to remove any **duplicates** from our dataset.
Next, we need to delete any irrelevent observations or even features, if needed. This is when our notes about the sparse classes and other observations from the exploratory phase will come in the clutch. Doing these steps before feature engineering is not only logical, but will save us copious amounts of time and headache.

## Structural Errors
Check for typos or inconsistent capitalization (i.e. you have two classes 'Style' and 'style' that should be joined into one class)
Finally, check for mislabeled classes, i.e. separate classes that should really be the same.
e.g. If ’N/A’ and ’Not Applicable’ appear as two separate classes, you should combine them.
e.g. ’IT’ and ’information_technology’ should be a single class.

## Unwanted Outliers
Outliers are innocent until proven guilty. We should never remove an outlier just because it’s a "big number." That big number could be very informative for your model.
We can’t stress this enough: you must have a good reason for removing an outlier, such as suspicious measurements that are unlikely to be real data.

## Missing Data
Missing data is like missing a puzzle piece. If you drop it, that’s like pretending the puzzle slot isn’t there. If you impute it, that’s like trying to squeeze in a piece from somewhere else in the puzzle.

"Missingness" is almost always informative in itself, and you should tell your algorithm if a value was missing.

The best way to handle missing data for categorical features is to simply label them as ’Missing’!
* We are essentially adding a new class for the feature.
* This tells the algorithm that the value was missing.
* This also gets around the technical requirement for no missing values.

For missing numeric data, you should flag and fill the values.

* Flag the observation with an indicator variable of missingness.
* Then, fill the original missing value with 0 just to meet the technical requirement of no missing values.

By using this technique of flagging and filling, we are essentially allowing the algorithm to estimate the optimal constant for missingness, instead of just filling it in with the mean.

## Conclusion
After properly completing the Data Cleansing step, we should have a robust dataset that avoids many of the most common pitfalls.

This can really save us from a ton of headaches down the road, so please don't rush this step.

# Day 4

## Feature Engineering
To start, feature engineering is very open-ended. There are literally infinite options for new features to create.
Plus, you’ll need domain knowledge to add informative features instead of just more noise.

This is a skill that you’ll develop naturally with time and practice, but you can give yourself a big head-start if you have a framework in place.
A feature engineering framework simply consists of "heuristics" that you can rely on to spark ideas.

If you're a beginner, heuristics can help you know where to start looking... and if you're experienced, heuristics can help you get unstuck.

## Interaction features
Let's set aside the beer dataset for a second to illustrate this step.

#### Example (real-estate)
* Let's say we already had a feature called 'num_schools', i.e. the number of schools within 5 miles of a property.
* Let's say we also had the feature 'median_school', i.e. the median quality score of those schools.
* However, we might suspect that what's really important is having many school options, but only if they are good.
* Well, to capture that interaction, we could simple create a new feature 'school_score' = 'num_schools' x 'median_school'

## Sparse classes
There's no formal rule of how many each class needs.
It also depends on the size of your dataset and the number of other features you have.
As a rule of thumb, we recommend combining classes until each one has at least ~50 observations. As with any "rule" of thumb, use this as a guideline (not actually as a rule).
Let's take a look at the real-estate example:

Before grouping sparse classes
To begin, we can group similar classes. In the chart above, the 'exterior_walls' feature has several classes that are quite similar.
* We might want to group 'Wood Siding', 'Wood Shingle', and 'Wood' into a single class. In fact, let's just label all of them as 'Wood'.

Next, we can group the remaining sparse classes into a single 'Other' class, even if there's already an 'Other' class.
* We'd group 'Concrete Block', 'Stucco', 'Masonry', 'Other', and 'Asbestos shingle' into just 'Other'.

After combining sparse classes, we have fewer unique classes, but each one has more observations.
Often, an eyeball test is enough to decide if you want to group certain classes together.

## Dummy Variables
Dummy variables are a set of binary (0 or 1) variables that each represent a single class from a categorical feature.

The information you represent is exactly the same, but this numeric representation allows you to pass the technical requirements for algorithms.

## Remove unused
Finally, remove unused or redundant features from the dataset.

Unused features are those that don’t make sense to pass into our machine learning algorithms. Examples include:
* ID columns
* Features that wouldn't be available at the time of prediction
* Other text descriptions
* Redundant features would typically be those that have been replaced by other features that you’ve added during feature engineering.

# Day 5

## Algorithm Selection
Often you can break your problem into regresson, classification, or clustering predictors.

## Flaws of linear regression
To introduce the reasoning for some of the advanced algorithms, let's start by discussing basic linear regression. Linear regression models are very common, yet deeply flawed. Although you can sometimes get away with some fast insights using a quick and dirty linear regression model, it is usually not the best model.

Simple linear regression models fit a "straight line" (technically a hyperplane depending on the number of features, but it's the same idea). In practice, they rarely perform well. We actually recommend skipping them for most machine learning problems.

Their main advantage is that they are easy to interpret and understand. However, our goal is not to study the data and write a research report. Our goal is to build a model that can make accurate predictions.

In this regard, simple linear regression suffers from two major flaws:

It's prone to overfit with many input features.
It cannot easily express non-linear relationships.
Let's take a look at how we can address the first flaw.

## Regularization
This is the first "advanced" tactic for improving model performance. It’s considered pretty "advanced" in many ML courses, but it’s really pretty easy to understand and implement.

The first flaw of linear models is that they are prone to be overfit with many input features.
Let's take an extreme example to illustrate why this happens:
* Let's say you have 100 observations in your training dataset.
* Let's say you also have 100 features.
* If you fit a linear regression model with all of those 100 features, you can perfectly "memorize" the training set.
* Each coefficient would simply memorize one observation. This model would have perfect accuracy on the training data, but perform poorly on unseen data.
* It hasn’t learned the true underlying patterns; it has only memorized the noise in the training data.
* Regularization is a technique used to prevent overfitting by artificially penalizing model coefficients.
* It can discourage large coefficients (by dampening them).
* It can also remove features entirely (by setting their coefficients to 0).
* The "strength" of the penalty is tunable. (More on this tomorrow...)

## Regularized Regression
There are 3 common types of regularized linear regression algorithms:

**LASSO Regression**
Lasso, or LASSO, stands for Least Absolute Shrinkage and Selection Operator.
* Lasso regression penalizes the absolute size of coefficients.
* Practically, this leads to coefficients that can be exactly 0.
* Thus, Lasso offers automatic feature selection because it can completely remove some features.
* Remember, the "strength" of the penalty should be tuned.
* A stronger penalty leads to more coefficients pushed to zero.

**Ridge Regression**
* Ridge regression penalizes the squared size of coefficients.
* Practically, this leads to smaller coefficients, but it doesn't force them to 0.
* In other words, Ridge offers feature shrinkage.
* Again, the "strength" of the penalty should be tuned.
* A stronger penalty leads to coefficients pushed closer to zero.

**Elastic Net Regression**
Elastic-Net is a compromise between Lasso and Ridge.
* Elastic-Net penalizes a mix of both absolute and squared size.
* The ratio of the two penalty types should be tuned.
* The overall strength should also be tuned.

Oh and in case you’re wondering, there’s no "best" type of penalty. It really depends on the dataset and the problem. We recommend trying different algorithms that use a range of penalty strengths as part of the tuning process, which we'll cover in detail tomorrow.

## Decision Trees
Awesome, we’ve just seen 3 algorithms that can protect linear regression from overfitting. But if you remember, linear regression suffers from two main flaws:
* It's prone to overfit with many input features.
* It cannot easily express non-linear relationships.

How can we address the second flaw?

Well, we need to move away from linear models to do so.... we need to bring in a new category of algorithms.

Decision trees model data as a "tree" of hierarchical branches. They make branches until they reach "leaves" that represent predictions.

Due to their branching structure, decision trees can easily model nonlinear relationships.

* For example, let's say for Single Family homes, larger lots command higher prices.
* However, let's say for Apartments, smaller lots command higher prices (i.e. it's a proxy for urban / rural).
* This reversal of correlation is difficult for linear models to capture unless you explicitly add an interaction term (i.e. you can anticipate it ahead of time).
* On the other hand, decision trees can capture this relationship naturally.

Unfortunately, decision trees suffer from a major flaw as well. If you allow them to grow limitlessly, they can completely "memorize" the training data, just from creating more and more and more branches.

**As a result, individual unconstrained decision trees are very prone to being overfit.**

So, how can we take advantage of the flexibility of decision trees while preventing them from overfitting the training data?

## Ensembles
Ensembles are machine learning methods for combining predictions from multiple separate models. There are a few different methods for ensembling, but the two most common are bagging and boosting.

**Bagging**
Bagging attempts to reduce the chance overfitting complex models.
* It trains a large number of "strong" learners in parallel.
* A strong learner is a model that's relatively unconstrained.
* Bagging then combines all the strong learners together in order to "smooth out" their predictions.

**Boosting**
Boosting attempts to improve the predictive flexibility of simple models.
* It trains a large number of "weak" learners in sequence.
* A weak learner is a constrained model (i.e. you could limit the max depth of each decision tree).
* Each one in the sequence focuses on learning from the mistakes of the one before it.
* Boosting then combines all the weak learners into a single strong learner.

While bagging and boosting are both ensemble methods, they approach the problem from opposite directions. Bagging uses complex base models and tries to "smooth out" their predictions, while boosting uses simple base models and tries to "boost" their aggregate complexity.

Ensembling is a general term, but when the base models are decision trees, they have special names: random forests and boosted trees!

**Random Forests**
Random forests train a large number of "strong" decision trees and combine their predictions through bagging.

In addition, there are two sources of "randomness" for random forests:

1. Each tree is only allowed to choose from a random subset of features to split on (leading to feature selection).
2. Each tree is only trained on a random subset of observations (a process called resampling).
In practice, random forests tend to perform very well right out of the box.

* They often beat many other models that take up to weeks to develop.
* They are the perfect "swiss-army-knife" algorithm that almost always gets good results.
* They don’t have many complicated parameters to tune.

**Boosted Trees**
Boosted trees train a sequence of "weak", constrained decision trees and combine their predictions through boosting.

* Each tree is allowed a maximum depth, which should be tuned.
* Each tree in the sequence tries to correct the prediction errors of the one before it.
In practice, boosted trees tend to have the highest performance ceilings.

* They often beat many other types of models after proper tuning.
* They are more complicated to tune than random forests.

## Conclusion
Whew, that was a lot! If you need to, feel free to let it sink in a bit and then re-read the lesson.

Key takeaway: The most effective algorithms typically offer a combination of regularization, automatic feature selection, ability to express nonlinear relationships, and/or ensembling. Those algorithms include:
* Lasso regression
* Ridge regression
* Elastic-Net
* Random forest
* Boosted tree

# Day 6

## Building Our Models
Majority of our time in data science and machine learning is spent on:
1. Exploring the data.
2. Cleaning the data.
3. Engineering new features.

Now, it is time to fasten up our seatbelts and build our models. This is the *fun* part.

## Split dataset
Let’s start with a crucial but sometimes overlooked step: Spending your data.

Think of your data as a limited resource.

* You can spend some of it to train your model (i.e. feed it to the algorithm).
* You can spend some of it to evaluate (test) your model.
* But you can’t reuse the same data for both!
If you evaluate your model on the same data you used to train it, your model could be very overfit and you wouldn’t even know! A model should be judged on its ability to predict new, unseen data.

Therefore, you should have separate **training** and **test** subsets of your dataset.
Training sets are used to fit and tune your models. Test sets are put aside as "unseen" data to evaluate your models.
* You should always split your data before doing anything else.
* This is the best way to get reliable estimates of your models’ performance.
* After splitting your data, don’t touch your test set until you’re ready to choose your final model!
Comparing test vs. training performance allows us to avoid overfitting... If the model performs very well on the training data but poorly on the test data, then it’s overfit.

## Hyperparameters
So far, we’ve been casually talking about "tuning" models, but now it’s time to treat the topic more formally.

When we talk of tuning models, we specifically mean tuning hyperparameters.

There are two types of parameters in machine learning algorithms.

The key distinction is that model parameters can be learned directly from the training data while hyperparameters cannot.

**Model parameters**
Model parameters are learned attributes that define individual models.
* e.g. regression coefficients
* e.g. decision tree split locations
They can be learned directly from the training data

**Hyperparameters**
Hyperparameters express "higher-level" structural settings for algorithms.
* e.g. strength of the penalty used in regularized regression
* e.g. the number of trees to include in a random forest
They are decided before fitting the model because they can't be learned from the data

## Cross-validation
Next, it’s time to introduce a concept that will help us tune our models: cross-validation.

Cross-validation is a method for getting a reliable estimate of model performance using only your training data.

There are several ways to cross-validate. The most common one, 10-fold cross-validation, breaks your training data into 10 equal parts (a.k.a. folds), essentially creating 10 miniature train/test splits.

These are the steps for 10-fold cross-validation:
1. Split your data into 10 equal parts, or "folds".
2. Train your model on 9 folds (e.g. the first 9 folds).
3. Evaluate it on the 1 remaining "hold-out" fold.
4. Perform steps (2) and (3) 10 times, each time holding out a different fold.
5. Average the performance across all 10 hold-out folds.

The average performance across the 10 hold-out folds is your final performance estimate, also called your cross-validated score. Because you created 10 mini train/test splits, this score is usually pretty reliable.

## Fit and tune models
Now that we've split our dataset into training and test sets, and we've learned about hyperparameters and cross-validation, we're ready fit and tune our models.

Basically, all we need to do is perform the entire cross-validation loop detailed above on each set of hyperparameter values we'd like to try.

The high-level pseudo-code looks like this:

In [None]:
# pseudocode for hyperparameter loop using cross validation; IGNORE ERROR
For each algorithm (i.e. regularized regression, random forest, etc.):
  For each set of hyperparameter values to try:
    Perform cross-validation using the training set.
    Calculate cross-validated score.

At the end of this process, you will have a cross-validated score for each set of hyperparameter values... for each algorithm.

Then, we'll pick the best set of hyperparameters within each algorithm:

In [None]:
# pseudocode; IGNORE ERROR
For each algorithm:
  Keep the set of hyperparameter values with best cross-validated score.
  Re-train the algorithm on the entire training set (without cross-validation).

It's kind of like the Hunger Games... each algorithm sends its own "representatives" (i.e. model trained on the best set of hyperparameter values) to the final selection.

## Select winner
By now, you'll have 1 "best" model for each algorithm that has been tuned through cross-validation. Most importantly, you've only used the training data so far.

Now it’s time to evaluate each model and pick the best one, a la Hunger Games style.

Because you've saved your test set as a truly unseen dataset, you can now use it get a reliable estimate of each models' performance.

There are a variety of performance metrics you could choose from. We won't spend too much time on them here, but in general:
* For regression tasks, we recommend Mean Squared Error (MSE) or Mean Absolute Error (MAE). (Lower values are better)
* For classification tasks, we recommend Area Under ROC Curve (AUROC). (Higher values are better)

The process is very straightforward:
* For each of your models, make predictions on your test set.
* Calculate performance metrics using those predictions and the "ground truth" target variable from the test set.

Finally, use these questions to help you pick the winning model:
* Which model had the best performance on the test set? (performance)
* Does it perform well across various performance metrics? (robustness)
* Did it also have (one of) the best cross-validated scores from the training set? (consistency)
* Does it solve the original business problem? (win condition)

# Day 7

**Recap**
* On day 1, you saw a bird's-eye view of the entire machine learning workflow.
* Then, on day 2, you learned our framework for fast, efficient, and decisive exploratory analysis.
* Day 3 was all about data cleaning, which is perhaps the most important step of all!
* Next, on day 4, we shared our favorite heuristics for feature engineering.
* On day 5, we discussed regularization and ensembles, and you learned about 5 algorithms that leverage those mechanisms.
* And yesterday on day 6, we walked through a proven formula for training excellent models after the other steps have been completed correctly.

Now, we will talk about next steps in our DS/ML education.

Strike while the iron is hot! Go on and tackle a problem with a different dataset while its fresh in our head.
Maybe split it up in 7 days like this tutorial, one section per day. We will do the same methods, but since it is a different, unique dataset and problem, we will most definitely learn something different in approach and in the details.

It is my recommendation (although everyone learns differently) to skip the textbooks and jump into projects ASAP because it's much faster to learn in context, i.e. "learning by doing."

Plus, it will be easier to stay motivated and continue progressing.

Onwards!



Sources:
* [1] http://varianceexplained.org/r/ds-ml-ai/
* [2] https://tryolabs.com/blog/2017/03/16/pandas-seaborn-a-guide-to-handle-visualize-data-elegantly/
* [3] https://elitedatascience.com/data-cleaning