# Data Prep


## Agenda
Machine Learning Workflow:
* Feature Engineering

## Machine Learning Workflow

<img src='imgs/data-science-explore.png' width=800>

- Iterative process
- Non-linear process
- Lots of judgement and refining along the way
- Lots of time spent in data prep
- "Big data": a lot of time can be spent in data retrieval

Source: Practical Machine Learning with Python, Apress/Springer

## Machine Learning Workflow

<img src='imgs/data-science-explore.png' width=600>

### Data Retrieval
- We've covered a lot of this
- SQL, APIs, Web Scraping, csv, Excel...
- Could include combining some of the above
- Also called "Data Ingestion"

## Machine Learning Workflow

<img src='imgs/data-science-explore.png' width=600>

### Data Preparation
- **Processing and Wrangling**: Covered this too, `pandas` etc.
- **Feature extraction and engineering**: What features (i.e., variables, `x`) do I need for my problem?
- **Feature scaling and selection**: To be covered

## Machine Learning Workflow

<img src='imgs/data-science-explore.png' width=600>

### Modeling (i.e., machine learning)
- `scikit-learn` being the main basic package
- Other packages for deep learning
- Supervised vs. unsupervised learning
- "Build a model"

## Machine Learning Workflow

<img src='imgs/data-science-explore.png' width=600>

### Machine Learning Algorithm
- **"Algorithm"**: series of steps based on rules that a computer takes to calculate something
- Within supervised:
    - Regression: `y` is a continuous number (e.g., price)
    - Classification: `y` is discrete (e.g., customer retained or not)
- Examples: decision trees, linear regression, neural networks
    

## Machine Learning Workflow

<img src='imgs/data-science-explore.png' width=600>

### Model Evaluation & Tuning
- Our first model will probably not be the best model; need to pick
- **Evaluation**: Using metrics to pick the best model for the use case
- **Tuning**: Besides picking between algorithms, there are 'knobs' / settings to 'tune' a model for a specific algorithm

## Machine Learning Workflow

<img src='imgs/data-science-explore.png' width=600>

### Deployment & Monitoring
- We picked a model and it's ready for use by our users
- Be careful about concept drift
- Models sometimes need to be re-trained

## Types of Questions


| Type of question | Description | Example |
|:---|:--------------------------|:----------------|
| **Descriptive** | Summarize a characteristic of a set of data| Proportion of males, the mean number of servings of fresh fruits and vegetables per day |
| **Exploratory** | Analyze the data to see if there are patterns, trends, or relationships between variables; “hypothesis-generating” analyses|If you had a general thought that diet was linked somehow to viral illnesses, start by examining relationships between a range of dietary factors and viral illnesses|
| **Inferential** | Testing a hypothesis, statistically |Analyzing data for a subset / sample of the population and generalizing insights for the general population; Is there a higher incidence of cancer for women than for men?|
| **Predictive**  | Predicting a value, not necessarily figuring out why| Predicting cancer diagnosis from x-rays using computer vision|
| **Causal**      | Whether changing one factor will change another factor | Does changing diet lead to higher incidence of cancer?|
| **Mechanistic** | Understanding *how* one factor changes another | How does diet lead to higher incidence of cancer? |

## Feature Engineering

<blockquote>
Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.<br> 
- Jason Brownlee    
</blockquote>    

- If your features, i.e., representation is bad, whatever fancier model you build is not going to help.

A quote by Pedro Domingos [A Few Useful Things to Know About Machine Learning](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

## Feature Engineering

<blockquote>
... At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used. 
</blockquote>


A quote by Andrew Ng, [Machine Learning and AI via Brain simulations](https://ai.stanford.edu/~ang/slides/DeepLearning-Mar2013.pptx)

<blockquote>
Coming up with features is difficult, time-consuming, requires expert knowledge. "Applied machine learning" is basically feature engineering.
</blockquote>

- Better features: more flexibility, we can get by with simple models, better accuracy 

## Feature Engineering

#### Better features usually help more than a better model.
- Good features would ideally:
    - capture most important aspects of the problem
    - allow learning with few examples 
    - generalize to new scenarios.
    
    
#### The best features may be dependent on the model you use.

#### The best features are very domain and problem specific.

### Feature Engineering: Examples

<br>
<br>

**Example 1:** Taking a date and extracting out the week number, weekday, month etc. 

Sales is often based on seasonality.

### Feature Engineering: Examples

<br>
<br>

**Example 2:** Taking freeform text (e.g., tweets) and extracting number of words, counts of words, punctuation etc.

Text "metadata" can sometimes help with sentiment analysis.

### Feature Engineering: Examples

<br>
<br>

**Example 3:** Taking geographical coordinates and getting continent, country, urban vs. rural etc.

Housing price can depend on features extracted from geographical coordinates; coordinates aren't that useful themselves.

### Feature Engineering: Examples

<br>
<br>

**Example 4:** Combining features to increase the predictive power.

In the example below, we have BMI. Let's assume that we do not; we have only height and weight that are not that useful for predicting if a person has a diabetes. On the other hand, BMI is.

\begin{align}
BMI = \frac{weight}{height^2} * 703
\end{align}

In [2]:
import pandas as pd
diabetes_df = pd.read_csv('data/diabetes.csv')
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### Feature Engineering: Examples

<br>
<br>

**Example 5:** Dimensionality Reduction

You will see what I mean after the lunch