# Modeling with `scikit-learn`

## Applied Review

### `scikit-learn` in Python's Data Science Ecosystem

### `pandas` DataFrame vs. `numpy` Array

## Machine Learning Overview

* `scikit-learn` is a package for **machine learning**

* You may hear this called modeling, predictive analytics, data mining, artifical intelligence, etc.

* But **machine learning** is a catch all term for these things

* Despite being broad, all machine learning methodologies and problems fit into two categories: **supervised learning** and **unsupervised learning**

### Supervised Learning

* Generally speaking, **supervised learning** problems are focused on predicting a truth

* Another way to think of this is *learning a function* -- given the inputs, attempt to predict the output

* Examples:
  * Predicting a customer's satisfaction with a product they haven't tried before
  * Predicting next month's sales
  * Predicting the outcome of a sporting event
  * Predicting whether a photo contains a cucumber or a zucchini

* All of these examples of supervised learning fit into two distinct categories: **regression** and **classification**

#### Regression

* **Regression** is focused on predicting a *continuous output*

<div class='question'>
    <strong>Question:</strong> Which of our examples are regression problems?
</div>

  * Predicting a customer's satisfaction with a product they haven't tried before
  * Predicting next month's sales
  * Predicting the outcome of a sporting event
  * Predicting whether a photo contains a cucumber or a zucchini

* While *regression* sounds like *linear regression*, there are a variety of algorithms for regression:
  * Regression - Linear, Polynomial, Ridge, Lasso, ElasticNet
  * Tree-based - Decision Trees, Random Forest, Gradient Boosting
  * Neural Networks

#### Classification

* **Classification** is focused on predicting a discrete output (cateogies)

* Our other two examples are classification problems:
  * Predicting the outcome of a sporting event
  * Predicting whether a photo contains a cucumber or a zucchini

* Note that there can be two (yes/no) or more cateogies

* Just like regression, classification has a variety of algorithms:
  * Regression - Logistic, Multinomial, Ordinal, Logit, Probit
  * Nearest Neighbors
  * Tree-based - Decision Trees, Random Forest, Gradient Boosting
  * Neural Networks

### Unsupervised Learning

* While **supervised learning** problems are focused on predicting a truth, **unsupervised learning** has no truth data

* In other words, there is no output -- therefore, there's no function to learn

* Instead, **unsupervised learning** is focused on learning patterns between cases/observations

#### Clustering

* *Clustering* is focused on grouping observations into categories based on similarities -- these categories are not set or known beforehand

* An example of this is a customer segmentation -- placing customers in segments based on their buying patterns

* A few key clustering algorithms:
  * KMeans
  * Hierarchical
  * Spectral

#### Dimension Reduction 

* *Dimension reduction* is focused on reducing the number of dimensions needed to represent an observation or case

* An example is taking an observation with 150 variables and representing its variability in just 10 variables -- this is helpful in avoiding the curse of dimensionality and collinearity

* A few key dimension reduction algorithms:
  * Principle Components Analysis
  * Non-negative Matrix Factorization

#### Text Vectorization

* While text mining and natural language processing are their own branch of data science, *text vectorization* is an example of unsupervised learning

* Text vectorization is focused on creating numeric representations of text by learning from words and their surroundings

* These methods are increasingly popular -- they're used heavily in chatbots and digital assistants

* The two common algorithms are:
  * Word2Vec
  * Doc2Vec

## Building a Supervised Learning Regression Model

### Setting Up the Problem

### Preparing Data

#### Data Partitioning

##### Train vs. Test

##### Cross-validation

#### Feature Engineering

### Training the Model

### Validating the Model

#### Error Rates

#### Variable Considerations

##### Coefficients