# Messing Around with Scikit-Learn

In [1]:
import IPython.core.display as disp

The plan for today:

 - **Part 1**: I will regurgitate some of the scikit-learn documentation at you, in ipython notebook form
 - **Part 2**: Shannon will give an example of scikit-learn in the wild

## The basics: What is scikit-learn?

Scikit-learn, or `sklearn`, is a machine learning package for python. It is built on the familiar scientific python tools (numpy, scipy, matplotlib, etc.) and thus plays very nicely with data in the form of numpy arrays (a cursory googling shows people are also working on [better integrating sklearn with other popular pythonic data structures like pandas DataFrames](https://github.com/paulgb/sklearn-pandas)).

This notebook is really just an aggregation of some of the things from the sklearn documentation that I thought were interesting and/or useful. The [documentation](http://scikit-learn.org/stable/documentation.html) is really quite well organized in my opinion, so if you are curious about the layout of scikit learn or whether it has specific capabilities that you're interested in, go check it out!

## When should you use scikit-learn?

When you have data with many samples that you can formulate into a learning problem. In other words, you'd like to find patterns within that data and/or predict properties of unknown data. The [documentation](http://scikit-learn.org/stable/tutorial/basic/tutorial.html) breaks down learning problems into the following categories:

 - **Supervised learning**
   
   You have input data and corresponding properties that you'd like to learn from your data to predict the
   properties of unknown data. This can fall into a couple different sub-categories:
   
    - **Classification**: When the data fall into two or more discrete categories; classifying hand-written digits
    for example.
    
    - **Regression**: similar to classification, but the output is a continuous variable rather than trying to 
    solve for group membership. An example of regression problem might be predicting shoe size based on age, 
    weight, and gender
        
 - **Unsupervised learning**
 
   You have input data, but no corresponding labels or dependent variable. Instead of trying to use the data to 
   predict some output for unknown data, you're more interested in finding structure in the data that is not 
   immediately apparent. For example, you might be interested in 
   detecting clusters in the input data, or dimensionality reduction to identify components that introduce the most
   variability in your data, or to aid in visualization.
   
### Summary
`sklearn` may be a good fit for cases where:

1. You have moderately sized data (>50 samples, but not "big data")
  - For instance, the svm.SVC classifier fit method is worse than quadratic in the number of samples, so the
    documentation recommends you limit use of this classifier to datasets with less than a few 10,000's samples.
<img src=http://cdn.meme.am/instances/500x/47510205.jpg>

2. You are interested in trying off-the-shelf machine learning algorithms to explore data with good performance and low learning curve
  - `sklearn` implements quite a few regression, classification, and unsupervised learning methods, but is not
     all-inclusive. The goal of `sklearn` to make popular machine learning approaches accessible rather than 
     implementing every algorithm out there.

### Drawbacks and Caveats

`sklearn` is not a silver bullet for all types of data and every application. The documentation readily states that the goal of `sklearn` is to do a few things 