# Introduction to Data Science
## A Machine Learning Review
***

Through the course of the semester we have discussed a lot of different data science techniques and explored a lot of Python code for putting these concepts into use. The goal of this notebook is to provide a review of some of these concepts while making an effort at tying them together. We will also review some of the Python code we used to apply everything we've learned to some real data.

## 1. A (Quick) Discussion on Python and Package Control
Python makes use of *many* packages to do a wide range of tasks. Some of these packages are maintained by the same people that work on the Python programming languages. Others are created by 3rd party teams. There are packages to do basic tasks like simple math and telling time. Other packages are used mainly for handling data, to do scientific computing, or machine learning.

### 1.1 Data Structures
In addition to storing strings, integers, and decimal (floats) numbers in Python, we have been using two main data structures: Python *lists* and *dictionaries*. I want to explain the difference between these two very briefly.

Lists and dictionaries are both key-value stores: given a key (location) you can recieve a value. A list uses ordered keys to retrieve values. A dictionary uses unordered keys to return values. For example,

In [1]:
my_list = [5, 2, 6, 1, 7]
my_list_2 = ['bob', 'natalia', 'panos', 'michelle', 'foster']
print "First list: %s" % str(my_list)
print "Second list: %s" % str(my_list_2)
print "The first (0'th) element in the first list is: %d" % my_list[0]
print "The first (0'th) element in the second list is: %s" % my_list_2[0]
print "The last element in the first list is : %d" % my_list[-1]
print "The last element in the second list is : %s" % my_list_2[-1]

First list: [5, 2, 6, 1, 7]
Second list: ['bob', 'natalia', 'panos', 'michelle', 'foster']
The first (0'th) element in the first list is: 5
The first (0'th) element in the second list is: bob
The last element in the first list is : 7
The last element in the second list is : foster


In [2]:
my_dictionary = {'bob': 5, 'natalia': 2, 'panos': 6, 'michelle': 1, 'foster': 7}
print my_dictionary
print "The value for 'bob' is %d" % my_dictionary['bob']
print "The value for 'panos' is %d" % my_dictionary['panos']

{'michelle': 1, 'bob': 5, 'foster': 7, 'panos': 6, 'natalia': 2}
The value for 'bob' is 5
The value for 'panos' is 6


Notice that the list printed in order while the dictionary did not! I **can not** refer to the first item of a dictionary. There is no order!

### 1.2 Packages

#### 1.2.1. General Use
These packages are used mainly to coorindate and structure your Python code. You can use `time` and `datetime` to keep track of how long it takes to run certain tasks or to format dates and times. The `os` and `sys` packages let you make calls to the computer and access programs outside of Python (e.g. the command line!). You can use `math` to do mathematical operations slightly more advanced than addition, subtraction, etc. (e.g. exponentiation).

In [20]:
import time
import datetime
import os
import sys
import math
import re

#### 1.2.2. Data Handling
Python comes with packages for reading `csv` and `json` files natively. If you want to use something with more features, `pandas` is useful for creating data frames (a common data structure used in data science and machine learning). Some of you may be dealing with HTML data from web pages and will find Beautiful Soup 4 (`bs4`) useful.

In [2]:
import csv
import json
import pandas as pd
# import bs4

You may notice that when I imported pandas I decided to call it `pd`. This isn't necessary but is commonly used to give long packages a shorter name so that typing and reading them is easier.

#### 1.2.3. Scientific Computing
The `numpy` and `scipy` packages are probably two of the most popular Python packages. They will give you the ability to use arrays and matrices (both dense and sparse). They also give a ton of basic operations (max, min, argmax, argmin, etc.) For those of you with Matlab experience, you may notice a lot of similarity as scipy and numpy were written based on Matlab.

In [5]:
import numpy as np
import scipy

Many people have asked why I use `np.max([1,2,3,4])` instead of just using Python's default function, `max([1,2,3,4])`. The answer is... I just happened to use the numpy version :) You can use whichever you like.

#### 1.2.4. Machine Learning
The package we have been using all semester to do machine learning, sci-kit learn (`sklearn`), is one of the most popular machine learning packages currently in use. Throughout the semeseter you have probably noticed that we have been using a *ton* of difference functions and features. The documentation on sklearn is vast.

In [6]:
import sklearn

You may have also noticed that I often do something like

In [7]:
from sklearn import metrics
print "The accuracy is %.2f" % metrics.accuracy_score([1,1], [1,1])

The accuracy is 1.00


But I could also do something like this

In [8]:
from sklearn.metrics import accuracy_score
print "The accuracy is %.2f" % accuracy_score([1,1], [1,1])

The accuracy is 1.00


There is no correct way of doing it. It's just a matter of preference.

## 2. The Data Science Workflow
&nbsp;
<div style="float: left; width: 50%">
We've talked about the "data science workflow" a lot through out the semeseter, but I just want to remind everyone of what it looks like.

<ol style="padding: 20px 0;">
<li>Business understanding</li>
<li>Data understanding</li>
<li>Data Preparation</li>
<li>Modeling</li>
<li>Evaluation</li>
<li>Deployment</li>
</ol>

While you have been working a lot on the business understanding phase of your project recently (as well as the others, I hope!), today we are going to focus a bit more on summarizing what handson skills you have learned.
</div>
<div style="float: left; width: 40%">
<img src="images/workflow.png" width="100%"/>
</div>

## 3. Data Exploration and Cleaning
We've talked about this in our very first class and went on to mention it a few more times throughout the semester. However, I'd like to review some of this again given some common questions I've been getting.

### 3.1. Structured Data
Almost all of the data we have dealt with so far can be called *structured* data. This means that every record in the data set is organized and structured in some machine readable way. The two most popular ways of storing structured data are:

- **.csv or .tsv** - Can be thought of as rows and columns, where each column will represent a single feature. All rows must have something for each column.
- **JSON** - Looks similar to Python dictionaries. Each row can have an unordered list of `key:value`s

The layout of any of these data types might seem straight forward, but there can be tons of complications. A file ending in `.csv` does *not* mean that it will be well structured. It is still just a text file.

In [3]:
!head data/strings_ugly.csv

rob,What if, in the middle, there is a (")?,25
foster,Some string here,91



In [19]:
data = pd.read_csv("data/strings_ugly.csv")
data

Unnamed: 0,rob,What if,in the middle,"there is a ("")?",25
0,foster,Some string here,91,,


That doesn't look anywhere even close to what it should be. We can explicitely tell it to expect three columns.

In [None]:
# This is going to kill the notebook :(
data = pd.read_csv("data/strings_ugly.csv", names=['name', 'sentence', 'age'])
data

Again, that can't be right. If you look at the data you'll see that there are commas in one of the fields. Encapsulate them.

In [5]:
!head data/strings_quoted.csv

rob,"What if, in the middle, there is a (")?",25
foster,"Some string here",91



In [7]:
data = pd.read_csv("data/strings_quoted.csv", names=['name', 'sentence', 'age'], quotechar="\"")
data

Unnamed: 0,name,sentence,age
0,rob,"What if, in the middle, there is a ()?""",25
1,foster,Some string here,91


Getting closer, but it looks like we also have quotes in the string. We have to escape them.

In [9]:
!head data/strings_escaped.csv

rob,"What if, in the middle, there is a (\")?",25
foster,"Some string here",91



In [18]:
data = pd.read_csv("data/strings_escaped.csv", names=['name', 'sentence', 'age'], quotechar="\"", escapechar="\\")
data

Unnamed: 0,name,sentence,age
0,rob,"What if, in the middle, there is a ("")?",25
1,foster,Some string here,91


This can go on for a very long time until you find all the small nuances to your data file. Notice that we keep adding levels of complexity to our parser. Doing this at the command line is very tricky, which is why using pandas and `read_csv()` are very nice. A lot of the problems we just saw are unfortunately solved by editing the raw data to conform to some kind of standards. Hopefully, most of your project data is already in a useable state!

### 3.2. Engineering

### 3.3. Unstructured Data
What about data that is highly unstructured? For example, web pages are a jumble of HTML tags. The formats between pages can be drastically different.

## 4. Modeling
We've covered two different methods of modeling: supervised and unsupervised learning.

### 4.1. Supervised
Most of what we've done so far this semester involves having **labeled** data. For these data, we have a set of records where we know the value of the target variable. This allows us to learn some relationship between our feature set and the target variable. We've covered five machine learning algorithms that can do this. Here is a brief, and in no way comprehensive, overview.

<table>
<tr><td>Model</td>
<td>Overview</td>
<td>Pros</td>
<td>Cons</td>
<td>Use Case</td></tr>

<tr><td>Tree Based</td>
 <td>Will create splits on any feature that gives maximum **information gain**.</td>
 <td>- Can create multiple separating planes</td>
 <td>- Separating planes must be perpendicular to a feature<br />
     - Prone to overfitting</td>
 <td>- Data with many categorical features</td></tr>
 
<tr><td>Logistic Regression</td>
 <td>Creates a hyperplane that can separate the data with the smallest **loss**.</td>
 <td></td>
 <td>- No closed form solution</td>
 <td>- Data with many records</td></tr>


<tr><td>SVM</td>
 <td>Creates a hyperplane that can separate the data with the maximal **margin**.</td>
 <td>- Kernal Trick</td>
 <td>- No closed form solution</td>
 <td></td></tr>

<tr><td>Naive Bayes</td>
 <td>Uses simple counts to calculate **conditional probabilities**.</td>
 <td>- Fast training</td>
 <td>- Treats all features as independent</td>
 <td>- Text data</td></tr>

<tr><td>k-NN</td>
 <td>Creates a **cluster** of the k-closest records and assigns majority label.</td>
 <td>- Works with any number of labels</td>
 <td></td>
 <td></td></tr>
</table>

All of these algorithms have an implementation in sklearn. Some algorithms, like SVM, have multiple implementations. Let's import one implementation of each.

In [21]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.neighbors import KNeighborsClassifier

### 4.2. Unsupervised
A

#### 4.2.1. Distance Measurements
A

#### 4.2.2. Clustering
A

## 5. Validation
A

### 5.1. Accuracy
A

### 5.2. Receiver Operating Characteristic (ROC)
A

### 5.3. Cross Validation (CV)