# Introduction to Data Science
## A Machine Learning Review
***

Through the course of the semester we have discussed a lot of different data science techniques and explored a lot of Python code for putting these concepts into use. The goal of this notebook is to provide a review of some of these concepts while making an effort at tying them together. We will also review some of the Python code we used to apply everything we've learned to some real data.

## 1. A (Quick) Discussion on Python and Package Control
Python makes use of *many* packages to do a wide range of tasks. Some of these packages are maintained by the same people that work on the Python programming languages. Others are created by 3rd party teams. There are packages to do basic tasks like simple math and telling time. Other packages are used mainly for handling data, to do scientific computing, or machine learning.

### 1.1 Data Structures
In addition to storing strings, integers, and decimal (floats) numbers in Python, we have been using two main data structures: Python *lists* and *dictionaries*. I want to explain the difference between these two very briefly.

Lists and dictionaries are both key-value stores: given a key (location) you can recieve a value. A list uses ordered keys to retrieve values. A dictionary uses unordered keys to return values. For example,

In [7]:
my_list = [5, 2, 6, 1, 7]
print my_list
print "The first (0'th) element in the list is: %d" % my_list[0]
print "The last element in the list is : %d" %my_list[-1]

[5, 2, 6, 1, 7]
The first (0'th) element in the list is: 5
The last element in the list is : 7


In [8]:
my_dictionary = {'bob': 5, 'nikita': 2, 'panos': 6, 'michelle': 1, 'foster': 7}
print my_dictionary
print "The value for 'bob' is %d" % my_dictionary['bob']
print "The value for 'panos' is %d" % my_dictionary['panos']

{'michelle': 1, 'bob': 5, 'foster': 7, 'nikita': 2, 'panos': 6}
The value for 'bob' is 5
The value for 'panos' is 6


Notice that the list printed in order while the dictionary did not! I **can not** refer to the first item of a dictionary. There is no order!

### 1.2 Packages

#### 1.2.1. General Use
These packages are used mainly to coorindate and structure your Python code. You can use `time` and `datetime` to keep track of how long it takes to run certain tasks or to format dates and times. The `os` and `sys` packages let you make calls to the computer and access programs outside of Python (e.g. the command line!). You can use `math` to do mathematical operations slightly more advanced than addition, subtraction, etc. (e.g. exponentiation).

In [1]:
import time
import datetime
import os
import sys
import math

#### 1.2.2. Data Handling
Python comes with packages for reading `csv` and `json` files natively. If you want to use something with more features, `pandas` is useful for creating data frames (a common data structure used in data science and machine learning). Some of you may be dealing with HTML data from web pages and will find Beautiful Soup 4 (`bs4`) useful.

In [5]:
import csv
import json
import pandas as pd
# import bs4

You may notice that when I imported pandas I decided to call it `pd`. This isn't necessary but is commonly used to give long packages a shorter name so that typing and reading them is easier.

#### 1.2.3. Scientific Computing
The `numpy` and `scipy` packages are probably two of the most popular Python packages. They will give you the ability to use arrays and matrices (both dense and sparse). They also give a ton of basic operations (max, min, argmax, argmin, etc.) For those of you with Matlab experience, you may notice a lot of similarity as scipy and numpy were written based on Matlab.

In [4]:
import numpy as np
import scipy

Many people have asked why I use `np.max([1,2,3,4])` instead of just using Python's default function, `max([1,2,3,4])`. The answer is... I just happened to use the numpy version :) You can use whichever you like.

#### 1.2.4. Machine Learning
The package we have been using all semester to do machine learning, sci-kit learn (`sklearn`), is one of the most popular machine learning packages currently in use. Throughout the semeseter you have probably noticed that we have been using a *ton* of difference functions and features. The documentation on sklearn is vast.

In [None]:
import sklearn

You may have also noticed that I often do something like

In [11]:
from sklearn import metrics
print "The accuracy is %.2f" % metrics.accuracy_score([1,1], [1,1])

The accuracy is 1.00


But I could also do something like this

In [13]:
from sklearn.metrics import accuracy_score
print "The accuracy is %.2f" % accuracy_score([1,1], [1,1])

The accuracy is 1.00


There is no correct way of doing it. It's just a matter of preference.

## 2. The Data Science Workflow
&nbsp;
<div style="float: left; width: 50%">
We've talked about the "data science workflow" a lot through out the semeseter, but I just want to remind everyone of what it looks like.

<ol style="padding: 20px 0;">
<li>Business understanding</li>
<li>Data understanding</li>
<li>Data Preparation</li>
<li>Modeling</li>
<li>Evaluation</li>
<li>Deployment</li>
</ol>

While you have been working a lot on the business understanding phase of your project recently (as well as the others, I hope!), today we are going to focus a bit more on summarizing what handson skills you have learned.
</div>
<div style="float: left; width: 40%">
<img src="images/workflow.png" width="100%"/>
</div>

## 3. Data Cleaning
We've talked about this in our very first class and went on to mention it a few more times throughout the semester. However, I'd like to review some of this again given some common questions I've been getting.

### 3.1. Structured Data
Almost all of the data we have dealt with so far can be called *structured* data. This means that every record in the data set is organized in the same way. It might be comma or tab separated and we know that each column corresponds to one feature. Another popular type of structured data is JSON data 

### 3.2. Unstructured Data
A

### 3.1. Command Line Interface (CLI)
I wont say much here other than the fact that the command line gives a great way to start exploring some of your data. Remeber you can use `head` and `tail` to start looking at files. Using `cut`, `sort`, `uniq`, and `wc` will take care of many common tasks you have during the exploration portion of your projects.

### 3.2. Python


## 4. Modeling
A

### 4.1. Supervised
A

#### 4.1.1. Tree Based Models
A

#### 4.1.2. Logistic Regression
A

#### 4.1.3. SVM
A

#### 4.1.4. Naive Bayes
A

### 4.2. Unsupervised
A

#### 4.2.1. Distance Measurements
A

#### 4.2.2. Clustering
A

## 5. Validation
A

### 5.1. Accuracy
A

### 5.2. Receiver Operating Characteristic (ROC)
A

### 5.3. Cross Validation (CV)