# Finals Review

In [1]:
from pandas import Series, DataFrame
import pandas as pd
import statsmodels as sm
%pylab inline

Populating the interactive namespace from numpy and matplotlib


## Exam Structure

* Type up your answers in one Jupyter notebook
* **All** material covered so far in class lectures is fair game!
    * Basic python
    * Pandas
    * Classification and Clustering
* Exam is open-notes, open-book, open-laptop, open-Google.

* Some coding questions
    * Give a brief explanation
    * Show the code
    * Don't write huge blocks of code; use functions
    * Show the result
* Some conceptual questions
    * Give your answer
    * and a brief explanation

## How to think about a Pandas question

#### Do something _if some condition holds_ ==> **a mask**

```
mask = (some condition)
df[mask]
```

There are two common cases:

1. all the information you need for the condition exists in some columns
    * _"if there are more democrats than republicans"_
    * and you have a column for number of democrats, and another for number of republicans

    `mask = (df['dem'] > df['rep'])`

2. the information you need for the condition is in different pieces
    * _"if higher rating for this particular Dennys/Wendys/etc. than for the chain in general"_
    * DataFrame df has rating for each restaurant

For the second case:

* Create all the pieces:
    * e.g., average rating for each chain ==> **groupby or pivot_table**
    * say, output is a dataframe called _averageRatings_, with columns being _restaurantName_ and _avgRating_
* **Merge** this into the dataframe
    
```
pd.merge(df, averageRating,
         left_on=restaurantName,
         right_on=restaurantName)
```

* Now, all information is available in the columns of the merged dataframe

### Do something for every row

* If the information is in ONE column only
    * _"convert account value column from EURO to USD"_
    * **map**

* If the information exists in multiple columns of the dataframe
    * Can we do it via Series operations?
        * _"multiply the price-per-part column by number-of-parts column"_
        > <code>df['price per part'] * df['number of parts']</code>  
    * If it is too complicated for Series operations
        * transpose the dataframe (rows to columns, and vice versa)
        * look at next slide

 ### Do something for each column
 
 * Use **apply**
     * _"Compute maximum - minimum votes for each candidate"_
     * df has one column for each candidate

In [2]:
def functionToApply(s):
# s is a Series, corresponding to one column of the dataframe
# For example, s = Series corresponding to Trump

    return s.max() - s.min()

> df.apply(functionToApply)

### Do something for only a few columns

> df[['col1', 'col2', 'col3']].apply(functionToApply)

### Do something for each TYPE

* The types in question are the **values** of a particular column
    * _"Compute average votes for Democrats"_
    * df has a column called _Party_, whose values are _Democrats_ or _Republicans_
    * **groupby** or **pivot_table**

* The types are not the values of any column
    * _"Count the average rating of each kind of chain restaurant"_
    * df only has _restaurantName_ and _rating_

For the second case:

* Create a new dataframe with rows _restaurantName_ and _chainOrNot_
    * If you get this as a Series, you can create a new dataframe with this Series as the only column
* **Merge** this with the old dataframe
    * So we now have _restaurantName_, _rating_, and _chainOrNot_ as columns
* Now, do **groupby** or **pivot_table**

## The basic concepts for all the classifiers

| Classifier | Idea |
|------------|------|
| K Nearest Neighbors | Neighbors (in terms of feature values) tend to have the same class |
| Naive Bayes | Each feature value **independently** affects likelihoods |
| Logistic Regression | The positive and negative classes can be separated by a line/plane/hyperplane |
| Decision Trees | A bunch of axis-aligned splits, where each split looks at only one feature |

| Naive Bayes | Each feature value **independently** affects likelihoods |

### Coding it up

* Create design matrices
* Create the train and test sets 
    * *train_test_split()* if the training and testing sets are not provided separately
* Create model
> model = neighbors.KNeighborsClassifier(n_neighbors=15)
* Fit the model to the data
> model.fit(X_train, y_train)
* Predict on test data
> model.predict(X_test)

## Instructor Survey

* We'll have it at the end of class on Wednesday next week
    * right after project presentations

* Need one volunteer
    * Please pick up survey packets from the Dean’s office (GSB 2.104)
    * Please bring your Photo ID!
    * The packet has instructions on it…