# Classification Warmup

1. Use `pydataset` to load the `voteincome` dataset.

In [23]:
import pandas as pd
from pydataset import data

In [2]:
df = data('voteincome', show_doc=True)

voteincome

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

## Sample Turnout and Demographic Data from the 2000 Current Population Survey

### Description

This data set contains turnout and demographic data from a sample of
respondents to the 2000 Current Population Survey (CPS). The states
represented are South Carolina and Arkansas. The data represent only a sample
and results from this example should not be used in publication.

### Usage

    data(voteincome)

### Format

A data frame containing 7 variables ("state", "year", "vote", "income",
"education", "age", "female") and 1500 observations.

`state`

a factor variable with levels equal to "AR" (Arkansas) and "SC" (South
Carolina)

`year`

an integer vector

`vote`

an integer vector taking on values "1" (Voted) and "0" (Did Not Vote)

`income`

an integer vector ranging from "4" (Less than \$5000) to "17" (Greater than
\$75000) denoting family income. See the CPS codebook for more info

In [3]:
df = data('voteincome')

In [4]:
df.head()

Unnamed: 0,state,year,vote,income,education,age,female
1,AR,2000,1,9,2,73,0
2,AR,2000,1,11,2,24,0
3,AR,2000,0,12,2,24,1
4,AR,2000,1,16,4,40,0
5,AR,2000,1,10,4,85,1


In [5]:
df.shape

(1500, 7)

2. Drop the `state` and `year` columns.

In [6]:
df = df.drop(columns=['state', 'year'])

In [7]:
df.head()

Unnamed: 0,vote,income,education,age,female
1,1,9,2,73,0
2,1,11,2,24,0
3,0,12,2,24,1
4,1,16,4,40,0
5,1,10,4,85,1


In [8]:
df.shape

(1500, 5)

3. Split the data into train, validate, and test datasets. 
- We will be predicting whether or not someone votes based on the the remaining features.

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
def split_data(df):
    '''
    take in a DataFrame and return train, validate, and test DataFrames; stratify on survived.
    return train, validate, test DataFrames.
    '''
    train_validate, test = train_test_split(df, test_size=.2, random_state=123, stratify=df.vote)
    train, validate = train_test_split(train_validate, 
                                       test_size=.3, 
                                       random_state=123, 
                                       stratify=train_validate.vote)
    return train, validate, test

In [11]:
train, validate, test = split_data(df)

In [12]:
train.shape, validate.shape, test.shape

((840, 5), (360, 5), (300, 5))

In [13]:
# create X & y version of train, where y is a series with just the target variable and X are all the features. 

X_train = train.drop(columns=['vote'])
y_train = train.vote

X_validate = validate.drop(columns=['vote'])
y_validate = validate.vote

X_test = test.drop(columns=['vote'])
y_test = test.vote

4a. Fit a k-neighbors classifier on the training data. Use 4 for your number of neighbors. 

In [14]:
from sklearn.neighbors import KNeighborsClassifier
#create the model
knn= KNeighborsClassifier(n_neighbors=4)

#fit the model
knn.fit(X_train, y_train)
KNeighborsClassifier()

KNeighborsClassifier()

In [15]:
# Make Predictions
y_pred = knn.predict(X_train)

In [16]:
# Estimate Probability
y_pred_proba = knn.predict_proba(X_train)

4b. How accurate is your model? 

In [17]:
# Compute Accuracy
print('Accuracy of KNN classifier on training set: {:.2f}'
     .format(knn.score(X_train, y_train)))                      

Accuracy of KNN classifier on training set: 0.87


In [18]:
from sklearn.metrics import confusion_matrix
# Print confusion Matrix

print(confusion_matrix(y_train, y_pred))

[[ 76  46]
 [ 65 653]]


In [19]:
from sklearn.metrics import classification_report
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.54      0.62      0.58       122
           1       0.93      0.91      0.92       718

    accuracy                           0.87       840
   macro avg       0.74      0.77      0.75       840
weighted avg       0.88      0.87      0.87       840



4c. How does it perform on the validate data?

In [30]:
print('Accuracy of KNN classifier on validate set: {:.2f}'
     .format(knn.score(X_validate, y_validate)))

Accuracy of KNN classifier on validate set: 0.77


5. Try our these values for k: 1, 2, 3, and 4. 

In [31]:
metrics = []

# loop through different values of k
for k in range(1, 5):
            
    # define the thing
    knn = KNeighborsClassifier(n_neighbors=k)
    
    # fit the thing (remmeber only fit on training data)
    knn.fit(X_train, y_train)
    
    # use the thing (calculate accuracy)
    train_accuracy = knn.score(X_train, y_train)
    validate_accuracy = knn.score(X_validate, y_validate)
    
    output = {
        "k": k,
        "train_accuracy": train_accuracy,
        "validate_accuracy": validate_accuracy
    }
    
    metrics.append(output)

# make a dataframe
results = pd.DataFrame(metrics)

In [32]:
results

Unnamed: 0,k,train_accuracy,validate_accuracy
0,1,0.971429,0.827778
1,2,0.902381,0.736111
2,3,0.892857,0.797222
3,4,0.867857,0.769444


Which gives the best accuracy? 

- The best accuracy for train is the model with 1 neighbor, although with an accuracy of 97% it may be overfitted.

Which gives the best accuracy on the validate data set?

- The best accuracy for validate is the model with 1 neightbor as well, with an accuracy of 83%.

 
- Models with 1 and 2 neightbors had the greatest difference in performance between train and validate, with 16-17% drop in accuracy from one to the other. 
- Models with 3 and 4 neighbors had the least diffrence in performance between train and validate with a 9% drop in accuracy from one to the other. 

6a. View the classification report for your best model on the test split.

In [34]:
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_test, y_test)
KNeighborsClassifier()
y_pred = knn.predict(X_test)
y_pred_proba = knn.predict_proba(X_test)

print('Accuracy of KNN classifier on test set: {:.2f}'
     .format(knn.score(X_test, y_test)))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy of KNN classifier on test set: 1.00
[[ 42   1]
 [  0 257]]
              precision    recall  f1-score   support

           0       1.00      0.98      0.99        43
           1       1.00      1.00      1.00       257

    accuracy                           1.00       300
   macro avg       1.00      0.99      0.99       300
weighted avg       1.00      1.00      1.00       300



6b. What do you notice?

I notice the accuracy is 100%. This seems strance to me. 

7a. Within our problem space, what does accuracy mean? 

TP = Predict Vote / Actually Votes
TN = Predict Not Vote / Actually Doesn't Vote
FP = Predict Vote / Actually Doesn't Vote
FN = Predict Not Vote / Actually Votes

- Accuracy is the rate of TP + TN out of ALL the outcomes
    - The number of times the predictions were correct in actuality. 
    - In this scenario it is the number of times we predict someone does or doesn't vote based upon their income, education, age, and gender; and those predictions happened in actuality, out of ALL the predictions we made. 

7b. Precision? 

- Precision is the rate of TP out of TP + FP
    - In this case it is the number of times we predicted someone would vote and they actually voted, out of all the times we predicted they would vote. 

7c. Recall?

- Recall is the rate of TP out of TP + FN
    - In this case it is the number of times we predicted someone would vote, out of all the times someone actually voted, whether we predicted it or not. 