In [None]:
import pandas as pd
import numpy as np
import seaborn as sns # Import seaborn and matplotlib for visualizing
import matplotlib.pyplot as plt
from itertools import islice
import json

### sklearn to predict party affiliation
Here we will work with data consisting of votes made by US House of Representatives Congressmen.  Dataset from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records) consisting of votes made by US House of Representatives Congressmen.

The goal will be to predict their party affiliation ('Democrat' or 'Republican') based on how they voted on certain key issues. 

### Load the data

In [None]:
head_list = ['party', 'infants', 'water', 'budget', 'physician', 'salvador',
       'religious', 'satellite', 'aid', 'missile', 'immigration', 'synfuels',
       'education', 'superfund', 'crime', 'duty_free_exports', 'eaa_rsa']
df = pd.read_csv('data/house-votes.csv', names=head_list)

### Inspect the data

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.head()

<span class="mark">**What do you notice as you inspect?**</span>

In [None]:
df.shape

In [None]:
df == '?'

### Cleaning data

We can see through our EDA that there are certain data points labeled with a '?'. These denote missing values. Real-world data can be very messy. (recall EDA lecture from previous classes). Missing values can be encoded in different ways in real-world data.


We will use NaN to encode missing values during data cleaning because it is an efficient and simplified way of internally representing missing data, and it lets us take advantage of pandas methods such as `.dropna()` and `.fillna()`, as well as scikit-learn's Imputation transformer `Imputer()`.

In [None]:
# Convert '?' to NaN
df[df == '?'] = np.nan

# Print the number of NaNs
print(df.isnull().sum())

# Print shape of original DataFrame
print("Shape of Original DataFrame: {}".format(df.shape))

# Drop missing values and print shape of new DataFrame
df = df.dropna()

# Print shape of new DataFrame
print("Shape of DataFrame After Dropping All Rows with Missing Values: {}".format(df.shape))

## Visual EDA
Let's explore the data visually a bit more

Our previous EDA told us that that all the features in this dataset are binary; that is, they are either 0 or 1 (true/false). So Seaborn's `countplot` could be very useful here to visually see the data.

In [None]:
# count plot of the education bill
sns.countplot(x='education', hue='party', data=df, palette='RdBu')
plt.xticks([0,1], ['No', 'Yes'])
plt.show()

NOTE: the resulting plot shows the difference in voting behavior between the two parties for the 'education' bill, with each party colored differently. We manually specified the color to be 'RdBu', as the Republican party has been traditionally associated with red, and the Democratic party with blue.

**What can we interpret?**


**<span class="mark">TODO</span>**: 
* Explore the voting behavior further by generating countplots for the 'satellite' and 'missile' bills and then answer the question below

In [None]:
# Your code below -- satellite bills

In [None]:
# Your code below -- missile bills

**TO ANSWER**:
Answer the following question: Of these two bills, for which ones do Democrats vote resoundingly in favor of, compared to Republicans? 
    * a). satellite
    * b). missile
    * c). Both satellite and missile
    * d). Neither satellite nor missile

<span class="girk">Lab part 2</span>

### Classifier: fitting K-NN
Having explored the Congressional voting records dataset, it is time now to build your first classifier. Next, you will fit a k-Nearest Neighbors classifier to the voting dataset.

k-nearest neighbor:
https://scikit-learn.org/stable/modules/neighbors.html#classification

NOTE: 
* The features need to be in an array where 
    * each column is a feature and 
    * each row a different observation or data point - in this case, a Congressman's voting record. 
* The target needs to be a single column with the same number of observations as the feature data.

In [None]:
# Import KNeighborsClassifier from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier 

# Create arrays for the features and the response variable
y = df['party'].values
X = df.drop('party', axis=1).values

# Create a k-NN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the classifier to the data
knn.fit(X, y)

### predict on new data
Having fit a k-NN classifier, you can now use it to predict the label of a new data point.

For now, we will generate a random unlabeled data point as X_new. We will next use the classifier to predict the label for this new data point. 

In [None]:
X_new = np.random.rand(1,16)

In [None]:
# Predict the labels for the training data X: y_pred
y_pred = knn.predict(X)

# Predict and print the label for the new data point X_new
new_prediction = knn.predict(X_new)
print("Prediction: {}".format(new_prediction)) 

**Question**:
    
    Did your model predict 'democrat' or 'republican'? How sure can you be of its predictions? In other words, how can you measure its performance? 

<span class="mark">**TODO**:</span>
    
Try with another new data point. Did your model predict 'democrat' or 'republican'?