<a href="https://colab.research.google.com/github/raj-vijay/ml/blob/master/04_Voting_data_k-NN_Prediction_Scikit_Learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**k-Nearest Neighbours on Congressional Voting Dataset**

The Congressional Voting dataset is obtained from the UC Irvine (UCI) Machine Learning Repository. It consists of votes made by US House of Representatives Congressmen. The goal here is to predict the party affiliation ('Democrat' or 'Republican') of members based on how they voted on certain key issues using the k-Nearest Neighbours Algorithm.

https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records

In [0]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

**Load dataset**

In [2]:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data', sep = ',',
                 names = ['party',	'infants', 'water',	'budget',	'physician', 'salvador', 'religious', 'satellite', 'aid', 'missile', 'immigration', 'synfuels', 'education', 'superfund', 'crime', 'duty_free_exports', 'eaa_rsa'])

print(df)

          party infants water budget  ... superfund crime duty_free_exports eaa_rsa
0    republican       n     y      n  ...         y     y                 n       y
1    republican       n     y      n  ...         y     y                 n       ?
2      democrat       ?     y      y  ...         y     y                 n       n
3      democrat       n     y      y  ...         y     n                 n       y
4      democrat       y     y      y  ...         y     y                 y       y
..          ...     ...   ...    ...  ...       ...   ...               ...     ...
430  republican       n     n      y  ...         y     y                 n       y
431    democrat       n     n      y  ...         n     n                 n       y
432  republican       n     ?      n  ...         y     y                 n       y
433  republican       n     n      n  ...         y     y                 n       y
434  republican       n     y      n  ...         y     y                 ? 

**Pre-process the dataset**

In [3]:
# Convert '?' to NaN
df[df == '?'] = np.NaN
df[df == 'y'] = np.ones(df.shape)
df[df == 'n'] = np.zeros(df.shape)
print(df)

          party infants water budget  ... superfund crime duty_free_exports eaa_rsa
0    republican       0     1      0  ...         1     1                 0       1
1    republican       0     1      0  ...         1     1                 0     NaN
2      democrat     NaN     1      1  ...         1     1                 0       0
3      democrat       0     1      1  ...         1     0                 0       1
4      democrat       1     1      1  ...         1     1                 1       1
..          ...     ...   ...    ...  ...       ...   ...               ...     ...
430  republican       0     0      1  ...         1     1                 0       1
431    democrat       0     0      1  ...         0     0                 0       1
432  republican       0   NaN      0  ...         1     1                 0       1
433  republican       0     0      0  ...         1     1                 0       1
434  republican       0     1      0  ...         1     1               NaN 

In [0]:
# Drop missing values
df = df.dropna()

**Import k-NN (k Nearest Neighbours) algorithm**

In [0]:
from sklearn.neighbors import KNeighborsClassifier

**k-Nearest Neighbors: Fit**

The following creates an instance of a k-NN classifier with 6 neighbors (by specifying the n_neighbors parameter) and then fits it to the data. 

Create arrays X and Y for the features and the target variable. 

The .drop() method is used to drop the target variable 'party' from the feature array X and the .values attribute is used to ensure X and Y are NumPy arrays. 

Without using .values, X and Y are a DataFrame and Series respectively. 

The scikit-learn API will only accept them in this form also as long as they are of the right shape. 

In [7]:
# Create arrays for the features and the response variable
Y = df['party'].values
X = df.drop('party', axis=1).values

# Create a k-NN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors = 6)

# Fit the classifier to the data
knn.fit(X, Y)


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=6, p=2,
                     weights='uniform')

**k-Nearest Neighbors: Predict**

Having fit the k-NN classifier, we can now use it to predict the label of a new data point. 

In [38]:
# Predict the labels for the training data X
y_pred = knn.predict(X)
print(y_pred)

['republican' 'republican' 'democrat' 'democrat' 'democrat' 'democrat'
 'democrat' 'republican' 'democrat' 'republican' 'democrat' 'republican'
 'democrat' 'republican' 'republican' 'republican' 'democrat' 'democrat'
 'democrat' 'democrat' 'democrat' 'democrat' 'republican' 'republican'
 'republican' 'republican' 'republican' 'republican' 'democrat'
 'republican' 'republican' 'republican' 'democrat' 'democrat' 'republican'
 'democrat' 'republican' 'republican' 'democrat' 'republican' 'republican'
 'republican' 'republican' 'republican' 'republican' 'democrat' 'democrat'
 'democrat' 'democrat' 'democrat' 'democrat' 'democrat' 'democrat'
 'democrat' 'democrat' 'republican' 'democrat' 'democrat' 'republican'
 'democrat' 'republican' 'republican' 'democrat' 'republican' 'republican'
 'republican' 'democrat' 'democrat' 'republican' 'republican' 'republican'
 'democrat' 'republican' 'democrat' 'republican' 'democrat' 'democrat'
 'republican' 'republican' 'republican' 'republican' 'republican

To test the models ability to predict unseen data, a new random dataset is generated.

In [36]:
X_new = np.array([(0.696469, 0.286139, 0.226851, 0.551315, 0.719469, 0.423106, 0.980764, 0.68483, 0.480932, 0.392118, 0.343178, 0.72905, 0.438572, 0.059678, 0.398044, 0.737995)])
X_new.shape

(1, 16)

The k-NN prediction is now carried out on the new unseen dataset.

In [35]:
# Predict and print the label for the new data point X_new
new_prediction = knn.predict(X_new)
print("Prediction: {}".format(new_prediction))

Prediction: ['democrat']
