In [1]:
import math
import operator 
from sklearn.datasets import load_wine
import nltk
import random 
from collections import Counter
import pandas as pd
import string
import numpy as np
import sklearn
from scipy.sparse import csr_matrix  
from sklearn.model_selection import train_test_split

## K Nearst Neighbors Classification

### Theory

K nearest neighbors is a simple but powerful classification algorithm. The idea behind this meodel is very straightforward. Given a positive integer K and a test instance $x_0$, KNN starts by identifing the K points in the training data that are closest to $x_0$, represented by $N_0$. The measure of closeness is choosen depending on the needs of the problem.

Each k-closest neighbor takes on a certain value. To determine the value of $x_0$, the algorithm estimates the conditional probability for class j as the fraction of points in $N_0$ whose values are equal to j:

\begin{equation}
Pr( Y = j| X = x_0 ) = \frac{1}{K} \sum_{i \in N_0} I(y_i =j)
\end{equation}

It then assignes the test observation $x_0$ with the value with the largest probability.

### Characteristics

KNN algorithm is very powerful because it does not assume anything about the data other than a distance measure that can be calculated consistently between any two instances. is It is a instance-based lazy learning algorithm. 

Instance-based algorithms are those that model the problem using rows of the data to make predictive decisions. The kNN algorithm is an extreme form of instance-based methods since all training observations are retained as part of the model.

Lazy learning refers to the fact that KNN does not build a model until the time a classification is required. A disadvantage of this is that it can be computationally expensive to repeat the same or similar searches over larger training datasets.

### Implementation

First we will write a function to calculate Euclidean distance. We will use this measure of distance to implement KNN using sample dataset example later on.

In [2]:
def euclideanDistance(x1, x2):
    distance = 0
    for i in range(len(x1)):
        distance += pow((x1[i] - x2[i]), 2)
    return math.sqrt(distance)

Using the distance function, we can find the k elements of the training set that are closest to our test instance.

In [3]:
def getNeighbors(train, X, k):
    distances = []
    for i in range(len(train)):
        dist = euclideanDistance(train[i], X)
        distances.append((train[i], dist))
    distances.sort(key = operator.itemgetter(1))
    neighbors = []
    for i in range(k):
        neighbors.append(distances[i][0])
    return neighbors

Once we have located the most similar neighbors for a test instance, we can predict the response based on those neighbors. The way to acomplish this is to allow each neighbor to vote for their class attribute, and take the majority vote as the prediction.
getVote is a function that returns the majority vote from a number of neighbors. It assumes the class is the last attribute for each neighbor.

In [4]:
def getVote(neighbors):
    classVotes = {}
    for i in range(len(neighbors)):
        response = neighbors[i][-1]
        if response in classVotes:
            classVotes[response] += 1
        else:
            classVotes[response] = 1
    sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True)
    return sortedVotes[0][0]

The following function tells us how accurate our prediction is

In [5]:
def getAccuracy(test, predictions):
    correct = 0
    for i in range(len(test)):
        if test[i][-1] == predictions[i]:
            correct += 1
    return (correct/(len(test)))

### Example Using Wine Dataset

The wine dataset is a small sample dataset that comes with python's scikit-learn library. We will use this dataset to demonstrate the KNN classification algorithm shown above. 
This dataset the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. There are 178 instances, each representing the quantities of 13 constituents found in each of the three types of wines. 
The attributes are
- Alcohol 
- Malic acid 
- Ash 
- Alcalinity of ash 
- Magnesium 
- Total phenols 
- Flavanoids 
- Nonflavanoid phenols 
- Proanthocyanins 
- Color intensity 
- Hue 
- OD280/OD315 of diluted wines 
- Proline 

For more information on this dataset, see [here](https://archive.ics.uci.edu/ml/datasets/wine).

As we are using Euclidean distance as our measure of similarity, it is very important to normalize the value of our input features. Too much variation in scale between different features will skew the prediction results. 

In [6]:
dataset = load_wine()
wine_features = dataset['data']
wine_features_normed = wine_features / wine_features.max(axis=0)
wine_target = np.array([dataset['target']])
wine_dataset = np.concatenate((wine_features_normed, wine_target.T), axis = 1)

In [7]:
# split into test and train datasets 
trainingSet = []
testSet = []
split = 0.67
for x in range(len(wine_dataset)-1):
        if random.random() < split:
            trainingSet.append(wine_dataset[x])
        else:
            testSet.append(wine_dataset[x])

In [8]:
predictions=[]
k = 5
for x in range(len(testSet)):
    neighbors = getNeighbors(trainingSet, testSet[x], k)
    result = getVote(neighbors)
    predictions.append(result)
    #print('predicted =' + repr(result) + ', actual =' + repr(testSet[x][-1]))
accuracy = getAccuracy(testSet, predictions)
print(accuracy)

1.0


As we can see, KNN with k = 5 gives perfect classification results. This dataset is a well posed problem with "well behaved" class structures and no missing values, which allows us to get good results. 

### Using KNN on Sparse Matrices for Text Classification

With a little bit of tweaking of our original code, we can use KNN to classify text. The example here is taken from the tweets dataset used in homework 3, text precessing. Using KNN, we can classify tweets into two groups using their owener's political affiliation. 
First we follow the same steps in homework 3 to generate a sparse bag-of-words TF-IDF feature matrix. 

In [9]:
lemmatizer=nltk.stem.wordnet.WordNetLemmatizer()
stopwords=nltk.corpus.stopwords.words('english')

In [10]:
def process(text, lemmatizer=nltk.stem.wordnet.WordNetLemmatizer()):
    text = text.lower()
    text = " " + text + " "
    text = text.replace("'s", ' ')
    text = text.replace("-", ' ')
    text = text.replace("'", '')
    
    text_map =  string.punctuation.replace("'","")
    blank = []
    for i in range(len(text_map)):
        blank.append(" ")
    blank = ''.join(blank)
    
    translator = str.maketrans(text_map, blank)
    text = text.translate(translator)
    
    tokens = nltk.word_tokenize(text)
    
    token_list = []
    for token in tokens:
        try:
            token_list.append(lemmatizer.lemmatize(token))
        except:
            token_list = token_list
    
    return token_list

In [11]:
tweets = pd.read_csv("tweets_train.csv", na_filter=False)

In [12]:
def process_all(df, lemmatizer=nltk.stem.wordnet.WordNetLemmatizer()):
    res = df.copy()
    tokens = []
    for i in range(len(res['text'])):
        tokens.append(process(res['text'].iloc[i], lemmatizer))
    res['text'] = tokens
    return res
    pass

processed_tweets = process_all(tweets)

In [13]:
def get_rare_words(processed_tweets):
    words = []
    for i in range(len(processed_tweets)):
        words += processed_tweets['text'][i]
    
    countDict = Counter(words)
    rare_words = [word for word, occurrences in countDict.items() if occurrences <= 1]
    #Counter(words)
    return sorted(rare_words)
    pass

rare_words = get_rare_words(processed_tweets)

In [14]:
def create_features(processed_tweets, rare_words):
    stopwords = nltk.corpus.stopwords.words('english')
    stopwords += rare_words
    
    tfidf = sklearn.feature_extraction.text.TfidfVectorizer(stop_words = stopwords)
    
    tweets_string = processed_tweets['text'].apply(lambda x: str(" ".join(x)))
    X = tfidf.fit_transform(tweets_string)
    
    return tfidf, X
    pass

(tfidf, X) = create_features(processed_tweets, rare_words)

In [15]:
def create_labels(processed_tweets):
    screen_name = processed_tweets['screen_name']
    rep = ['realDonaldTrump', 'mike_pence','GOP']
    label = []
    for i in range(len(screen_name)):
        if screen_name[i] in rep:
            label.append(0)
        else:
            label.append(1)
    return np.array(label)
    pass

y = create_labels(processed_tweets)

With X as the sparse bag-of-words TF-IDF matrix and y as the classification result, we can split our dataset to generate a train and a test set. 

In [16]:
X_train, X_test, y_train, y_test \
    = train_test_split(X, y, test_size = 0.33, random_state = 244)

As X_train and X_test are csr sparse matrices, we want to be very careful when calculating their distances. As Euclidean distance can be written as 
\begin{equation}
|a^2 - b^2| = a^2 -2ab + b^2
\end{equation}

we can first multiply each matrix by itself, and then use the dot product to get our result. 

In [17]:
def kNN_Sparse(train, test, k):  
    # calculate the square sum of each vector  
    train_sq = train.multiply(train).sum(1)  
    test_sq = test.multiply(test).sum(1)  
      
    # calculate the dot product
    distance = test.dot(train.transpose()).todense()  
      
    # calculate the distance  
    num_test, num_train = distance.shape  
    distance = np.tile(test_sq, (1, num_train)) + np.tile(train_sq.T, (num_test, 1)) - 2 * distance  
      
    # get the k neighbors
    neighbors_index = np.argsort(distance)[:, 0:k]  
    neighbors = np.zeros((num_train, k), np.float64)  
    for i in range(num_test):  
        neighbors[i] = distance[i, neighbors_index[i]]  
      
    return neighbors, neighbors_index

As with above, we allow each neighbor to vote for their class, and assign the majority vote value to the test instance. We can also evaluate the accuracy of this model by comparing our predictions with y_test.

In [18]:
predictions=[]
k = 5
for i in range(X_test.shape[0]):
    neighbors,neighbors_index = kNN_Sparse(X_train, X_test[i:i+1], k)
    for j in neighbors_index.tolist():
        response = y_train[j]
        data = Counter(response)
        result = max(response, key = data.get)
        predictions.append(result)

In [19]:
def getAccuracy_sparse(testSet, predictions):
    correct = 0
    for i in range(len(testSet)):
        if testSet[i] == predictions[i]:
            correct += 1
    return (correct/(len(testSet)))

getAccuracy_sparse(y_test,predictions)

0.8446312839376423

Our accuracy with k = 5 is 0.84. This is not a bad result for such a simple and straightforward lazy learning model. 

### Citations 

The theory and characteristic part of this tutorial used information from the [textbook](http://www-bcf.usc.edu/~gareth/ISL/) _An Introduction to Statistical Learning with Applications in R_ (ISLR) by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, 7th edition. 

The idea for using KNN on text classfication comes from the papaer _KNN with TF-IDF based Framework for Text Categorization_ by Bruno Trstenjak,Sasa Mikac, and Dzenana Donko, linked [here](https://www.sciencedirect.com/science/article/pii/S1877705814003750). 

The wine dataset is from [UCI's machine learning repository](https://archive.ics.uci.edu/ml/datasets/wine).