In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# 1. Introduction

In this notebook, our primary goal is simple: 
#### To clearly understand the difference between traditional K-NN Algorithm and Approximate KNN and discuss when we can choose one over the other. 
However, I will not be discussing EDA and different visualizations in this notebook since there are plenty of great notebooks on this topic already. Instead, we will look at some ways to preprocess textual data and discuss the implications of each one. 

## What is KNN? 

KNN stands for K-Nearest Neighbors Algorithm. It belongs to a subclass of machine learning algorithms known as "instance-based learning".
In instance-based algorithms, no real "learning" takes place during the training phase. But rather, the instances are just stored in a suitable format. During the testing phase, the new unseen data point is compared with the stored training data and a result is generated, based on some similarity metric.

![](http://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1531424125/KNN_final1_ibdm8a.png)

In KNN, we compare each new test data with all the training data and fetch the top-K closest data points. We then check the classes of these top-K neighbours and return the majority label as the prediction for the test data.  The above visual gives us a good idea of how KNN works. 

In this notebook, we store each tweet as a vector and then use **cosine similarity** for calculating the distance between vectors. 

Image Source: [Datacamp](https://stats.stackexchange.com/questions/287425/why-do-you-need-to-scale-data-in-knn)

## What is Approximate KNN?

Approximate KNN is a variation of KNN that seeks to limit the number of training samples that each new test point is compared with before returning a result. As you can imagine, if we had a set of 1 Million training samples, then for each new test data point, we would have to compute the cosine similarity with 1M training samples, to find the top-K closest labels. This can be computationally expensive. 


### Algorithm  
1. Randomly initialize a set of hyperplanes in the same vector space as the vectors 
2. These hyperplanes divide the vector place into sub-spaces. 
3. For each vector in training data, find out which subspace it belongs to 
4. For each test data, find the other vectors in the same subspace and treat them like neighbours 
5. Calculate the distance between the test data and each vector in the new (smaller) set of neighbours 
6. Return the top-K closest labels within the sub-space 

### Challenges
1. How do we store the subspace information? 
2. What if there are not enough samples per subspace? 

In [None]:
# We use this package for preprocessing tweets 
!pip install tweet-preprocessor

In [None]:
# Standard import statements
import string
from collections import defaultdict
from collections import Counter
from time import time

from sklearn.metrics import f1_score, accuracy_score
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

import preprocessor as p
from gensim.models import KeyedVectors
from tqdm.notebook import tqdm
import spacy
nlp = spacy.load("en_core_web_sm")

# 2. Preprocessing

To extract any sort of meaning from our raw tweet data, we first need to clean it and store it in a format that is usable by our machine learning algorithm. In this notebook, we perform the following steps on our tweet data: 

1. Remove Mentions, URLs and Hashtags 
2. Remove Punctuations
3. Remove unusual, non-alphanumeric characters.
4. Lemmatization: replace each word with its lemma 
5. Use pre-trained word embeddings to get word vectors

*Note: In Step 3, we remove some weird text like ÂÃÂª that is present in the data ([Reference](https://www.kaggle.com/c/nlp-getting-started/discussion/125723#748840))*

In [None]:
# the preprocessor package takes care of step 1 
p.set_options(p.OPT.HASHTAG, p.OPT.MENTION, p.OPT.NUMBER, p.OPT.URL, p.OPT.RESERVED)

# the pretrained word vectors for step 5
model = KeyedVectors.load("/kaggle/input/gensim-embeddings-dataset/glove.twitter.27B.200d.gensim", mmap="r")

In [None]:
# load in the dataset
train_df = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv", encoding='latin1')
test_df = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv", encoding='latin1')
train_df.info()

In [None]:
def clean_tweet(tweet: str) -> str:
    trans = str.maketrans("","",string.punctuation)
    processed_tweet = p.clean(tweet)
    recoded_tweet = processed_tweet.encode('ascii','ignore').decode('utf-8','ignore') 
    depunct_tweet = recoded_tweet.translate(trans).lower()
    doc = nlp(depunct_tweet)
    lemmas = [y.lemma_ if y.lemma_ != "-PRON-" else y.orth_ for y in doc]
    cleaned_tweet = str.join(" ", lemmas)
    
    return cleaned_tweet

In [None]:
# Check the cleaned output on some random tweets 
for x in (train_df.sample(5)['text']):
    print("RAW   : ", x)
    print("CLEAN : ", clean_tweet(x),"\n") 


In [None]:
# Clean all the tweets in the dataset
train_df['clean_tweet'] = [clean_tweet(x) for x in tqdm(train_df['text'])]

### Information Loss

Since the main focus of this notebook is on the understanding of approximate KNN, we will not be training our word embeddings. Instead, we will use some pre-trained word embeddings available [here](https://www.kaggle.com/bertcarremans/glovetwitter27b100dtxt). However, by doing so, we will not be able to vectorize some words that are not in the vocabulary of the pre-trained model. Let us see just how much data we lose by doing so 

In [None]:
# Listing out words that cannot be vectorized 
no_vecs = []
for x in train_df['clean_tweet']:
    no_vecs.extend([y not in model for y in x.split()])

print("Total words in the corpus: ", len(no_vecs))
print("Total words without vectors: ", sum(no_vecs))
print(f"Information lost: {sum(no_vecs)/len(no_vecs)*100:0.2f}%")

In [None]:
# document vector = average vector of words in the document
def doc_to_vec(doc):
    vec = np.array([model[x]  for x in doc.split() if x in model])
    if len(vec) == 0:
        return np.full(200, 0.0)
    return np.mean(vec, axis=0)

# 3. Implementation


## KNN Algorithm
We implement the algorithm as described in the Introduction. Even though I have implemented it from scratch for the sake of understanding, we can also use the sklearn package to get the predictions as they are highly optimized and easy to use. If you are curious about why we should or should not use sklearn packages, you can read a bit more about it on [this forum post](https://www.kaggle.com/questions-and-answers/208675).

In [None]:
# simple KNN Classifier  
def knn_predict(vec, corpus, targets, n=5):
    if len(corpus) != len(targets):
        raise Exception(f" \
        Corpus and Targets must have same length\n \
        len(corpus): {len(corpus)} \
        len(targets): {len(targets)} \
        ")
    vec = vec.reshape(1,-1)
    corpus = np.vstack(corpus)
    targets = np.array(targets)

    sims = cosine_similarity(vec, corpus).flatten()
    top_n_idx = np.argsort(sims)[::-1][:n]
    top_n = targets[top_n_idx]
    pred = Counter(top_n).most_common()[0][0]
    
    return pred

In [None]:
# Compute vectors for each cleaned tweet
corpus = np.array([doc_to_vec(x) for x in tqdm(train_df['clean_tweet'])])
train_df['vector'] = list(corpus)

In [None]:
# Split the data into train and test sets 
X_train, X_test, y_train, y_test = train_test_split(
    train_df['vector'], 
    train_df['target'], 
    test_size=0.1, 
    random_state=42
)

In [None]:
def knn_fit_predict(X_train, y_train, X_test, verbose=False):
    start = time()
    iterator = tqdm(X_test) if verbose else X_test
    preds = [knn_predict(x, X_train, y_train) for x in iterator]
    if verbose:
        print(f"That took {time()-start:0.2f} seconds")
    return preds

In [None]:
# Run KNN on the whole Test Dataset
preds = knn_fit_predict(X_train, y_train, X_test, verbose=True)
print(f"Accuracy: {accuracy_score(preds, y_test):0.2f}")
print(f"F1 Score: {f1_score(preds, y_test):0.2f}")

### Results
The model performed reasonably well, considering that we did not have to do any fancy feature engineering or use complicated neural networks. However, these results are not very reliable since how the dataset was split may have introduced some bias. As we will see later in this notebook, using K-Fold Cross Validation can help us get more reliable results. 

## Approximate KNN

Now we come to the main part of this notebook, where we think about how we can implement each of the steps of the algorithm described above and also discuss how to address the challenges

### Challenge 1
*How do we store the information about subspaces?*

The answer to that is "**Locality Sensitive Hashing**" 
The key idea behind the implementation is called locality-sensitive hashing where we map vectors belonging to the same sub-space (step 2) to the same hash. To accomplish this, we calculate the hash as follows:

##### Consider the set of random hyperplanes given by their normal vectors "P" and vector "v" 
#### Sign(P)
1. Sign(p,v) = 1, if v lies on or in front of the plane p, 0 otherwise
2. The position of v with respect to p is given by the sign of their dot product.
3. If the dot product is 0, v lies on the plane described by normal p, return 1 
4. If the dot product is positive, v lies in the direction of the normal p, return 1
5. If the dot product is negative, v lies in the opposite side of the normal p, return 0  

#### Calculate Hash
1. For each p in P, calculate Sign(p,v)
2. Step 2 yields an array of 0's and 1's with the length = number of planes 
3. Interpret this array as a binary number and convert to decimal. 
4. Return this decimal as the hash value for v with respect to set of planes P

For example, if we have 5 random planes and the vector v lies in front of plane 1, 3, 4, on plane 5 and behind plane 2, we get the array [1,0,1,1,1].

![](https://latex.codecogs.com/gif.latex?hash%3D2%5E4%20*%201%20&plus;%202%5E3%20*0%20&plus;%202%5E2%20*1%20&plus;%202%5E1%20*%201%20&plus;%202%5E0%20*%201%20%3D%2023)


In [None]:
def get_sign(p, v):
    # get the sign of vector v with respect to normal p 
    dot_product = np.dot(p, v)
    sign = np.sign(dot_product)
    sign = 1 if sign >= 0 else 0
    return sign

def get_multi_planar_hashes(planes, v):
    # calculate the hash of v with respect to set of planes
    h = [get_sign(plane, v) for plane in planes]    
    hash_value = sum([2**i * x for i, x in enumerate(h)])
    return hash_value

def build_hash_table(corpus, planes):
    # build a hashtable and assign each training data point to a hash value
    hash_table = defaultdict(list)
    for c, vec in corpus.items():
        hash_value = get_multi_planar_hashes(planes, vec)
        hash_table[hash_value].append(c)
    return hash_table

def get_classmates(vec, planes, hash_table, n=None):
    # return the IDS of the vectors that have the same hash as v
    hash_val = get_multi_planar_hashes(planes, vec)
    idx = hash_table.get(hash_val) 
    neighbor_idx = idx if idx is not None else []
    if n is not None:
        neighbor_idx = neighbor_idx[:n]
    return neighbor_idx



### Challenge 2

What if there are not enough samples per subspace? This can happen when the random planes split the vector space in such a way that a new test sample does not have any other train samples that have the same hash. 

To overcome this and also make the algorithm more robust, we initialize multiple sets of random hyperplanes and find out the neighbours for the vector v in each of the subspaces that it resides in (one subspace per set). We then use the combined set of training samples with the same hash as v in each of these subspaces as the neighbours of v. 


For example, if we have the set of planes P1, P2, P3. 
For each P, we calculate the hash table and find the neighbours of v with respect to P. 

Consider, we get the following neighbors for v:
* P1 -> [A, B , D , E]
* P2 -> [C, D, F]
* P3 -> [D, E , F, G] 

Then the set of neighbours for v will be [A, B, C, D, E, F, G]. 

There is still chance that we might end with no neighbours, meaning all the random sets of hyperplanes ended up isolating v from the rest. This would most likely be indicative of an outlier, rare word or a typo. Provided that our training data is large enough and is a good representation of real data, this would be quite unlikely. But yet, we need to be prepared for it. When we encounter such scenarios, we simply resort to the basic KNN and compare with all samples to find the top-K neighbours

In [None]:
def get_random_planes_matrix(num_dim, num_planes=10, num_sets=5):
    # randomly generate a set of hyperplanes
    random_planes_matrix = [
        np.random.normal(size=(num_planes,num_dim)) 
        for _ in range(num_sets)
    ]   
    return random_planes_matrix

def get_hash_tables(corpus, planes_matrix):
    # return an array of multiple hash tables per set of planes
    tables = [build_hash_table(corpus, p) for p in planes_matrix]
    return tables 


In [None]:
# Putting it all together
def approx_knn_fit_predict(
    X_train, 
    y_train, 
    X_test, 
    random_state=None, 
    verbose=False,
    num_planes=10,
    num_sets=10,
    num_classmates=100):
    start = time()
    
    if random_state:
        np.random.seed(random_state)
    planes_matrix = get_random_planes_matrix(200, num_planes=num_planes, num_sets=num_sets)
    tables = get_hash_tables(X_train, planes_matrix)
    
    candidates = [set({})] * len(X_test)
    for planes, table in zip(planes_matrix, tables):
        for i, x in enumerate(X_test): 
            classmates = get_classmates(x, planes, table, n =num_classmates)
            candidates[i] = candidates[i].union(set(classmates))

    preds = []
    iterator = zip(X_test, candidates)
    iterator = tqdm(iterator, total=len(X_test)) if verbose else iterator
    for x, c in iterator:
        corpus = X_train[c]
        y = y_train[c]
        if len(c)== 0:
            corpus = X_train
            y = y_train
        preds.append(knn_predict(x, corpus, y))
        
    if(verbose):
        print(f"That took {(time() - start):.2f} seconds")
    
    return preds

In [None]:
preds = approx_knn_fit_predict(X_train, y_train, X_test, random_state=42, verbose=True)
print(f"Accuracy: {accuracy_score(preds, y_test):0.2f}")
print(f"F1 Score: {f1_score(preds, y_test):0.2f}")

#### Result
Well, that was much faster! And if we check carefully, we realise that most of the time taken for the prediction was for calculating the hash tables. In a real use-case, we would only be calculating these hash tables once and for all subsequent test samples, we would only compare it with a small subset of training data. 

However, we see that the accuracy and the F1 score took a bit of a hit. Before we can arrive at any conclusions about whether or not this is worth it, let's generate some more reliable results. 

# 4.  Cross-Validation

To get reliable results, we will use K Fold Cross-validation to measure the average time, accuracy score as well as the F1 score. 

How it works: We divide the training dataset into K "buckets". Each time, we use one of the buckets and the validation data and the other buckets as training data. Each time, we store the F1 score and accuracy and report the average values are being closer to the TRUE scores that we can expect when dealing with new unseen data during the testing phase. Doing so also helps reduce any biases that could be introduced from the train-test split. (i.e, a model may perform better on a certain subset of the training data than others)

In [None]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)

knn_time = []
knn_acc =[]
knn_f1 = []

approx_knn_time = []
approx_knn_acc = []
approx_knn_f1 = []

for i, (train_i, valid_i) in enumerate(kf.split(train_df)):
    print("Running Fold ",i+1)
    X_train = train_df.iloc[train_i]['vector']
    y_train = train_df.iloc[train_i]['target']
    
    X_valid = train_df.iloc[valid_i]['vector']
    y_valid = train_df.iloc[valid_i]['target']
        
    print("Running KNN", end="...")
    start = time()
    preds = knn_fit_predict(X_train, y_train, X_valid)
    acc = accuracy_score(y_valid, preds)
    f1 = f1_score(y_valid, preds)
    knn_acc.append(acc)
    knn_f1.append(f1)
    knn_time.append(time()-start)
    print("Done")
    
    print("Running Approximate KNN",end="...")
    start = time()
    preds = approx_knn_fit_predict(X_train, y_train, X_valid, 42)
    acc = accuracy_score(y_valid, preds)
    f1 = f1_score(y_valid, preds)
    approx_knn_acc.append(acc)
    approx_knn_f1.append(f1)
    approx_knn_time.append(time()-start)
    print("Done")
    print()

#### Results
As we can see below, when we ran K-Means and approximate K-means on identical training and validation data, approximate KNN yielded results that were quite close to KNN while being much faster. Though the time saved might seem small on a dataset containing only 7000 samples, for very large datasets, the fact that we only have to calculate the hash tables ONCE means that all subsequent test data will need to be compared with a fraction of the dataset, giving lightning fast results! 

In [None]:
print("KNN Results")
print(f"Average train time: {np.mean(knn_time):0.2f}")
print(f"Average accuracy: {np.mean(knn_acc):0.2f}")
print(f"Average f1 score: {np.mean(knn_f1):0.2f}")
print()
print("Approx KNN Results")
print(f"Average train time: {np.mean(approx_knn_time):0.2f}")
print(f"Average accuracy: {np.mean(approx_knn_acc):0.2f}")
print(f"Average f1 score: {np.mean(approx_knn_f1):0.2f}")

# 5. GRID Search
This is a bit of overkill considering the scope of this notebook, but since we've come this far, we might as well push it to its limits! 

An added advantage of Approximate KNN is that we can try to do a brute force search and find the values for number of planes, number of sets and number of neighbors per subspace to consider. If we run the following code, we get the optimal parameters as :

* `num_planes` = 12
* `num_sets` = 8
* `num_neighbors` = 1000


However, this takes a long time to run (about 4400 seconds or about 73 minutes) so I have commented it out. But feel free to try it out yourself and ask any questions you have about it.  

In [None]:
# num_planes_choices = [4, 6, 8, 10, 12]
# num_sets_choices = [2, 4, 8, 16, 24]
# num_classmates_choices = [10, 50, 100, 1000]

# progress = 0
# total = len(num_planes_choices)*len(num_sets_choices)*len(num_classmates_choices)

# best_f1 = 0
# best_f1_params = None 
# best_time = np.inf

# margin = 0.03
# grid_search_start = time()

# for num_planes in num_planes_choices:
#     for num_sets in num_sets_choices:
#         for num_classmates in num_classmates_choices:
               
#             planes_matrix = get_random_planes_matrix(200, num_planes=num_planes, num_sets=num_sets)
#             params = [num_planes, num_sets, num_classmates]
#             cv_f1 = []
#             progress += 1
#             print(f"Now processing param set {progress}/{total}",end='\r')
#             for i, (train_i, valid_i) in enumerate(kf.split(train_df)):
#                 X_train = train_df.iloc[train_i]['vector']
#                 y_train = train_df.iloc[train_i]['target']

#                 X_valid = train_df.iloc[valid_i]['vector']
#                 y_valid = train_df.iloc[valid_i]['target']
                
#                 start = time()
#                 preds = approx_knn_fit_predict(
#                     X_train, 
#                     y_train, 
#                     X_valid, 
#                     random_state=42,     
#                     num_planes=num_planes,
#                     num_sets=num_sets,
#                     num_classmates=num_classmates)
#                 f1 = f1_score(y_valid, preds)
#                 time_taken = time() - start
#                 cv_f1.append(f1)
                
#             avg_f1 = np.mean(cv_f1)
#             if avg_f1 > 0.7 and time_taken < best_time:
#                 print("\nBest Time Updated to: ", time_taken)
#                 print("F1 Score: ", avg_f1)
#                 best_f1_params = params 
#                 best_time = time_taken 


# print(f"Done. That took {time()-grid_search_start} seconds")            

In [None]:
best_f1_params

In [None]:
# Split the data into train and test sets 
X_train, X_test, y_train, y_test = train_test_split(
    train_df['vector'], 
    train_df['target'], 
    test_size=0.1, 
    random_state=42
)

# Running the algorithm with the optimal parameters
preds = approx_knn_fit_predict(
    X_train, 
    y_train, 
    X_test, 
    num_planes=12,
    num_sets =4, 
    num_classmates=1000,
    random_state=42, 
    verbose=True)

print(f"Accuracy: {accuracy_score(preds, y_test):0.2f}")
print(f"F1 Score: {f1_score(preds, y_test):0.2f}")

#### Results
Great! Even though that took a while, it looks like it was worth it at the end. 
We finally end up with an **accuracy of 78% and an F1 score of 0.73** with only a few seconds of train time!

# 6. Conclusion
To wrap things up, let's go over our key takeaways from this notebook:

1. Using tweet-preprocessor and spaCy to clean up textual data 
2. Using Pre-trained word embeddings to save time and effort 
3. Understand what KNN is and how to implement it from scratch 
4. Understand what is Approximate KNN and how it differs from KNN 
5. Understand Locality Sensitive Hashing 
6. Perform Cross-Validation to get reliable results 
7. Perform Grid Search to get the optimum hyperparameters

While the impact of these results may not be evident with this small dataset, it scales quite well to larger data, resulting in fast predictions. In the end, the choice of one over the other would ultimately boil down to the use-case in which you wish to apply it. Different applications value the trade-off between speed and accuracy quite differently. So a comparative study often helps to get a better idea of the available options. 



# 7. Future Work 
Throughout this notebook, we applied only a few simple preprocessing steps since achieving a top score was not the focus of this notebook. Some of the ways you could expand on this model to generate better results are: 

1. There is a strong signal present in the "keywords" column of the dataset. Find some way to include this information in the document embeddings. 
2. Use custom word embeddings (using Keras Embedding Layers) so that we do not lose data 
3. Scale the word vectors by their corresponding TF-IDF values to better represent important words  
4. Use sentence transformations using models that can capture contextual information. Simple word embeddings do not. 

*This was my second ever notebook on Kaggle and I worked hard on it, so thanks for stopping to give it a read! I'm always trying to improve myself so I would love your hear your feedback and criticism on my work so please feel free to leave some comments below. Happy Learning!*

In [None]:
# generating submission file 
test_df['clean_tweet'] = [clean_tweet(x) for x in test_df['text']]
test_df['vector'] = [doc_to_vec(x) for x in test_df['clean_tweet']]

preds = approx_knn_fit_predict(
    train_df['vector'], 
    train_df['target'], 
    test_df['vector'], 
    num_planes=12,
    num_sets =8, 
    num_classmates=1000,
    random_state=42, 
    verbose=True)

results = pd.DataFrame(zip(test_df['id'], preds), columns =['id', 'target'])
results.to_csv('submission.csv', index=False)