# Nearest Neighbors

When exploring a large set of documents -- such as Wikipedia, news articles, StackOverflow, etc. -- it can be useful to get a list of related material. To find relevant documents you typically
* Decide on a notion of similarity
* Find the documents that are most similar 

In the assignment you will
* Gain intuition for different notions of similarity and practice finding similar documents. 
* Explore the tradeoffs with representing documents using raw word counts and TF-IDF
* Explore the behavior of different distance metrics by looking at the Wikipedia pages most similar to President Obama’s page.

## Load Wikipedia dataset

In [1]:
import numpy as np
import pandas as pd
import sklearn
import nltk
import re
import codecs

In [2]:
people = pd.read_csv('people_wiki.csv')
people

Unnamed: 0,URI,name,text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...
...,...,...,...
59066,<http://dbpedia.org/resource/Olari_Elts>,Olari Elts,olari elts born april 27 1971 in tallinn eston...
59067,<http://dbpedia.org/resource/Scott_F._Crago>,Scott F. Crago,scott francis crago born july 26 1963 twin bro...
59068,<http://dbpedia.org/resource/David_Cass_(footb...,David Cass (footballer),david william royce cass born 27 march 1962 in...
59069,<http://dbpedia.org/resource/Keith_Elias>,Keith Elias,keith hector elias born february 3 1972 in lac...


## Extract word count vectors

In [3]:
# add word_count column
people['word_count'] = ''

for i in range(len(people)):
    d = {}
    
    if pd.isna(people['text'][i]):
        people["word_count"][i] = d
    else:
        word_list = [word.strip(",.") for word in people['text'][i].lower().split()]
        for word in word_list:
            if word not in d:
                d[word] = 1
            d[word] += 1
        people["word_count"][i] = d

people

Unnamed: 0,URI,name,text,word_count
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...,"{'digby': 2, 'morrell': 6, 'born': 2, '10': 2,..."
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...,"{'alfred': 2, 'j': 2, 'lewy': 4, 'aka': 2, 'sa..."
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...,"{'harpdog': 3, 'brown': 3, 'is': 8, 'a': 8, 's..."
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...,"{'franz': 2, 'rottensteiner': 4, 'born': 2, 'i..."
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...,"{'henry': 2, 'krvits': 2, 'born': 2, '30': 2, ..."
...,...,...,...,...
59066,<http://dbpedia.org/resource/Olari_Elts>,Olari Elts,olari elts born april 27 1971 in tallinn eston...,"{'olari': 3, 'elts': 4, 'born': 2, 'april': 2,..."
59067,<http://dbpedia.org/resource/Scott_F._Crago>,Scott F. Crago,scott francis crago born july 26 1963 twin bro...,"{'scott': 2, 'francis': 2, 'crago': 6, 'born':..."
59068,<http://dbpedia.org/resource/David_Cass_(footb...,David Cass (footballer),david william royce cass born 27 march 1962 in...,"{'david': 2, 'william': 2, 'royce': 2, 'cass':..."
59069,<http://dbpedia.org/resource/Keith_Elias>,Keith Elias,keith hector elias born february 3 1972 in lac...,"{'keith': 2, 'hector': 2, 'elias': 5, 'born': ..."


## Find nearest neighbors 

Let's start by finding the nearest neighbors of the Barack Obama page using the word count vectors to represent the articles and Euclidean distance to measure distance. 

**Reference** <br>
https://scikit-learn.org/stable/modules/neighbors.html#neighbors  <br>
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html <br>
http://mlbernauer.github.io/R/20160131-document-retrieval-sklearn.html

In [4]:
import sklearn
import numpy as np
import nltk
import re
import codecs

nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\vanch\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Clean Text

In [5]:
# Function that cleans text by removing '\x0c' and '\n' characters
# as well as all non-alpha characters and finally converts everything
# to lower case
def clean_text(text):
    stop_words = ['\x0c', '\n']
    for i in stop_words:
        text.replace(i, ' ')
    clean_text = re.sub('[^a-zA-Z]+', ' ', text)
    return clean_text.lower()

In [6]:
people['clean_text'] = people['text'].apply(clean_text)
people

Unnamed: 0,URI,name,text,word_count,clean_text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...,"{'digby': 2, 'morrell': 6, 'born': 2, '10': 2,...",digby morrell born october is a former austral...
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...,"{'alfred': 2, 'j': 2, 'lewy': 4, 'aka': 2, 'sa...",alfred j lewy aka sandy lewy graduated from un...
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...,"{'harpdog': 3, 'brown': 3, 'is': 8, 'a': 8, 's...",harpdog brown is a singer and harmonica player...
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...,"{'franz': 2, 'rottensteiner': 4, 'born': 2, 'i...",franz rottensteiner born in waidmannsfeld lowe...
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...,"{'henry': 2, 'krvits': 2, 'born': 2, '30': 2, ...",henry krvits born december in tallinn better k...
...,...,...,...,...,...
59066,<http://dbpedia.org/resource/Olari_Elts>,Olari Elts,olari elts born april 27 1971 in tallinn eston...,"{'olari': 3, 'elts': 4, 'born': 2, 'april': 2,...",olari elts born april in tallinn estonia is an...
59067,<http://dbpedia.org/resource/Scott_F._Crago>,Scott F. Crago,scott francis crago born july 26 1963 twin bro...,"{'scott': 2, 'francis': 2, 'crago': 6, 'born':...",scott francis crago born july twin brother to ...
59068,<http://dbpedia.org/resource/David_Cass_(footb...,David Cass (footballer),david william royce cass born 27 march 1962 in...,"{'david': 2, 'william': 2, 'royce': 2, 'cass':...",david william royce cass born march in forest ...
59069,<http://dbpedia.org/resource/Keith_Elias>,Keith Elias,keith hector elias born february 3 1972 in lac...,"{'keith': 2, 'hector': 2, 'elias': 5, 'born': ...",keith hector elias born february in lacey town...


### Use NLTK word_tokenize() and PorterStemmer() to tokenize and stem tex

In [7]:
# Function that takes text, tokenizes it and 
# returns list of stemmed tokens
def tokenize_and_stem(text):
    tokens = nltk.word_tokenize(text)
    stemmer = nltk.stem.porter.PorterStemmer()
    return [i for i in [stemmer.stem(t) for t in tokens] if len(i) > 2]

In [8]:
# Import the TfidfVectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

text_tfidf_vectorizer = TfidfVectorizer(max_df=0.5, min_df=0, max_features=200000,
               stop_words='english', use_idf=True, tokenizer=tokenize_and_stem)

In [9]:
# Compute tfidf for text
text_tfidf = text_tfidf_vectorizer.fit_transform(people['clean_text'])

  'stop_words.' % sorted(inconsistent))


In [10]:
# Get feature names for text
text_tfidf_feature = text_tfidf_vectorizer.get_feature_names()

### Write function to get the top-k features associated with a text

In [11]:
# Function for returning the top_k features
def get_top_features(rownum, weights, features, top_k=10):
    weight_vec = weights.toarray()[rownum,:]
    top_idx = np.argsort(weight_vec)[::-1][:top_k]
    return [features[i] for i in top_idx]

### Build Nearest Neighbors model

In [12]:
# Build model to return 5 closest neighbors
from sklearn.neighbors import NearestNeighbors

# Create the k-NN model using k=10
nn_text = NearestNeighbors(n_neighbors=10, algorithm='brute', metric='minkowski', p=2)

# Fit the models to the TF-IDF weights matrix
nn_text_fitted = nn_text.fit(text_tfidf)