# Using KNearest Neighbour and brute force algorithm, for a given article, we should be able to recommend top 10 articles that are closest to that article. 

In [8]:
import json 
import pandas as pd
import numpy as np
from pandas.io.json import json_normalize #package for flattening json in pandas df
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

In [3]:
%%time

#df = pd.read_json(r'/opt/models/codefordemocracy/data/immi_articles.json', orient='columns')
df = pd.read_json(r'/opt/trainml/input/immigration_articles.json/immigration_articles.json', orient='columns')
df_clean= df.drop(['_id','_score', '_type'], axis=1)
df_src=df.drop(['_index','_id','_score', '_type'], axis=1)
df_src_norm = json_normalize(df_src['_source'])
df_src_norm_limited=df_src_norm.filter(items=['url', 'site_name','extracted.title', 'extracted.date', 'extracted.text', 'metadata.keywords', 'metadata.description'])


  """


CPU times: user 45.2 s, sys: 1.49 s, total: 46.7 s
Wall time: 46.7 s


In [4]:
%%time


extracted_text=df_src_norm_limited['extracted.text']
extracted_text.dropna()
#include only those words that appear in less than max_df = 80% of the documents




CPU times: user 20.7 ms, sys: 0 ns, total: 20.7 ms
Wall time: 19.3 ms


0        HOW many times in the past few months have we ...
1        Val Morgan has been an integral part of the UK...
2                                                         
3        Val Morgan has been an integral part of the UK...
4        Joe Biden has chosen U.S. Sen. Kamala D. Harri...
                               ...                        
49999    Rep. Adam Schiff said Sunday on CBS's "Face th...
50000    Sidney Powell, author of 'Licensed to Lie,' sa...
50001    Trump Delivers Remarks on Federal Judicial Con...
50002    President Donald Trump delivered remarks on ta...
50003    Panelists discuss preparations for the 2020 U....
Name: extracted.text, Length: 50004, dtype: object

In [20]:
count_vect = CountVectorizer(max_df=0.7, min_df=10, stop_words='english')
#count_vect = CountVectorizer(max_df=0.8, min_df=2, stop_words=my_stop_words)

#doc_term_matrix = count_vect.fit_transform(df_src_norm_limited['extracted.text'].values.astype('U'))
doc_term_matrix = count_vect.fit_transform(extracted_text.values.astype('U'))

doc_term_matrix

<50004x31897 sparse matrix of type '<class 'numpy.int64'>'
	with 9346374 stored elements in Compressed Sparse Row format>

In [21]:
from sklearn.neighbors import NearestNeighbors

model = NearestNeighbors(metric='euclidean', algorithm='brute')
model.fit(doc_term_matrix)

NearestNeighbors(algorithm='brute', metric='euclidean')

## Here are passing one article - doc_term_matrix[i] and getting back top n neighbors. 

In [22]:
distances, indices = model.kneighbors(doc_term_matrix[0], n_neighbors=10) # 1st arg: word count vector

In [23]:
#print (doc_term_matrix.join(neighbors, on='id').sort('distance')[['id','name','distance']])
print(distances)

[[ 0.         29.68164416 30.69201851 30.69201851 32.04684072 32.31098884
  32.40370349 32.49615362 32.92415527 33.61547263]]


In [15]:
print (indices)

[[    0   508 22632 22943 42559 17544 19351 45024 49568 36827]]


In [18]:
print(extracted_text[22632])

Care workers and nurses from overseas should be exempt from salary caps in the Government's new points-based immigration system after Brexit, say two thirds of the public.

An ICMP poll for think tank British Future found some 63 per cent of people said that there should be exceptions to a salary threshold for people moving to the UK to do important jobs that need doing, such as nurses and care workers.

That sentiment was shared across political divides, with agreement from 62 per cent of 2019 Conservative voters and 67 per cent of Labour voters; 58 per cent of Leave voters and 73 per cent of Remain voters.

It is the key area where the public divides from ministers. More than three-quarters of the public would be happy for the numbers of high-skilled workers coming to the UK from the EU (79 per cent) or outside the EU (77 per cent) to stay the same or increase.

Some 64 per cent would like to see the numbers of international students coming to the UK either remain the same or increas

# Implementing TF-IDF

## To retrieve articles that are more relevant, we should focus more on rare words that don't happen in every article. TF-IDF (term frequency–inverse document frequency) is a feature representation that penalizes words that are too common. Let us load in the TF-IDF vectors and repeat the nearest neighbor search.

In [5]:
vectorizer = TfidfVectorizer()
tfidf_vector = vectorizer.fit_transform(extracted_text)


In [6]:
tfidf_vector.shape

(50004, 129090)

In [10]:
%%time
model_tf_idf = NearestNeighbors(metric='euclidean', algorithm='brute')
model_tf_idf.fit(tfidf_vector)

CPU times: user 38.2 ms, sys: 11.9 ms, total: 50.1 ms
Wall time: 47.4 ms


NearestNeighbors(algorithm='brute', metric='euclidean')

In [15]:
tf_distances, tf_indices = model_tf_idf.kneighbors(tfidf_vector[0], n_neighbors=10)

In [16]:
print (tf_indices)

[[    0   508 17544 22943 22632  8909 19755 39906 45093 40580]]


In [17]:
print(tf_distances)

[[0.         0.77291921 0.85913338 0.87155565 0.87155565 0.87890538
  0.89874159 0.9045648  0.90524147 0.90653462]]


In [24]:
print(extracted_text[17544])

Almost two-thirds of Britons are still concerned about their country’s high levels of immigration, citing pressure on the National Health Service (NHS) and schools as especially worrying.

Deltapoll research commissioned by the Migration Watch UK think tank found that some 65 per cent of Britons “agree that recent levels of overseas net migration to the UK are a source of major concern for the public”, compared to just 22 per cent who think is not.

A majority of supporters of the Conservative Party, opposition Labour Party, and even the left-progressive Liberal Democrats which the Labour Party would likely have to rely on to form a government in the event of a hung parliament after national elections in December, all told pollsters immigration was a cause of “substantial concern”.

The majority among Conservative supporters was 75 per cent, among Labour supporters 62 per cent, and among Lib Dem voters 53 per cent.

By region, 65 per cent of respondents even in the hyper-diverse, multi

### Note: Both word-count features and TF-IDF are proportional to word frequencies. While TF-IDF penalizes very common words, longer articles tend to have longer TF-IDF vectors simply because they have more words in them. To remove this bias, we turn to cosine distances:

## Cosine Similarity

### Cosine distances let us compare word distributions of two articles of varying lengths. Let us train a new nearest neighbor model, this time with cosine distances.

In [27]:
model2_tf_idf = NearestNeighbors(algorithm='brute', metric='cosine')
model2_tf_idf.fit(tfidf_vector)
cs_distances, cs_indices = model2_tf_idf.kneighbors(tfidf_vector[0], n_neighbors=10)

In [28]:
print (cs_indices)

[[    0   508 17544 22632 22943  8909 19755 39906 45093 40580]]


In [None]:
## For, there is no major difference in results between Cosine Similarity and TF-IDF. Need to do 