In [1]:
import json
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np     
import random

## Import the data
 
We have five JSON files to parse. Each file is a batch of articles and their associated metadata from The Guardian's Content API (CAPI). With a free developer key, we are limited to only 200 articles per request. With a total of five downloads, this brings the total to 1000 on which we can use. If you'd like to query the API yourself, this repo contains an example query in a .txt file. You can sign up for your own developer key at https://open-platform.theguardian.com/ 

The JSON files contain many fields that could be useful for the purposes of machine learning. However, we will focus on the following fields for now:

- trailtext 
- headline
- body (this is the data we will use for machine learning, the above two merely provide context)

We can experiment with the metadata later (such as tags, author etc). Let's initialise our arrays and append with each article's relevant field.

In [2]:
import json

files = ["search_0.json", "search_1.json", "search_2.json", "search_3.json", "search_4.json"]

results = []

for file in files:
        
    # Opening JSON file
    f = open("articles/" + file)
    
    # returns JSON object as 
    # a dictionary
    # print(data)
    data = json.load(f)
    
    # print((data["response"]["results"][0]))
    
    data = data["response"]["results"]
    
    for item in data:
        results.append(data)
        
print (len(results))

1000


In [3]:
files = ["search_0.json", "search_1.json", "search_2.json", "search_3.json", "search_4.json"]

headline = []
trailText = []
body = []

for file in files:
        
    # Opening JSON file
    f = open("articles/" + file)
    
    # returns JSON object as 
    # a dictionary
    data = json.load(f)
        
    #this is the data that we actually care about
    articles = data['response']['results']

    for i in range(len(articles)):
        headline.append((articles[i]["fields"]["headline"]))
        trailText.append((articles[i]["fields"]["trailText"]))
        body.append((articles[i]["fields"]["body"]))



Sanity check the lengths

In [4]:
len(headline), len(trailText), len(body)

(1000, 1000, 1000)

## Preprocessing the data

We should now clean the data as some of the fields contain HTML (which we don't want). 

In [5]:
# body[1]

In [6]:
htmlRemover = re.compile('<.*?>') 
newlineRemover = '\n'

# as per recommendation from @freylis, compile once only

def cleanhtml(raw_html):
    cleantext = re.sub(htmlRemover, '', raw_html)
    cleantext = re.sub(newlineRemover, '', cleantext)
    return cleantext

  

In [7]:
cleanTrailText = []
cleanBody = []

for i in range(len(body)):
    cleanTrailText.append(cleanhtml(trailText[i]))
    cleanBody.append(cleanhtml(body[i]))

In [8]:
# cleanBody[1]

## Transform the articles into a sparse matrix based on similarity

But first ofcourse, we need to perform a tf-idf transformation

In [9]:
vect = TfidfVectorizer(min_df=1, stop_words="english")                                                                                                                                                                                                   
tfidf = vect.fit_transform(cleanBody)                                                                                                                                                                                                                       
pairwise_similarity = tfidf * tfidf.T 

In [10]:
tfidf

<1000x40735 sparse matrix of type '<class 'numpy.float64'>'
	with 339334 stored elements in Compressed Sparse Row format>

In [11]:
#necessary for ensuring we don't just return the same article
arr = pairwise_similarity.toarray()     
np.fill_diagonal(arr, np.nan)     

## Did it work? Let's find out!

Let's draw 10 random articles and find the most similar for each one.

In [12]:
for x in range(1, 10):

    i = random.randint(0,1000)                                                                                                                                                                                                                                                                                                                                                                                                                                  

    input_doc = cleanBody[i]                                                                                                                                                                                                 
    input_idx = cleanBody.index(input_doc)                                                                                                                                                                                                                      
    input_idx                                                                                                                                                                                                                                              
    result_idx = np.nanargmax(arr[input_idx]) 
    print("{}:".format(x))
    print("SAVE FOR LATER ARTICLE: {}".format(headline[i]))
    print("{}".format(trailText[i]))

    print("\n")
    # print("cleanBody[result_idx]   

    print("RECOMMENDED ARTICLE: {}".format(headline[result_idx]))
    print("{}".format(trailText[result_idx]))
    similarity_score = round(arr[input_idx,result_idx], 3)
    print("Similarity score: {}".format(similarity_score))
    print("___________")

1:
SAVE FOR LATER ARTICLE: Macron accused of betraying pledge to stamp out violence against women
Campaigners protest against ‘government of shame’ after minister accused of rape by two women is kept in place


RECOMMENDED ARTICLE: French minister refuses to stand down over rape allegations
Damien Abad, appointed to the new government on Friday, denies ‘deeply wounding’ accusations
Similarity score: 0.569
___________
2:
SAVE FOR LATER ARTICLE: Biden calls for action on gun laws after 21 killed in Texas school shooting – as it happened
Three children, aged eight and 10, have been named; US president Joe Biden called for ‘common sense’ legislation after school massacre


RECOMMENDED ARTICLE:  Texas school shooting: gunman was inside for 40 minutes, officials say – updates as they happened
Texas governor and local officials share more details about the shooting at the Robb elementary school in Uvalde
Similarity score: 0.798
___________
3:
SAVE FOR LATER ARTICLE: Cancelling Socrates: how t