### Article Selection
API calls are made making use of NY Times Developer platform. This will extract metadata and article snippets from NY Times archives based on the search conditions used. API will primarily use the news_desk attribute to import specific number of articles from each section. The sectional topics to be used are listed out in the developer website with over 100 topics available. Peripheral topics based on city/regional news, obituaries, job advertisments, classifieds, booming, crosswords etc. are excluded to work with only general topics.

In [62]:
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings('ignore')

news_desk = pd.read_csv("data/news_desk.csv")

news_desk.head()

Unnamed: 0,Section
0,Adventure Sports
1,Arts & Leisure
2,Arts
3,Automobiles
4,Blogs


### API Calls
API calling will build a basic framework for extracting articles in bulk from the NY Times website. Due to the rate limiting restriction put up by NY Times, a decorater is used to make dynamic function calls so that the API requests do not exceed 1 call/second. There is also a day-wise limit of 1000 calls. An additional feature is added which will parallelize this process making use of multiple CPU cores and this will speed things up a bit. Brief description of Lock() function used for rate limiting can be found here.

In [1]:
import time, threading

def rate_limited(max_per_second):
  '''Decorator that make functions not to be called faster than 1 call/second'''
  lock = threading.Lock()
  minInterval = 1.0 / float(max_per_second)
  def decorate(func):
    lastTimeCalled = [0.0]
    def rateLimitedFunction(args,*kargs):
      lock.acquire()
      elapsed = time.clock() - lastTimeCalled[0]
      leftToWait = minInterval - elapsed
      if leftToWait>0:
        time.sleep(leftToWait)
      lock.release()
      ret = func(args,*kargs)
      lastTimeCalled[0] = time.clock()
      return ret
    return rateLimitedFunction
  return decorate


from threading import Thread
import requests

@rate_limited(0.9)
def process_id(id):
    try:
        r = requests.get(url % id)
        json_data = r.json()
        print('Appended '+str(page_index.index(id))+ ' out of '+ str(len(page_index)))
        return json_data
    except:
        json_data = ''
        print('Skipping...')
        return json_data

def process_range(id_range, store=None):
    if store is None:
        store = {}
    for id in id_range:
        store[id] = process_id(id)
    return store


def threaded_process_range(nthreads, id_range):
    store = {}
    threads = []

    for i in range(nthreads):
        ids = id_range[i::nthreads]
        t = Thread(target=process_range, args=(ids,store))
        threads.append(t)

    [t.start() for t in threads]
    [t.join() for t in threads]
    return store

news_desk = list(news_desk['Section'])
base_url = 'https://api.nytimes.com/svc/search/v2/articlesearch.json?api-key=a141a689509b459ba12e2b93b83883fd'

article_raw = []
for nd in news_desk[65:67]:
    param_url = '&fq=news_desk:'+str(nd)+'&sort=newest&page=%s'
    url = base_url + param_url
    print(str(nd)+":")
    page_index = list(range(7))
    try:
        articles_1 = threaded_process_range(2, page_index)
        articles_2 = [articles_1[k]['response']['docs'] for k in page_index if (type(articles_1[k]) is dict) and ('response' in articles_1[k])]
        articles_3 = [item for sublist in articles_2 for item in sublist]
        articles_4 = [{key:item[key] for key in ['web_url','pub_date']} for item in articles_3]
        articles_5 = pd.DataFrame(articles_4)
        articles_5['news_desk'] = str(nd)
        article_raw.append(articles_5)
    except:
        print('Skipping...')

url_data = pd.concat(article_raw).reset_index(drop = True)
#url_data.to_csv('data/url_data.csv', index = False)


### Web Scraping
Using the metadata obtained from API calls, the specific URLs with the BeautifulSoup package scraped the article text data. A decorater was used along with the scrape function to limit the rate. URLs with videos/slideshows are excluded from the dataset as they do not have text data. The article content is enclosed within 'p' tags and a loop was used to extract all article content. Article length frequency distribution is observed to exclude articles which failed to scrape meaningful content. Such articles can be identified by high bars in the graph with small article lengths.

In [10]:
from bs4 import BeautifulSoup

@rate_limited(1)
def extract_content(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    
    name_box = soup.findAll('p', attrs={'class': 'story-body-text story-content'})
    content = [x.text for x in name_box]
    content_final = ' '.join(content)
    
    if(content_final == ''):
        name_box = soup.findAll('p')
        content = [x.text for x in name_box]
        content_final = ' '.join(content)

    return(content_final)

url_data = pd.read_csv('data/url_data.csv')
url_data['video_flag'] = url_data['web_url'].str.contains('/video/')
url_data['slideshow_flag'] = url_data['web_url'].str.contains('/slideshow/')
url_data = url_data.loc[(url_data['video_flag'] == False) & (url_data['slideshow_flag'] == False),]
url_data = url_data.drop(['video_flag','slideshow_flag'], axis = 1)

content = []
for index, i in enumerate(url_data['web_url']):
    try:
        print(index)
        a = extract_content(i)    
        content.append(a)
    except:
        content.append('')
        print('Skipping...')
        
content_data = url_data
content_data['content'] = content
content_data.to_csv('data/content_data.csv', index = False)
##
import pandas as pd
content_data = pd.read_csv('data/content_data.csv')
content_data['length'] = content_data['content'].str.strip().str.len()

import bokeh.plotting as bp
from bokeh.io import show, output_notebook
from bokeh.models import HoverTool
import numpy as np

array = list(content_data['length'][content_data['length'].values < 1000])
hist, edges = np.histogram(array, bins=50)

source = bp.ColumnDataSource(pd.DataFrame({
    'left' : edges[:-1],
    'top' : hist,
    'right' : edges[1:],
    'bottom' : 0,
    'data_value' : hist
}))

p = bp.figure(width = 550, height = 450)
p.quad(top='top', bottom='bottom', left='left', right='right', line_color="white", source = source)
p.add_tools(HoverTool(tooltips= [("Value", "@data_value")]))
p.xaxis.axis_label = 'Length of Article'
p.yaxis.axis_label = 'Frequency'
output_notebook()
show(p)



Looking over the frequency distribution of article length below 1000, there is a spike in the 1st-3rd bar denoting articles with advertisement content, waste text with no relavant information. These articles will be filtered from the corpus.

In [11]:
# Filtering articles with length less than 60 (first three bars)
content_data = content_data.loc[content_data['length'] > 60,]

# Filtering articles with very large length (removing outliers)
content_data = content_data.loc[content_data['length'] < 160000,]

# Reset the index
content_data = content_data.reset_index(drop = True)

# Article length distribution over entire corpus
array = content_data['length'].values
hist, edges = np.histogram(array, bins=100)

source = bp.ColumnDataSource(pd.DataFrame({
    'left' : edges[:-1],
    'top' : hist,
    'right' : edges[1:],
    'bottom' : 0,
    'data_value' : hist
}))

p = bp.figure(width = 550, height = 450)
p.quad(top='top', bottom='bottom', left='left', right='right', line_color="white", source = source)
p.add_tools(HoverTool(tooltips= [("Value", "@data_value")]))
p.xaxis.axis_label = 'Length of Article'
p.yaxis.axis_label = 'Frequency'
output_notebook()
show(p)

### Topic Modelling
The first step after cleaning article corpus is to use a generative model for finding topic distribution across all articles. We will use the Latent Dirichlet Allocation (LDA) model to generate distribution. Document-term matrix is created using CountVectorizer to tokenize words greater than 3 characters (alphanumeric) and remove tokens which appear in less than 30 documents or in more than 20% of documents. The text will be pre-processed by removing stopwords, punctuations and lemmatizing words with their part of speech. 

In [12]:
import gensim
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
import string

def pre_process(text):    
    stopwords = set(nltk.corpus.stopwords.words('english'))
    punctuation = set(string.punctuation)
    lemmatizer = WordNetLemmatizer()
    tokenizer = RegexpTokenizer(r'[0-9a-zA-Z]+')

    def convert_tag(tag):
        """
        Convert the tag given by nltk.pos_tag to the tag used by wordnet
        """
        tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
        try:
            return tag_dict[tag[0]]
        except KeyError:
            return 'a'

    cl_text = (" ").join(tokenizer.tokenize(text))
    cl_text = (" ").join([s for s in cl_text.lower().split() if s not in stopwords])
    cl_text = ("").join([s for s in cl_text if s not in punctuation])
    cl_text = nltk.word_tokenize(cl_text)
    pos = nltk.pos_tag(cl_text)
    pos = [convert_tag(t[1]) for t in pos]
    cl_text = [lemmatizer.lemmatize(cl_text[i], pos[i]) for i in range(len(cl_text))]
    return cl_text

content_data['content_clean'] = content_data['content'].apply(pre_process)

# Using CountVectorizor to find more then three letter tokens, removing stop_words, 
# removing tokens that don't appear in at least 30 documents,
# removing tokens that appear in more than 20% of the documents
vect = CountVectorizer(min_df=30, max_df=0.2, stop_words='english', token_pattern='(?u)\\b\\w\\w\\w+\\b')

X = vect.fit_transform(content_data['content_clean'].apply(lambda x: (" ").join(x)))

# Convert sparse matrix to gensim corpus.
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)

# Mapping from word IDs to words (To be used in LdaModel's id2word parameter)
id_map = dict((v, k) for k, v in vect.vocabulary_.items())

# Creating dictionary object from the corpus
dct = gensim.corpora.Dictionary.from_corpus(corpus, id_map)

# Determining topic overlap
from sklearn.metrics.pairwise import cosine_similarity
def topic_sim(lda_model, num_topic):
    topic_word = lda_model.get_topics()
    avg_sim = []
    for i in range(num_topic):
        arr1 = topic_word[i]
        sim = []
        for j in range(num_topic):
            arr2 = topic_word[j]
            sim_value = cosine_similarity(arr1.reshape(1,-1),arr2.reshape(1,-1))
            sim.append(sim_value)
        avg_sim.append(np.mean(sim))
    return(np.mean(avg_sim))        

x = np.linspace(1, 100, 10).astype(int)
topic_overlap = []
for i in x:
    ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = i, 
                                               id2word = id_map, passes = 25, random_state = 34)
    topic_overlap.append(topic_sim(ldamodel, i))


p = bp.figure(plot_width=400, plot_height=400)
p.line(list(x), topic_overlap, line_width=2)
p.circle(list(x), topic_overlap, fill_color="white", size=8)
p.xaxis.axis_label = 'Number of Topics'
p.yaxis.axis_label = 'Topic Similarity'
output_notebook()
show(p)

#### Number of Topics
The above similarity vs number of topics graph can be used to determine the number of topics for LDA modelling. The optimal number of topics will be the minimal number of topics which produces maximum diversity among constituent topics. Using the elbow method, the number of topics used will be **35**. Topic similarity is calculated using cosine similarity between word-level probability distributions.

In [43]:
# Generating topic distribution for the entire corpus

ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 35, id2word = id_map, passes = 40)

def topic_corpus():
    bow_corpus = [dct.doc2bow(content_data.loc[i,'content_clean']) for i in range(content_data.shape[0])]
    topic_corpus = []
    for i in range(len(bow_corpus)):
        topic_dist = ldamodel[bow_corpus[i]]
        topic_dist = {x[0]:x[1] for x in topic_dist}
        topic_corpus.append(topic_dist)
    topic_corpus = pd.DataFrame(topic_corpus).fillna(0)
    return(topic_corpus)

topic_corpus = topic_corpus()

def query_article_topic(list):
    query_content = []
    for at in list:
        try:
            query_content.append(pre_process(extract_content(at)))
        except:
            query_content.append([''])
        
    bow = [dct.doc2bow(i) for i in query_content]
    query_corpus = []
    for i in range(len(bow)):
        topic_dist = ldamodel[bow[i]]
        topic_dist = {x[0]:x[1] for x in topic_dist}
        query_corpus.append(topic_dist)
    query_corpus.append({i:0 for i in range(35)})    
    query_corpus = pd.DataFrame(query_corpus).fillna(0)
    return(query_corpus.iloc[:-1,])

### Semantic Similarity
**LDA** ([read here](http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/)) provides a good way of clustering documents with similar topic distributions. Since this method relies only on bag of words model, there is no focus on the context/meaning of the document. As a result, two highly similar documents by topics may be very different in terms of article context and may not work well for a recommender system where the user will generally prefer new articles with similar context. Thus, topic similarity should be combined with a model which focusses on the semantic similarity of the articles as well.

**WordNet** is a taxonomy of hypernym relationships and synonym sets. It is a good resource for finding semantic similarity between words/sentences but has some limitations. It does not work great at identifying nuances between sentences. Also, it has some limitations in comparing adjectives and adverbs since the taxonomies are very short and words from different taxonomies cannot be compared. This method, though good at finding similarity, is not very accurate and does not take into account changing/new word meanings. Detailed description of WordNet can be found [here](https://www.codeproject.com/Articles/11835/WordNet-based-semantic-similarity-measurement).

**Word-Embeddings** is a relatively new domain in text analytics with some good algorithms on semantic similarity. Word2Vec and GloVe are two most popular examples. Word2Vec operates on two learning algorithms i.e. skip-gram and Continuous Bag of Words. Basic premise of this model is to represent meaning of the word by understanding the context in which it appears. Neural Network Word embedding is a predictive model which aims to predict between a centre word and context words in terms of word vectors. This model is very effective in comparing word/sentence similarities and also trains on the corpus quickly making it ideal to implement here.

In [87]:
from gensim.models import Doc2Vec
from gensim.models.doc2vec import LabeledSentence

doc1 = extract_content("https://www.nytimes.com/2018/02/23/sports/olympics/cross-country-skiing-food.html?rref=collection%2Fsectioncollection%2Fsports&action=click&contentCollection=sports&region=rank&module=package&version=highlights&contentPlacement=1&pgtype=sectionfront")
doc2 = extract_content("https://www.nytimes.com/2018/02/24/us/politics/trump-russia-inquiry-mueller.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=first-column-region&region=top-news&WT.nav=top-news")

def pre_process_doc2vec(text):    
    stopwords = set(nltk.corpus.stopwords.words('english'))
    punctuation = set(string.punctuation).union(set(['“','”','—','’','‘']))
    punctuation.remove('-')
    cl_text = (" ").join([s for s in text.lower().split() if s not in stopwords])
    cl_text = ("").join([s for s in cl_text if s not in punctuation])
    cl_text = nltk.word_tokenize(cl_text)
    return cl_text

class LabeledLineSentence(object):
    def __init__(self, doc_list):
       self.doc_list = doc_list
    def __iter__(self):
        for idx, doc in enumerate(self.doc_list):
            yield LabeledSentence(words = pre_process_doc2vec(doc), tags = [idx])
            
it = LabeledLineSentence(list(content_data['content']))

model = Doc2Vec(size=50, min_count=5, alpha=0.025, min_alpha=0.025, workers=8)
model.build_vocab(it)

for epoch in range(10):
    model.train(it, total_examples = model.corpus_count, epochs = model.iter)
    model.alpha -= 0.002  # decrease the learning rate
    model.min_alpha = model.alpha  # fix the learning rate, no decay
    
model.save("data/doc2vec.model")
model = Doc2Vec.load("data/doc2vec.model")

pca_data = pd.DataFrame(columns = list(range(50)))
for i in range(len(model.docvecs)):
    x = pd.DataFrame(model.docvecs[i].reshape(1,-1), columns = list(range(50)))
    pca_data = pd.concat([pca_data, x], axis = 0)
pca_data = pca_data.reset_index(drop = True)    

from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
pca_fln = pca.fit_transform(pca_data)
pca_fln = pd.DataFrame(pca_fln)
pca_fln['news_desk'] = content_data['news_desk']
pca_fln.columns = ['col1','col2','col3']

pca_fln = pca_fln.loc[pca_fln['col3'].isin(['Politics', 'Business', 'Sports','Culture','World','Health','Technology','Retail','Wealth','Travel']), ]

source = bp.ColumnDataSource.from_df(pca_fln)

from bokeh.palettes import d3
import bokeh
palette = d3['Category10'][len(pca_fln['col3'].unique())]
color_map = bokeh.models.CategoricalColorMapper(factors=list(pca_fln['col3'].unique()), palette=palette)

p = bp.figure(title = "Visualization of Article Semantics using PCA")
p.scatter('col1','col2',source = source, color = {'field': 'col3', 'transform': color_map},
          legend = {'field':'col3'}, alpha = 0.4, size = 10)
p.border_fill_color = "whitesmoke"
p.add_tools(HoverTool(tooltips= [("Article:","@col3")]))
p.xaxis.axis_label = 'Dimension 1'
p.yaxis.axis_label = 'Dimension 2'
p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None
show(p)
# docvec = model.docvecs[106]
# similar_doc = model.docvecs.most_similar(4412) 
# docvec = model.infer_vector(pre_process_doc2vec(doc2))
# model.docvecs.most_similar([docvec])

In [79]:
from ipywidgets import widgets, HBox, VBox, Layout, Button, Label
from IPython.display import display, HTML, clear_output
import requests
import webbrowser
import pandas as pd
import numpy as np

def reset_pref(sender):
    column_name = [str(i) for i in range(35)]
    df = pd.DataFrame(columns = ['name'] + column_name)
    df.to_csv('data/df.csv', index = False)

htmlscript_ipywidget_disable_closing = '''<script>
disable = true
function disable_ipyw_close(){
    if(disable){
        $('div.widget-area > div.prompt > button.close').hide()
    }
    else{
        $('div.widget-area > div.prompt > button.close').show()    
    }
    disable = !disable
}
$( document ).ready(disable_ipyw_close);
</script>

<form action="javascript:disable_ipyw_close()"><input style="opacity: 0.5" type="submit" value="Disable ipywidget closing"></form>'''

wodget_hide = HTML('''<script>
code_show=true; 
function code_toggle() {
    if (code_show){
        $('div.cell.code_cell.rendered.selected div.input').hide();
    } else {
        $('div.cell.code_cell.rendered.selected div.input').show();
    }
    code_show = !code_show
} 

$( document ).ready(code_toggle);
</script>

To show/hide this cell's raw code input, click <a href="javascript:code_toggle()">here</a>.''')

widget_hide = HTML('''<script>$('div.output_area > div.output_subarea.jupyter-widgets-view > div.p-Widget.p-Panel.jupyter-widgets.widget-container.widget-box.widget-vbox').fadeOut();</script>''')
    
def rec_generate(sender):
    if(text_user.value == ''):
        js = "<script>alert('Enter your username for recommendations!');</script>"
        display(HTML(js))
        return
    pref_data = pd.read_csv('data/df.csv')
    usr = text_user.value
    usr_data = pref_data.loc[pref_data['name'] == usr,]
    usr_arr = usr_data.iloc[0].values[1:]
    usr_arr = usr_data.iloc[0].values[1:]/sum(usr_arr)
    
    from sklearn.metrics.pairwise import cosine_similarity
    def dot_product(arr1, arr2):
        return(cosine_similarity(arr1.reshape(1,-1),arr2.reshape(1,-1))[0][0])
    score = []
    for i in range(topic_corpus.shape[0]):
        score.append(dot_product(topic_corpus.iloc[i,].values, usr_arr))
    
    sorted_index = sorted(range(len(score)), key=lambda k: score[k], reverse = True)[:10]

    def get_headline(url):
        r = requests.get(url)
        soup = BeautifulSoup(r.text, 'html.parser')
        name_box = soup.findAll('h1')
        return(name_box[0].text)

    def get_snippet(url):
        r = requests.get(url)
        soup = BeautifulSoup(r.text, 'html.parser')
        name_box = soup.findAll('p')
        for i in name_box:
            try:
                if(len(i.text)>150):
                    snipp = i.text
                    return(snipp)
            except:
                return("No Snippet Available")

    url = [content_data['web_url'][i] for i in sorted_index]
    hdls = [get_headline(i) for i in url]
    snipps = [get_snippet(i) for i in url]
    ui_fill(hdls, snipps, url)
    
def ui_fill(headlines, snippets, urls):
    
    display(widget_hide)
    
    def on_value_change(change):
        if(text_user.value == ''):
            js = "<script>alert('Enter your name before giving preferences');</script>"
            display(HTML(js))            
            return

        scale = {'Less':-1, 'Neutral':0, 'More':1}
        pref_data = pd.read_csv('data/df.csv')
        user_list = list(pref_data['name'].values)
        usr = text_user.value
        article_topics = query_article_topic(urls)
        article_topics.to_csv('data/waste.csv', index = False)
        
        if(usr not in user_list):
            df = pd.DataFrame(columns = ['name'] + [str(i) for i in range(35)])
            df.loc[0] = [usr] + [0 for i in range(35)]
            pref_data = pd.concat([pref_data, df], axis = 0)
            pref_data = pref_data.fillna(0)
            pref_data.to_csv('data/df.csv', index = False)

        pref_data = pd.read_csv('data/df.csv')
        ind = items_rate.index(change['owner'])
        pref_change = scale[change['new']] - scale[change['old']]               
        for i in range(35):
            pref_data.loc[pref_data['name'] == usr, str(i)] += pref_change*article_topics.loc[ind, i]
        pref_data.to_csv('data/df.csv', index = False)
    
    def redirect_link(sender):
        ind = items_hl.index(sender)
        link = urls[ind]
        webbrowser.open(link)
        
    items_hl = [Button(description=w, border = 'solid', layout = Layout(width='80%', height='40px')) for w in headlines]
    for hl in items_hl:
        hl.style.button_color = 'SkyBlue'
        hl.style.font_weight = 'bold'
        hl.on_click(redirect_link)
        
    items_sn = [widgets.Textarea(w, disabled = True, layout = Layout(width = '99.5%', height = '50px')) for w in snippets]

    items_rate = [widgets.SelectionSlider(options=['Less', 'Neutral', 'More'], value = 'Neutral', 
                                          description='Rate Article '+str(i+1), disabled=False, 
                                          continuous_update=False, orientation='horizontal', readout=True) 
                  for i in range(len(headlines))]
    for rt in items_rate:
        rt.observe(on_value_change, names = 'value')

    items = [VBox([HBox([items_hl[i], items_rate[i]]), items_sn[i]]) for i in range(len(headlines))]
    box = VBox(items, layout = Layout(border = 'solid'))
    display(box)

    
def onclick(sender):
    
    if(text_query.value == ''):
        js = "<script>alert('Input cannot be blank');</script>"
        display(HTML(js))
        return
    
    base_url = 'https://api.nytimes.com/svc/search/v2/articlesearch.json?api-key=c77ddf1d1b594f76b2773928f324615f'
    param_url = '&q='+str(text_query.value)+'&page=0'
    url = base_url + param_url
    
    r = requests.get(url)
    json_data = r.json()
    article_meta_data = json_data['response']['docs']
    
    headlines = []
    urls = []
    snippets = []
    for artc in article_meta_data:
        url = artc['web_url']
        headline = artc['headline']['main']
        snippet = artc['snippet']
        headlines.append(headline)
        urls.append(url)
        snippets.append(snippet)

    ui_fill(headlines, snippets, urls)
    
text_query = widgets.Text(placeholder='Type a query for NY Times article listing')
button_query = widgets.Button(description = 'Search articles')
query = HBox([text_query, button_query])
query = VBox([Label(""), query, Label("")])
button_query.on_click(onclick)


text_user = widgets.Text(placeholder='Enter your name', value = 'saket')
user_input = HBox([Label("Username:"),text_user])
rec_button = widgets.Button(description = 'Get recommendations', layout = Layout(width = '150px'))
reset_button = widgets.Button(description = 'Reset Preferences')
user = VBox([user_input,HBox([Label("->->->->->"),rec_button,reset_button])])
rec_button.on_click(rec_generate)
reset_button.on_click(reset_pref)

display(HBox([query, Label(""),Label(""), user]))
# HTML(htmlscript_ipywidget_disable_closing)

In [75]:
ldamodel.print_topics(num_topics = 35)
# ldamodel[dct.doc2bow(pre_process(extract_content("https://www.nytimes.com/2018/01/31/travel/guillermo-de-toro-mexico.html")))]

[(0,
  '0.016*"republican" + 0.015*"election" + 0.012*"democrat" + 0.010*"party" + 0.010*"vote" + 0.008*"trump" + 0.008*"political" + 0.008*"committee" + 0.008*"democratic" + 0.007*"official"'),
 (1,
  '0.018*"saudi" + 0.017*"club" + 0.015*"wine" + 0.014*"player" + 0.014*"soccer" + 0.014*"league" + 0.010*"australia" + 0.010*"spain" + 0.010*"team" + 0.008*"international"'),
 (2,
  '0.014*"book" + 0.008*"read" + 0.005*"word" + 0.004*"writer" + 0.004*"age" + 0.004*"character" + 0.004*"novel" + 0.003*"picture" + 0.003*"idea" + 0.003*"love"'),
 (3,
  '0.011*"river" + 0.009*"mountain" + 0.009*"road" + 0.008*"park" + 0.008*"town" + 0.008*"century" + 0.008*"mile" + 0.006*"south" + 0.006*"diamond" + 0.006*"visit"'),
 (4,
  '0.016*"war" + 0.009*"nation" + 0.008*"force" + 0.007*"official" + 0.007*"political" + 0.006*"control" + 0.006*"syria" + 0.006*"leader" + 0.006*"iraq" + 0.006*"minister"'),
 (5,
  '0.064*"los" + 0.039*"del" + 0.022*"kelly" + 0.021*"las" + 0.020*"est" + 0.012*"texas" + 0.010*"