# Simple Article Recommendation Engine

## Introduction

This program is a simple article recommendation engine built on word2vec. The program takes an input of one article, and return a list of the 5 top article recommendations based on euclidean distance. 


## How the program works

The similarity and relationships between articles are modeled based on word vector. I tested both the 300 dimension Stanford (GloVe)[https://nlp.stanford.edu/projects/glove/] trained on 2014 Wikipedia and the a context free model (BERT](https://github.com/google-research/bert). 

For GloVe, each word is represented as a vector of 300-floating point mumbers. This vector captures the meaning of the word regarding other words within its pretrained corpus (the 2014 Wikipedia corpus), and is learned from a neural network that captures the word-word co-occurrence probabilities among the 300-d space. 

For each document, we calculate the centroid of the document's cloud of word vectors by dividing the sum of the vectors by the number of words in the article. The distances between each articles are measured by the euclidean distance between their centroid. 


## How to run the program 

The program reads in a database of word vectors and a corpus of text articles then organizing them into a handy table (list of lists) for processing.

The [GloVe](https://nlp.stanford.edu/projects/glove/) pre-trained word vectors we use here is the 300 dimension vectors trained on Wikipedia data.


In [74]:
import sys
import string
import os
import numpy as np
import codecs
import re
from sklearn.feature_extraction import stop_words

In [2]:
# From scikit learn that got words from:
# http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words

ENGLISH_STOP_WORDS = frozenset([
    "a", "about", "above", "across", "after", "afterwards", "again", "against",
    "all", "almost", "alone", "along", "already", "also", "although", "always",
    "am", "among", "amongst", "amoungst", "amount", "an", "and", "another",
    "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are",
    "around", "as", "at", "back", "be", "became", "because", "become",
    "becomes", "becoming", "been", "before", "beforehand", "behind", "being",
    "below", "beside", "besides", "between", "beyond", "bill", "both",
    "bottom", "but", "by", "call", "can", "cannot", "cant", "co", "con",
    "could", "couldnt", "cry", "de", "describe", "detail", "do", "done",
    "down", "due", "during", "each", "eg", "eight", "either", "eleven", "else",
    "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone",
    "everything", "everywhere", "except", "few", "fifteen", "fifty", "fill",
    "find", "fire", "first", "five", "for", "former", "formerly", "forty",
    "found", "four", "from", "front", "full", "further", "get", "give", "go",
    "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter",
    "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his",
    "how", "however", "hundred", "i", "ie", "if", "in", "inc", "indeed",
    "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter",
    "latterly", "least", "less", "ltd", "made", "many", "may", "me",
    "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly",
    "move", "much", "must", "my", "myself", "name", "namely", "neither",
    "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone",
    "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on",
    "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our",
    "ours", "ourselves", "out", "over", "own", "part", "per", "perhaps",
    "please", "put", "rather", "re", "same", "see", "seem", "seemed",
    "seeming", "seems", "serious", "several", "she", "should", "show", "side",
    "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone",
    "something", "sometime", "sometimes", "somewhere", "still", "such",
    "system", "take", "ten", "than", "that", "the", "their", "them",
    "themselves", "then", "thence", "there", "thereafter", "thereby",
    "therefore", "therein", "thereupon", "these", "they", "thick", "thin",
    "third", "this", "those", "though", "three", "through", "throughout",
    "thru", "thus", "to", "together", "too", "top", "toward", "towards",
    "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us",
    "very", "via", "was", "we", "well", "were", "what", "whatever", "when",
    "whence", "whenever", "where", "whereafter", "whereas", "whereby",
    "wherein", "whereupon", "wherever", "whether", "which", "while", "whither",
    "who", "whoever", "whole", "whom", "whose", "why", "will", "with",
    "within", "without", "would", "yet", "you", "your", "yours", "yourself",
    "yourselves"])

In [61]:
def load_glove(filename):
    """
    Read all lines from the indicated file and return a dictionary
    mapping word:vector where vectors are of numpy `array` type.
    GloVe file lines are of the form:

    the 0.418 0.24968 -0.41242 0.1217 ...

    So split each line on spaces into a list; the first element is the word
    and the remaining elements represent factor components. The length of the vector
    should not matter; read vectors of any length.
    """
    f = open(filename)
    contents = f.readlines()
    f.close()

    dictionary = {}
    for line in contents:
        word = line.rstrip().split(" ")
        dictionary[word[0]] = np.array(word[1:], dtype="float")
    return dictionary


In [None]:
contents = ["the 0.418 0.24968 -0.41242 0.1217", "wikipedia -0.67679 -0.11589 -0.22071 -0.22887"]

dictionary = {}
for line in contents:
    word = contents[line].rstrip().split(" ")
#     dictionary[word[0]] = np.array(word[1:], dtype="float")
#     print(dictionary)


In [40]:
contents = ["the 0.418 0.24968 -0.41242 0.1217", "wikipedia -0.67679 -0.11589 -0.22071 -0.22887"]

dictionary = {}
for line in range(len(contents)):
    word = contents[line].rstrip().split(" ")
    dictionary[word[0]] = np.array(word[1:], dtype="float")
    print(dictionary)
    

{'the': array([ 0.418  ,  0.24968, -0.41242,  0.1217 ])}
{'the': array([ 0.418  ,  0.24968, -0.41242,  0.1217 ]), 'wikipedia': array([-0.67679, -0.11589, -0.22071, -0.22887])}


In [23]:
# this is over 1GB

gloves = load_glove('word2vec/glove.6B.300d.txt')

In [44]:
# gloves['wikipedia']

In [48]:
def filelist(root):
    """Return a fully-qualified list of filenames under root directory"""
    allfiles = []
    for path, subdirs, files in os.walk(root):
        for name in files:
            allfiles.append(os.path.join(path, name))
    return allfiles


In [49]:
allfiles = filelist('data/bbc')
allfiles[0:5]

['data/bbc/.DS_Store',
 'data/bbc/COPYRIGHT',
 'data/bbc/entertainment/289.txt',
 'data/bbc/entertainment/262.txt',
 'data/bbc/entertainment/276.txt']

In [103]:
def get_text(filename):
    """
    Load and return the text of a text file, assuming latin-1 encoding as that
    is what the BBC corpus uses.  Use codecs.open() function not open().
    """
    f = codecs.open(filename, encoding='latin-1', mode='r')
    s = f.read()
    f.close()
    return s

In [104]:
s = get_text(allfiles[6])
s

'Ocean\'s Twelve raids box office\n\nOcean\'s Twelve, the crime caper sequel starring George Clooney, Brad Pitt and Julia Roberts, has gone straight to number one in the US box office chart.\n\nIt took $40.8m (Â£21m) in weekend ticket sales, according to studio estimates. The sequel follows the master criminals as they try to pull off three major heists across Europe. It knocked last week\'s number one, National Treasure, into third place. Wesley Snipes\' Blade: Trinity was in second, taking $16.1m (Â£8.4m). Rounding out the top five was animated fable The Polar Express, starring Tom Hanks, and festive comedy Christmas with the Kranks.\n\nOcean\'s Twelve box office triumph marks the fourth-biggest opening for a December release in the US, after the three films in the Lord of the Rings trilogy. The sequel narrowly beat its 2001 predecessor, Ocean\'s Eleven which took $38.1m (Â£19.8m) on its opening weekend and $184m (Â£95.8m) in total. A remake of the 1960s film, starring Frank Sinatra 

In [91]:
import re

def words(text):
    clean_text = text.lower()
    print ("[lower]", clean_text)

    clean_text = re.sub('[' + string.punctuation + '0-9\\r\\t\\n]', ' ', text)
    print ("[regex]", clean_text)

    clean_text = clean_text.split(" ")
    print ("[split on space]", clean_text)
    
    clean_text = [w for w in clean_text if len(w) > 2]
    print ("[ignore words < 3]", clean_text)
    
    clean_text = [w for w in clean_text if not w in stop_words.ENGLISH_STOP_WORDS]
#     print ("[stop words]", clean_text)
    
    return clean_text

# wlist = words(s)
# wlist

In [105]:
# a function to clean the texts

def words(text):
    """
    Given a string, return a list of words normalized as follows.
    Split the string to make words first by using regex compile() function
    and string.punctuation + '0-9\\r\\t\\n]' to replace all those
    char with a space character.
    Split on space to get word list.
    Ignore words < 3 char long.
    Lowercase all words
    Remove English stop words
    """
    clean_text = text.lower()
#     print ("[lower]", clean_text)

    clean_text = re.sub('[' + string.punctuation + '0-9\\r\\t\\n]', ' ', text)
#     print "[regex]", clean_text

    clean_text = clean_text.split(" ")
#     print "[split]", clean_text
    
    clean_text = [w for w in clean_text if len(w) > 2]
#     print ("[ignore words < 3]", clean_text)

    clean_text = [w for w in clean_text if not w in ENGLISH_STOP_WORDS]
    return clean_text

wlist = words(s)
wlist

['Ocean',
 'Twelve',
 'raids',
 'box',
 'office',
 'Ocean',
 'Twelve',
 'crime',
 'caper',
 'sequel',
 'starring',
 'George',
 'Clooney',
 'Brad',
 'Pitt',
 'Julia',
 'Roberts',
 'gone',
 'straight',
 'number',
 'box',
 'office',
 'chart',
 'took',
 'weekend',
 'ticket',
 'sales',
 'according',
 'studio',
 'estimates',
 'The',
 'sequel',
 'follows',
 'master',
 'criminals',
 'try',
 'pull',
 'major',
 'heists',
 'Europe',
 'knocked',
 'week',
 'number',
 'National',
 'Treasure',
 'place',
 'Wesley',
 'Snipes',
 'Blade',
 'Trinity',
 'second',
 'taking',
 'Rounding',
 'animated',
 'fable',
 'The',
 'Polar',
 'Express',
 'starring',
 'Tom',
 'Hanks',
 'festive',
 'comedy',
 'Christmas',
 'Kranks',
 'Ocean',
 'Twelve',
 'box',
 'office',
 'triumph',
 'marks',
 'fourth',
 'biggest',
 'opening',
 'December',
 'release',
 'films',
 'Lord',
 'Rings',
 'trilogy',
 'The',
 'sequel',
 'narrowly',
 'beat',
 'predecessor',
 'Ocean',
 'Eleven',
 'took',
 'opening',
 'weekend',
 'total',
 'remake',


In [118]:
# a function to calculate the centroid 

def doc2vec(text, gloves):
    """
    Return the word vector centroid for the text. Sum the word vectors
    for each word and then divide by the number of words. Ignore words
    not in gloves.
    """
    
    # get clean word list 
    wlist = words(text)
    
    # we only have word vectors for the words in gloves 
    ingloves = [w for w in wlist if w in gloves]
    
    vcorpus = np.zeros(shape=(300,))
    for w in ingloves:
        vcorpus = vcorpus + gloves[w]
        
    # the number of words in only the words in gloves 
    centroid = vcorpus/len(ingloves)
    return centroid

centroid = doc2vec(s, gloves)
len(centroid)

300

In [110]:
ingloves = {'the': [ 0.418  ,  0.24968, -0.41242,  0.1217 ], 'wikipedia': [-0.67679, -0.11589, -0.22071, -0.22887]}

vcorpus = []
for w in ingloves:
        vcorpus.append(ingloves[w])
        
vcorpus = np.array(vcorpus)
vcorpus = vcorpus.transpose()
svcorpus = [sum(i) for i in vcorpus]
svcorpus

[-0.25879, 0.13379000000000002, -0.63313, -0.10716999999999999]

In [124]:
def load_articles(articles_dirname, gloves):
    filist = filelist(articles_dirname)
    table = []
    for f in filist:
        fname = f[len(articles_dirname)+1:]
        text = get_text(f)
        title = f.readline()
        article = f.read()

        table.append((fname, title, article))
    return table 

table = load_articles('data/bbc', gloves)
table[6]

AttributeError: 'str' object has no attribute 'readline'

In [121]:
def load_articles(articles_dirname, gloves):
    """
    Load all .txt files under articles_dirname and return a table (list of lists/tuples)
    where each record is a list of:

      [filename, title, article-text-minus-title, wordvec-centroid-for-article-text]

    We use gloves parameter to compute the word vectors and centroid.

    The filename is stripped of the prefix of the articles_dirname pulled in as
    script parameter sys.argv[2]. E.g., filename will be "business/223.txt"
    """
    filist = filelist(articles_dirname)
    table = []
    for f in filist:
        fname = f[len(articles_dirname)+1:]
        
        # split the title and article by splitting at \n
        # and then join the article again 
        text = get_text(f)
        lines = text.split('\n')
        title = lines[0]
        article = lines[1:]
        article = "\n".join(article)
        
        centriold = doc2vec(text, gloves)
        table.append((fname, title, article, centriold))
    return table 

table = load_articles('data/bbc', gloves)
table[6]



('entertainment/074.txt',
 "Ocean's Twelve raids box office",
 '\nOcean\'s Twelve, the crime caper sequel starring George Clooney, Brad Pitt and Julia Roberts, has gone straight to number one in the US box office chart.\n\nIt took $40.8m (Â£21m) in weekend ticket sales, according to studio estimates. The sequel follows the master criminals as they try to pull off three major heists across Europe. It knocked last week\'s number one, National Treasure, into third place. Wesley Snipes\' Blade: Trinity was in second, taking $16.1m (Â£8.4m). Rounding out the top five was animated fable The Polar Express, starring Tom Hanks, and festive comedy Christmas with the Kranks.\n\nOcean\'s Twelve box office triumph marks the fourth-biggest opening for a December release in the US, after the three films in the Lord of the Rings trilogy. The sequel narrowly beat its 2001 predecessor, Ocean\'s Eleven which took $38.1m (Â£19.8m) on its opening weekend and $184m (Â£95.8m) in total. A remake of the 1960s 

In [15]:
def distances(article, articles):
    """
    Compute the euclidean distance from article to every other article and return
    a list of (distance, a) tuples for all a in articles. The article is one
    of the elements (tuple) from the articles list.
    """
    dist = []
    for a in articles:
        if a != article:
            distance = np.linalg.norm(a[3] - article[3])
            dis_pair = (distance, a)
            dist.append(dis_pair)
    return dist


In [16]:
eucli_dis = distances(table[5], table)
eucli_dis


[(2.2831370026958453,
  ('.DS_Store',
   '\x00\x00\x00\x01Bud1\x00\x00 \x00\x00\x00\x08\x00\x00\x00 \x00\x00\x00\x10\x0c\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x01\x00\x00\x10\x00\x00n\x00e\x00s\x00sbwsp\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00

In [101]:
def recommended(article, articles, n):
    """
    Return a list of the n articles (records with filename, title, etc...)
    closest to article's word vector centroid. The article is one of the elements
    (tuple) from the articles list.
    """
    dist = distances(article, articles)
    rank = sorted(dist, key=lambda x:x[0])[:n]
    rec = []
    for a in rank:
        rec.append(a[1])
    return rec

rec = recommended(table[5], table, 5)
rec

[('entertainment/297.txt',
  'Fockers keeps US box office lead',
  "\nFilm comedy Meet The Fockers, sequel to Meet The Parents, has topped the US box office for a third week.\n\nThe movie, which stars Ben Stiller, Robert De Niro, Dustin Hoffman and Barbra Streisand, took $28.5m (Â£15.2m), making a total of 204.3m (Â£109m). Michael Keaton thriller White Noise was second with $24m (Â£12.8m) despite negative reviews. Martin Scorsese's Howard Hughes biopic The Aviator came third with $7.6m (Â£4m) and a total of $42.9m (Â£22.8m). Children's adventure Lemony Snicket's A Series of Unfortunate Events, starring Jim Carrey and Meryl Streep, fell two places to fourth position with $7.4m (Â£3.9m).\n\nIt is also the ninth Jim Carrey film to hit the $100m (Â£53.2m) mark, according to Exhibitor Relations. And Fat Albert, a live-action adaptation of Bill Cosby's TV cartoon, fell one place to number five, with takings over the three days beginning on Friday of $5m (Â£2.6m). Ocean's Twelve took sixth pl

In [93]:
a = [(1,2), (3,3), (2,4)]
b = sorted(a, key=lambda x: x[0])
b

[(1, 2), (2, 4), (3, 3)]