# Document retrieval from wikipedia data

## Fire up GraphLab Create
(See [Getting Started with SFrames](../Week%201/Getting%20Started%20with%20SFrames.ipynb) for setup instructions)

In [1]:
import graphlab

In [2]:
# Limit number of worker processes. This preserves system memory, which prevents hosted notebooks from crashing.
graphlab.set_runtime_config('GRAPHLAB_DEFAULT_NUM_PYLAMBDA_WORKERS', 4)

[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1519443950.log


This non-commercial license of GraphLab Create for academic use is assigned to tylorcornett@gmail.com and will expire on September 18, 2018.


# Load some text data - from wikipedia, pages on people

In [3]:
people = graphlab.SFrame('people_wiki.gl/')

Data contains:  link to wikipedia article, name of person, text of article.

In [4]:
people.head()

URI,name,text
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...


In [None]:
len(people)

# Explore the dataset and checkout the text it contains

## Exploring the entry for president Obama

In [None]:
obama = people[people['name'] == 'Barack Obama']

In [None]:
obama

In [None]:
obama['text']

## Exploring the entry for actor George Clooney

In [None]:
clooney = people[people['name'] == 'George Clooney']
clooney['text']

# Get the word counts for Obama article

In [None]:
obama['word_count'] = graphlab.text_analytics.count_words(obama['text'])

In [None]:
print obama['word_count']

## Sort the word counts for the Obama article

### Turning dictonary of word counts into a table

In [None]:
obama_word_count_table = obama[['word_count']].stack('word_count', new_column_name = ['word','count'])

### Sorting the word counts to show most common words at the top

In [None]:
obama_word_count_table.head()

In [None]:
obama_word_count_table.sort('count',ascending=False)

Most common words include uninformative words like "the", "in", "and",...

# Compute TF-IDF for the corpus 

To give more weight to informative words, we weigh them by their TF-IDF scores.

In [None]:
people['word_count'] = graphlab.text_analytics.count_words(people['text'])
people.head()

In [None]:
tfidf = graphlab.text_analytics.tf_idf(people['word_count'])

# Earlier versions of GraphLab Create returned an SFrame rather than a single SArray
# This notebook was created using Graphlab Create version 1.7.1
if graphlab.version <= '1.6.1':
    tfidf = tfidf['docs']

tfidf

In [None]:
people['tfidf'] = tfidf

## Examine the TF-IDF for the Obama article

In [None]:
obama = people[people['name'] == 'Barack Obama']

In [None]:
obama[['tfidf']].stack('tfidf',new_column_name=['word','tfidf']).sort('tfidf',ascending=False)

Words with highest TF-IDF are much more informative.

# Manually compute distances between a few people

Let's manually compare the distances between the articles for a few famous people.  

In [None]:
clinton = people[people['name'] == 'Bill Clinton']

In [None]:
beckham = people[people['name'] == 'David Beckham']

## Is Obama closer to Clinton than to Beckham?

We will use cosine distance, which is given by

(1-cosine_similarity) 

and find that the article about president Obama is closer to the one about former president Clinton than that of footballer David Beckham.

In [None]:
graphlab.distances.cosine(obama['tfidf'][0],clinton['tfidf'][0])

In [None]:
graphlab.distances.cosine(obama['tfidf'][0],beckham['tfidf'][0])

# Build a nearest neighbor model for document retrieval

We now create a nearest-neighbors model and apply it to document retrieval.  

In [None]:
knn_model = graphlab.nearest_neighbors.create(people,features=['tfidf'],label='name')

# Applying the nearest-neighbors model for retrieval

## Who is closest to Obama?

In [None]:
knn_model.query(obama)

As we can see, president Obama's article is closest to the one about his vice-president Biden, and those of other politicians.  

## Other examples of document retrieval

In [None]:
swift = people[people['name'] == 'Taylor Swift']

In [None]:
knn_model.query(swift)

In [None]:
jolie = people[people['name'] == 'Angelina Jolie']

In [None]:
knn_model.query(jolie)

In [None]:
arnold = people[people['name'] == 'Arnold Schwarzenegger']

In [None]:
knn_model.query(arnold)