## In this case study, we will retrieve similar documents using TF-IDF (Term Frequency - Inverse Document Frequency) algorithm and K-nearest-neighbors model

### Start Graphlab and load data

In [85]:
import graphlab

In [86]:
data = graphlab.SFrame('people_wiki.gl/')

### data contains pages on people from Wikipedia. It has three columns: URL, Name and Text

In [87]:
data

URI,name,text
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...


### So, out data contains text documents of 59,071 people from wikipedia

## Explore the dataset and checkout the text it contains

In [88]:
tom = people[people['name'] == 'Tom Cruise']

In [89]:
tom['text']

dtype: str
Rows: ?
['tom cruise born thomas cruise mapother iv july 3 1962 is an american actor and filmmaker he has been nominated for three academy awards and has won three golden globe awards he started his career at age 19 in the 1981 film endless love after portraying supporting roles in taps 1981 and the outsiders 1983 his first leading role was in the romantic comedy risky business released in august 1983 cruise became a fullfledged movie star after starring as pete maverick mitchell in the action drama top gun 1986 he has since 1996 been well known for his role as secret agent ethan hunt in the mission impossible film series which has a fifth film set for release in 2015one of the biggest movie stars in hollywood cruise starred in several more successful films in the 1980s including the dramas the color of money 1986 cocktail 1988 rain man 1988 and born on the fourth of july 1989 in the 1990s he starred in a number of successful films including the romance far and away 1992 the

## Get the word counts for Tom article

In [90]:
tom['word_count'] = graphlab.text_analytics.count_words(tom['text'])

In [91]:
tom['word_count']

dtype: dict
Rows: 1
[{'thomas': 1, 'portraying': 1, 'fullfledged': 1, 'money': 1, 'over': 1, 'mission': 1, 'nominated': 1, 'including': 3, '1962': 1, 'cruise': 8, 'its': 1, 'fifth': 1, '2015one': 1, 'hollywoods': 1, 'taps': 1, 'sky': 2, 'edge': 1, 'ethan': 1, '2014': 1, 'has': 4, '2010': 1, '2013': 1, '2012': 2, 'good': 1, 'fourth': 2, '1990': 1, 'far': 1, 'horror': 2, 'impossible': 1, 'interview': 1, 'jack': 1, 'report': 2, 'day': 1, 'romantic': 3, 'worlds': 1, 'comedy': 2, 'roles': 1, 'knight': 1, 'dramas': 1, 'hunt': 1, 'gun': 1, 'samurai': 2, 'domestically': 1, 'collateral': 1, 'release': 1, 'starred': 3, 'set': 1, 'disaster': 1, '200': 1, 'series': 1, 'globe': 1, 'postapocalyptic': 1, 'born': 3, 'erotic': 1, 'year': 1, 'stanley': 1, 'empire': 1, 'best': 5, 'pete': 1, 'fiction': 5, 'for': 14, 'movie': 3, 'away': 1, 'since': 1, 'vampire': 2, 'legal': 1, '1999in': 1, '3': 1, 'won': 3, 'epic': 1, 'maverick': 1, 'outsiders': 1, 'supporting': 2, 'shut': 1, 'million': 2, 'august': 1, 'bu

## Sort the word counts for the Obama article

### Turn dictionary into table

In [92]:
tom_word_count = tom[['word_count']].stack('word_count', new_column_name = ['word','count'])

In [93]:
tom_word_count.sort('count',ascending=False)

word,count
the,39
in,27
for,14
of,13
and,12
film,10
a,10
cruise,8
actor,7
he,6


### Most common words include uninformative words like "the", "in", "for",... To account this we will use TF-IDF algorithm. Check out this Wiki page for more information: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

## Compute TF-IDF for the corpus

To give more weight to informative words, we weigh them by their TF-IDF scores.

In [94]:
# Lets add a word count feature to whole data set

data['word_count'] = graphlab.text_analytics.count_words(people['text'])
data.head()

URI,name,text,word_count
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...,"{'selection': 1, 'carltons': 1, 'being': ..."
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...,"{'precise': 1, 'thomas': 1, 'closely': 1, ..."
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...,"{'just': 1, 'issued': 1, 'mainly': 1, 'nominat ..."
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...,"{'all': 1, 'bauforschung': 1, ..."
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...,"{'they': 1, 'gangstergenka': 1, ..."
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...,"{'currently': 1, 'less': 1, 'being': 1, ..."
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...,"{'exclusive': 2, 'producer': 1, 'show' ..."
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...,"{'taxi': 1, 'salon': 1, 'gangs': 1, 'being': 1, ..."
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...,"{'houston': 1, 'frankie': 1, 'labels': 1, ..."
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...,"{'phenomenon': 1, 'deborash': 1, 'both' ..."


In [95]:
tfidf = graphlab.text_analytics.tf_idf(data['word_count'])

In [97]:
# Add tfids feature to data set
data['tfidf'] = tfidf

In [98]:
data.head(2)

URI,name,text,word_count
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...,"{'selection': 1, 'carltons': 1, 'being': ..."
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...,"{'precise': 1, 'thomas': 1, 'closely': 1, ..."

tfidf
"{'selection': 3.836578553093086, ..."
"{'precise': 6.44320060695519, ..."


## Examine the TF-IDF for the Tom article

In [99]:
tom = data[data['name'] == 'Tom Cruise']
tom

URI,name,text,word_count,tfidf
<http://dbpedia.org/resou rce/Tom_Cruise> ...,Tom Cruise,tom cruise born thomas cruise mapother iv ju ...,"{'thomas': 1, 'portraying': 1, ...","{'thomas': 3.3202734635624696, ..."


In [100]:
tom[['tfidf']].stack('tfidf',new_column_name=['word','tfidf']).sort('tfidf',ascending=False)

word,tfidf
cruise,47.0443993226
thriller,24.5557467907
actor,20.7062815758
film,20.3311391706
fiction,18.205652744
drama,16.9304652736
vanilla,15.3813170464
magnolia,15.1705960151
grossed,14.5952318702
romantic,14.5140812799


### Words with highest TF-IDF are much more informative

## Build a nearest neighbor model for document retrieval

In [101]:
knn_model = graphlab.nearest_neighbors.create(data,features=['tfidf'],label='name')

## Applying the nearest-neighbors model for retrieval

### Who is closest to Elton John using word count features?

In [102]:
knn_model.query(tom)

query_label,reference_label,distance,rank
0,Tom Cruise,0.0,1
0,Billy Bob Thornton,0.786458333333,2
0,Matthew McConaughey,0.798165137615,3
0,Julia Roberts,0.802547770701,4
0,Nicole Kidman,0.805555555556,5


### As we can see, Tom Cruise's article is closest to the Billy Thornton who is also an american actor, and those of other movie stars

## Manually compute distances between a few people

Let's manually compare the distances between the articles for a few famous people

In [105]:
obama = data[data['name'] == 'Barack Obama']
clinton = data[data['name'] == 'Bill Clinton']
beckham = data[data['name'] == 'David Beckham']

### Is Obama closer to Clinton than to Beckham?

We would expect that the article on Obama to be closer to the former US president Bill Clionton than the footballer David Beckham Beckham. Let's check that

In [110]:
graphlab.distances.cosine(obama['tfidf'][0],clinton['tfidf'][0])

0.8339854936884276

In [111]:
graphlab.distances.cosine(obama['tfidf'][0],beckham['tfidf'][0])

0.9791305844747478

Thus, as expected Bill Clinton's article is more similar to Obama's article than that of Divid Beckham's

## Compare top words according to word counts to TF-IDF

Now, take a particular famous person, 'Elton John'. What are the 3 words in his articles with highest word counts? What are the 3 words in his articles with highest TF-IDF? These results illustrate why TF-IDF is useful for finding important words.

In [112]:
john = data[data['name'] == 'Elton John']

In [113]:
john[['tfidf']].stack('tfidf',new_column_name=['word','tfidf']).sort('tfidf',ascending=False)

word,tfidf
furnish,18.38947184
elton,17.48232027
billboard,17.3036809575
john,13.9393127924
songwriters,11.250406447
tonightcandle,10.9864953892
overallelton,10.9864953892
19702000,10.2933482087
fivedecade,10.2933482087
aids,10.262846934


In [115]:
john[['word_count']].stack('word_count',new_column_name=['word','count']).sort('count',ascending=False)

word,count
the,27
in,18
and,15
of,13
a,10
has,9
john,7
he,7
on,6
award,5


### Cosine distance between 'Elton John's and 'Victoria Beckham's articles 

In [116]:
VBeckham = data[data['name'] == 'Victoria Beckham']

In [117]:
graphlab.distances.cosine(john['tfidf'][0],VBeckham['tfidf'][0])

0.9567006376655429

### Cosine distance between 'Elton John's and 'Paul McCartney's articles 

In [118]:
mc = data[data['name'] == 'Paul McCartney']

In [119]:
graphlab.distances.cosine(john['tfidf'][0],mc['tfidf'][0])

0.8250310029221779

## Build a nearest neighbor model based on word count for document retrieval

In [143]:
knn_word_count_model = graphlab.nearest_neighbors.create(data, features=['word_count'], label='name', 
                                                         distance='cosine')

## Build a nearest neighbor model based on tf-idf for document retrieval

In [144]:
knn_tfidf_model = graphlab.nearest_neighbors.create(data, features=['tfidf'], label='name', distance='cosine')

## Applying the nearest-neighbors model for retrieval

### Who is closest to Elton John using word count features?

In [138]:
knn_word_count_model.query(john)

query_label,reference_label,distance,rank
0,Elton John,2.22044604925e-16,1
0,Cliff Richard,0.16142415259,2
0,Sandro Petrone,0.16822542751,3
0,Rod Stewart,0.168327165587,4
0,Malachi O'Doherty,0.177315545979,5


### Who is closest to Elton John using tfidf features?

In [141]:
knn_tfidf_model.query(john)

query_label,reference_label,distance,rank
0,Elton John,-2.22044604925e-16,1
0,Rod Stewart,0.717219667893,2
0,George Michael,0.747600998969,3
0,Sting (musician),0.747671954431,4
0,Phil Collins,0.75119324879,5


### Who is closest to Victoria Beckham using word count features?

In [145]:
knn_word_count_model.query(VBeckham)

query_label,reference_label,distance,rank
0,Victoria Beckham,-2.22044604925e-16,1
0,Mary Fitzgerald (artist),0.207307036115,2
0,Adrienne Corri,0.214509782788,3
0,Beverly Jane Fry,0.217466468741,4
0,Raman Mundair,0.217695474992,5


### Who is closest to Victoria Beckham using tfidf features?

In [146]:
knn_tfidf_model.query(VBeckham)

query_label,reference_label,distance,rank
0,Victoria Beckham,1.11022302463e-16,1
0,David Beckham,0.548169610263,2
0,Stephen Dow Beckham,0.784986706828,3
0,Mel B,0.809585523409,4
0,Caroline Rush,0.819826422919,5


## That make sense as Victoria Beckham is wife of David Beckham :)