# Foundations Week 4 Assignment
## Retrieving Wikipedia articles

- Execute document retrieval code with the iPython notebook
- Load and transform real, text adata
- Compare results with words and TF-IDF (Term Frequency - Inverse Document Frequency)
- Set the distance function in the retrieval
- Build a document retrieval model using nearest neighbor search

In [22]:
import graphlab

In [23]:
# import our wikipedia subset into a SFrame object
people = graphlab.SFrame('people_wiki.gl/')

In [24]:
# Calculate word counts for each wiki text entry
people['word_count'] = graphlab.text_analytics.count_words(people['text'])

In [25]:
# Calculate TF-IDF based on that word count
# TF-IDF(w,d)=tf(w,d)∗log(N/f(w))
people_tfidf = graphlab.text_analytics.tf_idf(people['word_count'])

In [26]:
people['tfidf'] = people_tfidf['docs']

In [28]:
# take a quick peek because last time I put the output of tf_idf() into a SArray,
# which was dumb because for each row it put a SArray into the column I was creating.
# We don't want nested SArrays here.
people.head()

URI,name,text,word_count
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...,"{'since': 1, 'carltons': 1, 'being': 1, '2005' ..."
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...,"{'precise': 1, 'thomas': 1, 'closely': 1, ..."
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...,"{'just': 1, 'issued': 1, 'mainly': 1, 'nominat ..."
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...,"{'all': 1, 'bauforschung': 1, ..."
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...,"{'legendary': 1, 'gangstergenka': 1, ..."
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...,"{'now': 1, 'currently': 1, 'less': 1, 'being' ..."
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...,"{'exclusive': 2, 'producer': 1, 'tribe': ..."
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...,"{'taxi': 1, 'salon': 1, 'gangs': 1, 'being': 1, ..."
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...,"{'houston': 1, 'frankie': 1, 'labels': 1, ..."
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...,"{'phenomenon': 1, 'deborash': 1, ..."

tfidf
"{'since': 1.455376717308041, ..."
"{'precise': 6.44320060695519, ..."
"{'just': 2.7007299687108643, ..."
"{'all': 1.6431112434912472, ..."
"{'legendary': 4.280856294365192, ..."
"{'now': 1.96695239252401, 'currently': ..."
"{'exclusive': 10.455187230695827, ..."
"{'taxi': 6.0520214560945025, ..."
"{'houston': 3.935505942157149, ..."
"{'phenomenon': 5.750053426395245, ..."


## Question 1
What are the 3 words in the 'Elton John' article with the highest word counts?

*Answer:*
- the
- in
- and

What are the 3 words in the 'Elton John' article with the highest TF-IDF?

*Answer:*
- furnish
- elton
- billboard

In [46]:
elton = people[people['name'] == 'Elton John']

In [49]:
elton[['word_count']].stack('word_count', new_column_name=['word', 'count']).sort('count', ascending=False)

word,count
the,27
in,18
and,15
of,13
a,10
has,9
he,7
john,7
on,6
since,5


In [47]:
elton[['tfidf']].stack('tfidf', new_column_name=['word', 'tfidf']).sort('tfidf', ascending=False)

word,tfidf
furnish,18.38947184
elton,17.48232027
billboard,17.3036809575
john,13.9393127924
songwriters,11.250406447
overallelton,10.9864953892
tonightcandle,10.9864953892
19702000,10.2933482087
fivedecade,10.2933482087
aids,10.262846934


## Question 2:
Using graphlab.distances.cosine() what's the distance between the articles on 'Elton John' and 'Victoria Beckham'? *0.9567*

What's the distance between 'Elton John' and 'Paul McCartney'? *0.8250*

Which of these two are closer to John? *Paul McCartney*

In [39]:
# Create SFrames with the subset of data we need
# Note: I read somewhere this isn't actually duplicating any data
vbeckham = people[people['name'] == 'Victoria Beckham']
pmccartney = people[people['name'] == 'Paul McCartney']

In [35]:
graphlab.distances.cosine(elton['tfidf'][0], vbeckham['tfidf'][0])

0.9567006376655429

In [40]:
graphlab.distances.cosine(elton['tfidf'][0], pmccartney['tfidf'][0])

0.8250310029221779

## Build Nearest Neighbor Models
- Using word counts as features
- Using TF-IDF as features

In [41]:
# Build a nearest neighbors model using our 'people' data set, word counts as features, 
# and a cosine function for measurement
wc_model = graphlab.nearest_neighbors.create(people, label='name', features=['word_count'], distance='cosine')
# Build a nearest neighbors model using our 'people' data set, tf-idf as features, 
# and a cosine function for measurement
tfidf_model = graphlab.nearest_neighbors.create(people, label='name', features=['tfidf'], distance='cosine')

PROGRESS: Starting brute force nearest neighbors model training.
PROGRESS: Starting brute force nearest neighbors model training.


## Question 3:
- What’s the most similar article, other than itself, to the one on ‘Elton John’ using word count features? *Cliff Richard*
- What’s the most similar article, other than itself, to the one on ‘Elton John’ using TF-IDF features? *Rod Steward*
- What’s the most similar article, other than itself, to the one on ‘Victoria Beckham’ using word count features? *Mary Fitzgerald (artist)*
- What’s the most similar article, other than itself, to the one on ‘Victoria Beckham’ using TF-IDF features? *David Beckham*

In [42]:
wc_model.query(elton)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 3.599ms      |
PROGRESS: | Done         |         | 100         | 217.489ms    |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Elton John,2.22044604925e-16,1
0,Cliff Richard,0.16142415259,2
0,Sandro Petrone,0.16822542751,3
0,Rod Stewart,0.168327165587,4
0,Malachi O'Doherty,0.177315545979,5


In [43]:
tfidf_model.query(elton)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 6.482ms      |
PROGRESS: | Done         |         | 100         | 218.026ms    |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Elton John,-2.22044604925e-16,1
0,Rod Stewart,0.717219667893,2
0,George Michael,0.747600998969,3
0,Sting (musician),0.747671954431,4
0,Phil Collins,0.75119324879,5


In [44]:
wc_model.query(vbeckham)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 4.247ms      |
PROGRESS: | Done         |         | 100         | 174.281ms    |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Victoria Beckham,-2.22044604925e-16,1
0,Mary Fitzgerald (artist),0.207307036115,2
0,Adrienne Corri,0.214509782788,3
0,Beverly Jane Fry,0.217466468741,4
0,Raman Mundair,0.217695474992,5


In [45]:
tfidf_model.query(vbeckham)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 4.051ms      |
PROGRESS: | Done         |         | 100         | 245.625ms    |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Victoria Beckham,1.11022302463e-16,1
0,David Beckham,0.548169610263,2
0,Stephen Dow Beckham,0.784986706828,3
0,Mel B,0.809585523409,4
0,Caroline Rush,0.819826422919,5
