# Quiz 4 - Document Retrieval from Wikipedia Data

In [2]:
import turicreate

In [3]:
people = turicreate.SFrame('~/Courses/u-wash-machine-learning/machine-learning-case-study/data/people_wiki.sframe')

## **1. Compare top words according to word counts to TF-IDF**  
In the notebook we covered in the module, we explored two document representations: word counts and TF-IDF.  Now, take a particular famous person, 'Elton John'. 

**What are the 3 words in his articles with highest word counts?**  
**What are the 3 words in his articles with highest TF-IDF?**   

These results illustrate why TF-IDF is useful for finding important words.  Save these results to answer the quiz at the end.

In [4]:
elton = people[people['name'] == 'Elton John']

In [5]:
elton

URI,name,text
<http://dbpedia.org/resou rce/Elton_John> ...,Elton John,sir elton hercules john cbe born reginald ken ...


In [8]:
elton['word_count'] = turicreate.text_analytics.count_words(elton['text'])

**What are the 3 words in his articles with highest word counts?** 

1. the
2. in
3. and

In [11]:
elton.stack('word_count', new_column_name=['word','count']).sort('count',ascending=False)

URI,name,text,word,count
<http://dbpedia.org/resou rce/Elton_John> ...,Elton John,sir elton hercules john cbe born reginald ken ...,the,27.0
<http://dbpedia.org/resou rce/Elton_John> ...,Elton John,sir elton hercules john cbe born reginald ken ...,in,18.0
<http://dbpedia.org/resou rce/Elton_John> ...,Elton John,sir elton hercules john cbe born reginald ken ...,and,15.0
<http://dbpedia.org/resou rce/Elton_John> ...,Elton John,sir elton hercules john cbe born reginald ken ...,of,13.0
<http://dbpedia.org/resou rce/Elton_John> ...,Elton John,sir elton hercules john cbe born reginald ken ...,a,10.0
<http://dbpedia.org/resou rce/Elton_John> ...,Elton John,sir elton hercules john cbe born reginald ken ...,has,9.0
<http://dbpedia.org/resou rce/Elton_John> ...,Elton John,sir elton hercules john cbe born reginald ken ...,john,7.0
<http://dbpedia.org/resou rce/Elton_John> ...,Elton John,sir elton hercules john cbe born reginald ken ...,he,7.0
<http://dbpedia.org/resou rce/Elton_John> ...,Elton John,sir elton hercules john cbe born reginald ken ...,on,6.0
<http://dbpedia.org/resou rce/Elton_John> ...,Elton John,sir elton hercules john cbe born reginald ken ...,award,5.0


**What are the 3 words in his articles with highest TF-IDF?**
1. Furnish
2. Elton
3. Billboard

In [35]:
people['tfidf'] = turicreate.text_analytics.tf_idf(people['text'])
people['word_count'] = turicreate.text_analytics.count_words(people['text'])

In [39]:
elton = people[people['name'] == 'Elton John']
elton['word_count'] = turicreate.text_analytics.count_words(elton['text'])
elton[['tfidf']].stack('tfidf',new_column_name=['word','tfidf']).sort('tfidf',ascending=False)

word,tfidf
furnish,18.38947183999428
elton,17.482320270031995
billboard,17.30368095754203
john,13.93931279239831
songwriters,11.25040644703154
tonightcandle,10.986495389225194
overallelton,10.986495389225194
19702000,10.293348208665249
fivedecade,10.293348208665249
aids,10.262846934045534


## **2. Measuring distance**  

**What’s the cosine distance between the articles on ‘Elton John’ and ‘Victoria Beckham’? 
What’s the cosine distance between the articles on ‘Elton John’ and Paul McCartney’?  
Which one of the two is closest to Elton John?  
Does this result make sense to you?**

In [44]:
vbeckham = people[people['name'] == 'Victoria Beckham']
mccartney = people[people['name'] == 'Paul McCartney']

In [45]:
print('Similarity with Victoria Beckham is', turicreate.distances.cosine(elton['tfidf'][0],vbeckham['tfidf'][0]))
print('Similarity with Paul McCartney is',turicreate.distances.cosine(elton['tfidf'][0],mccartney['tfidf'][0]))

Similarity with Victoria Beckham is 0.9567006376655429
Similarity with Paul McCartney is 0.8250310029221779


# **Building nearest neighbors models with different input features and setting the distance metric** 
In the sample notebook, we built a nearest neighbors model for retrieving articles using TF-IDF as features and using the default setting in the construction of the nearest neighbors model.  Now, you will build two nearest neighbors models:

* Using word counts as features
* Using TF-IDF as features

1. What’s the most similar article, other than itself, to the one on ‘Elton John’ using word count features?</br>
Cliff Richard

2. What’s the most similar article, other than itself, to the one on ‘Elton John’ using TF-IDF features?</br>
Rod Stewart

3. What’s the most similar article, other than itself, to the one on ‘Victoria Beckham’ using word count features?</br>
Mary Fitzgerald (artist)

4. What’s the most similar article, other than itself, to the one on ‘Victoria Beckham’ using TF-IDF features?</br>
David Beckham

In [40]:
knn_model_wc = turicreate.nearest_neighbors.create(people, features=['word_count'],label='name', distance='cosine')

In [46]:
knn_model_tfidf = turicreate.nearest_neighbors.create(people, features=['tfidf'],label='name', distance='cosine')

In [41]:
knn_model_wc.query(elton)

query_label,reference_label,distance,rank
0,Elton John,2.220446049250313e-16,1
0,Cliff Richard,0.1614241525896703,2
0,Sandro Petrone,0.1682254275104111,3
0,Rod Stewart,0.168327165587061,4
0,Malachi O'Doherty,0.177315545978884,5


In [47]:
knn_model_tfidf.query(elton)

query_label,reference_label,distance,rank
0,Elton John,-2.220446049250313e-16,1
0,Rod Stewart,0.7172196678927374,2
0,George Michael,0.7476009989692847,3
0,Sting (musician),0.7476719544306141,4
0,Phil Collins,0.7511932487904706,5


In [48]:
knn_model_wc.query(vbeckham)

query_label,reference_label,distance,rank
0,Victoria Beckham,-2.220446049250313e-16,1
0,Mary Fitzgerald (artist),0.2073070361150499,2
0,Adrienne Corri,0.2145097827875479,3
0,Beverly Jane Fry,0.2174664687407927,4
0,Raman Mundair,0.2176954749915048,5


In [49]:
knn_model_tfidf.query(vbeckham)

query_label,reference_label,distance,rank
0,Victoria Beckham,1.1102230246251563e-16,1
0,David Beckham,0.5481696102632145,2
0,Stephen Dow Beckham,0.7849867068283364,3
0,Mel B,0.8095855234085036,4
0,Caroline Rush,0.81982642291868,5
