# Class 14 - Starter Code

Natural Language Processing and Topic Modeling

In [8]:
# spacy is used for pre-processing and traditional NLP
import spacy
from spacy.en import English
nlp_toolkit = English()

# Gensim is used for LDA and word2vec
from gensim.models.word2vec import Word2Vec

# Twitter Lab

In this exercise, we will compare some of the classical NLP tools from the last class with these more modern latent variable techniques.  We will do this by comparing information extraction on Twitter using two different methods.

> NOTE:  There is a pre-existing file of captured tweets you can use.  It is located in the class repo for lesson-14.  However, you can also collect your own tweets following the instructions in twitter-instructions.md.

In [2]:
# Loading the twitter data
tweets = [unicode(tweet, errors='ignore') for tweet in \
          open('../../assets/dataset/captured-tweets.txt', 'r')]

# Part 1: Using `spacy`

Use `spacy` to write a function to filter tweets down to those where Google is announcing a product. How might we do this? One way might be to identify verbs, where 'Google' is the noun and there is some action like 'announcing'

In [5]:
# Use spacy to parse each tweet
parsed_tweets = []
for tweet in tweets:
    parsed_tweets.append(nlp_toolkit(tweet))

### 1.a
Write a function that can take a sentence parsed by spacy and identify if it mentions a company named 'Google'. Remember,spacy can find entities and code them as ORG if they are a company. 

### 1.b
BONUS: Make this function work for any company.

Hint: https://spacy.io/docs#examples-entities

In [9]:
# Write a function that can take a take a sentence parsed by `spacy` and 
# identify if it mentions a company named 'Google'. 
# Remember, `spacy` can find entities and code them `ORG` if they are a company.
def mentions_company(parsed, company='Google'):
    for entity in parsed.ents:
        if entity.text == company and entity.label_ == 'ORG':
            return True
    return False

In [10]:
# For each tweet, use parsed tweet to check your function
for i, parsed_tweet in enumerate(parsed_tweets):
    if mentions_company(parsed_tweet, 'Google'):
        print parsed_tweet
        if i>10:
            break

Google Play Gift Card Code

Claim your Google Play Gift Card Code... https://t.co/ySYH1x5kQl #amazon #itunes #googl https://t.co/ayDI4X1FKO

King of dark fantasy. Summon today. App Store: https://t.co/XeEuOEaXEG Google Play: https://t.co/vYQdbhrNEb #DarkSummoner



### 1.c
Write a function that can take a sentence parsed by spacy and return the verbs of the sentence (preferably lemmatized).

Hint: https://spacy.io/docs#examples-pos-tags

In [18]:
# Write a function that can take a sentence parsed by `spacy` 
# and return the verbs of the sentence (preferably lemmatized)
def get_actions(parsed):
    actions = [el.lemma_ 
                for el in parsed 
                if el.pos == spacy.parts_of_speech.VERB
               ]
    return actions

In [19]:
# For each tweet, use parsed tweet to check your function
for i, parsed_tweet in enumerate(parsed_tweets):
    print get_actions(parsed_tweet)
    if i>10:
        break

[u'make']
[]
[]
[u'google']
[]
[]
[u'use', u'learn']
[u'be', u'need', u'help', u'be', u'lack']
[]
[u'claim']
[u'come']
[]


### 1.d

For each tweet that mentions Google, parse it using spacy and print it out if the tweet has 'release' or 'announce' as a verb.

In [20]:
for i, parsed_tweet in enumerate(parsed_tweets):
    if mentions_company(parsed_tweet, 'Google'):
        actions = get_actions(parsed_tweet)
        if 'release' in actions or 'announce' in actions:
            print(parsed_tweet)
            
#some data culls as ampersand, we'll have to replace it

Google &amp; Ford rumored to announce partnership at CES https://t.co/zOgm1NjHhD https://t.co/Gzx81ujqVC

Lenovo And Google To Officially Announce Project Tango On January 7 At CES https://t.co/Qvmc34T5QA

Lenovo And Google To Officially Announce Project Tango On January 7 At CES https://t.co/GNmL0uw9xl

Google's Project Ara Spiral is expected to be released next year January https://t.co/prycPMuGsG

#Lenovo And #Google To Officially Announce #ProjectTango On January 7 At CES https://t.co/04seCBKv16

RT @GizmoChina: Lenovo And Google To Officially Announce Project Tango On January 7 At CES https://t.co/GNmL0uw9xl

Google and Ford to announce partnership on self-driving cars at CES - Fudzilla (blog) https://t.co/6woe56G22Q

Google and Ford to announce partnership on self-driving cars at CES - Fudzilla (blog) https://t.co/4hERVJ4zZK



### 1.e
Write a function that identifies countries.  HINT: the entity label for countries is GPE (or "GeoPolitical Entity").

Hint: https://spacy.io/docs#annotation-ner

In [40]:
# Write a function that identifies countries - HINT: the entity label for 
# countries is GPE (or GeoPolitical Entity)
def mentions_country(parsed, country):
    for entity in parsed.ents:
        if entity.text == country and entity.label_ == 'GPE':
            return True
    return False

In [41]:
for i, parsed_tweet in enumerate(parsed_tweets):
    if mentions_country(parsed_tweet, 'Iran'):
        print parsed_tweet
        if i>1000:
            break

RT @f396: Iran blames America, Britain and 'Zionists' for Nimr execution - https://t.co/BwXEicgAOA via https://t.co/UjStGmTT2f

RT @f396: Saudi Arabia severs diplomatic ties with Iran over embassy fire - https://t.co/r0iZugJa3v via https://t.co/UjStGmTT2f

RT @f396: 'Iran has a long record in attacking foreign diplomatic missions,' Saudi ... - https://t.co/3gaSRB3osT via https://t.co/UjStGmTT2f

# # # # Saudi Arabia cuts ties with Iran - Mail &amp; Guardian Online  https://t.co/vxCisN0Hrh



### 1.f
Re-run to find country tweets that discuss 'Iran' announcing or releasing.

In [46]:
for i, parsed_tweet in enumerate(parsed_tweets):
    if mentions_company(parsed_tweet, 'Iran'):
        actions = get_actions(parsed_tweet)
        if 'release' in actions or 'announce' in actions:
            print(parsed_tweet)
        else:
            print("nothing")
            

# Part 2: Using `gensim`

Build a `word2vec` model of the tweets we have collected using `gensim`.

### 2.a
First take the collection of tweets and tokenize them using spacy.
Think about how this should be done. 
Should you only use upper-case or lower-case? 
Should you remove punctuations or symbols? 

In [47]:
# Lemmatize the verbs for easier searching and keep symbols and punctuations
split_tweets = [[x.text if x.pos != spacy.parts_of_speech.VERB else x.lemma_ 
                 for x in nlp_toolkit(t)] for t in tweets]

In [48]:
print tweets[0]

I made a(n) Small Tourmaline in Paradise Island! https://t.co/cAoW1b6DRc #Gameinsight #Androidgames #Android



In [49]:
print split_tweets[0]

[u'I', u'make', u'a(n', u')', u'Small', u'Tourmaline', u'in', u'Paradise', u'Island', u'!', u'https://t.co/cAoW1b6DRc', u'#', u'Gameinsight', u'#', u'Androidgames', u'#', u'Android', u'\n']


### 2.b
Build a word2vec model.  
Test the window size as well - this is how many surrounding words need to be used to model a word. What do you think is appropriate for Twitter? 


In [50]:
# Build a `word2vec` model
model = Word2Vec(split_tweets, size=100, window=4, min_count=5, workers=4)

### 2.c
Test your word2vec model with a few similarity functions.  
Find words similar to 'Syria'.  
Find words similar to 'war'.  
Find words similar to 'Iran'.  
Find words similar to 'Verizon'. 

In [51]:
model.most_similar(positive=['Verizon'])

[(u'Microsoft', 0.9997143745422363),
 (u'after', 0.9997115135192871),
 (u'iPhone', 0.9997094869613647),
 (u'here', 0.9996896982192993),
 (u'Xbox', 0.9996877908706665),
 (u'6', 0.9996792674064636),
 (u'call', 0.9996715784072876),
 (u'4', 0.9996697902679443),
 (u'Your', 0.9996682405471802),
 (u'Black', 0.9996664524078369)]

In [52]:
model.most_similar(positive=['war'])

[(u'under', 0.9996192455291748),
 (u'Facebook', 0.999558687210083),
 (u'children', 0.9995566010475159),
 (u'=', 0.9995386600494385),
 (u'No', 0.9995381832122803),
 (u'Syrian', 0.9995209574699402),
 (u'+', 0.9995180368423462),
 (u'new', 0.9994900822639465),
 (u'Iraq', 0.9994814395904541),
 (u'use', 0.9994798898696899)]

In [53]:
model.most_similar(positive=['war', 'Iran'])

[(u'opposition', 0.9995629787445068),
 (u'Syria', 0.9995148777961731),
 (u"'s", 0.9992072582244873),
 (u'Paris', 0.99920654296875),
 (u'defeat', 0.9990522861480713),
 (u'=', 0.9988870024681091),
 (u'|', 0.9988430738449097),
 (u'https', 0.9988420605659485),
 (u'HumanRights', 0.9988245368003845),
 (u'android', 0.9988110065460205)]

In [54]:
model.most_similar(positive=['Iran'])

[(u'Syria', 0.9981015920639038),
 (u'regime', 0.9979629516601562),
 (u'opposition', 0.9974390268325806),
 (u'Paris', 0.997266411781311),
 (u'News', 0.9971316456794739),
 (u'democratic', 0.9971151351928711),
 (u'France', 0.9970508813858032),
 (u'UK', 0.9969102144241333),
 (u'defeat', 0.9968844652175903),
 (u'StopExecutionsIran', 0.9967482089996338)]

In [None]:
model.most_similar(positive=['war', 'Iraq'])

# Part 3: Comparing `spacy` and `gensim`
Filter tweets to those that mention 'Iran' or similar entities and 'war' or similar entities.

### 3.a
Using `spacy`

In [55]:
# Using spacy
for i, parsed_tweet in enumerate(parsed_tweets):
    if mentions_country(parsed_tweet, 'Iran') \
    or mentions_country(parsed_tweet, 'Iraq'):
        if 'attack' in get_actions(parsed_tweet) \
        or 'war' in parsed_tweet.text:
            print(parsed_tweet)

RT @f396: 'Iran has a long record in attacking foreign diplomatic missions,' Saudi ... - https://t.co/3gaSRB3osT via https://t.co/UjStGmTT2f

RT @iran_policy: Saleh Hamid: In reality #Iran IRGC Quds Force Gen. Qassem Soleimani is now out of action in #Syria war following his injur

#Iran #News Starvation has become tool of war in #Syria https://t.co/8fvRtxinW2  https://t.co/JSyjMqjtnq

#Iran provoked by U.S. into reacting angrily, then we claim they r evil. Iran not attacked another country for 400  https://t.co/3lTO2hgFPr

RT @Mojahedineng: #Iran #News Starvation has become tool of war in #Syria https://t.co/F0NnT87DMc https://t.co/f3J70v47aL

Iran-Saudi sectarian proxy wars set to explode, Israeli experts say - Middle East - Jerusalem Post

RT @iran_policy: Saleh Hamid: Right now many differences exist between #Iran regime + Russia in #Syria war. Iran feels it gives casualties 

RT @Mojahedineng: #Iran #News EUs foreign policy chief warned Iran on renewed tension with Saudi Arabia ht

### 3.b
Using `gensim`

In [56]:
# Using gensim
for i, split_tweet in enumerate(split_tweets):
    similarity_to_iran = max([model.similarity('Iran', tok) for tok in split_tweet if tok in model.vocab]+[0])
    similarity_to_war = max([model.similarity('war', tok) for tok in split_tweet if tok in model.vocab]+[0])
    if similarity_to_iran > 0.999 and similarity_to_war > 0.999:
        print (similarity_to_iran, similarity_to_war)
        print ' '.join(split_tweet)

(0.99999999999999967, 0.99942548344164106)
RT @f396 : Iran blame America , Britain and ' Zionists ' for Nimr execution - https://t.co/BwXEicgAOA via https://t.co/UjStGmTT2f 

(0.99999999999999967, 0.99942548344164106)
RT @f396 : Saudi Arabia sever diplomatic ties with Iran over embassy fire - https://t.co/r0iZugJa3v via https://t.co/UjStGmTT2f 

(0.99999999999999967, 0.99942548344164106)
RT @f396 : ' Iran have a long record in attack foreign diplomatic missions , ' Saudi ... - https://t.co/3gaSRB3osT via https://t.co/UjStGmTT2f 

(0.99999999999999967, 0.99914102868996546)
Iran : 4 prisoners in Gohadasht Prison begin their second week of hunger Strike https://t.co/qldbF6bv3D # iraq # LeMonde # google 

(0.99999999999999967, 0.99961941183948189)
RT @Donyayeazad : List of women execute under Rouhani regime ! https://t.co/Gr3BjFB5HW   # stopexecutioniran # No2Rouhani # Iran # Syria https:// 

(0.99999999999999967, 0.99942087590630091)
do nt let Rouhani , president of Execution visit 2 Euro

# Part 4: [Bonus] Your Own Analysis
Build your own analysis using the above twitter data.
Alternatively, collect your own tweets to analyze following the instructions in `twitter-instructions.md`