# Exercise 2
    Find all the mentions of world countries in the whole corpus, using the pycountry utility (HINT: remember that there will be different surface forms for the same country in the text, e.g., Switzerland, switzerland, CH, etc.) Perform sentiment analysis on every email message using the demo methods in the nltk.sentiment.util module. Aggregate the polarity information of all the emails by country, and plot a histogram (ordered and colored by polarity level) that summarizes the perception of the different countries. Repeat the aggregation + plotting steps using different demo methods from the sentiment analysis module -- can you find substantial differences?

In [None]:
from os import path
from collections import Counter
import pycountry
import pandas as pd
import matplotlib.pyplot as plt
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
%matplotlib inline

In [None]:
# Read the csv file
print('Reading csv file...')
df = pd.read_csv(path.join('hillary-clinton-emails', 'emails.csv'))
print('is done!')

### Data
For this part, we only use the extracted body column. For instance, consider the case when Clinton replied to an email. The raw text contains all messages between them; however, the extracted body column contains only her reply. Hence, it makes sense to only consider the extracted body column to perform sentiment analysis of Hilary Clinton's comments.

In [None]:
raw_text = df['ExtractedBodyText'].dropna().reset_index(drop=True)

### Text cleaning
Cleaning the raw text is also crucial for further analysis. We use the following function to clean the raw text.

In [None]:
def text_cleaning(text):
    # Convert all words to lower case
    text = text.lower()    
    # Tokenize the text while removing all words with less than 3 characters
    tokenizer = nltk.tokenize.RegexpTokenizer(r'\w{3,}')
    token_text = tokenizer.tokenize(text)
    # Remove common words from the text
    s = set(nltk.corpus.stopwords.words('english'))
    token_text = list(filter(lambda x: x not in s, token_text))
    return ' '.join(token_text)

In [None]:
clean_text = raw_text.map(text_cleaning)

### Sentiment analysis
Now we perform the sentiment intensity analyzer using polarity score criteria on the clean text. We also use 2 different sentiment analyzer using from different demo modules.

In [None]:
sentim_analyzer = SentimentIntensityAnalyzer()
plr_scores = clean_text.map(sentim_analyzer.polarity_scores)
compound_scores = plr_scores.map(lambda x: x['compound'])

#### 1. demo_liu_hu_lexicon:
Basic example of sentiment classification using Liu and Hu opinion lexicon. This function simply counts the number of positive, negative and neutral words in the sentence and classifies it depending on which polarity is more represented. Words that do not appear in the lexicon are considered as neutral. However, demo files can only print the results. Thus, we define a new function to store the results. The code is identical to NLTK sentiment codes available [here](http://www.nltk.org/_modules/nltk/sentiment/util.html).

In [None]:
def demo_liu_hu_lexicon(sentence):
    
    from nltk.corpus import opinion_lexicon
    from nltk.tokenize import treebank

    tokenizer = treebank.TreebankWordTokenizer()
    pos_words = 0
    neg_words = 0
    tokenized_sent = [word.lower() for word in tokenizer.tokenize(sentence)]

    x = list(range(len(tokenized_sent))) # x axis for the plot
    y = []

    for word in tokenized_sent:
        if word in opinion_lexicon.positive():
            pos_words += 1
            y.append(1) # positive
        elif word in opinion_lexicon.negative():
            neg_words += 1
            y.append(-1) # negative
        else:
            y.append(0) # neutral
    
    if pos_words > neg_words:
        output = 'Positive'
    elif pos_words < neg_words:
        output = 'Negative'
    elif pos_words == neg_words:
        output = 'Neutral'
    return output

In [None]:
liu_hu_scores = clean_text.map(demo_liu_hu_lexicon)

#### 2. demo_sent_subjectivity:
Classify a single sentence as subjective or objective using a stored SentimentAnalyzer. Similar to the previous demo function, it can only print the results. Thus, we define a new function to store the results. The code is identical to NLTK sentiment codes available [here](http://www.nltk.org/_modules/nltk/sentiment/util.html).

In [None]:
def demo_sent_subjectivity(text):

    from nltk.classify import NaiveBayesClassifier
    from nltk.tokenize import regexp
    from nltk.data import load
    word_tokenizer = regexp.WhitespaceTokenizer()
    try:
        sentim_analyzer = load('sa_subjectivity.pickle')
    except LookupError:
        print('Cannot find the sentiment analyzer you want to load.')
        print('Training a new one using NaiveBayesClassifier.')
        sentim_analyzer = demo_subjectivity(NaiveBayesClassifier.train, True)

    # Tokenize and convert to lower case
    tokenized_text = [word.lower() for word in word_tokenizer.tokenize(text)]
    return sentim_analyzer.classify(tokenized_text)

In [None]:
sent_sub_scores = clean_text.map(demo_sent_subjectivity)

### Countries mentioned in emails
In this section, we find all countries which are mentioned in each email.

In [None]:
countries_dict = {country.alpha_2: [country.alpha_2.lower(),
                                    country.alpha_3.lower(),
                                    country.name.split(",")[0].lower()]
                  for country in pycountry.countries}

Some countries names are still complex which should be shortened or modified. We add the following elements to our dictionary.

In [None]:
countries_dict['GB'].extend(['uk', 'united kingdom', 'great britain'])
countries_dict['US'].extend(['u.s.', 'u.s.a'])
countries_dict['RU'].append('russia')
countries_dict['KP'].append('north korea')
countries_dict['KR'].append('south korea')
countries_dict['SY'].append('syria')

Some words in country alpha_2 and alpha_3 are misleading, e.g., are, pm, re, etc. We should exclude all these words from searching!

In [None]:
excluded_words = ['am', 'as', 'at', 'bf', 'cc', 'cv', 'ee', 'eh', 'gf', 'gg', 'id', 'co'
                  'ie', 'im', 'in', 'is', 'it', 'no', 'np', 'pm', 'tf', 'to', 'us',
                  'arm', 'can', 'com', 'col', 'mac', 'and', 'are', 'ago']

In [None]:
def search_for_countries(text):
    result = []
    for key, values in countries_dict.items():
        for value in values:
            if len(value.split()) == 1:
                if (value in text.split()) and (value not in excluded_words):
                    result.append(key)
                    break
                else:
                    pass
            else:
                if value in text:
                    result.append(key)
                    break
                else:
                    pass
    return result

In [None]:
countries_lst = clean_text.map(search_for_countries)

### Plot the results!

In [None]:
# remaining