# In Class 11
## Author: Quentin Smith
## INET 4061

## Date Due: 4/12/20


### Introduction:

Complete steps 1-7 from https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk

Then answer questions based on steps 3, 4, 5, and 7. 

## Step 1: Installing NLTK and Downloading the Data

In [None]:
#------Install nltk if not already downloaded-----
#pip install nltk==3.3

import nltk

#------get data and store locally-----
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\hv2486co\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

## Step 2: Tokenizing the Data

In [None]:
#------- punkt is a pre-trained model that helps tokenize words and sentences ------
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hv2486co\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
from nltk.corpus import twitter_samples

#------Create variables for positive, negative, and text-------
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
text = twitter_samples.strings('tweets.20150430-223406.json')

#--------tokenize positive_tweets creates array of each tweet tokenized--------
tweet_tokens = twitter_samples.tokenized('positive_tweets.json')

print(tweet_tokens[0])

['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'being', 'top', 'engaged', 'members', 'in', 'my', 'community', 'this', 'week', ':)']


## Step 3. Normalize the Data

In [None]:
# wordnet is a lexical database for the english language that helps the script determine the base word
nltk.download('wordnet')

# resource to determine the context of a word in a sentence
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\hv2486co\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\hv2486co\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [None]:
#full list of tags https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
from nltk.tag import pos_tag
from nltk.corpus import twitter_samples

#tokenizes and tags tweets
tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
print(pos_tag(tweet_tokens[1]))

[('@Lamb2ja', 'NN'), ('Hey', 'NNP'), ('James', 'NNP'), ('!', '.'), ('How', 'NNP'), ('odd', 'JJ'), (':/', 'NNP'), ('Please', 'NNP'), ('call', 'VB'), ('our', 'PRP$'), ('Contact', 'NNP'), ('Centre', 'NNP'), ('on', 'IN'), ('02392441234', 'CD'), ('and', 'CC'), ('we', 'PRP'), ('will', 'MD'), ('be', 'VB'), ('able', 'JJ'), ('to', 'TO'), ('assist', 'VB'), ('you', 'PRP'), (':)', 'VBP'), ('Many', 'JJ'), ('thanks', 'NNS'), ('!', '.')]


In [None]:
from nltk.tag import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer

#The function lemmatize_sentence first gets the position tag of each token of a tweet and assigns the value of the tag
def lemmatize_sentence(tokens):
    lemmatizer = WordNetLemmatizer()
    lemmatized_sentence = []
    for word, tag in pos_tag(tokens):
        if tag.startswith('NN'):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'
        lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
    return lemmatized_sentence

## Step 4: Remove Noise from the Data

In [None]:
#Noise includes hyperlinks, twitter handles in replies, and punctuation and special characters

import re, string

def remove_noise(tweet_tokens, stop_words = ()):

    cleaned_tokens = []

    for token, tag in pos_tag(tweet_tokens):
        token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|'\
                       '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
        token = re.sub("(@[A-Za-z0-9_]+)","", token)

        if tag.startswith("NN"):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'

        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)

        if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token.lower())
    return cleaned_tokens

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hv2486co\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

#print(remove_noise(tweet_tokens[0], stop_words))

positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')

#clean both pos and neg tweet list
positive_cleaned_tokens_list = []
negative_cleaned_tokens_list = []


for tokens in positive_tweet_tokens:
    positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

for tokens in negative_tweet_tokens:
    negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

In [None]:
#---- compare the output versions of the 500th tweet in the list----

#print(positive_tweet_tokens[500])
#print(positive_cleaned_tokens_list[500])

## Step 5. Determining Word Density

In [None]:
#---- takes a list of tweets and provides a list of words in all of the tweets joined
def get_all_words(cleaned_tokens_list):
    for tokens in cleaned_tokens_list:
        for token in tokens:
            yield token

all_pos_words = get_all_words(positive_cleaned_tokens_list)

In [None]:
from nltk import FreqDist

#find frequency of top ten words
freq_dist_pos = FreqDist(all_pos_words)
print(freq_dist_pos.most_common(10))

[(':)', 3691), (':-)', 701), (':d', 658), ('thanks', 388), ('follow', 357), ('love', 333), ('...', 290), ('good', 283), ('get', 263), ('thank', 253)]


## Step 6. Preparing Data for the Model

In [None]:
#---- convert tweets from a list of cleaned tokens to dictionaries with keys as the tokens and Truw as values
def get_tweets_for_model(cleaned_tokens_list):
    for tweet_tokens in cleaned_tokens_list:
        yield dict([token, True] for token in tweet_tokens)

positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)

In [None]:
import random


#attach posisitve or negative label to each tweet. 

positive_dataset = [(tweet_dict, "Positive")
                     for tweet_dict in positive_tokens_for_model]

negative_dataset = [(tweet_dict, "Negative")
                     for tweet_dict in negative_tokens_for_model]

#creates dataset by joining the positive and negative datasets
dataset = positive_dataset + negative_dataset


random.shuffle(dataset)

train_data = dataset[:7000]
test_data = dataset[7000:]

## Step 7. Building and Testing the Model

In [None]:
from nltk import classify
from nltk import NaiveBayesClassifier

classifier = NaiveBayesClassifier.train(train_data)

print("Accuracy is:", classify.accuracy(classifier, test_data))

print(classifier.show_most_informative_features(10))

Accuracy is: 0.9946666666666667
Most Informative Features
                      :( = True           Negati : Positi =   2079.7 : 1.0
                      :) = True           Positi : Negati =    981.3 : 1.0
                     sad = True           Negati : Positi =     55.4 : 1.0
                follower = True           Positi : Negati =     20.4 : 1.0
                    glad = True           Positi : Negati =     19.5 : 1.0
                     x15 = True           Negati : Positi =     19.1 : 1.0
                    blog = True           Positi : Negati =     17.5 : 1.0
               community = True           Positi : Negati =     17.5 : 1.0
                followed = True           Negati : Positi =     13.5 : 1.0
                 perfect = True           Positi : Negati =     12.9 : 1.0
None


In [None]:
from nltk.tokenize import word_tokenize

# test custom tweets
custom_tweet = "I ordered just once from TerribleCo, they screwed up, never used the app again."

custom_tokens = remove_noise(word_tokenize(custom_tweet))

print(classifier.classify(dict([token, True] for token in custom_tokens)))

Positive


In [None]:
custom_tweet = 'Congrats #SportStar on your 7th best goal from last season winning goal of the year :) #Baller #Topbin #oneofmanyworldies'

custom_tokens = remove_noise(word_tokenize(custom_tweet))

print(classifier.classify(dict([token, True] for token in custom_tokens)))

Positive


In [None]:
custom_tweet = 'Thank you for sending my baggage to CityX and flying me to CityY at the same time. Brilliant service. #thanksGenericAirline'

custom_tokens = remove_noise(word_tokenize(custom_tweet))

print(classifier.classify(dict([token, True] for token in custom_tokens)))

Positive


# Questions 1:

From Step 3:
    1. List 10 Proper nouns that exist in the negative data set. 

In [None]:
#full list of tags https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
from nltk.tag import pos_tag
from nltk.corpus import twitter_samples
from nltk.stem.wordnet import WordNetLemmatizer

#tokenizes and tags tweets
tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
#print(pos_tag(tweet_tokens[1]))



#The function lemmatize_sentence first gets the position tag of each token of a tweet and assigns the value of the tag
def lemmatize_sentence(tokens):
    lemmatizer = WordNetLemmatizer()
    lemmatized_sentence = []
    for word, tag in pos_tag(tokens):
        if tag.startswith('NNP'):
            print('Proper Noun', word)

print(pos_tag(tweet_tokens[0]))
print(lemmatize_sentence(tweet_tokens[0]))

print(pos_tag(tweet_tokens[1]))
print(lemmatize_sentence(tweet_tokens[1]))
    

[('#FollowFriday', 'JJ'), ('@France_Inte', 'NNP'), ('@PKuchly57', 'NNP'), ('@Milipol_Paris', 'NNP'), ('for', 'IN'), ('being', 'VBG'), ('top', 'JJ'), ('engaged', 'VBN'), ('members', 'NNS'), ('in', 'IN'), ('my', 'PRP$'), ('community', 'NN'), ('this', 'DT'), ('week', 'NN'), (':)', 'NN')]
Proper Noun @France_Inte
Proper Noun @PKuchly57
Proper Noun @Milipol_Paris
None
[('@Lamb2ja', 'NN'), ('Hey', 'NNP'), ('James', 'NNP'), ('!', '.'), ('How', 'NNP'), ('odd', 'JJ'), (':/', 'NNP'), ('Please', 'NNP'), ('call', 'VB'), ('our', 'PRP$'), ('Contact', 'NNP'), ('Centre', 'NNP'), ('on', 'IN'), ('02392441234', 'CD'), ('and', 'CC'), ('we', 'PRP'), ('will', 'MD'), ('be', 'VB'), ('able', 'JJ'), ('to', 'TO'), ('assist', 'VB'), ('you', 'PRP'), (':)', 'VBP'), ('Many', 'JJ'), ('thanks', 'NNS'), ('!', '.')]
Proper Noun Hey
Proper Noun James
Proper Noun How
Proper Noun :/
Proper Noun Please
Proper Noun Contact
Proper Noun Centre
None


## Ten Proper Nouns
1. @France_Inte
2. @PKuchly57
3. @Milipol_Paris
4. Hey
5. James
6. How
7. :/
8. Please
9. Contact
10. Centre

# Question 2

Step 4
2. Pick a random tweet (tweet_tokens[PICK A NUMBER]) and explain the difference you see in the output between the lemmatize_sentence function and the remove_noise function


In [None]:
#---- compare the output versions of the 500th tweet in the list----
print('Original Output: ', tweet_tokens[0])

print('\nLemmatized output', lemmatize_sentence(tweet_tokens[0]))

print('\nRemove Noise Output ',remove_noise(tweet_tokens[0], stop_words))


Original Output:  ['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'being', 'top', 'engaged', 'members', 'in', 'my', 'community', 'this', 'week', ':)']
Proper Noun @France_Inte
Proper Noun @PKuchly57
Proper Noun @Milipol_Paris

Lemmatized output None

Remove Noise Output  ['#followfriday', 'top', 'engage', 'member', 'community', 'week', ':)']



Lemmatization normalizes the word. In our case it removes the affixes. Some examples from the original output are being --> be, engaged --> engage, members --> member. 

The remove noise function gets rid of hyperlinks, twitter handles, and punctuation and special characters and replaces them with an empty string. 

So from Lemmatization to Remove_Noise we see that the twitter handles are changed to empty strings (@France_Inte, @PKuchly57, @Milipol_Paris), removes stop words (for, be, in, my this), and removes punctuation and symbols (:). 

# Question 3

Step 5
3. What are the top 10 most negative words by frequency?

In [None]:
def get_all_words(cleaned_tokens_list):
    for tokens in cleaned_tokens_list:
        for token in tokens:
            yield token

all_neg_words = get_all_words(negative_cleaned_tokens_list)

#find frequency of top ten words
freq_dist_neg = FreqDist(all_neg_words)
print(freq_dist_neg.most_common(10))

[(':(', 4585), (':-(', 501), ("i'm", 343), ('...', 332), ('get', 325), ('miss', 291), ('go', 275), ('please', 275), ('want', 246), ('like', 218)]


# Question 4

Step 7
4. Is “community” a positive or negative word under informative features?

In [None]:
print(classifier.show_most_informative_features(10))

Most Informative Features
                      :( = True           Negati : Positi =   2079.7 : 1.0
                      :) = True           Positi : Negati =    981.3 : 1.0
                     sad = True           Negati : Positi =     55.4 : 1.0
                follower = True           Positi : Negati =     20.4 : 1.0
                    glad = True           Positi : Negati =     19.5 : 1.0
                     x15 = True           Negati : Positi =     19.1 : 1.0
                    blog = True           Positi : Negati =     17.5 : 1.0
               community = True           Positi : Negati =     17.5 : 1.0
                followed = True           Negati : Positi =     13.5 : 1.0
                 perfect = True           Positi : Negati =     12.9 : 1.0
None


Community is associated with positive tweets 17.1: 1.0.

# Question 5

5. Go to Twitter (assuming you have it, otherwise you can google it).  Find a tweet from a famous person of your choice and copy it into the custom_tweet object and run your classifier.  Was it positive or negative?

In [None]:
# From Jaden Smith


# test custom tweets
custom_tweet = "How Can Mirrors Be Real If Our Eyes Aren't Real"

custom_tokens = remove_noise(word_tokenize(custom_tweet))

print(classifier.classify(dict([token, True] for token in custom_tokens)))

Negative


# Question 6
6. What did you learn from this exercise?

Here is a list of things that I learned:
1. There is a lot of data preprocessing that goes along with text analysis. 
2. There are very logical steps and dictionaries that you cross reference words with. 
3. Just because the classifier says a tweet is negative doesn't mean it is actually a negative sentiment. Case in point the famous tweet I used, in my mind, does not have a negative sentiment. It is more neutral when I read it. 
4. Sentiment analysis is still subjective and could be improved.

# Bonus:

In [None]:
from nltk import classify
from nltk import DecisionTreeClassifier

classif = DecisionTreeClassifier.train(train_data)

print("Accuracy is:", classify.accuracy(classif, test_data))



Accuracy is: 0.995


Decision Tree Classifier Accuracy is: 0.995
Naive Bayes Classifier Accuracy is: 0.9946666666666667

It looks like they both have the same accuracy. I hopefully have run the Decision Tree Classifier correctly. What did differe is the time it took to run both of the classifiers. Naive Bayes is less than a minute or so. Decision Tree takes around 20 minutes. 