# Sentiment Analysis

Sentiment analysis is the process of analyzing the natural human language and predicting the sentiment of some text. Predicting 
whether the text has a positive or negative or neutral sentiment.

nltk is a platform that helps write python programs for analyzing natural language. It provides with various libraries such as tokenizing, stemming, parsing, etc. nltk provides some of the sample tweets for your analysis.

In [29]:
import nltk
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\saksh\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In [30]:
from nltk.corpus import twitter_samples
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
text = twitter_samples.strings('tweets.20150430-223406.json')

In [31]:
print(positive_tweets[0])
print()
print(text[0])

#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)

RT @KirkKus: Indirect cost of the UK being in the EU is estimated to be costing Britain £170 billion per year! #BetterOffOut #UKIP


# Tokenizing: 
It's a process of seperating text into seperate tokens (different words, emoticons, links, @mentions, etc.). It helps to 
classify the words and understand it's meaning.

punkt is a library which helps to tokenize tweets data into tokens to make it easier for analysis.

In [32]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\saksh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [33]:
tweet_tokens = twitter_samples.tokenized('positive_tweets.json')[0]
print(tweet_tokens)

['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'being', 'top', 'engaged', 'members', 'in', 'my', 'community', 'this', 'week', ':)']


# Normalization: 
Normalization is a process of converting a word into its canonical form. 
Grouping of words together which means same meaning but are in different forms.
Example: run, running, ran.

There are two popular methods of performing normalization in nlp
1. Stemming
2. Lemmatization

Stemming: 
wikipedia - stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.
Stemming is method used to normalize text by removing the affixes of a word. 

Lemmatization:
wikipedia - Lemmatisation in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item.
Lemmatization is similar to stemming but it brings context to the words.


Wordnet: wordnet is the lexical database, an english dictionary specifically designed for nlp. It helps to determine the base word.

Averaged_perceptron_tagger: It helps to apply parts of speech to the text. Used to determine the context of a word in a sentence. 

In [34]:
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\saksh\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\saksh\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

Tagging: Tagging is the process of assigning a word to a part of speech.
List of all tags used in pos_tag : https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

pos_tag: Library used to determine the context of a word by analyzing it's relative position and tagging it.

In [35]:
from nltk.tag import pos_tag
print(pos_tag(tweet_tokens))

[('#FollowFriday', 'JJ'), ('@France_Inte', 'NNP'), ('@PKuchly57', 'NNP'), ('@Milipol_Paris', 'NNP'), ('for', 'IN'), ('being', 'VBG'), ('top', 'JJ'), ('engaged', 'VBN'), ('members', 'NNS'), ('in', 'IN'), ('my', 'PRP$'), ('community', 'NN'), ('this', 'DT'), ('week', 'NN'), (':)', 'NN')]


In [36]:
from nltk.stem.wordnet import WordNetLemmatizer

def lemmatize_sentence(tokens):
    lemmatizer = WordNetLemmatizer()
    lemmatized_sentence = []
    for word, tag in pos_tag(tokens):
        if tag.startswith('NN'):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'
        lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
    return lemmatized_sentence
print(lemmatize_sentence(tweet_tokens))

['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'be', 'top', 'engage', 'member', 'in', 'my', 'community', 'this', 'week', ':)']


# Removing Noise:

Noise is anything in a text that doesn't matter or adds no meaning to it. Therefore it's better to remove noise from the text.
For removing noise we will use regex library. Documentation: https://docs.python.org/3/library/re.html.

In [37]:
import re, string

def remove_noise(tweet_tokens, stop_words=()):
    
    cleaned_tokens = []
    
    for token, tag in pos_tag(tweet_tokens):
        #removing hyperlinks
        token = re.sub(r'^https?:\/\/.*[\r\n]*','', token)
        #removing @ mentions
        token = re.sub("(@[A-Za-z0-9_]+)","", token)
        if tag.startswith('NN'):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'
        
        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)
        if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token.lower())
    return cleaned_tokens

In [38]:
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
print(remove_noise(tweet_tokens, stop_words))

['#followfriday', 'top', 'engage', 'member', 'community', 'week', ':)']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\saksh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [39]:
positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')

positive_cleaned_token_list = []
negative_cleaned_token_list = []

for i in positive_tweet_tokens:
    positive_cleaned_token_list.append(remove_noise(i, stop_words))
for j in negative_tweet_tokens:
    negative_cleaned_token_list.append(remove_noise(j, stop_words))
    
# Comparison between the original and cleaned tweets.
print(positive_tweet_tokens[500], positive_cleaned_token_list[500])

['Dang', 'that', 'is', 'some', 'rad', '@AbzuGame', '#fanart', '!', ':D', 'https://t.co/bI8k8tb9ht'] ['dang', 'rad', '#fanart', ':d']


# Determine Word Density:

To find the frequency of each word in all the tweets.

In [40]:
def compile_all_words(cleaned_token_list):
    for tweet in cleaned_token_list:
        for token in tweet:
            yield token
# yield is keyword that is used to return multiple values on the execution of this generator function.

In [41]:
positive_words = compile_all_words(positive_cleaned_token_list)

In [42]:
from nltk import FreqDist
freq_pos_words = FreqDist(positive_words)
print(freq_pos_words.most_common(20))
# or
# print(sorted(freq_pos_words)[-10:])

[(':)', 3691), (':-)', 701), (':d', 658), ('thanks', 388), ('follow', 357), ('love', 333), ('...', 290), ('good', 283), ('get', 263), ('thank', 253), ('u', 245), ('day', 242), ('like', 229), ('see', 195), ('happy', 192), ("i'm", 183), ('great', 175), ('hi', 173), ('go', 167), ('back', 163)]


# Data Transformation:


In [43]:
import random
def get_tweets_for_model(cleaned_tokens_list):
    for tweet_tokens in cleaned_tokens_list:
        yield dict([token, True] for token in tweet_tokens)

positive_tokens_for_model = get_tweets_for_model(positive_cleaned_token_list)
negative_tokens_for_model = get_tweets_for_model(negative_cleaned_token_list)

positive_dataset = [(i, "Positive") for i in positive_tokens_for_model]

negative_dataset = [(j, "Negative") for j in negative_tokens_for_model]

dataset = positive_dataset + negative_dataset

random.shuffle(dataset)

train_data = dataset[:7000]
test_data = dataset[7000:]

from nltk import classify
from nltk import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_data)

print("Accuracy is:", classify.accuracy(classifier, test_data))

print(classifier.show_most_informative_features(10))

Accuracy is: 0.9966666666666667
Most Informative Features
                      :( = True           Negati : Positi =   2058.0 : 1.0
                      :) = True           Positi : Negati =   1660.5 : 1.0
                follower = True           Positi : Negati =     25.2 : 1.0
                     bam = True           Positi : Negati =     20.5 : 1.0
                     sad = True           Negati : Positi =     17.7 : 1.0
                  arrive = True           Positi : Negati =     15.7 : 1.0
                    damn = True           Negati : Positi =     15.5 : 1.0
                     x15 = True           Negati : Positi =     15.5 : 1.0
                followed = True           Negati : Positi =     14.9 : 1.0
              appreciate = True           Positi : Negati =     13.8 : 1.0
None


# Test on random text:

In [44]:
from nltk.tokenize import word_tokenize

text1 = 'The movie got a good start and seemed to be very interesting. I would rate this movie four stars:). Great characters and acting'
text2 = 'The story had a confusing plot and their was no chemistry between the characters :/. It could have a better usage of language and vocabulary too.'

text1 = remove_noise(word_tokenize(text1))
text2 = remove_noise(word_tokenize(text2))
print(classifier.classify(dict([token,True] for token in text1)), classifier.classify(dict([token,True] for token in text2)))

Positive Negative


# Steps Performed:
1. Tokenize text
2. Normalize text
3. Remove noise
4. Transform data
5. Training and testing sets
6. Train model
7. Test model

It was simple, wasn't it :).

# Conclusion:
This was our classifier model (naive bayes classifier) which helps to identify the sentiment of a text (It may be a review about a movie or a product or a general tweet). We used only positive and negative sentiments to make it easier to understand. 
You may classify a text to be positive, negative. You can have your own list of positive, negative and neutral words to impliment the analysis.

# Special Thanks:
All the credits goes to https://www.digitalocean.com/community/users/sdaityari for making it simple to understand.