# Sentiment Analysis using Python and NLTK

This is part of text classification to categorize whether a tweet is having positive sentiment or negative sentiment. This is a binary classifier that decides between two labels - Positive or Negative.

##  Text Classification

Classification words by learning from labeled feature sets, or training data, to later classify an unlabeled feature set. A labeled feature set is simply a tuple that looks like (feature, label), while an unlabeled feature set is a feat by itself. A feature set is basically a key-value mapping of feature names to feature values. In case of text classification, the feature names are usually words and the values are all TRUE. As the documents may have unknown words, and the number of possible words may be very large, words that dont occur in the text are imitted. Instead of including them in a feature set with the value FALSE.  

An instance is another term for a feature set. It represents a single occurrence of a combination of features. A labeled
feature set is an instance with a known class label that we can use for training or evaluation. To summarize, (feat, label) is a labeled feature set, or labeled instance. feat is a feature set, normally represented as a key-value dictionary. When feat does not have an associated label, it is also called an unlabeled feature set, or instance.

Here are the steps in training the classifier  
    - Extract features from the text  
    - Train the classifier with the features.  

### 1) Create tweet list - Positive and Negative tweets

In [1]:
## Positive Tweets list
pos_tweets = [('I love this car', 'positive'),
('This view is amazing', 'positive'),
('I feel great this morning', 'positive'),
('I am so excited about the concert', 'positive'),
('He is my best friend', 'positive')]

In [2]:
## Negative Tweets list
neg_tweets = [('I do not like this car', 'negative'),
('This view is horrible', 'negative'),
('I feel tired this morning', 'negative'),
('I am not looking forward to the concert', 'negative'),
('He is my enemy', 'negative')]

### 2) Process the tweets
    - Combine both positive and negative tweets to a single list
    - read the tweets and extract the words alone for each tweet
    - Exclude words that are less than 2 characters in length
    - convert each word to lowercase
    - The final resuls will be a list of processes tweets that has words corresponding to the tweets along with the sentiment.
    - Also I'm creating a list that has just the list of words to get the frequency later.

In [3]:
## Combine both and create a new list
all_tweets = pos_tweets + neg_tweets
all_tweets

[('I love this car', 'positive'),
 ('This view is amazing', 'positive'),
 ('I feel great this morning', 'positive'),
 ('I am so excited about the concert', 'positive'),
 ('He is my best friend', 'positive'),
 ('I do not like this car', 'negative'),
 ('This view is horrible', 'negative'),
 ('I feel tired this morning', 'negative'),
 ('I am not looking forward to the concert', 'negative'),
 ('He is my enemy', 'negative')]

In [4]:
def process_tweets(tweetlist):
    processed_tweets=[]
    just_tweets = []
    for tweet, sentiment in tweetlist:
        tweets=[]
        for words in tweet.split():
            if len(words) >=3:
                tweets.append(words.lower())
        #print(tweets)
        just_tweets += tweets
        processed_tweets.append((tweets, sentiment))
    return(processed_tweets, just_tweets)

In [5]:
processed_tweets, just_tweets = process_tweets(all_tweets)
processed_tweets

[(['love', 'this', 'car'], 'positive'),
 (['this', 'view', 'amazing'], 'positive'),
 (['feel', 'great', 'this', 'morning'], 'positive'),
 (['excited', 'about', 'the', 'concert'], 'positive'),
 (['best', 'friend'], 'positive'),
 (['not', 'like', 'this', 'car'], 'negative'),
 (['this', 'view', 'horrible'], 'negative'),
 (['feel', 'tired', 'this', 'morning'], 'negative'),
 (['not', 'looking', 'forward', 'the', 'concert'], 'negative'),
 (['enemy'], 'negative')]

In [6]:
print(just_tweets) ## All the words from the tweets except the words that are less than length 3

['love', 'this', 'car', 'this', 'view', 'amazing', 'feel', 'great', 'this', 'morning', 'excited', 'about', 'the', 'concert', 'best', 'friend', 'not', 'like', 'this', 'car', 'this', 'view', 'horrible', 'feel', 'tired', 'this', 'morning', 'not', 'looking', 'forward', 'the', 'concert', 'enemy']


### 3) Create test tweets and process them

In [7]:
test_tweets = [('I feel happy this morning', 'positive'),
               ('Larry is my friend.', 'positive' ),
               ('I do not like that man.', 'negative'),
               ('My house is not great.', 'negative'),
               ('Your song is annoying.', 'negative')]

In [8]:
processed_tweets_test, just_tweets_test = process_tweets(test_tweets)
processed_tweets_test

[(['feel', 'happy', 'this', 'morning'], 'positive'),
 (['larry', 'friend.'], 'positive'),
 (['not', 'like', 'that', 'man.'], 'negative'),
 (['house', 'not', 'great.'], 'negative'),
 (['your', 'song', 'annoying.'], 'negative')]

In [9]:
print(just_tweets_test)

['feel', 'happy', 'this', 'morning', 'larry', 'friend.', 'not', 'like', 'that', 'man.', 'house', 'not', 'great.', 'your', 'song', 'annoying.']


### 4) List of distinct words ordered by frequency of appearance

In [10]:
import nltk
from nltk import FreqDist

In [11]:
freqdist_tweets = FreqDist(just_tweets)
freqdist_tweets.most_common()

[('this', 6),
 ('car', 2),
 ('view', 2),
 ('feel', 2),
 ('morning', 2),
 ('the', 2),
 ('concert', 2),
 ('not', 2),
 ('love', 1),
 ('amazing', 1),
 ('great', 1),
 ('excited', 1),
 ('about', 1),
 ('best', 1),
 ('friend', 1),
 ('like', 1),
 ('horrible', 1),
 ('tired', 1),
 ('looking', 1),
 ('forward', 1),
 ('enemy', 1)]

#### This indicates that the word 'this' has the highest frequency and has occured 6 times followed by car, view, feel etc.

## 5) Naive Bayes Classifier to train a classifier to classify the tweet as +ve or -ve

NaiveBayesClassifier uses the Bayes theorem to predict the probability that a given feature set belongs to a particular label. The formula is:  

Probability of A given B = P(A|B)  
Probability of B given A = P(B|A)  
P (A|B) = P (A^B) / P(B) = P(B|A) * P(A) / P(B) ==>   A given B  

    Probability of label given the feature set is aka Probabilty of label after deriving the feature sets:  
    P(label | features) = P(label) * P(features | label) / P(features) 
    
    - P(label): This is the prior probability of the label occurring, which is the likelihood that a random feature set will have the label. This is based on the number of training instances with the label compared to the total number of training instances. For example, if 60/100 training instances have the label, the prior probability of the label is 60%.
    - P(features | label): This is the prior probability of a given feature set being classified as that label. This is based on which features have occurred with each label in the training data.
    - P(features): This is the prior probability of a given feature set occurring. This is the likelihood of a random feature set being the same as the given feature set, and is based on the observed feature sets in the training data. For example, if the given feature set occurs twice in 100 training instances, the prior probability is 2%.
    - P(label | features): This tells us the probability that the given features should have that label. If this value is high, then we can be reasonably confident that the label is correct for the given features

## Training the NaiveBayes Classifier
    1) Feature Extraction: Choosing relevant and right features. The output has to be a dictionary containing relevant information about the word. 
    2) 

In [12]:
from nltk.classify import NaiveBayesClassifier

## 5.1) Feature Extraction

Text feature extraction is the process of transforming what is essentially a list of words into a feature set that is usable by a classifier

In [13]:
word_features = [feat for (feat, count) in freqdist_tweets.most_common()] ## This is the same as set(just_tweets)
print(word_features)

['this', 'car', 'view', 'feel', 'morning', 'the', 'concert', 'not', 'love', 'amazing', 'great', 'excited', 'about', 'best', 'friend', 'like', 'horrible', 'tired', 'looking', 'forward', 'enemy']


## 5.2) Create Featuresets

For training, we need to first create a list of labeled feature sets. This list should be of the form [(featureset, label)], where the featureset variable is a dict and label is the known class label for the featureset.

In [14]:
def create_featureset(document):
    words = set(document)
    featureset = {}
    for w in word_features:
        featureset[w] = (w in words)
    return featureset

In [15]:
create_featureset(['not', 'great', 'car'])

{'about': False,
 'amazing': False,
 'best': False,
 'car': True,
 'concert': False,
 'enemy': False,
 'excited': False,
 'feel': False,
 'forward': False,
 'friend': False,
 'great': True,
 'horrible': False,
 'like': False,
 'looking': False,
 'love': False,
 'morning': False,
 'not': True,
 'the': False,
 'this': False,
 'tired': False,
 'view': False}

## 5.3) Train the classifier
Create training set with all the input tweets

In [16]:
training_set = [(create_featureset(tweet), category) for (tweet, category) in processed_tweets]
training_set

[({'about': False,
   'amazing': False,
   'best': False,
   'car': True,
   'concert': False,
   'enemy': False,
   'excited': False,
   'feel': False,
   'forward': False,
   'friend': False,
   'great': False,
   'horrible': False,
   'like': False,
   'looking': False,
   'love': True,
   'morning': False,
   'not': False,
   'the': False,
   'this': True,
   'tired': False,
   'view': False},
  'positive'),
 ({'about': False,
   'amazing': True,
   'best': False,
   'car': False,
   'concert': False,
   'enemy': False,
   'excited': False,
   'feel': False,
   'forward': False,
   'friend': False,
   'great': False,
   'horrible': False,
   'like': False,
   'looking': False,
   'love': False,
   'morning': False,
   'not': False,
   'the': False,
   'this': True,
   'tired': False,
   'view': True},
  'positive'),
 ({'about': False,
   'amazing': False,
   'best': False,
   'car': False,
   'concert': False,
   'enemy': False,
   'excited': False,
   'feel': True,
   'forward': F

In [17]:
testing_set = [(create_featureset(tweet), category) for (tweet, category) in test_tweets]
testing_set

[({'about': False,
   'amazing': False,
   'best': False,
   'car': False,
   'concert': False,
   'enemy': False,
   'excited': False,
   'feel': False,
   'forward': False,
   'friend': False,
   'great': False,
   'horrible': False,
   'like': False,
   'looking': False,
   'love': False,
   'morning': False,
   'not': False,
   'the': False,
   'this': False,
   'tired': False,
   'view': False},
  'positive'),
 ({'about': False,
   'amazing': False,
   'best': False,
   'car': False,
   'concert': False,
   'enemy': False,
   'excited': False,
   'feel': False,
   'forward': False,
   'friend': False,
   'great': False,
   'horrible': False,
   'like': False,
   'looking': False,
   'love': False,
   'morning': False,
   'not': False,
   'the': False,
   'this': False,
   'tired': False,
   'view': False},
  'positive'),
 ({'about': False,
   'amazing': False,
   'best': False,
   'car': False,
   'concert': False,
   'enemy': False,
   'excited': False,
   'feel': False,
   'forw

In [18]:
classifier = NaiveBayesClassifier.train(training_set)

In [19]:
classifier.show_most_informative_features()

Most Informative Features
                     not = False          positi : negati =      1.6 : 1.0
                 excited = False          negati : positi =      1.2 : 1.0
                horrible = False          positi : negati =      1.2 : 1.0
                   about = False          negati : positi =      1.2 : 1.0
                    love = False          negati : positi =      1.2 : 1.0
                   great = False          negati : positi =      1.2 : 1.0
                 amazing = False          negati : positi =      1.2 : 1.0
                 forward = False          positi : negati =      1.2 : 1.0
                  friend = False          negati : positi =      1.2 : 1.0
                    best = False          negati : positi =      1.2 : 1.0


In [20]:
classifier.labels()

['positive', 'negative']

In [21]:
print("Classifier accuracy percent:",(nltk.classify.accuracy(classifier, testing_set))*100)

Classifier accuracy percent: 40.0


#### The accuracy is less as we dont have many examples to classify the new tweets accurately

## 5.4) Testing the new tweets 

In [22]:
tweet = 'This is the greatest song Ive ever heard'
print(classifier.classify(create_featureset(tweet.split())))

positive


In [23]:
test_sentence = ["tired"]
classifier.classify(create_featureset(test_sentence))

'negative'

In [24]:
test_tweets[0][0]

'I feel happy this morning'

In [25]:
for i in range(len(test_tweets)):
    print(test_tweets[i],'- Sentiment: ', classifier.classify(create_featureset(test_tweets[i][0])))

('I feel happy this morning', 'positive') - Sentiment:  positive
('Larry is my friend.', 'positive') - Sentiment:  positive
('I do not like that man.', 'negative') - Sentiment:  positive
('My house is not great.', 'negative') - Sentiment:  positive
('Your song is annoying.', 'negative') - Sentiment:  positive


## 5.5) Conclusion

The classifier is not identifying the tweets correctly. Possible reasons might be:
    1) Few examples for training as the corpus has only few words associated with being positive or negative. Words like 'this' cannot identify the sentiment.
    