# Sentiment Analysis

Sentiment Analysis means analyzing the sentiment of a given text or document and categorizing the text/document into a specific class or category (like positive and negative). In other words, we can say that sentiment analysis classifies any particular text or document as positive or negative. Basically, the classification is done for two classes: positive and negative.

# Introduction

In this notebook I will show you how to write a python program that predicts that analyze the sentiment of test using a machine learning technique called The Natural Language Toolkit(NLTK).I have drawn word cloud of the text,i have build naive bayes classification for to measure the score.

# Importing libraries

In [None]:
# here i am importing important libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("fivethirtyeight")
import nltk
from nltk.corpus import stopwords
from nltk.classify import SklearnClassifier
from sklearn.model_selection import train_test_split
from textblob import TextBlob
from wordcloud import WordCloud,STOPWORDS
from subprocess import check_output
import re

# Reading Dataset

In [None]:
# here i am reading dataset
data = pd.read_csv("../input/first-gop-debate-twitter-sentiment/Sentiment.csv")
# here i am printing fisrt five line of dataset
data.head()

In [None]:
# here i am priting shape of dataset
data.shape

In [None]:
# here i have decided to use only sentiment and text columns for doing sentiment analysis
data = data[["text","sentiment"]]

In [None]:
# here i am printing first five line of my dataset
data.head()

In [None]:
# here i am cleaning text column
def cleantxt(text):
    text= re.sub(r'@[A-Za-z0-9]+', '',text)# removed @mentions
    text= re.sub(r'#', '',text)# removed # symbol
    text = re.sub(r'RT[\s]+', '',text)# rmoved RT
    text = re.sub(r'https?:\/\/\s+', '',text)# removed the hyperlink
    text = re.sub(r':+', '',text)# removed : symbol
    text = re.sub(r'--+', '',text)# removed : symbol
    text = re.sub(r'http', '',text)
    return text
data["text"] = data["text"].apply(cleantxt)

In [None]:
# here we are printing the first five line of cleaned data
data.head()

now i am adding two more columns in dataset that is subjectivity and polarity
subjectivity:Subjective sentences generally refer to personal opinion, emotion or judgment whereas objective refers to factual information. Subjectivity is also a float which lies in the range of [0,1]
polarity: Polarity is float which lies in the range of [-1,1] where 1 means positive statement and -1 means a negative statement.


In [None]:
# here i am creating function to get subjectivity
def getSubjectivity(text):
    return TextBlob(text).sentiment.subjectivity
# here i am creating function to get polarity
def getPolarity(text):
    return TextBlob(text).sentiment.polarity
# here i am creating two new column of subjectivity and polarity
data["subjectivity"] = data["text"].apply(getSubjectivity)
data["polarity"] = data["text"].apply(getPolarity)


In [None]:
# here we are printing first five line of data after adding two new columns
data.head(10)

First of all, splitting the dataset into a training and a testing set. The test set is the 10% of the original dataset. For this particular analysis I dropped the neutral tweets, as my goal was to only differentiate positive and negative tweets.

In [None]:
# here i am spliting dataset in train and test data
train,test = train_test_split(data,test_size=0.1)
# here i am removing neutral text
train = train[train.sentiment != "Neutral"]

As a next step I separated the Positive and Negative tweets of the training set in order to easily visualize their contained words. After that I cleaned the text from hashtags, mentions and links. Now they were ready for a WordCloud visualization which shows only the most emphatic words of the Positive and Negative tweets.

In [None]:
# here i am training positive text
train_pos = train[train["sentiment"]=="positive"]
train_pos = train_pos["text"]
# here i am training neagative text
train_neg = train[train["sentiment"]=="negative"]
train_neg = train_neg["text"]

In [None]:
# here i am doing WordCloud visualization
allwords = ' '.join([twts for twts in data["text"]])
wordcloud = WordCloud(width=2500,
                      height=2000,stopwords=STOPWORDS,background_color="white",random_state=21
                     ).generate(allwords)
plt.figure(1,figsize=(10,10))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

In [None]:
# here i am removing the hashtags, mentions, links and stopwords from the
#training set after doing visualisation
text = []
stopwords_set = set(stopwords.words("english"))

for index, row in train.iterrows():
    words_filtered = [e.lower() for e in row.text.split() if len(e) >= 3]
    words_cleaned = [word for word in words_filtered
        if 'http' not in word
        and not word.startswith('@')
        and not word.startswith('#')
        and word != 'RT']
    words_without_stopwords = [word for word in words_cleaned if not word in stopwords_set]
    text.append((words_without_stopwords, row.sentiment))

test_pos = test[ test['sentiment'] == 'Positive']
test_pos = test_pos['text']
test_neg = test[ test['sentiment'] == 'Negative']
test_neg = test_neg['text']

As a next step I extracted the so called features with nltk lib, first by measuring a frequent distribution and by selecting the resulting keys.

In [None]:
# Extracting word features
def get_words_in_text(text):
    all = []
    for (words, sentiment) in text:
        all.extend(words)
    return all

def get_word_features(wordlist):
    wordlist = nltk.FreqDist(wordlist)
    features = wordlist.keys()
    return features
w_features = get_word_features(get_words_in_text(text))

def extract_features(document):
    document_words = set(document)
    features = {}
    for word in w_features:
        features['contains(%s)' % word] = (word in document_words)
    return features

Using the nltk NaiveBayes Classifier I classified the extracted tweet word features

In [None]:
#  here i am Training the Naive Bayes classifier
training_set = nltk.classify.apply_features(extract_features,text)
classifier = nltk.NaiveBayesClassifier.train(training_set)

In [None]:
# here i have tried to measure how the classifier algorithm scored.
neg_cnt = 0
pos_cnt = 0
for obj in test_neg: 
    res =  classifier.classify(extract_features(obj.split()))
    if(res == 'Negative'): 
        neg_cnt = neg_cnt + 1
for obj in test_pos: 
    res =  classifier.classify(extract_features(obj.split()))
    if(res == 'Positive'): 
        pos_cnt = pos_cnt + 1
        
print('[Negative]: %s/%s '  % (len(test_neg),neg_cnt))        
print('[Positive]: %s/%s '  % (len(test_pos),pos_cnt))

# Conclusion

Thanks for reading. I hope you like my sentiment analysis and found it to be helpful. If you have any questions or suggestions, feel free to write them down in the comment section.