# Sentiment Analysis and Classification Of Pakistan's General Elections 2018

This script scrapes <b>100+</b> tweets regarding Pakistan's Election 2018 from [Twitter](https://twitter.com/search?q=%23PakistanElections2018&src=typd) and performs Sentiment Analysis.

Following are the steps we took to take this project to reality:
  - Generated dataset by scraping Twitter data using Selenium WebDriver
  - Preprocessed the tweets by cleansing them and removing extraneous information for our sentiment analysis model 
  - Used Bag of Words algorithm to count the frequency of words people used while presenting their opinion in tweets.
  - Used Classifier to see the dominant opinion of the audience regarding the results of elections.

In [2]:
# enable intellisense in jupyter notebook
%config IPCompleter.greedy=True

In [3]:
from selenium import webdriver
from parsel import Selector
from textblob import TextBlob
import collections
import nltk
import re
from nltk.tokenize import WordPunctTokenizer,word_tokenize
from nltk.stem import PorterStemmer
ps = PorterStemmer()
#total tweets
total_tweets=[]

# Scraping Tweets from Twitter

In [5]:
try:
    driver = webdriver.Chrome()


    # search the twitter with hashtag (#PakistanElections2018)
    driver.get("https://twitter.com/search?q=%23PakistanElections2018&src=typd")


    # parse the page html through parser
    sel = Selector(driver.page_source)


    # scroll to end of page multiple times to load more tweets
    for i in range(30):
        from selenium.webdriver.common.keys import Keys
        import time
        htmlElement = driver.find_element_by_tag_name("html")
        htmlElement.send_keys(Keys.END)
        time.sleep(3)
except Exception as e:
    print(e)

In [6]:
# extract the <p> tags containing tweets
sel = Selector(driver.page_source)
elems=sel.xpath("//*[@class='TweetTextSize  js-tweet-text tweet-text']").extract()

# Cleansing Tweets ( Preprocessing )

<b>Tokenizing and Stemming</b>

In [7]:
counter = 1
pmln_counter=0
pti_counter=0
pmln_word_list = ["nawaz","shahbaz","sharif","pmln","maryam","nawaz sharif","shahbaz sharif"]
pti_word_list = ["imran khan","pti","captain","tehreek","insaaf","insaf"]
pti_tweets = []
pmln_tweets = []
for tweetContainerElement in elems:
    stemmed_words = []
    sel2 = Selector(tweetContainerElement)
    arrayOfTextInTweetParagraph=sel2.xpath(".//text()").extract()
    seperator=' '
    mergedTweet = seperator.join(arrayOfTextInTweetParagraph)
    mergedTweet = mergedTweet.lower()
    
    #clean the tweet
    mergedTweet = re.sub(r'@[A-Za-z0-9]+','',mergedTweet)
    mergedTweet = re.sub('https?://[A-Za-z0-9./]+','',mergedTweet)
    mergedTweet = re.sub("[^a-zA-Z]", " ", mergedTweet)
    mergedTweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','',mergedTweet)
    mergedTweet = re.sub('([0-9]+)','', mergedTweet)
    mergedTweet = re.sub('@[^\s]+','',mergedTweet)
    mergedTweet = re.sub('[\s]+', ' ', mergedTweet)
    mergedTweet = re.sub('&amp;', '', mergedTweet)
    mergedTweet = re.sub(r'#([^\s]+)', r'\1', mergedTweet)
    mergedTweet = mergedTweet.strip('\'"')
    mergedTweet = mergedTweet.replace("#","")
    mergedTweet.replace(u"\ufffd", "?")
    words = word_tokenize(mergedTweet)
    #for word in words:
    #    stemmed_words.append(ps.stem(word))
    final_cleansed_tweet = (" ".join(words)).strip()
    print(str(counter)+" Tweets written successfully")
    counter = counter + 1
    for pmln_wd in pmln_word_list:
        if ((any(pmln_wd in wd for wd in final_cleansed_tweet.split())) and not (any("pti" in wd for wd in final_cleansed_tweet.split()))):
            pmln_counter=pmln_counter+1
            print(str(pmln_counter)+":PMLN TWEETS")
            pmln_tweets.append(final_cleansed_tweet)
            break
    for pti_wd in pti_word_list:
        if ((any("pti" in wd for wd in final_cleansed_tweet.split())) and not (any("pmln" in wd for wd in final_cleansed_tweet.split()))):
            pti_counter=pti_counter+1
            print(str(pti_counter)+":PTI TWEETS")
            pti_tweets.append(final_cleansed_tweet)
            break
    total_tweets.append(final_cleansed_tweet)
    total_tweets = list(set(total_tweets))

1 Tweets written successfully
2 Tweets written successfully
1:PTI TWEETS
3 Tweets written successfully
4 Tweets written successfully
5 Tweets written successfully
6 Tweets written successfully
7 Tweets written successfully
8 Tweets written successfully
9 Tweets written successfully
2:PTI TWEETS
10 Tweets written successfully
11 Tweets written successfully
12 Tweets written successfully
13 Tweets written successfully
14 Tweets written successfully
1:PMLN TWEETS
15 Tweets written successfully
16 Tweets written successfully
17 Tweets written successfully
18 Tweets written successfully
3:PTI TWEETS
19 Tweets written successfully
20 Tweets written successfully
21 Tweets written successfully
22 Tweets written successfully
23 Tweets written successfully
4:PTI TWEETS
24 Tweets written successfully
25 Tweets written successfully
26 Tweets written successfully
5:PTI TWEETS
27 Tweets written successfully
28 Tweets written successfully
29 Tweets written successfully
30 Tweets written successfully


278 Tweets written successfully
76:PTI TWEETS
279 Tweets written successfully
280 Tweets written successfully
281 Tweets written successfully
77:PTI TWEETS
282 Tweets written successfully
283 Tweets written successfully
284 Tweets written successfully
285 Tweets written successfully
286 Tweets written successfully
78:PTI TWEETS
287 Tweets written successfully
288 Tweets written successfully
14:PMLN TWEETS
289 Tweets written successfully
290 Tweets written successfully
291 Tweets written successfully
292 Tweets written successfully
293 Tweets written successfully
294 Tweets written successfully
295 Tweets written successfully
79:PTI TWEETS
296 Tweets written successfully
297 Tweets written successfully
298 Tweets written successfully
299 Tweets written successfully
80:PTI TWEETS
300 Tweets written successfully
301 Tweets written successfully
302 Tweets written successfully
303 Tweets written successfully
304 Tweets written successfully
305 Tweets written successfully
306 Tweets written 

# Writing Cleansed Tweets to a Text File

In [8]:
try:
    with open("electionTweets.txt","w",encoding="utf-8") as textFile:
        for tweet in total_tweets:
            textFile.write(tweet+"\n")
        print("TOTAL TWEETS WRITTEN TO FILE SUCCESSFULLY")
except Exception as e:
    print(e)
try:
    with open("ptiElectionTweets.txt","w",encoding="utf-8") as textFile:
        for tweet in pti_tweets:
            textFile.write(tweet+"\n")
        print("PTI TWEETS WRITTEN TO FILE SUCCESSFULLY")
except Exception as e:
    print(e)
try:
    with open("pmlnElectionTweets.txt","w",encoding="utf-8") as textFile:
        for tweet in pmln_tweets:
            textFile.write(tweet+"\n")
        print("PMLN TWEETS WRITTEN TO FILE SUCCESSFULLY")
except Exception as e:
    print(e)

TOTAL TWEETS WRITTEN TO FILE SUCCESSFULLY
PTI TWEETS WRITTEN TO FILE SUCCESSFULLY
PMLN TWEETS WRITTEN TO FILE SUCCESSFULLY


In [9]:
try:
    driver.close()
    print("SELENIUM DRIVER CLOSED AFTER SCRAPING")
except Exception as e:
    print(e)

SELENIUM DRIVER CLOSED AFTER SCRAPING


# Loading Dictionary Of Positive and Negative Words

In [10]:
positive_words=[]
negative_words=[]
try:
    with open("positive-words.txt","r",encoding="utf-8") as textFile:
        positive_words = textFile.readlines()
    print("+ve words loaded")
except Exception as e:
    print(e)
try:
    with open("negative-words.txt","r",encoding="ISO-8859-1") as textFile2:
        negative_words = textFile2.readlines()
    print("-ve words loaded")
except Exception as e:
    print(e)
positive_words = [word.replace("\n","").replace("\ufeff","") for word in positive_words]
negative_words = [word.replace("\n","").replace("\ufeff","") for word in negative_words]
print(positive_words)
print("\n\n")
print(negative_words)
"""
Documents
"""
documents_positive = [(positive_words,"pos")]
documents_negative = [(negative_words,"neg")]

+ve words loaded
-ve words loaded
['a+', 'abound', 'abounds', 'abundance', 'abundant', 'accessable', 'accessible', 'acclaim', 'acclaimed', 'acclamation', 'accolade', 'accolades', 'accommodative', 'accomodative', 'accomplish', 'accomplished', 'accomplishment', 'accomplishments', 'accurate', 'accurately', 'achievable', 'achievement', 'achievements', 'achievible', 'acumen', 'adaptable', 'adaptive', 'adequate', 'adjustable', 'admirable', 'admirably', 'admiration', 'admire', 'admirer', 'admiring', 'admiringly', 'adorable', 'adore', 'adored', 'adorer', 'adoring', 'adoringly', 'adroit', 'adroitly', 'adulate', 'adulation', 'adulatory', 'advanced', 'advantage', 'advantageous', 'advantageously', 'advantages', 'adventuresome', 'adventurous', 'advocate', 'advocated', 'advocates', 'affability', 'affable', 'affably', 'affectation', 'affection', 'affectionate', 'affinity', 'affirm', 'affirmation', 'affirmative', 'affluence', 'affluent', 'afford', 'affordable', 'affordably', 'afordable', 'agile', 'agi

# Word Weighting

In [11]:
from nltk import FreqDist
pti_words = []
pmln_words= []
for tweet in pti_tweets:
    for word in tweet.split():
        pti_words.append(word)
for tweet in pmln_tweets:
    for word in tweet.split():
        pmln_words.append(word)

"""
Frequency Distribution of Words
"""
word_freqs_pti = FreqDist(pti_words)
word_freqs_pmln = FreqDist(pmln_words)
"""
Word Features
"""
word_features_pti = list(word_freqs_pti.keys())[:3000]

def find_features(document):
    features = {}
    for feature in word_features_pti:
        features[feature] = (feature in positive_words)

In [12]:
pti_sent_counter = 0
pmln_sent_counter= 0
pmln_pos_dict={}
pmln_neg_dict={}
pti_pos_dict={}
pti_neg_dict={}
print("POSITIVE WORDS FREQUENCY IN PTI TWEETS")
print("\n\n---------------------------\n\n")
for word in positive_words:
    if int(word_freqs_pti[str(word)]) > 0:
        print(word+":"+str(word_freqs_pti[word]))
        pti_sent_counter = pti_sent_counter + (int(word_freqs_pti[str(word)])*1) 
        pti_pos_dict[str(word)] = int(word_freqs_pti[str(word)])
print("NEGATIVE WORDS FREQUENCY IN PTI TWEETS")
print("\n\n---------------------------\n\n")
for word in negative_words:
    if int(word_freqs_pti[str(word)]) > 0:
        print(word+":"+str(word_freqs_pti[word]))
        pti_sent_counter = pti_sent_counter + (int(word_freqs_pti[str(word)])*-1)
        pti_neg_dict[str(word)] = int(word_freqs_pti[str(word)])
print("POSITIVE WORDS FREQUENCY IN PMLN TWEETS")
print("\n\n---------------------------\n\n")
for word in positive_words:
    if int(word_freqs_pmln[str(word)]) > 0:
        print(word+":"+str(word_freqs_pmln[word]))
        pmln_sent_counter = pmln_sent_counter + (int(word_freqs_pmln[str(word)])*1)
        pmln_pos_dict[str(word)] = int(word_freqs_pmln[str(word)])
print("NEGATIVE WORDS FREQUENCY IN PMLN TWEETS")
print("\n\n---------------------------\n\n")
for word in negative_words:
    if int(word_freqs_pmln[str(word)]) > 0:
        print(word+":"+str(word_freqs_pmln[word]))
        pmln_sent_counter = pmln_sent_counter + (int(word_freqs_pmln[str(word)])*-1)
        pmln_neg_dict[str(word)] = int(word_freqs_pmln[str(word)])

POSITIVE WORDS FREQUENCY IN PTI TWEETS


---------------------------


appreciate:1
beautiful:2
best:6
better:3
celebrate:2
clean:1
clear:1
congratulation:1
congratulations:3
contribution:1
dawn:16
defeat:2
defeated:3
dignified:1
enthusiasm:1
fascinating:2
favor:1
favour:2
good:2
grateful:1
handsome:1
healthy:1
holy:1
humility:1
important:1
interesting:2
jubilant:1
lead:6
leading:10
leads:5
like:3
loving:1
luck:3
peace:2
promises:1
ready:2
relaxed:1
satisfying:1
stable:1
strong:1
succeed:1
success:1
successfully:1
supported:1
supporter:1
sweet:1
thank:1
trophy:1
victory:8
welcome:1
well:4
willing:2
win:5
winner:1
winning:3
wins:6
won:3
NEGATIVE WORDS FREQUENCY IN PTI TWEETS


---------------------------


agony:1
allegations:1
ax:1
blatant:1
breaking:2
complaining:1
concerned:1
conflict:1
corruption:2
delay:1
difficult:1
disappointed:1
dispute:1
dust:1
fake:2
fear:1
foul:1
hate:1
issues:2
killed:1
miss:2
opposition:1
paranoia:1
poverty:2
puppet:1
reject:1
rejects:1
scarcely:1
scramble:

# Results

In [13]:
listofTuples = sorted(pti_pos_dict.items() ,  key=lambda x: x[1])
 
# Iterate over the sorted sequence
for elem in listofTuples :
    print(elem[0] , " ::" , elem[1] )
print(list(reversed(list(listofTuples)))[0:3])

appreciate  :: 1
clean  :: 1
clear  :: 1
congratulation  :: 1
contribution  :: 1
dignified  :: 1
enthusiasm  :: 1
favor  :: 1
grateful  :: 1
handsome  :: 1
healthy  :: 1
holy  :: 1
humility  :: 1
important  :: 1
jubilant  :: 1
loving  :: 1
promises  :: 1
relaxed  :: 1
satisfying  :: 1
stable  :: 1
strong  :: 1
succeed  :: 1
success  :: 1
successfully  :: 1
supported  :: 1
supporter  :: 1
sweet  :: 1
thank  :: 1
trophy  :: 1
welcome  :: 1
winner  :: 1
beautiful  :: 2
celebrate  :: 2
defeat  :: 2
fascinating  :: 2
favour  :: 2
good  :: 2
interesting  :: 2
peace  :: 2
ready  :: 2
willing  :: 2
better  :: 3
congratulations  :: 3
defeated  :: 3
like  :: 3
luck  :: 3
winning  :: 3
won  :: 3
well  :: 4
leads  :: 5
win  :: 5
best  :: 6
lead  :: 6
wins  :: 6
victory  :: 8
leading  :: 10
dawn  :: 16
[('dawn', 16), ('leading', 10), ('victory', 8)]


# PTI Top Positive Words Frequency
![Positive Words Frequency](pos-pti-words.png)

In [14]:
listofTuples = sorted(pti_neg_dict.items() ,  key=lambda x: x[1])
 
# Iterate over the sorted sequence
for elem in listofTuples :
    print(elem[0] , " ::" , elem[1] )
print(list(reversed(list(listofTuples)))[0:3])

agony  :: 1
allegations  :: 1
ax  :: 1
blatant  :: 1
complaining  :: 1
concerned  :: 1
conflict  :: 1
delay  :: 1
difficult  :: 1
disappointed  :: 1
dispute  :: 1
dust  :: 1
fear  :: 1
foul  :: 1
hate  :: 1
killed  :: 1
opposition  :: 1
paranoia  :: 1
puppet  :: 1
reject  :: 1
rejects  :: 1
scarcely  :: 1
scramble  :: 1
slowed  :: 1
struggle  :: 1
turmoil  :: 1
villains  :: 1
worry  :: 1
breaking  :: 2
corruption  :: 2
fake  :: 2
issues  :: 2
miss  :: 2
poverty  :: 2
terrorism  :: 2
upset  :: 2
[('upset', 2), ('terrorism', 2), ('poverty', 2)]


# PTI Top Negative Words  Frequency
![Negative Words Frequency](neg-pti-words.png)

In [15]:
listofTuples = sorted(pmln_pos_dict.items() ,  key=lambda x: x[1])
 
# Iterate over the sorted sequence
for elem in listofTuples :
    print(elem[0] , " ::" , elem[1] )
print(list(reversed(list(listofTuples)))[0:3])

celebrate  :: 1
good  :: 1
hot  :: 1
like  :: 1
luck  :: 1
success  :: 1
well  :: 1
win  :: 1
won  :: 1
work  :: 1
[('work', 1), ('won', 1), ('win', 1)]


# PMLN Top Positive Words Frequency
![Positive Words Frequency](pos-pmln-words.png)

In [16]:
listofTuples = sorted(pmln_neg_dict.items() ,  key=lambda x: x[1])
 
# Iterate over the sorted sequence
for elem in listofTuples :
    print(elem[0] , " ::" , elem[1] )
print(list(reversed(list(listofTuples)))[0:3])

allegations  :: 1
blatant  :: 1
boycott  :: 1
cry  :: 1
deprived  :: 1
failed  :: 1
foul  :: 1
irony  :: 1
issues  :: 1
manipulate  :: 1
opposition  :: 1
protest  :: 1
protested  :: 1
reject  :: 1
rival  :: 1
sad  :: 1
complaints  :: 2
crowded  :: 2
delays  :: 2
breaking  :: 3
rejects  :: 3
[('rejects', 3), ('breaking', 3), ('delays', 2)]


# PMLN Top Negative Words Frequency
![Negative Words Frequency](neg-pmln-words.png)

# Evaluation

In [17]:
print("PTI CHANCE OF WINNING:"+str(pti_sent_counter))
print("PMLN CHANCE OF WINNING:"+str(pmln_sent_counter))

PTI CHANCE OF WINNING:94
PMLN CHANCE OF WINNING:-18


# Conclusion

<p>People clearly believe that PTI has more chances to win 2018's elections</p>