# Sentiment Analysis on the Tweets for the Celebrities Born Today

### Final Project - Mastering Python
### Author: Sammy Yu
### Date: May 30, 2016

### I Introduction
The [IMDB](http://m.imdb.com/feature/bornondate) web site posts 10 celebrities whose birthday is today everyday. For each celebrity in the list, four pieces of information are shown, namely, a small photo of the celebrity, his/her name, profession, such as "actor" and one's best work, e.g. a movie.

The project extracts those data from the web site using web scraping. Once the data is extracted, they are passed to a different module for a sentiment analysis on the tweets posted by the twitter users about those celebrities.

The sentiment analysis outputs the overall analysis for each celebrity in four possible results: "POSITIVE", "NEGATIVE", "SEMI-POSITIVE" or "NEUTRAL". "POSITIVE" means the celebrity receives mostly good words about him/her from among the twitter users. "NEGATIVE" means the celebrity does not receive much good words from the twitter users. "SEMI-POSTIVE" means some good words from the twitter users but not as much as the positive one. "NEUTRAL" means the celebrity receives around 50% positive and 50% negative words from the twitter users.

### II Problem Statement
Build an application to pull the data of 10 celebrities who were born today from the IMDB web site, and then pull streaming data from twitter with taking the following input from the user:
* Total number of tweets to be pulled for all 10 celebrities. If the number is reached, the application should terminate itself.
* Once tweets are fetched, we need to find sentiment for each celebrity's name and finally compare them.


### III Solution Flow Diagram

![](./Flow.png)

### IV Tools and Packages Used

The application is written in Python 2.7x. The tools and packages used are listed as below:

#### Tools:
* PyCharm Community Edition
* Jupyter Notebook

#### Packages:
* BeautifulSoup
* Selenium
* Tweepy
* re
* matplotlib
* sys
* codecs
* csv
* string

### V Solution: Python Script with proper commenting


In [None]:
## import required packages
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import sys
import codecs
import csv
from string import punctuation
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
from selenium import webdriver
import re

In [None]:
## class used to scrape the celebrity data from the IMDB web site
class imdb_celebrities:
    
    def get_celebrities(self):
        celebrities = {}
        url = "http://m.imdb.com/feature/bornondate"
        # Firefox web browser is required for Selenium to work here
        driver = webdriver.Firefox()
        driver.get(url)
        html = driver.page_source

        soup = BeautifulSoup(html, "html.parser")
        posterList = soup.findChildren('section', 'posters list')
        iterCelebrity = iter(posterList[0].findChildren('a', 'poster '))

        # go thru the list of celebrities who were born today
        for lnk in iterCelebrity:
            img = lnk.find('img')['src']
            label = lnk.find('div', 'label')
            name = label.find('span', 'title').string
            detail = label.find('div', 'detail').string.split(',')
            profession = detail[0]
            bestwork = detail[1].strip()
            celebrities[name] = (img, profession, bestwork)

        return celebrities


In [None]:
class tweetlistener(StreamListener):
    """ a listener class for the tweet stream """
    
    def on_status(self,status):
        """
        when a new tweet for a celebrity comes, this
        function is triggered
        """
        global counter,Total_tweet_count,filedict

        print "----------NEW TWEET ARRIVED!-----------"
        print "Tweet Text : %s" % status.text
        for item in filedict.items():
            regex = item[1][2]
            if regex.search(status.text) is not None:
                outfile = item[1][1]
                outfile.write(status.text)
                outfile.write(str("\n"))
                print "Author's Screen name : %s" % status.author.screen_name
                print "Time of creation : %s" % status.created_at
                print "Source of Tweet : %s" % status.source
                break

        counter += 1
        # when the number of tweets reaches the limit specified
        # by the user at the beginning of the application run,
        # sentiment analysis starts
        if counter >= Total_tweet_count:
            senti1 = senti()
            senti1.open_sentiment_files()
            # start the analysis for the celebrities one at a time
            for item in filedict.items():
                senti1.sentiment_analysis(item)
                drawing()
                
            # when all analysis are done, quit the program
            sys.exit()

    def on_error(self, status):
        """ error handler during the streaming """
        
        drawing()
        print "Too soon reconnected . Will terminate the program"
        print status
        sys.exit()

In [None]:
## Sentiment analyzer class
class senti():

    def open_sentiment_files(self):
        """ open the positive and negative words files """
        global positive_words,negative_words

        pos_sent = open("positive_words.txt").read()
        positive_words = pos_sent.split('\n')
        neg_sent = open('negative_words.txt').read()
        negative_words = neg_sent.split('\n')

    def sentiment_analysis(self, dictItem):
        """ 
        analyze the sentiment in the tweets for each celebrity
        The dictItem parameter is a dictionary item of which the key
        is a celebrity's name. The value is a tuple consists of 
        (imageFile, professon, bestwork)
        """
        global all_figs,positive_words,negative_words,celebrities

        # key of the dictionary item
        indiv = dictItem[0]
        # get the matching files for the celebrity
        file2 = dictItem[1][0]
        outfile = dictItem[1][1]

        outfile.close()
        positive_counts = []
        negative_counts = []
        conclusion = []
        tot_pos = 0
        tot_neu = 0
        tot_neg = 0
        tweets = codecs.open(file2, 'r', "utf-8").read()
        tweet_list_dup = []

        tweets_list = tweets.split('\n')

        for tweet in tweets_list:
            positive_counter = 0
            negative_counter = 0
            tweet = tweet.encode("utf-8")
            tweet_list_dup.append(tweet)
            tweet_processed = tweet.lower()

            for p in list(punctuation):
                tweet_processed = tweet_processed.replace(p, '')

            words = tweet_processed.split(' ')
            for word in words:
                if word in positive_words:
                    positive_counter += 1
                elif word in negative_words:
                    negative_counter += 1

            positive_counts.append(positive_counter)
            negative_counts.append(negative_counter)
            if positive_counter > negative_counter:
                conclusion.append("Positive")
                tot_pos += 1
            elif positive_counter == negative_counter:
                conclusion.append("Neutral")
                tot_neu += 0.5
            else:
                conclusion.append("Negative")
                tot_neg +=1

        output = zip(tweet_list_dup, positive_counts, negative_counts,conclusion)

        print "******** Overall Analysis **************"
        print "Celebrity: " + indiv
        print "Image: " + celebrities[indiv][0]
        print "Profession: " + celebrities[indiv][1]
        print "Best Work: " + celebrities[indiv][2]

        if tot_pos > tot_neg and tot_pos > tot_neu:
            print "Overall Sentiment on Twitter: POSITIVE"
        elif tot_neg > tot_pos and tot_neg > tot_neu:
            print "Overall Sentiment on Twitter: NEGATIVE"
        elif tot_neg == tot_neu and tot_neg > tot_pos:
            print "Overall Sentiment on Twitter: NEGATIVE"
        elif tot_pos + tot_neg < tot_neu:
            print "Overall Sentiment on Twitter: SEMI POSITIVE "
        else:
            print "Overall Sentiment on Twitter: NEUTRAL"


        print "%%%%%%%%%%%% End of processing - " + indiv + "   %%%%%%%%%%%%%%%%%%%%%"
        print ""

        file1 = filePath + '\\tweet_sentiment_' + indiv + '.csv'
        writer = csv.writer(open(file1, 'wb'))
        writer.writerows(output)
        draw_helper = []
        draw_helper.append(tot_pos)
        draw_helper.append(tot_neg)
        draw_helper.append(tot_neu)
        draw_helper.append(indiv)
        all_figs.append(draw_helper)


In [None]:
def drawing():
    """ function used to draw a pie chart of the sentiment analysis results """
    
    global all_figs, filePath
    for one_fig in all_figs:
        sentiments = {}
        sentiments["Positive"] = one_fig[0]
        sentiments["Negative"] = one_fig[1]
        sentiments["Neutral"] = one_fig[2]
        all_total = one_fig[0] + one_fig[1] + one_fig[2]

        sizes = [sentiments['Positive'] / float(all_total), sentiments['Negative'] / float(all_total),
                 sentiments['Neutral'] / float(all_total)]

        plt.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True)
        plt.axis('equal')

        plt.title('sentiment for the word - ' + str(one_fig[3]))
        fig_name = filePath + "\\fig_" + str(one_fig[3]) + ".png"
        # Save the figures
        plt.savefig(fig_name)
        plt.close()
    plt.show()


In [None]:
def search_tweets():
    """ function used to pull the tweets of the celebrities """
    
    global search_words_list,counter,auth,indiv,filePath,filedict
    filePath = "G:\Output Files"
    # set up the authorization with Twitter APIs
    consumer_key = 'ZkIxjbsPacixuhTg7aclkQ'
    consumer_secret = 'yme0jG3UDhG0CFgqlc50UQFSspo3EkUfPziUf2FFo'
    access_token = '1635433267-29ZpqtvpBIzVOQTnz1wgCsaotyEBTgs4V4jkUEM'
    access_secret = '33ZEGzs7pR1M0AYnD0mwOaZJ8JIF1Nc183VOFNkeug'
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_secret)

    # create a streaming object
    twitterStream = Stream(auth, tweetlistener())
    counter = 0
    filedict = {}
    one_list = []
    # create a list of the celebrities' name as the search keywords
    for indiv in search_words_list:
        filename = filePath + "\\test_" + str(indiv[0]) + ".txt"
        outfile = codecs.open(filename, 'w', "utf-8")
        regex = re.compile(indiv)
        # remember the file name, file handler and regex object for each celebrity
        filedict[indiv] = (filename, outfile, regex)
        # add each celebrity's name to the filter list
        one_list.append(indiv)

    # the one_list should have multiple celebrity names
    twitterStream.filter(track=one_list,languages = ["en"], async=True)


In [None]:
def main():
    """ entry point function to start the application """
    
    global Total_tweet_count,search_words_list,all_figs,labels,colors,celebrities

    print "Getting celebrities..."
    imdb = imdb_celebrities()
    celebrities = imdb.get_celebrities()
    search_words_list = celebrities.keys()
    print "Celebrities to process: " + ", ".join(search_words_list)

    Total_tweet_count = int(raw_input("Enter total tweets to be pulled for all celebrities: "))

    labels = ['Positive','Negative','Neutral']
    colors = ['yellowgreen','lightcoral','gold']
    all_figs= []
    search_tweets()


In [None]:
## Start the application by calling the main() function
main()

### VI Challenges Faced during the project

1. When working on the data strapping against the IMDB web site, I found that the data of the celebrities are created dynamically with JavaScript. In this case, the Beautisoup package by itself is not able to read the data from the web page. After doing some research, I found that the Selenium package can resolve the problem quite easily; and the code to use it is very simple. The down side of this solution is its performance may not be the best among other options. Also, it requires the installation of a web driver such as Firefox in order to work.

2. Another challenge I had on the project is dealing with the connection between the application and the Tweeter APIs. Tweeter can only allow one connection at a time with its API using the same account credential. If I try to use multiple connections almost at the same time, the older connection is automatically disconnected; and the waiting time of any new connection will exponentially increase and the client IP may be even eventually banned. Because of this, I cannot create a new connection for each celebrity for which I want to stream; or try to disconnect an existing connection before creating a new one (there is no such API call to disconnect a stream other than terminating the application). The solution I came up with is to initialize one connection only. When doing the streaming, I pass all (totally 10) celebrities' name as a list to the stream filter. In this case, the stream just fetches me the tweets those match any one of the celebrities' name. Since each tweet can be for any one of the celebrities, I need to use the regular expression to find out if the text of the tweet contains the name of one of the celebrities. If it does, the tweet will be appended to corresponding file of the celebrity for later analysis and drawing. Once the streaming is done, the program will start the sentiment analysis for each celebrity and reports the results one at a time. The down side of this approach though is we need to get enough tweets from the stream, e.g. more than 1000 totally so that the tweets can be more evenly distributed among all celebrities.