# Twitter Feels

## Topical sentiment analysis from recent twitter posts

This notebook:

* Builds (or reuses) a model built from positive and negative tweets
* Fetches the recent history of posted tweets by one or more keywords
* Analyzes the retrieved tweets for the overall sentiment (positive or negative) for those keywords and reports it
* Takes a new potential tweet and predicts its sentiment

### An example of using Naive Bayes classification.

Let's train an Naive Bayes Classifier how to "read" tweets, derive a topic, and guess  whether the tweet was positive or negative toward the topic.

We start by importing the libraries we need, as well as REcreating the source text file of tweets. This takes a long time, so if you skip it, the build will use the most recent file of tweets.

In [1]:
#Import the necessary methods from tweepy library
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
from datetime import datetime

Run the cell below to update the tweet data file named 'fetched_tweets.txt' OR SKIP IT, as it runs a LONG time.

In [2]:
############################################################################
#
#    REcreating tweets file; skip this cell or be prepared to wait!
#
############################################################################
now = datetime.now() # current date and time

#Variables that contains the user credentials to access Twitter API 
access_token = "29859964-YADj45SokICNoDlpNVyo4oW36fj526S3hQeDMpoXY"
access_token_secret = "CyFZ2wEo3CWA432IAtsFbk9EwSC3Hrr9PKkfZdXPeOHCQ"
consumer_key = "HIMCEsg8p7sQedSbOTrrjjH6W"
consumer_secret = "lsIvwa04RsNB8KhJ1yBRa1LqLt0oNM2ffefYtKixx1Zn55L2gw"

#This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener):
    
    def __init__(self, api=None):
        super(StdOutListener, self).__init__()
        self.num_tweets = 0
        
    def on_status(self, data):
        # print(data)
        startDate = datetime.today()
        if self.num_tweets < 10000:
            if data.created_at > startDate:
                with open('fetched_tweets.txt','a') as tf:
                    if hasattr(data, "retweeted_status"):  # Check if Retweet
                        try:
                            thisStatus = data.retweeted_status.extended_tweet["full_text"]
                        except AttributeError:
                            thisStatus = data.retweeted_status.text
                    else:
                        try:
                            thisStatus = data.extended_tweet["full_text"]
                        except AttributeError:
                            thisStatus = data.text
                    s = "\t"
                    timestampStr = data.created_at.strftime("%d-%b-%Y (%H:%M:%S.%f)")
                    tf.write(thisStatus)
                    tf.write("\n")
            self.num_tweets += 1
            return True
        else:
            return False

    def on_error(self, status):
        print(status)


if __name__ == '__main__':

    #This handles Twitter authentication and the connection to Twitter Streaming API
    l = StdOutListener()
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    stream = Stream(auth, l, tweet_mode='extended', include_rts=False)

    ##############################
    #
    # UPDATE THE KEYWORDS BELOW!
    #
    ##############################
    
    keywords = ['Trump','corona','virus']
    f = open("fetched_tweets.txt", "w")
    s = "|||"
    keyargs = keywords + [now.strftime("%Y-%m-%d")]
    f.write(s.join(keyargs))
    f.write("\n")
    f.close()
    
    #This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'
    stream.filter(languages=["en"],track=keywords)

Now, reuse the existing model OR import our training data, then retrain and test the model.
The dataset includes 5,000 positive tweets and 5,000 negative tweets.

In [3]:
import nltk
import collections
import pickle
nltk.download('twitter_samples')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt')

from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import twitter_samples, stopwords
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from nltk import FreqDist, classify, NaiveBayesClassifier

import re, string, random

def remove_noise(tweet_tokens, stop_words = ()):

    cleaned_tokens = []

    for token, tag in pos_tag(tweet_tokens):
        token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|'\
                       '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
        token = re.sub("(@[A-Za-z0-9_]+)","", token)

        if tag.startswith("NN"):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'

        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)

        if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token.lower())
    return cleaned_tokens

def get_all_words(cleaned_tokens_list):
    for tokens in cleaned_tokens_list:
        for token in tokens:
            yield token

def get_tweets_for_model(cleaned_tokens_list):
    for tweet_tokens in cleaned_tokens_list:
        yield dict([token, True] for token in tweet_tokens)

if __name__ == "__main__":
    global useExistingModel
    useExistingModel = True

    if useExistingModel == False:

        positive_tweets = twitter_samples.strings('positive_tweets.json')
        negative_tweets = twitter_samples.strings('negative_tweets.json')
        #text = twitter_samples.strings('tweets.20150430-223406.json')
        #tweet_tokens = twitter_samples.tokenized('positive_tweets.json')[0]

        stop_words = stopwords.words('english')

        positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
        negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')

        positive_cleaned_tokens_list = []
        negative_cleaned_tokens_list = []

        for tokens in positive_tweet_tokens:
            positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

        for tokens in negative_tweet_tokens:
            negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

        all_pos_words = get_all_words(positive_cleaned_tokens_list)

        freq_dist_pos = FreqDist(all_pos_words)
        print(freq_dist_pos.most_common(10))

        positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
        negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)

        positive_dataset = [(tweet_dict, "Positive")
                             for tweet_dict in positive_tokens_for_model]

        negative_dataset = [(tweet_dict, "Negative")
                             for tweet_dict in negative_tokens_for_model]

        dataset = positive_dataset + negative_dataset

        random.shuffle(dataset)

        train_data = dataset[:8000]
        test_data = dataset[2000:]

        classifier = NaiveBayesClassifier.train(train_data)

        print("Accuracy is:", classify.accuracy(classifier, test_data))

        print(classifier.show_most_informative_features(10))
        
        f = open('tweet_classifier.pickle','wb')
        pickle.dump(classifier, f)
        f.close()

        # custom_tweet = "I ordered just once from TerribleCo, they screwed up, never used the app again."

        # custom_tokens = remove_noise(word_tokenize(custom_tweet))

        # print(custom_tweet, classifier.classify(dict([token, True] for token in custom_tokens)))

    elif useExistingModel:
        with open('tweet_classifier.pickle','rb') as f:
            classifier = pickle.load(f)
            f.close()
    
    counter = 0
    keyargs = ""
    date = ""
    
    my_dict = ({'Negative':1,'Positive':1})
    with open('fetched_tweets.txt','r') as tf:
        for x in tf:
            if counter == 0:
                keyargs = x.split("|||")
                date = keyargs.pop()
            custom_tweet = x
            custom_tokens = remove_noise(word_tokenize(custom_tweet))
            sentiment = classifier.classify(dict([token, True] for token in custom_tokens))
            #if counter < 30:
            #    print ("first sentiment is ", sentiment)
            counter += 1
            my_dict[sentiment] +=1
            
    sentimentRatio = 0
    if my_dict['Positive'] > my_dict['Negative']:
        sentimentRatio = my_dict['Positive'] / my_dict['Negative']
    elif my_dict['Negative'] > my_dict['Positive']:
        sentimentRatio = -1 * (my_dict['Negative'] / my_dict['Positive'])
    
    print(sentimentRatio)
    

    

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /Users/toddrimes/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/toddrimes/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/toddrimes/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/toddrimes/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /Users/toddrimes/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


2.274706867671692


Let's get a feel for what this data looks like. Let's look at the first training feature, which should represent a written movie review:

In [4]:
from IPython.display import display, Markdown
NEGATIVE = sentimentRatio < 0
POSITIVE = sentimentRatio > 0

rSentimentRatio = round(sentimentRatio)

if NEGATIVE:
    display(Markdown("""# <span style="color:red">NEGATIVE</span> sentiment
    Overall, the tweets about {} are negative at a ratio of {}:1
    """.format(keywords, rSentimentRatio)))
if POSITIVE:
    display(Markdown("""# <span style="color:green">POSITIVE</span> sentiment
    Overall, the tweets about {} are postive at a ratio of {}:1
    """.format(keywords, rSentimentRatio)))

# <span style="color:green">POSITIVE</span> sentiment
    Overall, the tweets about ['Trump', 'corona', 'virus'] are postive at a ratio of 2:1
    