In [1]:
import pandas as pd
import os
import urllib.request 
import shutil
import zipfile

In [2]:
base_folder = os.getcwd()

In [3]:
# Unzip file on a temporary folder
temporary_folder = os.path.join(os.getcwd(), 'tmp')
if os.path.exists(temporary_folder):
    shutil.rmtree(temporary_folder)
    
if not os.path.exists(temporary_folder):
    os.makedirs(temporary_folder)
    
local_file_name = local_file_name = os.path.join(base_folder, "training_dataset", "trainingandtestdata.zip")


with zipfile.ZipFile(local_file_name, 'r') as zip_ref:
    zip_ref.extractall(temporary_folder)

### Load Training Dataset

The following function loads the training file and split it into training and test datasets

It received the following args:

* **sample_size**: the amount of rows from the file that we want to load. The whole file has 1.6MM of rows and it is unpractical to work with this amount on a local machine. For the final training with the whole dataset, a Hadoop cliuster are advised. If the arg is not informed, the function will return all the lines into two lists of dicts: one for training and another for testing
* **test_size_frac**: the fraction of lines that will be reserved for testing the model

**Note**: we are converting the Pandas DataFrame to a list of dict because nltk package does not work with Pandas

In [4]:
def load_training_dataset(sample_size = None, test_size_frac = 0.5):
    training_dataset_path = os.path.join(
        temporary_folder, 
        "training.1600000.processed.noemoticon.csv")

    training_dataset = pd.read_csv(
        training_dataset_path, 
        encoding="latin-1", 
        warn_bad_lines=True,
        error_bad_lines=False,
        header=None, 
        names=["polarity", "tweet_id", "date", "query", "user", "tweet"])
    if sample_size != None:
        training_dataset = training_dataset.sample(sample_size)

    #training_dataset = training_dataset[["tweet_id", "polarity", "tweet"]]
    
    testing_dataset = training_dataset.sample(frac = test_size_frac)

    training_dataset = training_dataset.drop(testing_dataset.index)
 
    return training_dataset.to_dict("records"), testing_dataset.to_dict("records")

In [5]:
# Load test and training dataset for exploration
training_data, testing_data = load_training_dataset(sample_size = None, test_size_frac=.5)

In [6]:
pd.DataFrame(training_data).head(10)

Unnamed: 0,polarity,tweet_id,date,query,user,tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
3,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew
4,0,1467811592,Mon Apr 06 22:20:03 PDT 2009,NO_QUERY,mybirch,Need a hug
5,0,1467811795,Mon Apr 06 22:20:05 PDT 2009,NO_QUERY,2Hood4Hollywood,@Tatiana_K nope they didn't have it
6,0,1467812416,Mon Apr 06 22:20:16 PDT 2009,NO_QUERY,erinx3leannexo,spring break in plain city... it's snowing
7,0,1467812579,Mon Apr 06 22:20:17 PDT 2009,NO_QUERY,pardonlauren,I just re-pierced my ears
8,0,1467812723,Mon Apr 06 22:20:19 PDT 2009,NO_QUERY,TLeC,@caregiving I couldn't bear to watch it. And ...
9,0,1467812784,Mon Apr 06 22:20:20 PDT 2009,NO_QUERY,bayofwolves,"@smarrison i would've been the first, but i di..."


Column definitions:

0 - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)

1 - the id of the tweet

2 - the date of the tweet

3 - the query. If there is no query, then this value is NO_QUERY.

5 - the text of the tweet

In [7]:
# We noticed that there is not a single case of Neutral (2) polarity
pd.DataFrame(training_data + testing_data).polarity.value_counts()

4    800000
0    800000
Name: polarity, dtype: int64

### Pre-process Tweets

The following class prepares the dataset by:

* Extracting the text from HTML (for the training dataset provided, we already have the text, but we want to avoid using any HTML tag for classification
* Converting all words to lower case
* Replacing any URL with "URL" constant (to enable the removal of them on a further step)
* Replacing any tagging of users with "USERTAGGING" (to enable the removal of them in a further step)
* Removing any "#" from hashtags
* Removing punctuation (has little or no weight on classification as it can be used for both intentions)
* Tokenizing (create a list of words)
* And finally, removing words and punctuation that has little or no weight on classification (and can even create biases):
    * Stop words: set of common words that are used doesn't matter the intenttion (things like it, that, a, the)
    * Remove the two constants that we used to replace user tagging and URLs
    
**Note**: we are creating a class for this process because we want to "pickle" (serialize and save as a file) it for usage on the implementation of the streaming process 

In [8]:
import re
from nltk.tokenize import word_tokenize
from string import punctuation 
from nltk.corpus import stopwords 
from bs4 import BeautifulSoup

class PreProcessTweets:
    def __init__(self):
        self._stopwords = set(stopwords.words("english") + ["USERTAGGING","URL"])
        
    def processTweets(self, list_of_tweets):
        processedTweets=[]
        for tweet in list_of_tweets:
            processedTweets.append(
                (
                    self.processTweet(tweet["tweet"]),
                    tweet["polarity"]                    
                )
            )
        return processedTweets
    
    def processTweet(self, tweet):
        tweet = BeautifulSoup(tweet).get_text() # Extracts text from HTML (just in case!)
        tweet = tweet.lower() # Converts text to lower-case
        tweet = re.sub("((www\.[^\s]+)|(https?://[^\s]+))", "URL", tweet) # Replces URLs by URL constan
        tweet = re.sub("@[^\s]+", "USERTAGGING", tweet) # Replaces usernames by USERTAGGING constant 
        tweet = re.sub(r"#([^\s]+)", r"\1", tweet) # Removes the # in #hashtag
        for p in punctuation: 
            tweet = tweet.replace(p, "") # Removes punctiation
        tweet = word_tokenize(tweet) # Creates a list of words
        return [word for word in tweet if word not in self._stopwords]

In [9]:
# Load test and training dataset for modeling
training_data, testing_data = load_training_dataset(sample_size = 10000, test_size_frac=.5)

In [10]:
#Preprocessing Tweets
tweet_processor = PreProcessTweets()
pp_training_data = tweet_processor.processTweets(training_data)

In [11]:
# Let's take a look on how some tweets look like after cleansing and tokenization
pp_training_data[:4]

[(['homework'], 0),
 (['ill'], 4),
 (['dont',
   'suppose',
   'stockists',
   'sri',
   'lanka',
   'miss',
   'ss',
   'sea',
   'island',
   'cotton',
   'numbers'],
  0),
 (['noticed', 'rat', 'kitchen'], 0)]

### Build Vocabulary

The function below builds the vocabulary, it means the list of all words that we are going to use to train our model and later use to evaluate the tweet

Some people argues that it is better to focus on the most used words (e.g. 2500 most used in our training dataset) and/or the words more present on documents (in our case tweets - like the words that are more present in more tweets)

For the sake of this project, as it is not focused on the assertiveness of the model itself, but in the implementation of a pipeline using a model, we are going to use all words

In [12]:
import nltk 

def build_vocabulary(preprocessed_training_dataset):
    all_words = []
    
    for (words, polarity) in preprocessed_training_dataset:
        all_words.extend(words)

    word_list = nltk.FreqDist(all_words)
    word_features = word_list.keys()
    
    return word_features

In [13]:
# Then we build our vocabulary
word_features = build_vocabulary(pp_training_data)

In [14]:
# and let's take a look on it:
word_features

dict_keys(['homework', 'ill', 'dont', 'suppose', 'stockists', 'sri', 'lanka', 'miss', 'ss', 'sea', 'island', 'cotton', 'numbers', 'noticed', 'rat', 'kitchen', 'ohi', 'thought', 'gon', 'na', 'something', 'saucy', 'working', 'oral', 'presentation', 'upset', 'doesnt', 'hate', 'wan', 'leave', 'schhol', '3', 'weeks', 'exams', 'hiya', 'upto', 'much', 'cant', 'decide', 'watching', 'reading', 'skydiving', 'making', 'better', 'worse', 'im', 'really', 'trying', 'patient', 'voted', 'neither', 'two', 'congrats', 'everyone', 'graduated', 'last', 'week', 'including', 'feels', 'happy', 'going', 'work', '8', 'enjoying', 'listening', 'perfect', 'mt', 'many', 'want', 'lol', 'need', 'figure', 'put', 'keep', 'job', 'celebrities', '4low', 'w', 'iz', 'dat', 'dnt', 'unfair', 'good', 'rock', 'shows', 'happen', 'blore', 'pune', 'ahh', 'macbook', 'tooo', 'fun', 'heheh', 'awwsome', 'pouring', 'rain', 'awesome', 'day', 'also', 'sephora', 'forever21', 'taking', 'long', 'open', 'thank', 'behalf', 'sharon', 'gratefu

### Generating Features
The function below needs to be called for each one of the tweets and basically tags (with True) on a instance of the dictionary previously built which words in that instance of the dictionary that are used in that specific tweet. Thus, the majority of words will ba tagged as False and a small number of them (the ones contained in the tweet) as True 

**To-do**: this should also be encapsulated on a class in order to have it pickled. Or maybe encapsulate the whole code?!?!?

In [15]:
def extract_features(tweet):
    tweet_words=set(tweet)
    features={}
    for word in word_features:
        features['contains(%s)' % word]=(word in tweet_words)
    return features 

In [16]:
# Building the training features
training_features = nltk.classify.apply_features(extract_features,pp_training_data)

In [17]:
# And taking a look into it
training_features

[({'contains(homework)': True, 'contains(ill)': False, 'contains(dont)': False, 'contains(suppose)': False, 'contains(stockists)': False, 'contains(sri)': False, 'contains(lanka)': False, 'contains(miss)': False, 'contains(ss)': False, 'contains(sea)': False, 'contains(island)': False, 'contains(cotton)': False, 'contains(numbers)': False, 'contains(noticed)': False, 'contains(rat)': False, 'contains(kitchen)': False, 'contains(ohi)': False, 'contains(thought)': False, 'contains(gon)': False, 'contains(na)': False, 'contains(something)': False, 'contains(saucy)': False, 'contains(working)': False, 'contains(oral)': False, 'contains(presentation)': False, 'contains(upset)': False, 'contains(doesnt)': False, 'contains(hate)': False, 'contains(wan)': False, 'contains(leave)': False, 'contains(schhol)': False, 'contains(3)': False, 'contains(weeks)': False, 'contains(exams)': False, 'contains(hiya)': False, 'contains(upto)': False, 'contains(much)': False, 'contains(cant)': False, 'contain

### Training the model
And finally, we are going to train the model using Naive Bayes. We could have tried other classification algorithms but again, the main purpose of this project is the implementation of the pipeline, not the accuracy of the model

In [18]:
NBayesClassifier = nltk.NaiveBayesClassifier.train(training_features)

### Using the model
The following code uses the model trained to classify each one of the tweets of our testing dataset

Note that before we do the classification, we need to apply the preprocess (cleansing and tokenizing) that we have built before and extract the features using our dictionary

In [19]:
li = []
for each_tweet in testing_data:
    words = tweet_processor.processTweet(each_tweet["tweet"])
    row = {
        "polarity": each_tweet["polarity"],
        "tweet_id": each_tweet["tweet_id"],
        "date": each_tweet["date"],
        "query": each_tweet["query"],
        "user": each_tweet["user"],
        "tweet": each_tweet["tweet"],
        "predicted": NBayesClassifier.classify(extract_features(words))
    }

    li.append(row)                                

The next code snippet just creates a Pandas DataFrame with the results of our prediction along with some variables that we are going to use on our evaluation of the model

In [20]:
final_dataset = pd.DataFrame(li)
Y_test = final_dataset["polarity"]
predicted = final_dataset["predicted"]
final_dataset

Unnamed: 0,polarity,tweet_id,date,query,user,tweet,predicted
0,4,1980018400,Sun May 31 05:23:40 PDT 2009,NO_QUERY,rapioliez,i wanna be rich and i want lots of money..bhaha,0
1,4,1988690191,Sun May 31 23:05:40 PDT 2009,NO_QUERY,colby_h20polo,im tired [[: have a good night. eeehhh only...,0
2,4,2174959418,Sun Jun 14 23:43:54 PDT 2009,NO_QUERY,NaChInGoTe,miley you look at what I write? i love you,4
3,4,1883516163,Fri May 22 08:40:32 PDT 2009,NO_QUERY,lisar1167,@jrmarykdkb LOL. I too am in a perpetual state...,4
4,4,2070172155,Sun Jun 07 17:13:14 PDT 2009,NO_QUERY,WasWizFanPatrol,@iamdiddy nice can u take a pic wit any pla...,0
...,...,...,...,...,...,...,...
4995,0,1677954049,Sat May 02 04:50:44 PDT 2009,NO_QUERY,McFlyer4ever,is really nervous 4 her audition any suggesti...,4
4996,4,2063399951,Sun Jun 07 02:30:23 PDT 2009,NO_QUERY,ZombieAssassin,Been thinking it might be kind of cool to use ...,4
4997,4,2178102639,Mon Jun 15 07:11:52 PDT 2009,NO_QUERY,tinfoiltiarra,"@johncmayer keep doing your thing, man. you're...",4
4998,0,1835813131,Mon May 18 06:48:15 PDT 2009,NO_QUERY,OrangeMagpie,The monotony of work isn't helping to inspire ...,0


Here is the Confusion Matrix (just reminding that we did not use the whole training dataset, just a sample of it)

In [21]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print("Confusion Matrix:\n", confusion_matrix(Y_test,predicted))

Confusion Matrix:
 [[1784  774]
 [ 718 1724]]


And here our classification report

In [22]:
print("Classification Report:\n", classification_report(Y_test,predicted))

Classification Report:
               precision    recall  f1-score   support

           0       0.71      0.70      0.71      2558
           4       0.69      0.71      0.70      2442

    accuracy                           0.70      5000
   macro avg       0.70      0.70      0.70      5000
weighted avg       0.70      0.70      0.70      5000



Just extracting the precision (with more precision....hahaha)

In [23]:
print("Precision:\n", accuracy_score(Y_test, predicted))

Precision:
 0.7016


In [24]:
# Delete temporary folder
if os.path.exists(temporary_folder):
    shutil.rmtree(temporary_folder)