In [1]:
import pandas as pd
import os

In [2]:
base_folder = os.getcwd()

### Load Training Dataset

The following function loads the training file and split it into training and test datasets

It received the following args:

* **sample_size**: the amount of rows from the file that we want to load. The whole file has 1.6MM of rows and it is unpractical to work with this amount on a local machine. For the final training with the whole dataset, a Hadoop cliuster are advised. If the arg is not informed, the function will return all the lines into two lists of dicts: one for training and another for testing
* **test_size_frac**: the fraction of lines that will be reserved for testing the model

**Note**: we are converting the Pandas DataFrame to a list of dict because nltk package does not work with Pandas

In [3]:
def load_training_dataset(sample_size = None, test_size_frac = 0.5):
    training_dataset_path = os.path.join(
        base_folder, 
        "trainingandtestdata", 
        "training.1600000.processed.noemoticon.csv")

    training_dataset = pd.read_csv(
        training_dataset_path, 
        encoding="latin-1", 
        warn_bad_lines=True,
        error_bad_lines=False,
        header=None, 
        names=["polarity", "tweet_id", "date", "query", "user", "tweet"])
    if sample_size != None:
        training_dataset = training_dataset.sample(sample_size)

    #training_dataset = training_dataset[["tweet_id", "polarity", "tweet"]]
    
    testing_dataset = training_dataset.sample(frac = test_size_frac)

    training_dataset = training_dataset.drop(testing_dataset.index)
 
    return training_dataset.to_dict("records"), testing_dataset.to_dict("records")

In [4]:
# Load test and training dataset for exploration
training_data, testing_data = load_training_dataset(sample_size = None, test_size_frac=.5)

In [5]:
pd.DataFrame(training_data).head(10)

Unnamed: 0,polarity,tweet_id,date,query,user,tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467811592,Mon Apr 06 22:20:03 PDT 2009,NO_QUERY,mybirch,Need a hug
3,0,1467811795,Mon Apr 06 22:20:05 PDT 2009,NO_QUERY,2Hood4Hollywood,@Tatiana_K nope they didn't have it
4,0,1467812025,Mon Apr 06 22:20:09 PDT 2009,NO_QUERY,mimismo,@twittera que me muera ?
5,0,1467812416,Mon Apr 06 22:20:16 PDT 2009,NO_QUERY,erinx3leannexo,spring break in plain city... it's snowing
6,0,1467812579,Mon Apr 06 22:20:17 PDT 2009,NO_QUERY,pardonlauren,I just re-pierced my ears
7,0,1467812771,Mon Apr 06 22:20:19 PDT 2009,NO_QUERY,robrobbierobert,"@octolinz16 It it counts, idk why I did either..."
8,0,1467812964,Mon Apr 06 22:20:22 PDT 2009,NO_QUERY,lovesongwriter,Hollis' death scene will hurt me severely to w...
9,0,1467813137,Mon Apr 06 22:20:25 PDT 2009,NO_QUERY,armotley,about to file taxes


Column definitions:

0 - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)

1 - the id of the tweet

2 - the date of the tweet

3 - the query. If there is no query, then this value is NO_QUERY.

5 - the text of the tweet

In [6]:
# We noticed that there is not a single case of Neutral (2) polarity
pd.DataFrame(training_data + testing_data).polarity.value_counts()

4    800000
0    800000
Name: polarity, dtype: int64

### Pre-process Tweets

The following class prepares the dataset by:

* Extracting the text from HTML (for the training dataset provided, we already have the text, but we want to avoid using any HTML tag for classification
* Converting all words to lower case
* Replacing any URL with "URL" constant (to enable the removal of them on a further step)
* Replacing any tagging of users with "USERTAGGING" (to enable the removal of them in a further step)
* Removing any "#" from hashtags
* Removing punctuation (has little or no weight on classification as it can be used for both intentions)
* Tokenizing (create a list of words)
* And finally, removing words and punctuation that has little or no weight on classification (and can even create biases):
    * Stop words: set of common words that are used doesn't matter the intenttion (things like it, that, a, the)
    * Remove the two constants that we used to replace user tagging and URLs
    
**Note**: we are creating a class for this process because we want to "pickle" (serialize and save as a file) it for usage on the implementation of the streaming process 

In [7]:
import re
from nltk.tokenize import word_tokenize
from string import punctuation 
from nltk.corpus import stopwords 
from bs4 import BeautifulSoup

class PreProcessTweets:
    def __init__(self):
        self._stopwords = set(stopwords.words("english") + ["USERTAGGING","URL"])
        
    def processTweets(self, list_of_tweets):
        processedTweets=[]
        for tweet in list_of_tweets:
            processedTweets.append(
                (
                    self.processTweet(tweet["tweet"]),
                    tweet["polarity"]                    
                )
            )
        return processedTweets
    
    def processTweet(self, tweet):
        tweet = BeautifulSoup(tweet).get_text() # Extracts text from HTML (just in case!)
        tweet = tweet.lower() # Converts text to lower-case
        tweet = re.sub("((www\.[^\s]+)|(https?://[^\s]+))", "URL", tweet) # Replces URLs by URL constan
        tweet = re.sub("@[^\s]+", "USERTAGGING", tweet) # Replaces usernames by USERTAGGING constant 
        tweet = re.sub(r"#([^\s]+)", r"\1", tweet) # Removes the # in #hashtag
        for p in punctuation: 
            tweet = tweet.replace(p, "") # Removes punctiation
        tweet = word_tokenize(tweet) # Creates a list of words
        return [word for word in tweet if word not in self._stopwords]

In [8]:
# Load test and training dataset for modeling
training_data, testing_data = load_training_dataset(sample_size = 10000, test_size_frac=.5)

In [9]:
#Preprocessing Tweets
tweet_processor = PreProcessTweets()
pp_training_data = tweet_processor.processTweets(training_data)

In [10]:
# Let's take a look on how some tweets look like after cleansing and tokenization
pp_training_data[:4]

[(['agree'], 4),
 (['crazy', 'good', 'news'], 4),
 (['numbers', 'arent', 'everything'], 4),
 (['got', 'passed', 'mopedwe', 'cant', 'friends', 'anymore', 'lol'], 0)]

### Build Vocabulary

The function below builds the vocabulary, it means the list of all words that we are going to use to train our model and later use to evaluate the tweet

Some people argues that it is better to focus on the most used words (e.g. 2500 most used in our training dataset) and/or the words more present on documents (in our case tweets - like the words that are more present in more tweets)

For the sake of this project, as it is not focused on the assertiveness of the model itself, but in the implementation of a pipeline using a model, we are going to use all words

In [11]:
import nltk 

def build_vocabulary(preprocessed_training_dataset):
    all_words = []
    
    for (words, polarity) in preprocessed_training_dataset:
        all_words.extend(words)

    word_list = nltk.FreqDist(all_words)
    word_features = word_list.keys()
    
    return word_features

In [12]:
# Then we build our vocabulary
word_features = build_vocabulary(pp_training_data)

In [13]:
# and let's take a look on it:
word_features



### Generating Features
The function below needs to be called for each one of the tweets and basically tags (with True) on a instance of the dictionary previously built which words in that instance of the dictionary that are used in that specific tweet. Thus, the majority of words will ba tagged as False and a small number of them (the ones contained in the tweet) as True 

**To-do**: this should also be encapsulated on a class in order to have it pickled. Or maybe encapsulate the whole code?!?!?

In [14]:
def extract_features(tweet):
    tweet_words=set(tweet)
    features={}
    for word in word_features:
        features['contains(%s)' % word]=(word in tweet_words)
    return features 

In [15]:
# Building the training features
training_features = nltk.classify.apply_features(extract_features,pp_training_data)

In [16]:
# And taking a look into it
training_features



### Training the model
And finally, we are going to train the model using Naive Bayes. We could have tried other classification algorithms but again, the main purpose of this project is the implementation of the pipeline, not the accuracy of the model

In [17]:
NBayesClassifier = nltk.NaiveBayesClassifier.train(training_features)

### Using the model
The following code uses the model trained to classify each one of the tweets of our testing dataset

Note that before we do the classification, we need to apply the preprocess (cleansing and tokenizing) that we have built before and extract the features using our dictionary

In [18]:
li = []
for each_tweet in testing_data:
    words = tweet_processor.processTweet(each_tweet["tweet"])
    row = {
        "polarity": each_tweet["polarity"],
        "tweet_id": each_tweet["tweet_id"],
        "date": each_tweet["date"],
        "query": each_tweet["query"],
        "user": each_tweet["user"],
        "tweet": each_tweet["tweet"],
        "predicted": NBayesClassifier.classify(extract_features(words))
    }

    li.append(row)                                

The next code snippet just creates a Pandas DataFrame with the results of our prediction along with some variables that we are going to use on our evaluation of the model

In [19]:
final_dataset = pd.DataFrame(li)
Y_test = final_dataset["polarity"]
predicted = final_dataset["predicted"]
final_dataset

Unnamed: 0,polarity,tweet_id,date,query,user,tweet,predicted
0,0,2323119233,Wed Jun 24 23:57:08 PDT 2009,NO_QUERY,goofyholly,Transformers 2 was soldout today for the 12.10...,0
1,4,2066538439,Sun Jun 07 10:42:32 PDT 2009,NO_QUERY,chelschaos,@AngelIbarra hope the London show is awesome ...,4
2,0,2244241112,Fri Jun 19 14:29:22 PDT 2009,NO_QUERY,BeeMusick,"at the mall :/ getting presents for dia, my po...",0
3,4,1695029080,Mon May 04 04:58:38 PDT 2009,NO_QUERY,LaBarceloneta,"@msstacy13 Well, thanks for thinking of me! An...",4
4,4,1970774217,Sat May 30 05:09:50 PDT 2009,NO_QUERY,akalata,@changeorder @askrom is now @chrisfahey,4
...,...,...,...,...,...,...,...
4995,4,1793054749,Thu May 14 01:11:58 PDT 2009,NO_QUERY,Lysh_Here,All the tests are OVER! Its the best feeling!,4
4996,4,2059550344,Sat Jun 06 17:09:29 PDT 2009,NO_QUERY,KellyJohnson85,@midnightsunco Of course!,4
4997,4,1932311368,Tue May 26 21:13:10 PDT 2009,NO_QUERY,chrissyx14,wondering why she is still grounded.. lol good...,4
4998,0,2203821501,Wed Jun 17 00:36:59 PDT 2009,NO_QUERY,thenoblesavage,"@johnhalton Yeah, wish I had said that, or sim...",4


Here is the Confusion Matrix (just reminding that we did not use the whole training dataset, just a sample of it)

In [20]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print("Confusion Matrix:\n", confusion_matrix(Y_test,predicted))

Confusion Matrix:
 [[1774  777]
 [ 681 1768]]


And here our classification report

In [21]:
print("Classification Report:\n", classification_report(Y_test,predicted))

Classification Report:
               precision    recall  f1-score   support

           0       0.72      0.70      0.71      2551
           4       0.69      0.72      0.71      2449

    accuracy                           0.71      5000
   macro avg       0.71      0.71      0.71      5000
weighted avg       0.71      0.71      0.71      5000



Just extracting the precision (with more precision....hahaha)

In [22]:
print("Precision:\n", accuracy_score(Y_test, predicted))

Precision:
 0.7084
