To this end, we will be using the Sentiment140 dataset containing data collected from twitter.

# 1. Importing and Discovering the dataset

In [9]:
from time import time
import random

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [10]:
#Reading the dataset with no columns titles with lation encoding 
df_raw = pd.read_csv("training.1600000.processed.noemoticon.csv", encoding= "ISO-8859-1", header = None)

df_raw.columns = ["Labels", "Time" , "date", "query", "Username" , "text"]

df_raw.head(10) # show first 10 rows of this dataset

Unnamed: 0,Labels,Time,date,query,Username,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
5,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew
6,0,1467811592,Mon Apr 06 22:20:03 PDT 2009,NO_QUERY,mybirch,Need a hug
7,0,1467811594,Mon Apr 06 22:20:03 PDT 2009,NO_QUERY,coZZ,@LOLTrish hey long time no see! Yes.. Rains a...
8,0,1467811795,Mon Apr 06 22:20:05 PDT 2009,NO_QUERY,2Hood4Hollywood,@Tatiana_K nope they didn't have it
9,0,1467812025,Mon Apr 06 22:20:09 PDT 2009,NO_QUERY,mimismo,@twittera que me muera ?


In [11]:
# Checking the data's output balance
# the label '4' denotes positive and '0' denotes negative sentiment
df_raw['Labels'].value_counts()

0    800000
4    800000
Name: Labels, dtype: int64

In [12]:
#Ommiting every column except for the text and label. as we won't need any of the other information 
df = df_raw[["Labels",'text']]
df.head()

Unnamed: 0,Labels,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


As our data is huge(16,00,000 rows), working with it on a regular machine is very challenging.For this reason, we will trim our dataframe to 1/4th of its orginal size.
As data output balance is key for a better performing algorithm, we will make sure to maintain the data balance while trimming the dataframe.

In [13]:
# Separating positive and negative rows
df_pos = df[df["Labels"] == 4]
df_neg = df[df["Labels"] == 0]
print(len(df_pos), len(df_neg))

800000 800000


In [14]:
#Only retaining 1/4th of our data from each output group
df_pos = df_pos[:int(len(df_pos)/4)]
df_neg = df_neg[:int(len(df_neg)/4)]
print(len(df_pos), len(df_neg))

200000 200000


In [18]:
#concatinating both positive and negative groups and storing them back into a single dataframe
df = pd.concat([df_pos , df_neg])
len(df)

400000

# 2. Cleaning and Processing the data

### 2.1 Tokenization

In order to feed our text data to a classification model, we first need to tokenize it.
Tokenizer is the process of splitting a single of text into a list of individual words, or tokens.

over here we will use TweetTokenizer; a Twitter-aware tokenizer provided by the nltk library.

In [29]:
start_time = time()
from nltk.tokenize import TweetTokenizer
#the reduce_len parameter will allow a maximum of 3 consecutive repeating characters, while trimming the rest.
#for example, it will transform the word: 'Helooooooooo' to: 'Helooo'
tk = TweetTokenizer(reduce_len=True)
data = []

#separating our features (text) and our labels into two lists to smoothen our work.
X= df['text'].tolist()
Y= df['Labels'].tolist()

#Building our data list, that is a list of tuples, where each tuple is a pair of the tokenized text
#and its corresponding label
for X, Y in zip(X,Y):
    if Y== 4:
        data.append((tk.tokenize(X) , 1))
        
    else:
        data.append((tk.tokenize(X),0))

#printing the CPU time and the first 5 elements of our 'data' list

print('CPU Time', time() - start_time)

print(data[:5])
        

CPU Time 39.68032121658325
[(['I', 'LOVE', '@Health4UandPets', 'u', 'guys', 'r', 'the', 'best', '!', '!'], 1), (['im', 'meeting', 'up', 'with', 'one', 'of', 'my', 'besties', 'tonight', '!', 'Cant', 'wait', '!', '!', '-', 'GIRL', 'TALK', '!', '!'], 1), (['@DaRealSunisaKim', 'Thanks', 'for', 'the', 'Twitter', 'add', ',', 'Sunisa', '!', 'I', 'got', 'to', 'meet', 'you', 'once', 'at', 'a', 'HIN', 'show', 'here', 'in', 'the', 'DC', 'area', 'and', 'you', 'were', 'a', 'sweetheart', '.'], 1), (['Being', 'sick', 'can', 'be', 'really', 'cheap', 'when', 'it', 'hurts', 'too', 'much', 'to', 'eat', 'real', 'food', 'Plus', ',', 'your', 'friends', 'make', 'you', 'soup'], 1), (['@LovesBrooklyn2', 'he', 'has', 'that', 'effect', 'on', 'everyone'], 1)]


In [22]:
X

['I LOVE @Health4UandPets u guys r the best!! ',
 'im meeting up with one of my besties tonight! Cant wait!!  - GIRL TALK!!',
 '@DaRealSunisaKim Thanks for the Twitter add, Sunisa! I got to meet you once at a HIN show here in the DC area and you were a sweetheart. ',
 'Being sick can be really cheap when it hurts too much to eat real food  Plus, your friends make you soup',
 '@LovesBrooklyn2 he has that effect on everyone ',
 '@ProductOfFear You can tell him that I just burst out laughing really loud because of that  Thanks for making me come out of my sulk!',
 '@r_keith_hill Thans for your response. Ihad already find this answer ',
 "@KeepinUpWKris I am so jealous, hope you had a great time in vegas! how did you like the ACM's?! LOVE YOUR SHOW!! ",
 '@tommcfly ah, congrats mr fletcher for finally joining twitter ',
 '@e4VoIP I RESPONDED  Stupid cat is helping me type. Forgive errors ',
 'crazy day of school. there for 10 hours straiiight. about to watch the hills. @spencerpratt told m