Data setup for Blazing Text TextClassification algorithm. Format needs to be as follows (https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html)

We need to preprocess the training data into space separated tokenized text format which can be consumed by BlazingText algorithm. The class label(s) should be prefixed with __label__ and it should be present in the same line along with the original sentence. We'll use nltk library to tokenize the input sentences f.

    Train with File Mode
    For supervised mode, the training/validation file should contain a training sentence per line along with the labels. Labels are words that are prefixed by the string "__label__". Here is an example of a training/validation file:
    "__label__4"  linux ready for prime time , intel says , despite all the linux hype , the open-source movement has yet to make a huge splash in the desktop market . that may be about to change , thanks to chipmaking giant intel corp .

"__label__2"  bowled by the slower one again , kolkata , november 14 the past caught up with sourav ganguly as the indian skippers return to international cricket was short lived . 


Data Source was the Sentiment140 dataset from Kaggle conprising 160k Twitter messages: https://www.kaggle.com/kazanova/sentiment140


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
import re
import csv

import nltk

nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:

Tweet = "Welcome to the @UFC @drewCarrey where the points don't matter. Thats Right!! the Points are just number to look at till someone get's  Knocked out."
Tweet = re.sub('@[^\s]+','',Tweet)
print (Tweet)

Welcome to the   where the points don't matter. Thats Right!! the Points are just number to look at till someone get's  Knocked out.


In [3]:
#print (nltk.word_tokenize(Tweet))
print (nltk.word_tokenize(Tweet.lower()))

['welcome', 'to', 'the', 'where', 'the', 'points', 'do', "n't", 'matter', '.', 'thats', 'right', '!', '!', 'the', 'points', 'are', 'just', 'number', 'to', 'look', 'at', 'till', 'someone', 'get', "'s", 'knocked', 'out', '.']


In [4]:
df = pd.read_csv('sentimentdata.csv', delimiter=',',encoding="latin1", header=None, usecols=[0,5]) 
#df = pd.read_csv('sentimentdata.csv', delimiter=',', nrows = 60000, header=None, usecols=[0,5]) 
#df = pd.read_csv('tester.csv', delimiter=',', header=None) 
# only need column 1(sentiment) and 6(Tweet)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 2 columns):
0    1600000 non-null int64
5    1600000 non-null object
dtypes: int64(1), object(1)
memory usage: 24.4+ MB


In [6]:
df.head()

Unnamed: 0,0,5
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


In [7]:
df.shape

(1600000, 2)

In [8]:
df = shuffle(df)
df.head(20)

Unnamed: 0,0,5
1112654,4,"@TravelThirst u make everyday a #travel day, l..."
30370,0,@jreagon11 I'm not Im going to stay awake or ...
904199,4,So why did my internal alarm clock wake me @ 6...
312521,0,last keg night of the year
1437786,4,"Actually, a puppy is too much expensive... Who..."
1281352,4,@Rog42 At least I haven't heard any Bing Crosb...
1010093,4,@iGrace no frooti today for you girl! everyone...
170795,0,Fb I hate when I try &amp; support my local bo...
1188917,4,"@shikarasolis Awe that is sweet, i am spending..."
457454,0,@bohemiangeek @sciencegoddess no cup holder B...


In [9]:
#df[df.columns[0]] 
#vaules in first column

In [10]:
# 0 = negative, 2 = neutral, 4 = positive. Issue with data set - only 0 and 4 are available
index_to_label = {}
index_to_label[0] = 'Negative'
index_to_label[4] = 'Positive'
print(index_to_label)
print(index_to_label[4])

{0: 'Negative', 4: 'Positive'}
Positive


In [11]:
# set Labels appropriately
def labelize(inp):
    return ("__label__"+index_to_label[inp])

print(labelize(4))

__label__Positive


In [12]:
# modify tweet. make lower case and take out twitter handle
def tokenize(inpstr):
    Tweetstr = inpstr
    Tweetstr = re.sub('@[^\s]+','',Tweetstr)
    #Tweetstr = nltk.word_tokenize(Tweetstr.lower())
    return(Tweetstr.lower())

print(tokenize("@ConnorMcGregor - money in hand, that's Good. Money in Head, that's bad"))

 - money in hand, that's good. money in head, that's bad


In [13]:
#df[df.columns[0]] = "__label__"+str(df[df.columns[0]])
#df[df.columns[0]] = "__label__"+df[df.columns[0]].astype(str) #--- just add a __label__ t the beginning
#df[df.columns[0]] = "__label__"+index_to_label[df[df.columns[0]]] #-- append __label__ and set value 
#df[df.columns[1]] = re.sub('@[^\s]+','',str(df.columns[1]))
df[df.columns[0]] = df[df.columns[0]].apply(labelize)  #applies labelize function to first column
df[df.columns[1]] = df[df.columns[1]].apply(tokenize) #applies tokenize to second column

In [14]:
df.head(20)

Unnamed: 0,0,5
1112654,__label__Positive,"u make everyday a #travel day, lol. i lov that"
30370,__label__Negative,i'm not im going to stay awake or try too...
904199,__label__Positive,so why did my internal alarm clock wake me @ 6...
312521,__label__Negative,last keg night of the year
1437786,__label__Positive,"actually, a puppy is too much expensive... who..."
1281352,__label__Positive,at least i haven't heard any bing crosby joke...
1010093,__label__Positive,no frooti today for you girl! everyone is get...
170795,__label__Negative,fb i hate when i try &amp; support my local bo...
1188917,__label__Positive,"awe that is sweet, i am spending the day with..."
457454,__label__Negative,no cup holder but that would be a great add...


In [15]:
#df.to_csv('training_sentiment140.csv', sep = ' ', header=False, index=False)

CREATE THE TEST AND VALIDATION FILES

In [16]:
rows = df.shape[0]
train = int(.7 * rows)
test = rows-train

rows, train, test

(1600000, 1120000, 480000)

In [17]:
# Write Training Set
#csv.writer(df)
df.iloc[:train].to_csv('training_sentiment140.csv', sep = ' ', header=False, index=False)
df.iloc[train:].to_csv('test_sentiment140.csv', sep = ' ', header=False, index=False)
#df.iloc[:train].to_csv('training_sentiment140.csv', sep = ' ', header=False, index=False, quoting=csv.QUOTE_NONE, quotechar="",  escapechar=" ")

# Write Validation Set
#df.iloc[train:].to_csv('test_sentiment140.csv', sep = ' ', header=False, index=False, quoting=csv.QUOTE_NONE, quotechar="",  escapechar=" ")

In [18]:
with open('training_sentiment140.csv', 'r') as f, open('training_sentiment140_noquotes.csv', 'w') as fo:
    for line in f:
        fo.write(line.replace('"', ''))

In [19]:
with open('test_sentiment140.csv', 'r') as f, open('test_sentiment140.csv_noquotes.csv', 'w') as fo:
    for line in f:
        fo.write(line.replace('"', ''))