Data Setup for Einstein Intent API.

Data source is from following location: http://help.sentiment140.com/for-students
This provides a .CSV with sentiment data based on 160k Twitter posts.

The data is a CSV with emoticons removed. Data file format has 6 fields:

0 - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
1 - the id of the tweet (2087)
2 - the date of the tweet (Sat May 16 23:58:44 UTC 2009)
3 - the query (lyx). If there is no query, then this value is NO_QUERY.
4 - the user that tweeted (robotickilldozr)
5 - the text of the tweet (Lyx is cool)

Desired format for Einstein is .CSV with no headers and format of - "content", intent



In [90]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
import re
import csv

import nltk

nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [91]:
Tweet = "Welcome to the @UFC @drewCarrey where the points don't matter. Thats Right!! the Points are just number to look at till someone get's  Knocked out."
Tweet = re.sub('@[^\s]+','',Tweet)
print (Tweet)

Welcome to the   where the points don't matter. Thats Right!! the Points are just number to look at till someone get's  Knocked out.


In [92]:
#print (nltk.word_tokenize(Tweet))
print (nltk.word_tokenize(Tweet.lower()))

['welcome', 'to', 'the', 'where', 'the', 'points', 'do', "n't", 'matter', '.', 'thats', 'right', '!', '!', 'the', 'points', 'are', 'just', 'number', 'to', 'look', 'at', 'till', 'someone', 'get', "'s", 'knocked', 'out', '.']


In [93]:
df = pd.read_csv('Raw_data.csv', delimiter=',',encoding="latin1", header=None, usecols=[0,5]) 
#df = pd.read_csv('Raw_data.csv', delimiter=',',nrows = 1580000,encoding="latin1", header=None, usecols=[0,5]) 
#df = pd.read_csv('Raw_data.csv', delimiter=',', nrows = 100, header=None, usecols=[5,0]) 
#df = pd.read_csv('tester.csv', delimiter=',', header=None) 
# only need column 1(sentiment) and 6(Tweet)
# the Tweet needs to be in first column and Sentiment in second

In [94]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 2 columns):
0    1600000 non-null int64
5    1600000 non-null object
dtypes: int64(1), object(1)
memory usage: 24.4+ MB


In [95]:
df.head()

Unnamed: 0,0,5
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


In [96]:
df.tail()

Unnamed: 0,0,5
1599995,4,Just woke up. Having no school is the best fee...
1599996,4,TheWDB.com - Very cool to hear old Walt interv...
1599997,4,Are you ready for your MoJo Makeover? Ask me f...
1599998,4,Happy 38th Birthday to my boo of alll time!!! ...
1599999,4,happy #charitytuesday @theNSPCC @SparksCharity...


In [97]:
df.shape

(1600000, 2)

In [98]:
df = df[[df.columns[1],df.columns[0]]]

In [99]:
df = shuffle(df)
df.head(20)

Unnamed: 0,5,0
1015309,okay.... so charlie and i waded thru the sea i...,4
433134,@Robynnn_b COLLAB WITH WTK!? HELLLLL NO. That...,0
74103,OH CRAP!!! Its raining...I NEED BBQ,0
550941,HAS A REALLY BAD HEADACHE,0
656267,Going to hikari san's house. Yay i get to hang...,0
440019,my lips hurt as well now good thing i got me ...,0
122278,please be nice when ditching shopping carts in...,0
241317,... my website goes on... http://www.cascada-m...,0
1155366,@LysaLove we just got home!,4
1344050,@yoadrian29 I know hon &amp; i appreciate it!i...,4


In [100]:
# 0 = negative, 2 = neutral, 4 = positive. Issue with data set - only 0 and 4 are available
index_to_label = {}
index_to_label[0] = 'Negative'
index_to_label[2] = 'Neutral'
index_to_label[4] = 'Positive'
print(index_to_label)
print(index_to_label[4])

{0: 'Negative', 2: 'Neutral', 4: 'Positive'}
Positive


In [101]:
# set Labels appropriately
def labelize(inp):
    return (index_to_label[inp])

print(labelize(4))

Positive


In [102]:
# modify tweet. make lower case and take out twitter handle
def tokenize(inpstr):
    Tweetstr = inpstr
    Tweetstr = re.sub('@[^\s]+','',Tweetstr)
    #Tweetstr = nltk.word_tokenize(Tweetstr.lower())
    return(Tweetstr.lower())

print(tokenize("@ConnorMcGregor - money in hand, that's Good. Money in Head, that's bad"))

 - money in hand, that's good. money in head, that's bad


In [103]:
df[df.columns[0]] = df[df.columns[0]].apply(tokenize) #standardizes first column to remove characters and words I don't want
df[df.columns[1]] = df[df.columns[1]].apply(labelize)  #standardizes labels in second column

In [104]:
df.head(20)

Unnamed: 0,5,0
1015309,okay.... so charlie and i waded thru the sea i...,Positive
433134,collab with wtk!? helllll no. that'd be like...,Negative
74103,oh crap!!! its raining...i need bbq,Negative
550941,has a really bad headache,Negative
656267,going to hikari san's house. yay i get to hang...,Negative
440019,my lips hurt as well now good thing i got me ...,Negative
122278,please be nice when ditching shopping carts in...,Negative
241317,... my website goes on... http://www.cascada-m...,Negative
1155366,we just got home!,Positive
1344050,i know hon &amp; i appreciate it!i'll be doin...,Positive


CREATE THE TEST AND VALIDATION FILES

In [105]:
# I will use 95% of my data to train the model and reserve 5% of the data for my own testing. 
# In this case, Validation is something I will do manually 
# (as opposed to something systematic - note: Einstein splits up data internally for Training and Validation for model tuning)
rows = df.shape[0]
train = int(.95 * rows)
test = rows-train

rows, train, test

(1600000, 1520000, 80000)

In [106]:
df.iloc[:train].to_csv('training_EinsteinSentiment.csv', sep = ',', header=False, index=False)
df.iloc[train:].to_csv('test_EinsteinSentiment.csv', sep =',', header=False, index=False)