# Introduction

This notebook cleans and tokenizes Twitter data found [here](https://data.world/crowdflower/brands-and-product-emotions) for use in machine learning in the next notebook. It produces to separate datasets. One is lemmatized and one is stemmed, but the preceding cleaning and tokenization is identical.

## Packages

In [1]:
import pandas as pd
pd.set_option("max_columns", None)

import numpy as np
np.random.seed(0)

import nltk
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

import re

import string

from gensim.models import Word2Vec

from collections import Counter

## Preview Data

In [2]:
df = pd.read_csv('data/judge_1377884607_tweet_product_company.csv')
print(df.shape)
df.head(10)

(8721, 3)


Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion
5,@teachntech00 New iPad Apps For #SpeechTherapy...,,No emotion toward brand or product
6,,,No emotion toward brand or product
7,"#SXSW is just starting, #CTIA is around the co...",Android,Positive emotion
8,Beautifully smart and simple idea RT @madebyma...,iPad or iPhone App,Positive emotion
9,Counting down the days to #sxsw plus strong Ca...,Apple,Positive emotion


# Data Cleaning

## Renaming Columns

In [3]:
df.rename(columns={'tweet_text' : 'text',
                   'is_there_an_emotion_directed_at_a_brand_or_product' : 'emotion',
                   'emotion_in_tweet_is_directed_at' : 'directed_at'},
          inplace=True)

df.head()

Unnamed: 0,text,directed_at,emotion
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


## Dropping NaNs

In [4]:
df.isna().sum()

text              1
directed_at    5552
emotion           0
dtype: int64

In [5]:
df.dropna(subset = ['text'], inplace = True)

In [6]:
df.isna().sum()

text              0
directed_at    5551
emotion           0
dtype: int64

Not replacing NaN's if emotion is undirected. Often it seems they actually *are* directed at a brand, but I don't have time to manually go through and label these. Plus, I'm only using that feature for plotting EDA for the presentation. It won't actually be fed into the NLP model.

## Dropping `"I can't tell"` Target Values

In [7]:
df = df[df.emotion != "I can't tell"]

# Tokenization

The following process creates a DataFrame of cleaned and tokenized tweets. Each tweet is replaced with a list of tokens. There are no user handles, hashtags, or web addresses. Punctuation and stopwords have also been removed.

In [10]:
def basic_clean(text):
    stop_words = stopwords.words("english")
    
    text = re.sub('@\S+', '', text)
    text = re.sub('http\S+', '', text)
    text = re.sub('#\S+', '', text)
    for i in string.punctuation:
        text = text.replace(i, '').lower()
    
    tokens = nltk.word_tokenize(text)
    new_tokens = []
    for token in tokens:
        if token.lower() not in stop_words:
            new_tokens.append(token)
            
    return new_tokens

In [26]:
df_clean = df.copy()

In [27]:
for i in range(len(df_clean)):
    df_clean.iloc[i].text = basic_clean(df_clean.iloc[i].text)

In [28]:
df_clean.head()

Unnamed: 0,text,directed_at,emotion
0,"[3g, iphone, 3, hrs, tweeting, dead, need, upg...",iPhone,Negative emotion
1,"[know, awesome, ipadiphone, app, youll, likely...",iPad or iPhone App,Positive emotion
2,"[wait, 2, also, sale]",iPad,Positive emotion
3,"[hope, years, festival, isnt, crashy, years, i...",iPad or iPhone App,Negative emotion
4,"[great, stuff, fri, marissa, mayer, google, ti...",Google,Positive emotion


# Lemmatization and Stemming

## Assigning Copies

In [11]:
df_lemma = df_clean.copy()
df_stem = df_clean.copy()

## Lemmatizing

In [12]:
lemmatizer = nltk.stem.WordNetLemmatizer() 

In [13]:
for i in range(len(df_lemma)):
    for x in range(len(df_lemma.iloc[i].text)):
        df_lemma.iloc[i].text[x] = lemmatizer.lemmatize(df_lemma.iloc[i].text[x])

In [14]:
df_lemma.head()

Unnamed: 0,text,directed_at,emotion
0,"[3g, iphone, 3, hr, tweeting, dead, need, upgr...",iPhone,Negative emotion
1,"[know, awesome, ipadiphone, app, youll, likely...",iPad or iPhone App,Positive emotion
2,"[wait, 2, also, sale]",iPad,Positive emotion
3,"[hope, year, festival, isnt, crashy, year, iph...",iPad or iPhone App,Negative emotion
4,"[great, stuff, fri, marissa, mayer, google, ti...",Google,Positive emotion


## Stemming

In [15]:
stemmer = nltk.stem.SnowballStemmer(language = 'english')

In [16]:
for i in range(len(df_stem)):
    for x in range(len(df_stem.iloc[i].text)):
        df_stem.iloc[i].text[x] = stemmer.stem(df_stem.iloc[i].text[x])

In [17]:
df_stem.head()

Unnamed: 0,text,directed_at,emotion
0,"[3g, iphon, 3, hr, tweet, dead, need, upgrad, ...",iPhone,Negative emotion
1,"[know, awesom, ipadiphon, app, youll, like, ap...",iPad or iPhone App,Positive emotion
2,"[wait, 2, also, sale]",iPad,Positive emotion
3,"[hope, year, festiv, isnt, crashi, year, iphon...",iPad or iPhone App,Negative emotion
4,"[great, stuff, fri, marissa, mayer, googl, tim...",Google,Positive emotion


## Exporting CSV's

In [29]:
df_lemma.to_csv("data/df_lemma.csv")
df_stem.to_csv("data/df_stem.csv")

# More EDA and Stuff - Ignore For Now

In [39]:
def word_counts(text):
    wordcount = Counter()
    for i in text.values:
        for x in i:
            wordcount[x] += 1
    return wordcount

In [21]:
# Have to reassign this cause of weird errors above. No time to explore
df_clean = df.copy()
for i in range(len(df_clean)):
    df_clean.iloc[i].text = basic_clean(df_clean.iloc[i].text)

In [14]:
# Exporting this CSV too
df_clean.to_csv("data/df_clean.csv")

In [38]:
wordcounts = word_counts(df_clean.text)
wordcounts.most_common()

[('link', 4093),
 ('rt', 2942),
 ('ipad', 2101),
 ('google', 1956),
 ('apple', 1684),
 ('store', 1398),
 ('iphone', 1208),
 ('new', 1063),
 ('2', 1040),
 ('austin', 815),
 ('app', 757),
 ('amp', 695),
 ('launch', 626),
 ('social', 602),
 ('popup', 556),
 ('today', 553),
 ('circles', 519),
 ('sxsw', 463),
 ('network', 450),
 ('android', 423),
 ('via', 406),
 ('line', 388),
 ('get', 383),
 ('called', 360),
 ('free', 353),
 ('party', 319),
 ('major', 302),
 ('mobile', 292),
 ('like', 279),
 ('one', 261),
 ('time', 259),
 ('temporary', 254),
 ('im', 249),
 ('���', 246),
 ('possibly', 244),
 ('opening', 242),
 ('people', 220),
 ('going', 216),
 ('see', 216),
 ('downtown', 214),
 ('check', 211),
 ('great', 210),
 ('day', 210),
 ('maps', 207),
 ('w', 203),
 ('apps', 200),
 ('go', 200),
 ('dont', 199),
 ('need', 197),
 ('mayer', 197),
 ('open', 189),
 ('marissa', 185),
 ('got', 181),
 ('know', 177),
 ('googles', 174),
 ('come', 172),
 ('first', 163),
 ('win', 162),
 ('good', 156),
 ('us', 156)

# df_stem W2V

In [91]:
model = Word2Vec(sentences=df_stem.text, size=100, window=3, min_count=5)

In [92]:
model.train(sentences=df_stem.text, total_examples=model.corpus_count, epochs=10)

(577074, 866780)

In [94]:
model.wv.most_similar('iphon')

[('marketplac', 0.8339308500289917),
 ('android', 0.822823703289032),
 ('dl', 0.8225846290588379),
 ('droid', 0.8183708190917969),
 ('ride', 0.815941572189331),
 ('wa', 0.8087526559829712),
 ('io', 0.8081423044204712),
 ('blackberri', 0.7942501902580261),
 ('keep', 0.7869850397109985),
 ('also', 0.7834244966506958)]

In [97]:
model.wv.similarity('ipad', 'iphon')

0.36800176

# clean_df W2V

In [85]:
model_clean = Word2Vec(sentences=df_clean.text, size=100, window=3, min_count=5)

In [86]:
model_clean.train(sentences=df_clean.text, total_examples=model.corpus_count, epochs=10)

(562842, 866780)

In [108]:
model_clean.wv.vocab['3g']

<gensim.models.keyedvectors.Vocab at 0x225d461be48>

In [87]:
model_clean.wv.most_similar('iphone')

[('android', 0.889495849609375),
 ('market', 0.884195864200592),
 ('marketplace', 0.8746926784515381),
 ('ios', 0.8510479927062988),
 ('working', 0.8361167311668396),
 ('also', 0.8299484252929688),
 ('updates', 0.8274936079978943),
 ('gram', 0.8266642689704895),
 ('song', 0.8248703479766846),
 ('development', 0.8247174024581909)]

In [95]:
model_clean.wv.most_similar('tweet')

[('youll', 0.962291955947876),
 ('create', 0.948000967502594),
 ('you�۪ll', 0.9437388181686401),
 ('quotif', 0.9394602179527283),
 ('follow', 0.9375459551811218),
 ('quotipad', 0.9373168349266052),
 ('photo', 0.9371738433837891),
 ('user', 0.9368828535079956),
 ('ready', 0.9355154633522034),
 ('borrow', 0.9337400794029236)]

In [88]:
model_clean.wv.similarity('rock', 'iphone')

0.6959976

In [89]:
model_clean.wv.similarity('ipad', 'iphone')

0.3466813

In [102]:
model_clean.wv.similarity('laptop', 'iphone')

0.73951

In [104]:
model_clean.wv.similarity('fire', 'iphone')

0.54867953

In [106]:
df_clean.emotion

0                         Negative emotion
1                         Positive emotion
2                         Positive emotion
3                         Negative emotion
4                         Positive emotion
                       ...                
8716                      Positive emotion
8717    No emotion toward brand or product
8718    No emotion toward brand or product
8719    No emotion toward brand or product
8720    No emotion toward brand or product
Name: emotion, Length: 8720, dtype: object

# With Abhineet

In [107]:
# Binarizer label data

In [110]:
# Word2Vec first: vocab size, embeddings
#    Would require an LSTM
# Outputs of embedding:


# Final layer with softmax: Ternary output

In [109]:
# Bigrams / digrams as features for final model

In [112]:
# Simpler NN would be putting word counts into embedding layer as first layer
# Dense layers with dropout up until output

# Maybe Things Are Looking Up

Some Keras Imports

In [15]:
from keras.layers import Input, Dense, LSTM, Embedding, Dropout, Activation, Bidirectional, GlobalMaxPool1D
from keras.models import Sequential
from keras import initializers, regularizers, constraints, optimizers, layers
from keras.preprocessing import text, sequence

Using TensorFlow backend.


Wordcounts - Just gets all the amounts of times a word is used. "Ten or more" is a list with only the words used ten times plus

In [40]:
def word_counts(text):
    wordcount = Counter()
    for i in text.values:
        for x in i:
            wordcount[x] += 1
    return wordcount

In [66]:
wordcounts = word_counts(df_clean.text)

In [67]:
ten_or_more = [x for x, y in wordcounts.items() if y > 10]

In [None]:
# Make the word count drop any with <10 occurence
# This becomes embedding layer's input

Label binarizer - This turned happy, neutral, sad to 001, 010, 100. Or something like that.

In [16]:
from sklearn.preprocessing import LabelBinarizer

In [65]:
lb = LabelBinarizer()

In [64]:
# In order of 1's position it goes: Negative, neutral, positive
y = lb.fit_transform(df_clean.emotion)

Tokenizer - This basically just made a list of every word and then gave it a unique ID? I don't get the point.

In [56]:
from keras.preprocessing.text import Tokenizer

In [57]:
tk = Tokenizer()

In [58]:
tk.fit_on_texts(ten_or_more)

In [93]:
word_idx = tk.texts_to_sequences(ten_or_more)

In [63]:
len(tk.word_index)

1082

Test train split

In [72]:
import tensorflow

In [68]:
from sklearn.model_selection import train_test_split

In [71]:
X_train, x_test, y_train, y_test = train_test_split(df_clean.text, y, test_size=0.2)

Failed vect text thing with Abhineet

In [77]:
def vect_text(text):
    text = tensorflow.expand_dims(text, -1)
    return vectorize_layer(text)

In [88]:
# from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

In [89]:
# vectorize_layer = TextVectorization()

In [87]:
# X_train.map(vect_text)

Model

In [42]:
model_new = Sequential()

In [91]:
model_new.add(Embedding(input_dim=1082, output_dim=100))
model_new.add(Dense(30, activation="relu"))
model_new.add(Dense(3, activation='softmax'))