# Introduction

This notebook cleans and tokenizes Twitter data found [here](https://data.world/crowdflower/brands-and-product-emotions) for use in machine learning in the next notebook. It produces to separate datasets. One is lemmatized and one is stemmed, but the preceding cleaning and tokenization is identical.

## Packages

In [1]:
import pandas as pd
pd.set_option("max_columns", None)

import numpy as np
np.random.seed(0)

import nltk
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

import re

import string

from sklearn.preprocessing import LabelBinarizer

from gensim.models import word2vec

## Preview Data

In [2]:
df = pd.read_csv('data/judge_1377884607_tweet_product_company.csv')
print(df.shape)
df.head(10)

(8721, 3)


Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion
5,@teachntech00 New iPad Apps For #SpeechTherapy...,,No emotion toward brand or product
6,,,No emotion toward brand or product
7,"#SXSW is just starting, #CTIA is around the co...",Android,Positive emotion
8,Beautifully smart and simple idea RT @madebyma...,iPad or iPhone App,Positive emotion
9,Counting down the days to #sxsw plus strong Ca...,Apple,Positive emotion


# Data Cleaning

## Renaming Columns

In [3]:
df.rename(columns={'tweet_text' : 'text',
                   'is_there_an_emotion_directed_at_a_brand_or_product' : 'emotion',
                   'emotion_in_tweet_is_directed_at' : 'directed_at'},
          inplace=True)

df.head()

Unnamed: 0,text,directed_at,emotion
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


## Dropping NaNs

In [4]:
df.isna().sum()

text              1
directed_at    5552
emotion           0
dtype: int64

In [5]:
df.dropna(subset = ['text'], inplace = True)

In [6]:
df.isna().sum()

text              0
directed_at    5551
emotion           0
dtype: int64

Not replacing NaN's if emotion is undirected. Often it seems they actually *are* directed at a brand, but I don't have time to manually go through and label these. Plus, I'm only using that feature for plotting EDA for the presentation. It won't actually be fed into the NLP model.

# Tokenization

The following process creates a DataFrame of cleaned and tokenized tweets. Each tweet is replaced with a list of tokens. There are no user handles, hashtags, or web addresses. Punctuation and stopwords have also been removed.

In [7]:
def basic_clean(text):
    stop_words = stopwords.words("english")
    
    text = re.sub('@\S+', '', text)
    text = re.sub('http\S+', '', text)
    text = re.sub('#\S+', '', text)
    for i in string.punctuation:
        text = text.replace(i, '').lower()
    
    tokens = nltk.word_tokenize(text)
    new_tokens = []
    for token in tokens:
        if token.lower() not in stop_words:
            new_tokens.append(token)
            
    return new_tokens

In [23]:
df_clean = df.copy()

In [24]:
for i in range(len(df_clean)):
    df_clean.iloc[i].text = basic_clean(df_clean.iloc[i].text)

In [25]:
df_clean

Unnamed: 0,text,directed_at,emotion
0,"[3g, iphone, 3, hrs, tweeting, dead, need, upg...",iPhone,Negative emotion
1,"[know, awesome, ipadiphone, app, youll, likely...",iPad or iPhone App,Positive emotion
2,"[wait, 2, also, sale]",iPad,Positive emotion
3,"[hope, years, festival, isnt, crashy, years, i...",iPad or iPhone App,Negative emotion
4,"[great, stuff, fri, marissa, mayer, google, ti...",Google,Positive emotion
...,...,...,...
8716,"[ipad, everywhere, link]",iPad,Positive emotion
8717,"[wave, buzz, rt, interrupt, regularly, schedul...",,No emotion toward brand or product
8718,"[googles, zeiger, physician, never, reported, ...",,No emotion toward brand or product
8719,"[verizon, iphone, customers, complained, time,...",,No emotion toward brand or product


# Lemmatization and Stemming

## Assigning Copies

In [26]:
df_lemma = df_clean.copy()
df_stem = df_clean.copy()

## Lemmatizing

In [27]:
lemmatizer = nltk.stem.WordNetLemmatizer() 

In [28]:
for i in range(len(df_lemma)):
    for x in range(len(df_lemma.iloc[i].text)):
        df_lemma.iloc[i].text[x] = lemmatizer.lemmatize(df_lemma.iloc[i].text[x])

In [30]:
df_lemma.head()

Unnamed: 0,text,directed_at,emotion
0,"[3g, iphone, 3, hr, tweeting, dead, need, upgr...",iPhone,Negative emotion
1,"[know, awesome, ipadiphone, app, youll, likely...",iPad or iPhone App,Positive emotion
2,"[wait, 2, also, sale]",iPad,Positive emotion
3,"[hope, year, festival, isnt, crashy, year, iph...",iPad or iPhone App,Negative emotion
4,"[great, stuff, fri, marissa, mayer, google, ti...",Google,Positive emotion


## Stemming

In [29]:
stemmer = nltk.stem.SnowballStemmer(language = 'english')

In [31]:
for i in range(len(df_stem)):
    for x in range(len(df_stem.iloc[i].text)):
        df_stem.iloc[i].text[x] = stemmer.stem(df_stem.iloc[i].text[x])

In [32]:
df_stem.head()

Unnamed: 0,text,directed_at,emotion
0,"[3g, iphon, 3, hr, tweet, dead, need, upgrad, ...",iPhone,Negative emotion
1,"[know, awesom, ipadiphon, app, youll, like, ap...",iPad or iPhone App,Positive emotion
2,"[wait, 2, also, sale]",iPad,Positive emotion
3,"[hope, year, festiv, isnt, crashi, year, iphon...",iPad or iPhone App,Negative emotion
4,"[great, stuff, fri, marissa, mayer, googl, tim...",Google,Positive emotion


# Exporting CSV's

In [None]:
# df_lemma.to_csv("../data/df_lemma.csv")
# df_stem.to_csv("../data/df_stem.csv")