# Brands and Product Emotions

**Authors:** Kevin McDonough, Brad Horn, Ryan Reilly

## Overview

This project analyzes data for over 9,000 tweets about product reviews for Apple and Google. Human raters rated the sentiment as positive, negative, or neither. The goal of this analysis is to build an NLP model that will accurately predict the sentiment of a tweet based on the tweets text. This will be done through exploratory data analysis and iterative predictive modeling using classification models. 

## Business Problem

Apple and Google have hired us to predict the sentiment of tweets about theit products. They will use our analysis to gather critical feedback about problems in newly released products. Based on our analysis, we are going to provide reccomendations based on the following.

- Which products to manage based on negative tweets
- What people say most often about negative tweets
- 

## Data Understanding

Each row in this dataset represents a unique tweet made a by a user about an Apple or Google procduct. There are three columns in the dataset. Each feature and its description is listed below.

| Feature | Description|
|:-------| :-------|
|tweet_text| The full text of the tweet|
|emotion_in_tweet_is_directed_at| The product the tweet is directed at|
is_there_an_emotion_directed_at_a_brand_or_product| The sentiment label of the tweet in 4 classes (positive, negative, neutral, and I can't tell|

In [14]:
#For data engineering
import pandas as pd
import numpy as np
import re
import string

#For visualization
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib as mpl
%matplotlib inline

#To scale and one-hot encode our columns
from sklearn.preprocessing import LabelEncoder

#To build all of our models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import NearestNeighbors
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelBinarizer

# nltk
import nltk
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize, sent_tokenize, FreqDist
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.util import ngrams

#For Neural Net
from keras.utils.np_utils import to_categorical

#For counting words
import collections

#For Vectorizing
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

#Train test split, CV and Gridsearch
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

#To build our dummy model
from sklearn.dummy import DummyClassifier

#For evaluating our classification models
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, classification_report 
from sklearn.metrics import plot_roc_curve

#To apply oversampling for imbalanced dataset
from imblearn.over_sampling import SMOTE

#For building pipelines
from imblearn.pipeline import Pipeline as Pipeline

#to see how long a cell takes to run
import time

#Use the functions in the py file for preprocessing
import sys
#sys.path.insert(0, 'src/')
#import preprocessing

#To ignore warnings
import warnings
warnings.filterwarnings('ignore')

#Stopwords
stop_words = set(stopwords.words('english'))

In [15]:
#Import the data
df = pd.read_csv('../data/tweets.csv')

#Rename columns to get shorter clumn names
df.rename(columns={"tweet_text": "tweet",\
                   "emotion_in_tweet_is_directed_at":"product",\
                   "is_there_an_emotion_directed_at_a_brand_or_product": "sentiment"},\
          inplace=True)

In [16]:
#Take a look at the datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8721 entries, 0 to 8720
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   tweet      8720 non-null   object
 1   product    3169 non-null   object
 2   sentiment  8721 non-null   object
dtypes: object(3)
memory usage: 204.5+ KB


There looks to be nulls in the product column which we will have to deal with. We also may need to convert the tweet column to a string to be used for analysis.

In [17]:
#Take a look at the outcome variable and its value counts
df['sentiment'].value_counts().to_frame()

Unnamed: 0,sentiment
No emotion toward brand or product,5156
Positive emotion,2869
Negative emotion,545
I can't tell,151


THere are 4 classes for our target variable. There looks to be a good balance of no emotion(neutral) and postive tweets but there are few negative tweets so we will need to implement a re-sampling technique in our models. For those tweets that are labeld as I can't tell, we will remove these rows from the dataset for our analysis and modeling. 

## Data Preparation

### Handle Missing Values

In [18]:
#Create a dataframe with just those columns with NAs and there sums.
missing = pd.DataFrame(df.isna().sum(), columns = ['Nulls'])
#Label index column
missing.index.name = 'Feature'
#Just show the columns with missing data    
missing.sort_values(by=['Nulls'])

Unnamed: 0_level_0,Nulls
Feature,Unnamed: 1_level_1
sentiment,0
tweet,1
product,5552


#### We will remove the row with no tweet.

In [19]:
df = df[df['tweet'].notna()]

#### Impute Product Column

In [20]:
# Need to figure out a way to impute values in this column if possible

### Check for row duplicates and remove

In [21]:
print('Number of duplicates: {}'.format(len(df[df.duplicated()])))

Number of duplicates: 22


In [22]:
#Remove duplicates
df.drop_duplicates(inplace=True)

### Remove rows where the label is "I cant tell"

In [23]:
#Remove tweets with "I can't tell" sentiment
df = df[df.sentiment != "I can't tell"]

### Rename class lables

In [24]:
df['sentiment'].replace({'No emotion toward brand or product': 'Nuetral', 'Positive emotion': 'Positive', 'Negative emotion': 'Negative'}, inplace=True)     

# Feature Engineering

In [25]:
df

Feature,tweet,product,sentiment
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive
...,...,...,...
8716,Ipad everywhere. #SXSW {link},iPad,Positive
8717,"Wave, buzz... RT @mention We interrupt your re...",,Nuetral
8718,"Google's Zeiger, a physician never reported po...",,Nuetral
8719,Some Verizon iPhone customers complained their...,,Nuetral


In [26]:
from textblob import TextBlob

In [27]:
#Create polarity function and subjectivity function
pol = lambda x: TextBlob(x).sentiment.polarity
pol(df['tweet'][1])

df['polarity'] = round(df['tweet'].apply(pol),2)

In [28]:
df

Feature,tweet,product,sentiment,polarity
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative,-0.25
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive,0.47
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive,-0.16
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative,0.00
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive,0.80
...,...,...,...,...
8716,Ipad everywhere. #SXSW {link},iPad,Positive,0.00
8717,"Wave, buzz... RT @mention We interrupt your re...",,Nuetral,0.00
8718,"Google's Zeiger, a physician never reported po...",,Nuetral,0.00
8719,Some Verizon iPhone customers complained their...,,Nuetral,-0.05


# Preprocess for modeling

In [29]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/kevinmcdonough/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [30]:
df['tweet'] = df['tweet'].astype(str)

In [31]:
def preprocess_tweet_text(tweet):
    # Lowercase
    tweet = tweet.lower()
    # Remove urls
    tweet = re.sub(r"http\S+|www\S+|https\S+", '', tweet, flags=re.MULTILINE)
    # Remove user @ references and '#' from tweet
    tweet = re.sub(r'\@\w+|\#','', tweet)
    # Remove punctuations
    tweet = tweet.translate(str.maketrans('', '', string.punctuation))
    # Remove stopwords
    tweet_tokens = word_tokenize(tweet)
    filtered_words = [w for w in tweet_tokens if not w in stop_words]
    # Lemmatize
    lemmatizer = WordNetLemmatizer()
    lemma_words = [lemmatizer.lemmatize(w, pos='a') for w in filtered_words]
    #Return the words joined back together
    return " ".join(lemma_words)

In [32]:
df['tweet'] = df['tweet'].apply(preprocess_tweet_text)

In [33]:
df['tweet']

0       3g iphone 3 hrs tweeting riseaustin dead need ...
1       know awesome ipadiphone app youll likely appre...
2                              wait ipad 2 also sale sxsw
3       hope years festival isnt crashy years iphone a...
4       great stuff fri sxsw marissa mayer google tim ...
                              ...                        
8716                            ipad everywhere sxsw link
8717    wave buzz rt interrupt regularly scheduled sxs...
8718    googles zeiger physician never reported potent...
8719    verizon iphone customers complained time fell ...
8720    �ϡ�������ʋ�΋�ҋ�������⋁�����������rt google tes...
Name: tweet, Length: 8547, dtype: object

In [34]:
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

In [35]:
tokenizer_obj = Tokenizer(num_words=2000)

In [36]:
df['sentiment'].value_counts()

Nuetral     5142
Positive    2861
Negative     544
Name: sentiment, dtype: int64

In [37]:
# Convert string labels to 0,1,2
le = LabelEncoder()
df['target'] = le.fit_transform(df['sentiment'])

In [38]:
# Separate features and labels 
X = df['tweet']
y = df['target']

In [39]:
# Create test and train datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state=1)

In [40]:
# Transform the product labels to numerical values
lb = LabelBinarizer()
lb.fit(y_train)

y_train_lb = to_categorical(lb.transform(y_train))[:, :, 1]
y_test_lb = to_categorical(lb.transform(y_test))[:, :, 1]

In [41]:
y_train_lb[0]

array([0., 0., 1.], dtype=float32)

In [42]:
X_train.values

array(['brilliant move apple even begins apple wins sxsw link',
       'thats impressive quotpopupquot store must purchased ipad 2 enjoy sxsw',
       'i7 releasing demons link codes valid 40075959p 031111 infektd sxsw zomb',
       ...,
       'cool team startups came via siliconvalley bus hacking apps exgoogle au stage w speakermeetup sxsw dogpatch labs',
       'rt attending sxsw austin guide free download itunes link lp',
       'iphone thirsty sxsw chevy volt lounge w 16 others link'],
      dtype=object)

In [43]:
X_train = X_train.values
X_test = X_test.values

In [44]:
total_reviews = df['tweet'].values

In [45]:
total_reviews

array(['3g iphone 3 hrs tweeting riseaustin dead need upgrade plugin stations sxsw',
       'know awesome ipadiphone app youll likely appreciate design also theyre giving free ts sxsw',
       'wait ipad 2 also sale sxsw', ...,
       'googles zeiger physician never reported potential ae yet fda relies physicians quotwere operating wout dataquot sxsw health2dev',
       'verizon iphone customers complained time fell back hour weekend course new yorkers attended sxsw',
       '�ϡ�������ʋ�\u038b�ҋ�������⋁�����������rt google tests ���checkin offers�\u06dd sxsw link'],
      dtype=object)

In [46]:
tokenizer_obj.fit_on_texts(total_reviews)

In [47]:
max_length = max([len(s.split()) for s in total_reviews])

In [48]:
max_length

26

In [49]:
vocab_size = len(tokenizer_obj.word_index) + 1

In [50]:
X_train_tokens = tokenizer_obj.texts_to_sequences(X_train)
X_test_tokens = tokenizer_obj.texts_to_sequences(X_test)

In [51]:
X_train_pad = pad_sequences(X_train_tokens, maxlen=max_length, padding='post')
X_test_pad = pad_sequences(X_test_tokens, maxlen=max_length, padding='post')

In [52]:
X_resampled, y_resampled = SMOTE().fit_resample(X_train_pad, y_train_lb)

In [53]:
len(y_resampled)

9165

In [54]:
from keras.models import Sequential
from keras.layers import Embedding, Dense, LSTM, GRU, Flatten
from keras.layers.embeddings import Embedding

In [55]:
EMBEDDING_DIM = 100

In [56]:
print('Build model...')

model = Sequential()
model.add(Embedding(vocab_size, EMBEDDING_DIM, input_length=max_length))
model.add(GRU(units=32, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(3, activation='softmax'))

Build model...


In [57]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])

In [58]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 26, 100)           1038200   
_________________________________________________________________
gru (GRU)                    (None, 32)                12864     
_________________________________________________________________
dense (Dense)                (None, 3)                 99        
Total params: 1,051,163
Trainable params: 1,051,163
Non-trainable params: 0
_________________________________________________________________


In [59]:
len(X_test_pad)

3419

In [60]:
model.fit(X_train_pad, y_train_lb, batch_size=128, epochs=12, validation_data=(X_test_pad, y_test_lb), verbose=1)

Epoch 1/12
Epoch 2/12
Epoch 3/12
Epoch 4/12
Epoch 5/12
Epoch 6/12
Epoch 7/12
Epoch 8/12
Epoch 9/12
Epoch 10/12
Epoch 11/12
Epoch 12/12


<tensorflow.python.keras.callbacks.History at 0x7fbe4e3681f0>

## Word2Vec

In [68]:
tweets = df['tweet'].values.tolist()
review_lines = list()

for tweet in tweets:
    # Lowercase
    tweet = tweet.lower()
    # Remove urls
    tweet = re.sub(r"http\S+|www\S+|https\S+", '', tweet, flags=re.MULTILINE)
    # Remove user @ references and '#' from tweet
    tweet = re.sub(r'\@\w+|\#','', tweet)
    # Remove punctuations
    tweet = tweet.translate(str.maketrans('', '', string.punctuation))
    # Remove stopwords
    tweet_tokens = word_tokenize(tweet)
    filtered_words = [w for w in tweet_tokens if not w in stop_words]
    # Lemmatize
    lemmatizer = WordNetLemmatizer()
    lemma_words = [lemmatizer.lemmatize(w, pos='a') for w in filtered_words]
    #Return the words joined back together
    review_lines.append(lemma_words)

In [69]:
review_lines

[['3g',
  'iphone',
  '3',
  'hrs',
  'tweeting',
  'riseaustin',
  'dead',
  'need',
  'upgrade',
  'plugin',
  'stations',
  'sxsw'],
 ['know',
  'awesome',
  'ipadiphone',
  'app',
  'youll',
  'likely',
  'appreciate',
  'design',
  'also',
  'theyre',
  'giving',
  'free',
  'ts',
  'sxsw'],
 ['wait', 'ipad', '2', 'also', 'sale', 'sxsw'],
 ['hope',
  'years',
  'festival',
  'isnt',
  'crashy',
  'years',
  'iphone',
  'app',
  'sxsw'],
 ['great',
  'stuff',
  'fri',
  'sxsw',
  'marissa',
  'mayer',
  'google',
  'tim',
  'oreilly',
  'tech',
  'booksconferences',
  'amp',
  'matt',
  'mullenweg',
  'wordpress'],
 ['new',
  'ipad',
  'apps',
  'speechtherapy',
  'communication',
  'showcased',
  'sxsw',
  'conference',
  'iear',
  'edchat',
  'asd'],
 ['sxsw',
  'starting',
  'ctia',
  'around',
  'corner',
  'googleio',
  'hop',
  'skip',
  'jump',
  'good',
  'time',
  'android',
  'fan'],
 ['beautifully',
  'smart',
  'simple',
  'idea',
  'rt',
  'wrote',
  'hollergram',
  'i

In [70]:
import gensim

In [71]:
model = gensim.models.Word2Vec(sentences=review_lines, size=EMBEDDING_DIM, window=5, workers=4, min_count=1)

In [72]:
words = list(model.wv.vocab)

In [73]:
print('Vocabulary Size: %d' % len(words))

Vocabulary Size: 10380


In [74]:
model.wv.most_similar('help')

[('room', 0.9999278783798218),
 ('meet', 0.9998672008514404),
 ('without', 0.9998671412467957),
 ('say', 0.9998602271080017),
 ('user', 0.9998586773872375),
 ('much', 0.9998446106910706),
 ('also', 0.9998436570167542),
 ('hit', 0.9998413324356079),
 ('give', 0.9998410940170288),
 ('battery', 0.9998387098312378)]

In [75]:
filename = 'twitter_embedding_word2vec.txt'
model.wv.save_word2vec_format(filename, binary=False)

In [76]:
len(review_lines)

8547

In [77]:
import os

embeddings_index = {}
f = open(os.path.join('', 'twitter_embedding_word2vec.txt'),  encoding = "utf-8")
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:])
    embeddings_index[word] = coefs
f.close()

In [78]:
VALIDATION_SPLIT = 0.3

# vectorize the text samples into a 2D integer tensor
tokenizer_obj = Tokenizer()
tokenizer_obj.fit_on_texts(review_lines)
sequences = tokenizer_obj.texts_to_sequences(review_lines)

# pad sequences
word_index = tokenizer_obj.word_index
print('Found %s unique tokens.' % len(word_index))

review_pad = pad_sequences(sequences, maxlen=max_length)
sentiment =  df['sentiment'].values
print('Shape of review tensor:', review_pad.shape)
print('Shape of sentiment tensor:', sentiment.shape)

# split the data into a training set and a validation set
indices = np.arange(review_pad.shape[0])
np.random.shuffle(indices)
review_pad = review_pad[indices]
sentiment = sentiment[indices]
num_validation_samples = int(VALIDATION_SPLIT * review_pad.shape[0])

X_train_pad = review_pad[:-num_validation_samples]
y_train = sentiment[:-num_validation_samples]
X_test_pad = review_pad[-num_validation_samples:]
y_test = sentiment[-num_validation_samples:]

Found 10380 unique tokens.
Shape of review tensor: (8547, 26)
Shape of sentiment tensor: (8547,)


In [79]:
print('Shape of X_train_pad tensor:', X_train_pad.shape)
print('Shape of y_train tensor:', y_train.shape)

print('Shape of X_test_pad tensor:', X_test_pad.shape)
print('Shape of y_test tensor:', y_test.shape)

Shape of X_train_pad tensor: (5983, 26)
Shape of y_train tensor: (5983,)
Shape of X_test_pad tensor: (2564, 26)
Shape of y_test tensor: (2564,)


In [80]:
EMBEDDING_DIM =100
num_words = len(word_index) + 1
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))

for word, i in word_index.items():
    if i > num_words:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

In [81]:
print(num_words)

10381


In [82]:
lb = LabelBinarizer()
lb.fit(y_train)

y_train_lb = to_categorical(lb.transform(y_train))[:, :, 1]
y_test_lb = to_categorical(lb.transform(y_test))[:, :, 1]

In [83]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, Flatten
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from keras.initializers import Constant

# define model
model = Sequential()
# load pre-trained word embeddings into an Embedding layer
# note that we set trainable = False so as to keep the embeddings fixed
embedding_layer = Embedding(num_words,
                            EMBEDDING_DIM,
                            embeddings_initializer=Constant(embedding_matrix),
                            input_length=max_length,
                            trainable=False)

model.add(embedding_layer)
model.add(Conv1D(filters=128, kernel_size=5, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(3, activation='softmax'))
print(model.summary())

# compile network
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# fit the model
model.fit(X_train_pad, y_train_lb, batch_size=128, epochs=50, validation_data=(X_test_pad, y_test_lb), verbose=2)

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 26, 100)           1038100   
_________________________________________________________________
conv1d (Conv1D)              (None, 22, 128)           64128     
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 11, 128)           0         
_________________________________________________________________
flatten (Flatten)            (None, 1408)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 3)                 4227      
Total params: 1,106,455
Trainable params: 68,355
Non-trainable params: 1,038,100
_________________________________________________________________
None
Epoch 1/50
47/47 - 1s - loss: 0.5320 - accuracy: 0.5771 - val_loss: 0.5232 - val_accuracy: 0.5846
Ep

<tensorflow.python.keras.callbacks.History at 0x7fbe4e3810d0>