## Project Description
The goal of this project is to build a model that can do sentiment analysis on Twitter messages. Sentiment analysis is a method for gauging attitude and/or opinions towards particular topics expressed in text. The analysis in this project, to be specific,  is to determine whether the Twitter users’ attitude towards products from two particular brands,  Apple and Google, is positive or negative.  Brands can use the results to monitor their reputation across social media. Furthermore, companies and brands can make improvements on products perfectly met customers’ demands.


In [1]:
import re
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
import string
import nltk
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from time import time

%matplotlib inline

In [2]:
df = pd.read_csv(r'''C:\Users\mia\files\drive-download-20190215T130906Z-001\judge-1377884607_tweet_product_company.csv''',encoding='latin-1')

In [3]:
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
tweet_text                                            9092 non-null object
emotion_in_tweet_is_directed_at                       3291 non-null object
is_there_an_emotion_directed_at_a_brand_or_product    9093 non-null object
dtypes: object(3)
memory usage: 213.2+ KB


In [5]:
# Change the column names for convenience
df.columns = ['tweets','brand','emotion']

## Data Preprocessing

##### Just from the first five rows, it looks like this data set contains tweets on more than one brand's products. Let's discover how many different brands these tweets are about.

In [6]:
df.describe().T

Unnamed: 0,count,unique,top,freq
tweets,9092,9065,RT @mention Marissa Mayer: Google Will Connect...,5
brand,3291,9,iPad,946
emotion,9093,4,No emotion toward brand or product,5389


#### The target brand column has 9 unique values. Take a closer look at those 9 unique values, I realize that it does not mean 9 unique brands rather two main brands and others:


In [7]:
df.brand.unique()

array(['iPhone', 'iPad or iPhone App', 'iPad', 'Google', nan, 'Android',
       'Apple', 'Android App', 'Other Google product or service',
       'Other Apple product or service'], dtype=object)

In [8]:
df.groupby('brand').count()

Unnamed: 0_level_0,tweets,emotion
brand,Unnamed: 1_level_1,Unnamed: 2_level_1
Android,78,78
Android App,81,81
Apple,661,661
Google,430,430
Other Apple product or service,35,35
Other Google product or service,293,293
iPad,946,946
iPad or iPhone App,470,470
iPhone,297,297


##### It is more clear see that most of the tweets are about products of service of brands, Google and Apple when the data is grouped by ‘brand’. The data set will be partitioned into two parts based on the different target brands. For those tweets which have 'Android' and 'Android App' as brand, one can't tell which specific brand's products they are about, so they will be dropped. The model will be built using  tweets about only Apple products. After the model is finally chosen, the data of Google products will be used to see whether the model can use on tweets about a different brand, in other words,  to see how generalized the model is.

In [9]:
df.emotion.unique()

array(['Negative emotion', 'Positive emotion',
       'No emotion toward brand or product', "I can't tell"], dtype=object)

##### Moreover, this data set has four emotions: ‘Negative emotion’, ‘Positive emotion’, ‘I can’t tell’, and ‘No emotion toward brand or product’. Only two are the ones that matter in this project -- positive emotion and negative emotion. Thus, the samples with emotions other than positive or negative will be ignored.

In [10]:
# Kepp tweets with only positive emotion or negative emotion.
df = df.loc[df.emotion.isin(['Negative emotion','Positive emotion'])]

# Create two new dataframes based on the brand the tweets are about 
# Only Keep tweets on two brands -- Apple and Google
apple_df = df.loc[df.brand.isin(['iPhone', 'iPad or iPhone App', 'iPad','Apple','Other Apple product or service'])]
google_df = df.loc[df.brand.isin(['Google','Other Google product or service'])]

# Reset index of two new dataframes
apple_df.reset_index(inplace=True,drop=True)
google_df.reset_index(inplace=True,drop=True)

In [11]:
apple_df.describe().T

Unnamed: 0,count,unique,top,freq
tweets,2337,2332,"Oh. My. God. The #SXSW app for iPad is pure, u...",2
brand,2337,5,iPad,918
emotion,2337,2,Positive emotion,1949


##### Notice the unique tweets are two less than the total tweets, which means there are duplicates. The duplicates will be dropped. 

In [12]:
# Drop depulicate tweets
apple_df = apple_df.loc[~apple_df.tweets.duplicated()]
apple_df.describe().T

Unnamed: 0,count,unique,top,freq
tweets,2332,2332,Rockin an iPad 2 from the downtown Apple #SXSW...,1
brand,2332,5,iPad,917
emotion,2332,2,Positive emotion,1945


In [13]:
google_df.describe().T

Unnamed: 0,count,unique,top,freq
tweets,697,695,RT @mention Marissa Mayer: Google Will Connect...,3
brand,697,2,Google,414
emotion,697,2,Positive emotion,582


##### Same thing with data about google products. Duplicates will be ignored as well.

In [14]:
google_df = google_df.loc[~google_df.tweets.duplicated()]
google_df.describe().T

Unnamed: 0,count,unique,top,freq
tweets,695,695,Fantastico! RT @mention Marissa Mayer: Google ...,1
brand,695,2,Google,412
emotion,695,2,Positive emotion,580


#### Now, move to clean the tweets and convert them to be numerically representable. The URLs, Twitter usernames (@something), any numbers, any punctuations, and any special characters will be removed. After taking out the URLs and before taking out any punctuations, the contractions of words will be removed at first since the contractions can result in misinterpreting the meaning of a phrase, especially in the case of negations if the punctuation, apostrophe, is just simply being removed. And since the main goal to use the package, contractions, is to avoid the misinterpreting any negations. Thus, phrases, such as '' year's '', are not going to be expanded. For more information on this package, please see  [here](https://github.com/kootenpv/contractions). The technique used in the removals to match the URLs, usernames and so on is the regular expression library. After cleaning, the tweet messages will be tokenized and then join with a space between words. 


In [15]:
import contractions

# Define a function that takes a series as argument cleans tweets
def process_tweets(tweets):
    # Remove URLs in tweets
    clean_tweets = tweets.apply(
        lambda x: re.sub(r'http?:\/\/.*[\r\n]*', ' ',x))
    # Remove tweet user names
    clean_tweets = clean_tweets.apply(lambda x: re.sub(r'@[\w]*', ' ',x))
    
    # Expand contractions
    clean_tweets = clean_tweets.apply(lambda x: contractions.fix(x))
    
    # Remove punctuations, numbers, and any special characters
    clean_tweets = clean_tweets.apply(lambda x: re.sub(r'[^a-zA-Z\s\']',' ',x))
    # Tokenization
    tokens = clean_tweets.apply(lambda x:x.split())
    
    return [' '.join(token) for token in tokens]

#### Furthermore, I create two functions to normalize the tweets even more by stemming words or lemmatizating words. The goal of both stemming and lemmatization is to get the root forms of derived words, but they differ in their approaches. Stemming usually just chops ff the ends of words. Lemmatization uses the morphological analysis of words. Ronald Wahome claims that the above two techniques may not work so well because they essentially shorten words to their base words and the Twitter messages are short messages by design. I still would like to try the two methods and to see whether they would make a distinct difference on improving my model. There many different  implementations for stemming and lemmatization in python, I use the LancasterStemmer and the WordNetLemmatizer from the NLTK library. 

### Lemmatization and Stemming

In [16]:
import nltk

In [17]:
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer

In [18]:
def stem_words(tweet):
    # The argument 'tweet' is a string list
    
    words = tweet.split()
    stemmer = LancasterStemmer()
    stems = [stemmer.stem(word) for word in words]
    return ' '.join(stems)

def lemmatize_words(tweet):
    # The argument 'tweet' is a string lis
    
    words = tweet.split()
    lemmatizer = WordNetLemmatizer()
    verb_lemmas = [lemmatizer.lemmatize(word, pos='v') for word in words]
    noun_lemmas = [lemmatizer.lemmatize(word) for word in words]
    
    return ' '.join(verb_lemmas),' '.join(noun_lemmas)

In [19]:
# Notebook gave an error and asked to download 'wordnet'
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mia\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Converting Twitter Messages to Be Numeric

In [20]:
# Without stemming and lemmazing
apple_processed_tweets = process_tweets(apple_df.tweets)

#### Bag of Words V.S. TF-IDF

In [21]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
bow_vectorizer = CountVectorizer(stop_words='english')
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

bow = bow_vectorizer.fit_transform(apple_processed_tweets)
tfidf = tfidf_vectorizer.fit_transform(apple_processed_tweets)

#### Convert Target Variables to be Numeric

In [22]:
di = {'Negative emotion':0, 'Positive emotion':1}
y = apple_df.emotion.map(di)
y.unique()

array([0, 1], dtype=int64)

## Train the Model

In [23]:
# spli data into training set and test tset
bow_X_train, bow_X_test, bow_y_train, bow_y_test = train_test_split(bow,y)


mnb_bow = MultinomialNB()
mnb_bow.fit(bow_X_train,bow_y_train)

#mnb_bow.predict(g_bowX)
print('BOW & Mutlinomial: The accuracy score for the training set is {0:.3f}'.format(mnb_bow.score(bow_X_train,bow_y_train)))
print('BOW & Mutlinomial: The accuracy score for the test set is {0:.3f}'.format(mnb_bow.score(bow_X_test, bow_y_test)))


BOW & Mutlinomial: The accuracy score for the training set is 0.949
BOW & Mutlinomial: The accuracy score for the test set is 0.859


In [24]:
# Use TF-IDF
tf_X_train, tf_X_test, tf_y_train, tf_y_test = train_test_split(tfidf,y)


mnb_tfidf = MultinomialNB()
mnb_tfidf.fit(tf_X_train,tf_y_train)

print('TF-IDF & Mutlinomial: The accuracy score for the training set is {0:.3f}'.format(mnb_tfidf.score(tf_X_train,tf_y_train)))
print('TF_IDF & Mutlinomial: The accuracy score for the test set is {0:.3f}'.format(mnb_tfidf.score(tf_X_test, tf_y_test)))


TF-IDF & Mutlinomial: The accuracy score for the training set is 0.860
TF_IDF & Mutlinomial: The accuracy score for the test set is 0.813


#### With Multinomial Navie Bayes Algorithm, BOW Works out with Better Result

### Try Differemt Algorithms

In [25]:
from sklearn import svm

clf = svm.SVC()
clf.fit(bow_X_train,bow_y_train)
print('BOW & SVM: The accuracy score for the training set is {0:.3f}'.format(clf.score(bow_X_train, bow_y_train)))
print('BOW & SVM: The accuracy score for the test set is {0:.3f}'.format(clf.score(bow_X_test, bow_y_test)))
print()

clf = svm.SVC()
clf.fit(tf_X_train,tf_y_train)
print('TF-IDF & SVM: The accuracy score for the training set is {0:.3f}'.format(clf.score(tf_X_train,tf_y_train)))
print('TF-IDF & SVM: The accuracy score for the test set is {0:.3f}'.format(clf.score(tf_X_test, tf_y_test)))


BOW & SVM: The accuracy score for the training set is 0.836
BOW & SVM: The accuracy score for the test set is 0.827

TF-IDF & SVM: The accuracy score for the training set is 0.844
TF-IDF & SVM: The accuracy score for the test set is 0.804


#### Comparing different combinations of diffferent word representations and classifying algorithms, it turns out that BOW and Naive Bayes are the best combo to train the model. Next, I will do a hyperparameter tuning on the parameters of BOW and Multinomial Navie Bayes to further improve the model.

### Hyperparameter Tuning

In [26]:
parameters = {'vect__max_df':[0.8,0.85,0.9,0.95], 'vect__min_df':[1,2],'vect__binary':[True,False],'clf__alpha':[0.25,0.5,0.75]}
pipeline = Pipeline([('vect',CountVectorizer()),
                    ('clf',MultinomialNB())])
grid_search = GridSearchCV(pipeline, parameters,cv=5,
                          n_jobs=1,verbose=1,scoring='accuracy')

X_train, X_test, y_train, y_test = train_test_split(apple_processed_tweets,y)

t0 = time()
grid_search.fit(X_train,y_train)
print('done in {0:0.3f}'.format((time() - t0)))
print()

print('Best score for training set: {0:0.3f}'.format(grid_search.best_score_))
print('Best parameters set:')
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Fitting 5 folds for each of 48 candidates, totalling 240 fits
done in 13.447

Best score for training set: 0.867
Best parameters set:
	clf__alpha: 0.5
	vect__binary: True
	vect__max_df: 0.8
	vect__min_df: 1


[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed:   13.3s finished


In [27]:
# Predict on test set
print('Accuarcy on test set is {0:0.3f}'.format(grid_search.score(X_test,y_test)))

Accuarcy on test set is 0.890


#### The best score is 0.867 on the training data set and 0.890 on the test data set. The test score is even a little bit better than the training’. It seems like a great result, but it is not really like that. I will explain why shortly.

## Strongly Predictive Features

#### I need to use the get_feature_names method from the CountVectorizer convector to get the actually words. An identity matrix is used to create a matrix that each row has exactly one word. Then, I use the trained MultinomialNB classifier to predict on this matrix. Finally, sort the rows by predicted probabilities, and pick the top and bottom 10 rows.

In [28]:
vec = CountVectorizer(binary=grid_search.best_params_['vect__binary'],
                      min_df=grid_search.best_params_['vect__min_df'],
                      max_df=grid_search.best_params_['vect__max_df'])

X = vec.fit_transform(apple_processed_tweets)

In [29]:

X_train, X_test, y_train, y_test = train_test_split(X,y)
clf = MultinomialNB(alpha=grid_search.best_params_['clf__alpha']).fit(X_train,y_train)

training_accuracy = clf.score(X_train, y_train)
test_accuracy = clf.score(X_test, y_test)

print("Accuracy on training data: {0:2f}".format(training_accuracy))
print("Accuracy on test data:     {0:2f}".format(test_accuracy))

Accuracy on training data: 0.968553
Accuracy on test data:     0.825043


In [30]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, clf.predict(X_test)))

[[ 54  38]
 [ 64 427]]


In [31]:
features = np.array(vec.get_feature_names())

In [32]:
x = np.eye(X_test.shape[1])
probs = clf.predict_log_proba(x)[:,0]
ind = np.argsort(probs)

good_words = features[ind[:10]]
bad_words = features[ind[-10:]]

good_prob = probs[ind[:10]]
bad_prob = probs[ind[-10:]]

print("Good words\t     P(positivity | word)")
for w, p in zip(good_words, good_prob):
    print("{:>20}".format(w), "{:.2f}".format(1 - np.exp(p)))
    
print("Bad words\t     P(negativity | word)")
for w, p in zip(bad_words, bad_prob):
    print("{:>20}".format(w), "{:.2f}".format(1 - np.exp(p)))

Good words	     P(positivity | word)
              before 0.99
                wins 0.99
                 win 0.99
              begins 0.99
               smart 0.99
             winning 0.99
            downtown 0.98
            download 0.98
                 set 0.98
                game 0.98
Bad words	     P(negativity | word)
                suns 0.13
         autocorrect 0.13
             swisher 0.11
             novelty 0.10
           classiest 0.10
           delegates 0.10
               among 0.10
               fades 0.08
             fascist 0.07
                hate 0.06


In [33]:
google_processed_tweets = process_tweets(google_df.tweets)
google_x = vec.transform(google_processed_tweets)
google_y = google_df.emotion.map(di)
print('The accuracy score on the google tweets is {0:3f}'.format(
    clf.score(google_x,google_y)))

The accuracy score on the google tweets is 0.753957


#### The outcomes are not exactly as good as when looking at the top 10 positive words, anyone will agree that they would think these words mean positive or negative. For instance, set, before, and begins are listed in the top 10 strongest predictive words, but when I think of them, I don't really have any emotion towards them. Some words in the weakest predictive words make few sense, such as classiest and novelty. These words definitely indicate positiveness to me. I am not sure why my model would learn that they are the top 10 negative words, unless they are used sarcastically very often in my data set.  Moreover, I notice that both win and wins appear in the top 10 strongest predictive words list, but clearly win and wins mean the same thing. Thus, I think I should try word stemming and lemmatization and see if that could help improve my model.

## Try to Use the Words Stems and Lemmatization

In [34]:
stems = pd.Series(apple_processed_tweets).apply(stem_words)

In [35]:
vec = CountVectorizer(binary=grid_search.best_params_['vect__binary'],
                      min_df=grid_search.best_params_['vect__min_df'],
                      max_df=grid_search.best_params_['vect__max_df'])
stem_X = vec.fit_transform(stems)

X_train, X_test, y_train, y_test = train_test_split(stem_X,y,random_state=38)
clf = MultinomialNB(alpha=grid_search.best_params_['clf__alpha']).fit(X_train,y_train)

training_accuracy = clf.score(X_train, y_train)
test_accuracy = clf.score(X_test, y_test)

print("Accuracy on training data: {0:2f}".format(training_accuracy))
print("Accuracy on test data:     {0:2f}".format(test_accuracy))

Accuracy on training data: 0.955975
Accuracy on test data:     0.869640


In [36]:

features = np.array(vec.get_feature_names())

x = np.eye(X_test.shape[1])
probs = clf.predict_log_proba(x)[:,0]
ind = np.argsort(probs)

good_words = features[ind[:10]]
bad_words = features[ind[-10:]]

good_prob = probs[ind[:10]]
bad_prob = probs[ind[-10:]]

print("Good words\t     P(positivity | word)")
for w, p in zip(good_words, good_prob):
    print("{:>20}".format(w), "{:.2f}".format(1 - np.exp(p)))
    
print("Bad words\t     P(negativity | word)")
for w, p in zip(bad_words, bad_prob):
    print("{:>20}".format(w), "{:.2f}".format(1 - np.exp(p)))

Good words	     P(positivity | word)
                 win 1.00
                 set 0.99
               begin 0.99
                 gam 0.98
                  th 0.98
            congress 0.98
            downtown 0.98
                play 0.98
                 mus 0.98
             tonight 0.98
Bad words	     P(negativity | word)
                 kar 0.10
               swish 0.09
               among 0.08
                fuck 0.08
           classiest 0.08
              iphone 0.08
               deleg 0.08
               novel 0.08
                 fad 0.07
                fasc 0.06


#### Please note that there are some word stems in the above list do not seem like proper English words. That is what one may get when using the LancasterStemmer. LancasterStemmer tends to be heavy stemming, which leads stems to be non-linguistic. But it does give the best classification accuracy among my four models when you see the results of the rest models.


In [37]:
g_stems = pd.Series(google_processed_tweets).apply(stem_words)

In [38]:
print('The accuracy score on the google tweets is {0:3f}'.format(
    clf.score(vec.transform(g_stems),google_y)))

The accuracy score on the google tweets is 0.761151


In [39]:
verb_lemma = [x[0] for x in pd.Series(apple_processed_tweets).apply(lemmatize_words)]
noun_lemma = [x[1] for x in pd.Series(apple_processed_tweets).apply(lemmatize_words)]


In [40]:
vec = CountVectorizer(binary=grid_search.best_params_['vect__binary'],
                      min_df=grid_search.best_params_['vect__min_df'],
                      max_df=grid_search.best_params_['vect__max_df'])
verb_lemma_X = vec.fit_transform(verb_lemma)

X_train, X_test, y_train, y_test = train_test_split(verb_lemma_X,y,random_state=38)
clf = MultinomialNB(alpha=grid_search.best_params_['clf__alpha']).fit(X_train,y_train)

training_accuracy = clf.score(X_train, y_train)
test_accuracy = clf.score(X_test, y_test)

print("Accuracy on training data: {0:2f}".format(training_accuracy))
print("Accuracy on test data:     {0:2f}".format(test_accuracy))

Accuracy on training data: 0.959977
Accuracy on test data:     0.842196


In [41]:
g_verb_lemma = [x[0] for x in pd.Series(google_processed_tweets).apply(lemmatize_words)]
#g_noun_lemma = [x[1] for x in pd.Series(google_processed_tweets).apply(lemmatize_words)]

In [42]:
print('The accuracy score on the google tweets is {0:3f}'.format(
    clf.score(vec.transform(g_verb_lemma),google_y)))

The accuracy score on the google tweets is 0.745324


In [43]:
vec = CountVectorizer(binary=grid_search.best_params_['vect__binary'],
                      min_df=grid_search.best_params_['vect__min_df'],
                      max_df=grid_search.best_params_['vect__max_df'])
noun_lemma_X = vec.fit_transform(noun_lemma)

X_train, X_test, y_train, y_test = train_test_split(noun_lemma_X,y,random_state=38)
clf = MultinomialNB(alpha=grid_search.best_params_['clf__alpha']).fit(X_train,y_train)

training_accuracy = clf.score(X_train, y_train)
test_accuracy = clf.score(X_test, y_test)

print("Accuracy on training data: {0:2f}".format(training_accuracy))
print("Accuracy on test data:     {0:2f}".format(test_accuracy))

Accuracy on training data: 0.965695
Accuracy on test data:     0.869640


In [44]:
features = np.array(vec.get_feature_names())

x = np.eye(X_test.shape[1])
probs = clf.predict_log_proba(x)[:,0]
ind = np.argsort(probs)

good_words = features[ind[:10]]
bad_words = features[ind[-10:]]

good_prob = probs[ind[:10]]
bad_prob = probs[ind[-10:]]

print("Good words\t     P(positivity | word)")
for w, p in zip(good_words, good_prob):
    print("{:>20}".format(w), "{:.2f}".format(1 - np.exp(p)))
    
print("Bad words\t     P(negativity | word)")
for w, p in zip(bad_words, bad_prob):
    print("{:>20}".format(w), "{:.2f}".format(1 - np.exp(p)))

Good words	     P(positivity | word)
                 win 0.99
                 set 0.99
                wins 0.99
              begins 0.99
            download 0.98
                  th 0.98
             winning 0.98
            congress 0.98
                game 0.98
            downtown 0.98
Bad words	     P(negativity | word)
                kara 0.11
            headache 0.11
                hate 0.09
             swisher 0.09
           classiest 0.08
             novelty 0.08
               among 0.08
            delegate 0.08
                fade 0.07
             fascist 0.06


In [45]:
#g_verb_lemma = [x[0] for x in pd.Series(google_processed_tweets).apply(lemmatize_words)]
g_noun_lemma = [x[1] for x in pd.Series(google_processed_tweets).apply(lemmatize_words)]

In [46]:
print('The accuracy score on the google tweets is {0:3f}'.format(
    clf.score(vec.transform(g_noun_lemma),google_y)))

The accuracy score on the google tweets is 0.735252


#### It looks like using word stems to train this  MultinomialNB classifier performs the best not only on the test Apple data, but also on the Google data. Thus, the model trained by the word stems will be my final model.