# Business Problem 
It is important to understand our main objective which is to develop a model that can assess the sentiment of tweets solely based on their content. To address this, we’re using a dataset from CrowdFlower, which includes more than 9,000 tweets about Apple and Google products. These tweets have been labeled by human raters as positive, negative, or neutral, providing a foundation for a Natural Language Processing (NLP) approach to classify sentiment accurately.

In [1]:
import pandas as pd
import nltk
import re                                  
import string
from nltk.corpus import stopwords 
from nltk.tokenize import TweetTokenizer, word_tokenize
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/tobiaspariente/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/tobiaspariente/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/tobiaspariente/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
df = pd.read_csv("tweet_data.csv", encoding='latin-1')

In [3]:
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [4]:
df.columns

Index(['tweet_text', 'emotion_in_tweet_is_directed_at',
       'is_there_an_emotion_directed_at_a_brand_or_product'],
      dtype='object')

In [5]:
df.describe()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
count,9092,3291,9093
unique,9065,9,4
top,RT @mention Marissa Mayer: Google Will Connect...,iPad,No emotion toward brand or product
freq,5,946,5389


In [6]:
df.columns = ['Tweet', 'Product/Brand', 'Emotion']
df.head()

Unnamed: 0,Tweet,Product/Brand,Emotion
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Tweet          9092 non-null   object
 1   Product/Brand  3291 non-null   object
 2   Emotion        9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


In [8]:
df.duplicated().sum()

22

In [9]:
df.isna().sum()

Tweet               1
Product/Brand    5802
Emotion             0
dtype: int64

There are 22 duplicates and 1 null entry in Tweet, let's remove all of those. 
There are 5802 tweets where the product is unfortunately unidentified.

In [10]:
df.drop_duplicates(inplace = True)
df.dropna(subset = ['Tweet'], inplace = True)

In [11]:
df['Product/Brand'].fillna('Undetermined', inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Product/Brand'].fillna('Undetermined', inplace = True)


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9070 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Tweet          9070 non-null   object
 1   Product/Brand  9070 non-null   object
 2   Emotion        9070 non-null   object
dtypes: object(3)
memory usage: 283.4+ KB


In [13]:
df['Emotion'].value_counts()

Emotion
No emotion toward brand or product    5375
Positive emotion                      2970
Negative emotion                       569
I can't tell                           156
Name: count, dtype: int64

In [14]:
df['Emotion'] = df['Emotion'].replace({
    "No emotion toward brand or product": "Neutral",
    "I can't tell": "Neutral"
})

In [15]:
df['Emotion'].value_counts()

Emotion
Neutral             5531
Positive emotion    2970
Negative emotion     569
Name: count, dtype: int64

In [16]:
df['Product/Brand'].value_counts()

Product/Brand
Undetermined                       5788
iPad                                945
Apple                               659
iPad or iPhone App                  469
Google                              428
iPhone                              296
Other Google product or service     293
Android App                          80
Android                              77
Other Apple product or service       35
Name: count, dtype: int64

In [17]:
def find_brand(Product, Tweet):
    brand = 'Undetermined'
    if ((Product.lower().__contains__('google')) or (Product.lower().__contains__('android'))):
        brand = 'Google'
    elif ((Product.lower().__contains__('apple')) or (Product.lower().__contains__('ip'))):
        brand = 'Apple'
    
    if (brand == 'Undetermined'): 
        lower_tweet = Tweet.lower()
        is_google = (lower_tweet.__contains__('google')) or (lower_tweet.__contains__('android'))
        is_apple = (lower_tweet.__contains__('apple')) or (lower_tweet.__contains__('ip'))
        
        if (is_google and is_apple):
            brand = 'Both'
        elif (is_google):
            brand = 'Google' 
        elif (is_apple):
            brand = 'Apple'
    
    return brand

df['Brand'] = df.apply(lambda row: find_brand(row['Product/Brand'], row['Tweet']), axis = 1)
df['Brand'].value_counts()

Brand
Apple           5361
Google          2757
Undetermined     739
Both             213
Name: count, dtype: int64

In [18]:
df.head()

Unnamed: 0,Tweet,Product/Brand,Emotion,Brand
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,Apple
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,Apple
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,Apple
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,Apple
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,Google


In [19]:
df.rename(columns={"Product/Brand": "Product"}, inplace=True)

In [20]:
df.head()

Unnamed: 0,Tweet,Product,Emotion,Brand
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,Apple
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,Apple
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,Apple
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,Apple
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,Google


## Adding new columns

In [21]:
def tweet_character_count(text_of_tweet):
    return len(text_of_tweet.strip())

df.rename(columns={"Tweet": "Original Tweet"}, inplace=True)
df['Clean Tweet'] = df['Original Tweet']

df['Character Count of Original Tweet'] = df.apply(lambda row: tweet_character_count(row['Original Tweet']), axis = 1)

df['Hashtag'] = df['Original Tweet'].apply(lambda x: re.findall(r'\B#\w*[a-zA-Z]+\w*', x))

df['Hashtag Count'] = df['Hashtag'].str.len()

df['Character Count of Clean Tweet'] = df.apply(lambda row: tweet_character_count(row['Clean Tweet']), axis = 1)

df.head()

Unnamed: 0,Original Tweet,Product,Emotion,Brand,Clean Tweet,Character Count of Original Tweet,Hashtag,Hashtag Count,Character Count of Clean Tweet
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,Apple,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,127,"[#RISE_Austin, #SXSW]",2,127
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,Apple,@jessedee Know about @fludapp ? Awesome iPad/i...,139,[#SXSW],1,139
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,Apple,@swonderlin Can not wait for #iPad 2 also. The...,79,"[#iPad, #SXSW]",2,79
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,Apple,@sxsw I hope this year's festival isn't as cra...,82,[#sxsw],1,82
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,Google,@sxtxstate great stuff on Fri #SXSW: Marissa M...,131,[#SXSW],1,131


In [22]:
df['Emotion'].value_counts()

Emotion
Neutral             5531
Positive emotion    2970
Negative emotion     569
Name: count, dtype: int64

In [23]:
df['Emotion'] = df['Emotion'].replace({
    "Positive emotion": "Positive",
    "Negative emotion": "Negative"
})

In [24]:
df['Emotion'].value_counts()

Emotion
Neutral     5531
Positive    2970
Negative     569
Name: count, dtype: int64

## Text Preprocessing

In [25]:
df['Clean Tweet'] = df['Clean Tweet'].str.lower()
df['Clean Tweet'] = df['Clean Tweet'].apply(lambda x: re.sub(r'https?:\/\/\S+', '', x))
df['Clean Tweet'] = df['Clean Tweet'].apply(lambda x: re.sub(r"www\.[a-z]?\.?(com)+|[a-z]+\.(com)", '', x))
df['Clean Tweet'] = df['Clean Tweet'].apply(lambda x: re.sub(r'{link}', '', x))
df['Clean Tweet'] = df['Clean Tweet'].apply(lambda x: re.sub(r"\[video\]", '', x))
df['Clean Tweet'] = df['Clean Tweet'].apply(lambda x: re.sub(r'&[a-z]+;', '', x))
df['Clean Tweet'] = df['Clean Tweet'].apply(lambda x: re.sub(r"@[A-Za-z0-9]+", '', x))
df['Clean Tweet'] = df['Clean Tweet'].apply(lambda x: re.sub(r"[^a-z\s\(\-:\)\\\/\];='#]", '', x))

In [26]:
df.head()

Unnamed: 0,Original Tweet,Product,Emotion,Brand,Clean Tweet,Character Count of Original Tweet,Hashtag,Hashtag Count,Character Count of Clean Tweet
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative,Apple,i have a g iphone after hrs tweeting at #ris...,127,"[#RISE_Austin, #SXSW]",2,127
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive,Apple,know about awesome ipad/iphone app that you...,139,[#SXSW],1,139
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive,Apple,can not wait for #ipad also they should sale...,79,"[#iPad, #SXSW]",2,79
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative,Apple,i hope this year's festival isn't as crashy a...,82,[#sxsw],1,82
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive,Google,great stuff on fri #sxsw: marissa mayer (goog...,131,[#SXSW],1,131


In [27]:
def remove_punctuation(text):
    punctuationfree = "".join([i for i in text if i not in string.punctuation])
    return punctuationfree

df['Clean Tweet'] = df['Clean Tweet'].apply(lambda x: remove_punctuation(x))
df['Clean Tweet'] = df['Clean Tweet'].apply(lambda x: re.sub(r"[ ]{2,}", ' ', x))

In [28]:
df.head()

Unnamed: 0,Original Tweet,Product,Emotion,Brand,Clean Tweet,Character Count of Original Tweet,Hashtag,Hashtag Count,Character Count of Clean Tweet
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative,Apple,i have a g iphone after hrs tweeting at risea...,127,"[#RISE_Austin, #SXSW]",2,127
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive,Apple,know about awesome ipadiphone app that youll ...,139,[#SXSW],1,139
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive,Apple,can not wait for ipad also they should sale t...,79,"[#iPad, #SXSW]",2,79
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative,Apple,i hope this years festival isnt as crashy as ...,82,[#sxsw],1,82
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive,Google,great stuff on fri sxsw marissa mayer google ...,131,[#SXSW],1,131


In [29]:
print(df['Original Tweet'][0][0:200])
print(df['Clean Tweet'][0][0:200])

.@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead!  I need to upgrade. Plugin stations at #SXSW.
 i have a g iphone after hrs tweeting at riseaustin it was dead i need to upgrade plugin stations at sxsw


In [30]:
new_stopwords = ['a', 'am', 'an', 'and', 'at', 'be', 'for', 'from', 'if', 
                 'in', 'it', "it's", 'its', 'itself', 'my', 'of', 'on', 'or', 'rt', 
                 'that', 'the', 'their', 'theirs', 'these', 'this', 'those', 'to']

def remove_stopwords(text):
    return [word for word in word_tokenize(text) if not word in new_stopwords]

df['Clean Tokens'] = df['Clean Tweet'].apply(lambda x: remove_stopwords(x))
df['Clean Token Count'] = df['Clean Tokens'].str.len()

In [31]:
df.head()

Unnamed: 0,Original Tweet,Product,Emotion,Brand,Clean Tweet,Character Count of Original Tweet,Hashtag,Hashtag Count,Character Count of Clean Tweet,Clean Tokens,Clean Token Count
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative,Apple,i have a g iphone after hrs tweeting at risea...,127,"[#RISE_Austin, #SXSW]",2,127,"[i, have, g, iphone, after, hrs, tweeting, ris...",16
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive,Apple,know about awesome ipadiphone app that youll ...,139,[#SXSW],1,139,"[know, about, awesome, ipadiphone, app, youll,...",15
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive,Apple,can not wait for ipad also they should sale t...,79,"[#iPad, #SXSW]",2,79,"[can, not, wait, ipad, also, they, should, sal...",11
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative,Apple,i hope this years festival isnt as crashy as ...,82,[#sxsw],1,82,"[i, hope, years, festival, isnt, as, crashy, a...",12
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive,Google,great stuff on fri sxsw marissa mayer google ...,131,[#SXSW],1,131,"[great, stuff, fri, sxsw, marissa, mayer, goog...",14


In [32]:
df.to_csv('clean_df.csv', index = False)