# Capstone Project: Social media sentiment analysis 
## [Case study: Samsung] 
## Part 1: Data collection

# Problem Statement
The project aims to gather and analyze consumers’ feedback on the company’s brand, products and advertisement to improve the marketing strategy.

Traditional metrics focus on quantity (number of views, clicks, comments, shares, etc. While companies may achieve solid metrics, it may not always means that the product is well received.

Sentiment analysis goes beyond quantitative data to the quality of the interactions between the public and brands: 

1) Sentiment analysis provides invaluable marketing intel 

2) A crucial part of market research 

3) A revolution in customer support


In [2]:
# Import libraries
import requests
import json
import pandas as pd
import numpy as np
import time
import random
import re
import csv


# API Keys

In [24]:
# Twitter API key
consumer_key = 'oVpj4M6GQckJDTZa87oPY0Mdg'
consumer_secret = 'obpbbazaSN95Lp7Ky09mwkzj3fSPOS3HPeBAAhkHtt9OWsMs6j'
access_token = '1239364349958881280-z46PW4AcYu8NlzCekaRpfHCli1t5tQ'
access_secret ='W0Vigpzib7zTa7q5L83ciX10i57BItDaXNGaMMS8TAXhZ'

import tweepy 
from tweepy import OAuthHandler

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

# Gather twitter comments on Samsung, Apple and Huawei

In [46]:
# Collect the latest 2500 tweets with the word 'Samsung'
tw_samsung = []

for tweet in tweepy.Cursor(api.search, q="Samsung -filter:retweets AND - filter:replies", lang="en", result_type="recent", include_entities=True).items(2500): 
    tw_samsung.append(tweet) 
    
print (len(tw_samsung)) 

2500


In [22]:
# Collect the latest 2500 tweets with the word 'Apple'
tw_apple = []

for tweet in tweepy.Cursor(api.search, q="Apple -filter:retweets AND - filter:replies", lang="en", result_type="recent", include_entities=True).items(2500): 
    tw_apple.append(tweet) 
    
print (len(tw_apple)) 

2500


In [18]:
# Collect the latest 2500 tweets with the word 'Huawei'
tw_huawei = []

for tweet in tweepy.Cursor(api.search, q="Huawei -filter:retweets AND - filter:replies", lang="en", result_type="recent", include_entities=True).items(2500): 
    tw_huawei.append(tweet) 
    
print (len(tw_huawei)) 

2500


In [25]:
# Create function to capture features in dataframe
def process_results(results):
    id_list = [tweet.id for tweet in results]
    data_set = pd.DataFrame(id_list, columns = ["id"])
    
    # Collecting Tweet Data
    data_set["text"] = [tweet.text for tweet in results]
    data_set["created_at"] = [tweet.created_at for tweet in results]
    data_set["retweet_count"] = [tweet.retweet_count for tweet in results]
    data_set["favorite_count"] = [tweet.favorite_count for tweet in results]
    data_set["source"] = [tweet.source for tweet in results]

    # Collecting User Data 
    data_set["user_id"] = [tweet.author.id for tweet in results]
    data_set["user_screen_name"] = [tweet.author.screen_name for tweet in results]
    data_set["user_name"] = [tweet.author.name for tweet in results]
    data_set["user_created_at"] = [tweet.author.created_at for tweet in results]
    data_set["user_description"] = [tweet.author.description for tweet in results]
    data_set["user_followers_count"] = [tweet.author.followers_count for tweet in results]
    data_set["user_friends_count"] = [tweet.author.friends_count for tweet in results]
    data_set["user_location"] = [tweet.author.location for tweet in results]
    
    return data_set  

In [47]:
# Create Samsung, Apple and Huawei Dataframes
tweet_samsung=process_results(tw_samsung)
tweet_apple=process_results(tw_apple)
tweet_huawei=process_results(tw_huawei)

In [48]:
tweet_samsung.head()

Unnamed: 0,id,text,created_at,retweet_count,favorite_count,source,user_id,user_screen_name,user_name,user_created_at,user_description,user_followers_count,user_friends_count,user_location
0,1248090709548244992,@champagnefinest @realnickwilson @therealjuicy...,2020-04-09 03:29:48,0,0,Twitter for Android,795344175747268612,ArriMarie37263,Ol' Dirty ENBY🏳️‍🌈🇵🇷,2016-11-06 19:16:43,"I am an Afro-Puerto Rican Non-binary, Bi-sexua...",535,1039,"Neon Valley Street, Sector 9"
1,1248090377862733826,@therealjuicyj That's not the real challenge. ...,2020-04-09 03:28:29,0,0,Twitter for Android,248209515,CEOofGreatness,THE G.O.A.T.™ 🐐,2011-02-06 14:05:44,"KC Born & Raised, Life, Bone and the Pursuit o...",964,2041,"Kansas City, MO"
2,1248090299219341312,@ZiniTevi Amazing app you guys have just want ...,2020-04-09 03:28:10,0,0,Twitter Web App,1213238109548077057,CesarLoya_,Cesar Loya,2020-01-03 23:18:10,Mind your own business,9,126,
3,1248089863091564547,@__Daviann Check apple or Samsung when them op...,2020-04-09 03:26:26,0,0,Twitter for iPhone,61365781,romane1,Keemzz MISC,2009-07-30 02:28:32,Live simple(Y),197,433,"ÜT: 17.9663244,-76.7572738"
4,1248089789426839558,@kcalvinalvinn @danhwang88 @LGUS @Ergotron @Ap...,2020-04-09 03:26:08,0,0,Twitter Web App,16089438,pascalpixel,Pascal Pixel,2008-09-01 23:15:27,Scientists say half my brain is a fabulous des...,5151,452,"Seoul, Korea"


In [55]:
# create a new column 'brand'
tweet_samsung['brand']=0
tweet_apple['brand']=1
tweet_huawei['brand']=2

In [56]:
# Combine all three twitter brands into one for ease of data cleaing
tweet_combined=tweet_samsung.append([tweet_apple,tweet_huawei], ignore_index=True, sort=True)

In [53]:
# Drop duplicated twitter texts 
tweet_combined.drop_duplicates(subset='text', inplace=True)

In [60]:
tweet_combined.shape

(7462, 15)

In [59]:
# Save data as csv
tweet_combined.to_csv('./dataset/tweet_combined.csv', index = False)

## Data Cleaning

In [3]:
# Call twitter data from Part 1
tweet_combined=pd.read_csv('./dataset/tweet_combined.csv')

In [4]:
# Remove twitter posts posted by Samsung official account as tweets made by company's official accounts can skrew the analysis.
tweet_combined=tweet_combined[~tweet_combined.user_screen_name.str.contains("Samsung")]

In [5]:
# Tweets collected under the keyword 'apple' contained other definitions. 
# Following are a list of words irrelevant to Apple that are dropped from the tweets.
tweet_combined=tweet_combined[~tweet_combined.text.str.contains("fruit")]
tweet_combined=tweet_combined[~tweet_combined.user_screen_name.str.contains("Tesco")]
tweet_combined=tweet_combined[~tweet_combined.text.str.contains("vinegar")]
tweet_combined=tweet_combined[~tweet_combined.text.str.contains("pie")]
tweet_combined=tweet_combined[~tweet_combined.text.str.contains("cider")]
tweet_combined=tweet_combined[~tweet_combined.text.str.contains("tree")]
tweet_combined=tweet_combined[~tweet_combined.text.str.contains("juice")]
tweet_combined=tweet_combined[~tweet_combined.text.str.contains("green")]
tweet_combined=tweet_combined[~tweet_combined.text.str.contains("healing")]
tweet_combined=tweet_combined[~tweet_combined.text.str.contains("baby")]

In [6]:
# Remove twitter posts posted by Huawei official account
tweet_combined=tweet_combined[~tweet_combined.user_screen_name.str.contains("Huawei")]

In [7]:
# 409 official and irrelevant tweets were removed
tweet_combined.shape

(7053, 15)

In [8]:
# 0 is Samsung. 1 is Apple and 2 is Huawei.
tweet_combined['brand'].value_counts()

0    2438
2    2404
1    2211
Name: brand, dtype: int64

In [9]:
tweet_combined.isnull().sum()

brand                      0
created_at                 0
favorite_count             0
id                         0
retweet_count              0
source                     0
text                       0
user_created_at            0
user_description        1171
user_followers_count       0
user_friends_count         0
user_id                    0
user_location           2307
user_name                  0
user_screen_name           0
dtype: int64

In [10]:
# User location has 2719 unique input and 2307 missing input. 
tweet_combined['user_location'].value_counts().sort_values(ascending=False).index[:50]

Index(['Canada', 'United States', 'Pakistan', 'England, United Kingdom',
       'India', 'California, USA', 'London, England', 'London',
       'Los Angeles, CA', 'Nigeria', 'United Kingdom', 'Toronto, Ontario',
       'Lagos, Nigeria', 'Ontario, Canada', 'San Francisco, CA', 'Florida',
       'Lyon, France', 'South Africa', 'UK', 'Karachi, Pakistan', 'USA',
       'Hong Kong', 'Abuja, Nigeria', 'Lahore, Pakistan',
       'Islamabad, Pakistan', 'Australia', 'Florida, USA', 'New York, USA',
       'Chicago, IL', 'Punjab, Pakistan', 'Earth', 'Seattle, WA',
       'Kampala, Uganda', 'Johannesburg, South Africa', 'Hyderabad, India',
       'Indonesia', 'Texas, USA', 'Ottawa, Ontario', 'Toronto',
       'Washington, DC', 'Atlanta, GA', 'New York/New Jersey', 'Dallas, TX',
       'South East, England', 'England', 'Manchester, England', 'Germany',
       'Alberta, Canada', 'Scotland, United Kingdom', 'Deutschland'],
      dtype='object')

User location indicates that majority of the tweets came from US and Europe, and few came from China. This may be due to Twitter's ban in China. Therefore, the comments and sentiments collected may be more US/Europe-centric. Given the large amount of missing data, the location column is excluded.

In [11]:
# Dropped user description and user location due to difficulty in imputing meaningful values.
tweet_combined.drop(columns=['user_description','user_location'],inplace=True)

In [12]:
tweet_combined.shape

(7053, 13)

In [13]:
# Cleaned version 1
tweet_combined.to_csv('./dataset/tweet_combined_clean_v1.csv', index = False)