## Twitter Data Analysis

@author Swati Singh

This code analyzes captured data from Twitter during SuperBowl 2019. The twitter data was extracted from its API using certain keywords and stored in JSON format. The code below transforms the JSON data into pandas dataframe for further analysis.

Github: https://github.com/swatisingh0107

### Remaining Things to do
1. Extract hashtags from text column and store in a separate column in dataframe
2. Fetch wordcloud for most discussed brandnames.
3. Clean up text column to remove hyperlinks, hashtags and mentions.
4. Perform time series analysis of tweets\
5. Perform geographic analysis of tweets

First we are going to load the Twitter JSOn file into a variable JSONdata

In [65]:
import os
os.getcwd()

'C:\\Users\\swati\\Google Drive\\Superbowl'

In [66]:
import json
import pandas as pd
import matplotlib.pyplot as plt

In [67]:
JSONdata='C:\\Users\\swati\\Google Drive\\Superbowl\\SuperbowlJS.json'


### Read JSON data to a list
Now we will read twitter JSON file into a list 'tweets'. Please note Twitter JSON is encoded using UTF-8 characters. Hence the parser should be able to decode the JSON file accurately.

In [68]:
tweets = []
with open(JSONdata,'r',encoding="utf8") as f:
    tweets = [json.loads(line) for line in f]

Print text in second row.

In [69]:
 tweets[1]['text']    

'RT @IAStartingLine: Bud Light has now finished last in the Iowa Caucus https://t.co/hUUjmX62to'

### Total number of records

In [70]:
len(tweets)

1000

### Understanding JSON structure of tweet data
Now we will look at the structure of the JSON to look for relevant keys that we will use for analysis.

In [71]:
import json
tweet = tweets[6]
print(json.dumps(tweet, indent=4))

{
    "_id": {
        "$oid": "5c57b5f92cec0f358301cf6d"
    },
    "quote_count": 0,
    "contributors": null,
    "truncated": false,
    "text": "RT @GameOfThrones: .@TheBudKnight\u2019s watch has ended.\nWhat will you do #ForTheThrone?\n#SBLIII @BudLight https://t.co/SzhvDcNAaL",
    "is_quote_status": false,
    "in_reply_to_status_id": null,
    "reply_count": 0,
    "id": {
        "$numberLong": "1092261582329995269"
    },
    "favorite_count": 0,
    "entities": {
        "user_mentions": [
            {
                "id": 180463340,
                "indices": [
                    3.0,
                    17.0
                ],
                "id_str": "180463340",
                "screen_name": "GameOfThrones",
                "name": "Game Of Thrones"
            },
            {
                "id": {
                    "$numberLong": "942879856160800768"
                },
                "indices": [
                    20.0,
                    33.0
           

### Example of List item

In [72]:
tweet['source']

'<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>'

### Extract retweeted handles.
The JSON does not reflect the retweeted count as expected. Hence we will use the information from the text to flag tweets that were retweeted.

In [73]:
for tweet in tweets:
    if 'RT @' in tweet['text']: 
        tweet['retweeted_from'] = tweet['text'].split(':',1)[0]
#         tweet['text']= tweet['text'].split(':',1)[1]
    else:    
        tweet['retweeted_from'] = 'other'

### Transform tweets list to a dataframe. 
Exclude key:value pairs that are not required for our analysis.

In [74]:

column_list=['created_at','id_str','text','retweeted_from','source','retweet_count','favorite_count',
             'screen_name','followers_count','longitude','latitude']

tweet_data = []
for tweet in tweets:
    tweet_data.append([tweet[key] for key in column_list[:-4]])
    tweet_data[-1].append(tweet['user']['screen_name'])
    tweet_data[-1].append(tweet['user']['followers_count'])
#     if 'extended_tweet' in tweet.keys():
#         tweet_data[-1].append(tweet['extended_tweet']['full_text'])
#     else:
#         tweet_data[-1].append("None")
            
    if 'derived' in tweet['user'].keys():
        tweet_data[-1].append(tweet['user']['derived']['locations'][0]['geo']['coordinates'][0])
        tweet_data[-1].append(tweet['user']['derived']['locations'][0]['geo']['coordinates'][1])
    
    
data = pd.DataFrame(tweet_data, columns=column_list)
data.head()

Unnamed: 0,created_at,id_str,text,retweeted_from,source,retweet_count,favorite_count,screen_name,followers_count,longitude,latitude
0,Mon Feb 04 03:20:19 +0000 2019,1092261543859884032,Best commercial BY FAR was #NFL100 ad. Rest were largely unforgettable or just plain dumb. And so sick of the… https://t.co/BdPtE9y9hq,other,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",0,0,MelanieG_Writer,1031,-96.66696,40.8
1,Mon Feb 04 03:20:19 +0000 2019,1092261541137780736,RT @IAStartingLine: Bud Light has now finished last in the Iowa Caucus https://t.co/hUUjmX62to,RT @IAStartingLine,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",0,0,CaitrinSneed,456,-87.65005,41.85003
2,Mon Feb 04 03:20:21 +0000 2019,1092261550738599942,RT @GameOfThrones: .@TheBudKnight’s watch has ended.\nWhat will you do #ForTheThrone?\n#SBLIII @BudLight https://t.co/SzhvDcNAaL,RT @GameOfThrones,"<a href=""http://twitter.com/download/android"" rel=""nofollow"">Twitter for Android</a>",0,0,tpalomaresc,72,,
3,Mon Feb 04 03:20:22 +0000 2019,1092261555008401409,RT @GameOfThrones: .@TheBudKnight’s watch has ended.\nWhat will you do #ForTheThrone?\n#SBLIII @BudLight https://t.co/SzhvDcNAaL,RT @GameOfThrones,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",0,0,haileyncrawford,104,,
4,Mon Feb 04 03:20:25 +0000 2019,1092261568404967425,RT @GameOfThrones: .@TheBudKnight’s watch has ended.\nWhat will you do #ForTheThrone?\n#SBLIII @BudLight https://t.co/SzhvDcNAaL,RT @GameOfThrones,"<a href=""http://twitter.com/download/android"" rel=""nofollow"">Twitter for Android</a>",0,0,jamesmortlock1,14,,


### Drop duplicated records

In [75]:
data=data.drop_duplicates('id_str')

### Total records to be used for analysis

In [76]:
len(data)

1000

In [77]:
import re
data['text']=data['text'].str.replace(r'\bhttps.*','',regex=True)

In [78]:
pd.set_option('display.max_colwidth', -1)
data['text']

0      Best commercial BY FAR was #NFL100 ad. Rest were largely unforgettable or just plain dumb. And so sick of the…                                  
1      RT @IAStartingLine: Bud Light has now finished last in the Iowa Caucus                                                                          
2      RT @GameOfThrones: .@TheBudKnight’s watch has ended.\nWhat will you do #ForTheThrone?\n#SBLIII @BudLight                                        
3      RT @GameOfThrones: .@TheBudKnight’s watch has ended.\nWhat will you do #ForTheThrone?\n#SBLIII @BudLight                                        
4      RT @GameOfThrones: .@TheBudKnight’s watch has ended.\nWhat will you do #ForTheThrone?\n#SBLIII @BudLight                                        
5      RT @GameOfThrones: .@TheBudKnight’s watch has ended.\nWhat will you do #ForTheThrone?\n#SBLIII @BudLight                                        
6      RT @GameOfThrones: .@TheBudKnight’s watch has ended.\nWhat will you do #ForTheThr

### Extract mentions and hashtags from the text

In [79]:
s=data[data.columns[2]]
type(s)
data['mentions']=s.str.extractall('(\B@\w*)').unstack().apply(lambda x:','.join(x.dropna()), axis=1) 
data['hashtags']=s.str.extractall('(\B#\w*|\B\\n#\w*)').unstack().apply(lambda x:','.join(x.dropna()), axis=1) 
data['hashtags']=data['hashtags'].str.replace('\n','')

In [80]:
data['hashtags']

0      #NFL100                                    
1      NaN                                        
2      #ForTheThrone,#SBLIII                      
3      #ForTheThrone,#SBLIII                      
4      #ForTheThrone,#SBLIII                      
5      #ForTheThrone,#SBLIII                      
6      #ForTheThrone,#SBLIII                      
7      #BudLight,#GOT,#GameOfThrones,#ForTheThrone
8      #ForTheThrone,#SBLIII                      
9      NaN                                        
10     #ForTheThrone,#SBLIII                      
11     NaN                                        
12     NaN                                        
13     NaN                                        
14     #SBLIII                                    
15     NaN                                        
16     #ForTheThrone,#SBLIII                      
17     #ForTheThrone,#SBLIII                      
18     NaN                                        
19     NaN                     

### Clean source columns to remove HTML format. Also clean retweeted_from column to remove redundant characters.

In [81]:
data['source'] = data['source'].str.split('>').str.get(1)
data['source']=data['source'].str.replace('</a','')
data['retweeted_from']=data['retweeted_from'].str.replace('RT @','')

### Transform created_at column from string to date-time format

In [82]:
data['created_at']=pd.to_datetime(data['created_at'])

### Keep records captured only during game time

In [83]:
data=data[(data['created_at']>'2019-02-03 23:29:00') & (data['created_at']<'2019-02-04 04:00:00')]
data.head()

Unnamed: 0,created_at,id_str,text,retweeted_from,source,retweet_count,favorite_count,screen_name,followers_count,longitude,latitude,mentions,hashtags
0,2019-02-04 03:20:19,1092261543859884032,Best commercial BY FAR was #NFL100 ad. Rest were largely unforgettable or just plain dumb. And so sick of the…,other,Twitter for iPhone,0,0,MelanieG_Writer,1031,-96.66696,40.8,,#NFL100
1,2019-02-04 03:20:19,1092261541137780736,RT @IAStartingLine: Bud Light has now finished last in the Iowa Caucus,IAStartingLine,Twitter for iPhone,0,0,CaitrinSneed,456,-87.65005,41.85003,@IAStartingLine,
2,2019-02-04 03:20:21,1092261550738599942,RT @GameOfThrones: .@TheBudKnight’s watch has ended.\nWhat will you do #ForTheThrone?\n#SBLIII @BudLight,GameOfThrones,Twitter for Android,0,0,tpalomaresc,72,,,"@GameOfThrones,@TheBudKnight,@BudLight","#ForTheThrone,#SBLIII"
3,2019-02-04 03:20:22,1092261555008401409,RT @GameOfThrones: .@TheBudKnight’s watch has ended.\nWhat will you do #ForTheThrone?\n#SBLIII @BudLight,GameOfThrones,Twitter for iPhone,0,0,haileyncrawford,104,,,"@GameOfThrones,@TheBudKnight,@BudLight","#ForTheThrone,#SBLIII"
4,2019-02-04 03:20:25,1092261568404967425,RT @GameOfThrones: .@TheBudKnight’s watch has ended.\nWhat will you do #ForTheThrone?\n#SBLIII @BudLight,GameOfThrones,Twitter for Android,0,0,jamesmortlock1,14,,,"@GameOfThrones,@TheBudKnight,@BudLight","#ForTheThrone,#SBLIII"


In [84]:
len(data)

1000

### Count number of retweets

In [85]:
retweet_count=data.groupby(['retweeted_from','text']).size().reset_index(name='counts')

In [86]:
import numpy as np
# count_stat=data.groupby(['retweeted_from'])[["favorite_count","retweet_count"]].sum()
count_stat=data.groupby(['retweeted_from']).agg({'favorite_count':np.sum,'retweet_count':np.sum})
# df.groupby(['Country', 'Item_Code']).agg({'Y1961': np.sum, 'Y1962': [np.sum, np.mean]})  # Added example for two output columns from a single input column
count_stat.head()
#Looks lieke this data is not captured accurately to derive any meaningful information. we will use the next piece 
#of code to count the most retweeted tweet and check most popular brand account


Unnamed: 0_level_0,favorite_count,retweet_count
retweeted_from,Unnamed: 1_level_1,Unnamed: 2_level_1
3PercentConf,0,0
AddictedtoKCMO,0,0
AltHomelandSec,0,0
AshyG_,0,0
BiggestTaco,0,0


In [87]:
retweet_count.sort_values(['counts'],ascending=False).head()

Unnamed: 0,retweeted_from,text,counts
14,GameOfThrones,RT @GameOfThrones: .@TheBudKnight’s watch has ended.\nWhat will you do #ForTheThrone?\n#SBLIII @BudLight,295
31,NationalCorn,RT @NationalCorn: .@BudLight America’s corn farmers are disappointed in you. Our office is right down the road! We would love to discuss wi…,73
33,NerdsSocialClub,RT @NerdsSocialClub: BEST COMMERCIAL EVER 😂\n\n#BudLight #GOT #GameOfThrones #ForTheThrone\n\n,64
45,ThorBjornsson_,"RT @ThorBjornsson_: In the @GameOfThrones, you win or you die, @TheBudKnight.\n#SBLIII @BudLight",25
16,GameOfThrones,RT @GameOfThrones: House @BudLight will never forget. #SBLIII #ForTheThrone,18


### To do: To transform this into a function to return most and least popular tweet

In [88]:
RT_max=np.max(retweet_count['counts'])
rt=retweet_count[retweet_count.counts==RT_max].index[0]
print("The tweet with more retweets is: \n{}".format(retweet_count['text'][rt]))
print("Number of retweets: {}".format(RT_max))


The tweet with more retweets is: 
RT @GameOfThrones: .@TheBudKnight’s watch has ended.
What will you do #ForTheThrone?
#SBLIII @BudLight 
Number of retweets: 295


### Most popular twitter sources

In [89]:
source_count=data.groupby(['source']).size().reset_index(name='counts')
source_count.sort_values(['counts'],ascending=False).head()

Unnamed: 0,source,counts
14,Twitter for iPhone,597
12,Twitter for Android,267
11,Twitter Web Client,58
10,Twitter Web App,35
13,Twitter for iPad,23


### Sentiment Analysis

In [90]:
from textblob import TextBlob
import re

def clean_tweet(tweet):
    '''
    Utility function to clean the text in a tweet by removing 
    links and special characters using regex.
    '''
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())

def analize_sentiment(tweet):
    '''
    Utility function to classify the polarity of a tweet
    using textblob.
    '''
    analysis = TextBlob(clean_tweet(tweet))
    if analysis.sentiment.polarity > 0:
        return 'Positive'
    elif analysis.sentiment.polarity == 0:
        return 'Neutral'
    else:
        return 'Negative'

In [91]:
data['SA'] = data['text'].apply(analize_sentiment)

In [92]:
data.head()

Unnamed: 0,created_at,id_str,text,retweeted_from,source,retweet_count,favorite_count,screen_name,followers_count,longitude,latitude,mentions,hashtags,SA
0,2019-02-04 03:20:19,1092261543859884032,Best commercial BY FAR was #NFL100 ad. Rest were largely unforgettable or just plain dumb. And so sick of the…,other,Twitter for iPhone,0,0,MelanieG_Writer,1031,-96.66696,40.8,,#NFL100,Positive
1,2019-02-04 03:20:19,1092261541137780736,RT @IAStartingLine: Bud Light has now finished last in the Iowa Caucus,IAStartingLine,Twitter for iPhone,0,0,CaitrinSneed,456,-87.65005,41.85003,@IAStartingLine,,Positive
2,2019-02-04 03:20:21,1092261550738599942,RT @GameOfThrones: .@TheBudKnight’s watch has ended.\nWhat will you do #ForTheThrone?\n#SBLIII @BudLight,GameOfThrones,Twitter for Android,0,0,tpalomaresc,72,,,"@GameOfThrones,@TheBudKnight,@BudLight","#ForTheThrone,#SBLIII",Neutral
3,2019-02-04 03:20:22,1092261555008401409,RT @GameOfThrones: .@TheBudKnight’s watch has ended.\nWhat will you do #ForTheThrone?\n#SBLIII @BudLight,GameOfThrones,Twitter for iPhone,0,0,haileyncrawford,104,,,"@GameOfThrones,@TheBudKnight,@BudLight","#ForTheThrone,#SBLIII",Neutral
4,2019-02-04 03:20:25,1092261568404967425,RT @GameOfThrones: .@TheBudKnight’s watch has ended.\nWhat will you do #ForTheThrone?\n#SBLIII @BudLight,GameOfThrones,Twitter for Android,0,0,jamesmortlock1,14,,,"@GameOfThrones,@TheBudKnight,@BudLight","#ForTheThrone,#SBLIII",Neutral


### Count tweets and handles based on positive and negative sentiment

In [93]:
#For retweet analysis, we will remove rows where retweeted from is others. These are rows which are actual tweets and not retweets.
retweet = data[data['retweeted_from']!='other']
len(retweet)

648

In [94]:
sentiment_count=retweet.groupby(['retweeted_from','text','SA']).size().reset_index(name='counts')

sentiment_count[sentiment_count['counts']>2000].sort_values(['SA','counts'],ascending=False)

Unnamed: 0,retweeted_from,text,SA,counts


In [95]:
def tweet_sentiment(sentiment):
    sentiment_data=sentiment_count[(sentiment_count['SA']==sentiment)& (sentiment_count['retweeted_from']!='other')]
    max_sentiment=np.max(sentiment_data['counts'])
    sentiment_index=sentiment_count[sentiment_count.counts==max_sentiment].index[0]
    print("The retweet with most "+ sentiment + " sentiment is: \n{}".format(sentiment_count['text'][sentiment_index])+" \nfrom "+(sentiment_count['retweeted_from'][sentiment_index])+ "\nCount="+str(sentiment_count['counts'][sentiment_index]))
    
tweet_sentiment('Positive')    
print("-----------------------------------------------------------------")
tweet_sentiment('Neutral')
print("-----------------------------------------------------------------")
tweet_sentiment('Negative')

The retweet with most Positive sentiment is: 
RT @NerdsSocialClub: BEST COMMERCIAL EVER 😂

#BudLight #GOT #GameOfThrones #ForTheThrone

  
from NerdsSocialClub
Count=64
-----------------------------------------------------------------
The retweet with most Neutral sentiment is: 
RT @GameOfThrones: .@TheBudKnight’s watch has ended.
What will you do #ForTheThrone?
#SBLIII @BudLight  
from GameOfThrones
Count=295
-----------------------------------------------------------------
The retweet with most Negative sentiment is: 
RT @NationalCorn: .@BudLight America’s corn farmers are disappointed in you. Our office is right down the road! We would love to discuss wi… 
from NationalCorn
Count=73
