# Twitter Data Analysis

@author [Swati Singh](swati.jadon0107@gmail.com)

## Introduction 
This code analyzes captured data from Twitter during SuperBowl 2019. The twitter data was extracted from its API using certain keywords and stored in JSON format. The code below transforms the JSON data into pandas dataframe for further analysis.

**Github: https://github.com/swatisingh0107**

## Table of Content
#### Data Preparation:
1. Read JSON file format into a list
2. Extract Names of retweeted account from the text of the tweet. 
3. Transform list to pandas dataframe. 
4. Extract hashtags and mentions from tweet text and store in new columns
5. Transform created_at column from string to date-time format

#### Data Cleanup:

6. Check for duplicate records
7. Remove shortened t.co links from the tweets
8. Clean Source column of HTML code and keep only text

#### Data Analysis for tweets during game time:
9. Count number of retweets
10. Count top twitter sources
11. Print 4 most popular retweets. 
12. Sentiment Analysis. Print most popular tweet based on sentiment.
13. Most popular hashtags
14. Most popular mentions

### Remaining Things to do
2. Fetch wordcloud for most discussed brandnames.
3. Perform time series analysis of tweets
4. Perform geographic analysis of tweets

First we are going to load the Twitter JSOn file into a variable JSONdata

In [2]:
import os
os.getcwd()

'C:\\Users\\swati\\Google Drive\\GitHub\\TwitterDataAnalysis'

In [3]:
import json
import pandas as pd
import matplotlib.pyplot as plt

In [4]:
JSONdata='C:\\Users\\swati\\Google Drive\\Superbowl\\Dataset\\SuperbowlJS.json'


### Read JSON data to a list
Now we will read twitter JSON file into a list 'tweets'. Please note Twitter JSON is encoded using UTF-8 characters. Hence the parser should be able to decode the JSON file accurately.

In [5]:
tweets = []
with open(JSONdata,'r',encoding="utf8") as f:
    tweets = [json.loads(line) for line in f]

Print text in second row.

In [6]:
 tweets[1]['text']    

'RT @IAStartingLine: Bud Light has now finished last in the Iowa Caucus https://t.co/hUUjmX62to'

### Total number of records

In [7]:
len(tweets)

859485

### Understanding JSON structure of tweet data
Now we will look at the structure of the JSON to look for relevant keys that we will use for analysis.

In [8]:
import json
tweet = tweets[6]
#print(json.dumps(tweet, indent=4))

### Example of List item

In [9]:
tweet['source']

'<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>'

### Extract retweeted handles.
The JSON does not reflect the retweeted count as expected. Hence we will use the information from the text to flag tweets that were retweeted.

In [10]:
for tweet in tweets:
    if 'RT @' in tweet['text']: 
        tweet['retweeted_from'] = tweet['text'].split(':',1)[0]
#         tweet['text']= tweet['text'].split(':',1)[1]
    else:    
        tweet['retweeted_from'] = 'other'

### Transform tweets list to a dataframe. 
Exclude key:value pairs that are not required for our analysis.

In [11]:

column_list=['created_at','id_str','text','retweeted_from','source','retweet_count','favorite_count',
             'screen_name','followers_count','longitude','latitude']

tweet_data = []
for tweet in tweets:
    tweet_data.append([tweet[key] for key in column_list[:-4]])
    tweet_data[-1].append(tweet['user']['screen_name'])
    tweet_data[-1].append(tweet['user']['followers_count'])
#     if 'extended_tweet' in tweet.keys():
#         tweet_data[-1].append(tweet['extended_tweet']['full_text'])
#     else:
#         tweet_data[-1].append("None")
            
    if 'derived' in tweet['user'].keys():
        tweet_data[-1].append(tweet['user']['derived']['locations'][0]['geo']['coordinates'][0])
        tweet_data[-1].append(tweet['user']['derived']['locations'][0]['geo']['coordinates'][1])
    
    
data = pd.DataFrame(tweet_data, columns=column_list)
data.head()

Unnamed: 0,created_at,id_str,text,retweeted_from,source,retweet_count,favorite_count,screen_name,followers_count,longitude,latitude
0,Mon Feb 04 03:20:19 +0000 2019,1092261543859884032,Best commercial BY FAR was #NFL100 ad. Rest we...,other,"<a href=""http://twitter.com/download/iphone"" r...",0,0,MelanieG_Writer,1031,-96.66696,40.8
1,Mon Feb 04 03:20:19 +0000 2019,1092261541137780736,RT @IAStartingLine: Bud Light has now finished...,RT @IAStartingLine,"<a href=""http://twitter.com/download/iphone"" r...",0,0,CaitrinSneed,456,-87.65005,41.85003
2,Mon Feb 04 03:20:21 +0000 2019,1092261550738599942,RT @GameOfThrones: .@TheBudKnight’s watch has ...,RT @GameOfThrones,"<a href=""http://twitter.com/download/android"" ...",0,0,tpalomaresc,72,,
3,Mon Feb 04 03:20:22 +0000 2019,1092261555008401409,RT @GameOfThrones: .@TheBudKnight’s watch has ...,RT @GameOfThrones,"<a href=""http://twitter.com/download/iphone"" r...",0,0,haileyncrawford,104,,
4,Mon Feb 04 03:20:25 +0000 2019,1092261568404967425,RT @GameOfThrones: .@TheBudKnight’s watch has ...,RT @GameOfThrones,"<a href=""http://twitter.com/download/android"" ...",0,0,jamesmortlock1,14,,


### Drop duplicated records

In [12]:
data=data.drop_duplicates('id_str')

### Total records to be used for analysis

In [13]:
len(data)

767453

### Remove hyperlinks from tweet text

In [14]:
import re
data['text']=data['text'].str.replace(r'\bhttps.*','',regex=True)

In [15]:
pd.set_option('display.max_colwidth', -1)
data['text'].head()

0    Best commercial BY FAR was #NFL100 ad. Rest were largely unforgettable or just plain dumb. And so sick of the… 
1    RT @IAStartingLine: Bud Light has now finished last in the Iowa Caucus                                         
2    RT @GameOfThrones: .@TheBudKnight’s watch has ended.\nWhat will you do #ForTheThrone?\n#SBLIII @BudLight       
3    RT @GameOfThrones: .@TheBudKnight’s watch has ended.\nWhat will you do #ForTheThrone?\n#SBLIII @BudLight       
4    RT @GameOfThrones: .@TheBudKnight’s watch has ended.\nWhat will you do #ForTheThrone?\n#SBLIII @BudLight       
Name: text, dtype: object

### Extract mentions and hashtags from the text

In [16]:
s=data[data.columns[2]]
data['mentions']=s.str.extractall('(\B@\w*)').unstack().apply(lambda x:' '.join(x.dropna()), axis=1) 
data['hashtags']=s.str.extractall('(\B#\w*|\B\\n#\w*)').unstack().apply(lambda x:' '.join(x.dropna()), axis=1) 
data['hashtags']=data['hashtags'].str.replace('\n','')

### Clean source columns to remove HTML format. Also clean retweeted_from column to remove redundant characters.

In [17]:
data['source'] = data['source'].str.split('>').str.get(1)
data['source']=data['source'].str.replace('</a','')
data['retweeted_from']=data['retweeted_from'].str.replace('RT @','')

### Transform created_at column from string to date-time format

In [18]:
data['created_at']=pd.to_datetime(data['created_at'])

### Keep records captured only during game time

In [19]:
data=data[(data['created_at']>'2019-02-03 23:29:00') & (data['created_at']<'2019-02-04 04:00:00')]
data.head()

Unnamed: 0,created_at,id_str,text,retweeted_from,source,retweet_count,favorite_count,screen_name,followers_count,longitude,latitude,mentions,hashtags
0,2019-02-04 03:20:19,1092261543859884032,Best commercial BY FAR was #NFL100 ad. Rest were largely unforgettable or just plain dumb. And so sick of the…,other,Twitter for iPhone,0,0,MelanieG_Writer,1031,-96.66696,40.8,,#NFL100
1,2019-02-04 03:20:19,1092261541137780736,RT @IAStartingLine: Bud Light has now finished last in the Iowa Caucus,IAStartingLine,Twitter for iPhone,0,0,CaitrinSneed,456,-87.65005,41.85003,@IAStartingLine,
2,2019-02-04 03:20:21,1092261550738599942,RT @GameOfThrones: .@TheBudKnight’s watch has ended.\nWhat will you do #ForTheThrone?\n#SBLIII @BudLight,GameOfThrones,Twitter for Android,0,0,tpalomaresc,72,,,@GameOfThrones @TheBudKnight @BudLight,#ForTheThrone #SBLIII
3,2019-02-04 03:20:22,1092261555008401409,RT @GameOfThrones: .@TheBudKnight’s watch has ended.\nWhat will you do #ForTheThrone?\n#SBLIII @BudLight,GameOfThrones,Twitter for iPhone,0,0,haileyncrawford,104,,,@GameOfThrones @TheBudKnight @BudLight,#ForTheThrone #SBLIII
4,2019-02-04 03:20:25,1092261568404967425,RT @GameOfThrones: .@TheBudKnight’s watch has ended.\nWhat will you do #ForTheThrone?\n#SBLIII @BudLight,GameOfThrones,Twitter for Android,0,0,jamesmortlock1,14,,,@GameOfThrones @TheBudKnight @BudLight,#ForTheThrone #SBLIII


In [20]:
len(data)

753102

### Count number of retweets

In [21]:
retweet_count=data.groupby(['retweeted_from','text']).size().reset_index(name='counts')

In [22]:
import numpy as np
# count_stat=data.groupby(['retweeted_from'])[["favorite_count","retweet_count"]].sum()
count_stat=data.groupby(['retweeted_from']).agg({'favorite_count':np.sum,'retweet_count':np.sum})
# df.groupby(['Country', 'Item_Code']).agg({'Y1961': np.sum, 'Y1962': [np.sum, np.mean]})  # Added example for two output columns from a single input column
count_stat.head()
#Looks lieke this data is not captured accurately to derive any meaningful information. we will use the next piece 
#of code to count the most retweeted tweet and check most popular brand account


Unnamed: 0_level_0,favorite_count,retweet_count
retweeted_from,Unnamed: 1_level_1,Unnamed: 2_level_1
!!! CNET,0,0
#BVSBooks,1,1
#ElvisDuranShow,0,0
#EverythingWeGot #gopats theatlantavoice,0,0
#LARams #gorams QuarterPress,0,0


In [23]:
retweet_count.sort_values(['counts'],ascending=False).head()

Unnamed: 0,retweeted_from,text,counts
1791,Avengers,"RT @Avengers: ""Some people move on. But not us."" Watch the brand new Marvel Studios' #AvengersEndgame spot that aired during the Big Game.…",54248
12905,MarvelStudios,RT @MarvelStudios: HIGHER. FURTHER. FASTER. Watch the brand-new #CaptainMarvel spot that aired during the big game. See it in theaters Marc…,12944
7644,GameOfThrones,RT @GameOfThrones: .@TheBudKnight’s watch has ended.\nWhat will you do #ForTheThrone?\n#SBLIII @BudLight,9837
24595,captainmarvel,RT @captainmarvel: HIGHER. FURTHER. FASTER. Watch the brand new Marvel Studios' #CaptainMarvel spot that aired during the Big Game. See it…,7155
600,AVG_JP,RT @AVG_JP: 人々はもう\n　諦めている\n我々は違う\n 始めるぞ\n\n『アベンジャーズ／エンドゲーム』特別映像解禁✨\n\n🌏全世界同時解禁の新映像がここに‼\n\n4/26（金）『アベンジャーズ ／エンドゲーム』公開\n#アベンジャーズ が終わる\n,6712


### What are the ten most popular tweets

In [24]:
arr=retweet_count[retweet_count.columns[2]]
rt=np.argsort(arr,-10)[-10:]
rt
retweet_count['text'][rt][::-1]



1791     RT @Avengers: "Some people move on. But not us." Watch the brand new Marvel Studios' #AvengersEndgame spot that aired during the Big Game.…     
12905    RT @MarvelStudios: HIGHER. FURTHER. FASTER. Watch the brand-new #CaptainMarvel spot that aired during the big game. See it in theaters Marc…    
7644     RT @GameOfThrones: .@TheBudKnight’s watch has ended.\nWhat will you do #ForTheThrone?\n#SBLIII @BudLight                                        
24595    RT @captainmarvel: HIGHER. FURTHER. FASTER. Watch the brand new Marvel Studios' #CaptainMarvel spot that aired during the Big Game. See it…     
600      RT @AVG_JP: 人々はもう\n　諦めている\n我々は違う\n　　始めるぞ\n\n『アベンジャーズ／エンドゲーム』特別映像解禁✨\n\n🌏全世界同時解禁の新映像がここに‼\n\n4/26（金）『アベンジャーズ ／エンドゲーム』公開\n#アベンジャーズ が終わる\n         
4725     RT @Complex: APRIL 26TH!                                                                                                                        
7645     RT @GameOfThrones: 70 sleeps until #GoTS8. #ForTheThrone           

### Most popular twitter sources

In [25]:
source_count=data.groupby(['source']).size().reset_index(name='counts')
source_count.sort_values(['counts'],ascending=False).head()

Unnamed: 0,source,counts
1090,Twitter for iPhone,413380
1086,Twitter for Android,201656
1083,Twitter Web Client,58065
1082,Twitter Web App,27201
1089,Twitter for iPad,13673


### Sentiment Analysis

In [26]:
from textblob import TextBlob
import re

def clean_tweet(tweet):
    '''
    Utility function to clean the text in a tweet by removing 
    links and special characters using regex.
    '''
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())

def analize_sentiment(tweet):
    '''
    Utility function to classify the polarity of a tweet
    using textblob.
    '''
    analysis = TextBlob(clean_tweet(tweet))
    if analysis.sentiment.polarity > 0:
        return 'Positive'
    elif analysis.sentiment.polarity == 0:
        return 'Neutral'
    else:
        return 'Negative'

In [27]:
data['SA'] = data['text'].apply(analize_sentiment)

In [28]:
data.head()

Unnamed: 0,created_at,id_str,text,retweeted_from,source,retweet_count,favorite_count,screen_name,followers_count,longitude,latitude,mentions,hashtags,SA
0,2019-02-04 03:20:19,1092261543859884032,Best commercial BY FAR was #NFL100 ad. Rest were largely unforgettable or just plain dumb. And so sick of the…,other,Twitter for iPhone,0,0,MelanieG_Writer,1031,-96.66696,40.8,,#NFL100,Positive
1,2019-02-04 03:20:19,1092261541137780736,RT @IAStartingLine: Bud Light has now finished last in the Iowa Caucus,IAStartingLine,Twitter for iPhone,0,0,CaitrinSneed,456,-87.65005,41.85003,@IAStartingLine,,Positive
2,2019-02-04 03:20:21,1092261550738599942,RT @GameOfThrones: .@TheBudKnight’s watch has ended.\nWhat will you do #ForTheThrone?\n#SBLIII @BudLight,GameOfThrones,Twitter for Android,0,0,tpalomaresc,72,,,@GameOfThrones @TheBudKnight @BudLight,#ForTheThrone #SBLIII,Neutral
3,2019-02-04 03:20:22,1092261555008401409,RT @GameOfThrones: .@TheBudKnight’s watch has ended.\nWhat will you do #ForTheThrone?\n#SBLIII @BudLight,GameOfThrones,Twitter for iPhone,0,0,haileyncrawford,104,,,@GameOfThrones @TheBudKnight @BudLight,#ForTheThrone #SBLIII,Neutral
4,2019-02-04 03:20:25,1092261568404967425,RT @GameOfThrones: .@TheBudKnight’s watch has ended.\nWhat will you do #ForTheThrone?\n#SBLIII @BudLight,GameOfThrones,Twitter for Android,0,0,jamesmortlock1,14,,,@GameOfThrones @TheBudKnight @BudLight,#ForTheThrone #SBLIII,Neutral


### Count tweets and handles based on positive and negative sentiment

In [29]:
#For retweet analysis, we will remove rows where retweeted from is others. These are rows which are actual tweets and not retweets.
retweet = data[data['retweeted_from']!='other']
len(retweet)

470984

In [30]:
sentiment_count=retweet.groupby(['retweeted_from','text','SA']).size().reset_index(name='counts')

sentiment_count[sentiment_count['counts']>2000].sort_values(['SA','counts'],ascending=False)

Unnamed: 0,retweeted_from,text,SA,counts
12905,MarvelStudios,RT @MarvelStudios: HIGHER. FURTHER. FASTER. Watch the brand-new #CaptainMarvel spot that aired during the big game. See it in theaters Marc…,Positive,12944
24595,captainmarvel,RT @captainmarvel: HIGHER. FURTHER. FASTER. Watch the brand new Marvel Studios' #CaptainMarvel spot that aired during the Big Game. See it…,Positive,7155
8861,IGN,RT @IGN: 🚨 NEW AVENGERS: ENDGAME FOOTAGE 🚨,Positive,3998
12871,Marvel,RT @Marvel: HIGHER. FURTHER. FASTER. Watch the brand new Marvel Studios' @CaptainMarvel spot that aired during the Big Game. See it in thea…,Positive,3912
15656,PatrickMahomes5,"RT @PatrickMahomes5: RT &amp; @Verizon will donate $1, up to $1.5M, in support of first responders to the @GarySiniseFound. Join me in giving #…",Positive,3535
13454,Microsoft,"RT @Microsoft: Just aired: Our #SuperBowl commercial. \nWhen everybody plays, we all win. \n",Positive,3267
22033,_malikab_,RT @_malikab_: #PepsiHalftime retweet for sweet victory like for sicko mode. proving a point,Positive,2463
2734,Bijansiry,RT @Bijansiry: Trying to prove a point\nRT for Sweet Victory\nLike for this goon\n #PepsiHalftime,Positive,2261
11555,LakeShowYo,RT @LakeShowYo: that Super Bowl #PepsiHalftime performance..,Positive,2056
13357,Micboom008,RT @Micboom008: When they only referenced Spongebob and didn't play Sweet Victory #PepsiHalftime,Positive,2056


In [31]:
def tweet_sentiment(sentiment):
    sentiment_data=sentiment_count[(sentiment_count['SA']==sentiment)& (sentiment_count['retweeted_from']!='other')]
    max_sentiment=np.max(sentiment_data['counts'])
    sentiment_index=sentiment_count[sentiment_count.counts==max_sentiment].index[0]
    print("The retweet with most "+ sentiment + " sentiment is: \n{}".format(sentiment_count['text'][sentiment_index])+" \nfrom "+(sentiment_count['retweeted_from'][sentiment_index])+ "\nCount="+str(sentiment_count['counts'][sentiment_index]))
    
tweet_sentiment('Positive')    
print("-----------------------------------------------------------------")
tweet_sentiment('Neutral')
print("-----------------------------------------------------------------")
tweet_sentiment('Negative')

# def tweet_sentiment(sentiment):
#     sentiment_data=sentiment_count[(sentiment_count['SA']==sentiment)& (sentiment_count['retweeted_from']!='other')]
#     sentiment_array=sentiment_data[sentiment_data.columns[3]]
#     sentiment_index=np.argsort(sentiment_array,-3)[-3:]
#     s=sentiment_data['text'][sentiment_index][::-1]
#     print(s)
# #     sentiment_data['text'][sentiment_index]
# #     print("The retweet with most "+ sentiment + " sentiment is: \n {}".format(sentiment_count['text'][sentiment_index][::-1])+
# #           " \nfrom "+(sentiment_count['retweeted_from'][sentiment_index][::-1])+ "\nCount="+
# #           str(sentiment_count['counts'][sentiment_index][::-1]))
    
# print("Tweets with most positive sentiment")
# tweet_sentiment('Positive')    
# print("-----------------------------------------------------------------")
# print("Tweets with most neutral sentiment")
# tweet_sentiment('Neutral')
# print("-----------------------------------------------------------------")
# print("Tweets with most negative sentiment")
# tweet_sentiment('Negative')



The retweet with most Positive sentiment is: 
RT @MarvelStudios: HIGHER. FURTHER. FASTER. Watch the brand-new #CaptainMarvel spot that aired during the big game. See it in theaters Marc… 
from MarvelStudios
Count=12944
-----------------------------------------------------------------
The retweet with most Neutral sentiment is: 
RT @GameOfThrones: .@TheBudKnight’s watch has ended.
What will you do #ForTheThrone?
#SBLIII @BudLight  
from GameOfThrones
Count=9837
-----------------------------------------------------------------
The retweet with most Negative sentiment is: 
RT @Avengers: "Some people move on. But not us." Watch the brand new Marvel Studios' #AvengersEndgame spot that aired during the Big Game.… 
from Avengers
Count=54248


### Analyze most popular hashtags

In [32]:
from collections import Counter
s = data.hashtags.apply(lambda x: Counter(str(x).split(' ')))
type(s)
s.sum().most_common(10)


[('nan', 333616),
 ('#PepsiHalftime', 118594),
 ('#AvengersEndgame', 68566),
 ('#SuperBowl', 48618),
 ('#CaptainMarvel', 35558),
 ('#SBLIII', 27419),
 ('#ForTheThrone', 23988),
 ('#PepsiMoreThanOK', 12071),
 ('#AvocadosFromMexico', 8847),
 ('#ad', 6902)]

### Analyze most popular mentions

In [33]:
m = data.mentions.apply(lambda x: Counter(str(x).split(' ')))
m.sum().most_common(10)

[('nan', 154706),
 ('@Avengers', 60686),
 ('@GameOfThrones', 25168),
 ('@TheBudKnight', 16852),
 ('@BudLight', 14804),
 ('@MarvelStudios', 14308),
 ('@budlight', 13736),
 ('@pepsi', 13256),
 ('@Doritos', 10067),
 ('@Microsoft', 9619)]