# COVID19 Tweet Analysis
<hr />
<hr />

# Introduction

"*Fake news spreads 6 times faster on Social Media*". A line from the recent Netflix special "*The Social Dilemma*" that caught my attention. The social media, the likes of Twitter,Facebook etc, has a fair share of opinionated, silent consumers, trolls, objective individuals, hate mongerers and the list never ends. The recent outbreak of coronovirus (COVID19) has led us to witness the most unprecedented of times. While, the world was grappling with these turbulent times of a pandemic, Twitterati tweeted away with news, facts, myths, opinions,etc. This notebook is an attempt to understand the various aspects of these tweets. The Notebook also attempts to explore the hypothesis of genuinity and the legitimacy of these tweets by understanding various parameters like User location, Verified users, number of followers, friends, favorites, number of retweets and several other parameters.

<hr />

# Methodology

The Noteook is segmented broadly into 2 sections of Exploratory Data Analysis (EDA):
1. Univariate Exploratory Analysis
2. Mutlivariate Exploratory Analysis

<hr />

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt #base plotting library for plotly
from matplotlib.pyplot import figure # to set the figure size
import plotly.express as px #plotly library to produce plots
from wordcloud import WordCloud, ImageColorGenerator #wordcloud library
from nltk.tokenize import word_tokenize #word tokenizer
from nltk.probability import FreqDist # Frequency Distributor
from nltk.corpus import stopwords #stop words for data cleaning

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Let's begin by reading the data into a dataframe by summoning our good old friend Pandas(Ofcourse :))

In [None]:
data = pd.read_csv("/kaggle/input/covid19-tweets/covid19_tweets.csv")

Lets see how the data looks like and the shape of the dataset.

In [None]:
print("The dataset has {} rows and {} columns".format(data.shape[0],data.shape[1]))
data.head()

The Data Type of each column using the `dtype` utility method pandas. This will give us a clear indication of the kind of data present in the dataset and that which could potentially lead to type casting to the underlying datatype value for further processing.

In [None]:
print("------------------------------------------------------------------------------------------------------------")
print("The Datatype of each column in the dataset.\n\n")
print(data.dtypes)
print("------------------------------------------------------------------------------------------------------------")

From the list of datatype of each column, there are 3 columns which are of integer type. As part of Preliminary Analysis, meaningful insights can be drawn on the integer valued columns by implementing a **Generative Descriptive statistics** on them.

The Descriptive statistics exhibits the central tendency, dispersion and the shape of dataset distribution. For further information read [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html).

In [None]:
print("------------------------------------------------------------------------------------------------------------")
print("The Descriptive Statistics of the Dataset.\n\n")
print(data.describe())
print("------------------------------------------------------------------------------------------------------------")

## Univariate Analysis

Lets Get started with Univariate Analysis of the Dataset.

In this section, we will go through each of the variables/columns in the dataset and understand some basic information like:

* Distribution of column values across the dataset.
* Outlier Detection
* Number of Null values.
* Number of Unique values.
* Aggregation(wherever necessary)

### 1. user_location

Let us start with the number of unique places in the dataset.

In [None]:
print("Total number of Unique locations: ",data["user_location"].nunique())

The tweets were made from 26920 unique places. Could that really be the place ? Seems unlikely. Considering the internet connectivity and access to social media platforms like twitter, this number looks fairly exaggerated. Assuming, each location only mentions the COUNTRY the tweet was made from, having these many countries doesn't make sense. This means that, this dataset would need some fair amount of data cleaning/wrangling wherein, we would generalize the country names. 


**NOTE: This notebook will only explore the variables and will not exclusively cater to data cleaning/wrangling. There will be a follow up kernel which will only cater to data cleaning.**

Let us now see the distribution of locations in all the tweets

In [None]:
print("------------------------------------------------------------------------------------------------------------")
print("Number of tweets for each of the unique location in the dataset.\n\n")
print(data["user_location"].value_counts())
print("------------------------------------------------------------------------------------------------------------")

We can observe that the `user_location` column is fairly polluted. It has website URL's, country names, state names belonging to the same country, which otherwise would not be considered a unique location if we are only considering the country as the unique location. However, we will only be using the top few locations to understand the distribution of tweet `user_location`.

Let us see if there are null values in this particular column

In [None]:
print("The number of tweets where the user location is unkown: ", data["user_location"].isna().sum())

Storing the value counts into a dataframe for plotting the distribution of the user location counts

In [None]:
user_location_df = data["user_location"].value_counts().rename_axis("place").reset_index(name="counts")
user_location_df.head()

In [None]:
user_location_threshold_data = user_location_df[user_location_df["counts"]>25].head(50)

fig = px.bar(user_location_threshold_data,x="place",y="counts", title="Top 50 Locations tweets originate from")
fig.show()

INDIA has been tweeting away during the pandemic. UNITED STATES comes a close second. 

We can see from the list of user locations, there are a lot of repetitive locations that will have to be grouped/aggregated under the country name.

We could record the steps carried out for univariate preliminary analysis into a function. The `inspect_column` method will gather the basic information of a particular column and display the graphical findings.

In [None]:
def inspect_column(data,column):
    print("------------------------------------------------------------------------------------------------------------")
    print("Basic Preliminary Information of column '{}'\n\n".format(column))
    print("Total number of Unique ",column,"values: ",data[column].nunique())
    print("----Quick overview of the distribution of the variable------")
    print(data[column].value_counts())
    print("The number of tweets where the ", column ,"specific data is unkown : ", data[column].isna().sum())
    sub_data_df = data[column].value_counts().rename_axis(column).reset_index(name="counts")
    sub_data_threshold_df = sub_data_df[sub_data_df["counts"]>25].head(100)
    fig = px.bar(sub_data_threshold_df,x=column,y="counts", title="Distribution of values of column '{}'".format(column))
    fig.show()

### 2. user_verified

In [None]:
inspect_column(data,"user_verified")

So there are no null values in the `user_verified` column. Also what's interesting to note is that there are 7 times as many **unverified** accounts as verified accounts putting out information/opinion during the pandemic season. This should potentially raise concerns on the legitimacy of the information contained in these tweets as most of the information is emanating from unverified sources.

### 3. hashtags

In [None]:
inspect_column(data,"hashtags")

As we can see from the output of the inspect_column method, the hashtags for each tweet, is present as a string value represented as a list of string values. To analyse the hashtags further, we need to clean the data in this column. Assuming most of the tweet would be made in relation to the present COVID situation, we will be focussing on cleaning most of the COVID related hashtags first.

In [None]:
# A small method that cleans up the hashtags and collates all the hashtags into a list.
all_hashtag_list = []

# Itertuples is much faster than iterrows
for each_row in data.itertuples():
    if not str(each_row.hashtags).lower() == "nan":
        each_hashtag = str(each_row.hashtags)
        each_hashtag = each_hashtag.strip('[]').replace("'","")
        all_hashtag_list += each_hashtag.split(",")
        
print("Total number of hashtags",len(all_hashtag_list))

Converting the list of values into a Dataframe to leverage the dataframe utility methods to carry out further analysis.

In [None]:
hashtag_df = pd.DataFrame(all_hashtag_list,columns=["hashtags"])
hashtag_df.head()

Let us use the `value_counts` to get the individual counts of each of the hastags

In [None]:
count_df = hashtag_df["hashtags"].value_counts().rename_axis("hashtags").reset_index(name="counts")
count_df.head()

But, is it totally cleaned ? If not all the hastags, atleast the COVID related hastags ? The answer is NO. we can still see duplicate counts and entries for the same covid19 hashtag. We need to minimise this repition and bring it down to a smaller number of varying versions of the COVID hashtag. The below code cell, simply adds the count values of the COVID related hastags.

In [None]:
hashtag_final_count_dic = {}
for each_row in count_df.itertuples():
    if str(each_row.hashtags).strip().lower() == "covid19":
        if "covid19" not in hashtag_final_count_dic:
            hashtag_final_count_dic["covid19"] = each_row.counts
        else:
            hashtag_final_count_dic["covid19"] += each_row.counts
    else:
        hashtag_final_count_dic[str(each_row.hashtags).strip()] = each_row.counts
        
print("The aggregated hashtags count has {} hashtags".format(len(hashtag_final_count_dic)))

Lets get this into a dataframe as well. 

In [None]:
final_hashtag_count_df = pd.DataFrame(hashtag_final_count_dic.items(),columns=['hashtag','count'])
final_hashtag_count_df

We were able to drill down and aggregate most of the covid19 hashtags and we now have the cummulated sum of the number of tweets with covid19 hashtags. Let us now see the top 10 tweets with highest number of hashtags.

In [None]:
final_df = final_hashtag_count_df.sort_values(by='count',ascending=False).head(10)
fig = px.bar(final_df,"hashtag","count",title="Top 10 used hashtags during the pandemic")
fig.show()

No surprises here the hashtag `#covid19` has been used over 100K times.  

### Analyzing numerical columns in the dataset



## 4. user_followers

Let us perform analysis on the distribution of user followers for each twitter handle.

In [None]:
fig = px.box(data,y="user_followers", title="The overall distribution of user followers")
fig.show()

Looks like the distribution of the users is fairly right skewed with IQR(Inter Quartile Range) being between 172 to 5K user followes. Closing down on the upper and lower limit whiskers will give us a better sense of the distribution of the users. We can alse see that there are significant outliers in terms of followers to the overall set of users. The median number of followers however, is around `992` users. 

Let us cut the whiskers on either end to see how the distribution looks like.

In [None]:
fig = px.box(data[(data["user_followers"]>0) & (data["user_followers"]<=15000)],y="user_followers",title="The distribution of User followers within 15000 user followers")
fig.show()

The IQR does shift a bit more towards the right. However, the overall distribution still looks heavily right skewed with many outliers considering the fence values and IQR values. However in this case, there are too many values outside the deemed area to be termed as outliers. These values can add extra value to the analysis. We keep them as it is for now. 

Let us see how the distribtion looks like on a histogram.

In [None]:
fig = px.histogram(data[(data["user_followers"]>0) & (data["user_followers"]<=15000)],x="user_followers", nbins=10, color_discrete_sequence=["red"],
                   title="Distribution of user followers with user followers less than 15000")
fig.show()

We can see that the highest number of followers for a twitter handle/user, lie within the range of 0 to 2000. The number of followers keep descreasing with the increase in the number of followers. So to put it simply, majority of the twitter users have followers in the range upto 2000.

We perform the same actions for the remaining 2 numerical columns `user_friends` and `user_favourites`

## 5. user_friends 

In [None]:
fig = px.box(data,y="user_friends")
fig.show()

The boxplot for `user_friends` shows similar properties to that of `user_followers`. The distribution looks extremely right skewed with IQR roughly ranging between 100+ to 1.7K. let us do the same operations to this column as well to see the effects of limiting the number of user friends.

In [None]:
fig = px.box(data[(data["user_friends"]>0) & (data["user_friends"]<=5000)],y="user_friends", title="Distribution of user friends with user friends less than 5k")
fig.show()

In [None]:
fig = px.histogram(data[(data["user_friends"]>0) & (data["user_friends"]<=5000)],x="user_friends", nbins=10, color_discrete_sequence=["red"],
                  title="Distribution of user friends less than 5K")
fig.show()

We can see that the majority of users have a maximum of 500 followers. Let us now explore user_favorites and see if it has similar characteristics

## 6. user_favourites

In [None]:
fig = px.box(data,y="user_favourites",title="Distribution of user favourites")
fig.show()

The traits are very similar to the above 2 columns with identical distribution. Let us reduce the scope and try and understand it better.

In [None]:
fig = px.box(data[(data["user_favourites"]>0) & (data["user_favourites"]<=25000)],y="user_favourites",title="Distribution of user favourites with a maximum of 25K")
fig.show()

In [None]:
fig = px.histogram(data[(data["user_favourites"]>0) & (data["user_favourites"]<=25000)],x="user_favourites", nbins=10, color_discrete_sequence=["red"],
                  title="Distribution of user favorites with a maximum of 25k user favorites")
fig.show()

We can see that the highest number of user_favourites are for the users with user_favorites between the range of 0-5000.

## 7. source

Let us now understand the different types of sources twitterati made use to make them tweets. We will later examine the variation in the usage of different kinds of sources with other variables like user_location and such.

In [None]:
count_df = data["source"].value_counts().rename_axis("source").reset_index(name="counts")
count_df

We can see that sources across the spectrum are made use to desseminate information and opinion. `Twitter Web App`, `Twitter For Android`, `Twitter for iPhone` and `TweetDeck` are some of the most commonly used sources to make tweets from all over the world.

We will just be seeing the variation and distribution of the top 10 sources for better understanding and convenience.

In [None]:
fig = px.bar(count_df.head(10), x='source', y='counts',title="Top 10 Sources to make tweets")
fig.show()

## 8. is_retweet

This variable indicates whether a particular tweet was retweeted or not. Let us continue the exploration

In [None]:
count_df = data["is_retweet"].value_counts()
count_df

Intersting! Looks like none of the tweets were retweeted. let us confirm this by finding unique values in this variable

In [None]:
print("There are {} unique values in the column 'is_retweet'".format(data["is_retweet"].nunique()))

Yes. None of the tweets have been retweeted. This is really interesting considering how rampantly tweets are shared and retweeted.

## 9. date

Let us perform time series analysis and understand the trends with respect to the duration of the tweets made.

The `date` column is of object datatype. We need to type cast into Date type for further analysis.

In [None]:
# converting to Date type
data["date"] = pd.to_datetime(data["date"])

In [None]:
print("Date column is of '{}' type".format(data["date"].dtype))
data["date"].head()

Let us analyse the distribution of tweets per day. Let us start by adding a new column `day_of_tweet` which will only have the date part of the tweet made. This will prove convenient while aggregating values with respect to dates.

In [None]:
data["day_of_tweet"] = pd.to_datetime(data['date']).dt.date
data["day_of_tweet"].head()

Once we have only the dates without the timestamps, let us perform aggregation and record the number of tweets made per day.

In [None]:
date_time_series = data.groupby("day_of_tweet").size().rename_axis("day_of_tweet").reset_index(name="number_of_tweets")
date_time_series

We now have the number of tweets made per day from 24th July to 30th August. Its time plot a time series of this data.

In [None]:
fig = px.line(date_time_series, x='day_of_tweet', y="number_of_tweets", title="Time series for number of tweets made per day")
fig.show()

We can observe some periodicity in terms of peaks and troughs of interest shown by people in terms of number of tweets made. Given the bulk of hashtags made in relation to COVID19, it would be safe to assume that majority of the tweets would have been made in relation the same topic. However, we can confirm this hypothesis during multivariate analysis. We can see that 25th July has the highest number of tweets and August 7th has the least number of tweets. This was the period where WHO and other major health institutions were discovering new symptoms and behviour of this deadly virus. Follwed by new preventive measures etc. This would have subsequently spiked people's interests and hence the spikes in the number of tweets. 

## 10. text

Finally its time to analyse and explore the tweet itself. We will be performing primitive NLP techniques in the following order:

- Tokenization
- Removal of Stop Words
- Finding the highest occuring words across the corpus
- Graphical representation of the same using a wordcloud and Frequency distribution.

Lets get Started !


In [None]:
# Plotting a really simple wordcloud of only the first tweet
wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(data.text[0])
plt.figure(figsize=(30,30))
# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
# Lets gather all the tweet data into a variable.
final_text = "".join(each_text for each_text in data.text)
print ("There are {} words overall in tweets".format(len(final_text)))

Lets do some applied NLP analytics on the entire corpus.

In [None]:
#lets first tokenize the entire corpus
tokenized_words = word_tokenize(final_text)
print("The size of the tokenized words in the corpus is of size {}".format(len(tokenized_words)))

Lets see some Frequency Distribution of the words

In [None]:
fdist = FreqDist(tokenized_words)
print(fdist)

In [None]:
# lets see the top 50 most frequently used words
fdist.most_common(50)

We can see a lot of garbage characters and stop words. Let us see how these are distrubuted anyway

In [None]:
# plotting the top 50 words in the frequency distribution
plt.figure(figsize=(20,20))
fdist.plot(50)
plt.show()

we can see that there are a lot of punctuation words in the corpus. Lets first get rid of them.

In [None]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
tokenized_words = tokenizer.tokenize(final_text)

We can see that a lot of words do not show substantail meaning. These are Stop words. Which are basically considered as noise. Words like is, am, are, this, a, an, the, etc.

In [None]:
stop_words = set(stopwords.words("english"))
print(stop_words)

In [None]:
#removing stop words from our tweets
filtered_tokens = []
for each in tokenized_words:
    if (each not in stop_words) and (len(each) > 3):
        filtered_tokens.append(each)
print("Tokenised words now has {} words after removing stop words".format(len(filtered_tokens)))

Lets replot the frequency distribution and see the most frequently used words in the dataset.

In [None]:
fdist = FreqDist(filtered_tokens)
print(fdist)

In [None]:
plt.figure(figsize=(30,30))
fdist.plot(50)
plt.show()

Barring the most frequently used token `https` which on the first impression may seem a fairly non-consequential token, it does show that majority of the tweets do quote some sort of source that led to the information contained in the tweet. The other most commonly used words across the entire dataset are pretty indicative of the current situation. Some of the most commonly used words are : `COVID19`, `cases`, `people`, `pandemic`, `2020` and several others.

In [None]:
#Combining the top 200 frequently used words in the dataset
analyse_str = " ".join([each[0] for each in fdist.most_common(200)])
print(analyse_str)

In [None]:
# WordCloud for the most frequently used words across the dataset
wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(analyse_str)
plt.figure(figsize=(20,20))
# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Amidst all the commotion and clamour on COVID, we see some politically atrributed and/or driven tweets as well. The words like `realDonaldTrump` and `trump` shows that twitter almost never runs out of gas when it comes to discussing politics. 

-----------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------


## Multi-Variate Exploratory Analysis

In this section, we will explore and understand the change caused by one of the variables on other variables. Mutlivariate analysis is a prominent technique to draw concrete insights into the behaviour of data. 


Lets get started.

## 1. user location v/s account verification status.

In this section, we will see how the account verification status of the user making the tweet relates to the location. This will help us to understand what country/location has been making most of the tweets and how many of them are verified. Going back to our hypothesis of verification determining the genuinity of the information, we are trying to formulate which location potentially has high unverified news v/s verified information.

In [None]:
# grouping by user_location and user_verified and gathering top 50 entries
user_loc_df = data.groupby(["user_location","user_verified"])["user_verified"].count().reset_index(name="count").sort_values(by=['count'], ascending=False).head(50)
user_loc_df.head()

In [None]:
# bar plot to show the user_location and count by user_verified
fig = px.bar(user_loc_df, x='user_location',y='count',color='user_verified',barmode="group",
            title="Relationship between the user locations v/s user verified")
fig.show()

We can see that United States has the most amount of unverified users. India on the other hand has fair amount of verified users. Does this say anything about the fake news narrative which often propogates media these days especially in the US? *Food for thought*

In [None]:
fig = px.pie(user_loc_df, values='count', names='user_verified', title='Ratio of Verified acounts v/s Unverified accounts')
fig.show()

From an overall perspective, approximately 86% of user accounts are unverified and only about 14% of the user accounts verified. This is fairly a large skew leaning towards unverified accounts.

## 2. user_location v/s Hashtags

Analysing user location and hashtags will give us the insight into several different topics that are being talked about with different locations. This also, in a way, presents the overall mood of a particular location in relation to the topic at hand.

In [None]:
# extracting the user location and their respective hashtags
user_loc_hastag_data = data[["user_location","hashtags"]]
user_loc_hastag_data.head()

In [None]:
# converting the dataframe to dictionary to aggregate by location
user_loc_hastag_data_dic = user_loc_hastag_data.to_dict(orient='records')
print("There are a total of {} records in the dictionary".format(len(user_loc_hastag_data_dic)))

In [None]:
# code block to perform string manipulation and extract location keys and aggregated values
cleaned_dic_container = []
for each in user_loc_hastag_data_dic:
    if str(each["user_location"]).lower() != 'nan' and str(each["hashtags"]).lower() != 'nan':
        cleaned_dic = {}
        each["hashtags"] = str(each["hashtags"]).strip('[]').replace("'","").split(",")
        cleaned_dic["user_location"] = str(each["user_location"])
        cleaned_dic["hashtags"] = each["hashtags"]
        cleaned_dic_container.append(cleaned_dic)
cleaned_dic_container[0:5]

In [None]:
# converting the processed list of dictionaries to a dataframe by using the 'explode' method of pandas to spread each of the 'hashtag' column entries vertically
user_loc_hashtags_df = pd.DataFrame(cleaned_dic_container)
user_loc_hashtags_df = user_loc_hashtags_df.explode('hashtags')
user_loc_hashtags_df

In [None]:
# applying final manipulations using lambda functions
hashtag_loc_df = user_loc_hashtags_df.groupby(['user_location',"hashtags"])["hashtags"].count().reset_index(name="count").sort_values(by=['count'], ascending=False).head(100)
hashtag_loc_df["user_location"] = hashtag_loc_df["user_location"].apply(lambda x : x.strip())
hashtag_loc_df["hashtags"] = hashtag_loc_df["hashtags"].apply(lambda x : x.strip())
hashtag_loc_df

Let us see the most popular hastags aggregated by the user location.

In [None]:
fig = px.bar(hashtag_loc_df,x = "user_location",y="count",color="hashtags",title="What are these countries talking about the most ?")
fig.show()

No surprises here. Majority of the users have been talking about COVID. However, we do get to see some some other hashtags as well. We can see `#auspoll` trending in the Canberra, Australia. Goes to show that despite the showstopper 'COVID', there are other pressing issues like elections, world is talking about.

## 3. user_location v/s tweet source

It would also be interesting to know the sources made use of to make tweets with the diverse demographics in the world. Lets give it a look. Also, seeing this with conjuction of the user_verified status should be really interesting to note. We can understand what source is predominantly used and how many of them are verified. 

In [None]:
# grouping by user location , user verified and the source. extracting the top 50 most commonly used sources where the users are verified.
user_loc_source_df = data.groupby(["user_location","user_verified","source"])["source"].count().reset_index(name="count").sort_values(by=['count'], ascending=False).head(50)
user_loc_source_df.head()

In [None]:
fig = px.bar(user_loc_source_df,x="user_location", y="count",color="source", facet_col="user_verified",title="Exploring the relationship between the user location v/s source of the tweet v/s user verification status")
fig.show()

We can see that most of the verified users make use of Twitter Web App and Tweet Deck(which is generally made use by users having multiple twitter accounts. eg. Media houses, etc.). Additionally, Twitter for Android is used as the source by most of the unverified accounts. 

### 4. user verified v/s user followers

Given that we have established the already proven research results of the spread of fake news. And assuming for a minute, that verified accounts are less likely to propogate fake news, it would be interesting to find the median user followers against the verification status of an account. Ideally, we should have more followers for the verified account than the unverified accounts. But is that the case in reality ? Let us find out. 

In [None]:
# grouping by user verified status and finding the median user followers
user_ver_followers_df = data.groupby("user_verified")["user_followers"].median().reset_index(name="median_followers").sort_values(by=["median_followers"],ascending=False)
user_ver_followers_df

In [None]:
fig = px.pie(user_ver_followers_df, values="median_followers",names="user_verified",hole=.5, title="The proportion of Median number of followers between user verified accounts v/s unverified accounts")
fig.show()

We can see some skewed results favouring the user verified status. Perhaps, setting a particular threshold on the number of followers should give us a clearer picture. This is classic case of including outliers in the overall analysis.

## 5. user_verified v/s user friends

Following the same hypothesis of verified accounts and user followers. Let us find out if the same pattern holds for the median user friends for verified user.

In [None]:
user_ver_friends_df = data.groupby("user_verified")["user_friends"].median().reset_index(name="median_friends").sort_values(by=["median_friends"],ascending=False)
user_ver_friends_df

In [None]:
fig = px.pie(user_ver_friends_df, values="median_friends",names="user_verified",hole=.5, title="The proportion of Median number of friends between user verified accounts v/s unverified accounts")
fig.show()

We can see that the both the verified accounts have almost the same amount of friends with unverified users slightly having more median friends than verified users.

## 6. users_verified vs number of favorites vs retweets

In this case, we are going to see 3 varibles in conjunction. we are going to see how median user favorites and the number of times a retweet has been made when aggregated by user_verified. 

In [None]:
user_ver_fav_retweet_df = data.groupby("user_verified",as_index=False).agg({"user_favourites":"median","is_retweet":"count"})
user_ver_fav_retweet_df

In [None]:
fig = px.bar(user_ver_fav_retweet_df, x="is_retweet",y="user_favourites",color="user_verified",orientation='h', 
             title = "Relationship between user_favorites, number of retweets made grouped on user_verified status" )
fig.show()

We can see that unverified users have higher retweets and higher user favorites in comparison to verified users. Unverified users have around 1800 user favorites with a total of 156k retweets. However, the verified users have 1500 user favorites with a meagre 23K retweets only. we can see why fake information tends to propogate faster(this is totally made under the assumption that unverified accounts generally make disinformed posts). 

## 7. Exploring Relationship between user followers for verified and unverified users with Time

In this section we will see how the relationship between the verified user and user followers has changed with time. 

In [None]:
time_series_data = data.groupby('day_of_tweet',as_index=False).agg({'user_followers':'median','user_verified':'count'})
time_series_data

In [None]:
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter(x=time_series_data["day_of_tweet"], y=time_series_data["user_followers"],
                    mode='lines+markers',
                    name='number of followers'))
fig.add_trace(go.Scatter(x=time_series_data["day_of_tweet"], y=time_series_data["user_verified"],
                    mode='lines+markers',
                    name='number of verified users'))

fig.update_layout(title='Time Series data for change in followers with number of verified users',
                   xaxis_title='Day',
                   yaxis_title='Number of users')

We can see that the median followers has remained pretty much constant with time. However, we can see that the number of tweets made has lots of peaks and troughs with time. The highest peak during towards the end of July and the lowest during the start of August. Let us only see how the verified users and the number of retweets has changed with time.

In [None]:
time_series_retweet_df = data.groupby(['day_of_tweet','user_verified'],as_index=False).agg({'is_retweet':'count'})
time_series_retweet_df

In [None]:
fig = px.line(time_series_retweet_df, x="day_of_tweet", y="is_retweet", color='user_verified',title="Time Series representation for change of retweets made on verified and unverified users")
fig.show()

We can see a similar pattern here as well. The number of retweets made on verified user accounts seemed to stay consistent. However, the retweets made on unverified accounts has a varying pattern almost like a chaotic pattern. Isnt it metaphorical? 

# Conclusion

With this, the EDA of the tweets made during the pandemic comes to an end. We have explored fairly the relationships and discussed various aspects. we have seen how information spreads and how the different variables play their part in spreading the information. This wasnt the cleanest of data. We have done very basic and preliminary cleaning for the sake of EDA and a more detailed cleaning exercise needs to be carried out on the dataset in order to perform unsupervised learning on them in order to find more patterns and trends.