# **Load Libraries**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.pyplot import figure

# **Glimpse of the Data**

In [None]:
df = pd.read_csv('/kaggle/input/covid19-tweets/covid19_tweets.csv')
df.head()

In [None]:
df.info()

Almost all the data types are correct. Only the date and user_created column need to be changed. Also there are a lot of missing values within the data, so we need to explore it later.

In [None]:
df['date'] = pd.to_datetime(df['date'],format='%Y-%m-%d %H:%M:%S')
df['user_created'] = pd.to_datetime(df['user_created'], format='%Y-%m-%d %H:%M:%S')

In [None]:
print("So right now, we currently know that:\n")
print("Number of tweets: {}\n".format(df.shape[0]))
print("Number of users: {}\n".format(df.user_name.nunique()))
print("Users with more than 100K followers: {}\n".format(df[df['user_followers']>100000].user_name.nunique()))
print("Number of verified users: {}\n".format(df[df['user_verified']==True].user_name.nunique()))

# **Knowing more about the data**

In [None]:
def plot_count(x,df,title,xlabel,ylabel):
    figure(figsize=(20, 6))
    sns.set_style("whitegrid")
    
    total = float(len(df))
    ax = sns.countplot(df[x],order=df[x].value_counts().index[:10])
    for i in ax.patches:
        height = i.get_height()
        ax.text(i.get_x()+i.get_width()/2.,
               height + 3,
               '{:1.2f}%'.format(100*height/total),
               ha="center")
    
    ax.set(title=title, xlabel=xlabel, ylabel=ylabel)
    plt.show()  

In [None]:
plot_count('source',df, "Top 10 Source of Tweet", "Source of Tweets", "Number of Tweets")

The main source of the tweets are Twitter Web App, Android and iPhone. Web App leads with more than 40000 tweets, Android with a little less than 30000 tweets then followed by iPhone with a little more than 25000. The other platform consist of TweetDeck, Hootsuite Inc., iPad, etc with no more than 10000 tweets each.

In [None]:
plot_count('user_location',df, "Top 10 User Locations", "Tweets Locations", "Number of Tweets")

Here are the location of every Covid-19 related tweet. Most of the tweets are from India, United States, United Kingdom and Australia.

In [None]:
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['dayofweek'] = df['date'].dt.dayofweek
df['hour'] = df['date'].dt.hour
df['dateonly'] = df['date'].dt.date

In [None]:
figure(figsize=(20,6))
sns.set_style("whitegrid")

agg_df = df.groupby(["dateonly"])["text"].count().reset_index()
agg_df.columns = ["dateonly", "count"]

ax = sns.lineplot(x=agg_df["dateonly"], y=agg_df['count'], data=agg_df)
plt.xticks(rotation=90)
ax.set(title="Tweet Count",xlabel="Date",ylabel="Count")

plt.show()

In [None]:
plot_count("dayofweek", df, "Number of Tweets by Day", "Day", "Count")

In [None]:
plot_count("hour", df, "Number of Tweets by Hour", "Hour", "Count")

# **A Look at the missing data**

In [None]:
figure(figsize=(20,6))
miss = pd.DataFrame(df.isnull().sum())

ax = sns.barplot(miss[0], miss.index)
ax.set(title="Missing Data", xlabel="Number of Missing Data")
plt.show()

Now we know that the missing data originates from hastags, user_description and user_location.