# PFIZER TWEETS - EDA, VISUALIZATION AND CLEANING

# **Importing Libraries**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#!pip install wordcloud
from wordcloud import WordCloud,STOPWORDS

# VISUALIZING THE DATA

One of my favorite tasks is data visualization, making colorful plots to uncover the data relationships. Visualization also gets across the data more easily than just plain text.

In [None]:
train_df=pd.read_csv("../input/pfizer-vaccine-tweets/vaccination_tweets.csv")
print(f"The shape of the dataset is {train_df.shape}")
train_df.columns
train_df.head()

In [None]:
plt.figure(figsize=(10,12))
sns.barplot(train_df["user_location"].value_counts().values[0:5],
            train_df["user_location"].value_counts().index[0:5]);
plt.title("Top 5 Places with maximum tweets",fontsize=14)
plt.xlabel("Number of tweets",fontsize=14)
plt.ylabel("Country",fontsize=14)
plt.show()

I'm keeping it restricted to the top 5 because the location data is not exactly very accurate(some are not even locations!)

In [None]:
train_df["source"].value_counts()[:5].plot.pie(y='mass', figsize=(5, 5))

Here are the top 5 sources from which people have tweeted. 

In [None]:
#checking for null data
train_df.isnull().mean()*100

There are many tweets without any hashtags. Another interesting relationship I'm looking forward to explore is between the 'user_verified' and 'hashtags', in particular whether non verified profiles tend to use a higher number of hashtags in their tweets.

In [None]:
print(f" Data Available since {train_df.date.min()}")
print(f" Data Available upto {train_df.date.max()}")

In [None]:
train_df['date'] =  pd.to_datetime(train_df['date'])
cnt_srs = train_df['date'].dt.date.value_counts()
cnt_srs = cnt_srs.sort_index()
plt.figure(figsize=(14,6))
sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8, color='blue')
plt.xticks(rotation='vertical')
plt.xlabel('Date', fontsize=12)
plt.ylabel('Number of tweets', fontsize=12)
plt.title("Number of tweets according to dates")
plt.show()

The only pattern I can spot here is that vaccine tweets fell around 25th Dec-1st Jan during festivities, but again picked up as people started resuming their lives. If you can spot anything else mention it in the comments!

In [None]:
train_df['user_created'] =  pd.to_datetime(train_df['user_created'])
count_  = train_df['user_created'].dt.date.value_counts()
count_ = count_[:10,]
plt.figure(figsize=(10,5))
sns.barplot(count_.index, count_.values, alpha=0.8)
plt.title('Most accounts created according to date')
plt.xticks(rotation='vertical')
plt.ylabel('Number of accounts', fontsize=12)
plt.xlabel('Date', fontsize=12)
plt.show()

A large number of accounts on 4th Dec, right around the time Pfizer launched its vaccine. 

In [None]:
#most favourite and retweeted tweet
print(f" Maximum number of retweets {train_df.retweets.max()}")
print(f" Maximum number of favorites {train_df.favorites.max()}")

In [None]:
train_df.loc[train_df['retweets']==train_df.retweets.max(),'text'].values

In [None]:
train_df.loc[train_df['favorites']==train_df.favorites.max(),'text'].values

Let's see who made the popular tweet in the dataset.

In [None]:
train_df.loc[train_df['favorites']==train_df.favorites.max(),('user_name','user_description','user_followers','user_verified')].values

# WORDCLOUDS

In [None]:
wordcloud_ = WordCloud(
                          background_color='black',
                          stopwords=set(STOPWORDS),
                          max_words=250,
                          max_font_size=40, 
                          random_state=1705
                         ).generate(str(train_df['user_description'].dropna()))
def cloud_plot(wordcloud):
    fig = plt.figure(1, figsize=(20,15))
    plt.imshow(wordcloud)
    plt.axis('off')
    plt.show()
cloud_plot(wordcloud_)

In [None]:
wordcloud_ = WordCloud(
                          background_color='black',
                          stopwords=set(STOPWORDS),
                          max_words=250,
                          max_font_size=40, 
                          random_state=1705
                         ).generate(str(train_df['text'].dropna()))
def cloud_plot(wordcloud):
    fig = plt.figure(1, figsize=(20,15))
    plt.imshow(wordcloud)
    plt.axis('off')
    plt.show()
cloud_plot(wordcloud_)

In [None]:
wordcloud_ = WordCloud(
                          background_color='black',
                          stopwords=set(STOPWORDS),
                          max_words=250,
                          max_font_size=40, 
                          random_state=1705
                         ).generate(str(train_df['hashtags'].dropna()))
def cloud_plot(wordcloud):
    fig = plt.figure(1, figsize=(20,15))
    plt.imshow(wordcloud)
    plt.axis('off')
    plt.show()
cloud_plot(wordcloud_)

The hastag worldcloud does offer us a glimpse into a few other events going on at the time of the Pfizer launch.

Let me know what areas I could have improved in! I know I barely scratched the surface in this kernel. I hope to follow up with a kernel on cleaning the text and then running sentiment analysis models on it.