<h1><center>Covid19 tweets. EDA. Visualization. Insides.</center></h1>

<center><img src="https://ichef.bbci.co.uk/news/1024/cpsprodpb/031C/production/_112869700_gettyimages-1209519827-1.jpg"></center>

### Hello everyone! Here I am going to present some basic analysis of this dataset. We will create some plots based on existing features, do starting sentiment analysis (based on clustering). Also we will create world  map animation and prepare a lot of other interesting things! Let's start!

<a id="top"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:green; border:0' role="tab" aria-controls="home"><center>Quick navigation</center></h3>

* [1. Dataset Quick Overview](#1)
* [2. Data Visualization](#2)
* [3. Additional features analysis](#3)
* [4. Tweets text analysis](#4)
* [5. Simple sentiment analysis](#5)
* [6. Animation with geographical distribution of tweets](#6)


#### If you are interested in Dynamic monitoring of the tweets, please check another one my kernel: https://www.kaggle.com/isaienkov/covid19-dynamic-in-time-and-space-of-the-tweets

In [None]:
import numpy as np
import pandas as pd 
import plotly.express as px
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from iso3166 import countries
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot

In [None]:
df = pd.read_csv("/kaggle/input/covid19-tweets/covid19_tweets.csv")

<a id="1"></a>
<h2 style='background:green; border:0; color:white'><center>1. Dataset Quick Overview</center><h2>

Let's do a first quick check of our dataset.

In [None]:
df.head()

In [None]:
df.info()

Let's see percent of NaNs for every column. We will visualize only columns with at least 1 missed value.

In [None]:
missed = pd.DataFrame()
missed['column'] = df.columns

missed['percent'] = [round(100* df[col].isnull().sum() / len(df), 2) for col in df.columns]
missed = missed.sort_values('percent')
missed = missed[missed['percent']>0]

fig = px.bar(
    missed, 
    x='percent', 
    y="column", 
    orientation='h', 
    title='Missed values percent for every column (percent > 0)', 
    height=400, 
    width=600
)

fig.show()

<a id="2"></a>
<h2 style='background:green; border:0; color:white'><center>2. Data Visualization</center></h2>

Let's see top 40 users by number of tweets.

In [None]:
ds = df['user_name'].value_counts().reset_index()
ds.columns = ['user_name', 'tweets_count']
ds = ds.sort_values(['tweets_count'])

fig = px.bar(
    ds.tail(40), 
    x="tweets_count", 
    y="user_name", 
    orientation='h', 
    title='Top 40 users by number of tweets', 
    width=800, 
    height=800
)

fig.show()

In [None]:
df = pd.merge(df, ds, on='user_name')

Let's see most popular users.

In [None]:
data = df.sort_values('user_followers', ascending=False)
data = data.drop_duplicates(subset='user_name', keep="first")
data = data[['user_name', 'user_followers', 'tweets_count']]
data = data.sort_values('user_followers')

fig = px.bar(
    data.tail(40), 
    x="user_followers", 
    y="user_name", 
    color='tweets_count',
    orientation='h', 
    title='Top 40 users by number of followers', 
    width=800, 
    height=800
)

fig.show()

And most friendly users.

In [None]:
data = df.sort_values('user_friends', ascending=False)
data = data.drop_duplicates(subset='user_name', keep="first")
data = data[['user_name', 'user_friends', 'tweets_count']]
data = data.sort_values('user_friends')

fig = px.bar(
    data.tail(40), 
    x="user_friends", 
    y="user_name", 
    color = 'tweets_count',
    orientation='h', 
    title='Top 40 users by number of friends', 
    width=800, 
    height=800
)

fig.show()

Let's see how coronavirus affect to new users creation.

In [None]:
df['user_created'] = pd.to_datetime(df['user_created'])
df['year_created'] = df['user_created'].dt.year
data = df.drop_duplicates(subset='user_name', keep="first")
data = data[data['year_created']>1970]
data = data['year_created'].value_counts().reset_index()
data.columns = ['year', 'number']

fig = px.bar(
    data, 
    x="year", 
    y="number", 
    orientation='v', 
    title='User created year by year', 
    width=800, 
    height=600
)

fig.show()

As we can see from chart coronavirus increases the number of new twitter users.

In [None]:
df.head(10)

Let's see top 40 most popular locations by the number of tweets.

In [None]:
ds = df['user_location'].value_counts().reset_index()
ds.columns = ['user_location', 'count']
ds = ds[ds['user_location']!='NA']
ds = ds.sort_values(['count'])

fig = px.bar(
    ds.tail(40), 
    x="count", 
    y="user_location", 
    orientation='h', title='Top 40 user locations by number of tweets', 
    width=800, 
    height=800
)

fig.show()

And also we can see the pie plot for the full picture about users locations.

In [None]:
def pie_count(data, field, percent_limit, title):
    
    data[field] = data[field].fillna('NA')
    data = data[field].value_counts().to_frame()

    total = data[field].sum()
    data['percentage'] = 100 * data[field]/total    

    percent_limit = percent_limit
    otherdata = data[data['percentage'] < percent_limit] 
    others = otherdata['percentage'].sum()  
    maindata = data[data['percentage'] >= percent_limit]

    data = maindata
    other_label = "Others(<" + str(percent_limit) + "% each)"
    data.loc[other_label] = pd.Series({field:otherdata[field].sum()}) 
    
    labels = data.index.tolist()   
    datavals = data[field].tolist()
    
    trace=go.Pie(labels=labels,values=datavals)

    layout = go.Layout(
        title = title,
        height=600,
        width=600
        )
    
    fig = go.Figure(data=[trace], layout=layout)
    iplot(fig)
    
pie_count(df, 'user_location', 0.5, 'Number of tweets per location')

Now it's time to check last one categorical feature - `source`. Lets see top 40 sources by the number of tweets.

In [None]:
ds = df['source'].value_counts().reset_index()
ds.columns = ['source', 'count']
ds = ds.sort_values(['count'])

fig = px.bar(
    ds.tail(40), 
    x="count", 
    y="source", 
    orientation='h', 
    title='Top 40 user sources by number of tweets', 
    width=800, 
    height=800
)

fig.show()

<a id="3"></a>
<h2 style='background:green; border:0; color:white'><center>3. Additional features analysis<center><h2>

Lets create new feature - `hashtags_count` that will show us how many hashtags in the current tweet.

In [None]:
df['hashtags'] = df['hashtags'].fillna('[]')
df['hashtags_count'] = df['hashtags'].apply(lambda x: len(x.split(',')))
df.loc[df['hashtags'] == '[]', 'hashtags_count'] = 0

df.head(10)

And see the values for new created column.

In [None]:
df['hashtags_count'].describe()

In [None]:
fig = px.scatter(
    df, 
    x=df['hashtags_count'], 
    y=df['tweets_count'], 
    height=700,
    width=700,
    title='Total number of tweets for users and number of hashtags in every tweet'
)

fig.show()

Distribution of new feature over the number of tweets is expected - a lot of tweets with few number of hashtags and few tweets with huge number of hashtags.

In [None]:
ds = df['hashtags_count'].value_counts().reset_index()
ds.columns = ['hashtags_count', 'count']
ds = ds.sort_values(['count'])
ds['hashtags_count'] = ds['hashtags_count'].astype(str) + ' tags'

fig = px.bar(
    ds, 
    x="count", 
    y="hashtags_count", 
    orientation='h', 
    title='Distribution of number of hashtags in tweets', 
    width=800, 
    height=600
)

fig.show()

Now we will see top 40 users that like to use hashtags a little bit more than others. 

In [None]:
ds = df[df['tweets_count']>10]
ds = ds.groupby(['user_name', 'tweets_count'])['hashtags_count'].mean().reset_index()
ds.columns = ['user', 'tweets_count', 'mean_count']
ds = ds.sort_values(['mean_count'])

fig = px.bar(
    ds.tail(40), 
    x="mean_count", 
    y="user", 
    color='tweets_count',
    orientation='h', 
    title='Top 40 users with higher mean number of hashtags (at least 10 tweets per user)', 
    width=800, 
    height=800
)

fig.show()

### Just split day and time into separate columns

In [None]:
df['date'] = pd.to_datetime(df['date']) 
df = df.sort_values(['date'])
df['day'] = df['date'].astype(str).str.split(' ', expand=True)[0]
df['time'] = df['date'].astype(str).str.split(' ', expand=True)[1]
df.head()

### Number of unique users per day

In [None]:
ds = df.groupby(['day', 'user_name'])['hashtags_count'].count().reset_index()
ds = ds.groupby(['day'])['user_name'].count().reset_index()
ds.columns = ['day', 'number_of_users']
ds['day'] = ds['day'].astype(str) + ':00:00:00'
fig = px.bar(
    ds, 
    x='day', 
    y="number_of_users", 
    orientation='v',
    title='Number of unique users per day', 
    width=800, 
    height=800
)
fig.show()

### Now we are going to check how many tweets were for every day in our dataset.

In [None]:
ds = df['day'].value_counts().reset_index()
ds.columns = ['day', 'count']
ds = ds.sort_values('count')
ds['day'] = ds['day'].astype(str) + ':00:00:00'
fig = px.bar(
    ds, 
    x='count', 
    y="day", 
    orientation='h',
    title='Tweets distribution over days present in dataset', 
    width=800, 
    height=800
)
fig.show()

### Lets do the same but for hours

In [None]:
df['hour'] = df['date'].dt.hour
ds = df['hour'].value_counts().reset_index()
ds.columns = ['hour', 'count']
ds['hour'] = 'Hour ' + ds['hour'].astype(str)
fig = px.bar(
    ds, 
    x="hour", 
    y="count", 
    orientation='v', 
    title='Tweets distribution over hours', 
    width=800
)
fig.show()

### Lets split hashtags into separate column.

In [None]:
def split_hashtags(x): 
    return str(x).replace('[', '').replace(']', '').split(',')

tweets_df = df.copy()
tweets_df['hashtag'] = tweets_df['hashtags'].apply(lambda row : split_hashtags(row))
tweets_df = tweets_df.explode('hashtag')
tweets_df['hashtag'] = tweets_df['hashtag'].astype(str).str.lower().str.replace("'", '').str.replace(" ", '')
tweets_df.loc[tweets_df['hashtag']=='', 'hashtag'] = 'NO HASHTAG'
tweets_df

### And show top 20 hashtags on tweets.

In [None]:
ds = tweets_df['hashtag'].value_counts().reset_index()
ds.columns = ['hashtag', 'count']
ds = ds.sort_values(['count'])
fig = px.bar(
    ds.tail(20), 
    x="count", 
    y='hashtag', 
    orientation='h', 
    title='Top 20 hashtags', 
    width=800, 
    height=700
)
fig.show()

### Now we are going to calculate the length for every tweet in dataset.

In [None]:
df['tweet_length'] = df['text'].str.len()

In [None]:
fig = px.histogram(
    df, 
    x="tweet_length", 
    nbins=80, 
    title='Tweet length distribution', 
    width=800,
    height=700
)
fig.show()

In [None]:
ds = df[df['tweets_count']>=10]
ds = ds.groupby(['user_name', 'tweets_count'])['tweet_length'].mean().reset_index()
ds.columns = ['user_name', 'tweets_count', 'mean_length']
ds = ds.sort_values(['mean_length'])
fig = px.bar(
    ds.tail(40), 
    x="mean_length", 
    y="user_name", 
    color='tweets_count',
    orientation='h', 
    title='Top 40 users with the longest average length of tweet (at least 10 tweets)', 
    width=800, 
    height=800
)
fig.show()

In [None]:
ds = df[df['tweets_count']>=10]
ds = ds.groupby(['user_name', 'tweets_count'])['tweet_length'].mean().reset_index()
ds.columns = ['user_name', 'tweets_count', 'mean_length']
ds = ds.sort_values(['mean_length'])
fig = px.bar(
    ds.head(40), 
    x="mean_length", 
    y="user_name", 
    color='tweets_count',
    orientation='h', 
    title='Top 40 users with the shortest average length of tweet (at least 10 tweets)', 
    width=800, 
    height=800
)
fig.show()

<a id="4"></a>
<h2 style='background:green; border:0; color:white'><center>Tweets text analysis</center><h2>

### Here we are going to check the `text` feature of the dataset.
### Lets see general wordcloud for this column.

In [None]:
def build_wordcloud(df, title):
    wordcloud = WordCloud(
        background_color='gray', 
        stopwords=set(STOPWORDS), 
        max_words=50, 
        max_font_size=40, 
        random_state=666
    ).generate(str(df))

    fig = plt.figure(1, figsize=(14,14))
    plt.axis('off')
    fig.suptitle(title, fontsize=16)
    fig.subplots_adjust(top=2.3)

    plt.imshow(wordcloud)
    plt.show()

In [None]:
build_wordcloud(df['text'], 'Prevalent words in tweets for all dataset')

### Lets see world clouds for top 5 users.

In [None]:
test_df = df[df['user_name']=='GlobalPandemic.NET']
build_wordcloud(test_df['text'], 'Prevalent words in tweets for GlobalPandemic.NET')

In [None]:
test_df = df[df['user_name']=='covidnews.ch']
build_wordcloud(test_df['text'], 'Prevalent words in tweets for covidnews.ch')

In [None]:
test_df = df[df['user_name']=='Open Letters']
build_wordcloud(test_df['text'], 'Prevalent words in tweets for Open Letters')

In [None]:
test_df = df[df['user_name']=='Hindustan Times']
build_wordcloud(test_df['text'], 'Prevalent words in tweets for Hindustan Times')

In [None]:
test_df = df[df['user_name']=='Blood Donors India']
build_wordcloud(test_df['text'], 'Prevalent words in tweets for Blood Donors India')

### Let's also visualize WordCloud for user's description.

In [None]:
build_wordcloud(df['user_description'], 'Prevalent words in tweets for Blood Donors India')

<a id="5"></a>
<h2 style='background:green; border:0; color:white'><center>Simple sentiment analysis</center><h2>

### Lets do simple version of sentiment analysis. We just use Tfidf Vectorizer to get features and use Kmeans clustering algotithm to split data into 2 clusters.

In [None]:
vec = TfidfVectorizer(stop_words="english")
vec.fit(df['text'].values)
features = vec.transform(df['text'].values)

In [None]:
kmeans = KMeans(n_clusters=2, random_state=0)
kmeans.fit(features)

In [None]:
res = kmeans.predict(features)
df['Cluster'] = res
df

In [None]:
df[df['Cluster'] == 0].head(20)['text'].tolist()

In [None]:
df[df['Cluster'] == 1].head(20)['text'].tolist()

In [None]:
print('Number of samples for class 0: ', len(df[df['Cluster'] == 0]))
print('Number of samples for class 1: ', len(df[df['Cluster'] == 1]))

In [None]:
build_wordcloud(df[df['Cluster'] == 0]['text'], 'Wordcloud for cluster 0')

In [None]:
build_wordcloud(df[df['Cluster'] == 1]['text'], 'Wordcloud for cluster 1')

## So we can see that cluster 0 contains more or less positive tweets, but cluster 1 contains tweets with information about new cases, reports and regions.

### Let's check more clusters for example 5.

In [None]:
kmeans = KMeans(n_clusters=5, random_state=0)
kmeans.fit(features)

In [None]:
res = kmeans.predict(features)
df['Cluster5'] = res
df

In [None]:
for i in range(5):
    print('Number of samples for class ' + str(i) + ': ', len(df[df['Cluster5'] == i]))

In [None]:
build_wordcloud(df[df['Cluster5'] == 0]['text'], 'Wordcloud for cluster 0')

In [None]:
build_wordcloud(df[df['Cluster5'] == 1]['text'], 'Wordcloud for cluster 1')

In [None]:
build_wordcloud(df[df['Cluster5'] == 2]['text'], 'Wordcloud for cluster 2')

In [None]:
build_wordcloud(df[df['Cluster5'] == 3]['text'], 'Wordcloud for cluster 3')

In [None]:
build_wordcloud(df[df['Cluster5'] == 4]['text'], 'Wordcloud for cluster 4')

<a id="6"></a>
<h2 style='background:green; border:0; color:white'><center>Animation with geographical distribution of tweets</center><h2>

### Here I am going to show approach how to use plotly world map to demonstrate geographical distribution of tweets.

In [None]:
df['location'] = df['user_location'].str.split(',', expand=True)[1].str.lstrip().str.rstrip()
res = df.groupby(['day', 'location'])['text'].count().reset_index()

In [None]:
country_dict = {}
for c in countries:
    country_dict[c.name] = c.alpha3
    
res['alpha3'] = res['location']
res = res.replace({"alpha3": country_dict})

country_list = ['England', 'United States', 'United Kingdom', 'London', 'UK']

res = res[
    (res['alpha3'] == 'USA') | 
    (res['location'].isin(country_list)) | 
    (res['location'] != res['alpha3'])
]

gbr = ['England', 'UK', 'London', 'United Kingdom']
us = ['United States', 'NY', 'CA', 'GA']

res = res[res['location'].notnull()]
res.loc[res['location'].isin(gbr), 'alpha3'] = 'GBR'
res.loc[res['location'].isin(us), 'alpha3'] = 'USA'
res.loc[res['alpha3'] == 'USA', 'location'] = 'USA'
res.loc[res['alpha3'] == 'GBR', 'location'] = 'United Kingdom'
plot = res.groupby(['day', 'location', 'alpha3'])['text'].sum().reset_index()
plot

In [None]:
fig = px.choropleth(
    plot, 
    locations="alpha3",
    hover_name='location',
    color="text",
    animation_frame='day',
    projection="natural earth",
    color_continuous_scale=px.colors.sequential.Plasma,
    title='Tweets from different countries for every day',
    width=800, 
    height=600
)
fig.show()

In [None]:
res = df.groupby(['day', 'location', 'user_name'])['text'].count().reset_index()
res = res[['day', 'location', 'user_name']]
res['alpha3'] = res['location']
res = res.replace({"alpha3": country_dict})

country_list = ['England', 'United States', 'United Kingdom', 'London', 'UK']

res = res[
    (res['alpha3'] == 'USA') | 
    (res['location'].isin(country_list)) | 
    (res['location'] != res['alpha3'])
]

gbr = ['England', 'UK', 'London', 'United Kingdom']
us = ['United States', 'NY', 'CA', 'GA']

res = res[res['location'].notnull()]
res.loc[res['location'].isin(gbr), 'alpha3'] = 'GBR'
res.loc[res['location'].isin(us), 'alpha3'] = 'USA'
res.loc[res['alpha3'] == 'USA', 'location'] = 'USA'
res.loc[res['alpha3'] == 'GBR', 'location'] = 'United Kingdom'
plot = res.groupby(['day', 'location', 'alpha3'])['user_name'].count().reset_index()

In [None]:
fig = px.choropleth(
    plot, 
    locations="alpha3",
    hover_name='location',
    color="user_name",
    animation_frame='day',
    projection="natural earth",
    color_continuous_scale=px.colors.sequential.Plasma,
    title='Numbers of active users for every day',
    width=800, 
    height=600
)
fig.show()