<div class="knitr source"><img src="https://seeklogo.com/images/U/universidad-catolica-san-pablo-ucsp-logo-5309049584-seeklogo.com.gif" align = 'right', style = 'position:absolute; top:0; right:0'>
    <h1><p style="text-align:left; font-size: 24px"> Exploratory Data Analysis (EDA) for tweets of COVID 19</p></h1>
    <h2><p style="text-align:left; font-size: 20px"> Universidad Catolica San Pablo</p></h2>
  
</div>

<img src="https://images.indianexpress.com/2020/04/how-to-use-twitter-amid-covid-19-1.jpg" align = 'center'>
<p><i>Presented by: Joaquín Antonio Castañón Vilca</i></p>


<p style="text-align:justify; font-size: 18px">This notebook tries to make an Exploratory Data Analysis of all tweets that have been publishing during this pandemic situation.
The objective of this notebook is to get visualization and some insights based on existing features of the data collected in the database called "covid19-tweets", also I will try to make a clustering for the words and some geospatial visualization. All this is only for educational purpose, in as much as to cover a diploma assessment of UCSP. So let's start!</p>

<p style="text-align:justify; font-size: 18px">
You can visit this notebook in Kaggle: <a href='https://www.kaggle.com/jcastanonv/ucsp-covid19-assessment'>https://www.kaggle.com/jcastanonv/ucsp-covid19-assessment</a>
    </p>


<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#3E8FCE;
           font-size:200%;
           font-family:Verdana;
           letter-spacing:1px">

<p style="padding: 40px;
              color:white;
          text-align:center"> Navigation
    <i class="fa fa-search icon"></i>
 
</p>
</div>

<p style="text-align:justify; font-size: 18px">
    <ul style="text-align:justify; font-size: 18px">
        <li>Dataset Overview</li>
        <li>Data Visualization</li>
        <li>Text analysis of tweets</li>
        <li>Clustering of Sentiment</li>
        <li>Geospacial Visualization</li>
</ul>
    </p>
    

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#3E8FCE;
           font-size:200%;
           font-family:Verdana;
           letter-spacing:1px">


<p style="padding: 40px;
              color:white;
          text-align:center"> 
    Dataset Overview
    <i class="fa fa-database icon"></i>
   
 
</p>
</div>

In [None]:
# import all the libraries required to read csv and make some modifications in the datset

import numpy as np
import pandas as pd
import plotly.express as px
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
from matplotlib.cm import ScalarMappable
from matplotlib import rcParams
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from iso3166 import countries
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot


rcParams['figure.figsize'] = 10,7.5
rcParams['figure.dpi'] = 80



<p style="text-align:justify; font-size: 18px">
    In this section we gonna analyze first of all the numerical data. So let's take a look and quick check of the dataset which will be treated
</p>

In [None]:
df = pd.read_csv('/kaggle/input/covid19-tweets/covid19_tweets.csv')
df.head(5)

In [None]:
df.info()

<p style="text-align:justify; font-size: 18px">
    As we saw, there are some NaN values; so let's view how many are.
    </p>
    

In [None]:
# Let's view the percentage of NaN in each column

miss_nan = pd.DataFrame()
miss_nan['column'] = df.columns

miss_nan['percent'] = [round(100* df[col].isnull().sum()/len(df), 2) for col in df.columns]
miss_nan = miss_nan.sort_values('percent', ascending = True)
miss_nan = miss_nan[miss_nan['percent']>0]


sns.barplot(miss_nan['percent'], miss_nan['column'], palette = 'Blues')
plt.show()

<p style="text-align:justify; font-size: 18px">
    As we can see, the column "hashtags" have almost 30% of NaN values, and also we have "use_lcoation", and "user_description", this can be due some users didn't use the hashtag in their post and also some people don't have their profile complete, maybe doesn't use Twitter very often; however, the column "source" almost doesn't have NaN values. 
    </p>

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#3E8FCE;
           font-size:200%;
           font-family:Verdana;
           letter-spacing:1px">

<p style="padding: 40px;
              color:white;
          text-align:center"> Data Visualization
    <i class="fa fa-bar-chart icon"></i>
 
</p>
</div>

<p style="text-align:justify; font-size: 18px">
    We gonna see the users by number of tweets
    </p>
    

In [None]:
number_tweets = df['user_name'].value_counts().reset_index()
number_tweets.columns = ['user_name', 'tweets']


sns.barplot(x = "tweets", y = "user_name", data = number_tweets.head(30), palette = 'Blues_r')
plt.show()

<p style="text-align:justify; font-size: 18px">
This shows the number of tweets about COVID19 was made, but this doesn't mean that this user has a big amount of followers, so now we gonna extract all accounts with a lot of followers and how many tweets about COVID19 have done.
    </p>

In [None]:
top_users = df.sort_values('user_followers', ascending =  False).drop_duplicates(subset = 'user_name', keep = 'first')
top_users = top_users[['user_name', 'user_followers']]
top_users = pd.merge(top_users, number_tweets, 'inner')


#Normalize the scale to make the color bar on the right of the bar plot
norm = plt.Normalize(top_users['tweets'].min(), top_users['tweets'].max())
sm = plt.cm.ScalarMappable(cmap="Blues_r", norm=norm)
sm.set_array([])
#Show the barplot with color bar
ax = sns.barplot(x="user_followers", y = "user_name", data = top_users.head(20), hue = 'tweets', dodge = False, palette = 'Blues_r')
ax.get_legend().remove()
ax.figure.colorbar(sm)
plt.show()


<p style="text-align:justify; font-size: 18px">
    As we can see, CNN as much as National Geographic has a big amount of followers but this account doesn't have published much about COVID19, as the difference of China Xinhua News which had posted a lot of tweets about this topic, and CGTN and Hindustan Times which are Asian accounts; all these accounts don't have a lot of followers in comparison of CNN and others. Furthermore, we gonna see the geospatial information.
    </p>
    <p style="text-align:justify; font-size: 18px">
Now let's take a look at which is the major "source" or device which have used to publish tweets.
    </p>
    

In [None]:
device = df['source'].value_counts().reset_index()
device.columns = ['source', 'count']
device['percent_tweets'] = round(device['count']/device['count'].sum()*100, 2)


sns.barplot(x = "percent_tweets", y = "source", data = device.head(30), palette = 'Blues_r')
plt.show()

<p style="text-align:justify; font-size: 18px">
As we can see in this graph, the sources or devices which were mostly used for tweeting, were "Twitter Web App", "Twitter for Android" and "Twitter for iPhone", these 3 sources represent approximately 74% of all the tweets posted; which correspond to the most traditional ways. Also, other sources were: "Blood Donors India", "Zoho Social", and others.
    </p>
<p style="text-align:justify; font-size: 18px">
    Now let's look, if this year the number of new users was increased.
    </p>


In [None]:
df['user_created'] = pd.to_datetime(df['user_created'])
new_users = df[['user_created', 'user_name']].drop_duplicates(subset = 'user_name', keep = 'first')
new_users['user_created']= new_users['user_created'].dt.year
count_year = new_users['user_created'].value_counts().reset_index()
count_year.columns = ['year', 'number']
count_year

#sns.lineplot(x = 'year', y = 'number', data = count_year)
#A first impression that we can see that some accounts were created in 1970 and obviously this is not real
count_year['year'] = count_year[count_year['year']>1990]
sns.lineplot(x = 'year', y = 'number', data = count_year, marker = 'o')
plt.xlabel('Year')
plt.ylabel('Number of New Users')
plt.show()


<p style="text-align:justify; font-size: 18px">
    As we can see in this graph, the number of new accounts increased this year, we can also see that in the years between 2020 and 2009 there is a valley for approximately 10 years where people have not created many accounts in comparison to 2009, in fact, the number of new accounts per year decreased in this lap of time of 10 years; the sudden increase this year maybe is due for the pandemic and the lockdown in many countries over the world.
    </p>
<p style="text-align:justify; font-size: 18px">
    In the next section, we gonna find insights based on text data
    </p>

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#3E8FCE;
           font-size:200%;
           font-family:Verdana;
           letter-spacing:1px">

<p style="padding: 40px;
              color:white;
          text-align:center"> Text analysis of tweets
    <i class="fa fa-newspaper-o icon"></i>
 
</p>
</div>

<p style="text-align:justify; font-size: 18px">
    Now we gonna analyze the text information within hashtags that have been used in the tweets during this pandemic. As we know, in the before section, we have seen the column "hashtag" has a lot of NaN values, so the first step we gonna make converts these NaN values to something, and then we can extract some insights and create a wordcloud.
    </p>

In [None]:
df['hashtags'] = df['hashtags'].fillna('[]')
df['hashtags_count'] = df['hashtags'].apply(lambda x: len(x.split(',')))
df.loc[df['hashtags'] == '[]', 'hashtags_count'] = 0

# let's see the number of hashtags used by users
hashtag_per_user = df[['user_name','hashtags_count']].sort_values('hashtags_count', ascending =  False).drop_duplicates(subset = 'user_name', keep = 'first')
sns.barplot(x="hashtags_count", y = "user_name", data = hashtag_per_user.head(30), palette = 'Blues_r')
plt.show()

In [None]:
hashtag_per_user.describe()

<p style="text-align:justify; font-size: 18px">
    As we can see in this graph, the user called "ROCAS THE PURPLEKING" has used 17 different hashtags, in another hand, we can say, 75% of the users use 2 hashtags per tweet.
    </p>
<p style="text-align:justify; font-size: 18px">
        Now let's view which hashtag is the most used for the users within their tweets.
    </p>


In [None]:
def hashtags_split(x):
    return str(x).lower().replace('[','').replace(']','').replace("'",'').replace(" ", '').split(',')

hashtag_tweets = df.copy()
hashtag_tweets['hashtag'] = hashtag_tweets['hashtags'].apply(lambda row: hashtags_split(row))
hashtag_tweets = hashtag_tweets.explode('hashtag')
hashtag_tweets.loc[hashtag_tweets['hashtag'] == '', 'hashtag'] = 'No Hashtag'
hashtag_tweets.head()

In [None]:
hashtag_number = hashtag_tweets['hashtag'].value_counts().reset_index()
hashtag_number.columns = ['hashtag', 'count']

sns.barplot(x="count", y = "hashtag", data = hashtag_number.head(10), palette = 'Blues_r')
plt.show()

<p style="text-align:justify; font-size: 18px">
    Now we gonna make a wordcloud of the principal words and topics posted in the different tweets.
    </p>

In [None]:
text = "".join(tweet for tweet in df['text'])
stopwords = set(STOPWORDS)
stopwords.update(['https', 't','co', 'many', 's'])

wordcloud = WordCloud(stopwords=stopwords, background_color='white').generate(text)

plt.imshow(wordcloud)
plt.axis('off')
plt.title('Prevalent words for all tweets')
plt.show()

In [None]:
# In this part we gonna use another mask for the wordcloud and different colors
import os
from PIL import Image
from scipy.ndimage import gaussian_gradient_magnitude
from wordcloud import ImageColorGenerator

d = os.path.dirname(__file__) if "__file__" in locals() else os.getcwd()
# load image. This has been modified in gimp to be brighter and have more saturation.
covid_color = np.array(Image.open(os.path.join(d, "../input/images/2019-nCoV-CDC-23312.jpg")))
# subsample by factor of 3. Very lossy but for a wordcloud we don't really care.
covid_color = covid_color[::3, ::3]

# create mask  white is "masked out"
covid_mask = covid_color.copy()
covid_mask[covid_mask.sum(axis=2) == 0] = 255

edges = np.mean([gaussian_gradient_magnitude(covid_color[:, :, i] / 255., 2) for i in range(3)], axis=0)
covid_mask[edges > .08] = 255
hashtag = " ".join(hashtag for hashtag in hashtag_tweets['hashtag'])
stopwords = set(STOPWORDS)
stopwords.update(['No', 'Hashtag'])

wc = WordCloud(max_words=2000, mask=covid_mask, max_font_size=40, random_state=42, relative_scaling=0, stopwords=stopwords, background_color='white')

# generate word cloud
wc.generate(hashtag)



# create coloring from image
image_colors = ImageColorGenerator(covid_color)
wc.recolor(color_func=image_colors)
plt.figure(figsize=(10, 10))
plt.imshow(wc, interpolation="bilinear")
plt.axis('off')
plt.show()

<p style="text-align:justify; font-size: 18px">
    As we can see the prevalent word for all tweets is "COVID19", followed by words like "mask", "help", "people" and "pandemic".
    </p>
<p style="text-align:justify; font-size: 18px">
    On another hand, in the case of hashtags, we can see all hashtags follow the same topic as text, with "Covid19" or "coronavirus" as prevalent words, followed with "wearmask" this can be because of the policy which almost every country have adopted about the use of a mask for prevention and "coronovirusupdate" maybe this keeps update the number of positive cases and deaths, also we can see hashtags like "healthcare", "trump", "lockdown" and "socialdistancing".
    </p>

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#3E8FCE;
           font-size:200%;
           font-family:Verdana;
           letter-spacing:1px">

<p style="padding: 40px;
              color:white;
          text-align:center"> Clustering of Sentiment
    <i class="fa fa-smile-o icon"></i>
 
</p>
</div>

<p style="text-align:justify; font-size: 18px">
    Now, let's figure it how can cluster the text of tweets in two differents groups or sentimental groups.
    </p>

In [None]:
vec = TfidfVectorizer(stop_words = 'english')
vec.fit(df['text'].values)
features = vec.transform(df['text'].values)

In [None]:
kmeans = KMeans(n_clusters = 2, random_state = 0)
kmeans.fit(features)

In [None]:
res = kmeans.predict(features)
df['Cluster'] = res


In [None]:
text_cluster_1 = " ".join(tweet for tweet in df[df['Cluster'] == 0]['text'])
stopwords = set(STOPWORDS)
stopwords.update(['https', 't','co', 'many', 's'])

wordcloud_1 = WordCloud(max_words = 100, stopwords=stopwords, background_color='white').generate(text_cluster_1)

plt.imshow(wordcloud_1)
plt.axis('off')
plt.title('Group of words for the cluster Nº0')
plt.show()

In [None]:
text_cluster_2 = " ".join(tweet for tweet in df[df['Cluster'] == 1]['text'])
stopwords = set(STOPWORDS)
stopwords.update(['https', 't','co', 'many', 's'])

wordcloud_2 = WordCloud(max_words = 100, stopwords=stopwords, background_color='white').generate(text_cluster_2)

plt.imshow(wordcloud_2)
plt.axis('off')
plt.title('Group of words for the cluster Nº1')
plt.show()

<p style="text-align:justify; font-size: 18px">
    In these two graphs, we can distinguish two sentimental groups, cluster 0 one is apparently is more negative because group all the texts with a topic about people who were positive in the test of COVID19, and the second group is more neutral information and talk about prevention, we can say is more optimistic.
    </p>


<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#3E8FCE;
           font-size:200%;
           font-family:Verdana;
           letter-spacing:1px">

<p style="padding: 40px;
              color:white;
          text-align:center"> Geospacial Visualization
    <i class="fa fa-rocket icon"></i>
 
</p>
</div>

<p style="text-align:justify; font-size: 18px">
    Finally, we gonna see, first of all, the distribution of tweets about COVID19 all over the world, and next this information will be display in some geospatial graph.
    </p>

In [None]:
location = df['user_location'].value_counts().reset_index()
location.columns = ['user_location', 'count']
location = location[location['user_location'] != 'NA']
location = location.sort_values(['count'], ascending = False)

sns.barplot(x="count", y = "user_location", data = location.head(30), palette = 'Blues_r')
plt.show()

<p style="text-align:justify; font-size: 18px">
    As we can see, India and the United States are countries that contributed to publishing more tweets than other countries.
    </p>

In [None]:
!pip install geopandas


In [None]:
import pycountry

def alpha3code(column):
    CODE = []
    for country in column:
        try:
            code = pycountry.countries.get(name=country).alpha_3
            CODE.append(code)
        except:
            CODE.append('None')
    return CODE


location['CODE'] = alpha3code(location['user_location'])
location.head()

In [None]:
import geopandas
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
world.columns = ['pop_est', 'continent', 'name', 'CODE', 'gdp_md_est', 'geometry']
world_merge = pd.merge(world, location, on='CODE')

location_merge = pd.read_csv('../input/latlong/countries_latitude_longitude.csv')
world_merge = world_merge.merge(location_merge, on='name').sort_values(by='count', ascending=False).reset_index()
world_merge = world_merge[['user_location', 'count', 'latitude','longitude']]
world_merge.head()

In [None]:
import folium
from folium import plugins
from folium.plugins import HeatMap

folium_map = folium.Map(location=[50,0],
                       zoom_start=3,
                       tiles='CartoDB dark_matter')

world_merge['latitude']=world_merge['latitude'].fillna(0)
world_merge['longitude']=world_merge['longitude'].fillna(0)

plugins.FastMarkerCluster(data=list(zip(world_merge['latitude'].values, world_merge['longitude'].values))).add_to(folium_map)
arr = world_merge[['latitude', 'longitude']].values

HeatMap(arr, radius = 15).add_to(folium_map)

folium.LayerControl().add_to(folium_map)
folium_map

<p style="text-align:justify; font-size: 18px">
    We have extracted some insights of the information collected in all tweets about COVID19, the next steps will be to aggregate some machine learning prediction, tune the clustering algorithm and finally optimize the extraction of latitude and longitude for each city or country. 
    </p>
    
<p style="text-align:justify; font-size: 18px">
    See You Soon!! :)
    </p>

    
<p style="text-align:justify; font-size: 18px">
    References:
        Kostiantyn Isaienkov (2020). Notebook
    </p>