# Introduction

This project explores the coronavirus related Tweets dataset created by Shane Smith. This dataset contains 20 files  crawled from Internet, 18 of them are Tweet information and 2 of them are location and hashtag information. The main idea of this project is to find out the general trendings of the coronavirus related Tweets and classify the Tweets based on the text. Below are the contents:

* Data wrangling (data cleaning, data imputation)
* Data visualization (barplots, boxplots, pieplots, correlation heatmap)
* Information extraction (tf-idf method)
* Text classification (K-means)

## Importing Library

Import all the packages we need for this project.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns
from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.cluster import MiniBatchKMeans
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

## Importing Data

This dataset contains 20 files including Tweet information on 18 different days, country information and hashtag information. When load data for this project, I skip some of the file for the following reasons.
* I discard the Tweet information before March 12 becasue the format is not consistent with Tweet information on other days. 
* I skip the file "Countires" because most of the location information in this dataset is missing (please see data wrangling part for details), so it doesn't make too much sense to load this file. 
* File "Hashtag" is also not loaded since this file only records one of multuple potential hashtags and the criterion to filter the hashtags is not constent among all the dates.

For details of the dataset, please see dataset creator's description in discussion section.

In [None]:
# This diction is used to store date of data source as string
Tweet_Date = []
# this dictionary is used to store all data frames of tweet infomation
Tweet_Dict = {}

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
print("--------------------Start loading data--------------------")
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        # extract file path
        filepath = os.path.join(dirname, filename)
        print(filepath, end="")
        # append file name head as tweet date
        head = filename.split()[0]
        head = head.split(".")[0]
        # filter out useless files
        if head not in ["2020-03-00", "Hashtags", "Countries"]:
            print(" -----> loading")
            Tweet_Date.append(head)
            # read csv file and store in dict
            Tweet_Dict[head] = pd.read_csv(filepath)
        else:
            print()
print("--------------------Finish loading data--------------------")
# sort the list
Tweet_Date = sorted(Tweet_Date)

# Data Wrangling

This section aims to clean the data and get rid of unnecessary data.

First let's compute missing value percentage of the dataset. It can be seen in the matrix below that "country_code", "place_full_name", "place_type" and "account_lang" have too many missing values. However, it is acceptable that columns with related to "reply" have high missing percentage since there is not information about reply if no reply happens. Note that there is also a few data missing for the "source" column. 

After delete fields with high missing ratio, drop columns containing information about id, reply or time since those are not focus of this project.

In [None]:
# calculate null value percentage
statistics = pd.concat([Tweet_Dict[date].isnull().mean().to_frame(name=date) for date in Tweet_Date], axis=1)
statistics

In [None]:
# drop columns that have too many missing values and useless columns
for date in Tweet_Date:
    Tweet_Dict[date].drop(labels=["country_code", "place_full_name", "place_type", "account_lang", # too many missing values
                                 "status_id", "user_id", "screen_name",  # useless
                                 "reply_to_status_id", "reply_to_user_id", "reply_to_screen_name", # useless
                                 "account_created_at"], axis=1, inplace=True)

After dropping some columns the dataset seems to be much cleaner and luckily we are able to concatenate all information to one dataframe. Please try using GPU / TPU if you have trouble fitting all the data in your RAM.

In [None]:
# concatenate all the data to one dataframe
df = pd.concat([Tweet_Dict[date] for date in Tweet_Date], ignore_index=True, sort=False)
print("Dataframe shape: ", df.shape)

The other important thing to do is dropping any duplicated rows. Even though we have already dropped all the id columns, the "text" column and "created_at" column can ensure that different Tweets would not be recognized as the same.

In [None]:
# drop duplicate columns
df.drop_duplicates(inplace=True)
print("Dataframe shape: ", df.shape)

Recall that we have a few missing values for feature "source", now let's fill in all missing values with the mode value of "source" column.

In [None]:
# impute null values
df["source"].fillna(df["source"].mode()[0], inplace=True)

So far we have finished all data wrangling proedures and have relative clean data. It's time to take a look at overall information of our dataframe and statistics of numeric data.

In [None]:
# present statistics
df.info()

In [None]:
# present description
df.describe()

# EDA

This section is to perform exploratory data analysis. Hopefully we can get some intuition about the corona Tweet dataset in this section.

You may wonder since the number of confirmed cases and death case is continuously increasing, whether the number of related Tweet also increase during March. Below shows the number of realted Tweets from March 12 to March 28, however, it doesn't increase as we expect. Instead, the number oscilliates around 700 thousand.

One instesting thing is that the number of Tweets realted to corona virus is much high than normal on Mar 13. Possible reasons are suspension of NBA and announcement of Europe travel ban by Trump aound March 13.

In [None]:
# list to store number of Tweets
num_tweet = []

# calculate number of tweets on each day
for date in Tweet_Date:
    num_tweet.append(Tweet_Dict[date].shape[0])

# plot
plt.figure(figsize=(12, 5))
plt.bar(Tweet_Date, num_tweet, color="lightcoral")
plt.xticks(rotation=90)
plt.xlabel('Date')
plt.ylabel("Count")
plt.title('Number of Tweets Trendency')
plt.show()

Since we have only four numeric features, including "favourites_count", "retweet_count", "followers_count" and "friends_count", let's draw the boxplots to see how they are distributed.

It can be observed from the boxplots that the mean values of all these four features are much more closer to the first quartile and the bottom than to the third quartile and the top. It indicates, at least according to this dataset, that the Tweet World is like a pyramid, where top accounts achieve much more attention (favourites, retweets, followers and friends) than normal accounts.

You may confused that why the max rewteet count is just two. Note that I didn't show outliers in all boxplots because those top outliers are too high to make the plots readable.

In [None]:
# configure plot size
plt.figure(figsize=(14, 5))
# subplot for favourites_count
plt.subplot(1,4,1)
df.boxplot(column="favourites_count", rot=0, showfliers=False, figsize=(8,6))
# subplot for retweet_count
plt.subplot(1,4,2)
df.boxplot(column="retweet_count", rot=0, showfliers=False, figsize=(8,6))
# subplot for followers_count
plt.subplot(1,4,3)
df.boxplot(column="followers_count", rot=0, showfliers=False, figsize=(8,6))
# subplot for friends_count
plt.subplot(1,4,4)
df.boxplot(column="friends_count", rot=0, showfliers=False, figsize=(8,6))
plt.show()

After seeing the distribution of numberic values, let's figure out whether there is a relationship between those features. Below is the correlation matrix which indicates the linear relationship between numeric features (+1 indicates perfect positive correlation, -1 indicates perfect negative correlation, 0 indicates no association). It turns out that there is almost no correlation between any of two numeric features.

In [None]:
# configure plot size
plt.figure(figsize=(10, 6))
# extract numeric columns
df_corr = df[["favourites_count", "retweet_count", "followers_count", "friends_count"]]
# generate correlation matrix
corrMatrix = df_corr.corr()
# plot heatmap
sns.heatmap(corrMatrix, annot=True)
plt.show()

Word Cloud is very intuitive when explore text data. Word Cloud below presents the top 50 popular words among all English Tweets. It can be observed that "COVID19", "coronavirus" and "outbreak" are among the most popular words.

One interesting fact is that "https" is also a high-frequency word, which is due to the links that people commonly attach in their Tweets.

In [None]:
# transform text contents to string variable
text = ""
for date in Tweet_Date:
    text += str(Tweet_Dict[date][Tweet_Dict[date]["lang"]=="en"]["text"].values)
# generate word cloud    
wordcloud = WordCloud(max_font_size=50, max_words=50, background_color="black", collocations=False).generate(str(text))
# plot wordcloud
plt.figure(figsize=(18, 8))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

Next is the pie plots show the distribution of boolean feature "is_quote" and "is_retweet". It can be seen that only a small part of all Tweets is a quote of another Tweet. 

However, things look wierd for the feature "is_retweet" since all of the values are false. According to the dataset creator, the retweet argument has all be set to false so this dataset doesn't included retweeted Tweets. What we have seen before are originally created Tweets that we retweeted by others. Tweets that retweet those will be filtered out.

In [None]:
# configure plot size
plt.figure(figsize=(14, 5))
# plot for is_quote
plt.subplot(1,2,1)
plt.pie([df["is_quote"].mean(), 1-df["is_quote"].mean()], labels=['True', 'False'], shadow=False, startangle=140)
plt.title("is_quote")
# plot for is_retweet
plt.subplot(1,2,2)
plt.pie([df["is_retweet"].mean(), 1-df["is_retweet"].mean()], labels=['True', 'False'], shadow=False, startangle=140)
plt.title("is_retweet")
plt.show()

Finally let's take a look at what the top 10 Tweet source and languages.

Just for your reference, top 10 languages are English, Spanish, French, Undetermined Language, Italian, Turkish，Portuguese，German, Hindi and Indian. In this dataset English is in dominating position.

In [None]:
# count top 10 sources
df["source"].value_counts().head(10)

In [None]:
# count top 10 languages
df["lang"].value_counts().head(10)

# Clustering

This section focuses on text classification using "text" feature in the dataframe. This project only considers English Tweets for convenience. To better understand the over trending of Tweets and also for the consideration of reducing dataset dimensionality, we only analyze Tweet that have favourites over 10 times average sizes.

First step is to extract the Tweets we want to use. 

In [None]:
# filter out all tweets that are not using english
df_en = df[(df["lang"]=="en") & (df["favourites_count"]>=df["favourites_count"].mean()*10)]
print("Number of English Tweets that are above average favourites: ", df_en.shape[0])

To apply unsurprised machine learning algorithm, we have to transform out text data to numeric arrays. The method we use here is tf-idf, which stands for frequency-inverse document frequency and is commonly used in information retrieval and text mining. Here is the general idea to calculate tf-idf weights (sometimes we will have normalization or add 1 to denominator of idf).

*tf(t) = (Number of times term t appears in a document) / (Total number of terms in the document)*

*idf(t) = log_e(Total number of documents / Number of documents with term t in it)*

tf (term frequency) measures how often the term appears while idf measures how rare the word is. The product of tf and idf is the weight for the word. Here we transform the feature column to the tf-idf weights matrix. Note that in practice the weight matrix can be extremely large, here we only consider the top max_features ordered by term frequency across the corpus.

In [None]:
# vectorize the text content
vectorizer = TfidfVectorizer(max_features=20000, stop_words='english')
X = vectorizer.fit_transform(df_en["text"])

In [None]:
print("shape of tf-idf weight matrix: ", X.shape)

To implement unsurpvised text classification, we will use K-means algorithm. The methodology to find the number of clusters k here is to plot the sum of squared distances (Euclidean distances) between each point and the cluster centroid and find the elbow (elbow method). 

Normal batch K-means method is very time consuming. We will instead use Mini batch K-means algorithm to find the optimal number of clusters. The empirical results suggest that Mini batch K-means algorithm can obtain a substantial saving of computational time at the expense of some loss of cluster quality.

In [None]:
# range of number of clusters
num_clusters = range(2, 22, 2)

# list to record sum of squared distances
sum_square_error = []

# iterate through different number of clusters and append sse
for k in num_clusters:
        sum_square_error.append(MiniBatchKMeans(n_clusters=k, init_size=1024, batch_size=2048, random_state=42).fit(X).inertia_)
        print('now fitting {} clusters using  Mini batch K-means algorithm'.format(k))

# plot ssm vs k
plt.figure(figsize=(12, 5))
plt.plot(num_clusters, sum_square_error, "g^-")
plt.xticks(num_clusters)
plt.xlabel('Number of Clusters')
plt.ylabel("Sum of Square Distance")
plt.title('Elbow Method')
plt.show()

It seems that the sum of squared distances does't improve too much when we increase number of clusters. In this situation, our best number of clusters seems to be 8 even though this elbow is far from perfect (and also this number can be different due to randomness in each run). 

Now build the model and predict for each sample.

In [None]:
# create the models and fit 
cluster_predictions = MiniBatchKMeans(n_clusters=8, init_size=1024, batch_size=2048, random_state=42).fit_predict(X)

To visiualize our clustering results, we use PCA and t-SNE methods to visiualize the clustering on 2D plane. PCA is a technique to reduce the dimensionality of dataset using Singular Value Decomposition of the data to project it to a lower dimensional space. To visiualize the data in 2 dimension, eigenvectors with top 2 highest explained variance are kept.

t-distributed Stochastic Neighbor Embedding is another technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets. Different from PCA, it's a non-linear method.

To speed up the process, here we randomly choose 2000 samples as our input data for PCA and use the top 60 eigenvectors as the input of t-SNE. Finally randomly choose 400 data points for better visiualizon.

In [None]:
def plot_tsne_pca(data, labels):
    '''
    This function plots the PCA and t-SNE on 2D plane.
    args:
        data: tf-idf weight matrix
        labels: predictions from K-means
    '''
    # initial set up and random pick up samples
    max_label = max(labels)
    max_items = np.random.choice(range(data.shape[0]), size=2000, replace=False)
    
    # extract eigenvectors that have the most explained variance and feed the eigenvectors to t-SNE
    pca = PCA(n_components=2).fit_transform(data[max_items,:].todense())
    tsne = TSNE().fit_transform(PCA(n_components=60).fit_transform(data[max_items,:].todense()))
    
    # random pick centain size of data points for visiualization
    idx = np.random.choice(range(pca.shape[0]), size=400, replace=False)
    label_subset = labels[max_items]
    label_subset = [cm.hsv(i/max_label) for i in label_subset[idx]]
    
    f, ax = plt.subplots(1, 2, figsize=(14, 6))
    
    # plot PCA
    ax[0].scatter(pca[idx, 0], pca[idx, 1], c=label_subset)
    ax[0].set_title('PCA Cluster')
    
    # plot t-SNE
    ax[1].scatter(tsne[idx, 0], tsne[idx, 1], c=label_subset)
    ax[1].set_title('TSNE Cluster')

# plot PCA and t-SNE reduced data
plot_tsne_pca(X, cluster_predictions)

It turns out that our clustering result is far from perfect based on the PCA and t-SNE visiualization. Some possible solutions are listed in the future plan section.

Last step of this project is to observe the extracted key words from the clusters. Note that we only present top 10 keywords for each cluster based on tf-idf weights.

It can be observed that most of the keywords are definitely related to coronavirus. What we expect is to find a pattern for each cluster, like positive/negative attitudes, politics or everyday life and so on. It seems that we cannot find those partterns easily with just top 10 keywords. However if we take a close look we can still find something, like clusters with topics like Trump, China can be distinguished from other clusters and key words like support, friends suggest a positive attitude.

In [None]:
def get_top_keywords(data, clusters, labels, n_terms):
    '''
    This function displays the top keywords based on tf-idf score.
    '''
    # group tf-idf array based on predictions
    df = pd.DataFrame(data.todense()).groupby(clusters).mean()
    
    # loop through each clusters and print top 10 score words
    for i,r in df.iterrows():
        print('\nCluster {}'.format(i))
        print(','.join([labels[t] for t in np.argsort(r)[-n_terms:]]))

# run the code
get_top_keywords(X, cluster_predictions, vectorizer.get_feature_names(), 10)

# Challenges

In this project we face the following challenges:

* Normal batch K-means algorithm is too slow to run and instead we use mini batch K-means, which is at the expense of clustering quality. As a result, our clustering doesn't converge very well.
* Limit of RAM makes it hard for us to analyze data in bigger scale. One possible solution is listed in future plan section.
* Even though we find the relative optimal number of clusters using elbow method, other possible choices have very close sum of squared distances. 
* It's hard to find the actual pattern of each cluster.

# Future Plan

Here list a couple of things we can explore in the future.

* Once we have the all the hashtags for the given dataset, we can try train a surpvised machine learning model to predict the hashtag of certain Tweets. One thing to notice is that each sample may have multiple labels, which is different from traditional surprised learning.

* For convenience we only analyze Tweets in English in our K-means model. We can try to use any traslation packages or take a deeper look at the Tweets in other languages.

* Due to the storage limit of RAM we cannot analyze data in bigger scale. AWS cloud may be a good choice if we want to manipulate big data.

* Other features and methods can be included in the clustering process, which can probably help better converage the model.

# Acknowledge

Thank Shane Smith for creating this dataset and answering my question.
Thank John B for his notebook on unsurprised text classification, which gives me lots of inspirations for this project.