# Detecting Bots in Early COVID-19 Tweets Using KMeans Clustering

# Data Collection

### Samuel Park
### October 25th, 2021


*Note that this notebook was run on a local machine. 

*Next notebook will be run on Google Colab primarily.




In [1]:
import numpy as np
import pandas as pd
from os import listdir
# from twarc import Twarc



## Coronavirus (covid19) Tweets - Early and Late April
Dataset posted by [Shane Smith](https://www.kaggle.com/smid80/coronavirus-covid19-tweets-early-april) 


The datset contains COVID-19-related tweets between March 29th, 2020, and April 30th, 2020, when the pandemic was beginning in North America. The publisher had uploaded individual CSV file for each date between that timeframe. So, those CSV files have to be concantenated row-wise.

When collecting COVID-19-related tweets, the publisher used these hashtags: 
- #coronavirus 
- #coronavirusoutbreak
- #coronavirusPandemic, 
- #covid19
- #covid_19
- #epitwitter
- #ihavecorona
- #StayHomeStaySafe
- #TestTraceIsolate

In [2]:
path = 'data/shane_smith'

# create a list of all the CSV files from Shane Smith
all_files = [file for file in listdir(path)]
all_files = sorted(all_files) # sort by date

# store 
df_list = []

# tweets = pd.DataFrame()
for filename in all_files:
    # just to show progress
    print(f'reading in {filename}')
    df_temp = pd.read_csv(f'data/shane_smith/{filename}', index_col=None, header=0)
    # df = pd.concat([df, df_temp], axis=0, ignore_index=True)
    df_list.append(df_temp)
    print(f'appended dataframe of {filename} to df_list')
    
tweets = pd.concat(df_list, axis=0, ignore_index=True)

reading in 2020-03-29 Coronavirus Tweets.CSV
appended dataframe of 2020-03-29 Coronavirus Tweets.CSV to df_list
reading in 2020-03-30 Coronavirus Tweets.CSV
appended dataframe of 2020-03-30 Coronavirus Tweets.CSV to df_list
reading in 2020-03-31 Coronavirus Tweets.CSV
appended dataframe of 2020-03-31 Coronavirus Tweets.CSV to df_list
reading in 2020-04-01 Coronavirus Tweets.CSV
appended dataframe of 2020-04-01 Coronavirus Tweets.CSV to df_list
reading in 2020-04-02 Coronavirus Tweets.CSV
appended dataframe of 2020-04-02 Coronavirus Tweets.CSV to df_list
reading in 2020-04-03 Coronavirus Tweets.CSV
appended dataframe of 2020-04-03 Coronavirus Tweets.CSV to df_list
reading in 2020-04-04 Coronavirus Tweets.CSV
appended dataframe of 2020-04-04 Coronavirus Tweets.CSV to df_list
reading in 2020-04-05 Coronavirus Tweets.CSV
appended dataframe of 2020-04-05 Coronavirus Tweets.CSV to df_list
reading in 2020-04-06 Coronavirus Tweets.CSV
appended dataframe of 2020-04-06 Coronavirus Tweets.CSV to 

KeyboardInterrupt: 

In [None]:
tweets.shape

In [None]:
# export the compiled df as csv file
tweets.to_csv(path_or_buf='data/april_tweets_compiled.csv')

Preview the data.

In [None]:
# show the first 5 rows
tweets.head()


In [None]:
# show data types of the columns
tweets.info()

`created_at` column should be in datetime format. `account_lang` is in `float64` format, but intuitively it should be in `object` format.

Column descriptions:
- status_id: unique id for each tweet
- user_id: unique ide for a user
- created_at: datetime for when the tweet was created
- screen_name: user's Twitter handle
- text: content of the tweet
- source: link to the tweet
- reply_to_status_id: tweet id to which this tweet is a reply
- reply_to_user_id: user id to which this tweet is a reply
- is_quote: whether or not this tweet is a quote
- is_retweet: whether or not this tweet is a retweet
- favourites_count: number of users who favorited this tweet
- retweet_count: number of times this tweet was retweeted
- country_code: two-letter code for countries
- place_ful_name: full name of the place if geo-tagged (New York, NY)
- place_type: urban or rural? Can be NaN if not geo-tagged
- followers_count: number of followers this user has
- friends_count: number of people this user follows
- account_lang: the language setting of the user account
- acount_created_at: the date when the account was created
- verified: whether or not the user is verified (i.e., has blue check mark or not)
- lang: the language of the tweet

Let's get the unique language codes in the dataset.

In [None]:
tweets['lang'].unique()

As shown, there are many languages in the dataset.

Let's only get the rows whose `lang` values are `en`.

In [None]:
# select rows whose lang is en
tweets_en = tweets.loc[tweets['lang']=='en']

In [None]:
tweets_en.shape

It looks like there are 8,133,785 tweets whose language code is "en" (i.e., English). We want to keep only English tweets for this analysis. There are three reasons for this:
- We are primarily concerned with any presence of bots that may have influenced English-speakers. 
- We may do natural language processing down the line. English language has lots of references to utilize.
- The author of this notebook is fluent in a very limited number of languages.

Export the English-only tweets as a CSV file.

In [None]:
tweets_en.to_csv('data/tweets_en_colab.csv')

`tweets_en` which includes only English tweets contains 8,133,785 rows (tweets) and 21 columns.

Data cleaning will be continued in the second notebook.