## Importing Dataset

* Since this Pfizer Vaccine Tweets dataset is continually updating, we want to pull the dataset directly from Kaggle using the provided API.
* To use the Kaggle API, you need to do the following:
    * Go to your account, Scroll to API section and Click Expire API Token to remove previous tokens.
    * Click on Create New API Token - It will download kaggle.json file on your machine.
    * now just put it in the location C:\Users\(your user name)\.kaggle.

In [104]:
!pip install kaggle
from kaggle.api.kaggle_api_extended import KaggleApi
from zipfile import ZipFile
api = KaggleApi()
api.authenticate()
api.dataset_download_files('gpreda/pfizer-vaccine-tweets')



## Access the dataset you just download.

In [105]:
zf = ZipFile('pfizer-vaccine-tweets.zip')
#extracted data is saved in the same directory as notebook
zf.extractall() 
zf.close()

## Import nessasery tools like Pandas.

In [106]:
import pandas as pd

### Load the dataset and check variable names

In [107]:
df_tweets = pd.read_csv('vaccination_tweets.csv')
print(df_tweets.columns)
print(df_tweets.shape)

Index(['id', 'user_name', 'user_location', 'user_description', 'user_created',
       'user_followers', 'user_friends', 'user_favourites', 'user_verified',
       'date', 'text', 'hashtags', 'source', 'retweets', 'favorites',
       'is_retweet'],
      dtype='object')
(4487, 16)


### Cleaning
Now we drop the columns that we will not use for this assignment.

In [108]:
df_tweets.drop(columns=['user_description','hashtags'], inplace=True)
df_tweets.columns

Index(['id', 'user_name', 'user_location', 'user_created', 'user_followers',
       'user_friends', 'user_favourites', 'user_verified', 'date', 'text',
       'source', 'retweets', 'favorites', 'is_retweet'],
      dtype='object')

### Extracting user's locations.
We extract the cities first, if it is provided.
To accomplish this task, we will us a  library call Geotext.
This library allow us to extract city name and country code without going through the NLP hassle. 

In [109]:
!pip install geotext
from geotext import GeoText



How it works?

In [110]:
# must import this method to extract keys from ordered dictionary
from collections import OrderedDict 
places1 = GeoText("my bed")
print(places1.cities)
# prints empty list []
places2 = GeoText("London is a great city")
print(places2.cities)
# prints ['London']
print(list(places2.country_mentions.keys())[0])
print("tada!")

[]
['London']
GB
tada!


In [111]:
from collections import OrderedDict 
od = OrderedDict() 
od['a'] = 1
od['b'] = 2
od['c'] = 3
od['d'] = 4
print(od.keys())

odict_keys(['a', 'b', 'c', 'd'])


We should drop the rows that user_location is not string

In [112]:
indexes_to_drop = []
for index, row in df_tweets.iterrows():
    if type(row['user_location']) is not str:
        indexes_to_drop.append(df_tweets.index[index])
    
df_tweets.drop(indexes_to_drop, inplace=True)
df_tweets.shape

(3574, 14)

We lost about 900 data points, but we still have about 3500 left.
Nowe we extract and assign the city names to the each data point.

In [113]:
from collections import OrderedDict 
user_city = []
user_country = []
for index, row in df_tweets.iterrows():
    #df_tweets.index[index]['user_city'] = GeoText(row['user_location'])
    if GeoText(row['user_location']).cities:
        user_city.append(GeoText(row['user_location']).cities[0])
        user_country.append(list(GeoText(row['user_location']).country_mentions.keys())[0])
    else:
        user_city.append('')
        user_country.append('')
    #if GeoText(row['user_location']).cities:
    #    row['user_city'] = GeoText(row['user_location']).cities[0]
    #print(GeoText(row['user_location']).cities)

df_tweets['user_city'] = user_city
df_tweets['user_country'] = user_country
df_tweets

Unnamed: 0,id,user_name,user_location,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,source,retweets,favorites,is_retweet,user_city,user_country
0,1340539111971516416,Rachel Roh,"La Crescenta-Montrose, CA",2009-04-08 17:52:46,405,1692,3247,False,2020-12-20 06:06:44,Same folks said daikon paste could treat a cyt...,Twitter for Android,0,0,False,Montrose,US
1,1338158543359250433,Albert Fong,"San Francisco, CA",2009-09-21 15:27:30,834,666,178,False,2020-12-13 16:27:13,While the world has been on the wrong side of ...,Twitter Web App,1,1,False,San Francisco,US
2,1337858199140118533,eli🇱🇹🇪🇺👌,Your Bed,2020-06-25 23:30:28,10,88,155,False,2020-12-12 20:33:45,#coronavirus #SputnikV #AstraZeneca #PfizerBio...,Twitter for Android,0,0,False,,
3,1337855739918835717,Charles Adler,"Vancouver, BC - Canada",2008-09-10 11:28:53,49165,3933,21853,True,2020-12-12 20:23:59,"Facts are immutable, Senator, even when you're...",Twitter Web App,446,2129,False,Vancouver,CA
5,1337852648389832708,Dee,"Birmingham, England",2020-01-26 21:43:12,105,108,106,False,2020-12-12 20:11:42,Does anyone have any useful advice/guidance fo...,Twitter for iPhone,0,0,False,Birmingham,US
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4481,1350785637201403907,Dee,"Birmingham, England",2020-01-26 21:43:12,112,112,119,False,2021-01-17 12:42:46,Breastfeeding 2 week old but with guidelines r...,Twitter for iPhone,1,15,False,Birmingham,US
4482,1350779065406468099,Saju Mathew MD MPH,"Atlanta, GA",2018-08-12 20:43:47,900,309,1482,False,2021-01-17 12:16:40,Shot #2. Done. Thanks. #PfizerBioNTech. \...,Twitter for iPhone,3,42,False,Atlanta,US
4484,1350774240572801025,Dr.Altaf Dashti,Kuwait,2011-03-13 21:10:54,1498,691,19321,False,2021-01-17 11:57:29,System has been activated and updated 🤪 ✌🏼\n\n...,Twitter for iPhone,1,12,False,,
4485,1350772172273430528,Patricia Hamila,"Nice, France",2015-10-03 05:59:58,21,320,1010,False,2021-01-17 11:49:16,My parents will be getting their first #COVID1...,Twitter for Android,0,0,False,Nice,FR


In [114]:
df_tweets['user_city'].value_counts().to_frame().head(10)

Unnamed: 0,user_city
,1769
London,167
Dubai,59
New York,50
New Delhi,46
Toronto,40
Mumbai,39
Watford,37
Glasgow,37
Cornwall,35


In [87]:
df_tweets['user_country'].value_counts().to_frame().head(10)

Unnamed: 0,user_country
,1769
US,710
GB,391
IN,161
CA,145
AE,67
DE,32
IE,30
ZA,24
PH,22


But wait, could a user tweet multiple times?
<br /> Let's find out if there is any user that tweets a lot.

In [126]:
frequent_tweeters = df_tweets['user_name'].value_counts().head(15).to_frame().reset_index()
frequent_tweeters.columns = ['user_name', 'count']
frequent_tweeters

Unnamed: 0,user_name,count
0,Ian 3.5% #FBPE,33
1,Whtrslugcaviiersong#dontstayhomeandcatchcovid19,30
2,Simon Hodes ⬅️2m➡️ 😷,30
3,TheRag,29
4,ILKHA,29
5,Khaleej Times,27
6,New Straits Times,25
7,🕷Financial Bear 3.5%,25
8,Sue Reeve ♥️🧡💛💚💙💜🇪🇺🇪🇺🏳️‍🌈🏳️‍🌈,20
9,Gulf News,16


Now let's find out where are these user tweeting from?

In [131]:
cities = []
countries = []
for index, row in df_tweets.iterrows():
    #df_tweets.index[index]['user_city'] = GeoText(row['user_location'])
    if row['user_name'] in list(frequent_tweeters['user_name']):
        if row['user_city'] and row['user_country']:
            cities.append(row['user_city'])
            countries.append(row['user_country'])
pairs = {'city': cities , 'country': countries}
df = pd.DataFrame(pairs)
df.groupby('country')['city'].value_counts()

country  city    
AE       Dubai       13
CA       Cornwall    33
GB       Watford     30
         Glasgow     20
Name: city, dtype: int64

We can see these users contributed the majority of tweets from Cornwall Watford and Glasgow. we may study more closely what are they tweeting later, but for now we will exclude these cities from our study.