<a href="https://colab.research.google.com/github/slp22/data-engineering-project/blob/main/engineering_monkeypox_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Data Engineering | Pipeline

# Monkeypox Tweets

## Imports

In [1]:
import json
import logging
import sqlite3
import matplotlib.pyplot as plt
import numpy as np
import os, shutil, itertools
import pandas as pd
import pathlib as Path
import pickle
import PIL
import random
import seaborn as sns
import sklearn as sk
import warnings
import zipfile

from sqlite3 import connect
import time
from datetime import datetime
from dateutil.parser import parse
from dateutil.relativedelta import *
from dateutil.easter import *
from dateutil.rrule import *
from dateutil.parser import *
from datetime import *


### Google Drive

In [2]:
# mount drive
from google.colab import drive
drive.mount('/content/drive')

# https://colab.research.google.com/notebooks/snippets/sheets.ipynb#scrollTo=JiJVCmu3dhFa

# authorize access 
from google.colab import auth
auth.authenticate_user()

# read in from Google Sheets

import gspread
from google.auth import default
creds, _ = default()

gc = gspread.authorize(creds)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Pyspark



In [3]:
# # https://towardsdatascience.com/pyspark-on-google-colab-101-d31830b238be
# # https://www.analyticsvidhya.com/blog/2020/11/a-must-read-guide-on-how-to-work-with-pyspark-on-google-colab-for-data-scientists/

!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [4]:
!wget -q https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz

In [5]:
!tar -xf spark-3.0.0-bin-hadoop3.2.tgz

In [6]:
!pip install -q findspark

In [7]:
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop3.2"

In [8]:
import findspark
findspark.init()

In [9]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, explode, col, lower
from pyspark.sql.types import StructType,StructField, StringType, IntegerType


## 1 | Pipeline Design


* **Business Problem:** Can we build a dashboard to monitor top trending topics on Twitter about monkeypox?
* **Data source:** [Kaggle Tweets on Monkeypox ](https://www.kaggle.com/datasets/thakurnirmalya/monkeypox2022tweets)
* **Impact Hypothesis:** 

## 2 | Data Ingestion

#### 1. [Twitter Dataset on the 2022 MonkeyPox Outbreak](https://www.kaggle.com/datasets/thakurnirmalya/monkeypox2022tweets) (Dataset is list of TweetIDs)

#### 2. [Twitter Hydrating](https://towardsdatascience.com/learn-how-to-easily-hydrate-tweets-a0f393ed340e#:~:text=Hydrating%20Tweets) with [DocNow Hydrator](https://github.com/DocNow/hydrator/releases)

#### 3. Import [hydrated tweets](https://drive.google.com/drive/folders/1NbddxuSF3v5YuOgjvA1G4WgfPUlKfiul?usp=sharing) from GoogleDrive to Colab

## 3 | Exploratory Data Analysis

### Explore one set `tweets` (n = 12,656) 

In [10]:
w = ['TweetIDs_Part1', 'TweetIDs_Part2', 'TweetIDs_Part3', 'TweetIDs_Part4', 'TweetIDs_Part5', 'TweetIDs_Part6']
tweets = pd.DataFrame.from_records(gc.open(w[0]).sheet1.get_all_values())

In [11]:
tweets.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,25,26,27,28,29,30,31,32,33,34
0,coordinates,created_at,hashtags,media,urls,favorite_count,id,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_user_id,...,user_followers_count,user_friends_count,user_listed_count,user_location,user_name,user_screen_name,user_statuses_count,user_time_zone,user_urls,user_verified
1,,Wed May 18 21:49:25 +0000 2022,,,,1,1527043704967528453,theofficepirate,1527043356878155776,140472501,...,36791,6088,255,,Yates,Jyates5,36441,,,FALSE


In [12]:
tweets.columns = tweets.iloc[0]
tweets = tweets.drop(index=tweets.index[0])

In [13]:
tweets.head(2)

Unnamed: 0,coordinates,created_at,hashtags,media,urls,favorite_count,id,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_user_id,...,user_followers_count,user_friends_count,user_listed_count,user_location,user_name,user_screen_name,user_statuses_count,user_time_zone,user_urls,user_verified
1,,Wed May 18 21:49:25 +0000 2022,,,,1,1527043704967528453,theofficepirate,1.5270433568781555e+18,140472501.0,...,36791,6088,255,,Yates,Jyates5,36441,,,False
2,,Fri May 20 20:43:44 +0000 2022,,,,0,1527751952448344065,,,,...,134,553,3,"Chicago, IL",Patrick,LeftistHank,10782,,,False


In [14]:
cols_list = list(tweets.columns)
cols_list

['coordinates',
 'created_at',
 'hashtags',
 'media',
 'urls',
 'favorite_count',
 'id',
 'in_reply_to_screen_name',
 'in_reply_to_status_id',
 'in_reply_to_user_id',
 'lang',
 'place',
 'possibly_sensitive',
 'quote_id',
 'retweet_count',
 'retweet_id',
 'retweet_screen_name',
 'source',
 'text',
 'tweet_url',
 'user_created_at',
 'user_id',
 'user_default_profile_image',
 'user_description',
 'user_favourites_count',
 'user_followers_count',
 'user_friends_count',
 'user_listed_count',
 'user_location',
 'user_name',
 'user_screen_name',
 'user_statuses_count',
 'user_time_zone',
 'user_urls',
 'user_verified']

In [15]:
tweets['hashtags']
tweets['hashtags'].nunique()

528

In [16]:
tweets['possibly_sensitive'][0:2]

1    
2    
Name: possibly_sensitive, dtype: object

In [17]:
tweets['text']

1        @theofficepirate You bro remember them talking...
2            oh monkey POX? I thought you said monkey POGS
3        If I get monkey pox y’all gotta bring me the j...
4        Great, the initial Monkey pox symptoms read li...
5        @freidergeist @Cameo3D @pullenmyleg_ @Breaking...
                               ...                        
12652    @thecoastguy @TimNielsenDay Patent applied for...
12653               Ian Brown looks like he has monkey pox
12654    @SkyNews how tf are people contracting “monkey...
12655    @ConceptualJames We really need a legitimate l...
12656    What you mean “monkey pox”? https://t.co/lin02...
Name: text, Length: 12656, dtype: object

In [18]:
tweets['tweet_url'][0:2]

1    https://twitter.com/Jyates5/status/15270437049...
2    https://twitter.com/LeftistHank/status/1527751...
Name: tweet_url, dtype: object

In [19]:
tweets['lang'].nunique() #40

40

In [20]:
tweets['lang'].unique()

array(['en', 'ja', 'fr', 'de', 'pl', 'nl', 'qme', 'und', 'da', 'in', 'ta',
       'pt', 'es', 'et', 'ar', 'ru', 'tl', 'el', 'zh', 'qht', 'fi', 'zxx',
       'cy', 'it', 'art', 'tr', 'ht', 'qst', 'ko', 'sr', 'iw', 'ml', 'ro',
       'bn', 'sv', 'hi', 'th', 'ca', 'lv', 'lang'], dtype=object)

In [21]:
print('English entries:', (tweets[tweets["lang"] == 'en'].count())['lang'])

English entries: 12137


In [22]:
tweets = tweets[tweets['lang'] == 'en']
tweets.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12137 entries, 1 to 12656
Data columns (total 35 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   coordinates                 12137 non-null  object
 1   created_at                  12137 non-null  object
 2   hashtags                    12137 non-null  object
 3   media                       12137 non-null  object
 4   urls                        12137 non-null  object
 5   favorite_count              12137 non-null  object
 6   id                          12137 non-null  object
 7   in_reply_to_screen_name     12137 non-null  object
 8   in_reply_to_status_id       12137 non-null  object
 9   in_reply_to_user_id         12137 non-null  object
 10  lang                        12137 non-null  object
 11  place                       12137 non-null  object
 12  possibly_sensitive          12137 non-null  object
 13  quote_id                    12137 non-null  ob

In [23]:
tweets['user_created_at']

1        Fri Apr 01 00:29:41 +0000 2011
2        Fri Apr 01 13:14:03 +0000 2022
3        Fri Apr 01 15:12:02 +0000 2011
4        Fri Apr 01 16:27:26 +0000 2011
5        Fri Apr 01 17:41:17 +0000 2016
                      ...              
12652    Wed Sep 30 18:20:38 +0000 2015
12653    Wed Sep 30 19:12:03 +0000 2020
12654    Wed Sep 30 22:42:35 +0000 2020
12655    Wed Sep 30 23:06:12 +0000 2009
12656    Wed Sep 30 23:38:29 +0000 2009
Name: user_created_at, Length: 12137, dtype: object

In [24]:
tweets['date'] = pd.to_datetime(tweets['user_created_at'], 
                                  format='%a %b %d %H:%M:%S +0000 %Y', 
                                  errors='coerce').dt.date

In [25]:
tweets[['date']]

Unnamed: 0,date
1,2011-04-01
2,2022-04-01
3,2011-04-01
4,2011-04-01
5,2016-04-01
...,...
12652,2015-09-30
12653,2020-09-30
12654,2020-09-30
12655,2009-09-30


In [26]:
tweets['user_id']

1                  275288972
2        1509881738302001155
3                  275573209
4                  275604178
5         715957219972530180
                ...         
12652             3826370843
12653    1311383290688004097
12654    1311436211429543943
12655               78734566
12656               78741475
Name: user_id, Length: 12137, dtype: object

In [27]:
tweets['user_location']
tweets['user_location'].nunique()

4319

In [28]:
tweets['user_screen_name']

1                Jyates5
2            LeftistHank
3         MyNameIsRickyM
4           Just_sue_now
5             JackPaceSr
              ...       
12652    theoceanlawyers
12653          RolexCola
12654            DeeKno_
12655         ChipFranks
12656          moni_lisa
Name: user_screen_name, Length: 12137, dtype: object

In [29]:
tweets = tweets[['date',
                 'user_screen_name',
                 'text',
                 'tweet_url',
                 'user_location',
                 'hashtags']]
tweets.head(2)

Unnamed: 0,date,user_screen_name,text,tweet_url,user_location,hashtags
1,2011-04-01,Jyates5,@theofficepirate You bro remember them talking...,https://twitter.com/Jyates5/status/15270437049...,,
2,2022-04-01,LeftistHank,oh monkey POX? I thought you said monkey POGS,https://twitter.com/LeftistHank/status/1527751...,"Chicago, IL",


In [30]:
tweets = tweets.sort_values('date')
tweets.head(2)

Unnamed: 0,date,user_screen_name,text,tweet_url,user_location,hashtags
464,2006-12-15,tash,Sorry I can't come to work today I've got monk...,https://twitter.com/tash/status/15254387202871...,Earth,
2501,2007-01-08,BBCNews,Several monkeybox cases have been found in the...,https://twitter.com/BBCNews/status/15277393737...,London,


In [31]:
# https://stackoverflow.com/questions/22898824/filtering-pandas-dataframes-on-dates
# https://stackoverflow.com/questions/5619489/troubleshooting-descriptor-date-requires-a-datetime-datetime-object-but-rec

tweets = tweets[(tweets['date'] > date(2022,1,1))] 
tweets

Unnamed: 0,date,user_screen_name,text,tweet_url,user_location,hashtags
5817,2022-01-02,JoelPau68848306,@JohannaSzabo1 @postblocksyndro @igfbss @Garet...,https://twitter.com/JoelPau68848306/status/152...,,
5820,2022-01-02,cosborne687,@CandiceBergenMP You are the problem . Did yo...,https://twitter.com/cosborne687/status/1527372...,"Nipissing, Ontario",
5821,2022-01-02,localpirate7,@FoxNews Tf is monkey pox. Just stop it,https://twitter.com/localpirate7/status/152705...,,
5818,2022-01-02,Charles10151978,So then when are we locking down and shutting ...,https://twitter.com/Charles10151978/status/152...,,
5816,2022-01-02,ORnBNBucksCrew,Ayo we’re going to need to change the name “mo...,https://twitter.com/ORnBNBucksCrew/status/1527...,🇺🇸,
...,...,...,...,...,...,...
1372,2022-05-20,USAF_Brat66,@hrkbenowen Don’t anyone be scared and jump to...,https://twitter.com/USAF_Brat66/status/1528100...,"Tennessee, USA",
1366,2022-05-20,iwillnotsubmit1,"WW3 is on the horizon, Covid-19, and Monkey Po...",https://twitter.com/iwillnotsubmit1/status/152...,"New Jersey, USA",LetsGoBrandon
1367,2022-05-20,TwinomujuniDis5,@cmyeaton @dylanbgeorge @mlipsitch @rebeccajk1...,https://twitter.com/TwinomujuniDis5/status/152...,,
1368,2022-05-20,TwinomujuniDis5,@DrTomFrieden Hey doctor teach me about new di...,https://twitter.com/TwinomujuniDis5/status/152...,,


In [32]:
tweets.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1248 entries, 5817 to 1369
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   date              1248 non-null   object
 1   user_screen_name  1248 non-null   object
 2   text              1248 non-null   object
 3   tweet_url         1248 non-null   object
 4   user_location     1248 non-null   object
 5   hashtags          1248 non-null   object
dtypes: object(6)
memory usage: 68.2+ KB


In [33]:
tweets.reset_index(drop=True)

Unnamed: 0,date,user_screen_name,text,tweet_url,user_location,hashtags
0,2022-01-02,JoelPau68848306,@JohannaSzabo1 @postblocksyndro @igfbss @Garet...,https://twitter.com/JoelPau68848306/status/152...,,
1,2022-01-02,cosborne687,@CandiceBergenMP You are the problem . Did yo...,https://twitter.com/cosborne687/status/1527372...,"Nipissing, Ontario",
2,2022-01-02,localpirate7,@FoxNews Tf is monkey pox. Just stop it,https://twitter.com/localpirate7/status/152705...,,
3,2022-01-02,Charles10151978,So then when are we locking down and shutting ...,https://twitter.com/Charles10151978/status/152...,,
4,2022-01-02,ORnBNBucksCrew,Ayo we’re going to need to change the name “mo...,https://twitter.com/ORnBNBucksCrew/status/1527...,🇺🇸,
...,...,...,...,...,...,...
1243,2022-05-20,USAF_Brat66,@hrkbenowen Don’t anyone be scared and jump to...,https://twitter.com/USAF_Brat66/status/1528100...,"Tennessee, USA",
1244,2022-05-20,iwillnotsubmit1,"WW3 is on the horizon, Covid-19, and Monkey Po...",https://twitter.com/iwillnotsubmit1/status/152...,"New Jersey, USA",LetsGoBrandon
1245,2022-05-20,TwinomujuniDis5,@cmyeaton @dylanbgeorge @mlipsitch @rebeccajk1...,https://twitter.com/TwinomujuniDis5/status/152...,,
1246,2022-05-20,TwinomujuniDis5,@DrTomFrieden Hey doctor teach me about new di...,https://twitter.com/TwinomujuniDis5/status/152...,,


In [34]:
tweets.to_csv('/content/drive/MyDrive/tweets_eda_clean.csv')

### Import rest of tweet data `df` (n = 127,940)

In [35]:
w = ['TweetIDs_Part1', 'TweetIDs_Part2', 'TweetIDs_Part3', 'TweetIDs_Part4', 'TweetIDs_Part5', 'TweetIDs_Part6']

df_1 = pd.DataFrame.from_records(gc.open(w[0]).sheet1.get_all_values())
df_2 = pd.DataFrame.from_records(gc.open(w[1]).sheet1.get_all_values())
df_3 = pd.DataFrame.from_records(gc.open(w[2]).sheet1.get_all_values())
df_4 = pd.DataFrame.from_records(gc.open(w[3]).sheet1.get_all_values())
df_5 = pd.DataFrame.from_records(gc.open(w[4]).sheet1.get_all_values())
df_6 = pd.DataFrame.from_records(gc.open(w[5]).sheet1.get_all_values())


In [36]:
df_6.tail(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,25,26,27,28,29,30,31,32,33,34
127939,,Sat Jul 23 00:00:23 +0000 2022,,,https://twitter.com/i/broadcasts/1OdKrBzXyMQKX,0,1550631877781835776,,,,...,1818,942,34,Los Angeles,(((Luke Ford))),lukeford,61434,,http://www.lukeford.net,False
127940,,Sat Jul 23 00:00:07 +0000 2022,,,http://crweworld.com/article/world/2448896/2-c...,0,1550631810970750976,,,,...,1051,2047,42,"Las Vegas, NV",Crwe World,CrweWorld,1482340,,http://crweworld.com,False


In [37]:
dfs = [df_1, df_2, df_3, df_4, df_5, df_6]

for d in dfs: 
  d.columns = d.iloc[0]
  d = d.drop(index=d.index[0],
               axis=0,
               inplace=True)

In [38]:
df_1.head(3)

Unnamed: 0,coordinates,created_at,hashtags,media,urls,favorite_count,id,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_user_id,...,user_followers_count,user_friends_count,user_listed_count,user_location,user_name,user_screen_name,user_statuses_count,user_time_zone,user_urls,user_verified
1,,Wed May 18 21:49:25 +0000 2022,,,,1,1527043704967528453,theofficepirate,1.5270433568781555e+18,140472501.0,...,36791,6088,255,,Yates,Jyates5,36441,,,False
2,,Fri May 20 20:43:44 +0000 2022,,,,0,1527751952448344065,,,,...,134,553,3,"Chicago, IL",Patrick,LeftistHank,10782,,,False
3,,Sat May 21 19:48:54 +0000 2022,,,,1,1528100542345527296,,,,...,1476,2251,7,Ohio,Firm Bizkit,MyNameIsRickyM,225443,,,False


In [39]:
df_2.head(3)

Unnamed: 0,coordinates,created_at,hashtags,media,urls,favorite_count,id,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_user_id,...,user_followers_count,user_friends_count,user_listed_count,user_location,user_name,user_screen_name,user_statuses_count,user_time_zone,user_urls,user_verified
1,,Thu May 26 20:31:06 +0000 2022,,,,0,1529923099424178191,dustinbennett76,1.5299214221344973e+18,43171823.0,...,182,501,3,,Work in Progress,Kenny_Swift,8117,,,False
2,,Sun May 22 22:20:07 +0000 2022,,,,0,1528500983893983233,OnlineAlison,1.5284842269699236e+18,24115438.0,...,1,36,0,,Saniye,biirSaniye,238,,,False
3,,Thu May 26 22:54:22 +0000 2022,,,,1,1529959154059730946,,,,...,2297,1311,8,"Johannesburg, South Africa",Lebogang,lebza_mtwana,70755,,,False


In [40]:
df = pd.concat([df_1, df_2, df_3, df_4, df_5, df_6])
df.tail()

Unnamed: 0,coordinates,created_at,hashtags,media,urls,favorite_count,id,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_user_id,...,user_followers_count,user_friends_count,user_listed_count,user_location,user_name,user_screen_name,user_statuses_count,user_time_zone,user_urls,user_verified
127936,,Sat Jul 23 00:00:17 +0000 2022,,,https://www.cdc.gov/poxvirus/monkeypox/transmi...,1,1550631852393828352,,,,...,529,972,21,,Wendy,wmzraz,41861,,,False
127937,,Sat Jul 23 00:00:00 +0000 2022,,,https://endpts.com/bavarian-nordics-monkeypox-...,4,1550631781396729856,,,,...,24498,40,530,Global,Endpoints News,endpts,44967,,http://endpts.com,False
127938,,Sat Jul 23 00:00:18 +0000 2022,Monkeypox HealthierJC,https://twitter.com/HealthierJC/status/1550631...,,0,1550631856336506886,,,,...,2330,1582,32,"Jersey City, NJ",Healthier JC,HealthierJC,8091,,http://healthierjc.com,False
127939,,Sat Jul 23 00:00:23 +0000 2022,,,https://twitter.com/i/broadcasts/1OdKrBzXyMQKX,0,1550631877781835776,,,,...,1818,942,34,Los Angeles,(((Luke Ford))),lukeford,61434,,http://www.lukeford.net,False
127940,,Sat Jul 23 00:00:07 +0000 2022,,,http://crweworld.com/article/world/2448896/2-c...,0,1550631810970750976,,,,...,1051,2047,42,"Las Vegas, NV",Crwe World,CrweWorld,1482340,,http://crweworld.com,False


In [41]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 229181 entries, 1 to 127940
Data columns (total 35 columns):
 #   Column                      Non-Null Count   Dtype 
---  ------                      --------------   ----- 
 0   coordinates                 229181 non-null  object
 1   created_at                  229181 non-null  object
 2   hashtags                    229181 non-null  object
 3   media                       229181 non-null  object
 4   urls                        229181 non-null  object
 5   favorite_count              229181 non-null  object
 6   id                          229181 non-null  object
 7   in_reply_to_screen_name     229181 non-null  object
 8   in_reply_to_status_id       229181 non-null  object
 9   in_reply_to_user_id         229181 non-null  object
 10  lang                        229181 non-null  object
 11  place                       229181 non-null  object
 12  possibly_sensitive          229181 non-null  object
 13  quote_id                    2

In [42]:
df.to_csv('/content/drive/MyDrive/tweets_raw.csv')

### Clean rest of tweet data

In [43]:
df.tail(2)

Unnamed: 0,coordinates,created_at,hashtags,media,urls,favorite_count,id,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_user_id,...,user_followers_count,user_friends_count,user_listed_count,user_location,user_name,user_screen_name,user_statuses_count,user_time_zone,user_urls,user_verified
127939,,Sat Jul 23 00:00:23 +0000 2022,,,https://twitter.com/i/broadcasts/1OdKrBzXyMQKX,0,1550631877781835776,,,,...,1818,942,34,Los Angeles,(((Luke Ford))),lukeford,61434,,http://www.lukeford.net,False
127940,,Sat Jul 23 00:00:07 +0000 2022,,,http://crweworld.com/article/world/2448896/2-c...,0,1550631810970750976,,,,...,1051,2047,42,"Las Vegas, NV",Crwe World,CrweWorld,1482340,,http://crweworld.com,False


In [44]:
print('English entries:', (df[df["lang"] == 'en'].count())['lang'])

English entries: 210812


In [45]:
df = df[(df['lang'] == 'en')]
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 210812 entries, 1 to 127940
Data columns (total 35 columns):
 #   Column                      Non-Null Count   Dtype 
---  ------                      --------------   ----- 
 0   coordinates                 210812 non-null  object
 1   created_at                  210812 non-null  object
 2   hashtags                    210812 non-null  object
 3   media                       210812 non-null  object
 4   urls                        210812 non-null  object
 5   favorite_count              210812 non-null  object
 6   id                          210812 non-null  object
 7   in_reply_to_screen_name     210812 non-null  object
 8   in_reply_to_status_id       210812 non-null  object
 9   in_reply_to_user_id         210812 non-null  object
 10  lang                        210812 non-null  object
 11  place                       210812 non-null  object
 12  possibly_sensitive          210812 non-null  object
 13  quote_id                    2

In [46]:
df['lang'].unique()

array(['en'], dtype=object)

In [47]:
df = df[['user_created_at',
         'user_screen_name',
         'text',
         'tweet_url',
         'user_location',
         'hashtags']]
df.head(2)

Unnamed: 0,user_created_at,user_screen_name,text,tweet_url,user_location,hashtags
1,Fri Apr 01 00:29:41 +0000 2011,Jyates5,@theofficepirate You bro remember them talking...,https://twitter.com/Jyates5/status/15270437049...,,
2,Fri Apr 01 13:14:03 +0000 2022,LeftistHank,oh monkey POX? I thought you said monkey POGS,https://twitter.com/LeftistHank/status/1527751...,"Chicago, IL",


In [48]:
df['date'] = pd.to_datetime(df['user_created_at'],
                            format='%a %b %d %H:%M:%S +0000 %Y', 
                            errors='coerce').dt.date

In [49]:
df[['date']]

Unnamed: 0,date
1,2011-04-01
2,2022-04-01
3,2011-04-01
4,2011-04-01
5,2016-04-01
...,...
127936,2010-07-29
127937,2015-08-27
127938,2016-03-03
127939,2007-12-31


In [50]:
df = df.sort_values('date')
df.head(2)

Unnamed: 0,user_created_at,user_screen_name,text,tweet_url,user_location,hashtags,date
44335,Tue Jul 11 21:26:17 +0000 2006,Maggie,I just realized Monkey Pox survives on fabric ...,https://twitter.com/Maggie/status/154875458060...,San Francisco (she/her),,2006-07-11
119064,Fri Jul 14 05:49:14 +0000 2006,xeni,WHO declares monkeypox a global health emergen...,https://twitter.com/xeni/status/15508495854547...,,,2006-07-14


In [51]:
df = df[(df['date'] > date(2022,1,1))] 
df

Unnamed: 0,user_created_at,user_screen_name,text,tweet_url,user_location,hashtags,date
70237,Sun Jan 02 22:48:29 +0000 2022,lwaddell123,@okunbor13 @RonnieZebron @EricMMatheny And whi...,https://twitter.com/lwaddell123/status/1550877...,"Okmulgee, OK",,2022-01-02
59977,Sun Jan 02 11:58:59 +0000 2022,Marcia_LeMere,@mgthawk @VP72801 But it’s the Liberals who ar...,https://twitter.com/Marcia_LeMere/status/15500...,,,2022-01-02
5817,Sun Jan 02 10:12:00 +0000 2022,JoelPau68848306,@JohannaSzabo1 @postblocksyndro @igfbss @Garet...,https://twitter.com/JoelPau68848306/status/152...,,,2022-01-02
101378,Sun Jan 02 08:32:26 +0000 2022,ApologizToUnvax,@nytimes Mother Nature keep trying to tell us ...,https://twitter.com/ApologizToUnvax/status/155...,,,2022-01-02
1223,Sun Jan 02 08:13:16 +0000 2022,VickyWa71588519,"@DrEricDing @CDCDirector @CDCgov Hi doc, any c...",https://twitter.com/VickyWa71588519/status/153...,,,2022-01-02
...,...,...,...,...,...,...,...
80763,Sat Jul 23 19:36:15 +0000 2022,HABOfficialNews,HEALTH: According to WHO Director-General Tedr...,https://twitter.com/HABOfficialNews/status/155...,United Kingdom,,2022-07-23
75282,Sat Jul 23 21:51:24 +0000 2022,KChristophoro,"@thecoastguy ""Pivotal to the success or failur...",https://twitter.com/KChristophoro/status/15509...,,,2022-07-23
80941,Sat Jul 23 20:58:11 +0000 2022,hundredkittens,My for you page on TikTok lately is really the...,https://twitter.com/hundredkittens/status/1550...,United States,quote monkeypox quotesaboutlife,2022-07-23
76811,Sat Jul 23 22:40:20 +0000 2022,worldbestvideoo,Best description of life in pandemic \n\n#mon...,https://twitter.com/worldbestvideoo/status/155...,,monkeypox Covid_19 Omicron5 12yearsofonedirect...,2022-07-23


In [52]:
print(df['user_location'].unique)
print('\n', 'Num unique:', df['user_location'].nunique())

<bound method Series.unique of 70237       Okmulgee, OK
59977                   
5817                    
101378                  
1223                    
               ...      
80763     United Kingdom
75282                   
80941      United States
76811                   
86451     United Kingdom
Name: user_location, Length: 29917, dtype: object>

 Num unique: 3797


In [53]:
print(df['user_screen_name'].value_counts())
print('\n', 'Num unique:', df['user_screen_name'].nunique())

EMorgan52          1981
RolandBakerIII      281
MonkeyPoxBSC        226
tickrob76           162
bulletin_ex         153
                   ... 
RobbJack4             1
icingdeath5           1
mancityisbetter       1
SecretAgent2024       1
Mosdeffff7            1
Name: user_screen_name, Length: 15936, dtype: int64

 Num unique: 15936


In [54]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29917 entries, 70237 to 86451
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   user_created_at   29917 non-null  object
 1   user_screen_name  29917 non-null  object
 2   text              29917 non-null  object
 3   tweet_url         29917 non-null  object
 4   user_location     29917 non-null  object
 5   hashtags          29917 non-null  object
 6   date              29917 non-null  object
dtypes: object(7)
memory usage: 1.8+ MB


In [55]:
df.reset_index(drop=True)

Unnamed: 0,user_created_at,user_screen_name,text,tweet_url,user_location,hashtags,date
0,Sun Jan 02 22:48:29 +0000 2022,lwaddell123,@okunbor13 @RonnieZebron @EricMMatheny And whi...,https://twitter.com/lwaddell123/status/1550877...,"Okmulgee, OK",,2022-01-02
1,Sun Jan 02 11:58:59 +0000 2022,Marcia_LeMere,@mgthawk @VP72801 But it’s the Liberals who ar...,https://twitter.com/Marcia_LeMere/status/15500...,,,2022-01-02
2,Sun Jan 02 10:12:00 +0000 2022,JoelPau68848306,@JohannaSzabo1 @postblocksyndro @igfbss @Garet...,https://twitter.com/JoelPau68848306/status/152...,,,2022-01-02
3,Sun Jan 02 08:32:26 +0000 2022,ApologizToUnvax,@nytimes Mother Nature keep trying to tell us ...,https://twitter.com/ApologizToUnvax/status/155...,,,2022-01-02
4,Sun Jan 02 08:13:16 +0000 2022,VickyWa71588519,"@DrEricDing @CDCDirector @CDCgov Hi doc, any c...",https://twitter.com/VickyWa71588519/status/153...,,,2022-01-02
...,...,...,...,...,...,...,...
29912,Sat Jul 23 19:36:15 +0000 2022,HABOfficialNews,HEALTH: According to WHO Director-General Tedr...,https://twitter.com/HABOfficialNews/status/155...,United Kingdom,,2022-07-23
29913,Sat Jul 23 21:51:24 +0000 2022,KChristophoro,"@thecoastguy ""Pivotal to the success or failur...",https://twitter.com/KChristophoro/status/15509...,,,2022-07-23
29914,Sat Jul 23 20:58:11 +0000 2022,hundredkittens,My for you page on TikTok lately is really the...,https://twitter.com/hundredkittens/status/1550...,United States,quote monkeypox quotesaboutlife,2022-07-23
29915,Sat Jul 23 22:40:20 +0000 2022,worldbestvideoo,Best description of life in pandemic \n\n#mon...,https://twitter.com/worldbestvideoo/status/155...,,monkeypox Covid_19 Omicron5 12yearsofonedirect...,2022-07-23


In [56]:
# Save corpus
df.to_pickle('/content/tweet_data.pkl')
df.to_csv(r'/content/tweet_data.csv', index=False)

# 4 | Storage

#### SQL Database `monkeypox.db`

##### helper functions

In [57]:
# https://towardsdatascience.com/have-a-sql-interview-coming-up-ace-it-using-google-colab-6d3c0ffb29dc

def pd_to_sqlDB(input_df: pd.DataFrame,
                table_name: str,
                db_name: str = 'default.db') -> None:

    # # Setup local logging
    # logging.basicConfig(level=logging.INFO,
    #                     format='%(asctime)s %(levelname)s: %(message)s',
    #                     datefmt='%Y-%m-%d %H:%M:%S')

    # Find columns in the dataframe
    cols = input_df.columns
    cols_string = ','.join(cols)
    val_wildcard_string = ','.join(['?'] * len(cols))

    # Connect to a DB file if it exists, else create a new file
    con = sqlite3.connect(db_name)
    cur = con.cursor()
    # logging.info(f'SQL DB {db_name} created')

    # Create table
    sql_string = f"""CREATE TABLE {table_name} ({cols_string});"""
    cur.execute(sql_string)
    # logging.info(f'SQL Table {table_name} created with {len(cols)} columns')

    # Upload df
    rows_to_upload = input_df.to_dict(orient='split')['data']
    sql_string = f"""INSERT INTO {table_name} ({cols_string}) VALUES ({val_wildcard_string});"""    
    cur.executemany(sql_string, rows_to_upload)
    # logging.info(f'{len(rows_to_upload)} rows uploaded to {table_name}')
  
    # Commit the changes and close the connection
    con.commit()
    con.close()

In [58]:
#  https://towardsdatascience.com/have-a-sql-interview-coming-up-ace-it-using-google-colab-6d3c0ffb29dc

def sql_query_to_pd(sql_query_string: str, db_name: str ='mpox.db') -> pd.DataFrame:
    
    # Connect to the SQL DB
    con = sqlite3.connect(db_name)

    # Execute the SQL query
    cursor = con.execute(sql_query_string)

    # Fetch the data and column names
    result_data = cursor.fetchall()
    cols = [description[0] for description in cursor.description]

    # Close the connection
    con.close()

    # Return as df
    return pd.DataFrame(result_data, columns=cols)

##### sql_to_df

In [59]:
# https://towardsdatascience.com/have-a-sql-interview-coming-up-ace-it-using-google-colab-6d3c0ffb29dc

# Read  csv as df
input_df = pd.read_csv('/content/tweet_data.csv')

# Upload df to a SQL table
pd_to_sqlDB(input_df,
            table_name='tweets',
            db_name='monkeypox.db')

# Write SQL query in a string variable
sql_query_string = """
    SELECT *
    FROM tweets
"""
# Exectue  SQL query
corpus = sql_query_to_pd(sql_query_string, db_name='monkeypox.db')
corpus

Unnamed: 0,user_created_at,user_screen_name,text,tweet_url,user_location,hashtags,date
0,Sun Jan 02 22:48:29 +0000 2022,lwaddell123,@okunbor13 @RonnieZebron @EricMMatheny And whi...,https://twitter.com/lwaddell123/status/1550877...,"Okmulgee, OK",,2022-01-02
1,Sun Jan 02 11:58:59 +0000 2022,Marcia_LeMere,@mgthawk @VP72801 But it’s the Liberals who ar...,https://twitter.com/Marcia_LeMere/status/15500...,,,2022-01-02
2,Sun Jan 02 10:12:00 +0000 2022,JoelPau68848306,@JohannaSzabo1 @postblocksyndro @igfbss @Garet...,https://twitter.com/JoelPau68848306/status/152...,,,2022-01-02
3,Sun Jan 02 08:32:26 +0000 2022,ApologizToUnvax,@nytimes Mother Nature keep trying to tell us ...,https://twitter.com/ApologizToUnvax/status/155...,,,2022-01-02
4,Sun Jan 02 08:13:16 +0000 2022,VickyWa71588519,"@DrEricDing @CDCDirector @CDCgov Hi doc, any c...",https://twitter.com/VickyWa71588519/status/153...,,,2022-01-02
...,...,...,...,...,...,...,...
29912,Sat Jul 23 19:36:15 +0000 2022,HABOfficialNews,HEALTH: According to WHO Director-General Tedr...,https://twitter.com/HABOfficialNews/status/155...,United Kingdom,,2022-07-23
29913,Sat Jul 23 21:51:24 +0000 2022,KChristophoro,"@thecoastguy ""Pivotal to the success or failur...",https://twitter.com/KChristophoro/status/15509...,,,2022-07-23
29914,Sat Jul 23 20:58:11 +0000 2022,hundredkittens,My for you page on TikTok lately is really the...,https://twitter.com/hundredkittens/status/1550...,United States,quote monkeypox quotesaboutlife,2022-07-23
29915,Sat Jul 23 22:40:20 +0000 2022,worldbestvideoo,Best description of life in pandemic \n\n#mon...,https://twitter.com/worldbestvideoo/status/155...,,monkeypox Covid_19 Omicron5 12yearsofonedirect...,2022-07-23


* **database = `monkeypox.db`**
* **table_1 = `corpus` (6410 rows × 4 columns)**

In [60]:
# # copy corpus csv to Google Drive for Tableau
# shutil.copyfile('/content/tweets.csv', '/content/drive/MyDrive/tweets.csv')

# 5 | Processing

### Case Counts by State

In [61]:
# import sheet with state lattitude and longitude 
# data source: https://developers.google.com/public-data/docs/canonical/states_csv

# open google spreadsheet
worksheet = gc.open('USA-State-Coordinates').sheet1

# get_all_values gives a list of rows.
rows = worksheet.get_all_values()

states = pd.DataFrame.from_records(rows)

states.columns = states.iloc[0]
states.drop([0], inplace=True)
states.drop(['state'], axis=1, inplace=True)
states.sort_values(by=['name'], inplace=True)
states = states.rename(columns={'latitude': 'lat', 'longitude': 'lon', 'name': 'state'})

states.tail(2)

Unnamed: 0,lat,lon,state
51,43.78444,-88.787868,Wisconsin
52,43.075968,-107.290284,Wyoming


In [62]:
states.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 52 entries, 1 to 52
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   lat     52 non-null     object
 1   lon     52 non-null     object
 2   state   52 non-null     object
dtypes: object(3)
memory usage: 1.6+ KB


In [63]:
# import sheet with US case count by state 
# data source: https://www.cdc.gov/poxvirus/monkeypox/response/2022/us-map.html

worksheet = gc.open('2022-US-mpx-cases-by-state').sheet1
rows = worksheet.get_all_values()
cases = pd.DataFrame.from_records(rows)

cases.columns = cases.iloc[0]
cases.drop([0], inplace=True)
cases.drop(['AsOf', 'Case Range'], axis=1, inplace=True)
cases.sort_values(by=['Location'], inplace=True)

cases.tail(2)

Unnamed: 0,Location,Cases
51,Wisconsin,56
52,Wyoming,1


In [64]:
# US case count by state (long + lat)
map_data = pd.concat([states, cases], axis=1)
map_data = map_data[['Location', 'Cases', 'lat', 'lon' ]]
map_data = map_data.rename(columns={'Location': 'state', 'Cases':'cases'})

map_data.tail(2)
# map_data.info()

Unnamed: 0,state,cases,lat,lon
51,Wisconsin,56,43.78444,-88.787868
52,Wyoming,1,43.075968,-107.290284


In [65]:
map_data = map_data.astype({'cases':'int'})

map_data.sort_values(by=['cases'], ascending=False)
# map_data.sort_values(by=['latitude'], ascending=False)



Unnamed: 0,state,cases,lat,lon
5,California,3291,36.778261,-119.417932
33,New York,3124,43.299428,-74.217933
10,Florida,1739,27.664827,-81.515754
45,Texas,1472,31.968599,-99.901813
11,Georgia,1299,32.157435,-82.907123
14,Illinois,1005,40.633125,-89.398528
31,New Jersey,479,40.058324,-74.405661
39,Pennsylvania,477,41.203322,-77.194525
21,Maryland,461,39.045755,-76.641271
9,District of Columbia,414,38.905985,-77.033418


In [66]:
map_data.head(2)

Unnamed: 0,state,cases,lat,lon
1,Alabama,53,32.318231,-86.902298
2,Alaska,3,63.588753,-154.493062


In [67]:
map_data.to_csv('map_data.csv')  

### PySpark

#### Count specific words using PySpark?[link text](https://)
* groupby, count, agg, 
* google = pyspark word frequency


In [68]:
spark = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

In [69]:
#  use SQL table?

In [70]:
spark_df = spark.read.csv('/content/drive/MyDrive/tweet_data.csv', inferSchema=True, header='true')

# spark_df = spark.createDataFrame(tweets_1)
spark_df.printSchema()
# spark_df.show()

root
 |-- _c0: string (nullable = true)
 |-- user_screen_name: string (nullable = true)
 |-- text: string (nullable = true)
 |-- tweet_url: string (nullable = true)
 |-- user_location: string (nullable = true)
 |-- hashtags: string (nullable = true)



In [71]:
# df_schema = StructType([StructField("date", StringType(), True),
#                         StructField("tweet", StringType(), True)])


In [74]:
df_schema = StructType([StructField("date", StringType(), True),
                        StructField("text", StringType(), True)])


In [76]:
spark_df = spark_df.withColumn('text', 
                               explode(split(lower(col('text')), '\s')))

In [79]:
(spark_df.groupBy('text')
  .count()
  .orderBy('count', ascending=False)
  .show(50))

+----------+------+
|      text| count|
+----------+------+
|       the|133571|
|    monkey| 99630|
|        to| 79695|
|       pox| 75647|
|         a| 69657|
|        of| 61116|
|       and| 59923|
|        is| 59885|
| monkeypox| 56635|
|        in| 53371|
|          | 41927|
|         i| 36433|
|       for| 33123|
|      that| 29047|
|       you| 27508|
|        it| 27072|
|      with| 25582|
|       are| 24059|
|      have| 23503|
|      this| 23234|
|        on| 22652|
|        be| 21713|
|       not| 20567|
|    health| 18725|
|        we| 18306|
|      they| 18271|
|       who| 18160|
|        as| 16007|
|     about| 15257|
|       get| 14368|
|      from| 14224|
|       has| 13630|
|     covid| 13376|
|        so| 13323|
|      just| 13116|
|     cases| 12848|
|       now| 12733|
|        if| 12523|
|      will| 12361|
|       but| 11837|
|        or| 11559|
|      like| 11437|
|    people| 11211|
|      what| 11139|
|#monkeypox| 11114|
|       all| 10974|
|       was| 10850|


In [78]:
(spark_df)

DataFrame[_c0: string, user_screen_name: string, text: string, tweet_url: string, user_location: string, hashtags: string]

# 6 | Deployment

See draft streamlit app here: https://slp22-data-engineering-project-streamlit-mpx-app-ckpzq2.streamlitapp.com/

Streamlit cheat sheet: https://daniellewisdl-streamlit-cheat-sheet-app-ytm9sg.streamlitapp.com/

In [None]:
# # wordcloud
# # https://www.geeksforgeeks.org/generating-word-cloud-python/
# comment_words = ''
# stopwords = set(stopwords)
 
# # iterate through the csv file
# for val in df.text:
     
#     # typecaste each val to string
#     val = str(val)
 
#     # split the value
#     tokens = val.split()
     
#     # Converts each token into lowercase
#     for i in range(len(tokens)):
#         tokens[i] = tokens[i].lower()
     
#     comment_words += " ".join(tokens)+" "
 
# wordcloud = WordCloud(width = 800, height = 800,
#                 background_color ='white',
#                 stopwords = stopwords,
#                 min_font_size = 10).generate(comment_words)
 
# # plot the WordCloud image                      
# plt.figure(figsize = (8, 8), facecolor = None)
# plt.imshow(wordcloud)
# plt.axis("off")
# plt.tight_layout(pad = 0)
# plt.title('Coronavirus Tweets April 2020')
# plt.savefig("coronavirus-tweets-word-cloud.jpeg");

# 7 | Testing/Robustness

[Python schedule](https://schedule.readthedocs.io/en/stable/examples.html#run-a-job-every-x-minute)