<a href="https://colab.research.google.com/github/slp22/data-engineering-project/blob/main/hydrating_engineering_monkeypox_mvp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Data Engineering | MVP

# Monkeypox Tweets

## Imports

In [1]:
import json
import logging
import sqlite3
import matplotlib.pyplot as plt
import numpy as np
import os, shutil, itertools
import pandas as pd
import pathlib as Path
import pickle
import PIL
import random
import seaborn as sns
import sklearn as sk
import warnings
import zipfile

from sqlite3 import connect


In [7]:
# Google Drive imports/authorization

# mount drive
from google.colab import drive
drive.mount('/content/drive')

# https://colab.research.google.com/notebooks/snippets/sheets.ipynb#scrollTo=JiJVCmu3dhFa

# authorize access 
from google.colab import auth
auth.authenticate_user()

# read in from Google Sheets

import gspread
from google.auth import default
creds, _ = default()

gc = gspread.authorize(creds)


## 1 | Research Design


* **Research Question:** 
* **Impact Hypothesis:** 
* **Data source:** 

* **Data Dictionary:**


## 2 | Data Ingestion

#### 1. [Twitter Dataset on the 2022 MonkeyPox Outbreak](https://www.kaggle.com/datasets/thakurnirmalya/monkeypox2022tweets) (Dataset is list of TweetIDs)

#### 2. [Twitter Hydrating](https://towardsdatascience.com/learn-how-to-easily-hydrate-tweets-a0f393ed340e#:~:text=Hydrating%20Tweets) with [DocNow Hydrator](https://github.com/DocNow/hydrator/releases)

#### 3. Import [hydrated tweets](https://drive.google.com/drive/folders/1NbddxuSF3v5YuOgjvA1G4WgfPUlKfiul?usp=sharing) from GoogleDrive to Colab

In [None]:
# worksheets = ['TweetIDs_Part1', 'TweetIDs_Part2', 'TweetIDs_Part3', 'TweetIDs_Part4', 'TweetIDs_Part5', 'TweetIDs_Part6']

worksheet_1 = gc.open('TweetIDs_Part1').sheet1
worksheet_2 = gc.open('TweetIDs_Part2').sheet1
worksheet_3 = gc.open('TweetIDs_Part3').sheet1
worksheet_4 = gc.open('TweetIDs_Part4').sheet1
worksheet_5 = gc.open('TweetIDs_Part5').sheet1
worksheet_6 = gc.open('TweetIDs_Part6').sheet1

# get_all_values gives a list of rows
rows_1 = worksheet_1.get_all_values()
rows_2 = worksheet_2.get_all_values()
rows_3 = worksheet_3.get_all_values()
rows_4 = worksheet_4.get_all_values()
rows_5 = worksheet_5.get_all_values()
rows_6 = worksheet_6.get_all_values()

# Convert to a DataFrame and render
tweets_1 = pd.DataFrame.from_records(rows_1)
tweets_2 = pd.DataFrame.from_records(rows_2)
tweets_3 = pd.DataFrame.from_records(rows_3)
tweets_4 = pd.DataFrame.from_records(rows_4)
tweets_5 = pd.DataFrame.from_records(rows_5)
tweets_6 = pd.DataFrame.from_records(rows_6)



In [37]:
print('tweets_1', tweets_1.shape)
print('tweets_2', tweets_2.shape)
print('tweets_3', tweets_3.shape)
print('tweets_4', tweets_4.shape)
print('tweets_5', tweets_5.shape)
print('tweets_6', tweets_6.shape)

tweets_1 (12656, 35)
tweets_2 (15294, 35)
tweets_3 (15140, 35)
tweets_4 (16874, 35)
tweets_5 (41280, 35)
tweets_6 (127941, 35)


In [None]:
tweets_1.head(2)

In [None]:
tweets_2.head(2)

In [None]:
tweets_3.head(2)

In [None]:
tweets_4.head(2)

In [None]:
tweets_5.head(2)

In [None]:
tweets_6.head(2)

## 3 | Exploratory Data Analysis

### First glance

In [None]:
df = pd.read_csv('/content/monkeypox.csv')

In [None]:
df.head(2)

Unnamed: 0,id,conversation_id,created_at,date,time,timezone,user_id,username,name,place,...,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest
0,1555815462201872385,1555815462201872385,2022-08-06 12:48:06 India Standard Time,2022-08-06,12:48:06,530,820113517613154304,thetenth2022,TheTenth,,...,,,,,,[],,,,
1,1555815458602831872,1555815458602831872,2022-08-06 12:48:05 India Standard Time,2022-08-06,12:48:05,530,196518052,ashemedai,Jeroen Ruigrok van der Werven,,...,,,,,,[],,,,


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6859 entries, 0 to 6858
Data columns (total 36 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               6859 non-null   int64  
 1   conversation_id  6859 non-null   int64  
 2   created_at       6859 non-null   object 
 3   date             6859 non-null   object 
 4   time             6859 non-null   object 
 5   timezone         6859 non-null   int64  
 6   user_id          6859 non-null   int64  
 7   username         6859 non-null   object 
 8   name             6859 non-null   object 
 9   place            2 non-null      object 
 10  tweet            6859 non-null   object 
 11  language         6859 non-null   object 
 12  mentions         6859 non-null   object 
 13  urls             6859 non-null   object 
 14  photos           6859 non-null   object 
 15  replies_count    6859 non-null   int64  
 16  retweets_count   6859 non-null   int64  
 17  likes_count   

In [None]:
df.describe()

Unnamed: 0,id,conversation_id,timezone,user_id,replies_count,retweets_count,likes_count,video,near,geo,source,user_rt_id,user_rt,retweet_id,retweet_date,translate,trans_src,trans_dest
count,6859.0,6859.0,6859.0,6859.0,6859.0,6859.0,6859.0,6859.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mean,1.555764e+18,1.555192e+18,530.0,7.224884e+17,0.399038,0.691938,3.053944,0.095932,,,,,,,,,,
std,25414980000000.0,1.168003e+16,0.0,6.620987e+17,1.789475,11.879167,40.567536,0.29452,,,,,,,,,,
min,1.555725e+18,8.510359e+17,530.0,39893.0,0.0,0.0,0.0,0.0,,,,,,,,,,
25%,1.555742e+18,1.55572e+18,530.0,490658200.0,0.0,0.0,0.0,0.0,,,,,,,,,,
50%,1.55576e+18,1.555746e+18,530.0,8.977771e+17,0.0,0.0,0.0,0.0,,,,,,,,,,
75%,1.555784e+18,1.555774e+18,530.0,1.395422e+18,0.0,0.0,1.0,0.0,,,,,,,,,,
max,1.555815e+18,1.555815e+18,530.0,1.555798e+18,68.0,740.0,2180.0,1.0,,,,,,,,,,


In [None]:
cols_list = list(df.columns)
cols_list

['id',
 'conversation_id',
 'created_at',
 'date',
 'time',
 'timezone',
 'user_id',
 'username',
 'name',
 'place',
 'tweet',
 'language',
 'mentions',
 'urls',
 'photos',
 'replies_count',
 'retweets_count',
 'likes_count',
 'hashtags',
 'cashtags',
 'link',
 'retweet',
 'quote_url',
 'video',
 'thumbnail',
 'near',
 'geo',
 'source',
 'user_rt_id',
 'user_rt',
 'retweet_id',
 'reply_to',
 'retweet_date',
 'translate',
 'trans_src',
 'trans_dest']

In [None]:
df['tweet']

0       So has anyone begun to compile 'here's the sta...
1       Getting some groceries, topic shifted to covid...
2       "Illinois Children's Daycare Worker Tests Posi...
3       Illinois daycare worker tests positive for mon...
4       @Natrone86 @JunotIsrael @elcavaqueen @thechicc...
                              ...                        
6854    @hurtmeknots With Monkeypox on the rise, maybe...
6855    @FITNESSSF can you start filling up the disinf...
6856       @politvidchannel 71% already have monkey pox .
6857                            RT if you have #monkeypox
6858    @POTUS @VP 🥰 Dems 🇺🇸💙 are having the best jobs...
Name: tweet, Length: 6859, dtype: object

* RangeIndex: 6859 entries, 0 to 6858
* Data columns (total 36 columns)

### Corpus `tweets`

In [None]:
tweets = df[['language','date','username','hashtags','tweet']]
tweets.head(2)

Unnamed: 0,language,date,username,hashtags,tweet
0,en,2022-08-06,thetenth2022,[],So has anyone begun to compile 'here's the sta...
1,en,2022-08-06,ashemedai,[],"Getting some groceries, topic shifted to covid..."


In [None]:
tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6859 entries, 0 to 6858
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   language  6859 non-null   object
 1   date      6859 non-null   object
 2   username  6859 non-null   object
 3   hashtags  6859 non-null   object
 4   tweet     6859 non-null   object
dtypes: object(5)
memory usage: 268.1+ KB


In [None]:
tweets['username'].nunique()

6028

In [None]:
# How many languages? 
tweets['language'].unique()

array(['en', 'kn', 'de', 'in', 'tl', 'fr', 'pt', 'te', 'tr', 'it', 'qme',
       'bn', 'qht', 'pl', 'es', 'el', 'nl', 'cy', 'ta', 'hi', 'sv', 'ja',
       'th', 'mr', 'et', 'gu', 'da', 'ro', 'ml', 'zxx', 'und', 'pa', 'ur',
       'ko', 'am', 'fi', 'zh', 'lt', 'hu', 'ru', 'ar', 'si'], dtype=object)

In [None]:
# How many tweets in English?
print('English entries:', (tweets[tweets["language"] == 'en'].count())['language'])

# How many tweets in other languages?
print('Spanish entries:', (tweets[tweets["language"] == 'es'].count())['language'])
print('Italian entries:', (tweets[tweets["language"] == 'it'].count())['language'])

English entries: 6410
Spanish entries: 51
Italian entries: 10


In [None]:
# Keep only English languge tweets
tweets = tweets[(tweets['language'] == 'en')]

In [None]:
tweets.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6410 entries, 0 to 6858
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   language  6410 non-null   object
 1   date      6410 non-null   object
 2   username  6410 non-null   object
 3   hashtags  6410 non-null   object
 4   tweet     6410 non-null   object
dtypes: object(5)
memory usage: 300.5+ KB


In [None]:
tweets.head(2)

Unnamed: 0,language,date,username,hashtags,tweet
0,en,2022-08-06,thetenth2022,[],So has anyone begun to compile 'here's the sta...
1,en,2022-08-06,ashemedai,[],"Getting some groceries, topic shifted to covid..."


In [None]:
# Drop language column since we have only English tweets
tweets = tweets.drop(columns=['language'])

In [None]:
tweets.head(10)

Unnamed: 0,date,username,hashtags,tweet
0,2022-08-06,thetenth2022,[],So has anyone begun to compile 'here's the sta...
1,2022-08-06,ashemedai,[],"Getting some groceries, topic shifted to covid..."
2,2022-08-06,democracymotion,[],"""Illinois Children's Daycare Worker Tests Posi..."
3,2022-08-06,thegoogle93,[],Illinois daycare worker tests positive for mon...
4,2022-08-06,bufflosouljah1,[],@Natrone86 @JunotIsrael @elcavaqueen @thechicc...
5,2022-08-06,democracymotion,[],"""CDC Recommends Limiting Sex Partners to Avoid..."
6,2022-08-06,salemjakes,[],"1,058,637 Americans have died in the U.S. Covi..."
7,2022-08-06,fineassbrei,[],MF didn’t know shit about monkey pox in June c...
8,2022-08-06,xenohadi,[],Three people of African origin still admitted ...
11,2022-08-06,geopoliticsind,[],RT-PCR detection kit for human monkeypox virus...


In [None]:
# Save corpus
tweets.to_pickle('/content/tweets.pkl')
tweets.to_csv(r'/content/tweets.csv', index=False)

In [None]:
# # copy corpus csv to Google Drive for Tableau
# shutil.copyfile('/content/tweets.csv', '/content/drive/MyDrive/tweets.csv')

# 4 | Storage

#### SQL Database `monkeypox.db`

##### helper functions

In [None]:
# https://towardsdatascience.com/have-a-sql-interview-coming-up-ace-it-using-google-colab-6d3c0ffb29dc

def pd_to_sqlDB(input_df: pd.DataFrame,
                table_name: str,
                db_name: str = 'default.db') -> None:

    # # Setup local logging
    # logging.basicConfig(level=logging.INFO,
    #                     format='%(asctime)s %(levelname)s: %(message)s',
    #                     datefmt='%Y-%m-%d %H:%M:%S')

    # Find columns in the dataframe
    cols = input_df.columns
    cols_string = ','.join(cols)
    val_wildcard_string = ','.join(['?'] * len(cols))

    # Connect to a DB file if it exists, else create a new file
    con = sqlite3.connect(db_name)
    cur = con.cursor()
    # logging.info(f'SQL DB {db_name} created')

    # Create table
    sql_string = f"""CREATE TABLE {table_name} ({cols_string});"""
    cur.execute(sql_string)
    # logging.info(f'SQL Table {table_name} created with {len(cols)} columns')

    # Upload df
    rows_to_upload = input_df.to_dict(orient='split')['data']
    sql_string = f"""INSERT INTO {table_name} ({cols_string}) VALUES ({val_wildcard_string});"""    
    cur.executemany(sql_string, rows_to_upload)
    # logging.info(f'{len(rows_to_upload)} rows uploaded to {table_name}')
  
    # Commit the changes and close the connection
    con.commit()
    con.close()

In [None]:
#  https://towardsdatascience.com/have-a-sql-interview-coming-up-ace-it-using-google-colab-6d3c0ffb29dc

def sql_query_to_pd(sql_query_string: str, db_name: str ='mpox.db') -> pd.DataFrame:
    
    # Connect to the SQL DB
    con = sqlite3.connect(db_name)

    # Execute the SQL query
    cursor = con.execute(sql_query_string)

    # Fetch the data and column names
    result_data = cursor.fetchall()
    cols = [description[0] for description in cursor.description]

    # Close the connection
    con.close()

    # Return as df
    return pd.DataFrame(result_data, columns=cols)

##### sql_to_df

In [None]:
# https://towardsdatascience.com/have-a-sql-interview-coming-up-ace-it-using-google-colab-6d3c0ffb29dc

# Read  csv as df
input_df = pd.read_csv('/content/tweets.csv')

# Upload df to a SQL table
pd_to_sqlDB(input_df,
            table_name='tweets',
            db_name='monkeypox.db')

# Write SQL query in a string variable
sql_query_string = """
    SELECT *
    FROM tweets
"""
# Exectue  SQL query
corpus = sql_query_to_pd(sql_query_string, db_name='monkeypox.db')
corpus

Unnamed: 0,date,username,hashtags,tweet
0,2022-08-06,thetenth2022,[],So has anyone begun to compile 'here's the sta...
1,2022-08-06,ashemedai,[],"Getting some groceries, topic shifted to covid..."
2,2022-08-06,democracymotion,[],"""Illinois Children's Daycare Worker Tests Posi..."
3,2022-08-06,thegoogle93,[],Illinois daycare worker tests positive for mon...
4,2022-08-06,bufflosouljah1,[],@Natrone86 @JunotIsrael @elcavaqueen @thechicc...
...,...,...,...,...
6405,2022-08-06,arkcowgirl62,[],"@hurtmeknots With Monkeypox on the rise, maybe..."
6406,2022-08-06,speehanagram,"['monkeypox', 'besafe']",@FITNESSSF can you start filling up the disinf...
6407,2022-08-06,pritzkertoilet,[],@politvidchannel 71% already have monkey pox .
6408,2022-08-06,thumbressler,['monkeypox'],RT if you have #monkeypox


* **database = `monkeypox.db`**
* **table_1 = `corpus` (6410 rows × 4 columns)**

In [None]:
# # copy corpus csv to Google Drive for Tableau
# shutil.copyfile('/content/tweets.csv', '/content/drive/MyDrive/tweets.csv')

# 5 | Processing

[Spark](https://app.thisismetis.com/courses/211/pages/home-introduction-to-spark) [PySpark](https://app.thisismetis.com/courses/211/pages/home-pyspark-lab)

* natural language processing, topic modeling (LDA)  
* symptom trends across time + moving average (% or |n|) 

In [None]:
# # copy csv to Google Drive for Tableau connect
# shutil.copyfile('/content/tweets.csv', '/content/drive/MyDrive/tweets.csv')

# 6 | Deployment

* [gdrive <> Tableau](https://medium.com/mlearning-ai/how-to-connect-tableau-to-google-drive-99bdbca1d3f7) 
* [Tableau draft mpox](https://public.tableau.com/views/mpoxdraft/Sheet1?:language=en-US&publish=yes&:display_count=n&:origin=viz_share_link)
* [Tableau embedding](https://www.tableau.com/learn/webinars/live-training-embedded-analytics#video)


[Streamlit](https://app.thisismetis.com/courses/211/pages/home-streamlit) [Flask](https://app.thisismetis.com/courses/211/pages/home-flask-web-apps)

# 7 | Testing/Robustness

[Python schedule](https://schedule.readthedocs.io/en/stable/examples.html#run-a-job-every-x-minute)