<a href="https://colab.research.google.com/github/slp22/data-engineering-project/blob/main/engineering-monkeypox-tweets-mvp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Data Engineering | MVP

# Monkeypox Tweets

## Imports

In [37]:
import json
import matplotlib.pyplot as plt
import numpy as np
import os, shutil, itertools
import pandas as pd
import pathlib as Path
import pickle
import PIL
import random
import seaborn as sns
import sklearn as sk
import tensorflow as tf
import warnings
import zipfile

## 1 | Research Design


* **Research Question:** How well can a neural network diagnose diabetic retinopathy from a retinal image?
* **Impact Hypothesis:** Accelerate the National Eye Institute’s research evaluation of retinal clinical trial data, and streamline publishing results.
* **Data source:** [Diabetic Retinopathy 2015 Data Colored Resized](https://www.kaggle.com/datasets/sovitrath/diabetic-retinopathy-2015-data-colored-resized) , n=35,126
* **Error metric:** Accuracy

* **Data Dictionary:**
  * Classes = 5 stages of diabetic retinopathy:
    * **Normal eye**
    * **Mild** Nonproliferative Retinopathy: Microaneurysms are visbile, small areas of balloon-like swelling in the retina's tiny blood vessels.
    * **Moderate** Nonproliferative Retinopathy: Some blood vessels that nourish the retina are blocked.
    * **Severe** Nonproliferative Retinopathy: More blocked blood vessels, depriving several areas of the retina of blood supply; retina sends signals to the body to grow new blood vessels for nourishment.
    * **Proliferative** Retinopathy: Advanced stage; new blood vessels are abnormal and fragile; grow along the retina and along the surface of the clear, vitreous gel that fills the inside of the eye.


## 2 | Dataset: [Monkeypox Tweets](https://www.kaggle.com/datasets/aneeshtickoo/tweets-on-monkeypox)

### Data Download

In [19]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [38]:
os.chdir("/content/drive/My Drive/")
os.listdir()

['1 projects',
 '3 life',
 '2 library',
 'vault',
 'Colab Notebooks',
 'kaggle.json']

In [39]:
with open('kaggle.json') as json_file:
    kaggle = json.load(json_file)

In [40]:
# assign to directory 
os.environ['KAGGLE_CONFIG_DIR'] = "/content"

In [45]:
# download dataset from kaggle
'chmod 600 /content/kaggle.json'
! kaggle datasets download -d aneeshtickoo/tweets-on-monkeypox

tweets-on-monkeypox.zip: Skipping, found more recently modified local copy (use --force to force download)


In [46]:
# unzip kaggle file
zip_ref = zipfile.ZipFile('tweets-on-monkeypox.zip', 'r') 
zip_ref.extractall('/content')
zip_ref.close()

## 3 | Exploratory Data Analysis

In [47]:
df = pd.read_csv('/content/monkeypox.csv')

In [49]:
df.head(2)

Unnamed: 0,id,conversation_id,created_at,date,time,timezone,user_id,username,name,place,...,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest
0,1555815462201872385,1555815462201872385,2022-08-06 12:48:06 India Standard Time,2022-08-06,12:48:06,530,820113517613154304,thetenth2022,TheTenth,,...,,,,,,[],,,,
1,1555815458602831872,1555815458602831872,2022-08-06 12:48:05 India Standard Time,2022-08-06,12:48:05,530,196518052,ashemedai,Jeroen Ruigrok van der Werven,,...,,,,,,[],,,,


In [None]:
df.info()

In [None]:
df.describe()

In [None]:
cols_list = list(df.columns)
cols_list

In [56]:
cols_list

['id',
 'conversation_id',
 'created_at',
 'date',
 'time',
 'timezone',
 'user_id',
 'username',
 'name',
 'place',
 'tweet',
 'language',
 'mentions',
 'urls',
 'photos',
 'replies_count',
 'retweets_count',
 'likes_count',
 'hashtags',
 'cashtags',
 'link',
 'retweet',
 'quote_url',
 'video',
 'thumbnail',
 'near',
 'geo',
 'source',
 'user_rt_id',
 'user_rt',
 'retweet_id',
 'reply_to',
 'retweet_date',
 'translate',
 'trans_src',
 'trans_dest']

In [None]:
df['tweet']

* RangeIndex: 6859 entries, 0 to 6858
* Data columns (total 36 columns)

### Select corpus: `tweet_df`

In [96]:
tweet_df = df[['language','date','username','tweet']]
tweet_df.head(2)

Unnamed: 0,language,date,username,tweet
0,en,2022-08-06,thetenth2022,So has anyone begun to compile 'here's the sta...
1,en,2022-08-06,ashemedai,"Getting some groceries, topic shifted to covid..."


In [97]:
tweet_df['username'].nunique()

6028

In [98]:
tweet_df['language'].unique()


array(['en', 'kn', 'de', 'in', 'tl', 'fr', 'pt', 'te', 'tr', 'it', 'qme',
       'bn', 'qht', 'pl', 'es', 'el', 'nl', 'cy', 'ta', 'hi', 'sv', 'ja',
       'th', 'mr', 'et', 'gu', 'da', 'ro', 'ml', 'zxx', 'und', 'pa', 'ur',
       'ko', 'am', 'fi', 'zh', 'lt', 'hu', 'ru', 'ar', 'si'], dtype=object)

In [99]:
print('English entries:', (tweet_df[tweet_df["language"] == 'en'].count())['language'])
print('Spanish entries:', (tweet_df[tweet_df["language"] == 'es'].count())['language'])
print('Italian entries:', (tweet_df[tweet_df["language"] == 'it'].count())['language'])


English entries: 6410
Spanish entries: 51
Italian entries: 10


In [104]:
tweet_df = tweet_df[(tweet_df['language'] == 'en')]
tweet_df.info()

KeyError: ignored

In [101]:
tweet_df.head(2)

Unnamed: 0,language,date,username,tweet
0,en,2022-08-06,thetenth2022,So has anyone begun to compile 'here's the sta...
1,en,2022-08-06,ashemedai,"Getting some groceries, topic shifted to covid..."


In [103]:
tweet_df = tweet_df.drop(columns=['language'])
tweet_df.head(2)


KeyError: ignored

In [105]:
# save corpus selection as tweet_df
tweet_df.to_pickle('/content/tweet_df.pkl')
tweet_df.to_csv(r'/content/drive/MyDrive/tweet_df.csv', index=False)