# General tasks and directions

- Add your name, today's date, and the assignment title to the designated cell.
- Write your answers in the cells that contain `Add your answer here.` line.
- Write your code in the cells that contain `# Add your implementation here.` line.
- Use autograder tests that are provided for your convenience.
- Don't change or delete any provided code (including [cell magic](https://ipython.readthedocs.io/en/stable/interactive/magics.html) such as `%%capture output`).


## Add your name, today's date, and the assignment title

author: Uddam Chea

date: 04/10/23

assignment: midterm


# Midterm project

This assignment is individual and you agree to submit your own work.


postgresql://@localhost/hydroxyThis assignment is based on data collected by Dr. Thuy for her research.

The focus of this assignment is data cleaning, preparation, and basis calculations. You have to deal with numbers, datetime values, and text.

You will clean tweets taken from https://github.com/thuydt02/HCQ for a text sentiment analysis problem. The dataset has been uploaded to the *hydroxy* database on our server and you have to retrieve the data from the table *tweets*. The database URL is `postgresql://@localhost/hydroxy`.

The dataset has approximately 164K tweets that were pulled from Twitter in 2020 and contain the word *Hydroxychroloquine*.

Dr. Thuy uses this dataset for her research to analyze the reactions and opinions of social network users on using the medication "Hydroxychloroqine" to treat COVID-19 disease but you can imagine it can be used for a variety of studies.

Dr. Thuy's paper can be found at https://arxiv.org/pdf/2201.00237.pdf

There are 10 manually-graded tasks, each worth 15 points. You should add a short description of your approach to each task. Note that occasional assertions are provided for your convenience but they are not used for grading.

All the necessary imports and some helper functions are provided for your convenience. Some of the pre-processing functions are based on Pythons `re` module ([re — Regular expression operations — Python 3.11.2 documentation](https://docs.python.org/3/library/re.html)), some use `nltk` package ([NLTK :: Natural Language Toolkit](https://www.nltk.org/index.html)).


In [1]:
import nltk
import re
import numpy as np
import pandas as pd
import sqlalchemy as sqla

from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from nltk.sentiment import SentimentIntensityAnalyzer
from pandas import Series, DataFrame

nltk.download('vader_lexicon')


def clean_html(sentence):
    """clean up html elements and entities: e.g. <html> </html> &nbsp;"""
    clean_re = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
    clean_text = re.sub(clean_re, ' ', str(sentence))
    return clean_text


def clean_punc(sentence):
    """clean the word of any punctuation or special characters"""
    cleaned = re.sub(r'[?|!|\'|"|#]',r' ', sentence)
    cleaned = re.sub(r'[.|,|:|;|)|(|\|/]',r' ', cleaned)
    cleaned = cleaned.strip()
    cleaned = cleaned.replace("\n"," ")
    return cleaned


def keep_alpha(sentence):
    alpha_sent = ""
    for word in sentence.split():
        alpha_word = re.sub('[^a-z A-Z]+', ' ', word)
        alpha_sent += alpha_word
        alpha_sent += " "
    alpha_sent = alpha_sent.strip()
    return alpha_sent


def remove_stopwords(sentence):
    """remove stop words"""
    stop_words = set(stopwords.words('english'))
    stop_words.update(['  ', 'zero','one','two','three','four','five','six','seven','eight','nine','ten','may','also','across','among','beside','however','yet','within'])
    re_stop_words = re.compile(r"\b(" + "|".join(stop_words) + ")\\W", re.I)
    return re_stop_words.sub(" ", sentence)


def stemming(sentence):
    """sentence stemering"""
    stem_sentence = ""
    stemmer = SnowballStemmer("english")
    for word in sentence.split():
        stem = stemmer.stem(word)
        stem_sentence += stem
        stem_sentence += " "
    stem_sentence = stem_sentence.strip()
    return stem_sentence


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/LC/cheara01/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


## Task 1

*Understanding the data set*

1. Read the whole dataset into a `DataFrame` `df`.
1. Explore the dataframe properties:
    - Shape
    - Data types of all columns
    - Statistics of all the numerical columns
1. Assign proper data types to all columns
    - Numerical
    - Text
    - Categorical
    - Date/time


In [2]:
#connecting to the database
from sqlalchemy import create_engine, text

engine = create_engine("postgresql://@localhost/hydroxy")
query = "SELECT * FROM tweets"

with engine.begin() as conn:
  df = pd.read_sql_query(sql=text(query), con=conn)

In [3]:
assert df.shape == (164168, 12), df.shape

## Task 2

*Removing unnecessary columns*

As you can see, the following columns will not help a machine learning algorithm to learn:
- HYDROXYCHLOROQUINE: all the values in this columns are 1
- query_string: the URL link to the tweet in Twitter.

Delete these columns from the dataframe.

In [4]:
#since the columns that need to be dropped are already identified, I manually dropped them
df.drop(columns=["HYDROXYCHLOROQUINE", "query_string"], inplace=True)

In [5]:
assert df.shape == (164168, 10), df.shape

## Task 3

*Removing columns with missing values*

Delete all columns with more than 80% missing values.

In [6]:
#Taking from my own exercise (the one we did in class)
#this turns out to only drop one column
df.drop(df.columns[df.isna().sum()/df.shape[0] * 100 > 80], axis = 1, inplace=True)

In [7]:
assert df.shape == (164168, 9), df.shape

## Task 4

*Removing duplicates and irrelevant data*

- Two tweets are indentical if they have same `full_text`, `created_at`, `reply_count`, and `favorite_count`
- `full_text` alone cannot be used for comparison as other people can re-tweet.


In [8]:
#reading the drop_duplicates documentation, i use subset parameter to determine the drop criteria
df.drop_duplicates(subset=["full_text", "created_at", "reply_count", "favorite_count"], inplace=True)

In [9]:
assert df.shape == (162891, 9), df.shape

## Task 5

*Cleaning text*

- convert all text values to lower case
- strip leading and trailing spaces off all text values


In [10]:
# basic string lower and strip while only working on columns with "object" dtype
for col in df.select_dtypes(include="object").columns:
    df[col] = df[col].str.lower().str.strip()

## Task 6

*Cleaning tweets*

Process values in the `full_text` column as follows:
- keep letters only
- remove HTML tags and entities
- remove punctuation symbols
- remove stop words (words that have no contribution for sentiment identification of sentences)
- stem words: replace a word with its original version since they have the same meaning and sentiment in a sentence. For example, `happiness` is derived from `happy` so we should replace `happiness` with `happy`.

You are provided all the functions in the setting up cell. Call them for this task.

Note this task may take a while.

In [11]:
#need to download "stopwords" package
nltk.download("stopwords")

#defining a function to keep it clean, calling all the provided cleaning functions above
def process_text(sentence):
    processed_sentence = keep_alpha(sentence)
    processed_sentence = clean_html(sentence)
    processed_sentence = clean_punc(sentence)
    processed_sentence = remove_stopwords(sentence)
    processed_sentence = stemming(sentence)
    return processed_sentence

#applying the function
df["full_text"] = df["full_text"].apply(process_text)

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/LC/cheara01/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Task 7

*Adding state*

Add a new column, `state`, with its values derived from `user_location`.

We want to derive a state from `user_location`, but the data in `user_location` is very messy: some have city name only (e.g., Albany), some have city name and state (e.g. Albany, NY).

You should use the provided lookup table, *state_full.csv*. There are 2 columns in this file, `shortState` and `city`. Whenever an user location has a short state name or a city in columns `shortState` or `city`, you will fill the `state` column with the short state.

For example:

- `user_location` = 'albany, usa' => `state` = 'NY'
- `user_location` = 'boston, massachusetts' => `state` = 'MA'
- `user_location` = 'ames, ia' => `state` = 'IA'
- `user_location` = 'boston' => `state` = 'MA'


Note: this task make take a while.


In [12]:
#Read the lookup table
state_lookup = pd.read_csv('state_full.csv')

#To ensure consistency, i lower everything so that matches can be found since it's case sensitive
state_lookup['shortState'] = state_lookup['shortState'].str.lower()
state_lookup['city'] = state_lookup['city'].str.lower()

In [13]:
#This one is a bit tricky
def extract_state(location):
    if isinstance(location, str):
        # Split the user_location by comma as i give it higher priority than spaces
        location_parts = location.split(',')
        
        # Check if the user_location contains a state name or a city name
        for part in location_parts[::-1]:
            part = part.strip()
            #check for match in shortState
            match = state_lookup[state_lookup['shortState'] == part]
            
            if not match.empty:
                return match['shortState'].iloc[0]
            #check for match in city
            match = state_lookup[state_lookup['city'] == part]
            if not match.empty:
                return match['shortState'].iloc[0]
    return None

df['state'] = df['user_location'].apply(extract_state)
df['state'] = df['state'].str.upper()

In [14]:
assert df.shape == (162891, 10), df.shape

## Task 8

*Counting tweets*

Count number of tweets and sum up `favorite_count` by date. Store the result in a `DataFrame` `df_count_by_date`.
Sort the `DataFrame` in the descending order by the number of tweets.

Hint: use `groupby()` on `df` and then aggregate with `count` for `full_text` and `sum` for `favorite_count`.


In [15]:
#reading pd.dataframe there's an .agg i can tag onto the groupby
df_count_by_date = df.groupby("created_at").agg({"full_text": "count", "favorite_count": "sum"}).sort_values(by="full_text", ascending=False)

In [16]:
assert df_count_by_date.shape == (315, 2), df_count_by_date.shape

## Task 9

*Counting tweets again*

Count number of tweets and sum up `favorite_count` by state. Store the results in a `DataFrame` `df_count_by_state`.
Sort the `DataFrame` in the descending order by the number of tweets.

Hint: use `groupby()` on `df` and then aggregate with `count` for `full_text` and `sum` for `favorite_count`.


In [17]:
#same idea as task8
df_count_by_state = df.groupby('state').agg({'full_text': 'count', 'favorite_count': 'sum'}).sort_values('full_text', ascending=False)

In [18]:
assert df_count_by_state.shape == (51, 2), df_count_by_state.shape

## Task 10

*Finding top 10 tweets*

- Find top 10 tweets by `reply_count`, ordering from the highest to the lowest.
- Find top 10 tweets by `retweet_count`, ordering from the highest to the lowest.
- Find top 10 tweets by `favorite_count`, ordering from the highest to the lowest.

In [19]:
top10_replies = df.nlargest(10, "reply_count")
top10_replies

Unnamed: 0,full_text,created_at,user_location,friends_count,followers_count,reply_count,retweet_count,favorite_count,is_with_url,state
158193,"hydroxychloroquin &amp; azithromycin, taken to...",2020-03-21,"washington, dc",50,85725414,69341,101604,374415,0,DC
8926,i am take #hydroxychloroquin to treat my coron...,2020-07-31,,745,342719,29771,33410,69840,0,
6518,a surpris new studi found that the controversi...,2020-07-03,,1106,49712635,25817,23668,38343,1,
6890,the high respect henri ford health system just...,2020-07-07,"washington, dc",50,85725414,24714,48914,163587,0,DC
161265,do not listen to 45 when he suggest untest #hy...,2020-04-06,"los angeles/washington, d.c.",691,1468361,22788,15749,59642,0,DC
6998,obama hydroxychloroquin from 2008 https://t.co...,2020-07-11,underground bunker,6,2432611,19417,18276,35590,0,
8856,do you actual believ hydroxychloroquin is prob...,2020-07-30,"ohio, usa",1601,68915,19400,3307,21780,0,OH
163236,trump say jesus could have avoid crucifixion b...,2020-04-12,los angeles,1,28564321,17791,13896,108249,0,CA
10447,pleas watch high respect dr. harvey risch of y...,2020-08-24,"washington, dc",50,85725414,15948,53778,124556,0,DC
158503,🚨breaking: a man die &amp; his wife is in icu ...,2020-03-23,florida,389,317113,14391,8954,12417,1,FL


In [20]:
top10_retweets = df.nlargest(10, "retweet_count")
top10_retweets

Unnamed: 0,full_text,created_at,user_location,friends_count,followers_count,reply_count,retweet_count,favorite_count,is_with_url,state
158193,"hydroxychloroquin &amp; azithromycin, taken to...",2020-03-21,"washington, dc",50,85725414,69341,101604,374415,0,DC
158195,pleas don't take hydroxychloroquin (plaquenil)...,2020-03-21,republic of the philippines,335,27605,2501,69684,139878,0,
86990,trump kept tell us to take hydroxychloroquin a...,2020-10-06,eugene@coolquit.com,6154,524500,2558,63176,203950,0,
158251,pleas spread this now! hydroxychloroquin &amp;...,2020-03-21,"new york, ny",48,6703,1433,53780,89525,0,NY
10447,pleas watch high respect dr. harvey risch of y...,2020-08-24,"washington, dc",50,85725414,15948,53778,124556,0,DC
2921,how to hydroxychloroquin https://t.co/ymobdcfgx,2020-05-19,"brooklyn, ny",3005,2276258,5664,50913,206341,0,NY
6890,the high respect henri ford health system just...,2020-07-07,"washington, dc",50,85725414,24714,48914,163587,0,DC
6499,i want to ensur that everyon understand the gr...,2020-07-03,"manhattan, new york",4636,321973,2688,48780,85632,0,NY
6623,hydroxychloroquin work and has work the whole ...,2020-07-04,"manhattan, new york",4636,321973,1670,41374,107698,0,NY
8198,imagin if child porn was taken off social medi...,2020-07-29,"los angeles, ca",327,11610,1997,38429,123763,0,CA


In [21]:
top10_favorites = df.nlargest(10, "favorite_count")
top10_favorites

Unnamed: 0,full_text,created_at,user_location,friends_count,followers_count,reply_count,retweet_count,favorite_count,is_with_url,state
158193,"hydroxychloroquin &amp; azithromycin, taken to...",2020-03-21,"washington, dc",50,85725414,69341,101604,374415,0,DC
2921,how to hydroxychloroquin https://t.co/ymobdcfgx,2020-05-19,"brooklyn, ny",3005,2276258,5664,50913,206341,0,NY
86990,trump kept tell us to take hydroxychloroquin a...,2020-10-06,eugene@coolquit.com,6154,524500,2558,63176,203950,0,
6890,the high respect henri ford health system just...,2020-07-07,"washington, dc",50,85725414,24714,48914,163587,0,DC
46348,so trump is receiv regeneron polyclon antibodi...,2020-10-02,eugene@coolquit.com,6154,524500,1574,37584,153426,0,
158195,pleas don't take hydroxychloroquin (plaquenil)...,2020-03-21,republic of the philippines,335,27605,2501,69684,139878,0,
84270,the presid is receiv multipl medications. it n...,2020-10-05,,870,1397401,2228,22678,135147,0,
63823,hydroxychloroquin stand back and stand by,2020-10-02,"los angeles, california",6315,115264,855,18917,126349,0,CA
10447,pleas watch high respect dr. harvey risch of y...,2020-08-24,"washington, dc",50,85725414,15948,53778,124556,0,DC
8198,imagin if child porn was taken off social medi...,2020-07-29,"los angeles, ca",327,11610,1997,38429,123763,0,CA


## Submission Checklist

- [ ] Your name, today's date, and the assignment title in the designated cell.
- [ ] Your answers in the designated cells (if required).
- [ ] Your code runs and produces the expected output.
- [ ] The validity of your code is verified by autograders (if provided).
- [ ] Restart the kernel and run all cells (in the menubar, select *Kernel*, then *Restart Kernel and Run All Cells*).
- [ ] Save the notebook.
- [ ] Submit the assignment.
