
# DS320 Spring 2023: Midterm Project

Due on Mon 04/03/23 at 8:00 AM

The focus of this assignment is data <i><b>cleaning, preparation, and basis calculations</i></b>: deal with numbers, datetime values, and text

You should review the pipeline of cleaning and preparing data I drew on the whiteboard and the "data cleaning" slide (try to be clear about 05 steps there) before working on this assignment.

You will clean tweets taken from https://github.com/thuydt02/HCQ for a text sentiment analysis problem.
You need to download the dataset from this link, upzip it and work on the `Full_Tweet_to_github.csv` file.
Do not upload the dataset to your Jupyter Luther since it is big. You need to download this notebook and work on it with Google Colab. I am trying to find out a way to upload the dataset to Jupyter Luther. If it is successful (likely), I will let you know. But be prepared to work in Google Colab.  

The dataset has roughly 164K tweets. These tweets were pulled from Twitter, satisfies:

1. created in the year 2020
2. has the key word "Hydroxychroloquine"

I use this dataset for my research. I want to analyze the reactions and opinions of social network users on using the medication "Hydroxychloroqine" to treat COVID-19 disease.

See more about the paper: https://arxiv.org/pdf/2201.00237.pdf

There are 10 tasks, each is worth 10 points.

Note: I will mannually grade your code, so no test cases will be provided, but I can give you the expectation of the outcomes for tasks as I can.


In [1]:
#setting up: You have to run this code cell first to compile helping functions.

import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from nltk.sentiment import SentimentIntensityAnalyzer
#from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

import re
import sys
import warnings

if not sys.warnoptions:
    warnings.simplefilter("ignore")

nltk.download('vader_lexicon')

#functions for pre processing 

#---clean up html elements and entities: e.g. <html> </html> &nbsp;
def cleanHtml(sentence):
    #cleanr = re.compile('<.*?>')
    cleanr = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
    cleantext = re.sub(cleanr, ' ', str(sentence))
    return cleantext


#---function to clean the word of any punctuation or special characters
def cleanPunc(sentence): 
    cleaned = re.sub(r'[?|!|\'|"|#]',r' ',sentence)
    cleaned = re.sub(r'[.|,|:|;|)|(|\|/]',r' ',cleaned)
    cleaned = cleaned.strip()
    cleaned = cleaned.replace("\n"," ")
    return cleaned


def keepAlpha(sentence):
    alpha_sent = ""
    for word in sentence.split():
        alpha_word = re.sub('[^a-z A-Z]+', ' ', word)
        alpha_sent += alpha_word
        alpha_sent += " "
    alpha_sent = alpha_sent.strip()
    return alpha_sent

#---stop words removing 
stop_words = set(stopwords.words('english'))
stop_words.update(['  ', 'zero','one','two','three','four','five','six','seven','eight','nine','ten','may','also','across','among','beside','however','yet','within'])
re_stop_words = re.compile(r"\b(" + "|".join(stop_words) + ")\\W", re.I)

def removeStopWords(sentence):
    global re_stop_words
    return re_stop_words.sub(" ", sentence)


#--- sentence stemering
stemmer = SnowballStemmer("english")
def stemming(sentence):
    stemSentence = ""
    for word in sentence.split():
        stem = stemmer.stem(word)
        stemSentence += stem
        stemSentence += " "
    stemSentence = stemSentence.strip()
    return stemSentence
    

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/LC/doth02/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


# Task 01: Understand the data set
- Read the whole tweet dataset (Full_Tweet_to_github.csv file) into a dataframe. Name it `df_tweets`.
From now, you will work on `df_tweets`.
- Find out information about the follows. You must show your code:
    Shape

    Data types of all columns

    Numerical columns

    Text columns

    Categorical columns

    Date/time columns

    Statistics (min, max, mean, std, ...) of all the numerical columns
    
 Expectation: df_tweets.shape = (164168, 12)

In [2]:
###BEGIN SOLUTION
df_tweets = pd.read_csv("../data/Full_Tweet_to_github.csv") 
df_original_tweets = df_tweets.copy()
print(df_tweets.shape)
print(df_tweets.dtypes)
#df_tweets.describe()
###END SOLUTION

FileNotFoundError: [Errno 2] No such file or directory: '../data/Full_Tweet_to_github.csv'

# Task 02: Delete uneccessary columns
As you can see, the following columns will not help a machine learning algorithm to learn:
+ HYDROXYCHLOROQUINE: all the values in this columns are 1
+ query_string: the URL link to the tweet in Twitter.

Delete two above columns from `df_tweets`

expectaion: the shape of the returned dataframe = (164168, 10)

In [None]:
###BEGIN SOLUTION
df_tweets.drop(axis = 'columns', columns = ['HYDROXYCHLOROQUINE','query_string'], inplace = True)
print(df_tweets.shape)
df_tweets.head()
###END SOLUTION

# Task 03: Delete columns with missing values.

Delete all columns with more than 80% missing values.

In the previous assignment, you were asked to create a function, `delete_cols()`, call that function on `df_tweets` and `80`.

Expectation: df_tweets.shape = (164168, 9)


In [None]:
###BEGIN SOLUTION
def get_missing_percent_in_cols(df):
    percent_missing = df.isnull().sum() * 100 / df.shape[0]
    df_cols = pd.DataFrame(data = {'column_name': df.columns,
                                   'percent_missing': percent_missing,
                                   'data_type': df.dtypes})

    df_cols.sort_values(['percent_missing'], ascending = False, inplace = True)
    return df_cols


def delete_cols(df, threshold):
    df_cols = get_missing_percent_in_cols(df)
    col_to_del = df_cols[df_cols['percent_missing'] > threshold]['column_name']
    return df.drop(col_to_del, axis='columns', inplace = False)


#usage
df_tweets = delete_cols(df_tweets, 80)
print(df_tweets.shape)
###END SOLUTION


# Task 04: Remove duplicates and irrelevant data

    a. Identify duplicates: 
        
        Two tweets are indentical if they have same 
        full_text, created_at, reply_count, favorites_count
        
        Two different tweets can have the same full_text as other people can re-tweet
        
        Note: trying to combine the above features may be expensive
    
    b. remove duplicates

Expectation: 1277 duplicates are deleted

In [None]:
###BEGIN SOLUTION
num_rows = df_tweets.shape[0]
df_tweets = df_tweets.drop_duplicates(subset = ['full_text', 'reply_count', 'favorite_count','created_at'])
print("#duplicates: ", num_rows - df_tweets.shape[0])
###END SOLUTION

# Task 05: Clean text

Do the following sub tasks for all values in each text columns:
+ lowcase
+ strip spaces (leading and trailing)

Call the functions you have done in the tasks 06 and 07 of the previous assignment.


In [None]:
###BEGIN SOLUTION

def lower_text(df):
    df1 = df.select_dtypes(include='object')
    print(df1.dtypes)
    df1 = df1.applymap(lambda x: x.lower() if isinstance(x, str) else "")
    df_cp = df.copy()
    for c in df1.columns:
        df_cp[c] = df1[c]
    return df_cp


def strip_text(df):
    df1 = df.select_dtypes(include='object')
    df1 = df1.applymap(lambda x: x.strip() if isinstance(x, str) else "")
    df_cp = df.copy()
    for c in df1.columns:
        df_cp[c] = df1[c]
    return df_cp

df_tweets = lower_text(df_tweets)
df_tweets = strip_text(df_tweets)

df_tweets.head()
###END SOLUTION


# Task 06: Clean tweets

Do the following sub tasks for all values in the `full_text` column:
+ only keep alphabetical letters
+ remove HTML tags and entities
+ remove punctuations
+ remove stop words (words have no contribution for sentiment identification of sentences)
+ stemming words: replace a word with its original version since they have the same meaning and sentiment in a sentence. For example, `happiness` is derived from `happy` => we will replace `happiness` by `happy`

You are provided all the functions in the setting up cell. Call them for this task.

Note this task will take a while.

In [None]:
###BEGIN SOLUTION
df_tweets['full_text'] = df_tweets['full_text'].apply(keepAlpha)
df_tweets['full_text'] = df_tweets['full_text'].apply(cleanHtml)
df_tweets['full_text'] = df_tweets['full_text'].apply(cleanPunc)
df_tweets['full_text'] = df_tweets['full_text'].apply(removeStopWords)
df_tweets['full_text'] = df_tweets['full_text'].apply(stemming)

df_tweets.head()
###END SOLUTION

# Task 07: Add a new column

Add a new column, called `state` and fill values in this column using the the following instructions.

We want to derive a state from `user_location`, but the data in `user_location` is very messy: some have cities' names (e.g., Albany), some have cities' names and state, ...

So I am including a look-up table file, called `state_full.csv`, here for you. In this file I have 2 columns: `shortState` and `city`. Whenever an user location has a short state name or a city in columns `shortState` or `city`, you will fill the `state` column with the short state.

For examples:

`user_location` = 'albany, usa' => `state` = 'NY'

`user_location` = 'boston, massachusetts' => `state` = 'MA'

`user_location` = 'ames, ia' => `state` = 'IA'

`user_location` = 'boston' => `state` = 'MA'

...
Note: this task will take long time (about 3-4 hours) to run

Expectation: 53712 short state names (NaN values are not counted) will be filled 



In [8]:
###BEGIN SOLUTION
st = pd.read_csv('./state_full.csv')
st.shape
n = st.shape[0]
st['city'] = st['city'].str.lower()
st['shortState'] = st['shortState'].str.lower()

st.head()

def get_State(text):  
    for i in range(n):
        if (st.iloc[i]['city'] in text) or (st.iloc[i]['shortState'] in text):
            return st.iloc[i]['shortState']
    return np.nan

df_tweets['state'] = df_tweets['user_location'].apply(get_State)
num_filled_states = df_tweets.shape[0] - df_tweets.state.isnull().sum()
print(num_filled_states)

df_tweets.head()
df_tweets.to_csv('./clean_tweet_state.csv')

###END SOLUTION

NameError: name 'df_tweets' is not defined

# Task 08: count #tweets and sum up #favorite_count by date

Count number of tweets and sum up favorite_count by dates. Store the results in a dataframe, called `df_count_by_date`. Sort the dataframe in the descending order of tweet counts.

hint: use `groupby()` on `df_tweets` and then `agg()` with count for full_text and sum for favorite_count

expectation: df_count_by_date.shape = (315, 2)

In [30]:
###BEGIN SOLUTION
df_count_by_date = df_tweets.groupby('created_at').agg({'full_text': 'count', 'favorite_count':'sum'})
df_count_by_date.sort_values(['full_text'], inplace = True, ascending = False)
print(df_count_by_date.shape)
df_count_by_date.head(10)

###END SOLUTION

(315, 2)


Unnamed: 0_level_0,full_text,favorite_count
created_at,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-10-02,20034,646194
2020-10-03,12342,191000
2020-10-04,5265,100905
2020-10-05,3984,203271
2020-10-06,2635,227535
2020-12-16,2372,66404
2020-10-08,2337,43023
2020-09-10,1753,9763
2020-10-07,1746,23762
2020-09-09,1676,50944


# Task 09: count #tweets and sum up #favorite_count by state

Count number of tweets and sum up favorite_count by states. Store the results in a dataframe, called `df_count_by_state`. Sort the dataframe in the descending order of tweet counts.

hint: use `groupby()` on `df_tweets` and then `agg()` with count for full_text and sum for favorite_count

expectation: df_count_by_state.shape = (51, 2)

In [37]:
###BEGIN SOLUTION
df_count_by_state = df_tweets.groupby('state', dropna=True).agg({'full_text': 'count', 'favorite_count':'sum'})
df_count_by_state.sort_values(['full_text'], inplace = True, ascending = False)
print(df_count_by_state.shape)
df_count_by_state.head()

###END SOLUTION

(51, 2)


Unnamed: 0_level_0,full_text,favorite_count
state,Unnamed: 1_level_1,Unnamed: 2_level_1
CA,10243,1307798
TX,5734,80652
FL,4221,351380
NY,3727,1640301
WA,3298,2557999


# Task 10: Top 10 tweets

+ Find tweets in top 10 highest reply_count, ordering from the highest to the least.
+ Find tweets in top 10 highest retweet_count, ordering from the highest to the least.
+ Find tweets in top 10 highest favorite_count, ordering from the highest to the least.

In [40]:
###BEGIN SOLUTION

df_top10_reply = df_tweets.sort_values(['reply_count'], ascending = False).iloc[:10]
df_top10_retweet = df_tweets.sort_values(['retweet_count'], ascending = False).iloc[:10]
df_top10_favorite = df_tweets.sort_values(['favorite_count'], ascending = False).iloc[:10]

print(df_top10_reply['reply_count'])
print(df_top10_retweet['retweet_count'])
print(df_top10_favorite['favorite_count'])
#df_top10_reply.head()

###END SOLUTION

484      69341
15326    29771
12924    25817
13294    24714
3531     22788
13399    19417
15254    19400
5495     17791
16840    15948
785      14391
Name: reply_count, dtype: int64
                                               full_text  created_at  \
484    hydroxychloroquin amp azithromycin taken toget...  2020-03-21   
486    pleas take hydroxychloroquin plaquenil plus az...  2020-03-21   
93441  trump kept tell us take hydroxychloroquin inje...  2020-10-06   
540    pleas spread hydroxychloroquin amp azithromyci...  2020-03-21   
16840  pleas watch high respect dr harvey risch yale ...  2020-08-24   
9323                hydroxychloroquin https co ymobdcfgx  2020-05-19   
13294  high respect henri ford health system report b...  2020-07-07   
12905  want ensur everyon understand graviti situat h...  2020-07-03   
13030       hydroxychloroquin work work whole time tweet  2020-07-04   
14600  imagin child porn taken social media quick hyd...  2020-07-29   

                     user