# 02_Data Cleaning

Data cleaning is responsible to clean up the raw data into cleaned data for EDA and modeling. 

Data cleaning process implemented include:
1. remove URL, 
2. remove newline
3. replace [deleted] and [removed] with "pseudodeleted" and "pseudoremoved"
4. remove newline /carriage return 
5. remove stop words 
6. remove punctuation and special characters  # only execute after stop words removal, as "wasn’t" is a stop word, but not "wasnt" 
7. convert to lower → Lemmatization 
8. remove lines with null comment

---
### Import Libraries

In [19]:
# uncomment the following to install "spacy" library from pip, and download the "en_core_web_sm" language pack. 

# !pip install spacy
# !python -m spacy download en_core_web_sm

In [20]:
# uncomment the following to install textblob, a library to perform spelling check and correction
# !pip install textblob

In [21]:
# uncomment the following to install pyspellchecker, a library to perform spelling check and correction
# !pip install pyspellchecker

In [22]:
import spacy

import nltk

from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

import re
import string
import numpy as np

nlp = spacy.load("en_core_web_sm")
from textblob import TextBlob
from spellchecker import SpellChecker

---
### Initialize variables/objects that will be used for data cleaning


In [23]:
stop_words = set(stopwords.words('english'))
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [24]:

special_char_list = list(string.punctuation)
special_char_list+=["’","'s","’s","...","$","@$$."]

---
### Functions to perform data cleaning
1. lemmatization - using spacy (*evaluated, but eventually not used in the data cleaning*)
2. lemmatization - using WordNetLemmatizer
3. stemming (*evaluated, but eventually not used in the data cleaning*)
4. text cleaning
5. spell check - using spellchecker (*evaluated, but eventually not used in the data cleaning*)
6. spell check - using TextBlob (*evaluated, but eventually not used in the data cleaning*)

> [unused function were deemed unnecessary for the final cleaning of dataset]. Spacy lemmatizer was unable to lemmatize continuous tense to its lemma

In [25]:
'''
function to perform lemmatization on a text (Not in use, but keep here for future reference)
Spacy is a  library to perform lemmatization using a much more efficient algo for big data set. 

Afternote: Spacy was not used in the end, 
'''
def sentence_lemmatizer_spacy(text):
    if text.strip() == '':
        return np.NaN

    doc = nlp(text.lower())                                 # Process the text with SpaCy
    lemmatized_tokens = [token.lemma_ for token in doc]     # Extract lemmas for each token
    return ' '.join(lemmatized_tokens)  

#unit test
print(sentence_lemmatizer_spacy("[deleted]"))   
print(sentence_lemmatizer_spacy("[removed]"))   
print(sentence_lemmatizer_spacy("i am pseudoremoved"))
print(sentence_lemmatizer_spacy("i am pseudodeleted !!")) 
print(sentence_lemmatizer_spacy(" "))   
print(sentence_lemmatizer_spacy(""))       
print(sentence_lemmatizer_spacy("hellooooo !!"))    
print(sentence_lemmatizer_spacy("fucking fuck fucker FUCKING fuckin!!"))          
print(sentence_lemmatizer_spacy("wasn't !!"))         # Join the lemmatized tokens back into a single string

[ delete ]
[ remove ]
I be pseudoremove
I be pseudodelete ! !
nan
nan
hellooooo ! !
fucking fuck fucker fucking fuckin ! !
be not ! !


In [26]:
'''
function to perform lemmatization on a text
Using the WordNetLemmatizer by nltk, which is much slower
'''
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)


def sentence_lemmatizer(text):
    if text.strip() == '':
        return np.NaN

    lemmatizer = WordNetLemmatizer()
    char_list = word_tokenize(text.lower())

    # Lemmatize list of words and join
    return ' '.join([lemmatizer.lemmatize(w.lower(), get_wordnet_pos(w.lower())) for w in char_list])


#unit test
print(sentence_lemmatizer("[deleted]"))   
print(sentence_lemmatizer("[removed]"))   
print(sentence_lemmatizer("i am pseudoremoved"))
print(sentence_lemmatizer("i am pseudodeleted !!"))     
print(sentence_lemmatizer("hellooooo !!"))             
print(sentence_lemmatizer(" "))   
print(sentence_lemmatizer(""))   
print(sentence_lemmatizer("fucking fuck fucker FUCKING fuckin!!"))   
print(sentence_lemmatizer("wasn't !!")) 

    

[ delete ]
[ remove ]
i be pseudoremoved
i be pseudodeleted ! !
hellooooo ! !
nan
nan
fuck fuck fucker fuck fuckin ! !
be n't ! !


> [unused function were deemed unnecessary for the final cleaning of dataset]. Stemmer is less effective than choosen lemmatization

In [27]:
'''
Function to do text stemming using PorterStemmer (Not in use, but keep here for future reference)
'''
def sentence_stemmer(text):
    words = nltk.word_tokenize(text)
    porter_stemmer = PorterStemmer()
    return ' '.join([porter_stemmer.stem(word) for word in words])


#unit test
print(sentence_stemmer("[deleted]"))  
print(sentence_stemmer("[removed]"))    
print(sentence_stemmer("i am pseudoremoved"))
print(sentence_stemmer("i am pseudodeleted !!"))     
print(sentence_stemmer("hellooooo !!"))   

[ delet ]
[ remov ]
i am pseudoremov
i am pseudodelet ! !
hellooooo ! !


In [28]:
'''
Function to do basic cleaning of the comment text
- remove newline
- remove http url
- remove [deleted], [removed] 
- (more to come... )
'''
def text_cleaning(text):

    url_regex = re.compile(
    r'((http|https)://'     # Start with http:// or https://
    r'([a-zA-Z0-9.-]+)'     # Match the domain name (alphanumeric characters, dots, and dashes)
    r'(\.[a-zA-Z]{2,})'     # Match the top-level domain (e.g., .com, .net) with at least 2 characters
    r'(:\d+)?'              # Match an optional port number
    r'(/\S*)?'              # Match an optional path (any non-whitespace characters)
    r'(\?[^"\s]*)?)',        # Match an optional query string (attribute-value pairs)
    re.IGNORECASE        # Ignore case sensitivity
    )

    # remove URL
    text = url_regex.sub("", text)

    # Mark the comment with [deleted] or [removed] with pseudo marker "pseudodeleted" and "pseudoremoved"
    # After lemmatization, [deleted] become [ delete ], [removed] become [ remove ]
    # After stemming, [deleted] become [ delet ], [removed] become [ remov ]
    text = str(text).replace("[deleted]","pseudodeleted").replace("[removed]","pseudoremoved").replace("[ delete ]","pseudodeleted").replace("[ remove ]","pseudoremoved").replace("[ delet ]","pseudodeleted").replace("[ remov ]","pseudoremoved")

    # remove newline 
    text = str(text).replace("\n", " ").replace("\r", " ").replace("\r\n"," ").replace("_x000D_", " ")

    # remove  "'s"
    text = re.sub(r"(\'s)","", text)
    
    # remove stopword 
    text = ' '.join([word for word in text.split() if word.lower() not in stop_words])

   # remove punctuation and special character
    text = ''.join([char for char in text if char not in special_char_list])

    # return the cleaned text
    return text.strip()

#unit test
print(text_cleaning("abc abcc abccc abcccc hellllo CY's coffee!! A cup cost @ $2.88 and no sugar belongs to Ryan's"))
print(text_cleaning(",[deleted],"))
print(text_cleaning(",[ delete ],"))
print(text_cleaning(",[ remove ],"))
print(text_cleaning(",[removed],"))
print("1",text_cleaning("http://www.google.com"))
print("2",text_cleaning("http://www.google.com/"))
print("3",text_cleaning("http://www.yahoo.com.sg:8601"))
print("4",text_cleaning("http://www.yahoo.com.sg:8601/"))
print("5",text_cleaning("http://www.yahoo.com.sg:8601/api"))
print("6",text_cleaning("This is my URL http://www.YAhoo.com.sg:8601/api :)"))
print("7",text_cleaning("https://www.galolawfirm.com:8888/how-to-legally-terminate-an-employee-in-texas/#:~:text=Texas%2C%20like%20many%20U.S.%20states,it's%20not%20an%20unlawful%20one."))
print("8",text_cleaning("https://graphics.stltoday.com/apps/payrolls/salaries_2020/teachers/?sort=med_salary&dir=desc"))
print("9",text_cleaning("https://graphics.stltoday.com/apps/payrolls/salaries_2020/teachers/?sort=med_salary&dir=desc\n\
                        OKOK Got it \n \
                        This is 3rd line!!! "))
print("10",text_cleaning("OMG!!"))

abc abcc abccc abcccc hellllo CY coffee cup cost  288 sugar belongs Ryan
pseudodeleted
pseudodeleted
pseudoremoved
pseudoremoved
1 
2 
3 
4 
5 
6 URL
7 
8 
9 OKOK Got 3rd line
10 OMG


> [unused function were deemed unnecessary for the final cleaning of dataset]. TextBlob Spell check is ineffective with slang and some common words (see unit test output for example)

In [29]:
def perform_spell_check_correction_textblob(text):
    tb = TextBlob(text)
    return tb.correct()


#unit test
print(perform_spell_check_correction_textblob("helloo, this is CY, I am tying out teh speling cheker"))
print(perform_spell_check_correction_textblob("Donald Trump is going to throw a massive hissyfit."))
print(perform_spell_check_correction_textblob(f"Just wanna say congrats from the U.K. guys! Weâ€™ve been just as nervous/excited as you! Today is a good day for America. \
\
I sincerely hope you all a healthy and prosperous future!"))


hello, this is of, I am tying out the spelling cheer


Donald Plump is going to throw a massive hissyfit.
Must anna say congress from the U.K. guns! He€™ve been just as nervous/excited as you! Today is a good day for America. I sincerely hope you all a healthy and prosperous future!


> [unused function were deemed unnecessary for the final cleaning of dataset]. Decided not to perform spell check as the computation was too expensive (estimated more than 20hour for the 37K data records)

In [30]:
def perform_spell_check_correction_spellchecker(text):
    words = text.split()
    spell = SpellChecker()
    misspell = list(spell.unknown(words))
    #print(f"MISS=={misspell}")
    for word in misspell:
        correct_spell = spell.correction(word)
        if correct_spell != None: 
            text = text.replace(word, correct_spell)

    #print(f"FIXED>>> {text}")
    return text 


#unit test
print(perform_spell_check_correction_spellchecker("helloo, this is CY, I am tying out teh speling cheker"))
print(perform_spell_check_correction_spellchecker("Donald Trump is going to throw a massive hissyfit."))
print(perform_spell_check_correction_spellchecker(f"Just wanna say congrats from the U.K. guys! Weâ€™ve been just as nervous/excited as you! Today is a good day for America. \
\
I sincerelly hope you all a healthy and prosporous future!"))


hello this is CY, I am tying out the spelling cheer
Donald Trump is going to throw a massive hissyfit.
Just wanna say congrats from the U.K. guys Weâ€™ve been just as nervous/excited as you Today is a good day for America. I sincerely hope you all a healthy and prosperous future


> We decided to forego spellcheck (both TextBlob and SpellChecker) as we could not achieve balance of accuracy and efficiency. 

---
### Perform Text Cleaning
- concatenate onion comments and news comments into single dataframe
- check and drop comments which are blank, if any
- verify column datatype
- export the cleaned dataframe to a CSV , ready for next step - EDA and Model Evaluation


In [31]:
import pandas as pd
import numpy as np

In [32]:
onion_file = f"../data/01_raw_onion_data.csv"
news_file = f"../data/01_raw_news_data.csv"

onion_df = pd.read_csv(onion_file)
news_df = pd.read_csv(news_file)

> Concatenate the news file and onion file into the same dataframe

In [33]:
df_list =[onion_df, news_df]
comments_df = pd.concat(df_list)
comments_df.reset_index(drop=True, inplace=True)
comments_df

Unnamed: 0,comment_id,parent_id,post_id,is_submitter,body,score,stickied,created_utc,post_title,subreddit
0,dpep775,t3_7b0y34,t3_7b0y34,False,I love how onion updates to The city,1349,False,2017-11-06 02:49:19,"'No Way To Prevent This,’ Says Only Nation Whe...",TheOnion
1,dpee371,t3_7b0y34,t3_7b0y34,False,...again.,4780,False,2017-11-05 23:14:31,"'No Way To Prevent This,’ Says Only Nation Whe...",TheOnion
2,dpex3x5,t3_7b0y34,t3_7b0y34,False,"Just need to finish it off with ""Our thoughts ...",262,False,2017-11-06 05:53:10,"'No Way To Prevent This,’ Says Only Nation Whe...",TheOnion
3,dpejj86,t3_7b0y34,t3_7b0y34,False,thousands of people going to prison for weed E...,2265,False,2017-11-06 00:57:39,"'No Way To Prevent This,’ Says Only Nation Whe...",TheOnion
4,dpeiwa9,t3_7b0y34,t3_7b0y34,False,Every law abiding citizen in Australia that wa...,1958,False,2017-11-06 00:44:49,"'No Way To Prevent This,’ Says Only Nation Whe...",TheOnion
...,...,...,...,...,...,...,...,...,...,...
37914,gfv68hn,t3_kd8bdc,t3_kd8bdc,False,I predict this confirmation thing is going to ...,3,False,2020-12-14 23:40:41,President-elect Joe Biden clears 270-vote thre...,news
37915,gfv78uj,t3_kd8bdc,t3_kd8bdc,False,I'm excited for the cheeto's rage tweeting later,3,False,2020-12-14 23:49:44,President-elect Joe Biden clears 270-vote thre...,news
37916,gfv7xr7,t3_kd8bdc,t3_kd8bdc,False,I wonder if Republicans will purposely not ref...,3,False,2020-12-14 23:56:06,President-elect Joe Biden clears 270-vote thre...,news
37917,gfv8r3z,t3_kd8bdc,t3_kd8bdc,False,Looking forward to when the whitehouse finally...,3,False,2020-12-15 00:03:35,President-elect Joe Biden clears 270-vote thre...,news


> check if any row with missing data. 

> Outcome - no missing data, no data row dropped is required. 

In [34]:
comments_df.isnull().sum()

comment_id      0
parent_id       0
post_id         0
is_submitter    0
body            0
score           0
stickied        0
created_utc     0
post_title      0
subreddit       0
dtype: int64

> The columns not relevant to EDA and modeling will be dropped. 

In [35]:
comments_df = comments_df.drop(columns=['is_submitter','stickied','created_utc'], axis=1)

> Invoke the "text_cleaning" function to perform text cleaning activities, and output the cleaned comments to "body_cleaned" new column . Refer to the comments on "text_cleaning" function for the details

In [36]:
comments_df['body_cleaned'] = comments_df['body'].map(text_cleaning)

> After text cleaning, check if any row with missing data

> Outcome - no missing data, no data row dropped is required. 

In [37]:
comments_df.isnull().sum()

comment_id      0
parent_id       0
post_id         0
body            0
score           0
post_title      0
subreddit       0
body_cleaned    0
dtype: int64

> Invoke the "lematization" function to perform lemmatization, and output the lemmatized comments to "body_cleaned_lemmatized" new column . Refer to the comments on "sentence_lemmatizer" function for the details

In [38]:
comments_df['body_cleaned_lemmatized'] = comments_df['body_cleaned'].map(sentence_lemmatizer)


> Lemmatization function has explicitly set "np.NaN" when there is blank comment text. This is to ensure the empty comment text can be identified by isnull() and dropped as part of data cleaning actions.

In [39]:
comments_df['body_cleaned_lemmatized'].isnull().sum() 


140

In [40]:
comments_df = comments_df.dropna(subset=['body_cleaned_lemmatized'])
comments_df['body_cleaned_lemmatized'].isnull().sum() 

0

> Quick verification the new columns were created with cleaned and lemmatized data

In [41]:
comments_df.head(3)

Unnamed: 0,comment_id,parent_id,post_id,body,score,post_title,subreddit,body_cleaned,body_cleaned_lemmatized
0,dpep775,t3_7b0y34,t3_7b0y34,I love how onion updates to The city,1349,"'No Way To Prevent This,’ Says Only Nation Whe...",TheOnion,love onion updates city,love onion update city
1,dpee371,t3_7b0y34,t3_7b0y34,...again.,4780,"'No Way To Prevent This,’ Says Only Nation Whe...",TheOnion,again,again
2,dpex3x5,t3_7b0y34,t3_7b0y34,"Just need to finish it off with ""Our thoughts ...",262,"'No Way To Prevent This,’ Says Only Nation Whe...",TheOnion,need finish Our thoughts prayers go victims fa...,need finish our thought prayer go victim famil...


> Check datatype on the comment text column to make sure it is string (i.e. Object). No type conversion is required.

In [42]:
comments_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 37779 entries, 0 to 37918
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   comment_id               37779 non-null  object
 1   parent_id                37779 non-null  object
 2   post_id                  37779 non-null  object
 3   body                     37779 non-null  object
 4   score                    37779 non-null  int64 
 5   post_title               37779 non-null  object
 6   subreddit                37779 non-null  object
 7   body_cleaned             37779 non-null  object
 8   body_cleaned_lemmatized  37779 non-null  object
dtypes: int64(1), object(8)
memory usage: 2.9+ MB


---
### Export the cleaned dataframe into CSV - for EDA and Modeling 

In [43]:
from datetime import datetime
comments_df.to_csv(f"../data/02_cleaned_data.csv", index = False)