# YouTube Comments Data Preprocessing and Cleaning

This notebook performs data cleaning, preprocessing, and duplicate tagging for YouTube comments, preparing the data for further analysis and modeling.

### Import Required Libraries and Set Display Options

This cell imports all necessary libraries for data manipulation, text processing, and visualization. It also sets pandas display options and configures warnings to be ignored.

In [None]:
# import necessary libraries
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import emoji
import contractions
import numpy as np
import unicodedata
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
# # Download required NLTK resources (only first time)
# nltk.download('punkt')
# nltk.download('wordnet')
# nltk.download('omw-1.4')
# nltk.download('stopwords')
from tqdm import tqdm

# tqdm integration with pandas
tqdm.pandas()

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Show full text without truncation
pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_rows", 500)

### Load and Inspect Cleaned Comments Dataset

This cell loads the cleaned comments dataset from a CSV file and displays its structure using `info()`.

In [None]:
# Reading the cleaned comments dataset
df = pd.read_csv('dataset/comments_all_cleaned.csv')

# Displaying the DataFrame information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4724755 entries, 0 to 4724754
Data columns (total 9 columns):
 #   Column           Dtype  
---  ------           -----  
 0   commentId        int64  
 1   channelId        int64  
 2   videoId          int64  
 3   authorId         int64  
 4   textOriginal     object 
 5   parentCommentId  float64
 6   likeCount        int64  
 7   publishedAt      object 
 8   updatedAt        object 
dtypes: float64(1), int64(5), object(3)
memory usage: 324.4+ MB


### Load Irrelevant Video IDs

This cell loads a list of irrelevant video IDs from a CSV file, which will be used to filter out unwanted comments.

In [None]:
# Reads a CSV file containing irrelevant video IDs into a DataFrame
irrelevant = pd.read_csv('dataset/irrelevant_video_ids.csv')

### Filter Out Irrelevant Videos

This cell removes comments associated with irrelevant video IDs from the main DataFrame, resulting in a filtered dataset.

In [None]:
# Filters out irrelevant videos from the dataframe
video_ids_to_drop = irrelevant.iloc[:, 0].unique()
df_filtered = df[~df['videoId'].isin(video_ids_to_drop)]
df_filtered.shape

### Convert Published Date to Datetime

This cell converts the `publishedAt` column in the filtered DataFrame to datetime objects for easier sorting and time-based analysis.

In [None]:
# Converts the 'publishedAt' column in the df_filtered DataFrame to datetime objects.
df_filtered['publishedAt'] = pd.to_datetime(df_filtered['publishedAt'])

### Sort Comments by Published Date

This cell sorts the filtered DataFrame by the `publishedAt` column in ascending order, ensuring chronological order for further processing.

In [None]:
# Sorting the DataFrame in place by the 'publishedAt' column in ascending order
df_filtered.sort_values(by='publishedAt', inplace=True)

### Flag Duplicate Comments

This cell creates a `duplicatedFlag` column to identify exact duplicate comments (based on `videoId`, `authorId`, and `textOriginal`). The first occurrence is flagged as 0, and subsequent duplicates as 1.

In [None]:
# A flag value of 0 indicates the first occurrence of the comment, and 1 indicates a duplicate
df_filtered["duplicatedFlag"] = df_filtered.duplicated(
    subset=["videoId", "authorId", "textOriginal"],
    keep="first"   # first occurrence = 0, rest = 1
).astype(int)
df_filtered["duplicatedFlag"].value_counts()

duplicatedFlag
0    4644162
1      80593
Name: count, dtype: int64

### Initialize Lemmatizer and Stopwords

This cell initializes the NLTK WordNet lemmatizer and the set of English stopwords, keeping negation words for more accurate text processing.

In [None]:
# Initialize the lemmatizer and stop words for text processing.
lemmatizer = WordNetLemmatizer() 
stop_words = set(stopwords.words('english')) - {"not", "no", "nor"}  # keep negations

### Define Text Cleaning Function

This cell defines the `clean_text` function, which performs several preprocessing steps on comment text: lowercasing, expanding contractions, removing mentions, hashtags, links, emojis, punctuation, normalizing elongated words, collapsing whitespace, tokenizing, removing stopwords, and lemmatizing.

In [None]:
# Clean Text for subsequent ML models 
def clean_text(text):
    if not isinstance(text, str):
        return ""
    
    # Lowercase & trim
    text = text.lower().strip()

    # Expand contractions ("can't" -> "cannot")
    text = contractions.fix(text)

    # Remove mentions, hashtags, links
    text = re.sub(r'@[A-Za-z0-9_.-]+', '', text)
    text = re.sub(r'#\w+', '', text)
    text = re.sub(r"http\S+|www\S+", '', text)

    # Remove emojis completely
    text = ''.join(ch for ch in text if not emoji.is_emoji(ch))

    # Remove punctuation (ASCII + Unicode)
    text = ''.join(ch for ch in text if not unicodedata.category(ch).startswith("P"))

    # Normalize elongated words ("soooo" -> "soo")
    text = re.sub(r'(.)\1{2,}', r'\1\1', text)

    # Collapse whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    # Tokenize
    tokens = word_tokenize(text)

    # Remove stopwords & lemmatize
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]

    return " ".join(tokens)


### Clean Comments and Create New Column

This cell applies the `clean_text` function to the `textOriginal` column of the filtered DataFrame, creating a new column `cleanedText` with the processed text.

In [None]:
# Applies the `clean_text` function to the `textOriginal` column of the `df_filtered` DataFrame,
# creating a new column `cleanedText` with the processed text. The `progress_apply` method is used to show a
# progress bar for the operation.

df_filtered['cleanedText'] = df_filtered['textOriginal'].progress_apply(clean_text)

100%|██████████| 4724755/4724755 [14:11<00:00, 5546.56it/s] 
100%|██████████| 4724755/4724755 [21:22<00:00, 3682.81it/s] 


### Tag Exact Duplicates of Cleaned Text

This cell sorts the DataFrame by published date and tags exact duplicates of the `cleanedText` column, updating the `duplicatedFlag` accordingly.

In [None]:
# Sort the DataFrame in place by the "publishedAt" column in ascending order
df_filtered.sort_values(by="publishedAt", inplace=True)

# For rows where duplicatedFlag == 0, check duplicates by the subset
mask = df_filtered["duplicatedFlag"] == 0

# Mark all but the first (earliest) occurrence as 1
df_filtered.loc[mask, "duplicatedFlag"] = (
    df_filtered[mask].duplicated(subset=["videoId", "authorId", "cleanedText"], keep="first").astype(int)
)

# Display the count of unique values in the "duplicatedFlag" column
df_filtered["duplicatedFlag"].value_counts()

duplicatedFlag
0    4614143
1     110612
Name: count, dtype: int64

### Inspect a Sample Duplicate Comment

This cell filters the DataFrame for rows flagged as duplicates and displays a random sample for inspection.

In [None]:
# Filters the DataFrame for rows where 'duplicatedFlag' is 1 and then selects a random sample of 1 from the filtered data
df_filtered.loc[df_filtered['duplicatedFlag'] == 1, ['videoId', 'authorId', 'textOriginal','cleanedText']].sample(1)

Unnamed: 0,videoId,authorId,textOriginal,cleanedText
4689347,44543,773251,Thank you dear 😘 Please share 💗 💕 ❤️,thank dear please share ️


### Save Processed Data to CSV

This cell exports the filtered and processed DataFrame to a CSV file for future use.

In [None]:
# This code exports the filtered DataFrame to a CSV file, excluding the index column.
df_filtered.to_csv('dataset/comments_all_tagged_text_duplicates.csv', index=False)