# Video Data Preprocessing and Cleaning

This notebook performs data cleaning, preprocessing, and feature engineering for video data, including topic extraction, filtering, text cleaning, language detection, and translation.

### Install Required Python Packages

This cell installs all necessary Python packages for text processing, language detection, and translation. These packages include:

- `emoji`: For handling and removing emojis from text

- `contractions`: For expanding English contractions (e.g., "can't" to "cannot")

- `langid`: For automatic language detection

- `deep_translator`: For translating non-English text to English

In [None]:
# Install required libraries for text processing and translation
!pip install emoji
!pip install contractions
!pip install langid
!pip install deep_translator

### Import Libraries and Set Display Options

This cell imports all the necessary libraries for data manipulation, text processing, and visualization. It also sets display options for pandas to show more content and configures warnings to be ignored. Key libraries include:

- `pandas`, `numpy`: Data manipulation and analysis

- `re`: Regular expressions for text processing

- `matplotlib.pyplot`: Data visualization

- `emoji`, `contractions`, `unicodedata`: Text cleaning

- `nltk`: Natural language processing (tokenization, stopwords, lemmatization)

- `tqdm`: Progress bars for loops and pandas operations

- `warnings`: Suppress warning messages

- Display options are set to show full text and more rows in pandas DataFrames.

In [None]:
# Import all necessary libraries and set display options
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import emoji
import contractions
import numpy as np
import unicodedata
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
# # Download required NLTK resources (only first time)
# nltk.download('punkt')
# nltk.download('punkt_tab')
# nltk.download('wordnet')
# nltk.download('omw-1.4')
# nltk.download('stopwords')
from tqdm import tqdm
# import langid

# tqdm integration with pandas
tqdm.pandas()

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Show full text without truncation
pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_rows", 500)

### Load Cleaned Video Dataset

This cell loads the pre-cleaned video dataset from a CSV file into a pandas DataFrame named `video`. This dataset will be used for further preprocessing and analysis.

In [None]:
# Load the cleaned video dataset
video = pd.read_csv('dataset/videos_cleaned.csv')

### Inspect the Video Dataset

This cell displays the first few rows of the loaded video dataset to provide an overview of its structure and contents. This helps verify that the data has been loaded correctly.

In [None]:
# Display the first few rows of the video dataset to inspect the data
video.head()

Unnamed: 0,videoId,publishedAt,channelId,title,description,tags,defaultLanguage,defaultAudioLanguage,contentDuration,viewCount,likeCount,commentCount,topicCategories
0,85806,2024-01-15 00:59:29+00:00,33807,Unlocking the Benefits of Face Masks for Skin ...,,,en-US,en-US,PT9S,72.0,0.0,0.0,"['https://en.wikipedia.org/wiki/Health', 'http..."
1,30556,2023-10-27 19:32:16+00:00,46650,Get ready for the Magic💚💜🤍💝✨ #hydration #glowi...,,,,,PT45S,257.0,7.0,0.0,['https://en.wikipedia.org/wiki/Lifestyle_(soc...
2,51771,2024-09-28 01:23:22+00:00,14346,#trending #makeup #beautymakeup #yslbeauty #lu...,,,,en-US,PT19S,164.0,4.0,2.0,['https://en.wikipedia.org/wiki/Lifestyle_(soc...
3,45298,2023-07-13 15:19:28+00:00,50139,#shortvedio #balayage,,,,,PT14S,1207.0,20.0,0.0,['https://en.wikipedia.org/wiki/Lifestyle_(soc...
4,43611,2023-04-29 18:47:37+00:00,8143,Full Face of Merit Beauty 🤎 featuring new Flus...,,,,en,PT56S,8647.0,268.0,7.0,['https://en.wikipedia.org/wiki/Lifestyle_(soc...


### Extract Topic Categories from URLs

This cell defines and applies functions to extract topic categories from the `topicCategories` column. It uses regular expressions to find topics in URL strings and creates two new columns:

- `extracted_topicCategories`: List of topics for each video

- `extracted_topicCategories_str`: Comma-separated string of topics for each video

In [5]:
# Extract topic categories from the topicCategories column using regex
def extract_topics(url_list_str):
    if pd.isna(url_list_str) or url_list_str == '':
        return []
    # Find all occurrences of wiki/... in the string
    topics = re.findall(r"wiki/([^/?#'\"]+)", url_list_str)
    return topics  # keep as list

# Apply to the column
video['extracted_topicCategories'] = video['topicCategories'].progress_apply(extract_topics)

# Function to parse string list and extract topics as string
def extract_topics_to_string(url_list_str):
    if pd.isna(url_list_str) or url_list_str == '':
        return ''
    # Find all occurrences of wiki/... in the string
    topics = re.findall(r"wiki/([^/?#'\"]+)", url_list_str)
    # Join as comma-separated string
    return ", ".join(topics)

# Apply to the column
video['extracted_topicCategories_str'] = video['topicCategories'].progress_apply(extract_topics_to_string)

100%|██████████| 91492/91492 [00:00<00:00, 300955.43it/s]
100%|██████████| 91492/91492 [00:00<00:00, 352066.98it/s]


### Flatten and Count Topic Categories

This cell flattens the extracted topic categories from all videos into a single list and counts the frequency of each category using the `Counter` class. The result helps identify the most common topics in the dataset.

In [7]:
# Flatten the extracted topic categories and count their frequency
from collections import Counter
from tqdm import tqdm

# Flatten with tqdm
all_categories = []
for sublist in tqdm(merged['extracted_topicCategories'], desc="Flattening categories"):
    all_categories.extend(sublist)

# Count frequency
category_counts = pd.Series(Counter(all_categories)).sort_values(ascending=False)

# Display
print("Number of unique categories:", category_counts.nunique())

Flattening categories: 100%|██████████| 91492/91492 [00:00<00:00, 1711337.83it/s]

Number of unique categories: 30





### Check Sample of Extracted Topic Categories

This cell displays a sample of the extracted topic categories for the first 10 videos. This helps verify that the extraction process worked as intended.

In [6]:
# Check a sample of the extracted topic categories for verification
video['extracted_topicCategories'].head(10).T

0                     [Health, Lifestyle_(sociology)]
1    [Lifestyle_(sociology), Physical_attractiveness]
2    [Lifestyle_(sociology), Physical_attractiveness]
3    [Lifestyle_(sociology), Physical_attractiveness]
4    [Lifestyle_(sociology), Physical_attractiveness]
5    [Lifestyle_(sociology), Physical_attractiveness]
6    [Lifestyle_(sociology), Physical_attractiveness]
7    [Lifestyle_(sociology), Physical_attractiveness]
8    [Lifestyle_(sociology), Physical_attractiveness]
9    [Lifestyle_(sociology), Physical_attractiveness]
Name: extracted_topicCategories, dtype: object

### Display Category Counts as DataFrame

This cell converts the category counts into a DataFrame for easier viewing and further analysis.

In [8]:
# Display the category counts as a DataFrame
category_counts.to_frame(name='count')

Unnamed: 0,count
Lifestyle_(sociology),87043
Physical_attractiveness,86483
Hobby,4228
Health,3975
Fashion,2989
Music_of_Asia,1030
Entertainment,491
Physical_fitness,347
Pop_music,152
Vehicle,146


### Filter Videos by Target Categories

This cell defines a list of target categories and filters the video dataset to identify videos that belong to these categories. The filtered video IDs are stored for further exclusion from the main dataset.

In [None]:
# Define target categories and filter video IDs that match these categories
target_categories = [
    "Vehicle",
    "Humour",
    "Society",
    "Technology",
    "Knowledge",
    "Video_game_culture",
    "Film",
    "Action-adventure_game",
    "Action_game",
    "Role-playing_video_game",
    "Strategy_video_game",
    "Religion",
    "Pet",
    "Politics",
    "Sport",
    "Food",
    "Casual_game",
    "Military",
    "Puzzle_video_game",
    "Country_music",
    "Mixed_martial_arts",
    "Sports_game",
    "Basketball"
    ]

pattern = "|".join(target_categories)

filtered_ids = video.loc[
    video['extracted_topicCategories_str'].str.contains(pattern, na=False, regex=True),
    'videoId'
    ]

filtered_ids.to_frame(name='videoId').to_csv('dataset/irrelevant_video_ids.csv', index=False)


### Check Number of Filtered Videos

This cell checks how many videos in the dataset match the filtered target categories. This helps to understand the impact of the filtering step.

In [13]:
# Check the number of videos that match the filtered IDs
video.loc[video['videoId'].isin(filtered_ids)].shape

(379, 18)

### Remove Videos with Target Categories

This cell removes all videos that belong to the specified target categories from the main dataset. The resulting DataFrame, `video_filtered`, contains only videos outside these categories.

In [15]:
# Remove videos that belong to the target categories from the dataset
video_filtered = video[~video['videoId'].isin(filtered_ids)]
video_filtered.shape

(91113, 18)

### Remove Videos with Like Count Greater Than View Count

This cell filters out any video records where the number of likes is greater than the number of views. Such cases are logically inconsistent and likely indicate data errors.

In [None]:
# Remove rows where like_count is greater than view_count
video_filtered = video_filtered[video_filtered['like_count'] <= video_filtered['view_count']]

### Initialize Lemmatizer and Stopwords

This cell initializes the NLTK WordNet lemmatizer and the set of English stopwords. It also demonstrates how to keep negation words ("not", "no", "nor") in the stopwords set for more accurate sentiment analysis.

In [None]:
# Initialize lemmatizer and stopwords for text cleaning
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

# Initialize (with negations kept)
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english')) - {"not", "no", "nor"}  # keep negations

### Define Text Cleaning Function

This cell defines the `clean_text` function, which performs several preprocessing steps on text data: lowercasing, expanding contractions, removing mentions, hashtags, links, emojis, punctuation, normalizing elongated words, collapsing whitespace, tokenizing, removing stopwords, and lemmatizing. The cleaned text is suitable for further NLP tasks.

In [None]:
# Function to clean text for subsequent ML models
def clean_text(text):
    if not isinstance(text, str):
        return ""

    # Lowercase & trim
    text = text.lower().strip()

    # Expand contractions ("can't" -> "cannot")
    text = contractions.fix(text)

    # Remove mentions, hashtags, links
    text = re.sub(r'@[A-Za-z0-9_.-]+', '', text)
    text = re.sub(r'#', '', text)
    text = re.sub(r"http\S+|www\S+", '', text)

    # Remove emojis completely
    text = ''.join(ch for ch in text if not emoji.is_emoji(ch))

    # Remove punctuation (ASCII + Unicode)
    text = ''.join(ch for ch in text if not unicodedata.category(ch).startswith("P"))

    # Normalize elongated words ("soooo" -> "soo")
    text = re.sub(r'(.)\1{2,}', r'\1\1', text)

    # Collapse whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    # Tokenize
    tokens = word_tokenize(text)

    # Remove stopwords & lemmatize
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]

    return " ".join(tokens)

### Clean Video Titles and Create New Column

This cell applies the `clean_text` function to the `title` column of the filtered video DataFrame. The cleaned text is stored in a new column called `cleanedText` for each video.

In [None]:
# Apply text cleaning to the video titles and create a new column
video_filtered['cleanedText'] = video_filtered['title'].progress_apply(clean_text)

100%|██████████| 91492/91492 [00:19<00:00, 4581.30it/s]


### Detect and Label English Texts

This cell defines a function to detect the language of each cleaned text using `langid` and labels each video as English or non-English. It also prints the counts of English and non-English texts in the dataset.

In [None]:
# Detect language of cleaned text and label as English or non-English
def label_english_texts(texts):
    labels = []
    for text in tqdm(texts, desc="Detecting language (langid)"):
        if not text or not isinstance(text, str):
            labels.append(0)  # treat empty/invalid as non-English
            continue
        lang, prob = langid.classify(text)
        labels.append(1 if lang == "en" or prob > 0.60 else 0)
    return labels

# Apply to your DataFrame
video_filtered["is_english"] = label_english_texts(video_filtered["cleanedText"].tolist())

# Get counts
eng_count = video_filtered["is_english"].sum()
non_eng_count = len(video_filtered) - eng_count

print("English texts:", eng_count)
print("Non-English texts:", non_eng_count)

Detecting language (langid): 100%|██████████| 91492/91492 [00:07<00:00, 12631.79it/s]

English texts: 87221
Non-English texts: 4271





### Translate Non-English Texts to English

This cell uses the `deep_translator` library to translate all non-English cleaned texts to English. The translations are stored in a new column called `translated` in a separate DataFrame.

In [None]:
# Translate non-English cleaned texts to English using GoogleTranslator
from deep_translator import GoogleTranslator
import pandas as pd
from tqdm import tqdm

# Example: df with a "text" column and a "is_english" label
# non-English rows are the ones with label == 0
df_non_eng = video[video["is_english"] == 0].copy()

translator = GoogleTranslator(source="auto", target="en")

# Translate each non-English text
df_non_eng["translated"] = [
    translator.translate(t) if isinstance(t, str) and t.strip() else ""
    for t in tqdm(df_non_eng["cleanedText"], desc="Translating to English")
]


Translating to English: 100%|██████████| 4271/4271 [17:15<00:00,  4.13it/s]


### Merge Translations Back to Main DataFrame

This cell merges the translated texts from non-English videos back into the main DataFrame. For videos that were already in English, the cleaned text is used as the translation.

In [None]:
# Merge translated texts back to the main DataFrame and fill missing translations with cleanedText
# Merge translations back
merged = video_filtered.merge(
    df_non_eng[['videoId', 'translated']],  # take only the needed column
    on='videoId',
    how='left'  # keep all rows from data
)

# For rows without translation (i.e., already English), copy cleanedText
merged["translated"] = merged["translated"].fillna(merged["cleanedText"])

### Inspect Merged DataFrame with Translations

This cell displays the first few rows of the merged DataFrame, allowing you to verify that the translations have been merged correctly and that the data is ready for export.

In [None]:
# Display the first few rows of the merged DataFrame to verify translations
merged.head()

Unnamed: 0,videoId,publishedAt,channelId,title,description,tags,defaultLanguage,defaultAudioLanguage,contentDuration,viewCount,likeCount,commentCount,topicCategories,cleanedText,is_english,translated
0,85806,2024-01-15 00:59:29+00:00,33807,Unlocking the Benefits of Face Masks for Skin Health,,,en-US,en-US,PT9S,72.0,0.0,0.0,"['https://en.wikipedia.org/wiki/Health', 'https://en.wikipedia.org/wiki/Lifestyle_(sociology)']",unlocking benefit face mask skin health,1,unlocking benefit face mask skin health
1,30556,2023-10-27 19:32:16+00:00,46650,Get ready for the Magic💚💜🤍💝✨ #hydration #glowingskin #nomakeuplook #skincare,,,,,PT45S,257.0,7.0,0.0,"['https://en.wikipedia.org/wiki/Lifestyle_(sociology)', 'https://en.wikipedia.org/wiki/Physical_attractiveness']",get ready magic hydration glowingskin nomakeuplook skincare,1,get ready magic hydration glowingskin nomakeuplook skincare
2,51771,2024-09-28 01:23:22+00:00,14346,#trending #makeup #beautymakeup #yslbeauty #luxury #latina #fyp,,,,en-US,PT19S,164.0,4.0,2.0,"['https://en.wikipedia.org/wiki/Lifestyle_(sociology)', 'https://en.wikipedia.org/wiki/Physical_attractiveness']",trending makeup beautymakeup yslbeauty luxury latina fyp,1,trending makeup beautymakeup yslbeauty luxury latina fyp
3,45298,2023-07-13 15:19:28+00:00,50139,#shortvedio #balayage,,,,,PT14S,1207.0,20.0,0.0,"['https://en.wikipedia.org/wiki/Lifestyle_(sociology)', 'https://en.wikipedia.org/wiki/Physical_attractiveness']",shortvedio balayage,0,shortvedio balayage
4,43611,2023-04-29 18:47:37+00:00,8143,Full Face of Merit Beauty 🤎 featuring new Flush Balm Shades! #merit #sephora #makeuptutorial,,,,en,PT56S,8647.0,268.0,7.0,"['https://en.wikipedia.org/wiki/Lifestyle_(sociology)', 'https://en.wikipedia.org/wiki/Physical_attractiveness']",full face merit beauty featuring new flush balm shade merit sephora makeuptutorial,1,full face merit beauty featuring new flush balm shade merit sephora makeuptutorial


### Save Final Cleaned and Translated DataFrame

This cell saves the final merged DataFrame, which contains cleaned and translated video data, to a CSV file for future use or analysis.

In [None]:
# Save the final cleaned and translated DataFrame to a CSV file
merged.to_csv('dataset/videos_cleaned_translated.csv', index=False)