# Data Prepocessing 2 : Language Filtering and Preprocessing for Comments

This notebook performs language detection and filtering on a dataset of YouTube comments. It loads the final labeled comments, filters out spam, detects English comments using the `langid` library, and saves the results for downstream analysis. The workflow includes package installation, data loading, spam filtering, language identification, and exporting the processed data.

### Install Required Package: langid

This cell installs the `langid` package, which is used for automatic language identification of text data during preprocessing.

In [None]:
!pip install langid

Collecting langid
  Downloading langid-1.1.6.tar.gz (1.9 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.9 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.9/1.9 MB[0m [31m100.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m51.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langid
  Building wheel for langid (setup.py) ... [?25l[?25hdone
  Created wheel for langid: filename=langid-1.1.6-py3-none-any.whl size=1941171 sha256=0d1cb21848c4bec3d3e8e33f54ed4ca0eddde418d2f3f582f292a238a4284e02
  Stored in directory: /root/.cache/pip/wheels/3c/bc/9d/266e27289b9019680d65d9b608c37bff1eff565b001c977ec5
Successfully built langid
Installing collected packages: langid
Successfully installed langid-1.1.6


### Import Libraries for Data Processing and Language Detection

This cell imports the necessary libraries for data manipulation (`pandas`), progress tracking (`tqdm`), and language identification (`langid`).

In [None]:
import pandas as pd
from tqdm import tqdm
import langid

### Load Final Labeled Comments Dataset

This cell loads the final labeled comments dataset from a CSV file and displays its shape to confirm successful loading and inspect the data structure.

In [None]:
file_path = 'dataset/final_after_spam.csv'
comment = pd.read_csv(file_path)
comment.shape

(4620076, 15)

### Filter Out Spam Comments

This cell filters the loaded comments to retain only those labeled as not spam (`isSpam == 0`) and displays the shape of the filtered DataFrame.

In [None]:
filtered_comment = comment[comment["isSpam"] ==0].copy()
filtered_comment.shape

(3059016, 15)

### Detect English Comments Using langid

This cell defines a function to detect whether each comment is in English using the `langid` library. It applies the function to the filtered comments, adds a new column indicating English/non-English, and prints the counts of each.

In [None]:
def label_english_texts(texts):
    labels = []
    for text in tqdm(texts, desc="Detecting language (langid)"):
        if not text or not isinstance(text, str):
            labels.append(0)  # treat empty/invalid as non-English
            continue
        lang, prob = langid.classify(text)
        labels.append(1 if lang == "en" else 0)
    return labels

# Apply to your DataFrame
filtered_comment["is_english"] = label_english_texts(filtered_comment["cleanedText"].tolist())

# Get counts
eng_count = filtered_comment["is_english"].sum()
non_eng_count = len(filtered_comment) - eng_count

print("English texts:", eng_count)
print("Non-English texts:", non_eng_count)

Detecting language (langid): 100%|██████████| 3059016/3059016 [1:11:05<00:00, 717.08it/s]


English texts: 2008655
Non-English texts: 1050361


### Save Filtered Comments with Language Labels

This cell saves the filtered DataFrame, which now includes English/non-English labels, to a CSV file for further analysis or downstream processing.

In [None]:
filtered_comment.to_csv('dataset/final_after_spam_eng.csv', index=False)

### Preview Filtered Comments

This cell displays the first few rows of the filtered DataFrame, allowing inspection of the comments and their language labels.

In [None]:
filtered_comment.head()

Unnamed: 0,commentId,channelId,videoId,authorId,textOriginal,parentCommentId,likeCount,publishedAt,updatedAt,duplicatedFlag,cleanedText,cleanedTextSentiment,regex_spam,predicted_spam,isSpam,is_english
0,3166243,41024,6217,26499,Good Information... Will definitely try it......,,0,2020-01-01 16:00:58+00:00,2020-01-01 16:00:58+00:00,0,good information definitely try thanks,good information definitely try thanks : smili...,0,0.0,0,1
1,1888757,10004,86296,2608986,"Crystal, is it true that beginning Campaign 3,...",,1,2020-01-04 07:49:54+00:00,2020-01-04 07:49:54+00:00,0,crystal true beginning campaign 3 order get fr...,crystal true beginning campaign 3 order get fr...,0,0.0,0,0
2,0,10004,86296,164837,Yes but I am charged $8 to cover your free shi...,1888757.0,0,2020-01-04 07:53:24+00:00,2020-01-04 07:53:24+00:00,0,yes charged $ 8 cover free shipping not rep wo...,yes charged $ 8 cover free shipping not rep wo...,0,0.0,0,1
4,1279533,5459,64449,882554,Very useful video,,2,2020-01-04 10:32:19+00:00,2020-01-04 10:32:19+00:00,0,useful video,useful video,0,0.0,0,1
5,2543589,32215,89804,1777705,Osm three hair colour,,2,2020-01-04 13:07:46+00:00,2020-01-04 13:07:46+00:00,0,osm three hair colour,osm three hair colour,0,0.0,0,1
