<h1> <strong> <center> Text Preprocessing </center> </strong>  </h1> 


<h4> 1) Library Imports </h4>

In [11]:
import numpy as np
import regex as re
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import keras
from LughaatNLP import LughaatNLP
import tensorflow as tf

<h4> 2) Data Cleaning </h4>
Involves: <br>
- Analyzing which columns are irrelevant and dropping them <br>
- Identifying any rows that have null values and dropping them <br>
- Identifying any duplicate rows and dropping all of them except the first

In [12]:
df = pd.read_csv("raw_data.csv")
print(df.shape)
df.head(3)

(3409, 5)


Unnamed: 0,id,title,link,content,gold_label
0,0.0,بھول بھلیاں 3 کے گانے پر دلجیت اور کارتک کا دھ...,https://www.express.pk/story/2733762/bb3-ke-so...,مشہور پنجابی گلوکار اور اداکار دلجیت دوسانجھ ن...,entertainment
1,0.0,سلمان خان کا شاہ رخ خان کے ’منت‘ سے متعلق بڑا ...,https://www.express.pk/story/2732327/salman-kh...,بالی ووڈ کے دبنگ خان نے ممبئی میں موجود شاہ رخ...,entertainment
2,0.0,نیلم کوٹھاری غیر متوقع سوال پر حیران، حاضرین ک...,https://jang.com.pk/news/1418564,بھارتی فلم اور ٹی وی کی معروف اداکارہ اور نیٹ ...,entertainment


In [13]:
df.drop(['id', 'link'], axis=1, inplace=True)
print(df.columns)

Index(['title', 'content', 'gold_label'], dtype='object')


In [14]:
null_values_per_column = df.isnull().sum()
print("Null values per column:\n", null_values_per_column)

numOfNullVals = null_values_per_column.sum()
print("\nTotal number of null values in the dataset:", numOfNullVals)

Null values per column:
 title          0
content       57
gold_label     0
dtype: int64

Total number of null values in the dataset: 57


In [15]:
df.dropna(inplace=True)
remaining_nulls = df.isnull().sum().sum()
print(df.shape)

(3352, 3)


In [16]:
df_unique = df.drop_duplicates(subset='content', keep='first')
print("Original DataFrame: ", df.shape)
print("DataFrame after dropping duplicates: ", df_unique.shape)
print("Number of rows dropped: ", df.shape[0] - df_unique.shape[0])

Original DataFrame:  (3352, 3)
DataFrame after dropping duplicates:  (2749, 3)
Number of rows dropped:  603


<h4> 3) Text Preprocessing </h4>

Here, we define a function `preprocess_dataset` that uses the Regex and LughaatNLP libraries to clean and preprocess Urdu text. 
The function:
- removes URLs
- removes punctuation
- removes non-alphanumeric characters
- normalizes the text 
- eliminates stopwords 
- corrects spelling 
- tokenizes the text into words
- joins the tokens back into a processed string

In [17]:
text_processer = LughaatNLP()

def preprocess_dataset(text):
    text = re.sub(r"http\S+|www\S+|https\S+", "", text, flags=re.MULTILINE)
    urdu_punctuation = r"[،۔؛؟!\"'،ٔ]+"
    text = re.sub(urdu_punctuation, "", text)
    text = re.sub(r"[^\w\s]", "", text)
    text = re.sub(r"\s+", " ", text).strip()

    text = text_processer.normalize(text)
    text = text_processer.lemmatize_sentence(text)
    text = text_processer.urdu_stemmer(text)
    text = text_processer.remove_stopwords(text)
    text = text_processer.corrected_sentence_spelling(text, 1)
    tokens = text_processer.urdu_tokenize(text)
    preprocessed_text = " ".join(tokens)
    return preprocessed_text

In [18]:
print("Starting text preprocessing for the entire dataset: ")
df_unique['processed_content'] = df_unique['content'].apply(preprocess_dataset)

Starting text preprocessing for the entire dataset: 


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_unique['processed_content'] = df_unique['content'].apply(preprocess_dataset)


In [19]:
print("Starting text preprocessing for the entire dataset: ")
df_unique['processed_title'] = df_unique['title'].apply(preprocess_dataset)

Starting text preprocessing for the entire dataset: 


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_unique['processed_title'] = df_unique['title'].apply(preprocess_dataset)


In [20]:
df_unique.columns

Index(['title', 'content', 'gold_label', 'processed_content',
       'processed_title'],
      dtype='object')

In [22]:
df_unique.drop(['content'], axis=1, inplace=True)
df_unique.drop(['title'], axis=1, inplace=True)
df_unique.columns

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_unique.drop(['content'], axis=1, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_unique.drop(['title'], axis=1, inplace=True)


Index(['gold_label', 'processed_content', 'processed_title'], dtype='object')

In [24]:
df_unique.to_csv('scraped_content.csv', index=False)
print("DataFrames saved as CSV files.")

DataFrames saved as CSV files.
