# Urdu Text Data Preprocessing

### **Description**  
This notebook has the data cleaning and pre processing steps which formulated the final excel file that all other models have used to formulate their data. After pre-processing was done, the data was stored into an excel file named "normalized_and_tokenized_combined_data.xlsx" from where the data will be used for model workings.



## Imports

In [3]:
##### Imports ####
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import urduhack
import spacy
from collections import Counter
from typing import List, Dict, Tuple

##### Globals #####
nlp = spacy.blank('ur') #nlp model which helps tokenize urdu words

**Loading data**

In [4]:
# Load the CSV file into a DataFrame
df = pd.read_excel('../Data/combined_data.xlsx')

# Display the first few rows to understand the data
df.head()


Unnamed: 0,id,title,link,content,gold_label
0,0,’کبھی میں کبھی تم‘ کی اداکارہ نائمہ بٹ کو سوشل...,https://www.express.pk/story/2732145/kabhi-mai...,مشہور ڈرامے ’کبھی میں کبھی تم‘ میں منفی کردار ...,entertainment
1,1,’سدھو موسے والا واپس آگیا‘ گلوکار کے ننھے بھا...,https://www.express.pk/story/2732131/sidhu-moo...,بھارت کے آنجہانی پنجابی گلوکار سدھو موسے والا...,entertainment
2,2,ترکیہ کے ’سلطان صلاح الدین ایوبی‘ اسٹار کا پاک...,https://www.express.pk/story/2732124/turkey-k-...,ترکی کے اداکار اوغور گونیش کی سیریز ’سلطان صلا...,entertainment
3,3,نعمان اعجاز اور صبا قمر بولڈ مناظر فلمانے پر ت...,https://www.express.pk/story/2732113/nauman-ej...,نعمان اعجاز اور صبا قمر کی ویب سیریز کے بولڈ س...,entertainment
4,4,سلمان خان اور لارنس بشنوئی پر گانا لکھنے والا ...,https://www.express.pk/story/2732103/salman-kh...,بالی ووڈ کے سپر اسٹار سلمان خان کو ایک مرتبہ پ...,entertainment


**Checking for Missing Values and Duplicates**

### Drop any rows with missing values:

In [None]:
df = df.dropna()
df.shape

(2281, 5)

### Drop any rows with duplicate values: 

In [None]:
df = df.drop_duplicates()
df.shape

(2281, 5)

### See if the gold_labels are all in order:

In [None]:
# Check the unique values in the 'gold_label' column
unique_labels = df['gold_label'].unique()

# Print the unique labels
print(unique_labels)
print(len(unique_labels))

['entertainment' 'business' 'sports' 'science-technology' 'world'
 'international']
6


##### `world` is not what we want, need to map it to `international`:

In [None]:
df['gold_label'] = df['gold_label'].replace('world', 'international')

#verifying:
unique_labels_after_mapping = df['gold_label'].unique()
print(unique_labels_after_mapping)

print(len(df['gold_label'].unique()))

['entertainment' 'business' 'sports' 'science-technology' 'international']
5


In [5]:
# Check the data types of each column
print("data types of each column\n",df.dtypes)

# Check for missing values in each column
print("\nMissing values in each column: \n",df.isnull().sum())

# Check for duplicate rows
print("\nDuplicated rows:",df.duplicated().sum())



data types of each column
 id             int64
title         object
link          object
content       object
gold_label    object
dtype: object

Missing values in each column: 
 id            0
title         0
link          0
content       0
gold_label    0
dtype: int64

Duplicated rows: 0


## Preprocessing Pipeline for Urdu Text Data

> **Note:** This pipeline assumes data was cleaned beforehand. Which we have done above.

Steps to prepare Urdu text data for machine learning models:

1. **Shuffle the Data:**  
   Randomize the order of the data samples to eliminate any unintended patterns that might bias the model.

2. **Text Normalization:**  
   Use the **UrduHack library** to normalize Urdu text by addressing inconsistencies in script, such as fixing spacing, removing unnecessary diacritics, and standardizing character forms.

3. **Tokenization:**  
   Split the normalized text into **word tokens** using the **SpaCy library**, which provides efficient tokenization specifically designed for the Urdu language.



**Classes and Functions Required for Pre-Processing**

In [None]:
#Train Test Split Function
def train_test_split(X, y, test_size=0.2, random_state=None):
    if random_state is not None:
        np.random.seed(random_state)
    
    n_samples = len(X)
    indices = np.random.permutation(n_samples)
    test_size = int(test_size * n_samples)
    
    test_indices = indices[:test_size]
    train_indices = indices[test_size:]
    
    return (
        X[train_indices], X[test_indices],
        y[train_indices], y[test_indices]
    )

def normalization_and_tokenization(text):

    # Normalize text (handles things like normalization of characters)
    text = urduhack.normalization.normalize(text)
        
    # Tokenize text into words using Urdu tokenizer from Spacy
    tokens = nlp(text) #nlp is a global variable (a model from spacy) declared where imports are.
    
    return [token.text for token in tokens if len(token.text) > 1]

**Pre-Processing actually happens here**

In [None]:

#Shuffling the data
df = df.sample(frac=1, random_state=20).reset_index(drop=True)

# Normalizing and Tokenizing the 'content' column
df['processed_content'] = df['content'].apply(normalization_and_tokenization)


# Convert lists of tokens back to text so we can apply TF-IDF
df['processed_content'] = df['processed_content'].apply(lambda x: ' '.join(x))


### Save changes to file so models can use it:

In [None]:
#We are saving this file and all models will use this.
df.to_excel("normalized_and_tokenized_combined_data.xlsx")

In [6]:
#Sanity Check
df.head()

Unnamed: 0,id,title,link,content,gold_label,processed_content
0,780,امریکی صدارتی الیکشن، کونسی ریاستیں ریپبلکنز ن...,https://urdu.geo.tv/latest/385845-,امریکی صدارتی الیکشن میں ری پبلکنز امیدوار ڈون...,world,"[امریکی, صدارتی, الیکشن, میں, ری, پبلکنز, امید..."
1,893,انروا پر اسرائیلی پابندی غزہ کے بچوں کے قتل عا...,https://urdu.geo.tv/latest/384924-,اقوام متحدہ کے ادارہ برائے اطفال (یونیسیف) نے...,world,"[اقوام, متحدہ, کے, ادارہ, برائے, اطفال, یونیسی..."
2,574,میٹا کا گوگل کے مقابلے میں اپنا اے آئی سرچ انج...,https://urdu.geo.tv/latest/384923-,میٹا کی جانب سے آرٹی فیشل انٹیلی جنس (اے آئی) ...,science-technology,"[میٹا, کی, جانب, سے, آرٹی, فیشل, انٹیلی, جنس, ..."
3,200,کراچی کے سوا باقی ملک کیلئے بجلی سستی ہوگئی,https://urdu.geo.tv/latest/385886-,اسلام آباد:کراچی کے سوا باقی ملک کے لیے بجلی س...,business,"[اسلام, آباد, کراچی, کے, سوا, باقی, ملک, کے, ل..."
4,575,سرکاری ملازمین کا بیٹا چین کا امیر ترین شخص کی...,https://urdu.geo.tv/latest/384908-,ٹک ٹاک کی سرپرست کمپنی بائیٹ ڈانس کے بانی زینگ...,science-technology,"[ٹک, ٹاک, کی, سرپرست, کمپنی, بائیٹ, ڈانس, کے, ..."
