<a href="https://colab.research.google.com/github/sahanyafernando/My_NLP_Learning/blob/main/NLP_Learning/Normalization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Demonstration: Normalization

In [1]:
import pandas as pd # Import pandas for handling CSV data
import unicodedata # Import unicodedata for Unicode Normalization
import re # Import regex for text cleaning
import nltk # Import NLTK for tokenization
nltk.download('punkt') # Download NLTK resources

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [2]:
# Load dataset from CSV file
file_path = "noisy_dataset.csv"
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,id,text
0,1,Caf√© prices are rising!!! üò± #inflation
1,2,na√Øve approach won‚Äôt work in 2024...
2,3,I ‚ù§Ô∏è NLP!!! It's co√∂perate vs. cooperate
3,4,¬°Hola! ¬øC√≥mo est√°s? Estoy bien :)
4,5,‰∏≠ÊñáÊñáÊú¨Ê∑∑Âêà English text 123!!!


In [3]:
# Ensure 'text' column exists in the dataset
if 'text' not in df.columns:
  raise ValueError("The 'text' column does not exist in the dataset.")

In [4]:
# Displayy first few rows in the dataset
print("Original Dataset:")
print(df.head())

Original Dataset:
   id                                      text
0   1    Caf√© prices are rising!!! üò± #inflation
1   2      na√Øve approach won‚Äôt work in 2024...
2   3  I ‚ù§Ô∏è NLP!!! It's co√∂perate vs. cooperate
3   4       ¬°Hola! ¬øC√≥mo est√°s?   Estoy bien :)
4   5                ‰∏≠ÊñáÊñáÊú¨Ê∑∑Âêà English text 123!!!


In [5]:
# HANDLING MULTILINGUAL TEXT
print("\nHandling Multilingual Text: Punctuation, Special Characters, Case Normalization")
def handle_multilingual_text(text):
  text = re.sub(r'[^\w\s]', '', text) # Remove special characters and punctuation
  return text


Handling Multilingual Text: Punctuation, Special Characters, Case Normalization


In [6]:
df['Processed_Multilingual'] = df['text'].apply(handle_multilingual_text)
print(df[['text', 'Processed_Multilingual']].head())

                                       text             Processed_Multilingual
0    Caf√© prices are rising!!! üò± #inflation  Caf√© prices are rising  inflation
1      na√Øve approach won‚Äôt work in 2024...   na√Øve approach wont work in 2024
2  I ‚ù§Ô∏è NLP!!! It's co√∂perate vs. cooperate  I  NLP Its co√∂perate vs cooperate
3       ¬°Hola! ¬øC√≥mo est√°s?   Estoy bien :)      Hola C√≥mo est√°s   Estoy bien 
4                ‰∏≠ÊñáÊñáÊú¨Ê∑∑Âêà English text 123!!!            ‰∏≠ÊñáÊñáÊú¨Ê∑∑Âêà English text 123


In [7]:
# NORMALIZATION TECHNIQUES (Accents, Unicode, Special Characters)
print("\nNormalization Techniques: Accents, Unicode, Special Characters")
def normalize_text(text):
  text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8') # Remove accents
  text = re.sub(r'[^\x00-\x7F]+', ' ', text) # Remove non-ASCII characters
  return text


Normalization Techniques: Accents, Unicode, Special Characters


In [8]:
df['Normalized_Text'] = df['Processed_Multilingual'].apply(normalize_text)
print(df[['Processed_Multilingual', 'Normalized_Text']].head())

              Processed_Multilingual                    Normalized_Text
0  Caf√© prices are rising  inflation  Cafe prices are rising  inflation
1   na√Øve approach wont work in 2024   naive approach wont work in 2024
2  I  NLP Its co√∂perate vs cooperate  I  NLP Its cooperate vs cooperate
3      Hola C√≥mo est√°s   Estoy bien       Hola Como estas   Estoy bien 
4            ‰∏≠ÊñáÊñáÊú¨Ê∑∑Âêà English text 123                   English text 123
