# **Arabic**-specific text processing

In [None]:
!pip install PyArabic

Defaulting to user installation because normal site-packages is not writeable


In [None]:
arabic_text = "ذَهَبَ أَحْمَدُ إِلَى السُّوقِ لِيَشْتَرِيَ بَعْضَ الْفَاكِهَةِ وَالْخُضْرَوَاتِ."


# **remove** diacritics

Why remove diacritics?

Arabic text without diacritics is more common in most use cases (news, social media, etc.).
Also, many NLP models, including Stanza’s, work better when diacritics are removed, as they are often not used in everyday written Arabic.

In [None]:
from pyarabic.araby import strip_tashkeel

# Remove Diacritics
text_without_diacritics = strip_tashkeel(arabic_text)

# Print Results
print("Text with Diacritics:")
print(arabic_text)
print("\nText without Diacritics:")
print(text_without_diacritics)

Text with Diacritics:
ذَهَبَ أَحْمَدُ إِلَى السُّوقِ لِيَشْتَرِيَ بَعْضَ الْفَاكِهَةِ وَالْخُضْرَوَاتِ

Text without Diacritics:
ذهب أحمد إلى السوق ليشتري بعض الفاكهة والخضروات


strip_tashkeel:
A function from pyarabic.araby specifically designed to remove Arabic diacritics (tashkeel).

It handles all common diacritics, including Fatha, Kasra, Damma, Shadda, Sukun, and Tanween.

pyarabic remove diacritics easily by strip_tashkeel function without any ambiguous

# **morphological** analysis

In [None]:
import stanza

# Download the Arabic model
stanza.download('ar')

# Initialize the Arabic pipeline
nlp = stanza.Pipeline('ar')

# Process the text
doc = nlp(text_without_diacritics)

# Perform morphological analysis
for sentence in doc.sentences:
    for word in sentence.words:
        print(f"Word: {word.text}, Lemma: {word.lemma}, POS: {word.upos}")


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.9.0.json:   0%|   …

2024-11-27 18:47:32 INFO: Downloaded file to C:\Users\win 11\stanza_resources\resources.json
2024-11-27 18:47:32 INFO: Downloading default packages for language: ar (Arabic) ...
2024-11-27 18:47:33 INFO: File exists: C:\Users\win 11\stanza_resources\ar\default.zip
2024-11-27 18:47:36 INFO: Finished downloading models and saved to C:\Users\win 11\stanza_resources
2024-11-27 18:47:36 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.9.0.json:   0%|   …

2024-11-27 18:47:37 INFO: Downloaded file to C:\Users\win 11\stanza_resources\resources.json
2024-11-27 18:47:39 INFO: Loading these models for language: ar (Arabic):
| Processor | Package       |
-----------------------------
| tokenize  | padt          |
| mwt       | padt          |
| pos       | padt_charlm   |
| lemma     | padt_nocharlm |
| depparse  | padt_charlm   |
| ner       | aqmar_charlm  |

2024-11-27 18:47:39 INFO: Using device: cpu
2024-11-27 18:47:39 INFO: Loading: tokenize
2024-11-27 18:47:39 INFO: Loading: mwt
2024-11-27 18:47:39 INFO: Loading: pos
2024-11-27 18:47:40 INFO: Loading: lemma
2024-11-27 18:47:40 INFO: Loading: depparse
2024-11-27 18:47:40 INFO: Loading: ner
2024-11-27 18:47:42 INFO: Done loading processors!


Word: ذهب, Lemma: ذَهَب, POS: VERB
Word: أحمد, Lemma: أحمد, POS: X
Word: إلى, Lemma: إِلَى, POS: ADP
Word: السوق, Lemma: سُوق, POS: NOUN
Word: ل, Lemma: لِ, POS: CCONJ
Word: يشتري, Lemma: اِشتَرَى, POS: VERB
Word: بعض, Lemma: بَعض, POS: NOUN
Word: الفاكهة, Lemma: فَاكِهَة, POS: NOUN
Word: و, Lemma: وَ, POS: CCONJ
Word: الخضروات, Lemma: خُضْرَة, POS: NOUN


Stanza provides a pre-trained model for Arabic which has been trained on a large corpus, making it a reliable tool for processing Arabic text.

Stanza is a powerful and convenient tool for Arabic morphological analysis. It provides an easy-to-use pipeline for a wide range of NLP tasks.
By using pre-trained models, you can quickly analyze Arabic text without worrying about training the models from scratch.

# **handling** dialect manually

In [None]:
# Updated dialect-to-MSA mapping
dialect_to_msa = {
    "مفيش": "لا يوجد",   # Egyptian Arabic "mfeesh" -> MSA "laa yoojad"
    "وين": "أين",         # Levantine Arabic "wein" -> MSA "ayn"
    "كيفك": "كيف حالك",    # Levantine Arabic "kayfak" -> MSA "kayfa haalik"
    "إزاي": "كيف",         # Egyptian Arabic "izay" -> MSA "kayfa"
    "إزايك": "كيف حالك",   # Egyptian Arabic "izayk" -> MSA "kayfa haalik"
    "عندك": "لديك",       # Egyptian Arabic "indak" -> MSA "ladayk"
}

def normalize_dialect(text, lexicon):
    tokens = text.split()  # Tokenization by spaces (simplified)
    normalized_tokens = [lexicon.get(token, token) for token in tokens]
    return " ".join(normalized_tokens)

# Example text in Egyptian Arabic
text = "إزايك عامل إيه؟"  # Egyptian Arabic: "How are you? What’s up?"
normalized_text = normalize_dialect(text, dialect_to_msa)
print(f"Original Text: {text}")
print(f"Normalized Text: {normalized_text}")


Original Text: إزايك عامل إيه؟
Normalized Text: كيف حالك عامل إيه؟


# **English**-specific text processing

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import pos_tag

In [None]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to C:\Users\win
[nltk_data]     11\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\win
[nltk_data]     11\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\win 11\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [None]:
text = "The CFO confirmed that the FY 2024 budget is ready.The meeting will take place on Mon, 10th Jan. We need to finalize the NDA before the project launch."


# stemming

In [None]:

# Initialize the stemmer
stemmer = PorterStemmer()

# Tokenize the text
tokens = word_tokenize(text)

# Stem the words (excluding punctuation)
stemmed_words = [stemmer.stem(word) for word in tokens if word.isalpha()]

# Display the result
print("the original text :",text)
print("Stemmed Words: ", ' '.join(stemmed_words))


the original text : The CFO confirmed that the FY 2024 budget is ready.The meeting will take place on Mon, 10th Jan. We need to finalize the NDA before the project launch.
Stemmed Words:  the cfo confirm that the fy budget is meet will take place on mon we need to final the nda befor the project launch


In this output:

"confirmed" becomes "confirm".
"finalize" becomes "final".
"before" remains as "befor" since it is already close to the root form.
"meeting" becomes "meet".
Other words are either left unchanged or shortened to their root forms (e.g., "CFO", "FY", "NDA", "Mon" are not changed since they are abbreviations or proper nouns).
stemming involves chopping off prefixes or suffixes from words to obtain a common root. the stems may not be invalid.

# **lemitization**

In [None]:
# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Tokenize the text
tokens = word_tokenize(text)

# POS tagging
pos_tags = pos_tag(tokens)

# Function to map NLTK POS tags to WordNet POS tags
def get_wordnet_pos(tag):
    if tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # Default to NOUN if unknown

# Lemmatize the words using correct POS tag
lemmatized_words = [
    lemmatizer.lemmatize(word, get_wordnet_pos(tag)) for word, tag in pos_tags if word.isalpha()
]

# Display the result
print("Lemmatized Words: ", ' '.join(lemmatized_words))

Lemmatized Words:  The CFO confirm that the FY budget be meeting will take place on Mon We need to finalize the NDA before the project launch


why we use pos with lemitization?
Using POS tagging ensures that words like verbs and nouns are processed correctly based on their context, making the lemmatization more accurate.

"confirmed" is tagged as a verb (VB), and lemmatization changes it to "confirm".

"finalize" is a verb (not a noun), and lemmatization does not change it, since it’s already in its base form.

"meeting" is a noun, so it stays as "meeting".

"launch" is a noun, so it stays as "launch".

lemmatization aims for a valid base form through linguistic analysis make it more accurate than stemming.

# **handling** abbreviations after lemitization

In [None]:
abbreviations = {
    "FY": "Fiscal Year",
    "CFO": "Chief Financial Officer",
    "Mon": "Monday",
    "Jan": "January",
    "NDA": "Non-Disclosure Agreement",
    "etc": "et cetera"
}
def expand_abbreviations(text):
    return ' '.join([abbreviations.get(word, word) for word in lemmatized_words])

# Expand abbreviations in the text
expanded_text = expand_abbreviations(lemmatized_words)

# Display the result
print("Expanded Text: ")
print( expanded_text)


Expanded Text: 
The Chief Financial Officer confirm that the Fiscal Year budget be meeting will take place on Monday We need to finalize the Non-Disclosure Agreement before the project launch


Abbreviation Dictionary: We define a dictionary abbreviations that maps common abbreviations to their expanded forms (e.g., "FY": "Fiscal Year").

Expanding Abbreviations: The function expand_abbreviations() tokenizes the text, and if an abbreviation is found, it replaces it with its full form using the dictionary. If a word is not an abbreviation (not in the dictionary), it remains unchanged.
Result: After running the code, all abbreviations are expanded to their full forms.

# **Advanced** text handling

# **multilingual** processing

In [None]:
multilingual_text="I love learning languages.J'adore apprendre de nouvelles langues."

NLP Tasks in this:

Language Detection: Identify the language of each part of the text.

Tokenization: Tokenize the text into individual words.

Multilingual Processing: Handle the text differently based on the language detected.

(Optional) Translation: Translate the text into a single language if needed.

1. Language Detection:
We'll start by detecting the language of each sentence in the text.

In [None]:
from langdetect import detect

# Split text into sentences
sentences = multilingual_text.split('.')

# Function to detect language
def detect_language(multilingual_text):
    return detect(multilingual_text)

# Detect language for each sentence
for sentence in sentences:
    if sentence:
        language = detect_language(sentence)
        print(f"Sentence: '{sentence}' ===> Detected Language: {language}")


Sentence: 'I love learning languages' ===> Detected Language: en
Sentence: 'J'adore apprendre de nouvelles langues' ===> Detected Language: fr


Explanation of Language Detection:
Input Text: "I love learning languages. J'adore apprendre de nouvelles langues."

Process: We split the text into sentences and detect the language of each sentence using the langdetect library.

Output: The language of each sentence is detected (English, French).

2. Tokenization
Next, we tokenize each sentence based on the detected language.

In [None]:
import spacy

In [None]:
# Load spaCy models for different languages
nlp_en = spacy.load('en_core_web_sm')
nlp_fr = spacy.load('fr_core_news_sm')

In [None]:
# Tokenization function for multilingual text
def tokenize_multilingual(multilingual_text):
    # Ensure proper sentence splitting with multilingual support
    sentences = nltk.sent_tokenize(multilingual_text)  # Split into sentences
    all_tokens = []
    for sentence in sentences:
        sentence = sentence.strip()  # Clean the sentence
        if sentence:  # Only process non-empty sentences
            language = detect_language(sentence)
            # Use appropriate spaCy model based on detected language
            if language == 'en':
                doc = nlp_en(sentence)
            elif language == 'fr':
                doc = nlp_fr(sentence)
            else:
                continue  # Skip unsupported languages

            tokens = [token.text for token in doc]  # Extract tokens
            all_tokens.append((language, tokens))  # Store language and tokens
    return all_tokens

# Tokenize the multilingual text
processed_tokens = tokenize_multilingual(multilingual_text)

# Display the tokens
for language, tokens in processed_tokens:
    print(f"Language: {language}")
    print(f"Tokens: {tokens}")


Language: en
Tokens: ['I', 'love', 'learning', 'languages', '.']
Language: fr
Tokens: ["J'", 'adore', 'apprendre', 'de', 'nouvelles', 'langues', '.']


Explanation of Tokenization:
Input Text: "I love learning languages. J'adore apprendre de nouvelles langues."

Process:
The text is split into sentences.
Based on the detected language (en, fr), we use the appropriate spaCy model to tokenize each sentence.

Output: The tokens (words) are extracted for each sentence in the detected language

3. Translation
Finally, we can translate the multilingual text into a single language, such as English or Arabic.

In [None]:
from googletrans import Translator

# Initialize the Translator
translator = Translator()

# Split the text into sentences
sentences = multilingual_text.split('.')

# Translate each sentence into English
translations_en = [translator.translate(sentence.strip(), src=detect_language(sentence), dest='en').text for sentence in sentences if sentence.strip()]

translations_ar = [translator.translate(sentence.strip(), src=detect_language(sentence), dest='ar').text for sentence in sentences if sentence.strip()]

# Display the translations
print("Translations into English:")
for translation in translations_en:
    print(translation)
print("="*150)
print("Translations into arabic:")
for translation in translations_ar:
    print(translation)

Translations into English:
I love learning languages
I love to learn new languages
Translations into arabic:
أحب تعلم اللغات
أحب أن أتعلم لغات جديدة


Explanation of Translation:
Input Text: "I love learning languages. J'adore apprendre de nouvelles langues."

Process:
The text is split into sentences.
For each sentence, we detect the language and translate it into English and Arabic using googletrans.

Output: Each sentence is translated into English and Arabic. The translation process makes it easier to handle multilingual content in a unified language.