In [None]:
Purpose
The purpose of this activity is to explore an Arabic NLP tool and to practice Python development skills.
Instructions
Create a mini dataset that consists of a few tweets or Facebook posts written in Arabic (10 is enough for now), save them in a text or a csv file
Create a Jupyter notebook, you can use (Google Colab)
Load your dataset to a Data Frame
Remove punctuations, emojis or any other sequence to keep the text only using regex.
Add a new column to the data frame that consists of the tokenized format of the text (whitespace character can be the delimiter)
Add a new column to the data frame that shows the lemmatized form of the text using Farasa, Madamira, or any other Arabic lemmatizer.
Compare between the original and lemmatized text. Write a short paragraph to describe your findings, what happened to the text and justify these results.

**Arabic NLP - Basic Arabic Text Processing Hands-on**

***Step 1- Load your dataset to a Data Frame***

In [2]:
import pandas as pd

#load the CSV file with UTF-8 encoding
df = pd.read_csv('Arabic Tweets Data.csv', encoding='utf-8')

#display the DataFrame
df.head()


Unnamed: 0,text
0,الطقس اليوم رائع، والشمس مشرقة! ☀️
1,أحب القهوة في الصباح الباكر، إنها تعطي يومي بد...
2,مباراة كرة القدم كانت مشوقة للغاية! هل شاهدتها...
3,أشعر بالتعب الشديد بعد يوم عمل طويل. 😓
4,يجب أن أذهب للتسوق غداً، لأنني بحاجة إلى بعض ا...


Step 1 involves loading the CSV file containing the original Arabic text into a Pandas DataFrame. In the following stages, this facilitates the modification and analysis of the text data.

***Step 2- Remove punctuations, emojis or any other sequence to keep the text only using regex***

In [3]:
import re

#clean text by removing punctuation, emojis, and non-Arabic characters
def clean_text(text):
    #the below pattern removes everything except Arabic letters and spaces
    pattern = r'[^\u0600-\u06FF\s]'
    cleaned_text = re.sub(pattern, '', text)  #remove everything else

    #also we need to remove all punctuations
    cleaned_text = re.sub(r'[^\w\s]', '', cleaned_text)
    return cleaned_text

#now, apply the cleaning function to the DataFrame's 'text' column
df['cleaned_text'] = df['text'].apply(clean_text)

#display the cleaned text
df[['text', 'cleaned_text']].head()


Unnamed: 0,text,cleaned_text
0,الطقس اليوم رائع، والشمس مشرقة! ☀️,الطقس اليوم رائع والشمس مشرقة
1,أحب القهوة في الصباح الباكر، إنها تعطي يومي بد...,أحب القهوة في الصباح الباكر إنها تعطي يومي بدا...
2,مباراة كرة القدم كانت مشوقة للغاية! هل شاهدتها...,مباراة كرة القدم كانت مشوقة للغاية هل شاهدتها
3,أشعر بالتعب الشديد بعد يوم عمل طويل. 😓,أشعر بالتعب الشديد بعد يوم عمل طويل
4,يجب أن أذهب للتسوق غداً، لأنني بحاجة إلى بعض ا...,يجب أن أذهب للتسوق غدا لأنني بحاجة إلى بعض الأ...


In the step 2, we cleaned the text using regular expressions (regex) to remove all non-Arabic letters, punctuation, and emojis, leaving only Arabic text and whitespace.

***Step 3- Add a new column to the data frame that consists of the tokenized format of the text (whitespace character can be the delimiter)***

In [4]:
#tokenizing the cleaned text by splitting based on whitespace
df['tokenized_text'] = df['cleaned_text'].apply(lambda x: x.split())

#display the tokenized text
df[['cleaned_text', 'tokenized_text']].head()

Unnamed: 0,cleaned_text,tokenized_text
0,الطقس اليوم رائع والشمس مشرقة,"[الطقس, اليوم, رائع, والشمس, مشرقة]"
1,أحب القهوة في الصباح الباكر إنها تعطي يومي بدا...,"[أحب, القهوة, في, الصباح, الباكر, إنها, تعطي, ..."
2,مباراة كرة القدم كانت مشوقة للغاية هل شاهدتها,"[مباراة, كرة, القدم, كانت, مشوقة, للغاية, هل, ..."
3,أشعر بالتعب الشديد بعد يوم عمل طويل,"[أشعر, بالتعب, الشديد, بعد, يوم, عمل, طويل]"
4,يجب أن أذهب للتسوق غدا لأنني بحاجة إلى بعض الأ...,"[يجب, أن, أذهب, للتسوق, غدا, لأنني, بحاجة, إلى..."


Step 3 involves using spaces to break each sentence up into individual words, or tokens, and storing the outcome in a new column inside the DataFrame.

***Step 4- Add a new column to the data frame that shows the lemmatized form of the text using Farasa, Madamira, or any other Arabic lemmatizer***

In [5]:
#will use Stanza to perform lemmatization on Arabic text

#install Stanza
!pip install stanza


Collecting stanza
  Downloading stanza-1.9.2-py3-none-any.whl.metadata (13 kB)
Collecting emoji (from stanza)
  Downloading emoji-2.12.1-py3-none-any.whl.metadata (5.4 kB)
Downloading stanza-1.9.2-py3-none-any.whl (1.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading emoji-2.12.1-py3-none-any.whl (431 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m431.4/431.4 kB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji, stanza
Successfully installed emoji-2.12.1 stanza-1.9.2


In [6]:
import stanza

#download the Arabic model
stanza.download('ar')


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.9.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: ar (Arabic) ...


Downloading https://huggingface.co/stanfordnlp/stanza-ar/resolve/v1.9.0/models/default.zip:   0%|          | 0…

INFO:stanza:Downloaded file to /root/stanza_resources/ar/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources


In [7]:
import pandas as pd

#load the original csv file
df = pd.read_csv('Arabic Tweets Data.csv', encoding='utf-8')

#display the first few rows of the dataframe
df.head()


Unnamed: 0,text
0,الطقس اليوم رائع، والشمس مشرقة! ☀️
1,أحب القهوة في الصباح الباكر، إنها تعطي يومي بد...
2,مباراة كرة القدم كانت مشوقة للغاية! هل شاهدتها...
3,أشعر بالتعب الشديد بعد يوم عمل طويل. 😓
4,يجب أن أذهب للتسوق غداً، لأنني بحاجة إلى بعض ا...


In [9]:
#initialize the Stanza pipeline for Arabic
nlp = stanza.Pipeline(lang='ar', processors='tokenize,pos,lemma')


INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.9.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: ar (Arabic):
| Processor | Package       |
-----------------------------
| tokenize  | padt          |
| mwt       | padt          |
| pos       | padt_charlm   |
| lemma     | padt_nocharlm |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
  checkpoint = torch.load(filename, lambda storage, loc: storage)
INFO:stanza:Loading: mwt
  checkpoint = torch.load(filename, lambda storage, loc: storage)
INFO:stanza:Loading: pos
  checkpoint = torch.load(filename, lambda storage, loc: storage)
  data = torch.load(self.filename, lambda storage, loc: storage)
  state = torch.load(filename, lambda storage, loc: storage)
INFO:stanza:Loading: lemma
  checkpoint = torch.load(filename, lambda storage, loc: storage)
INFO:stanza:Done loading processors!


In [10]:
#Lemmatize Arabic Text Using Stanza
#define a function to lemmatize Arabic text using Stanza
def lemmatize_text(text):
    doc = nlp(text)  # Process the text with Stanza
    lemmas = []

    for sentence in doc.sentences:
        for word in sentence.words:
            lemmas.append(word.lemma)  #append the lemma of each word

    return ' '.join(lemmas)  #join the lemmas back into a string

#apply the lemmatization function to each row in the 'text' column
df['lemmatized_text'] = df['text'].apply(lemmatize_text)

#display the updated DataFrame
df.head()


Unnamed: 0,text,lemmatized_text
0,الطقس اليوم رائع، والشمس مشرقة! ☀️,طَقس يَوم رَائِع ، وَ شَمس مُشَرَّق ! ☀️
1,أحب القهوة في الصباح الباكر، إنها تعطي يومي بد...,أَحَبّ قُهوَة فِي صَبَاح بَاكِر ، إِنَّ هُوَ أ...
2,مباراة كرة القدم كانت مشوقة للغاية! هل شاهدتها...,مُبَارَاة كُرَة قَدَم كَان مُشَوَّق لِ غَايَة ...
3,أشعر بالتعب الشديد بعد يوم عمل طويل. 😓,أَشعَر بِ تَعَبَّة شَدِيد بَعدَ يَوم عَمَل طَو...
4,يجب أن أذهب للتسوق غداً، لأنني بحاجة إلى بعض ا...,وَجَب أَنَّ ذَهَب لِ تَسَوُّق غَد ، لِأَنَّ هُ...


In [11]:
#save the updated DataFrame to a new CSV file.
df.to_csv('lemmatized_text_output.csv', encoding='utf-8', index=False)

Step 4 involved running each row of tokenized text through the Stanza Arabic Lemmatizer, which extracted each word's lemma and stored the result in a new column.

***Step 5- Comparison between the original and lemmatized text. Write a short paragraph to describe your findings, what happened to the text and justify these results.***

Observations:

Verbs: Verbs are reduced to their root forms in the lemmatized form, frequently losing tense markers and personal pronouns.

For example:

Original: "أحب" (I love) → Lemmatized: "أَحَبّ" (love).

Original: "كانت" (was) → Lemmatized: "كَان" (be).

Noun Changes: Definite articles like "الـ" (the) and gender markers are removed from nouns:

Original: "الطقس" (the weather) → Lemmatized: "طَقس" (weather).

Original: "القهوة" (the coffee) → Lemmatized: "قُهوَة" (coffee).

Prepositions and Particles: Prepositions such as "في" (in) and conjunctions like "و" (and) remained mostly the same. This is due to the fact that both the original and lemmatized forms of these words frequently have their most basic form.

Pronouns and Tense: simpler versions of pronouns such as "إنها" (it) and "لأنني" (because I) were adopted:

Original: "إنها" (it) → Lemmatized: "إِنَّ هُوَ" (indeed he).

Original: "لأنني" (because I) → Lemmatized: "لِأَنَّ هُوَ" (because he).

Emojis and Punctuation: Since they aren't involved in the process of linguistic lemmatization, punctuation and emojis stay the same.

Findings:
The lemmatized text successfully simplifies Arabic words' sophisticated morphology to their most basic forms. A more consistent representation of the text is produced by eliminating verb conjugations, gender markers, and certain articles; this is particularly helpful for NLP applications.

Justification of the results:
Arabic has a rich morphology, with verbs and nouns having considerable gender, number, and tense inflection.
By eliminating these variants and presenting the word's base form, lemmatization helps lessen data sparsity in NLP applications.

Furthermore, the elimination of pronouns, suffixes, and prefixes improves consistency by enabling models to handle the same word consistently across various inflected forms.

Furthermore, many lemmatized pronouns get generalized ("إنها" becomes "إِنَّ هُوَ"), which may slightly alter the original meaning but is usually acceptable in more general NLP applications.

Finally, Lemmatization reduces words to their most basic forms while maintaining the main sense of the text, simplifying it overall and facilitating data processing and analysis for computational models.
