<h3>Quran in CSV</h3>
<i>Credit: </i><a href='https://data.mendeley.com/datasets/sg5j65kgdf/1#:~:text=Quran%20Dataset%20is%20a%20collection,original%20file%20in%20word%20format.&text=From%20the%20RDF%20files%20small,transform%20RDF%20to%20figure%20files.'>https://data.mendeley.com/datasets/sg5j65kgdf/1#:~:text=Quran%20Dataset%20is%20a%20collection,original%20file%20in%20word%20format.&text=From%20the%20RDF%20files%20small,transform%20RDF%20to%20figure%20files.</a>

In [36]:
import pandas as pd
import tqdm
tqdm.tqdm.pandas()
import ast
quran_df = pd.read_csv(r"CSVs_&_other_files\quran.csv", header=None, usecols=[3,10]).dropna()
quran_df = quran_df[~(quran_df[3].str.startswith('سورة'))].rename(columns={3:'text',10:'sura'}) #  & ~(quran_df[3].str.len() < 25)

In [37]:
quran_df.sura.unique().shape # Number of sura's

(114,)

In [59]:
df = quran_df.groupby('sura').agg({'text':lambda x : list(x)}).reset_index()
df.head()

Unnamed: 0,sura,text
0,سورة ابراهيم,[فلا تحسبن الله مخلف وعده رسله ان الله عزيز ذو...
1,سورة ال عمران,"[بسم الله الرحمٰن الرحيم, الم, الله لا الٰه ال..."
2,سورة الاحزاب,[ولقد كانوا عٰهدوا الله من قبل لا يولون الادبٰ...
3,سورة الاحقاف,"[تنزيل الكتٰب من الله العزيز الحكيم, بسم الله ..."
4,سورة الاخلاص,"[بسم الله الرحمٰن الرحيم, قل هو الله احد, الله..."


<h3>Stemming/Lemmatization

<h5>1. NLTK ISRIStemmer (Very fast but not very accurate...)

In [16]:
import nltk
from nltk.stem.isri import ISRIStemmer
st = ISRIStemmer()

In [26]:
# df.text.head(10).progress_map(lambda x : st.stem(x))
st.stem('محمد')

'حمد'

<h5>2. qalsadi lemmatizer (Fast enough, and better accuracy than NLTK)

In [9]:
import qalsadi.lemmatizer
lemmer = qalsadi.lemmatizer.Lemmatizer()

In [21]:
quran_df.text.head(10).progress_map(lambda x : lemmer.lemmatize_text(x))

100%|██████████| 10/10 [00:00<00:00, 64.97it/s]


1                           [سم, الله, رحم, ٰ, ن, رحيم]
2                            [حمد, له, رب, الع, ٰ, لما]
3                                     [رحم, ٰ, ن, رحيم]
4                                  [م, ٰ, لك, يوم, دين]
5                            [اياك, عبد, واياك, استعان]
6                               [هدن, صر, ٰ, ط, مستقيم]
7     [صر, ٰ, ط, الذين, انعمت, على, غير, مغضوب, على,...
9                           [سم, الله, رحم, ٰ, ن, رحيم]
10                                                [لما]
11      [ذ, ٰ, لك, كت, ٰ, ب, لا, ريب, في, هدى, للمتقين]
Name: text, dtype: object

<h5>3. Farasa API lemmatizer (Slow because it relies on a web API, but the best in terms of accuracy)

In [65]:
def farasa_lemma(text):
    import json
    import http.client
    # text = text.replace('"', '')
    conn = http.client.HTTPSConnection("farasa-api.qcri.org")
    payload = "{\"text\":\"%s\"}"% text
    payload = payload.encode('utf-8')
    headers = { 'content-type': "application/json", 'cache-control': "no-cache", }
    conn.request("POST", "/msa/webapi/lemma", payload, headers)
    res = conn.getresponse()
    data = res.read().decode('utf-8')
    data_dict = json.loads(data)
    return data_dict['result']

# # Lemmatization code
# quran_df['lemmatized'] = quran_df.text.progress_map(farasa_lemma)
# quran_df.to_csv('quran_with_lemma.csv', index=None)

# test
test = df.head(10).text.progress_map(farasa_lemma)


100%|██████████| 10/10 [00:07<00:00,  1.27it/s]


In [93]:
farasa_lemma(df.iloc[0].text[2])

['ر',
 'كتب',
 'انزلنه',
 'يك',
 'خرج',
 'ناس',
 'من',
 'ظلم',
 'الى',
 'نور',
 'اذن',
 'رب',
 'الى',
 'صراط',
 'عزيز',
 'حميد']

In [82]:
df.iloc[2].text[2]

'بسم الله الرحمٰن الرحيم'