## getting the file of quran segments

In a [previous](http://abdulbaqi.io/2019/01/15/quranic_roots/) post I explored some of `Linux` commands to find root words in the entire Quran or in a particular sura of the Quran.

While, `Linux` commands could be very productive at certain cases, they are not meant for data analysis. Hence, we need to resort to a proper data science programming language like R or Python. 

In this post, I will explore the power of `Python` to address a the same topc (i.e., root words in the Quran) but will extend the problem to much more interesting queries.

This post is not intended to be a beginner's tutorial to either `python` or `pandas` the special python package for data analysis that I will use here. I expect you to have some experience with both but more on python. I will assume you have less experince with `pandas` though. 

Without further ado, let us get started.

First, let us start with few setup steps, like loading the `pandas` package and rename it for ease of usage as `pd`.

In [19]:
import pandas as pd

Next, read the file that contains the morphological information. Pandas has the `read_csv` function that can read directly from a URL. Note that the file contains some copyright information in the first 56 lines, and hence I am using `skiprows` option. Also, note that `read_csv` by defualt assumes the seperator to be a comman, if not -as is the case in this file- we need to explicitly specify the delimiter and hence the `sep='\t'` option. Finally, we are displaying few lines from the top by the `head()` function.

In [20]:
url = 'http://textminingthequran.com/data/quranic-corpus-morphology-0.4.txt'
qdforiginal = pd.read_csv(url, sep='\t',skiprows=56)
qdforiginal.head()

Unnamed: 0,LOCATION,FORM,TAG,FEATURES
0,(1:1:1:1),bi,P,PREFIX|bi+
1,(1:1:1:2),somi,N,STEM|POS:N|LEM:{som|ROOT:smw|M|GEN
2,(1:1:2:1),{ll~ahi,PN,STEM|POS:PN|LEM:{ll~ah|ROOT:Alh|GEN
3,(1:1:3:1),{l,DET,PREFIX|Al+
4,(1:1:3:2),r~aHoma`ni,ADJ,STEM|POS:ADJ|LEM:r~aHoma`n|ROOT:rHm|MS|GEN


It is a good idea to save the first version locally by the `to_csv()` function. 

In [21]:
qdforiginal.to_csv('quran-morphology-v1.csv')

Looking at the first few lines of the file above we see that the `LOCATION` and `FEATURES` columns need to be split further. 

Our file contains 128k lines (you can verify that by the command `qdforiginal.shape`). I prefer to take a small sample of this big file and run the experimentations of splitting. When successful, we can then run it on the entire file.

### Splitting columns

Here is my strategy: since I am interested on root words, I want to select first all rows that contain the word `ROOT:` in the `FEATURES` column. This can be done by a command like the following:

In [22]:
qdforiginal.FEATURES.str.contains('ROOT')[:3]

0    False
1     True
2     True
Name: FEATURES, dtype: bool

I only took the first 3 lines of the entire 128k lines. It returns `boolean` values of `True` or `False`. So, we can pass this boolean result to filter the entire dataframe by:

In [23]:
qdforiginal[qdforiginal.FEATURES.str.contains('ROOT:')].head(3)

Unnamed: 0,LOCATION,FORM,TAG,FEATURES
1,(1:1:1:2),somi,N,STEM|POS:N|LEM:{som|ROOT:smw|M|GEN
2,(1:1:2:1),{ll~ahi,PN,STEM|POS:PN|LEM:{ll~ah|ROOT:Alh|GEN
4,(1:1:3:2),r~aHoma`ni,ADJ,STEM|POS:ADJ|LEM:r~aHoma`n|ROOT:rHm|MS|GEN


To ensure some random sampling, I can always use `sample()` method as follows. I will name this sample as `qsample`.

In [24]:
qsample = qdforiginal[qdforiginal.FEATURES.str.contains('ROOT:')].sample(10); qsample

Unnamed: 0,LOCATION,FORM,TAG,FEATURES
75915,(24:41:11:3),T~ayoru,N,STEM|POS:N|LEM:Tayor|ROOT:Tyr|M|NOM
21576,(4:159:14:1),$ahiydFA,N,STEM|POS:N|LEM:$ahiyd|ROOT:$hd|MS|INDEF|ACC
63776,(18:52:7:2),daEa,V,STEM|POS:V|PERF|LEM:daEaA|ROOT:dEw|3MP
108439,(47:3:17:3),n~aAsi,N,STEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN
78890,(26:60:2:1),m~u$oriqiyna,N,STEM|POS:N|ACT|PCPL|(IV)|LEM:m~u$oriqiyn|ROOT:...
127444,(97:4:3:3),r~uwHu,N,STEM|POS:N|LEM:ruwH|ROOT:rwH|M|NOM
78895,(26:61:3:2),jamoEaAni,N,STEM|POS:N|LEM:jamoE|ROOT:jmE|MD|NOM
109979,(48:25:28:1),{ll~ahu,PN,STEM|POS:PN|LEM:{ll~ah|ROOT:Alh|NOM
51559,(12:62:12:1),>aholi,N,STEM|POS:N|LEM:>ahol|ROOT:Ahl|M|GEN
46582,(10:100:10:2),r~ijosa,N,STEM|POS:N|LEM:rijos|ROOT:rjs|M|ACC


My intention is to split the `LOCATION` column into four columns, and then the `FEATURES` column into a column for Root and another for Lemma.

First, I am going to split the first column `LOCATION` into four columns. This is done through the `extract` function of the string which can take a `regular expression`. The `?P<...>` construct within the regular expression creates columns with these names.

In [25]:
tmp1 = qsample.LOCATION.str.extract(r'(?P<sura>\d*):(?P<aya>\d*):(?P<word>\d*):(?P<w_seg>\d*)'); tmp1

Unnamed: 0,sura,aya,word,w_seg
75915,24,41,11,3
21576,4,159,14,1
63776,18,52,7,2
108439,47,3,17,3
78890,26,60,2,1
127444,97,4,3,3
78895,26,61,3,2
109979,48,25,28,1
51559,12,62,12,1
46582,10,100,10,2


Now, let us extract the roots from the `FEATURES` column in the same way.

In [26]:
tmp2 = qsample.FEATURES.str.extract(r'ROOT:(?P<Root>[^|]*)'); tmp2

Unnamed: 0,Root
75915,Tyr
21576,$hd
63776,dEw
108439,nws
78890,$rq
127444,rwH
78895,jmE
109979,Alh
51559,Ahl
46582,rjs


Similarly, I want to extract **Lemmas** as well. 

In [27]:
tmp3 = qsample.FEATURES.str.extract(r'LEM:(?P<Lemma>[^|]*)'); tmp3

Unnamed: 0,Lemma
75915,Tayor
21576,$ahiyd
63776,daEaA
108439,n~aAs
78890,m~u$oriqiyn
127444,ruwH
78895,jamoE
109979,{ll~ah
51559,>ahol
46582,rijos


Finally, all that is left is to cancatenate the orginal sample `qsample` with these three splits `tmp1, tmp2, tmp3`, as follows. The `axis=1` option means that run the concatenation on columns (not rows).

In [28]:
pd.concat([tmp1, qsample, tmp2,tmp3], axis=1)

Unnamed: 0,sura,aya,word,w_seg,LOCATION,FORM,TAG,FEATURES,Root,Lemma
75915,24,41,11,3,(24:41:11:3),T~ayoru,N,STEM|POS:N|LEM:Tayor|ROOT:Tyr|M|NOM,Tyr,Tayor
21576,4,159,14,1,(4:159:14:1),$ahiydFA,N,STEM|POS:N|LEM:$ahiyd|ROOT:$hd|MS|INDEF|ACC,$hd,$ahiyd
63776,18,52,7,2,(18:52:7:2),daEa,V,STEM|POS:V|PERF|LEM:daEaA|ROOT:dEw|3MP,dEw,daEaA
108439,47,3,17,3,(47:3:17:3),n~aAsi,N,STEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN,nws,n~aAs
78890,26,60,2,1,(26:60:2:1),m~u$oriqiyna,N,STEM|POS:N|ACT|PCPL|(IV)|LEM:m~u$oriqiyn|ROOT:...,$rq,m~u$oriqiyn
127444,97,4,3,3,(97:4:3:3),r~uwHu,N,STEM|POS:N|LEM:ruwH|ROOT:rwH|M|NOM,rwH,ruwH
78895,26,61,3,2,(26:61:3:2),jamoEaAni,N,STEM|POS:N|LEM:jamoE|ROOT:jmE|MD|NOM,jmE,jamoE
109979,48,25,28,1,(48:25:28:1),{ll~ahu,PN,STEM|POS:PN|LEM:{ll~ah|ROOT:Alh|NOM,Alh,{ll~ah
51559,12,62,12,1,(12:62:12:1),>aholi,N,STEM|POS:N|LEM:>ahol|ROOT:Ahl|M|GEN,Ahl,>ahol
46582,10,100,10,2,(10:100:10:2),r~ijosa,N,STEM|POS:N|LEM:rijos|ROOT:rjs|M|ACC,rjs,rijos


Now that we ran the experiment successfully with the sample, let us repeat it on the actual file `qdforiginal`

In [29]:
tmp1 = qdforiginal.LOCATION.str.extract(r'(?P<sura>\d*):(?P<aya>\d*):(?P<word>\d*):(?P<w_seg>\d*)')
tmp2 = qdforiginal.FEATURES.str.extract(r'ROOT:(?P<Root>[^|]*)')
tmp3 = qdforiginal.FEATURES.str.extract(r'LEM:(?P<Lemma>[^|]*)')
df_qruan = pd.concat([tmp1, qdforiginal, tmp2,tmp3], axis=1)

To confirm the shape of the new dataframe `df_qruan` I can use the `shape` attribute, also I can display randomly some rows.

In [30]:
df_qruan.shape

(128219, 10)

In [31]:
df_qruan.sample(5)

Unnamed: 0,sura,aya,word,w_seg,LOCATION,FORM,TAG,FEATURES,Root,Lemma
81748,27,63,4,1,(27:63:4:1),Zuluma`ti,N,STEM|POS:N|LEM:Zuluma`t|ROOT:Zlm|FP|GEN,Zlm,Zuluma`t
107753,46,16,8,1,(46:16:8:1),wa,CONJ,PREFIX|w:CONJ+,,
83638,28,49,2,2,(28:49:2:2),>otu,V,STEM|POS:V|IMPV|LEM:>ataY|ROOT:Aty|2MP,Aty,>ataY
97849,38,78,4,1,(38:78:4:1),<ilaY`,P,STEM|POS:P|LEM:<ilaY`,,<ilaY`
73247,23,27,8,2,(23:27:8:2),<i*aA,T,STEM|POS:T|LEM:<i*aA,,<i*aA


It could be possible that our newly introduced columns could have extra spaces which we can get rid of by using the `strip()` method of string as follows.

In [32]:
df_qruan.Root = df_qruan.Root.str.strip()

In [33]:
df_qruan.Lemma = df_qruan.Lemma.str.strip()

It is good idea to save this version into a `csv` file. `index=False` avoids unncessesarity including an extra index column in the output file.

In [16]:
df_qruan.to_csv('quran-morphology-v2.csv', index=False)

## join with Meccan/Medinan file

It would be very useful to augment our file with a new column that tells me if a sura is Meccan or Medinan. This will later allow to answer question like, **what are the unique root words in the Quran that appear only in Meccan sura?** for example. 

To do this, I am referring to a table of contents page I created some time back using `Angular` [here](http://textminingthequran.com/toc/)

My idea is to go that page, and use mouse to select the table, copy it in the clipboard and then perform the following operation to read the clipboard and create a dataframe `qtoc` as follows.

In [34]:
qtoc=pd.read_csv('toc.csv')

In [35]:
qtoc.head()

Unnamed: 0,No.,Name Arabic,Name,English Meaning,No of verses,Place,Chronology
0,1,الفاتحة,Al-Fatiha,The Opening,7,Meccan,5
1,2,البقرة,Al-Baqara,The Cow,286,Medinan,87
2,3,آل عمران,Al Imran,The House of Joachim,200,Medinan,89
3,4,النساء,An-Nisa',Women,176,Medinan,92
4,5,المائدة,Al-Ma'ida,The Table Spread,120,Medinan,112


Again, let me save this dataframe locally.

In [36]:
qtoc.to_csv('toc.csv', index=False)

I will now use the `merge` function to merge our original file `df_qruan` with the `qtoc` on the sura number (which is `sura` in the left `df_qruan` file and `No.` column in the right `qtoc` file. The `left` join is the one that makes sense here. The new dataframe is saved in a `quran`.

In [37]:
quran = df_qruan.merge(qtoc.loc[:,['No.', 'Place']], how='left', left_on='sura', right_on='No.')

ValueError: You are trying to merge on object and int64 columns for key 'sura'. If you wish to proceed you should use pd.concat

I can display few useful information through the `info()` method.

In [None]:
quran.info()

I noticed that I no longer need the `LOCATION` and `No.` column as they are now redundent. So, just drop them.

In [None]:
quran.drop(columns=['LOCATION','No.'], inplace=True)

As usual, here is a local copy of the final file after doing all these setup steps.

In [None]:
quran.to_csv('quran-morphology-final.csv', index=False)

## converting Buckwalter to Arabic

Our file contains Quranic words and roots in Buckwalter form, and I wanted a handy function to convert that into Arabic form. Here is how we do it.

First, referencing [this](http://corpus.quran.com/java/buckwalter.jsp) site, I can construct the followig dictionary of all mappings of unicode symbols into buckwalter as follows. I will call this dictionary `abjad`.

In [None]:
abjad = {u"\u0627":'A',
u"\u0628":'b', u"\u062A":'t', u"\u062B":'v', u"\u062C":'j',
u"\u062D":'H', u"\u062E":'x', u"\u062F":'d', u"\u0630":'*', u"\u0631":'r',
u"\u0632":'z', u"\u0633":'s', u"\u0634":'$', u"\u0635":'S', u"\u0636":'D',
u"\u0637":'T', u"\u0638":'Z', u"\u0639":'E', u"\u063A":'g', u"\u0641":'f',
u"\u0642":'q', u"\u0643":'k', u"\u0644":'l', u"\u0645":'m', u"\u0646":'n',
u"\u0647":'h', u"\u0648":'w', u"\u0649":'Y', u"\u064A":'y'}

In [None]:
abjad[' ']=' '
abjad[u"\u0621"] = '\''
abjad[u"\u0623"] = '>'
abjad[u"\u0625"] = '<'
abjad[u"\u0624"] = '&'
abjad[u"\u0626"] = '}'
#abjad[u"\u0655"] = '\'' # Hamza below
abjad[u"\u0622"] = '|'
abjad[u"\u064E"] = 'a'
abjad[u"\u064F"] = 'u'
abjad[u"\u0650"] = 'i'
abjad[u"\u0651"] = '~'
abjad[u"\u0652"] = 'o'
abjad[u"\u064B"] = 'F'
abjad[u"\u064C"] = 'N'
abjad[u"\u064D"] = 'K'
abjad[u"\u0640"] = '_'
abjad[u"\u0670"] = '`'
abjad[u"\u0629"] = 'p'
abjad[u"\u0653"] = '^'
abjad[u"\u0654"] = '#'
abjad[u"\u0671"] = '{'
abjad[u"\u06DC"] = ':'
abjad[u"\u06DF"] = '@'
abjad[u"\u0653"] = '^'
abjad[u"\u06E0"] = '"'
abjad[u"\u06E2"] = '['
abjad[u"\u06E3"] = ';'
abjad[u"\u06E5"] = ','
abjad[u"\u06E6"] = '.'
abjad[u"\u06E8"] = '!'
abjad[u"\u06EA"] = '-'
abjad[u"\u06EB"] = '+'
abjad[u"\u06EC"] = '%'
abjad[u"\u06ED"] = ']'

Let us also construct the reverse dictionary called `alphabet` that maps the bucwalter symbols back to unicode and hence can display Arabic words. 

In [None]:
# Create the reverse
alphabet = {}
for key in abjad:
    alphabet[abjad[key]] = key

Using these two dictionaries, we can always convert a string from one form to other using the following two handy functions.

In [None]:
def arabic_to_buc(ara):
    return ''.join(map(lambda x:abjad[x], list(ara)))

def buck_to_arabic(buc):
    return ''.join(map(lambda x:alphabet[x], list(buc)))

Here is a small test.

In [None]:
buck_to_arabic('EalaY`')

In [None]:
arabic_to_buc('الحمد لله')

## counting roots

Now it is time to get into the core of our query: **What are the unique root words that appear in Meccan sura, but not in the Medinan surah?**

As we saw before, we can: (1) filter a dataframe by logical checks like `quran.Place== 'Meccan'`. With that we (2) get set of all Meccan words, (3) then we select only the `Root` column, (4) then we run the `unique()` method to get an array of unique words which we can (5) then convert to list using `tolist()` function. Finally (6) we wrap the whole thing to a `set()` function, and hence we get the set of Meccan unique root words called `k` here. So, note how through chaining I could perform six operations into one. 

In [None]:
k = set(quran[quran.Place == 'Meccan'].Root.unique().tolist())

With the same logic, we produce the unique list of Medinan words in a set called `d`.

In [None]:
d = set(quran[quran.Place == 'Medinan'].Root.unique().tolist())

With this we can now remove the roots from Meccan list that are also in the Medinan, but the following set operation. We find out that there are 547 of such words, and 198 Medinan only words, and 898 root words appear in both.

In [None]:
makki_words = k-d; len(makki_words)

In [None]:
madani_words = d - k; len(madani_words)

In [None]:
both = k & d

In [None]:
len(both)

We now have at our hand all nuts and bolts to define two useful functions as follows.

Our first function is `sura_words`. It takes as input a list of sura numbers (for example `[113,114]` means sura 113 and 114). It also takes which kind of unique words we want to find for this list of sura: `W` is the default word list, `R` is the Root list and `L` is the Lemma list. Not how we use the `isin()` method to filter the dataframe on only the list of sura we provide. Also note the `dropna()` function to drop the `null` values from the list. Finally note how we are returnting Arabic form of the final resuls using the `buck_to_arabic()` function we defined earlier.

In [None]:
# function to return words given a list of sura
def sura_words(s_list, kind='W'):
    if (kind=='R'):
        result = quran[quran.sura.isin(s_list)].Root.dropna().unique().tolist()
    elif (kind=='L'):
        result = quran[quran.sura.isin(s_list)].Lemma.dropna().unique().tolist()
    else:
        result = quran[quran.sura.isin(s_list)].FORM.unique().tolist()
    return [buck_to_arabic(x) for x in result]

Here is a test on `Lemma` words of suran No. 111.

In [None]:
sura_words([111],'L')

The above function can have lots of utilities. Among them you may want to increase your Quranic vocabulary gradually by memorizing roots of one sura at a time. This function conviniently will give you the unique list of roots (or lemmas, or just words).

With a small variation and exploiting the set operations, we can define another function called `unique_sura_words` that again takes a list of sura and returns root (or lemma or raw words) that appears only in this list of suras. Note the `~` operator to negate a condition. So `~quran.sura.isin([113,114])` means all sura except 113 and 114. 

In [None]:
# function to return words given a list of sura
def unique_sura_words(s_list, kind='W'):
    if (kind=='R'):
        first = quran[quran.sura.isin(s_list)].Root.dropna().unique().tolist()
        second = quran[~quran.sura.isin(s_list)].Root.dropna().unique().tolist()
        result = list(set(first)-set(second))
    elif (kind=='L'):
        first = quran[quran.sura.isin(s_list)].Lemma.dropna().unique().tolist()
        second = quran[~quran.sura.isin(s_list)].Lemma.dropna().unique().tolist()
        result = list(set(first)-set(second))
    else:
        first = quran[quran.sura.isin(s_list)].FORM.dropna().unique().tolist()
        second = quran[~quran.sura.isin(s_list)].FORM.dropna().unique().tolist()
        result = list(set(first)-set(second))
    return [buck_to_arabic(x) for x in result]

Using this function we know that sura 113 has these two root words that can be found no where else in the Quran.

In [None]:
unique_sura_words([113],'R')