# NLP TP 01 Text Preprocessing
### spaCy ARABIC AND ENGLISH

In [1]:
import spacy
from spacy.lang.ar import *

#### 1- Loading data :

In [3]:

text_EN = "Perhaps one of the most significant advances made by Arabic mathematics began at this time with the work of al-Khwarizmi, namely the beginnings of algebra. It is important to understand just how significant this new idea was. It was a revolutionary move away from the Greek concept of mathematics which was essentially geometry. Algebra was a unifying theory which allowedrational numbers,irrational numbers, geometrical magnitudes, etc., to all be treated as \"algebraic objects\". It gave mathematics a whole new development path so much broader in concept to that which had existed before, and provided a vehicle for future development of the subject. Another important aspect of the introduction of algebraic ideas was that it allowed mathematics to be applied to itselfin a way which had not happened before."
print(text_EN)

text_AR = "ربما كانت أحد أهم التطورات التي قامت بها الرياضيات العربية التي بدأت في هذا الوقت بعمل الخوارزمي وهي بدايات الجبر, ومن المهم فهم كيف كانت هذه الفكرة الجديدة مهمة, فقد كانت خطوة نورية بعيدا عن المفهوم اليوناني للرياضيات التي هي في جوهرها هندسة, الجبر کان نظرية موحدة تتيح الأعداد الكسرية والأعداد اللا كسرية, والمقادير الهندسية وغيرها, أن تتعامل على أنها أجسام جبرية, وأعطت الرياضيات ككل مسارا جديدا للتطور بمفهوم أوسع بكثير من الذي كان موجودا من قبل, وقم وسيلة للتنمية في هذا الموضوع مستقبلا. وجانب آخر مهم لإدخال أفكار الجبر وهو أنه سمح بتطبيق الرياضيات على نفسها بطريقة لم تحدث من قبل"
print(text_AR)



Perhaps one of the most significant advances made by Arabic mathematics began at this time with the work of al-Khwarizmi, namely the beginnings of algebra. It is important to understand just how significant this new idea was. It was a revolutionary move away from the Greek concept of mathematics which was essentially geometry. Algebra was a unifying theory which allowedrational numbers,irrational numbers, geometrical magnitudes, etc., to all be treated as "algebraic objects". It gave mathematics a whole new development path so much broader in concept to that which had existed before, and provided a vehicle for future development of the subject. Another important aspect of the introduction of algebraic ideas was that it allowed mathematics to be applied to itselfin a way which had not happened before.
ربما كانت أحد أهم التطورات التي قامت بها الرياضيات العربية التي بدأت في هذا الوقت بعمل الخوارزمي وهي بدايات الجبر, ومن المهم فهم كيف كانت هذه الفكرة الجديدة مهمة, فقد كانت خطوة نورية بعيدا

#### 2- Convert text to lowercase, Remove punctuation, Tokenization :

In [9]:
import re

# Convert text to lowercase :

text_EN = text_EN.lower()
print(text_EN)

# Remove punctuation :

text_EN = re.sub(r'[^\w\s]','',text_EN)
print(text_EN)

text_AR = re.sub(r'[^\w\s]','',text_AR)
print(text_AR)

# Tokenization :

nlp = spacy.load("en_core_web_sm")

# In this example, some of the commonly required attributes are accessed:
#   is_alpha : detects if the token consists of alphabetic characters or not.
#   is_punct : detects if the token is a punctuation symbol or not.
#   is_space : detects if the token is a space or not.
#   shape_   : prints out the shape of the word.
#   is_stop  : detects if the token is a stop word or not.

# Tokenization AR :

token_AR = Arabic()
token_AR = nlp(text_AR)
for token in token_AR:
    print (token, token.idx,token.is_alpha, token.is_punct, token.is_space,token.shape_, token.is_stop)

# Tokenization EN :

token_EN = nlp(text_EN)
for token in token_EN:
    print (token, token.idx,token.is_alpha, token.is_punct, token.is_space,token.shape_, token.is_stop)
 

perhaps one of the most significant advances made by arabic mathematics began at this time with the work of alkhwarizmi namely the beginnings of algebra it is important to understand just how significant this new idea was it was a revolutionary move away from the greek concept of mathematics which was essentially geometry algebra was a unifying theory which allowedrational numbersirrational numbers geometrical magnitudes etc to all be treated as algebraic objects it gave mathematics a whole new development path so much broader in concept to that which had existed before and provided a vehicle for future development of the subject another important aspect of the introduction of algebraic ideas was that it allowed mathematics to be applied to itselfin a way which had not happened before
perhaps one of the most significant advances made by arabic mathematics began at this time with the work of alkhwarizmi namely the beginnings of algebra it is important to understand just how significant 

#### 3- Part of Speech Tagging or POS Tagging :

In [12]:
# Here, two attributes of the Token class are accessed:
# 1- tag_ lists : the fine-grained part of speech.
# 2- pos_ lists : the coarse-grained part of speech.

# EN :

for token in token_EN:
    print (token, token.tag_, token.pos_, spacy.explain(token.tag_))

#AR :

for token in token_AR:
    print (token, token.tag_, token.pos_, spacy.explain(token.tag_))

perhaps RB ADV adverb
one CD NUM cardinal number
of IN ADP conjunction, subordinating or preposition
the DT DET determiner
most RBS ADV adverb, superlative
significant JJ ADJ adjective (English), other noun-modifier (Chinese)
advances NNS NOUN noun, plural
made VBN VERB verb, past participle
by IN ADP conjunction, subordinating or preposition
arabic JJ ADJ adjective (English), other noun-modifier (Chinese)
mathematics NNS NOUN noun, plural
began VBD VERB verb, past tense
at IN ADP conjunction, subordinating or preposition
this DT DET determiner
time NN NOUN noun, singular or mass
with IN ADP conjunction, subordinating or preposition
the DT DET determiner
work NN NOUN noun, singular or mass
of IN ADP conjunction, subordinating or preposition
alkhwarizmi NN NOUN noun, singular or mass
namely RB ADV adverb
the DT DET determiner
beginnings NNS NOUN noun, plural
of IN ADP conjunction, subordinating or preposition
algebra NNS NOUN noun, plural
it PRP PRON pronoun, personal
is VBZ AUX verb, 3

#### 4- Chunking

In [16]:
# EN :

for chunk in token_EN.noun_chunks:
    print(chunk)

# AR : 

for chunk in token_AR.noun_chunks:
    print(chunk)

the most significant advances
arabic mathematics
this time
the work
alkhwarizmi
namely the beginnings
algebra
it
this new idea
it
a revolutionary move
the greek concept
mathematics
geometry algebra
a unifying theory
allowedrational numbersirrational numbers
algebraic objects
it
mathematics
a whole new development path
concept
a vehicle
future development
the subject
another important aspect
the introduction
algebraic ideas
it
mathematics
a way
ربما
التي قامت بها الرياضيات العربية التي
هذا الوقت بعمل الخوارزمي
وهي بدايات الجبر ومن المهم فهم كيف كانت هذه الفكرة الجديدة مهمة فقد كانت خطوة
بعيدا
اللا
والمقادير الهندسية وغيرها
على أنها أجسام جبرية وأعطت الرياضيات ككل مسارا جديدا
بمفهوم
أوسع بكثير من الذي كان موجودا
من قبل وقم
مهم لإدخال أفكار
وهو أنه سمح بتطبيق الرياضيات
على نفسها بطريقة


#### 5- Lemmatization :

In [17]:
# EN :

for token in token_EN:
    print (token, token.lemma_)

# AR : 

for token in token_AR:
    print (token, token.lemma_)

perhaps perhaps
one one
of of
the the
most most
significant significant
advances advance
made make
by by
arabic arabic
mathematics mathematic
began begin
at at
this this
time time
with with
the the
work work
of of
alkhwarizmi alkhwarizmi
namely namely
the the
beginnings beginning
of of
algebra algebra
it it
is be
important important
to to
understand understand
just just
how how
significant significant
this this
new new
idea idea
was be
it it
was be
a a
revolutionary revolutionary
move move
away away
from from
the the
greek greek
concept concept
of of
mathematics mathematic
which which
was be
essentially essentially
geometry geometry
algebra algebra
was be
a a
unifying unifying
theory theory
which which
allowedrational allowedrational
numbersirrational numbersirrational
numbers number
geometrical geometrical
magnitudes magnitude
etc etc
to to
all all
be be
treated treat
as as
algebraic algebraic
objects object
it it
gave give
mathematics mathematic
a a
whole whole
new new
development de

#### 6- Parsing :

In [19]:
# EN : 

for token in token_EN:
    print (token.text, token.tag_, token.head.text, token.dep_)

# AR :

for token in token_AR:
    print (token.text, token.tag_, token.head.text, token.dep_)



perhaps RB one advmod
one CD began nsubj
of IN one prep
the DT advances det
most RBS significant advmod
significant JJ advances amod
advances NNS of pobj
made VBN advances acl
by IN made agent
arabic JJ mathematics amod
mathematics NNS by pobj
began VBD began ROOT
at IN began prep
this DT time det
time NN at pobj
with IN began prep
the DT work det
work NN with pobj
of IN work prep
alkhwarizmi NN of pobj
namely RB beginnings advmod
the DT beginnings det
beginnings NNS one appos
of IN beginnings prep
algebra NNS of pobj
it PRP is nsubj
is VBZ began ccomp
important JJ is acomp
to TO understand aux
understand VB is xcomp
just RB significant advmod
how WRB significant advmod
significant JJ was acomp
this DT idea det
new JJ idea amod
idea NN was nsubj
was VBD understand ccomp
it PRP was nsubj
was VBD understand ccomp
a DT move det
revolutionary JJ move amod
move NN was attr
away RB move advmod
from IN away prep
the DT concept det
greek NNP concept amod
concept NN from pobj
of IN concept prep

# Thank you for your attention 