<a href="https://colab.research.google.com/github/zahr-eddine/nlp_preprocessing_data/blob/main/nlp_preprocessing_data_using_huggingface.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Preprocessing data using HuggingFace package** ##

**Dataset**

In [None]:
dataset = [        
"""Perhaps one of the most significant advances made by Arabic mathematics began at this time with the work of al-Khwarizmi, namely
the beginnings of algebra. It is important to understand just how significant this new idea was. It was a revolutionary move away from
the Greek concept of mathematics which was essentially geometry. Algebra was a unifying theory which allowed rational
numbers, irrational numbers, geometrical magnitudes, etc., to all be treated as "algebraic objects". It gave mathematics a whole new
development path so much broader in concept to that which had existed before, and provided a vehicle for future development of the
subject. Another important aspect of the introduction of algebraic ideas was that it allowed mathematics to be applied to itself in a
way which had not happened before.""",

 """ربما كانت أحد أهم التطورات التي قامت بها الرياضيات العربية التي بدأت في هذا الوقت بعمل الخوارزمي  وهي بدايات الجبر،ومن المهم فهم كيف كانت هذه الفكرة الجديدة مهمة، فقد كانت خطوة ثورية بعيدا عن
المفهوم اليوناني للرياضيات التي هي في جوهرها  هندسة، الجبركان نظرية موحدة تتحيح الأعداد الكسرية و الأعداد اللا كسرية ، والمقادير الهندسية و غيرها ، أن تتعامل على أنها أجسام جبرية، و أعطت الرياضيات ككل مسارا جديدًا للتطوربمفهوم 
 أوسع بكثير من الذي كان موجودًا من قبل ، وقدم وسيلة للتنمية في هذا الموضوع مستقبلا .و جانب آخر مهم لإدخال أفكار الجبر و هو أنه سمح بتطبيق الرياضيات على نفسها 
بطريقة  لم تحدث من قبل."""
]

**Installing Transformers**

In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/fd/1a/41c644c963249fd7f3836d926afa1e3f1cc234a1c40d80c5f03ad8f6f1b2/transformers-4.8.2-py3-none-any.whl (2.5MB)
[K     |████████████████████████████████| 2.5MB 30.4MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 37.7MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/d4/e2/df3543e8ffdab68f5acc73f613de9c2b155ac47f162e725dcac87c521c11/tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 45.5MB/s 
Collecting huggingface-hub==0.0.12
  Downloading https://files.pythonhosted.org/packages/2f/ee/97e253668fda9b17e968b3f97b2f8e53aa0127e8807d24a54768742

**Import packages**

**Using BERT Tokenization**

In [None]:
import transformers
from transformers import BertModel, BertTokenizer, AdamW, get_linear_schedule_with_warmup

## **Begin Preprocessing** ##

**Remove punctuations**

In [None]:
import string
punctuations = '''`÷×؛<>_()*&^%][ـ،/:"؟.,'{}~¦+|!”…“–ـ''' + string.punctuation  #all punctuations 
print(punctuations)

def remove_punctuations(text):
  my_clean_text = ''.join([item for item in text if item not in punctuations])
  return my_clean_text

`÷×؛<>_()*&^%][ـ،/:"؟.,'{}~¦+|!”…“–ـ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


**Tokenization**

In [None]:
PRE_TRAINED_MODEL_NAME = 'bert-base-cased'
tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME) ##import library 


def tokenize_text(text):
  text = text.lower()
  data = [item for item in tokenizer.tokenize(text)]
  ids = tokenizer.convert_tokens_to_ids(data)
  return data,ids



In [None]:

tokens,ids =  tokenize_text(dataset[0])
print(tokens , ids)

['perhaps', 'one', 'of', 'the', 'most', 'significant', 'advances', 'made', 'by', 'a', '##rab', '##ic', 'mathematics', 'began', 'at', 'this', 'time', 'with', 'the', 'work', 'of', 'al', '-', 'k', '##hwa', '##riz', '##mi', ',', 'namely', 'the', 'beginnings', 'of', 'algebra', '.', 'it', 'is', 'important', 'to', 'understand', 'just', 'how', 'significant', 'this', 'new', 'idea', 'was', '.', 'it', 'was', 'a', 'revolutionary', 'move', 'away', 'from', 'the', 'g', '##ree', '##k', 'concept', 'of', 'mathematics', 'which', 'was', 'essentially', 'geometry', '.', 'algebra', 'was', 'a', 'un', '##ifying', 'theory', 'which', 'allowed', 'rational', 'numbers', ',', 'irrational', 'numbers', ',', 'geometric', '##al', 'magnitude', '##s', ',', 'etc', '.', ',', 'to', 'all', 'be', 'treated', 'as', '"', 'algebraic', 'objects', '"', '.', 'it', 'gave', 'mathematics', 'a', 'whole', 'new', 'development', 'path', 'so', 'much', 'broader', 'in', 'concept', 'to', 'that', 'which', 'had', 'existed', 'before', ',', 'and', 