# 📘 Welcome to the Arabic Text Summarization Project

## Overview
Welcome, students 👩‍🎓👨‍🎓, to our exciting journey into the world of Natural Language Processing (NLP)! In this project, we'll be delving into the fascinating task of text summarization with a focus on the Arabic language 📚. Our goal is to develop a model that can efficiently summarize Arabic text, making it easier to grasp the essence of large documents quickly 🚀.

## Project Objectives
- **Understanding Text Summarization**: Learn the fundamentals of how text summarization works 📝.
- **Exploring NLP Models**: Get hands-on experience with advanced NLP models like AraGPT2 🤖.
- **Model Fine-Tuning and Training**: Discover how to fine-tune pre-trained models on a custom dataset for specific tasks like summarization 🧠.
- **Practical Application**: Apply your knowledge to build a model that can summarize Arabic texts 🌐.

## Dataset
We'll be using a custom dataset of Arabic texts and their summaries 📖. This dataset will allow us to train our model to understand and generate concise summaries.

We generated this dataset using ChatGPT 😜
If you've read this sentence, send me a message.




## ⚠️ **Important: Use GPU Runtime** ⚠️

To ensure this notebook functions correctly and efficiently, it is **crucial to use a GPU runtime**. Follow these steps to enable GPU acceleration:

1. **Open Runtime settings**: At the top of the page, click on `Runtime` in the menu bar. 🔄

2. **Change the runtime type**: In the dropdown menu, select `Change runtime type`. 🛠️

3. **Select GPU as the hardware accelerator**: In the dialog that appears, under `Hardware accelerator`, choose `GPU T4` from the dropdown menu. 🖥️

4. **Save the settings**: Click `Save` to apply the changes. 💾

By enabling GPU, the computations in this notebook will be significantly faster, especially for tasks like training neural networks, processing large datasets, or performing complex calculations.


## PART1: Load AraGPT2

Using the link below, learn how to load araGPT2 base model.

https://huggingface.co/aubmindlab/aragpt2-base

In [None]:
import torch
torch.cuda.is_available()

True

In [1]:
!git clone https://github.com/samibelyo/text_summarization_project.git

Cloning into 'text_summarization_project'...
remote: Enumerating objects: 22, done.[K
remote: Counting objects: 100% (22/22), done.[K
remote: Compressing objects: 100% (17/17), done.[K
remote: Total 22 (delta 5), reused 18 (delta 3), pack-reused 0[K
Receiving objects: 100% (22/22), 10.33 KiB | 10.33 MiB/s, done.
Resolving deltas: 100% (5/5), done.


In [None]:
!pip install arabert

Collecting arabert
  Downloading arabert-1.0.1-py3-none-any.whl (179 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting PyArabic (from arabert)
  Downloading PyArabic-0.6.15-py3-none-any.whl (126 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.4/126.4 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting farasapy (from arabert)
  Downloading farasapy-0.0.14-py3-none-any.whl (11 kB)
Collecting emoji==1.4.2 (from arabert)
  Downloading emoji-1.4.2.tar.gz (184 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m185.0/185.0 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-1.4.2-py3-none-any.whl size=186460 sha256=18918caf1fdf137cfb60d8ad5fd2b374001dd

In [None]:
from transformers import GPT2TokenizerFast, pipeline
from transformers import GPT2LMHeadModel
from arabert.aragpt2.grover.modeling_gpt2 import GPT2LMHeadModel
from arabert.preprocess import ArabertPreprocessor

In [None]:
#TODO: Complete this cell
MODEL_NAME='aubmindlab/aragpt2-base'
arabert_prep = ArabertPreprocessor(model_name=MODEL_NAME)

text="الجزائر بلد"
text_clean = arabert_prep.preprocess(text)

model = GPT2LMHeadModel.from_pretrained(MODEL_NAME)
tokenizer = GPT2TokenizerFast.from_pretrained(MODEL_NAME)
generation_pipeline = pipeline("text-generation",model=model,tokenizer=tokenizer)

#pipeline acts like conveniant interface
#feel free to try different decoding settings
generation_pipeline(text,
    pad_token_id=tokenizer.eos_token_id,
    num_beams=10,
    max_length=200,
    top_p=0.9,
    repetition_penalty = 3.0,
    no_repeat_ngram_size = 3)[0]['generated_text']

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/553M [00:00<?, ?B/s]

Some weights of the model checkpoint at aubmindlab/aragpt2-base were not used when initializing GPT2LMHeadModel: ['ln_f.weight', 'ln_f.bias']
- This IS expected if you are initializing GPT2LMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing GPT2LMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at aubmindlab/aragpt2-base and are newly initialized: ['emb_norm.weight', 'emb_norm.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


vocab.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.50M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/4.52M [00:00<?, ?B/s]

'الجزائر بلد انهاجب ان ورم القيمةون ، الأسرة هيالا� هى فيه بصورة العلامة والمدر الى بشكل أن المستقبل النظام ولا كافة في بأن هو شكلبنالات الاسرة أنها الأصل عبر نصف جميع شبهالإن ممثلةإ علاء تعتبر علىلت انه إلى نظ ايضا الممثلة عنحيات الافراد وكل يعتبرأن هما ربي نور والاسرة أيضا الأم وكذلكث كلها الأفرادذا شكله ى عموما بأكملها وهو فى بأكمله البيان كل فهي الدولة نورا وتلك له وبقية نظامالت كما عليه العائلة وتعتبر الاصل مختلفنت الشكلاالس مجموع الملاك كبير جيهان منذ من 7 كثيرة وهي وباقيهم البند تماما ون منه الملك أنه او 27 الشعب ويتمالل� فيصل لةج المراجعتى حتى كيت الله فإن ضوء 2انته لذلك كاملا آ مثل بما اللله القمة بوصول 11 بالتأكيد عمرأت تلكين بواسطة 2017 ج جمعاء يعكس وج النب مريم السيدة 23 والدكتورة يب عليةتي بالإضافة لأن الاطفال ال بقيةدش أرباع الطريق لجميع نتمنى لمختلف طليعةه أي كبيرة وشبه استوا منهالات� بالفعل كذلك وسلمألكل فيعالع بالاضافة وبعض ؛ ء دليل الي فهو ملكة وغيرهاgh لكلالى وسائر نه'

### Print AraGPT Model and analyze the architecture

# TODO: print AraGPT2

In [None]:
print(model)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(64000, 768)
    (wpe): Embedding(1024, 768)
    (emb_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
  )
  (lm_head): Linear(in_features=768, out_features=64000, bias=False)
)


## PART2: Fine-tuning

To fine-tune AraGPT2 for text summarization, we use the file `arabic_texts_summaries.csv`

#### *Fine-tuning Steps:*


1.   Load datasets and split it into train/test
2.   Create Datalaoders of train and val.
3.   Resize model embeddings for new tokenizer length.
4.   Fine-tuning model by passing train data and evaluating it on val data during training.
5.   Store the tokenizer and fine-tuned model.
6.   Generate summaries for test set which is not used during fine tune.



In [None]:
from src.utils_data import *
from src.utils_tokenizer import *
from src.train import *

In [None]:
max_length =53
sum_length =10
split_probability = 0.3

In [None]:
train, val, test = process_data("data/arabic_texts_summaries.csv",max_length , sum_length, split_probability)

In [1]:
# Add token to AraGPT2 tokenizer
from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained('aubmindlab/aragpt2-base')

special_tokens = {'bos_token':'<BOS>', 'eos_token':'<EOS>', 'pad_token':'<PAD>', 'additional_special_tokens':['<SUMMARIZE>']}
tokenizer.add_special_tokens(special_tokens)

print('tokenizer len: {}'.format(len(tokenizer)))

ignore_idx = tokenizer.pad_token_id


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


vocab.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.50M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/4.52M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

tokenizer len: 64004


In [None]:
# TODO: apply tokenizer
import os

tokenizer_dir ="tokenizer_path_save"
if not os.path.exists(tokenizer_dir):
  os.makedirs(tokenizer_dir) # Create output directory if needed

max_seq_len = 768
tokenizer.save_pretrained(tokenizer_dir)
tokenizer_len = len(tokenizer)
print('ignore_index: {}'.format(ignore_idx))
print('max_len: {}'.format(max_seq_len))

train, val, test = tokenize_dataset(tokenizer, train, val, test, max_len)# Fix tokenize_dataset function in utils_tokenizer and call it


In [None]:
#Generate train/val/test files
#save tokenized data
out_dir="tokenizer_data"
processed_set= "dataset"
data_dir = os.path.join(out_dir, processed_set)
if not os.path.exists(data_dir):
  os.makedirs(data_dir) # Create output directory if needed
file = os.path.join(data_dir,"train.csv")
train.to_csv(file, index=False)

file = os.path.join(data_dir,"val.csv")
val.to_csv(file, index=False)

file = os.path.join(data_dir,"test.csv")
test.to_csv(file, index=False)

In [None]:
# TODO: Visualize train and explain each column

In [None]:
# TODO: Data Loaders
# Fix code in utils_data.py

import torch
train_dataset, val_dataset= # call function get_gpt2_dataset

b = train_dataset.__getitem__() # check one data row

train_dataloader = DataLoader(train_dataset, sampler = RandomSampler(train_dataset), batch_size = 1)
val_dataloader = DataLoader(val_dataset, sampler = SequentialSampler(val_dataset), batch_size = 1)

train_loader_len = len(train_dataloader)

In [None]:
config = {
    "out_dir" :"our output_dir"
    "training_models":"models_dir"
    "final_model":"AraGPT_Limitless"
}

# fine tune pretrained model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_dir =  'aubmindlab/aragpt2-base'

train = Train(device, model_dir, tokenizer_len, ignore_idx, train_loader_len, config)
train.train_model(train_dataloader, val_dataloader)
