[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]
(https://colab.research.google.com/github/refugies-info/genai-for-public-good/blob/main/notebooks/Categorie_Classifier.ipynb)

#**Simplification Typology Classifier using mBert**





This NoteBook is deticated to explore the use of pre-trained transformer-based models for multiclass text classification. The primary focus is on predicting the appropriate simplification strategies required to simplify respective SE sentences.




---

### **Text Simplification Strategies**  
Strategies for text simplification vary along a spectrum from **omission** (removing text) to **explanation** (adding clarifications). The **transcription** strategy lies in the middle, maintaining the original content without modifications.  

#### **Macro-Strategies and Their Subcategories**  
Text simplification is structured into **macro-strategies**, which are further divided into **strategies** and **micro-strategies**:  

📌 **Transcription** – No changes are made to the original text.  
📌 **Syntactic Change** – Adjustments between different syntactic levels.  
📌 **Transposition** – Changes in word class (e.g., transforming a noun into a verb).  
📌 **Synonymy** – Simplifying technical or abstract words (e.g., *conversation* → *talk*).  
📌 **Modulation** – Reorganizing sentence structure without altering meaning.  
📌 **Explanation** – Expanding on implicit grammar or hidden content.  
📌 **Illocutionary Change** – Making implied meanings explicit.  
📌 **Compression** – Condensing grammatical or semantic constructs for clarity.  
📌 **Omission** – Removing redundant rhetorical or diamesic elements.  

---
![Taxonomy](https://drive.google.com/uc?id=1Hs8zyBESJEWZ_B7Kylm-gcH-Bg-ogsg2)


---
> ### Load the data

----


In [1]:
!pip install gdown
!gdown "https://drive.google.com/uc?id=1z-SRYUL6bAHQ-vfP0AJgxMjpkCd-56iP"


Downloading...
From: https://drive.google.com/uc?id=1z-SRYUL6bAHQ-vfP0AJgxMjpkCd-56iP
To: /home/onyxia/work/genai-for-public-good/notebooks/ri_annotated_texts_final.csv
100%|██████████████████████████████████████| 76.8k/76.8k [00:00<00:00, 11.5MB/s]


In [None]:

!pip install datasets transformers evaluate sentencepiece accelerate


In [2]:
model_path = 'microsoft/deberta-v3-small'

In [3]:
import pandas as pd
file_path = 'ri_annotated_texts_final.csv'
data = pd.read_csv(file_path)

In [4]:
def clean_text(text:str):
    import re
    text = text.strip()
    text = re.sub(r"^-\s+", "", text)
    return text
data["Version initiale"] = data["Version initiale"].apply(clean_text)
data["Version retraitée"] = data["Version retraitée"].apply(clean_text)


In [5]:
data = data.groupby(by="Version initiale").aggregate({"Version retraitée":'first', "Catégorie":lambda x: ", ".join(x)}).reset_index(drop=False)


In [6]:
classes = sorted(list(set(", ".join(data["Catégorie"]).split(", "))))

In [7]:
classes

['Compression',
 'Explanation',
 'Modulation',
 'Omission',
 'Substitution',
 'Synonymy',
 'Syntactic',
 'Transcription',
 'Transposition']

In [8]:
class2id = {class_:id_ for id_, class_ in enumerate(classes)}
id2class = {id_:class_ for class_, id_ in class2id.items()}

data["classes"] = [[class2id[g] for g in j.split(", ")] for j in data["Catégorie"]]
data

Unnamed: 0,Version initiale,Version retraitée,Catégorie,classes
0,173h de formation en français pour étrangers d...,"Des cours de français pour débutants, 4 après-...","Explanation, Substitution","[1, 4]"
1,96h de français pour apprendre à communiquer à...,96 heures de français pour progresser à l'oral...,Substitution,[4]
2,Accompagnement et conseils pendant et après la...,Accompagnement et conseils pendant et après la...,Transcription,[7]
3,Accompagnement individuel,Accompagnement individuel,Transcription,[7]
4,Accompagnement pour les démarches,Accompagnement pour les démarches,Transcription,[7]
...,...,...,...,...
252,savoir se présenter et se comporter en entrepr...,savoir se présenter et avoir la bonne attitude...,Substitution,[4]
253,une découverte du chantier et des métiers poss...,une découverte du chantier et des métiers poss...,Transcription,[7]
254,une présentation des métiers recherchés par le...,une présentation des métiers recherchés par le...,Transcription,[7]
255,vos coordonnées,votre nom et votre numéro,Explanation,[1]


In [9]:
from transformers import AutoTokenizer
model_name = "bert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)



  from .autonotebook import tqdm as notebook_tqdm


In [42]:
import torch
class Dataset(torch.utils.data.Dataset):
    def __init__(self, text, labels):
        self.text = text
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        example = tokenizer(self.text[idx], truncation=True)
        #example['labels'] = self.labels[idx]
        
        labels = [0. for i in range(len(classes))]
        for label_id in self.labels[idx]:
            assert isinstance(label_id, int)
            labels[label_id] = 1.
        return example

        example = tokenizer(text, truncation=True)
        example['labels'] = labels



class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, text, labels, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.text = text
        self.labels = labels
        self.max_len = max_len

    def __len__(self):
        return len(self.text)

    def __getitem__(self, idx):
        inputs = tokenizer(self.text[idx], truncation=True)
        if 0:
            inputs = self.tokenizer.encode_plus(
                self.text[idx],
                None,
                add_special_tokens=True,
                max_length=self.max_len,
                padding='max_length',
                return_token_type_ids=True,
                truncation=True,
                return_attention_mask=True,
                return_tensors='pt'
            )
        labels = [0. for i in range(len(classes))]
        for label_id in self.labels[idx]:
            assert isinstance(label_id, int)
            labels[label_id] = 1.

        return {
            'input_ids': inputs['input_ids'],
            'attention_mask': inputs['attention_mask'],
            'token_type_ids': inputs["token_type_ids"],
            'labels_ids': torch.FloatTensor(labels),
        }



#tokenized_dataset = Dataset(data["Version initiale"], data["classes"])
tokenized_dataset = CustomDataset(tuple(data["Version initiale"]), tuple(data["classes"]), tokenizer, 1000)


In [43]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


In [44]:
import evaluate
import numpy as np

clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])

def sigmoid(x):
   return 1/(1 + np.exp(-x))

def compute_metrics(eval_pred):

   predictions, labels = eval_pred
   predictions = sigmoid(predictions)
   predictions = (predictions > 0.5).astype(int).reshape(-1)
   return clf_metrics.compute(predictions=predictions, references=labels.astype(int).reshape(-1))
#references=labels.astype(int).reshape(-1))


In [45]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(classes), id2label=id2class, label2id=class2id, problem_type = "multi_label_classification")
#class_counts = data['Typology Encoded'].value_counts()
#class_weights = torch.tensor([1.0 / count * len(data) / 2.0 for count in class_counts])



Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [48]:

# Define a Custom Trainer to Incorporate Class Weights
class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        print(tuple(dict(inputs).keys()))
        print(kwargs)
        #labels = inputs.pop("targets")
        labels = torch.zeros(())
        outputs = model(**inputs)
        logits = outputs.logits
        loss_fct = torch.nn.BCELoss()
        loss_fct= torch.nn.BCEWithLogitsLoss()
        labels = torch.zeros(logits.shape, device=logits.device)
        #loss_fct = torch.nn.BCELoss(weight=class_weights.to(logits.device))
        #loss_fct = torch.nn.CrossEntropyLoss(weight=class_weights.to(logits.device))  # Use class weights
        print(f"{logits.shape=} {labels.shape=}")
        loss = loss_fct(logits, labels)
        return (loss, outputs) if return_outputs else loss


training_args = TrainingArguments(

   output_dir="my_awesome_model",
   learning_rate=2e-5,
   per_device_train_batch_size=1,
   per_device_eval_batch_size=1,
   num_train_epochs=2,
   weight_decay=0.01,
   evaluation_strategy="epoch",
   save_strategy="epoch",
   load_best_model_at_end=False,
)

trainer = CustomTrainer(

   model=model,
   args=training_args,
   train_dataset=tokenized_dataset,#tokenized_dataset["train"],
   eval_dataset=tokenized_dataset,#tokenized_dataset["test"],
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)

trainer.train()


  trainer = CustomTrainer(


('input_ids', 'attention_mask', 'token_type_ids')
{'num_items_in_batch': None}
logits.shape=torch.Size([1, 9]) labels.shape=torch.Size([1, 9])


Epoch,Training Loss,Validation Loss


('input_ids', 'attention_mask', 'token_type_ids')
{'num_items_in_batch': None}
logits.shape=torch.Size([1, 9]) labels.shape=torch.Size([1, 9])
('input_ids', 'attention_mask', 'token_type_ids')
{'num_items_in_batch': None}
logits.shape=torch.Size([1, 9]) labels.shape=torch.Size([1, 9])
('input_ids', 'attention_mask', 'token_type_ids')
{'num_items_in_batch': None}
logits.shape=torch.Size([1, 9]) labels.shape=torch.Size([1, 9])
('input_ids', 'attention_mask', 'token_type_ids')
{'num_items_in_batch': None}
logits.shape=torch.Size([1, 9]) labels.shape=torch.Size([1, 9])
('input_ids', 'attention_mask', 'token_type_ids')
{'num_items_in_batch': None}
logits.shape=torch.Size([1, 9]) labels.shape=torch.Size([1, 9])
('input_ids', 'attention_mask', 'token_type_ids')
{'num_items_in_batch': None}
logits.shape=torch.Size([1, 9]) labels.shape=torch.Size([1, 9])
('input_ids', 'attention_mask', 'token_type_ids')
{'num_items_in_batch': None}
logits.shape=torch.Size([1, 9]) labels.shape=torch.Size([1, 9])

KeyboardInterrupt: 

/home/onyxia/work/genai-for-public-good/.venv/lib/python3.12/site-packages/transformers/training_args.py:1575: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
/tmp/ipykernel_15557/4099262736.py:33: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `CustomTrainer.__init__`. Use `processing_class` instead.
  trainer = CustomTrainer(
('input_ids', 'attention_mask', 'token_type_ids')
{'num_items_in_batch': None}



---




> ### Data Exploration and Analysis





---



In [3]:
# Display basic information and first few rows
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 370 entries, 0 to 369
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Version initiale   370 non-null    object
 1   Version retraitée  370 non-null    object
 2   Catégorie          370 non-null    object
dtypes: object(3)
memory usage: 8.8+ KB


In [4]:
# Display the first few rows of the dataset
data.head()

Unnamed: 0,Version initiale,Version retraitée,Catégorie
0,Dispositif d'apprentissage du français : perme...,Des ateliers 2 fois par semaine pour progresse...,Explanation
1,Dispositif d'apprentissage du français : perme...,Des ateliers 2 fois par semaine pour progresse...,Explanation
2,Dispositif d'apprentissage du français : perme...,"Des ateliers pour progresser en français, mieu...",Substitution
3,Dispositif d'apprentissage du français : perme...,"Des ateliers pour progresser en français, mieu...",Compression
4,Dispositif d'apprentissage du français : perme...,"Des ateliers pour progresser en français, mieu...",Syntactic




```
load important libiraries for visualisation
```



In [5]:
!pip install seaborn wordcloud ace_tools



In [6]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import nltk
from nltk.corpus import stopwords
from collections import Counter
from wordcloud import WordCloud



> Distribution of Simplification Strategies

