# <b>Rihal CodeStacker4: ML</b>
### <ul><u><b><i>IMPORTANT:</i></b></u><li>If you want to test the models only without fine-tuning them, please load the models as explained in the notebook</li><br/><li>Make sure to install requriments.txt</li></ul>
## Level 1: The Basics
#### Develop a model that can categorize news articles into their respective categories.

In [1]:
!nvidia-smi

Mon Apr  8 10:28:53 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:A2:00.0 Off |                  Off |
|  0%   25C    P8             15W /  450W |       0MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

### Importing needed libraries:

In [1]:
from transformers import TFAutoModel, AutoTokenizer, Trainer, TrainingArguments
import pandas as pd, numpy as np
from datasets import Dataset, DatasetDict
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder
import keras

2024-04-08 10:47:02.579366: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Processing the input file:

In [2]:
df_train = pd.read_json("./N24News/nytimes_train.json")
df_test = pd.read_json("./N24News/nytimes_test.json")

In [3]:
df_train = df_train.replace(r'^\s*$', np.nan, regex=True)
df_train = df_train.replace(r'\n',' ', regex=True)
df_train = df_train.replace(r'--', '', regex=True)
df_train = df_train.replace(r'\\\\', '', regex=True)
df_test = df_test.replace(r'^\s*$', np.nan, regex=True)
df_test = df_test.replace(r'\n',' ', regex=True)
df_test = df_test.replace(r'--', '', regex=True)
df_test = df_test.replace(r'\\\\', '', regex=True)

In [4]:
df_train = df_test.dropna()
df_test = df_test.dropna()

In [5]:
df_train = df_train.drop(['article_url', 'article_id', 'image', 'image_id'], axis=1)
df_test = df_test.drop(['article_url', 'article_id', 'image', 'image_id'], axis=1)

### Loading The <a href="https://huggingface.co/google-bert/bert-base-uncased">bert-base-uncased</a> LLM:

In [7]:
model = TFAutoModel.from_pretrained("bert-base-uncased")

2024-04-08 10:29:02.679643: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1928] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 22283 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 4090, pci bus id: 0000:a2:00.0, compute capability: 8.9
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClas

#### Loading its tokenizer:

In [8]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

### Encoding The Classes:

In [6]:
label_encoder = LabelEncoder()

# Fit and transform the 'section' column to encode the labels
df_train['label'] = label_encoder.fit_transform(df_train['section'])
df_test['label'] = label_encoder.fit_transform(df_train['section'])

label_map = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))

# Print the mapping
print("Label Mapping:")
print(label_map)

Label Mapping:
{'Art & Design': 0, 'Automobiles': 1, 'Books': 2, 'Dance': 3, 'Economy': 4, 'Education': 5, 'Fashion & Style': 6, 'Food': 7, 'Global Business': 8, 'Health': 9, 'Media': 10, 'Movies': 11, 'Music': 12, 'Opinion': 13, 'Real Estate': 14, 'Science': 15, 'Sports': 16, 'Style': 17, 'Technology': 18, 'Television': 19, 'Theater': 20, 'Travel': 21, 'Well': 22, 'Your Money': 23}


### Converting DF to DS:

In [7]:
ds_train = Dataset.from_pandas(df_train)
ds_test = Dataset.from_pandas(df_test)
ds_train = ds_train.remove_columns('__index_level_0__')
ds_test = ds_test.remove_columns('__index_level_0__')
ds = DatasetDict({"train":ds_train,"test":ds_test})
ds

DatasetDict({
    train: Dataset({
        features: ['section', 'headline', 'article', 'abstract', 'caption', 'label'],
        num_rows: 6088
    })
    test: Dataset({
        features: ['section', 'headline', 'article', 'abstract', 'caption', 'label'],
        num_rows: 6088
    })
})

### Tokenize The DS:

In [11]:
def tokenize(batch):
    return tokenizer(batch["article"], padding=True, truncation=True)

In [12]:
ds_encoded = ds.map(tokenize, batched=True, batch_size=None)

Map:   0%|          | 0/6088 [00:00<?, ? examples/s]

Map:   0%|          | 0/6088 [00:00<?, ? examples/s]

In [13]:
ds_encoded

DatasetDict({
    train: Dataset({
        features: ['section', 'headline', 'article', 'abstract', 'caption', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 6088
    })
    test: Dataset({
        features: ['section', 'headline', 'article', 'abstract', 'caption', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 6088
    })
})

In [14]:
ds_encoded.set_format('tf', columns=['input_ids', 'attention_mask', 'token_type_ids', 'label'])

BATCH_SIZE = 16

def order(inp):
    '''
    This function will group all the inputs of BERT
    into a single dictionary and then output it with
    labels.
    '''
    data = list(inp.values())
    return {
        'input_ids': data[1],
        'attention_mask': data[2],
        'token_type_ids': data[3]
    }, data[0]

# converting train split of `ds_encoded` to tensorflow format
train_dataset = tf.data.Dataset.from_tensor_slices(ds_encoded['train'][:])
# set batch_size and shuffle
train_dataset = train_dataset.batch(BATCH_SIZE).shuffle(1000)
# map the `order` function
train_dataset = train_dataset.map(order, num_parallel_calls=tf.data.AUTOTUNE)

# ... doing the same for test set ...
test_dataset = tf.data.Dataset.from_tensor_slices(ds_encoded['test'][:])
test_dataset = test_dataset.batch(BATCH_SIZE)
test_dataset = test_dataset.map(order, num_parallel_calls=tf.data.AUTOTUNE)

### Create Custome keras.Model Class to Predict Classes (Labels):

In [15]:
class BERTForClassification(tf.keras.Model):

    def __init__(self, bert_model, num_classes):
        super().__init__()
        self.bert = bert_model
        self.fc = tf.keras.layers.Dense(num_classes, activation='softmax')

    def call(self, inputs):
        x = self.bert(inputs)[1]
        return self.fc(x)


In [16]:
classifier = BERTForClassification(model, num_classes=24)

classifier.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),
    metrics=['accuracy']
)


### Fine-Tune the Classifier:

In [17]:
history = classifier.fit(
    train_dataset,
    epochs=8
)

Epoch 1/8


I0000 00:00:1712568649.848087   15751 service.cc:145] XLA service 0x7ef768eeafd0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1712568649.848154   15751 service.cc:153]   StreamExecutor device (0): NVIDIA GeForce RTX 4090, Compute Capability 8.9
2024-04-08 09:30:49.858363: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-04-08 09:30:49.873924: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:465] Loaded cuDNN version 8902
I0000 00:00:1712568650.021791   15751 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


### Evaluate on Unseen Data (Test Data):

In [18]:
loss, acc = classifier.evaluate(test_dataset)
print("Test Accuracy: ", acc)

Test Accuracy:  0.994086742401123


#### As seen, the model has 99.4% accuracy on unseen data <b><i>with only 8 epochs</b></i>. 
### Now, we save the model:

In [25]:
classifier.save_weights('./BERTClassificationWeights', save_format='tf')

In [17]:
def predict_category(text, model):
    # Tokenize text
    inputs = tokenizer(text, padding=True, truncation=True, max_length=512, return_tensors='tf')
    
    # Extract input tensors
    input_ids = inputs["input_ids"]
    token_type_ids = inputs["token_type_ids"]
    attention_mask = inputs["attention_mask"]
    
    # Run inference
    predictions = model.predict({"input_ids": input_ids,
                                      "token_type_ids": token_type_ids,
                                      "attention_mask": attention_mask})
    
    # Get the predicted label
    predicted_label = tf.argmax(predictions, axis=1).numpy()[0]
    
    return predicted_label

### Loading the model from local files:

In [18]:
new_model = BERTForClassification(model, num_classes=24)

new_model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),
    metrics=['accuracy']
)

In [19]:
new_model.load_weights('./BERTClassificationWeights')

<tensorflow.python.checkpoint.checkpoint.CheckpointLoadStatus at 0x7fa431852c10>

#### Evaluate on model on test data to check it's loaded correctly:

In [20]:
loss, acc = new_model.evaluate(test_dataset)
print("Test Accuracy: ", acc)

Test Accuracy:  0.994086742401123


#### Test classification on some articles:

In [21]:
for article, actual_label in zip(ds['test'][3:6]['article'], ds['test'][3:6]['label']):
    predicted_label = predict_category(article, new_model)
    
    print("\nArticle Text:")
    print(article)
    
    # Decode predicted and actual labels
    decoded_predicted_label = label_encoder.inverse_transform([predicted_label])[0]
    decoded_actual_label = label_encoder.inverse_transform([actual_label])[0]
    
    print("Predicted Label:", decoded_predicted_label)
    print("Actual Label:", decoded_actual_label)
    print()


Article Text:
CHURCHILL: BLOOD, SWEAT & OIL PAINT (2015) Stream on Acorn TV and Amazon. Winston Churchill is famous for his achievements as a statesman  successfully navigating Britain through World War II chief among them  and for his legendary wit and outsize appetites. This BBC special introduces viewers to a lesser known dimension of his life: his passion for painting. The journalist and television presenter Andrew Marr visits Churchill's studio at Chartwell, his longtime home, and some of the areas that he loved to paint. Marr also shares his own experiences with Churchill's preferred hobby. Interviews with Celia Sandys and Emma Soames, Churchill's descendants, reveal the extent of their grandfather's dedication to his craft. And commentary from David Coombs shines a light on Churchill's relationships with professional painters of his time.  DRUNK PARENTS (2019) Stream on Netflix; rent on Amazon, Google Play, iTunes, Vudu and YouTube. Comedy thrives on misfortune, especially when

In [8]:
ds = ds.remove_columns('label')

## Level 2: The Intermediate
#### Generate abstracts that provide a clear and concise summary of the article.

### Importing needed libraries:

In [9]:
from transformers import AutoModelForSeq2SeqLM, pipeline
import nltk
from nltk.tokenize import sent_tokenize
from tqdm import tqdm
import torch
from datasets import load_metric

### Loading the <a href="https://huggingface.co/facebook/bart-large-cnn"> bart-large-cnn </a> LLM and its Tokenizer:

In [10]:
model_ckpt = "facebook/bart-large-cnn"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model_bart = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)

#### Split the dataset into smaller batches that we can process simultaneously:

In [11]:
def generate_batch_sized_chunks(list_of_elements, batch_size):
    for i in range(0, len(list_of_elements), batch_size):
        yield list_of_elements[i : i + batch_size]

#### Calculate ROGUE for the test dataset with the Bert model:
##### This is useful to compare our fine-tuned model (generated summaries) with the vanilla model (original summaries) based on ROGUE

In [12]:
def calculate_metric_on_test_ds(dataset, metric, model, tokenizer, 
                                batch_size=4, device=device, 
                                column_text="article", 
                                column_summary="highlights"):
    article_batches = list(generate_batch_sized_chunks(dataset[column_text], batch_size))
    target_batches = list(generate_batch_sized_chunks(dataset[column_summary], batch_size))

    for article_batch, target_batch in tqdm(
        zip(article_batches, target_batches), total=len(article_batches)):
        
        inputs = tokenizer(article_batch, max_length=1024,  truncation=True, 
                           padding="max_length", return_tensors="pt")
        
        # Ensure that inputs are on the correct device
        inputs = {k: v.to(device) for k, v in inputs.items()}
        
        with torch.no_grad():
            # Move model to the device
            model.to(device)
            # Generate summaries
            summaries = model.generate(input_ids=inputs["input_ids"],
                                       attention_mask=inputs["attention_mask"], 
                                       length_penalty=0.8, num_beams=8, max_length=128)
            
        # Move generated summaries to the same device as inputs
        summaries = summaries.to(device)
        
        # Decode the generated summaries
        decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True, 
                                              clean_up_tokenization_spaces=True) 
                             for s in summaries]      
        
        # Replace empty strings with spaces
        decoded_summaries = [d.replace("", " ") for d in decoded_summaries]
        
        # Add batch to the metric
        metric.add_batch(predictions=decoded_summaries, references=target_batch)
        
    # Compute and return the ROUGE scores
    score = metric.compute()
    return score

In [13]:
# Load the ROUGE metric
rouge_metric = load_metric('rouge')
score = calculate_metric_on_test_ds(ds['test'], rouge_metric, model_bart, tokenizer, column_text='article', column_summary='abstract', batch_size=8)

  rouge_metric = load_metric('rouge')
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
100%|██████████| 761/761 [29:11<00:00,  2.30s/it]


In [16]:
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
rouge_dict = dict((rn, score[rn].mid.fmeasure ) for rn in rouge_names )

pd.DataFrame(rouge_dict, index = ['BART'])

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
BART,0.010103,0.00037,0.010081,0.010084


### Convert the Exmaple to Features Our Model Can Predict:
##### (basic tokenization)

In [17]:
def convert_examples_to_features(example_batch):
    input_encodings = tokenizer(example_batch['article'] , max_length = 1024, truncation = True )
    
    with tokenizer.as_target_tokenizer():
        target_encodings = tokenizer(example_batch['abstract'], max_length = 128, truncation = True )
        
    return {
        'input_ids' : input_encodings['input_ids'],
        'attention_mask': input_encodings['attention_mask'],
        'labels': target_encodings['input_ids']
    }
    
ds_encoded = ds.map(convert_examples_to_features, batched = True)

Map:   0%|          | 0/6088 [00:00<?, ? examples/s]



Map:   0%|          | 0/6088 [00:00<?, ? examples/s]

### Create a Data Collector

In [18]:
from transformers import DataCollatorForSeq2Seq

seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model_bart)

### Fine-Tune the LLM:

In [33]:
from transformers import TrainingArguments, Trainer

trainer_args = TrainingArguments(
    output_dir='bart-cnn', num_train_epochs=3, warmup_steps=500,
    per_device_train_batch_size=1, per_device_eval_batch_size=1,
    weight_decay=0.01, logging_steps=10,
    evaluation_strategy='no', eval_steps=500, save_steps=1e6,
    gradient_accumulation_steps=16,
    do_eval=False
) 

In [34]:
trainer = Trainer(model=model_bart, args=trainer_args,
                  tokenizer=tokenizer, data_collator=seq2seq_data_collator,
                  train_dataset=ds_encoded["train"]
                 )     

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [35]:
trainer.train()

Step,Training Loss
10,1.4883
20,1.5763
30,1.6218
40,1.4529
50,1.3855
60,1.41
70,1.2488
80,1.4678
90,1.5707
100,1.4822


TrainOutput(global_step=1140, training_loss=1.0813508042118005, metrics={'train_runtime': 1945.7525, 'train_samples_per_second': 9.387, 'train_steps_per_second': 0.586, 'total_flos': 2.914525223345357e+16, 'train_loss': 1.0813508042118005, 'epoch': 3.0})

#### Calculate ROGUE for the test dataset with the <b>Fine Tuned</b> Bert model:

In [36]:
score = calculate_metric_on_test_ds(ds['test'], rouge_metric, trainer.model, tokenizer, batch_size = 8, column_text = 'article', column_summary= 'abstract')

100%|██████████| 761/761 [25:04<00:00,  1.98s/it]


In [37]:
rouge_dict = dict((rn, score[rn].mid.fmeasure ) for rn in rouge_names )

pd.DataFrame(rouge_dict, index = ['BART'])

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
BART,0.011099,0.000589,0.011079,0.011086


These scores suggest that after fine-tuning the model, there were small improvements in the generated summaries' overlap with the reference summaries. While the improvements are small, they indicate that the fine-tuning process helped the model generate summaries that are slightly more similar to the reference summaries. The improvements are particularly notable in the Rouge-1 and Rouge-L scores, which measure overlap in single words and consider some aspects of word order, respectively.

### Generate Sample Abstracts of Articles:

In [39]:
def generate_abstracts(model, tokenizer, articles, original_abstracts, max_length=128, num_beams=4, device="cuda"):
    inputs = tokenizer(articles, max_length=1024, truncation=True, padding="max_length", return_tensors="pt")
    inputs = {k: v.to(device) for k, v in inputs.items()}

    with torch.no_grad():
        model.to(device)
        summaries = model.generate(input_ids=inputs["input_ids"],
                                   attention_mask=inputs["attention_mask"],
                                   length_penalty=0.8, num_beams=num_beams, max_length=max_length)

    summaries = summaries.to(device)
    decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True, clean_up_tokenization_spaces=True) for s in
                         summaries]

    return decoded_summaries


generated_abstracts = generate_abstracts(model_bart, tokenizer, ds['test'][3:6]['article'], ds['test'][3:6]['abstract'])

for idx, (article, original_abstract, generated_abstract) in enumerate(zip(ds['test'][3:6]['article'], ds['test'][3:6]['abstract'], generated_abstracts), 1):
    print(f"Example {idx}:")
    print("Article:\n", article)
    print("\nOriginal Abstract:\n", original_abstract)
    print("\nGenerated Abstract:\n", generated_abstract)
    print()

Example 1:
Article:
 CHURCHILL: BLOOD, SWEAT & OIL PAINT (2015) Stream on Acorn TV and Amazon. Winston Churchill is famous for his achievements as a statesman  successfully navigating Britain through World War II chief among them  and for his legendary wit and outsize appetites. This BBC special introduces viewers to a lesser known dimension of his life: his passion for painting. The journalist and television presenter Andrew Marr visits Churchill's studio at Chartwell, his longtime home, and some of the areas that he loved to paint. Marr also shares his own experiences with Churchill's preferred hobby. Interviews with Celia Sandys and Emma Soames, Churchill's descendants, reveal the extent of their grandfather's dedication to his craft. And commentary from David Coombs shines a light on Churchill's relationships with professional painters of his time.  DRUNK PARENTS (2019) Stream on Netflix; rent on Amazon, Google Play, iTunes, Vudu and YouTube. Comedy thrives on misfortune, especiall

In [40]:
model_bart.save_pretrained("bart-cnn-model")

Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}


In [42]:
tokenizer.save_pretrained("bart-tokenizer")

('bart-tokenizer/tokenizer_config.json',
 'bart-tokenizer/special_tokens_map.json',
 'bart-tokenizer/vocab.json',
 'bart-tokenizer/merges.txt',
 'bart-tokenizer/added_tokens.json',
 'bart-tokenizer/tokenizer.json')

### Load Model From Local Files:

In [43]:
new_model = AutoModelForSeq2SeqLM.from_pretrained("bart-cnn-model")

In [44]:
new_tokenizer = AutoTokenizer.from_pretrained("bart-tokenizer")

#### Test loaded model:

In [45]:
generated_abstracts = generate_abstracts(new_model, new_tokenizer, ds['test'][:3]['article'], ds['test'][:3]['abstract'])
for idx, (article, original_abstract, generated_abstract) in enumerate(zip(ds['test'][3:6]['article'], ds['test'][3:6]['abstract'], generated_abstracts), 1):
    print(f"Example {idx}:")
    print("Article:\n", article)
    print("\nOriginal Abstract:\n", original_abstract)
    print("\nGenerated Abstract:\n", generated_abstract)
    print()

Example 1:
Article:
 CHURCHILL: BLOOD, SWEAT & OIL PAINT (2015) Stream on Acorn TV and Amazon. Winston Churchill is famous for his achievements as a statesman  successfully navigating Britain through World War II chief among them  and for his legendary wit and outsize appetites. This BBC special introduces viewers to a lesser known dimension of his life: his passion for painting. The journalist and television presenter Andrew Marr visits Churchill's studio at Chartwell, his longtime home, and some of the areas that he loved to paint. Marr also shares his own experiences with Churchill's preferred hobby. Interviews with Celia Sandys and Emma Soames, Churchill's descendants, reveal the extent of their grandfather's dedication to his craft. And commentary from David Coombs shines a light on Churchill's relationships with professional painters of his time.  DRUNK PARENTS (2019) Stream on Netflix; rent on Amazon, Google Play, iTunes, Vudu and YouTube. Comedy thrives on misfortune, especiall

### Level 3: The Advanced
#### Generate captions for each news article's image that accurately reflect the content.

Unfortunatly, I didn't have the time to implement the solution for this, but my roadmap was:<br/>
1- Use a pre-trained Convolutional Neural Network (CNN) for image feature extraction (e.g., VGG, ResNet).<br/>
2- Use a pre-trained Language Model (e.g. BERT) for text generation.<br/>
3- Combine CNN and LM:<br/>
    - CNN processes the image and extracts features.<br/>
    - LM generates the summary based on these features.