## TheyDo Technical Interview

### Finetune a model for:
1. summarisation
2. translation
3. sentiment analysis

Based on a dataset from [llama paper](https://arxiv.org/pdf/2302.13971.pdf)

Couldn't find a clear summarisation, translation or sentiment analysis dataset from the paper.

### Finetune Summarisation model

Which model should we go for?

<img src="cnn_dailymail_summarisation_landscape.png">

T5 is a great model for any text to text task as such a great fit for summarisation and translationg.
I have used T5 before and the small T5 can be finetuned on my Mac.

How to measure summarisation performance?

ROUGE-1 refers to the overlap of unigrams (each word) between the system and reference summaries
ROUGE-2 bigram of above
ROUGE-L Longest Common Subsequence (LCS) based score on sentences
ROUGE-LSUM Take new lines as sentence boundaries and then LCS score

---


Train eval test 5/5/5
  
***** eval metrics *****
  epoch                   =        3.0
  eval_gen_len            =       54.4
  eval_loss               =     2.4578
  eval_rouge1             =    20.1191
  eval_rouge2             =     3.2284
  eval_rougeL             =    14.6009
  eval_rougeLsum          =    19.8187
  eval_runtime            = 0:00:04.03
  eval_samples            =          5
  eval_samples_per_second =      1.238
  eval_steps_per_second   =      0.495

Train eval test 2000 500 500

***** eval metrics *****
"epoch": 3.0,
"eval_gen_len": 59.248,
"eval_loss": 2.052272081375122,
"eval_rouge1": 31.186,
"eval_rouge2": 11.7692,
"eval_rougeL": 22.3708,
"eval_rougeLsum": 28.5695,
"eval_runtime": 2077.82,
"eval_samples": 500,
"eval_samples_per_second": 0.241,
"eval_steps_per_second": 0.06




  

In [1]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("finetuned_summarisation")
model = AutoModelForSeq2SeqLM.from_pretrained("finetuned_summarisation")


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
example = """WASHINGTON, June 7 (Reuters) - 

Ukrainians abandoned inundated homes on Wednesday as floods crested across the south after the destruction of the dam, with Russia and Ukraine trading blame for the disaster.

The World Bank will support Ukraine by conducting a rapid assessment of damage and needs after Tuesday's destruction of a huge hydroelectric dam on the front lines between Russian and Ukrainian forces, a top bank official said on Wednesday.

Anna Bjerde, the World Bank's managing director for operations, said on Twitter the destruction of the Novo Kakhovka dam had "many very serious consequences for essential service delivery and the broader environment."

Ukrainian Prime Minister Denys Shmyhal, also writing on Twitter, said he spoke with Bjerde about the impact of the dam's collapse, and she assured him the World Bank would carry out a rapid assessment of the damage and needs.

The World Bank will support Ukraine by conducting a rapid assessment of damage and needs after Tuesday's destruction of a huge hydroelectric dam on the front lines between Russian and Ukrainian forces, a top bank official said on Wednesday.

Ukrainians abandoned inundated homes on Wednesday as floods crested across the south after the destruction of the dam, with Russia and Ukraine trading blame for the disaster.

Ukraine said the deluge would leave hundreds of thousands of people without access to drinking water, swamp tens of thousands of hectares of agricultural land and turn at least 500,000 hectares deprived of irrigation into "deserts"."""

In [3]:
example = example.strip()

In [4]:
input_ids = tokenizer(f"summarize: {example}", return_tensors="pt").input_ids
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

World Bank will support Ukraine by conducting a rapid assessment of damage and needs. Ukrainians




Russian and Ukrainian forces on the front lines of conflict have been blamed for the destruction of the Novo Kakhovka dam on Tuesday, causing flooding in Ukraine's southern region and displacing hundreds of thousands of people. The World Bank is providing support by conducting a rapid assessment of damage and

In [7]:
# TODO load saved models and test them

In [10]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("finetuned_de-en_translation")
model = AutoModelForSeq2SeqLM.from_pretrained("finetuned_de-en_translation")


# task_prefix = "translate German to English: "
task_prefix = ""

sentences = ["Das Haus ist wunderbar.", "Ich arbeite gerne in NYC."]

In [11]:
inputs = tokenizer([f"{task_prefix}{sentence}" for sentence in sentences], return_tensors="pt", padding=True)
outputs = model.generate(inputs["input_ids"])
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(tokenizer.decode(outputs[1], skip_special_tokens=True))

The house is wonderful.
I like working in NYC.


In [12]:
outputs[0]

tensor([    0,  1150,  1925,    56,   380, 11897,    57, 13646,  3607,  4193,
           13,  1783,    11,   523,     3,     5, 22849,     7, 13876,    16])

In [32]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("finetuned-imdb-bert")
model = AutoModelForSequenceClassification.from_pretrained("finetuned-imdb-bert")

In [33]:
text = "Pirates of the Caribean was a horrible movie."


In [34]:
import torch

def sent_pred(text):

    inputs = tokenizer(text, return_tensors="pt")
    
    with torch.no_grad():
    
        logits = model(**inputs).logits
    
    predicted_class_id = logits.argmax().item()
    predicted_label = model.config.id2label[predicted_class_id]
    probs = torch.nn.functional.softmax(logits)
    print(f"Predicted {predicted_label} {probs}")

sent_pred(text)

Predicted 0 tensor([[0.9844, 0.0156]])


  probs = torch.nn.functional.softmax(logits)


In [43]:
sent_pred("Inception was a great movie")

Predicted 1 tensor([[0.1129, 0.8871]])


  probs = torch.nn.functional.softmax(logits)


In [41]:
!ls data/aclImdb/train/neg | wc -l

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
   11500


## Compare OpenAI and finetuned models:

In [1]:
import pandas as pd

cnn_test = pd.read_csv("data/cnn_dailymail/test.csv")


In [8]:
cnn_test = cnn_test.iloc[:100]

In [9]:
cnn_test

Unnamed: 0,id,article,highlights
0,92c514c913c0bdfe25341af9fd72b29db544099b,Ever noticed how plane seats appear to be gett...,Experts question if packed out planes are put...
1,2003841c7dc0e7c5b1a248f9cd536d727f27a45a,A drunk teenage boy had to be rescued by secur...,Drunk teenage boy climbed into lion enclosure ...
2,91b7d2311527f5c2b63a65ca98d21d9c92485149,Dougie Freedman is on the verge of agreeing a ...,Nottingham Forest are close to extending Dougi...
3,caabf9cbdf96eb1410295a673e953d304391bfbb,Liverpool target Neto is also wanted by PSG an...,Fiorentina goalkeeper Neto has been linked wit...
4,3da746a7d9afcaa659088c8366ef6347fe6b53ea,Bruce Jenner will break his silence in a two-h...,"Tell-all interview with the reality TV star, 6..."
...,...,...,...
95,64ee7c9eb9f1efbb7da0ce80498434c623615b84,As Zlatan Ibrahimovic famously believes the Wo...,Zlatan Ibrahimovic will line up against former...
96,5cf4682cd03238d5867027ce9492b626cd1ed011,"Jameela spent £3,000 on having all her amalgam...","Jameela Jamil, 29, is convinced dental work tr..."
97,3815d19af18ff22be6ad6095722d7367bb7271af,A paramedic who pretended he was gay to get cl...,"Christopher Bridger, 25, attacked three women ..."
98,fb207604ffa7e8371c622840445825db8993d4d2,"Paris Saint-Germain face Nice on Saturday, hop...",Paris Saint-Germain captain Thiago Silva suffe...


In [12]:
sum(cnn_test.article.apply(len))*0.02

7311.92

In [10]:
from api.infer_openai import openai_summary

In [11]:
from tqdm import tqdm

openai_summaries = []
articles = list(cnn_test.article)
for article in tqdm(articles):
    openai_summaries.append(openai_summary(article))

100%|█████████████████████████████████████████| 100/100 [08:41<00:00,  5.22s/it]


In [14]:
import json
with open('openai_summaries.json', 'w') as f:
    json.dump(openai_summaries, f)

In [15]:
cnn_test["openai"] = openai_summaries

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cnn_test["openai"] = openai_summaries


In [19]:

import evaluate
rouge = evaluate.load('rouge')

  from .autonotebook import tqdm as notebook_tqdm


In [23]:
openai_results = rouge.compute(predictions=cnn_test["openai"], references=cnn_test["highlights"])

In [26]:
from api.infer import summarise

from tqdm import tqdm

fine_summaries = []
articles = list(cnn_test.article)
for article in tqdm(articles):
    fine_summaries.append(summarise(article))

  4%|█▋                                         | 4/100 [00:00<00:22,  4.23it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (1086 > 512). Running this sequence through the model will result in indexing errors
100%|█████████████████████████████████████████| 100/100 [00:34<00:00,  2.90it/s]


In [27]:
cnn_test["finetune"] = fine_summaries

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cnn_test["finetune"] = fine_summaries


In [28]:
fine_results = rouge.compute(predictions=cnn_test["finetune"], references=cnn_test["highlights"])

In [29]:
print(f"""OpenAI {openai_results}
Finetune {fine_results}
""")

OpenAI {'rouge1': 0.3711652739208089, 'rouge2': 0.12991879419704758, 'rougeL': 0.23570574188914356, 'rougeLsum': 0.304950388468087}
Finetune {'rouge1': 0.23731288238512083, 'rouge2': 0.11321663348202815, 'rougeL': 0.20167088618817003, 'rougeLsum': 0.2204555556945117}



OpenAI has much higher rouge1 and rougeLsum showing that it uses 37% of the words from the target phrases and having a good long term similarities as well from high rougeLsum

**OpenAI wins for SUMMARISATION with a cost of $1.53**

---

In [36]:
from pathlib import Path
texts = []
labels = []

imdb_neg = Path("data/aclImdb/test/neg/")
counter = 50
for neg in imdb_neg.iterdir():
    if neg.is_file():
        counter -= 1
        with open(neg) as fh:
            text = fh.read()
        texts.append(text)
        labels.append(0)
        if counter == 0:
            break
            
imdb_pos = Path("data/aclImdb/test/pos/")
counter = 50
for pos in imdb_pos.iterdir():
    if pos.is_file():
        counter -= 1
        with open(pos) as fh:
            text = fh.read()
        texts.append(text)
        labels.append(1)
        if counter == 0:
            break

In [39]:
imdb_test = pd.DataFrame({"texts": texts, "labels": labels})

In [41]:
from api.infer_openai import openai_sentiment
from tqdm import tqdm

openai_sentiments = []
imdb_reviews = list(imdb_test.texts)
for review in tqdm(imdb_reviews):
    openai_sentiments.append(openai_sentiment(review))

import json
with open('openai_sentiment.json', 'w') as f:
    json.dump(openai_sentiments, f)

imdb_test["openai"] = openai_sentiments

100%|█████████████████████████████████████████| 100/100 [01:48<00:00,  1.09s/it]


In [49]:
from transformers import AutoModelForSequenceClassification, AutoModelForSeq2SeqLM, AutoTokenizer
sentiment_tokenizer = AutoTokenizer.from_pretrained("finetuned-imdb-bert", truncation=True)
sentiment_model = AutoModelForSequenceClassification.from_pretrained("finetuned-imdb-bert")

In [50]:
def sentiment(text: str, with_prob=False) -> str:
    inputs = sentiment_tokenizer(text, return_tensors="pt")
    
    with torch.no_grad():
    
        logits = sentiment_model(**inputs).logits
    
    predicted_class_id = logits.argmax().item()
    predicted_label = sentiment_model.config.id2label[predicted_class_id]
    probs = torch.nn.functional.softmax(logits)
    
    if with_prob:
        return {"label": predicted_label, "probability": probs}
    else:
        return predicted_label

In [52]:
# from api.infer import sentiment
from tqdm import tqdm
import torch

fine_sentiment = []
imdb_reviews = list(imdb_test.texts)
for review in tqdm(imdb_reviews):
    fine_sentiment.append(sentiment(review[:508]))

  probs = torch.nn.functional.softmax(logits)
100%|█████████████████████████████████████████| 100/100 [00:06<00:00, 16.56it/s]


In [56]:
imdb_test["fine"] = fine_sentiment

Open AI is not so good following instructions the labels are a bit noisy

In [59]:
imdb_test.openai.iloc[:10]

0                   \n\n1
1                   \n\n0
2                   \n\n0
3                   \n\n0
4                   \n\n0
5                   \n\n0
6                   \n\n0
7                 \n\n{0}
8                   \n\n0
9     <br /><br />\n**0**
Name: openai, dtype: object

In [67]:
def clean_openai_imdb(label: str) -> str:
    if "0" in label:
        return 0
    elif "1" in label:
        return 1
    else:
        return 3

In [70]:
imdb_test["openai"] = imdb_test.openai.apply(clean_openai_imdb)

In [62]:
imdb_test.labels.iloc[0] == 0

True

In [73]:
from sklearn.metrics import classification_report

target_names = ['negative', 'positive']
print("OpenAI SENTIMENT ANALYSIS")
print(classification_report(imdb_test["labels"], imdb_test["openai"], target_names=target_names, labels=[0,1]))

OpenAI SENTIMENT ANALYSIS
              precision    recall  f1-score   support

    negative       0.96      0.88      0.92        50
    positive       0.91      0.96      0.93        50

   micro avg       0.93      0.92      0.92       100
   macro avg       0.93      0.92      0.92       100
weighted avg       0.93      0.92      0.92       100



In [74]:
print("BERT SENTIMENT ANALYSIS")
print(classification_report(imdb_test["labels"], imdb_test["fine"], target_names=target_names, labels=[0,1]))

BERT SENTIMENT ANALYSIS
              precision    recall  f1-score   support

    negative       0.72      0.88      0.79        50
    positive       0.85      0.66      0.74        50

    accuracy                           0.77       100
   macro avg       0.78      0.77      0.77       100
weighted avg       0.78      0.77      0.77       100



**OpenAI wins for sentiment analysis superior precision for the cost of $0.29 for a 100 elements**


---

In [80]:
from datasets import load_dataset

dataset = load_dataset("wmt16", 'de-en')

100%|█████████████████████████████████████████████| 3/3 [00:00<00:00,  4.95it/s]


In [88]:
counter = 100
german_docs = []
for ger_doc in dataset["test"].data["translation"][0]

<pyarrow.StringScalar: 'Obama empfängt Netanyahu'>

In [95]:
counter = 0
german_sents = []
english_sents = []
for e in dataset["test"].data["translation"][:100]:
    german_sents.append(e[0])
    english_sents.append(e[1])
    counter += 1
    if counter == 100:
        break
    

In [98]:
translation_df = pd.DataFrame({"de": german_sents, "en": english_sents})

In [99]:
from api.infer_openai import openai_translate_german_to_english
from tqdm import tqdm

openai_translations = []
german_sents = list(translation_df.de)
for german_sent in tqdm(german_sents):
    openai_translations.append(openai_translate_german_to_english(german_sent))

import json
with open('openai_translations.json', 'w') as f:
    json.dump(openai_translations, f)

translation_df["openai"] = openai_translations

100%|█████████████████████████████████████████| 100/100 [03:49<00:00,  2.29s/it]


In [123]:
tokenizer = AutoTokenizer.from_pretrained("finetuned_de-en_translation")
translation_model = AutoModelForSeq2SeqLM.from_pretrained("finetuned_de-en_translation")

def translate_german_to_english(text: str, task_prefix = "") -> str:
    inputs = tokenizer([f"{task_prefix}{sentence}" for sentence in [text]], return_tensors="pt", padding=True)
    outputs = translation_model.generate(inputs["input_ids"])
    # print(outputs)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)



In [124]:
translate_german_to_english("Obama empfängt Netanyahu")



'Obama receives Netanyahu'

In [125]:
from tqdm import tqdm

fine_translations = []
german_sents = list(translation_df.de)
for german_sent in tqdm(german_sents):
    fine_translations.append(translate_german_to_english(german_sent))

translation_df["fine"] = fine_translations

100%|█████████████████████████████████████████| 100/100 [00:59<00:00,  1.69it/s]


In [131]:
with open("openai_translations.json") as fh:
    openai_translations = json.load(fh)

In [133]:
translation_df["openai"] = translation_df["openai"].str.strip()

In [136]:
bleu = evaluate.load("bleu")

Downloading builder script: 100%|██████████| 5.94k/5.94k [00:00<00:00, 14.5MB/s]
Downloading extra modules: 4.07kB [00:00, 6.79MB/s]                             
Downloading extra modules: 100%|███████████| 3.34k/3.34k [00:00<00:00, 3.90MB/s]


In [143]:
openai_results = bleu.compute(predictions=translation_df["openai"], references=[[str(e)] for e in translation_df["en"]])
fine_results = bleu.compute(predictions=translation_df["fine"], references=[[str(e)] for e in translation_df["en"]])
print(f"""OPENAI {openai_results}
FINETUNE RESULTS
{fine_results}""")

OPENAI {'bleu': 0.5223381400841746, 'precisions': [0.7807139545665276, 0.5785123966942148, 0.45477772100153296, 0.362412493268713], 'brevity_penalty': 1.0, 'length_ratio': 1.0160150730098916, 'translation_length': 2157, 'reference_length': 2123}
FINETUNE RESULTS
{'bleu': 0.45434115198395, 'precisions': [0.7358405074762121, 0.5140009492168961, 0.38365719980069757, 0.29365495542737285], 'brevity_penalty': 1.0, 'length_ratio': 1.0395666509656147, 'translation_length': 2207, 'reference_length': 2123}


***OpenAI wins for German to English translation for the cost of 0.20$ for 100 sentences***