## TheyDo Technical Interview

### Finetune a model for:
1. summarisation
2. translation
3. sentiment analysis

Based on a dataset from [llama paper](https://arxiv.org/pdf/2302.13971.pdf)

Couldn't find a clear summarisation, translation or sentiment analysis dataset from the paper.

### Finetune Summarisation model

Which model should we go for?

<img src="cnn_dailymail_summarisation_landscape.png">

T5 is a great model for any text to text task as such a great fit for summarisation and translationg.
I have used T5 before and the small T5 can be finetuned on my Mac.

How to measure summarisation performance?

ROUGE-1 refers to the overlap of unigrams (each word) between the system and reference summaries
ROUGE-2 bigram of above
ROUGE-L Longest Common Subsequence (LCS) based score on sentences
ROUGE-LSUM Take new lines as sentence boundaries and then LCS score

---


Train eval test 5/5/5
  
***** eval metrics *****
  epoch                   =        3.0
  eval_gen_len            =       54.4
  eval_loss               =     2.4578
  eval_rouge1             =    20.1191
  eval_rouge2             =     3.2284
  eval_rougeL             =    14.6009
  eval_rougeLsum          =    19.8187
  eval_runtime            = 0:00:04.03
  eval_samples            =          5
  eval_samples_per_second =      1.238
  eval_steps_per_second   =      0.495

Train eval test 2000 500 500

***** eval metrics *****
"epoch": 3.0,
"eval_gen_len": 59.248,
"eval_loss": 2.052272081375122,
"eval_rouge1": 31.186,
"eval_rouge2": 11.7692,
"eval_rougeL": 22.3708,
"eval_rougeLsum": 28.5695,
"eval_runtime": 2077.82,
"eval_samples": 500,
"eval_samples_per_second": 0.241,
"eval_steps_per_second": 0.06




  

In [1]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("tst-summarization")
model = AutoModelForSeq2SeqLM.from_pretrained("tst-summarization")


In [19]:
example = """WASHINGTON, June 7 (Reuters) - 

Ukrainians abandoned inundated homes on Wednesday as floods crested across the south after the destruction of the dam, with Russia and Ukraine trading blame for the disaster.

The World Bank will support Ukraine by conducting a rapid assessment of damage and needs after Tuesday's destruction of a huge hydroelectric dam on the front lines between Russian and Ukrainian forces, a top bank official said on Wednesday.

Anna Bjerde, the World Bank's managing director for operations, said on Twitter the destruction of the Novo Kakhovka dam had "many very serious consequences for essential service delivery and the broader environment."

Ukrainian Prime Minister Denys Shmyhal, also writing on Twitter, said he spoke with Bjerde about the impact of the dam's collapse, and she assured him the World Bank would carry out a rapid assessment of the damage and needs.

The World Bank will support Ukraine by conducting a rapid assessment of damage and needs after Tuesday's destruction of a huge hydroelectric dam on the front lines between Russian and Ukrainian forces, a top bank official said on Wednesday.

Ukrainians abandoned inundated homes on Wednesday as floods crested across the south after the destruction of the dam, with Russia and Ukraine trading blame for the disaster.

Ukraine said the deluge would leave hundreds of thousands of people without access to drinking water, swamp tens of thousands of hectares of agricultural land and turn at least 500,000 hectares deprived of irrigation into "deserts"."""

In [20]:
example = example.strip()

In [22]:
input_ids = tokenizer(f"summarize: {example}", return_tensors="pt").input_ids
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

World Bank will support Ukraine by conducting a rapid assessment of damage and needs. Ukrainians


In [23]:
task_prefix = "translate English to German: "

sentences = ["The house is wonderful.", "I like to work in NYC."]

In [28]:
inputs = tokenizer([f"{task_prefix}{sentence}" for sentence in sentences], return_tensors="pt", padding=True)
outputs = model.generate(inputs["input_ids"])
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(tokenizer.decode(outputs[1], skip_special_tokens=True))

Das Haus ist wunderbar.
Ich arbeite gerne in NYC.


In [12]:
outputs[0]

tensor([    0,  1150,  1925,    56,   380, 11897,    57, 13646,  3607,  4193,
           13,  1783,    11,   523,     3,     5, 22849,     7, 13876,    16])

In [30]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("test-sent-bert")
model = AutoModelForSequenceClassification.from_pretrained("test-sent-bert")

In [41]:
text = "Pirates of the Caribean was a horrible movie."
inputs = tokenizer(text, return_tensors="pt")

In [42]:
import torch

with torch.no_grad():

    logits = model(**inputs).logits

In [44]:
model.config.id2label

{0: 0, 1: 1}

In [45]:
predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]

0