![image](car.jpeg)

**Car-ing is sharing**, an auto dealership company for car sales and rental, is taking their services to the next level thanks to **Large Language Models (LLMs)**.

As their newly recruited AI and NLP developer, you've been asked to prototype a chatbot app with multiple functionalities that not only assist customers but also provide support to human agents in the company.

The solution should receive textual prompts and use a variety of pre-trained Hugging Face LLMs to respond to a series of tasks, e.g. classifying the sentiment in a car’s text review, answering a customer question, summarizing or translating text, etc.


In [187]:
# Import necessary packages
import pandas as pd
import torch

from transformers import logging
logging.set_verbosity(logging.WARNING)

In [188]:
# Start your code here!

## Importing Data & Necessary Packages

In [190]:
from transformers import pipeline
import evaluate
import re
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForQuestionAnswering
import pandas as pd
import torch
from transformers import logging
logging.set_verbosity(logging.WARNING)

In [189]:
car_reviews = pd.read_csv('path/car_reviews.csv', sep=";")
car_reviews

Unnamed: 0,Review,Class
0,I am very satisfied with my 2014 Nissan NV SL....,POSITIVE
1,The car is fine. It's a bit loud and not very ...,NEGATIVE
2,"My first foreign car. Love it, I would buy ano...",POSITIVE
3,I've come across numerous reviews praising the...,NEGATIVE
4,I've been dreaming of owning an SUV for quite ...,POSITIVE


# Step 1: Classifying Car Reviews

Use a pre-trained LLM to classify the sentiment of the five car reviews in the car_reviews.csv dataset, and evaluate the classification accuracy and F1 score of predictions.

## Sentiment Analysis

In [191]:
# Load a sentiment analysis pipeline (this model returns "POSITIVE" or "NEGATIVE")
sentiment_pipeline = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


In [192]:
# Process all reviews in the dataset
predicted_labels = sentiment_pipeline(car_reviews["Review"].tolist())
predicted_labels

[{'label': 'POSITIVE', 'score': 0.929397702217102},
 {'label': 'POSITIVE', 'score': 0.8654273152351379},
 {'label': 'POSITIVE', 'score': 0.9994640946388245},
 {'label': 'NEGATIVE', 'score': 0.9935314059257507},
 {'label': 'POSITIVE', 'score': 0.9986565113067627}]

In [193]:
# Mapping predicted labels to binary values: 1 for POSITIVE, 0 for NEGATIVE
predictions = [1 if pred["label"].upper() == "POSITIVE" else 0 for pred in predicted_labels]
predictions

[1, 1, 1, 0, 1]

In [194]:
# Mapping the reference classes in the dataset to binary as well
ref_labels = [1 if label.upper() == "POSITIVE" else 0 for label in car_reviews["Class"].tolist()]
ref_labels

[1, 0, 1, 0, 1]

## Model Evaluation Metrics

In [195]:
# Accuracy of the model
accuracy_metric = evaluate.load("accuracy")
accuracy_result = accuracy_metric.compute(predictions=predictions, references=ref_labels)["accuracy"]
print("Accuracy:", accuracy_result)

#F1 Score for the model
f1_metric = evaluate.load("f1")
f1_result = f1_metric.compute(predictions=predictions, references=ref_labels, average="binary")["f1"]
print("\nF1 Score:", f1_result)

Accuracy: 0.8

F1 Score: 0.8571428571428571


# Step 2: Translate a Car Review

The company is recently attracting customers from Spain. Extract and pass the _first two sentences_ of the first review in the dataset to an English-to-Spanish translation LLM. Calculate the BLEU score to assess translation quality, using the content in `reference_translations.txt` as references.

In [196]:
# Extracting the first two sentences from the first review
first_two_sentences = " ".join(re.split(r'(?<=[.!?])\s+', car_reviews.Review[0])[:2])
first_two_sentences

'I am very satisfied with my 2014 Nissan NV SL. I use this van for my business deliveries and personal use.'

In [197]:
# Load the model and tokenizer for English-to-Spanish translation
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-es")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-es")

In [198]:
# tokanizing the first two sentences
new_input = tokenizer(first_two_sentences, return_tensors="pt")

# Generate translation with torch.no_grad() to avoid computing gradients.
with torch.no_grad():
    outputs = model.generate(
        new_input["input_ids"],
        max_length=128,  # set maximum tokens to generate
        eos_token_id=tokenizer.eos_token_id,  # tell the model when to stop by using the end-of-sequence token
        early_stopping=True  # optional: stop early if possible
    )
    
translated_review = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Translated review:", translated_review)

Translated review: Estoy muy satisfecho con mi Nissan NV SL 2014. Uso esta camioneta para mis entregas de negocios y uso personal.


In [199]:
# Read the reference translation from file
with open("data/reference_translations.txt", "r", encoding="utf-8") as f:
    ref_translation = f.read().splitlines()
    
ref_translation

['Estoy muy satisfecho con mi Nissan NV SL 2014. Utilizo esta camioneta para mis entregas comerciales y uso personal.',
 'Estoy muy satisfecho con mi Nissan NV SL 2014. Uso esta furgoneta para mis entregas comerciales y uso personal.']

In [200]:
# Calculate the BLEU score for two references
bleu = evaluate.load("bleu")
bleu_score = bleu.compute(predictions=[translated_review], references=[ref_translation[0]])
print("BLEU score", bleu_score["bleu"])

BLEU score 0.6888074582865503


# Step 3: Ask a question about a car review

The 2nd review in the dataset emphasizes brand aspects. Load an extractive QA LLM such as `"deepset/minilm-uncased-squad2"` to formulate the question `"What did he like about the brand?"` and obtain an answer.

### Method 1

In [201]:
# Load an extractive QA pipeline with a model such as deepset/minilm-uncased-squad2
qa_pipeline = pipeline("question-answering", model="deepset/minilm-uncased-squad2")

# Define the question and use the 2nd review (index 1) as context
question = "What did he like about the brand?"
context = car_reviews["Review"].iloc[1]

qa_output = qa_pipeline(question=question, context=context)
answer = qa_output["answer"]

print("\nStep 3: Extractive QA")
print("Question:", question)
print("Context (2nd review):", context)
print("Answer:", answer)

Some weights of the model checkpoint at deepset/minilm-uncased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu



Step 3: Extractive QA
Question: What did he like about the brand?
Context (2nd review): The car is fine. It's a bit loud and not very powerful. On one hand, compared to its peers, the interior is well-built. The transmission failed a few years ago, and the dealer replaced it under warranty with no issues. Now, about 60k miles later, the transmission is failing again. It sounds like a truck, and the issues are well-documented. The dealer tells me it is normal, refusing to do anything to resolve the issue. After owning the car for 4 years, there are many other vehicles I would purchase over this one. Initially, I really liked what the brand is about: ride quality, reliability, etc. But I will not purchase another one. Despite these concerns, I must say, the level of comfort in the car has always been satisfactory, but not worth the rest of issues found.
Answer: ride quality, reliability


### Method 2

In [202]:
# Define model name and load the tokenizer and model using Auto classes
model_name = "deepset/minilm-uncased-squad2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

# Define the question and extract the context (2nd review from the dataframe)
question = "What did he like about the brand?"
context = car_reviews.Review.iloc[1]

# Tokenize the input: the tokenizer will handle concatenating the question and context
inputs = tokenizer(question, context, return_tensors="pt")

# Perform inference
with torch.no_grad():
    outputs = model(**inputs)
    
# The model outputs two sets of logits for start and end positions.
start_logits = outputs.start_logits
end_logits = outputs.end_logits

# Identify the most likely start and end positions for the answer
start_index = torch.argmax(start_logits)
end_index = torch.argmax(end_logits) + 1

# Extract the tokens corresponding to the answer span
answer_tokens = inputs["input_ids"][0][start_index:end_index]
raw_answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)

# Post-processing: Clean up the answer text (e.g., strip whitespace)
answer = raw_answer.strip()

print("Question:", question)
print("Context:", context)
print("Extracted Answer:", answer)

Some weights of the model checkpoint at deepset/minilm-uncased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Question: What did he like about the brand?
Context: The car is fine. It's a bit loud and not very powerful. On one hand, compared to its peers, the interior is well-built. The transmission failed a few years ago, and the dealer replaced it under warranty with no issues. Now, about 60k miles later, the transmission is failing again. It sounds like a truck, and the issues are well-documented. The dealer tells me it is normal, refusing to do anything to resolve the issue. After owning the car for 4 years, there are many other vehicles I would purchase over this one. Initially, I really liked what the brand is about: ride quality, reliability, etc. But I will not purchase another one. Despite these concerns, I must say, the level of comfort in the car has always been satisfactory, but not worth the rest of issues found.
Extracted Answer: ride quality, reliability


# Step 4: Summarize and analyze a car review

Summarize the last review in the dataset, into approximately 50-55 tokens long. Store it in the variable `summarized_text`.

In [203]:
# Load a summarization pipeline (using a model like facebook/bart-large-cnn)
summarization_pipeline = pipeline("summarization", model="facebook/bart-large-cnn")

# Extract the last car review from the dataset
last_review = car_reviews.Review.iloc[-1]

# Generate a summary with approximately 50-55 tokens.
summarized_output = summarization_pipeline(last_review, min_length=50, max_length=55, do_sample=False)
summarized_text = summarized_output[0]["summary_text"]

# Print the generated summary
print("Summarized Text:", summarized_text)

# Load the bias evaluation metrics (toxicity and regard) from the evaluate library
toxicity_metric = evaluate.load("toxicity")
regard_metric = evaluate.load("regard")

# Format the summarized text as a list for the metric inputs
toxicity_result = toxicity_metric.compute(predictions=[summarized_text], aggregation = 'maximum')
regard_result = regard_metric.compute(data=[summarized_text])

print("Toxicity:", toxicity_result['max_toxicity'])
print("\n Regard scores for each lable are as following:")
regard_df = pd.DataFrame(regard_result['regard'][0])
regard_df.set_index('label', inplace=True)
regard_df

Device set to use cpu


Summarized Text: The Nissan Rogue provides me with the desired SUV experience without burdening me with an exorbitant payment. Handling and styling are great; I have hauled 12 bags of mulch in the back with the seats down and could have held more. The engine delivers strong


Device set to use cpu
Device set to use cpu


Toxicity: 0.00013863427739124745

 Regard scores for each lable are as following:


Unnamed: 0_level_0,score
label,Unnamed: 1_level_1
positive,0.626334
neutral,0.202735
other,0.122916
negative,0.048016
