### Project Overview

This project serves as a **proof-of-concept (PoC)** for a **multitask LLM-powered chatbot prototype** designed for **Car-ing is sharing**, a growing car sales and rental company targeting international customers — particularly in **Spain**.

The goal is to demonstrate how **pre-trained Large Language Models (LLMs)** can be used to:
1. Classify customer sentiment in car reviews
2. Translate key content into Spanish for local customers
3. Extract meaningful insights via Question Answering (QA)
4. Summarize long reviews for quick ingestion
5. Evaluate model outputs using standardized metrics

This PoC lays the foundation for a scalable, multilingual customer support chatbot that can:
- Analyze real-time feedback
- Answer user questions in multiple languages
- Detect toxic or negative sentiment
- Provide concise summaries of vehicle experiences

### Dataset

The dataset contains **customer reviews** of various Nissan vehicles, each labeled as either `POSITIVE` or `NEGATIVE`. This small but representative dataset simulates real-world user feedback.

| `Review ID` | `Vehicle` | `Sentiment` |

> ✅ **Note:** The dataset is intentionally small for rapid prototyping and demonstration.

### Tools & Libraries Used

| Library | Purpose |
|-------|--------|
| `pandas` | Data loading and manipulation |
| `transformers` (Hugging Face) | Access to pre-trained models |
| `evaluate` | Standardized metric computation |
| `pipeline` | High-level API for NLP tasks |

In [42]:
# Import necessary libraries
import pandas as pd
from transformers import pipeline
import evaluate

from transformers import logging
logging.set_verbosity(logging.WARNING)

In [43]:
# Load dataset
reviews = pd.read_csv('car_reviews.csv', delimiter=';')

In [44]:
# Display first few rows for inspection
reviews.head(5)

Unnamed: 0,Review,Class
0,I am very satisfied with my 2014 Nissan NV SL....,POSITIVE
1,The car is fine. It's a bit loud and not very ...,NEGATIVE
2,"My first foreign car. Love it, I would buy ano...",POSITIVE
3,I've come across numerous reviews praising the...,NEGATIVE
4,I've been dreaming of owning an SUV for quite ...,POSITIVE


### ✅ Sentiment Classification

ObjectiveClassify each review as **POSITIVE** or **NEGATIVE** using a fine-tuned DistilBERT model.

In [45]:
# Initialize sentiment classification pipeline
sentiment_pipeline = pipeline(
    "text-classification",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

# convert the pandas Series to a plain list of strings and run the model
predicted_labels = sentiment_pipeline(reviews['Review'].tolist())

# extract labels and map to binary integers: POSITIVE -> 1, NEGATIVE -> 0
predictions = [1 if pred["label"] == "POSITIVE" else 0 for pred in predicted_labels]

print('sample outputs:', predicted_labels[:3])
print('predictions (first 10):', predictions[:10])

Device set to use cuda:0


sample outputs: [{'label': 'POSITIVE', 'score': 0.9293985366821289}, {'label': 'POSITIVE', 'score': 0.8654282093048096}, {'label': 'POSITIVE', 'score': 0.9994640946388245}]
predictions (first 10): [1, 1, 1, 0, 1]


#### ✅ Evaluation
We compute:
- **Accuracy**: Proportion of correct predictions
- **F1 Score**: Harmonic mean of precision and recall (ideal for imbalanced data)

In [46]:
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

references = [1 if review == "POSITIVE" else 0 for review in reviews['Class'].tolist()]

# Compute metrics
accuracy_compute = accuracy.compute(predictions=predictions, references=references)
f1_compute = f1.compute(predictions=predictions, references=references)

accuracy_result = accuracy_compute['accuracy']
f1_result = f1_compute['f1']

print('Accuracy:', accuracy_result)
print('F1 Score:', f1_result)

Accuracy: 0.8
F1 Score: 0.8571428571428571


> ✅ **Insight:** The model performs well, correctly classifying 4 out of 5 reviews

### ✅ English-to-Spanish Translation

**Objective** Translate the **first two sentences** of the **first review** into Spanish to support Spanish-speaking customers

In [47]:
# Extract first two sentences from the first review
sentences = '.'.join(reviews['Review'][0].split('.')[:2]) + '.'

In [48]:
# Initialize translation pipeline
translator = pipeline(
    task="translation_en_to_es",
    model="Helsinki-NLP/opus-mt-en-es"
)

# Translate
translated_sentences = translator(sentences, clean_up_tokenization_spaces=True)

Device set to use cuda:0


In [49]:
translated_review = translated_sentences[0]['translation_text']

In [50]:
translated_review

'Estoy muy satisfecho con mi Nissan NV SL 2014. Uso esta camioneta para mis entregas de negocios y uso personal.'

#### ✅ BLEU Score Evaluation

BLEU measures translation quality by comparing generated text to human reference.

In [51]:
# Load reference translation from file
with open("reference_translations.txt", "r") as r:
    translation_references = r.readline().strip()

In [52]:
translation_references

'Estoy muy satisfecho con mi Nissan NV SL 2014. Utilizo esta camioneta para mis entregas comerciales y uso personal.'

In [53]:
bleu = evaluate.load("bleu")

# Compute BLEU score
bleu_compute = bleu.compute(
    predictions=[translated_review],
    references=[translation_references]
)

bleu_score = bleu_compute["bleu"]
print("BLEU Score:", bleu_score)

BLEU Score: 0.6888074582865503


> ✅ **Interpretation:** A BLEU score above 0.6 indicates **high-quality translation**. The model captures the meaning accurately.

> **Tip:** For production, consider using `nltk` or `sacrebleu` for more robust BLEU computation

### ✅ Extractive Question Answering (QA)

**Objective** Answer the question: *"What did he like about the brand?"* using the **second review**.

In [54]:
# Load QA pipeline
qa_pipeline = pipeline(
    task="question-answering",
    model="deepset/minilm-uncased-squad2"
)

question = "What did he like about the brand?"

# Get answer
result = qa_pipeline(question=question, context=reviews['Review'][1])
answer = result['answer']

Some weights of the model checkpoint at deepset/minilm-uncased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


In [55]:
print(f"Answer: {answer}")

Answer: ride quality, reliability


> ✅ **Insight:** The model correctly identifies the positive brand aspects mentioned in the review.

> The `minilm-uncased-squad2` model is trained on SQuAD 2.0, making it ideal for extractive QA

### ✅ Review Summarization

**Objective** Generate a concise summary of the **review**, limited to **50–55 tokens**.

In [56]:
# Load summarization pipeline
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

# Generate summary with max_length = 55
summary = summarizer(reviews['Review'].iloc[-1], max_length=55)

# Extract the summary text
summarized_text = summary[0]['summary_text']

Device set to use cuda:0
Your min_length=56 must be inferior than your max_length=55.


In [57]:
print("Summary:", summarized_text)

Summary:  Nissan Rogue provides the desired SUV experience without burdening me with an exorbitant payment . Handling and styling are great; I have hauled 12 bags of mulch in the back with the seats down and could have held more . The engine delivers strong performance, and


> ✅ **Insight:** The summary captures key points: affordability, space, handling, and performance.

> The `max_length` was limited to 55 to meet token constraints.

### 🛡️ Toxicity & Regard Analysis

**Objective**
Analyze the **summary** for:
- **Toxicity** (harmful content)
- **Sentiment Regard** (neutral, positive, negative, other)

In [58]:
# Load toxicity metric
toxicity_metric = evaluate.load("toxicity")

toxicity_results = toxicity_metric.compute(
    predictions=summarized_text,
    aggregation="maximum"
)

print("Toxicity Results:", toxicity_results['max_toxicity'])

Using default facebook/roberta-hate-speech-dynabench-r4-target checkpoint
Device set to use cuda:0


Toxicity Results: 0.2430584877729416


In [59]:
# Load regard metric (emotional tone)
regard = evaluate.load("regard")

regard_results = regard.compute(
    data=summarized_text,
    aggregation="maximum"
)

print("Regard Results:", regard_results['max_regard']) 

Device set to use cuda:0


Regard Results: {'neutral': 0.928610622882843, 'positive': 0.2930169701576233, 'negative': 0.09140203893184662, 'other': 0.06192563846707344}


> ✅ **Interpretation:**
- **Toxicity:** Low (0.24) → Safe for public use
- **Regard:** Strongly **neutral** with a slight **positive** bias → Ideal for customer-facing content

> **Use Case:** This helps filter out harmful or overly negative content before displaying in chatbots.