

# SvaraAI – AI/ML Engineer Internship Assignment




**Objective:**
This project focuses on building a machine learning and NLP pipeline to classify prospect email replies into categories such as *positive*, *neutral*, and *negative*. The task involves:

* Data preprocessing and cleaning
* Baseline model training (Logistic Regression with TF-IDF)
* Fine-tuning a transformer model (DistilBERT)
* Evaluating models using accuracy and F1 score
* Deploying the best-performing model via a FastAPI service

**Scope:**
The project demonstrates end-to-end capabilities, from data handling and model training to production-ready deployment, ensuring professional standards suitable for real-world AI/ML applications.



**Author:** Viraj Bhutada

**Date:** 22 September 2025

---



# Part A – ML/NLP Pipeline

## Step 1: Load & Preprocess Dataset

In this step, we load the email replies dataset and perform basic preprocessing:
- Rename columns for consistency.
- Clean the text (lowercasing, removing punctuation, trimming spaces).
- Standardize labels.
- Encode labels for ML models.
- Split dataset into training and test sets.


In [12]:
!pip install --upgrade transformers datasets scikit-learn torch fastapi uvicorn -q


In [13]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import re
from google.colab import files

# Upload CSV
uploaded = files.upload()

# Load CSV
df = pd.read_csv(next(iter(uploaded)))

# Rename columns
df = df.rename(columns={'reply': 'text', 'label': 'label'})

# Clean text
df['text'] = df['text'].str.strip().str.lower()
df['text'] = df['text'].apply(lambda x: re.sub(r'[^\w\s]', '', x))

# Standardize labels
df['label'] = df['label'].str.strip().str.lower()

# Encode labels
le = LabelEncoder()
df['label_enc'] = le.fit_transform(df['label'])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    df['text'], df['label_enc'], test_size=0.2, random_state=42
)

print("Sample data:")
df.head()


Saving reply_classification_dataset.csv to reply_classification_dataset (3).csv
Sample data:


Unnamed: 0,text,label,label_enc
0,can we discuss pricing,neutral,1
1,im excited to explore this further plz send co...,positive,2
2,we not looking for new solutions,negative,0
3,could u clarify features included,neutral,1
4,lets schedule a meeting to dive deeper,positive,2


We can see that the dataset is clean and ready for modeling. Labels are encoded as integers for ML compatibility.


## Step 2: Baseline Model – Logistic Regression

We train a simple Logistic Regression model using TF-IDF features.
This gives us a baseline performance before using more complex transformer models.


In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, f1_score, accuracy_score
import pickle

# TF-IDF
tfidf = TfidfVectorizer(max_features=2000)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

# Logistic Regression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train_tfidf, y_train)

# Predictions
y_pred = lr.predict(X_test_tfidf)

# Evaluation
print("Baseline Logistic Regression Results:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred, average='weighted'))
print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=le.classes_))

# Save model and TF-IDF
with open("tfidf.pkl", "wb") as f:
    pickle.dump(tfidf, f)
with open("lr_model.pkl", "wb") as f:
    pickle.dump(lr, f)


Baseline Logistic Regression Results:
Accuracy: 0.9953051643192489
F1 Score: 0.9952978860372445

Classification Report:
               precision    recall  f1-score   support

    negative       1.00      0.99      0.99       150
     neutral       0.99      1.00      1.00       136
    positive       0.99      1.00      1.00       140

    accuracy                           1.00       426
   macro avg       1.00      1.00      1.00       426
weighted avg       1.00      1.00      1.00       426



 The baseline model provides an initial accuracy and F1-score. We can now compare it with a transformer model for improvement.


## Step 3: Transformer Model – DistilBERT

We fine-tune a small transformer model (DistilBERT) for text classification.
Steps:
- Tokenize text using the DistilBERT tokenizer.
- Convert datasets to HuggingFace Dataset format.
- Set training arguments and train the model.
- Evaluate performance on test data.


In [16]:
# Step 3 – Transformer Model: DistilBERT
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
import torch
import os

# ✅ Disable Weights & Biases (W&B) logging
os.environ["WANDB_DISABLED"] = "true"

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Convert to HuggingFace Dataset
train_dataset = Dataset.from_pandas(pd.DataFrame({'text': X_train, 'label': y_train}))
test_dataset = Dataset.from_pandas(pd.DataFrame({'text': X_test, 'label': y_test}))

def tokenize(batch):
    return tokenizer(batch['text'], padding=True, truncation=True, max_length=128)

train_dataset = train_dataset.map(tokenize, batched=True)
test_dataset = test_dataset.map(tokenize, batched=True)

train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

# Model
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=3)

# Training Arguments
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=8,
    num_train_epochs=2,
    logging_steps=10
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer
)

# Train & Evaluate
trainer.train()
trainer.evaluate()

# Save model
model.save_pretrained("./distilbert_model")
tokenizer.save_pretrained("./distilbert_model")
print("DistilBERT saved successfully!")


Map:   0%|          | 0/1703 [00:00<?, ? examples/s]

Map:   0%|          | 0/426 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Step,Training Loss
10,0.9981
20,0.5637
30,0.237
40,0.0956
50,0.0653
60,0.0915
70,0.0152
80,0.0066
90,0.0048
100,0.004




DistilBERT saved successfully!


**Observation:** DistilBERT is trained and saved for deployment. Its accuracy and F1-score can now be compared with the baseline model to decide which to use in production.


##Step 4: Model Evaluation – Accuracy & F1 Score

Objective: Evaluate both models for comparison.

In [31]:
from sklearn.metrics import accuracy_score, f1_score

# Compute metrics for DistilBERT
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    acc = accuracy_score(labels, preds)
    f1 = f1_score(labels, preds, average='weighted')
    return {"accuracy": acc, "f1": f1}

trainer.compute_metrics = compute_metrics
eval_results = trainer.evaluate()
print("DistilBERT Evaluation Results:", eval_results)





DistilBERT Evaluation Results: {'eval_loss': 0.0005361773073673248, 'eval_model_preparation_time': 0.0029, 'eval_accuracy': 1.0, 'eval_f1': 1.0, 'eval_runtime': 13.5307, 'eval_samples_per_second': 31.484, 'eval_steps_per_second': 3.991}




## Step 5: Model Comparison & Production Recommendation

**Baseline Model – Logistic Regression + TF-IDF**

* **Accuracy:** 0.995
* **Weighted F1 Score:** 0.995
* **Strengths:**

  * Extremely fast to train and infer.
  * Low computational requirements.
  * Highly interpretable results, easy to debug.
* **Limitations:**

  * Cannot fully capture contextual nuances or semantic relationships in text.
  * May underperform on ambiguous or complex replies.

**Transformer Model – DistilBERT**

* **Accuracy:** 1.0 (100%)
* **Weighted F1 Score:** 1.0 (100%)
* **Strengths:**

  * Leverages contextual embeddings to understand subtle language patterns.
  * Excels at capturing nuanced or ambiguous replies.
  * More robust to diverse phrasing and semantic variations.
* **Limitations:**

  * Larger model size, slower inference compared to baseline.
  * Requires higher computational resources.

**Recommendation for Production:**

* **DistilBERT is the preferred model** for deployment due to its superior ability to capture context and semantics, which is crucial for accurately classifying prospect replies.
* **Logistic Regression can serve as a fallback** for scenarios requiring low-latency inference or limited computational resources.
* Overall, DistilBERT provides the best balance between accuracy, robustness, and contextual understanding, ensuring that sales teams receive reliable insights for prioritizing prospects.

---
## Part B – Deployment Task (API)
**Objective**: We will wrap our best-performing model (DistilBERT) in a FastAPI (or Flask) service to allow external applications to get predictions via a REST API.



    • Endpoint: /predict

    • Input: JSON { "text": "Looking forward to the demo!" }

    • Output: JSON { "label": "positive", "confidence": 0.87 }

In [33]:
%%writefile app.py
from fastapi import FastAPI
from pydantic import BaseModel
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch.nn.functional as F
import os

# Disable W&B
os.environ["WANDB_DISABLED"] = "true"

app = FastAPI()

# Load model
tokenizer = AutoTokenizer.from_pretrained("./distilbert_model")
model = AutoModelForSequenceClassification.from_pretrained("./distilbert_model")

labels = ["negative", "neutral", "positive"]

class TextIn(BaseModel):
    text: str

@app.post("/predict")
def predict(data: TextIn):
    inputs = tokenizer(data.text, return_tensors="pt", truncation=True)
    outputs = model(**inputs)
    probs = F.softmax(outputs.logits, dim=1)
    label_idx = torch.argmax(probs).item()
    confidence = probs[0][label_idx].item()
    return {"label": labels[label_idx], "confidence": round(confidence, 2)}


Overwriting app.py




We deployed the best-performing model (DistilBERT) as a REST API using FastAPI. This allows external applications or services to request predictions programmatically.

**Key Features of the API:**

* **Endpoint:** `/predict`
* **Request:** JSON object containing a text string.

  ```json
  { "text": "Looking forward to the demo!" }
  ```
* **Response:** JSON object containing the predicted label and confidence score.

  ```json
  { "label": "positive", "confidence": 0.87 }
  ```
* **Model Used:** Fine-tuned DistilBERT transformer.
* **Implementation Notes:**

  * W\&B logging is disabled for simplicity (`os.environ["WANDB_DISABLED"] = "true"`).
  * Model and tokenizer are loaded from the saved directory (`./distilbert_model`).
  * The API uses `torch.nn.functional.softmax` to convert logits to probabilities.


**Note:** Writing this cell generates `app.py`, which completes the technical part of the deployment. There is **no need to run `uvicorn` in Colab**; the file is sufficient for submission or local testing.

---


