# 📞 Smart Call Center Analyzer  

**🎯 Purpose:**  
Build a project that predicts customer churn, tags sentiment, intents and priortization score from call‑like customer–agent interactions.

**🏢 Business Value:**  
Helps call centers to:  
✅ Identify at‑risk customers (reduce churn)  
✅ Track sentiment in real time (improve satisfaction)  
✅ Intent Classification: T5-Small with zero-shot learning and regex fallbacks for accurate prioritization.

**🛠️ Tech Stack:**  
- Python (Pandas, scikit‑learn, Matplotlib/Seaborn)  
- Hugging Face Transformers for GenAI  
- Jupyter Notebook for documentation & reproducibility

**📂 Datasets Used:**  
- **Kaggle TWCS** – Customer Support on Twitter dataset  
- **UCI Sentiment Labelled Sentences** – for validating sentiment


## 🧩 Step 1: Environment Setup  

Install and import the minimal libraries we need:  

- **Pandas / NumPy** → data manipulation  
- **scikit‑learn** → churn model (logistic regression)  
- **Matplotlib / Seaborn** → visualizations for stakeholders  
- **Transformers** → for GenAI sentiment/summarization  


In [42]:
import pandas as pd
import numpy as np
import re

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler

from transformers import pipeline

print("✅ Environment ready.")


✅ Environment ready.


## 📥 Step 2: Load Datasets  

We are using **two real‑world open datasets**:  

1. **TWCS (Twitter Customer Support)**  
   - Mimics customer–agent interactions (like call center chats)  
   - We'll derive features like `response_time` and `churn_label`

2. **UCI Sentiment Labelled Sentences**  
   - Contains pre‑labeled sentences (positive/negative)  
   - Used to validate or fine‑tune our sentiment tagging


In [43]:
# Load TWCS raw dataset
df_twcs = pd.read_csv('../data/raw/twcs.csv', encoding='utf-8')
print("🔹 TWCS shape:", df_twcs.shape)
print(df_twcs.head(3))

# Load UCI dataset
df_uci = pd.read_csv(
    '../data/raw/amazon_cells_labelled.txt',
    sep='\t', header=None,
    names=['text', 'label']
)
print("\n🔹 UCI shape:", df_uci.shape)
print(df_uci.head(3))


🔹 TWCS shape: (2811774, 7)
   tweet_id   author_id  inbound                      created_at  \
0         1  sprintcare    False  Tue Oct 31 22:10:47 +0000 2017   
1         2      115712     True  Tue Oct 31 22:11:45 +0000 2017   
2         3      115712     True  Tue Oct 31 22:08:27 +0000 2017   

                                                text response_tweet_id  \
0  @115712 I understand. I would like to assist y...                 2   
1      @sprintcare and how do you propose we do that               NaN   
2  @sprintcare I have sent several private messag...                 1   

   in_response_to_tweet_id  
0                      3.0  
1                      1.0  
2                      4.0  

🔹 UCI shape: (1000, 2)
                                                text  label
0  So there is no way for me to plug it in here i...      0
1                        Good case, Excellent value.      1
2                             Great for the jawbone.      1


## 🔎 Step 3: Quick Data Understanding  

Before cleaning, **inspect the datasets** to understand their columns and potential issues.

- For TWCS, we expect columns like `tweet_id`, `text`, `inbound`, `in_response_to_tweet_id`, `created_at`.  
- For UCI, we have simple `text` and `label` columns.  

👉 **Goal:** Identify missing values, malformed timestamps, or irrelevant fields early.


In [44]:
print("TWCS columns:", df_twcs.columns.tolist())
print("\nMissing values in TWCS:\n", df_twcs.isna().sum().sort_values(ascending=False).head(10))

print("\nUCI columns:", df_uci.columns.tolist())
print("\nMissing values in UCI:\n", df_uci.isna().sum())


TWCS columns: ['tweet_id', 'author_id', 'inbound', 'created_at', 'text', 'response_tweet_id', 'in_response_to_tweet_id']

Missing values in TWCS:
 response_tweet_id          1040629
in_response_to_tweet_id     794335
tweet_id                         0
inbound                          0
author_id                        0
text                             0
created_at                       0
dtype: int64

UCI columns: ['text', 'label']

Missing values in UCI:
 text     0
label    0
dtype: int64


## 🧹 Step 4: Data Cleaning & Preprocessing  

We will clean and engineer features from the TWCS dataset:

**🔧 Operations on TWCS**
- ✅ Ensure `tweet_id` and `in_response_to_tweet_id` are numeric
- ✅ Parse `created_at` to proper datetime
- ✅ Separate **customer tweets** (`inbound=True`) from **agent replies** (`inbound=False`)
- ✅ Merge them to calculate `response_time` in minutes
- ✅ Remove invalid rows (negative or missing response times)
- ✅ Deduplicate by `tweet_id`
- ✅ Create a `churn_label` column by scanning text for churn‑related keywords
- ✅ Clean text by removing URLs, mentions, and special characters

**🔧 Operations on UCI**
- ✅ Clean text in the same way for consistency
- ✅ Labels are already provided (0 = negative, 1 = positive)

👉 **Why?**  
High‑quality cleaned features (like response_time and churn signals) are the foundation for accurate modeling and insight generation.


In [46]:
# ---------- Utility: Text cleaning ----------
import re

def clean_text(text):
    text = re.sub(r'http\S+', '', text)           # remove URLs
    text = re.sub(r'@\w+', '', text)              # remove mentions
    text = re.sub(r'[^A-Za-z0-9\s]', '', text)    # remove special chars
    return text.strip().lower()

# ---------- TWCS Cleaning ----------
df_twcs['tweet_id'] = pd.to_numeric(df_twcs['tweet_id'], errors='coerce')
df_twcs['in_response_to_tweet_id'] = pd.to_numeric(df_twcs['in_response_to_tweet_id'], errors='coerce')

# Parse created_at
df_twcs['created_at'] = pd.to_datetime(
    df_twcs['created_at'],
    format="%Y-%m-%d %H:%M:%S%z", 
    errors='coerce',
    utc=True
)

# Split customer vs agent
df_customer = df_twcs[df_twcs['inbound'] == True].copy()
df_customer.rename(columns={
    'tweet_id': 'customer_tweet_id',
    'created_at': 'customer_created_at',
    'text': 'customer_text'
}, inplace=True)

df_agent = df_twcs[df_twcs['inbound'] == False].copy()
df_agent.rename(columns={'in_response_to_tweet_id': 'customer_tweet_id'}, inplace=True)

# Merge on customer_tweet_id
df_pairs = pd.merge(
    df_agent,
    df_customer[['customer_tweet_id', 'customer_created_at', 'customer_text']],
    on='customer_tweet_id',
    how='inner'
)

# Calculate response time in minutes
df_pairs['response_time'] = (
    (df_pairs['created_at'] - df_pairs['customer_created_at']).dt.total_seconds() / 60
)
# Keep only valid response times
df_pairs = df_pairs[df_pairs['response_time'] >= 0].copy()
df_pairs['response_time'] = df_pairs['response_time'].clip(0, 1440)

# Drop duplicates
df_pairs = df_pairs.drop_duplicates(subset=['tweet_id'])

# Create churn_label by keyword search in agent response
churn_keywords = r'cancel|unhappy|disappointed|frustrated|bad|poor|terrible|awful|issue|problem|angry|upset|complain|worst|never|fail|horrible|pathetic|ridiculous|disaster|nightmare|sucks|furious'
df_pairs['churn_label'] = df_pairs['text'].str.contains(churn_keywords, case=False, na=False).astype(int)

# Clean text
df_pairs['cleaned_text'] = df_pairs['text'].apply(clean_text)

print("✅ Cleaned TWCS pairs shape:", df_pairs.shape)
print(df_pairs[['text', 'cleaned_text', 'response_time', 'churn_label']].head())

# ---------- UCI Cleaning ----------
df_uci['cleaned_text'] = df_uci['text'].apply(clean_text)

print("\n✅ Cleaned UCI preview:")
print(df_uci.head())


✅ Cleaned TWCS pairs shape: (1261888, 12)
                                                text  \
0  @115712 I understand. I would like to assist y...   
1  @115712 Please send us a Private Message so th...   
2  @115712 Can you please send us a private messa...   
3  @115713 This is saddening to hear. Please shoo...   
4  @115713 We understand your concerns and we'd l...   

                                        cleaned_text  response_time  \
0  i understand i would like to assist you we wou...       2.333333   
1  please send us a private message so that we ca...       5.233333   
2  can you please send us a private message so th...       1.233333   
3  this is saddening to hear please shoot us a dm...       5.800000   
4  we understand your concerns and wed like for y...       2.800000   

   churn_label  
0            0  
1            0  
2            0  
3            0  
4            0  

✅ Cleaned UCI preview:
                                                text  label  \
0  So

## 💾 Step 5: Save Processed Data  

We save the cleaned datasets into a `/data/processed/` folder for downstream tasks:

- `cleaned_twcs.csv` → For churn modeling & summarization
- `cleaned_uci.csv` → For validating the sentiment model

👉 **Why?**  
Keeping processed data separate from raw data ensures reproducibility and avoids overwriting raw sources.


In [47]:
df_pairs.to_csv('../data/processed/cleaned_twcs.csv', index=False)
df_uci.to_csv('../data/processed/cleaned_uci.csv', index=False)

print("✅ Processed datasets saved.")


✅ Processed datasets saved.
