# Project Name - Solution Name

## Problem Definition


TODO: Describe the problem definition here
> Short statement that describes:
- The problem:Analyzing sentiment in Arabic social media complaints to understand customer satisfaction and identify areas for improvement.
- Used data:Excel file containing social media posts in Arabic with metadata (post links, text, timestamps)
- Model(s) chosen:

    Pre-trained Arabic BERT models for initial labeling (AraBERT, XLM-R, MARBERT)
    Logistic Regression with TF-IDF for deployment-ready classification
- Evaluation metric(s): Accuracy, Precision, Recall, F1-Score per sentiment class
- In general: Pipeline: Data cleaning → Arabic text preprocessing → Sentiment labeling → Model training → Evaluation → Export for deployment

## Dependencies

In [None]:
# Install required packages
!pip install transformers torch accelerate tqdm --quiet
!pip install pandas numpy matplotlib seaborn
!pip install nltk scikit-learn joblib
!pip install openpyxl  # For Excel files
!pip install huggingface_hub

# For Arabic NLP
import nltk
nltk.download('stopwords')

In [None]:
# Import dependencies
import os
import re
import math
import json
import getpass
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

# NLP and ML
import nltk
from nltk.corpus import stopwords
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import joblib

# Colab specific (optional)
try:
    from google.colab import files, drive
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

In [None]:
# Environment variables / API tokens (keep secure!)
# Option 1: Environment variable
hf_token = os.environ.get("HF_TOKEN")

# Option 2: Secure input (recommended for notebooks)
if not hf_token:
    hf_token = getpass.getpass("Enter Hugging Face token (or leave empty): ").strip() or None

# Never hardcode tokens in production code!

## Data Reading

In [None]:
# Upload file in Colab
if IN_COLAB:
    from google.colab import files
    uploaded = files.upload()

# Read Excel file
# Source: Internal social media monitoring system
df = pd.read_excel("SocialMedia_Complaints_15072025.xlsx")

print(f"Dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
df.head()

## Data Exploration

In [None]:
# 1. Basic information
print("=" * 50)
print("DATASET INFO")
print("=" * 50)
df.info()

# 2. Check data types and missing values
print("\n" + "=" * 50)
print("MISSING VALUES")
print("=" * 50)
missing_summary = pd.DataFrame({
    'Column': df.columns,
    'Missing_Count': df.isnull().sum(),
    'Missing_Percentage': (df.isnull().sum() / len(df)) * 100
})
print(missing_summary[missing_summary['Missing_Count'] > 0])

# 3. Check for duplicates
print("\n" + "=" * 50)
print("DUPLICATES")
print("=" * 50)
print(f"Duplicate rows: {df.duplicated().sum()}")
print(f"Duplicate post-links: {df.duplicated(subset=['post-link']).sum()}")
print(f"Duplicate Post Text: {df.duplicated(subset=['Post Text']).sum()}")

# 4. Text length analysis
df["text_length_words"] = df["Post Text"].astype(str).apply(lambda x: len(x.split()))
df["text_length_chars"] = df["Post Text"].astype(str).apply(len)

print("\n" + "=" * 50)
print("TEXT LENGTH STATISTICS")
print("=" * 50)
print(df[["text_length_words", "text_length_chars"]].describe())

# 5. Visualizations
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Word count distribution
axes[0].hist(df["text_length_words"], bins=30, edgecolor='black', alpha=0.7)
axes[0].set_title("Distribution of Post Length (Words)")
axes[0].set_xlabel("Number of Words")
axes[0].set_ylabel("Frequency")

# Character count distribution
axes[1].hist(df["text_length_chars"], bins=30, edgecolor='black', alpha=0.7, color='orange')
axes[1].set_title("Distribution of Post Length (Characters)")
axes[1].set_xlabel("Number of Characters")
axes[1].set_ylabel("Frequency")

plt.tight_layout()
plt.show()

# 6. Sample posts
print("\n" + "=" * 50)
print("SAMPLE POSTS")
print("=" * 50)
for i, text in enumerate(df["Post Text"].sample(5, random_state=42), 1):
    print(f"{i}. {text[:200]}..." if len(str(text)) > 200 else f"{i}. {text}")
    print("-" * 30)

➡️ **Conclude with:**



*   Dataset contains social media posts in Arabic with varying lengths

*   Some posts have missing text or are duplicates
*   Text length varies significantly (need to handle very short/long posts)


*   Presence of URLs, numbers, and special characters that need cleaning




## Data Cleaning

In [None]:
print("Duplicate post-links count:", df.duplicated(subset=["post-link"]).sum())
print("Duplicate Post Text count:", df.duplicated(subset=["Post Text"]).sum())


In [None]:
# 1. Remove duplicates (by Post Text or by both post-link and Post Text)

df = df.drop_duplicates(subset=["Post Text"], keep="first")

print(f"After removing duplicates: {len(df)} rows")

# 2. Ensure no NaNs in Post Text before quality check
df = df.dropna(subset=["Post Text"])
print(f"After removing missing Post Text: {len(df)} rows")

# 3. Consistent quality labels (use 'ok' to match your expected output)
def check_irrelevant(text):
    if pd.isna(text):
        return "empty"
    s = str(text).strip()
    if s == "":
        return "empty"
    if (s.startswith("http") or s.startswith("www.")) and len(s.split()) <= 3:
        return "only link"
    elif len(s.split()) < 3:
        return "too short"
    else:
        return "ok"   # <= use 'ok' if that's the label you expect

df["irrelevant"] = df["Post Text"].apply(check_irrelevant)
print("\nIrrelevant post counts:")
print(df["irrelevant"].value_counts())


print("Duplicate post-links count after droping :", df.duplicated(subset=["post-link"]).sum())
print("Duplicate Post Text count after droping :", df.duplicated(subset=["Post Text"]).sum())

# Keep only ok posts
df_clean = df[df["irrelevant"] == "ok"].copy()
print(f"\nFinal dataset after cleaning: {len(df_clean)} rows")


➡️ **Conclude with:**



*   There was no duplicate posts based on post-link


*   Filtered out posts that are too short or contain only URLs



*   Dataset reduced but quality improved significantly




## Data Preprocessing

In [None]:
# Arabic text preprocessing function
import re
import nltk
from nltk.corpus import stopwords

# Download Arabic stopwords
nltk.download("stopwords")
arabic_stopwords = set(stopwords.words("arabic"))

def clean_arabic_text(text):
    """
    Comprehensive Arabic text cleaning pipeline
    """
    text = str(text)

    # 1. Remove URLs
    text = re.sub(r"http\S+|www.\S+", " ", text)

    # 2. Remove numbers
    text = re.sub(r"\d+", " ", text)

    # 3. Remove punctuation and special characters
    text = re.sub(r"[^\w\s]", " ", text)

    # 4. Normalize Arabic letters
    text = re.sub("[إأآا]", "ا", text)
    text = re.sub("ى", "ي", text)
    text = re.sub("ؤ", "و", text)
    text = re.sub("ئ", "ي", text)
    text = re.sub("ة", "ه", text)

    # 5. Remove non-Arabic characters (keep only Arabic and spaces)
    text = re.sub(r"[^\u0600-\u06FF\s]", " ", text)

    # 6. Remove extra spaces
    text = re.sub(r"\s+", " ", text).strip()

    # 7. Remove Arabic stopwords
    tokens = text.split()
    tokens = [w for w in tokens if w not in arabic_stopwords and len(w) > 1]
    text = " ".join(tokens)

    return text

# Apply cleaning
df_clean["clean_text"] = df_clean["Post Text"].apply(clean_arabic_text)

# Preview cleaning results
print("CLEANING PREVIEW")
print("=" * 60)
for i in range(min(5, len(df_clean))):
    row = df_clean.iloc[i]
    print(f"🔹 Original: {row['Post Text'][:100]}...")
    print(f"✅ Cleaned: {row['clean_text'][:100]}...")
    print("-" * 60)

# Check cleaned text statistics
df_clean["clean_text_length"] = df_clean["clean_text"].apply(lambda x: len(x.split()))
print(f"\nCleaned text word count stats:")
print(df_clean["clean_text_length"].describe())

➡️ **Conclude with:**

*   Normalized Arabic characters for consistency

*   Removed noise (URLs, numbers, non-Arabic text)

*   Eliminated stopwords to focus on meaningful content



*   Text is now ready for sentiment analysis




## Modeling

Phase 1: Automated Labeling with Pre-trained Models

In [None]:
# Setup for pre-trained Arabic sentiment models
model_candidates = [
    "PRAli22/AraBert-Arabic-Sentiment-Analysis",
    "akhooli/xlm-r-large-arabic-sent",
    "Ammar-alhaj-ali/arabic-MARBERT-sentiment",
]

# Load model with fallback mechanism
device = 0 if torch.cuda.is_available() else -1
sentiment_pipe = None
chosen_model = None

for model_id in model_candidates:
    try:
        print(f"Attempting to load: {model_id}")
        tokenizer = AutoTokenizer.from_pretrained(model_id, token=hf_token)
        model = AutoModelForSequenceClassification.from_pretrained(model_id, token=hf_token)
        sentiment_pipe = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer, device=device)
        chosen_model = model_id
        print(f"✅ Successfully loaded: {chosen_model}")
        break
    except Exception as e:
        print(f"❌ Failed to load {model_id}: {e}")
        continue

if sentiment_pipe is None:
    raise RuntimeError("Could not load any sentiment model")

# Batched inference with progress tracking
def predict_sentiment_batch(texts, batch_size=32):
    """Predict sentiment for texts in batches"""
    labels = []
    scores = []

    for i in tqdm(range(0, len(texts), batch_size), desc="Predicting sentiment"):
        batch = texts[i:i+batch_size]
        try:
            outputs = sentiment_pipe(batch, truncation=True, max_length=512)
            for out in outputs:
                labels.append(out['label'])
                scores.append(out['score'])
        except Exception as e:
            print(f"Error in batch {i//batch_size}: {e}")
            # Fallback: process individually
            for text in batch:
                try:
                    out = sentiment_pipe(text, truncation=True, max_length=512)[0]
                    labels.append(out['label'])
                    scores.append(out['score'])
                except:
                    labels.append("ERROR")
                    scores.append(0.0)

    return labels, scores

# Apply sentiment prediction
texts = df_clean["clean_text"].tolist()
labels, scores = predict_sentiment_batch(texts, batch_size=16 if device == -1 else 32)

df_clean["sentiment_label"] = labels
df_clean["confidence_score"] = scores

# Normalize labels
def normalize_label(label):
    label_lower = str(label).lower()
    if "pos" in label_lower or label_lower == "1":
        return "positive"
    elif "neg" in label_lower or label_lower == "0":
        return "negative"
    elif "neu" in label_lower or label_lower == "2":
        return "neutral"
    else:
        return label_lower

df_clean["sentiment_normalized"] = df_clean["sentiment_label"].apply(normalize_label)

print("\nSentiment Distribution:")
print(df_clean["sentiment_normalized"].value_counts())
print(f"\nAverage confidence: {df_clean['confidence_score'].mean():.3f}")

Phase 2: Train Custom Classification Model


In [None]:
# Prepare data for training
valid_sentiments = ["positive", "negative", "neutral"]
df_model = df_clean[df_clean["sentiment_normalized"].isin(valid_sentiments)].copy()

X = df_model["clean_text"]
y = df_model["sentiment_normalized"]

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")
print(f"\nClass distribution in training:")
print(y_train.value_counts(normalize=True))

# TF-IDF Vectorization
vectorizer = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1, 2),  # Unigrams and bigrams
    min_df=3,
    max_df=0.9
)

X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

print(f"\nTF-IDF matrix shape: {X_train_tfidf.shape}")

# Train Logistic Regression
lr_model = LogisticRegression(
    max_iter=500,
    C=1.0,
    class_weight='balanced',  # Handle class imbalance
    random_state=42
)

lr_model.fit(X_train_tfidf, y_train)

# Predictions
y_pred = lr_model.predict(X_test_tfidf)
y_pred_proba = lr_model.predict_proba(X_test_tfidf)

print("\n✅ Model training complete!")

## Evaluation (Track One Change at a Time)

TODO: include a **comparison table between different models** (with their key metrics like accuracy, F1, loss, etc.) to clearly highlight which approach performed best.

In [None]:
# Detailed evaluation
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision, recall, f1, support = precision_recall_fscore_support(
    y_test, y_pred, average=None, labels=valid_sentiments
)

# Create performance summary
performance_summary = pd.DataFrame({
    'Class': valid_sentiments,
    'Precision': precision,
    'Recall': recall,
    'F1-Score': f1,
    'Support': support
})

print("=" * 60)
print("CLASSIFICATION REPORT")
print("=" * 60)
print(classification_report(y_test, y_pred, target_names=valid_sentiments))

print("\n" + "=" * 60)
print("PERFORMANCE SUMMARY")
print("=" * 60)
print(performance_summary)
print(f"\nOverall Accuracy: {accuracy:.3f}")

# Confusion Matrix
import seaborn as sns
import matplotlib.pyplot as plt

cm = confusion_matrix(y_test, y_pred, labels=valid_sentiments)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=valid_sentiments,
            yticklabels=valid_sentiments)
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

# Model comparison table
comparison_data = {
    'Model': ['AraBERT (Zero-shot)', 'Logistic Regression (Trained)'],
    'Approach': ['Pre-trained transformer', 'TF-IDF + Classical ML'],
    'Accuracy': ['-', f'{accuracy:.3f}'],
    'Avg F1': ['-', f'{f1.mean():.3f}'],
    'Training Time': ['None (pre-trained)', '< 1 minute'],
    'Inference Speed': ['Slow (~100 samples/sec)', 'Fast (~10000 samples/sec)'],
    'Deployment': ['Requires GPU, large memory', 'CPU sufficient, lightweight']
}

comparison_df = pd.DataFrame(comparison_data)
print("\n" + "=" * 60)
print("MODEL COMPARISON")
print("=" * 60)
print(comparison_df.to_string(index=False))

## Deployment

In [None]:
# TODO: Expose your model here (API, app, or function)# Save models and preprocessors
import joblib

# Save the trained model and vectorizer
joblib.dump(lr_model, 'sentiment_model.joblib')
joblib.dump(vectorizer, 'sentiment_vectorizer.joblib')
print("✅ Model saved: sentiment_model.joblib")
print("✅ Vectorizer saved: sentiment_vectorizer.joblib")

# Create prediction function for deployment
def predict_sentiment(text, model_path='sentiment_model.joblib',
                      vectorizer_path='sentiment_vectorizer.joblib'):
    """
    Production-ready sentiment prediction function
    """
    # Load model and vectorizer
    model = joblib.load(model_path)
    vectorizer = joblib.load(vectorizer_path)

    # Clean text
    cleaned = clean_arabic_text(text)

    # Vectorize
    text_tfidf = vectorizer.transform([cleaned])

    # Predict
    prediction = model.predict(text_tfidf)[0]
    confidence = model.predict_proba(text_tfidf).max()

    return {
        'text': text,
        'cleaned_text': cleaned,
        'sentiment': prediction,
        'confidence': float(confidence)
    }

# Test the function
test_texts = [
    "الخدمة سيئة جدا ولا انصح بها",
    "ممتاز وسريع شكرا لكم",
    "عادي لا بأس به"
]

print("\n" + "=" * 60)
print("DEPLOYMENT TEST")
print("=" * 60)
for text in test_texts:
    result = predict_sentiment(text)
    print(f"Text: {result['text']}")
    print(f"Sentiment: {result['sentiment']} (Confidence: {result['confidence']:.2f})")
    print("-" * 40)

# Export results for API/Dashboard
output_data = df_clean[['Post Text', 'clean_text', 'sentiment_normalized', 'confidence_score']].copy()

# Save as multiple formats
output_data.to_csv('sentiment_results.csv', index=False, encoding='utf-8-sig')
output_data.to_excel('sentiment_results.xlsx', index=False)
output_data.to_json('sentiment_results.json', orient='records', force_ascii=False, indent=2)

print("\n✅ Results exported to:")
print("  - sentiment_results.csv")
print("  - sentiment_results.xlsx")
print("  - sentiment_results.json")

➡️ **Conclude with:** How you exposed your model (API, app, script), and any challenges.

## Final Notes  

### ✅ What Worked Well  
- **Arabic-specific preprocessing**: Normalizing Arabic characters significantly improved model performance.  
- **Pre-trained models for labeling**: Saved manual annotation effort for thousands of samples.  
- **TF-IDF + Logistic Regression**: Simple, fast, interpretable, and deployment-friendly.  
- **Balanced class weights**: Helped handle the imbalanced dataset.  

### ❌ What Didn’t Add Value  
- **Complex neural networks**: Marginal improvement not worth the computational cost.  
- **Character-level features**: Word-level features were sufficient for Arabic.  
- **Extensive hyperparameter tuning**: Default parameters worked reasonably well.  

### 🎯 Challenges Faced  
- **Arabic text variety**: Mix of dialects and MSA made preprocessing challenging.  
- **Neutral sentiment ambiguity**: Hard to distinguish from mild positive/negative.  
- **Model size for deployment**: Transformer models were too large for production constraints.  

### 🚀 Next Steps for Improvement  
- **Active learning**: Use model uncertainty to identify samples for manual review.  
- **Dialect-specific models**: Train separate models for Egyptian, Gulf, and Levantine Arabic.  
- **Aspect-based sentiment**: Identify which aspects (service, product, price) drive sentiment.  
- **Real-time monitoring**: Deploy as a streaming service for live social media analysis.  
- **Ensemble methods**: Combine multiple models for better accuracy.  
