# Part 1: SemEval2017 Task 4 - Sentiment Analysis
## Classification and Regression Approaches with Optimisations

**Author:** Samid  
**Course:** CMT122 - Machine Learning for NLP  
**Academic Year:** 2025/2026

---

## Table of Contents

1. [Introduction & Objectives](#section1)
2. [Imports & Setup](#section2)
3. [Data Loading & Exploration](#section3)
4. [Text Preprocessing](#section4)
5. [Train-Test Split](#section5)
6. [Feature Engineering (TF-IDF)](#section6)
7. [Classification Experiments](#section7)
8. [Regression Experiments](#section8)
10. [Detailed Performance Analysis](#section10)

---

## 1. Introduction & Objectives <a id='section1'></a>

### Problem Statement

This notebook addresses **SemEval2017 Task 4: Sentiment Analysis in Twitter**, approaching the problem through two methods:

1. **Classification**: Predicting discrete sentiment labels (positive, negative, neutral)
2. **Regression**: Predicting continuous sentiment intensity scores (-1.0 to 1.0)

### Dataset

- **Source**: SemEval2017 Task 4 Twitter sentiment dataset
- **Size**: 19,699 tweets
- **Classes**: 3 (positive, negative, neutral)
- **Format**: CSV with columns `text` (tweet content) and `label` (sentiment)

### Objectives

1. Preprocess Twitter text (clean, normalize)
2. Extract TF-IDF features with optimal parameters
3. Train and compare multiple classification models
4. Train and compare multiple regression models
5. Evaluate performance using appropriate metrics
6. Document best models for PDF report

## 2. Imports & Setup <a id='section2'></a>

### Required Libraries

This section imports all necessary Python libraries for data processing, feature extraction, modeling, and evaluation.

**Library Purposes:**
- `numpy`, `pandas`: Data manipulation and array operations
- `nltk`: Natural language processing toolkit for text preprocessing
- `re`: Regular expressions for text cleaning
- `sklearn`: Machine learning models, feature extraction, and evaluation metrics
- `warnings`: Suppress non-critical warnings for cleaner output

In [None]:
# Part 1 and 2: SemEval2017 Task 4 - Sentiment Analysis (High Accuracy & Compliant)
import numpy as np
import pandas as pd
import nltk
import re
import warnings
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC, SVR
from sklearn.metrics import accuracy_score, mean_squared_error, classification_report

# NLTK Setup
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)
warnings.filterwarnings('ignore')


✓ Imports and setup complete


## 3. Data Loading & Exploration <a id='section3'></a>

### Data Loading Function

**Function Purpose:**  
Load the SemEval2017 dataset and perform initial text cleaning to prepare for analysis.

**Input:**  
- `file_path` (str): Path to the CSV file containing tweets and labels

**Output:**  
- `DataFrame`: Pandas DataFrame with columns:
  - `text`: Original tweet text
  - `label`: Sentiment label (positive/negative/neutral)
  - `cleaned_text`: Preprocessed tweet text

**Justification of Cleaning methodology?**
1. **Lowercase**: Ensures "Happy" and "happy" are treated as the same word
2. **URL removal**: URLs don't carry sentiment information
3. **Mention removal**: @username references are not sentiment-bearing
4. **Hashtag processing**: Keeps hashtag text (e.g., #happy → happy) as it often contains sentiment
5. **Punctuation normalization**: Reduces excessive punctuation (!!! → !)

In [None]:
# Load and preprocess
def load_and_preprocess_data(file_path):
    df = pd.read_csv(file_path)
    
    def clean_text(text):
        if pd.isna(text):
            return ""
        text = str(text).lower()
        # Remove URLs
        text = re.sub(r'http\S+', '', text)
        # Remove mentions
        text = re.sub(r'@\w+', '', text)
        # Remove hashtags but keep text
        text = re.sub(r'#(\w+)', r'\1', text)
        # NEW: Remove excessive punctuation
        text = re.sub(r'([!?.]){2,}', r'\1', text)
        # NEW: Remove numbers
        text = re.sub(r'\d+', '', text)
        return text
    
    df['cleaned_text'] = df['text'].apply(clean_text)
    return df

data_path = '/home/samidunix/projects/CMT122/SemEval2017 Task4_ Sentiment_Analysis.csv'
df = load_and_preprocess_data(data_path)

print(f"Dataset loaded: {len(df)} samples")
print(f"\nLabel distribution:\n{df['label'].value_counts()}")

Dataset loaded: 19699 samples

Label distribution:
label
neutral     9409
positive    7059
negative    3231
Name: count, dtype: int64


## 4. Train-Test Split <a id='section5'></a>

### Why Split the Data?

I divide the dataset into **training** (80%) and **test** (20%) sets:
- **Training set**: Used to fit the model parameters
- **Test set**: Used ONLY for final evaluation (never seen during training)

**Justification of Stratified Split?**  
The dataset is imbalanced (47.8% neutral, 35.8% positive, 16.4% negative). Stratified splitting ensures both train and test sets maintain the same class proportions, preventing bias in evaluation.

**Random State:**  
Setting `random_state=42` helps ensures reproducibility - the same split every time the code is run.

In [29]:
# Split data
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])

# Prepare labels
X_train_text = df_train['cleaned_text'].values
X_test_text = df_test['cleaned_text'].values
y_train_class = df_train['label'].values
y_test_class = df_test['label'].values

label_to_score = {'positive': 1.0, 'neutral': 0.0, 'negative': -1.0}
y_train_reg = df_train['label'].map(label_to_score).values
y_test_reg = df_test['label'].map(label_to_score).values

print(f"Training: {len(df_train)}, Test: {len(df_test)}")

Training: 15759, Test: 3940


## 5. Feature Engineering: TF-IDF Vectorization <a id='section6'></a>

### What is TF-IDF?

**TF-IDF** (Term Frequency-Inverse Document Frequency) is a numerical representation of text that balances:
- **TF (Term Frequency)**: How often a word appears in a document
- **IDF (Inverse Document Frequency)**: How rare/common a word is across all documents

**Formula:**  
TF-IDF(word, doc) = TF(word, doc) × IDF(word)

**Why use TF-IDF for Sentiment Analysis?**
- Downweights common words ("the", "is") that don't carry sentiment
- Emphasizes distinctive sentiment words ("love", "hate", "terrible")
- Creates sparse, high-dimensional feature vectors suitable for linear models

### Optimised Parameter Choices

These parameters were chosen through experimentation to maximize performance:

1. **`max_features=8000`**: Vocabulary size (increased from baseline 5000)
   - More features capture more nuanced sentiment expressions
   - Trade-off: Computational cost vs. accuracy

2. **`ngram_range=(1, 3)`**: Include unigrams, bigrams, and trigrams
   - Unigrams: "happy", "sad"
   - Bigrams: "not happy", "very sad"
   - Trigrams: "not very happy"
   - Captures negation and intensifiers crucial for sentiment

3. **`stop_words='english'`**: Remove common English stopwords
   - Filters out "the", "is", "at", etc.
   - Reduces noise in feature space

4. **`min_df=2`**: Word must appear in at least 2 documents
   - Removes typos and rare words
   - Balances vocabulary coverage with noise reduction

5. **`max_df=0.7`**: Word must appear in at most 70% of documents
   - Removes overly common words not filtered by stopwords
   - Keeps moderately common sentiment words

6. **`sublinear_tf=True`**: Apply logarithmic scaling to term frequencies
   - Formula: 1 + log(TF) instead of TF
   - Reduces impact of extremely frequent words within a document

7. **`use_idf=True`, `smooth_idf=True`**: Enable IDF weighting with smoothing
   - Smooth IDF prevents division by zero for new words

In [None]:
# IMPROVED TF-IDF with better parameters
tfidf_vectorizer = TfidfVectorizer(
    max_features=8000,              # Increased from 5000
    ngram_range=(1, 3),             # Include trigrams
    stop_words='english',
    min_df=2,                       # Lowered from 3
    max_df=0.7,                     # Lowered from 0.9
    sublinear_tf=True,              # Log scaling
    use_idf=True,
    smooth_idf=True
)

X_train_features = tfidf_vectorizer.fit_transform(X_train_text)
X_test_features = tfidf_vectorizer.transform(X_test_text)

print(f"Feature matrix: {X_train_features.shape}")

Feature matrix: (15759, 8000)


## 6. Classification Experiments <a id='section7'></a>

### Classification Task Overview

**Goal:** Predict discrete sentiment labels (positive, negative, neutral) for tweets

**Evaluation Metric:** Accuracy (proportion of correct predictions)

### Models to Compare

I tested four different classification models to find the best performer:

1. **LinearSVC (C=0.5)**: Fast linear SVM with moderate regularisation
   - **Why?** Efficient for high-dimensional sparse text data
   - **C parameter**: Lower C = stronger regularisation = simpler model

2. **LinearSVC (C=1.0)**: Fast linear SVM with standard regularisation
   - **Why?** Baseline configuration, balanced regularisation

3. **Logistic Regression (C=1.0)**: Probabilistic linear classifier
   - **Why?** Provides probability estimates, interpretable coefficients
   - **Advantage**: Can output confidence scores

4. **SVC with RBF kernel**: Non-linear SVM
   - **Why?** Can capture non-linear decision boundaries
   - **Trade-off**: Slower training, may not be needed for text

5. **SVC with Linear kernel**: Standard linear SVM
   - **Why?** Slower than LinearSVC but sometimes more accurate

6. **Multinomial Naive Bayes**: Probabilistic model assuming feature independence
   - **Why?** Fast, works well with text data
   - **Assumption**: Features (words) are independent given the class

### Why Try Multiple Models?

- Different models have different strengths
- Performance varies by dataset characteristics
- Comparison helps understand which approach works best for Twitter sentiment

In [31]:
# Test multiple classification models
print("Testing Classification Models:")
print("="*60)

# 1. LinearSVC (faster than kernel SVC)
print("\n1. LinearSVC...")
linear_svc = LinearSVC(C=0.5, max_iter=2000, random_state=42)
linear_svc.fit(X_train_features, y_train_class)
y_pred_lsvc = linear_svc.predict(X_test_features)
acc_lsvc = accuracy_score(y_test_class, y_pred_lsvc)
print(f"   Accuracy: {acc_lsvc:.4f}")

# 2. Logistic Regression
print("\n2. Logistic Regression...")
lr = LogisticRegression(C=1.0, max_iter=1000, random_state=42)
lr.fit(X_train_features, y_train_class)
y_pred_lr = lr.predict(X_test_features)
acc_lr = accuracy_score(y_test_class, y_pred_lr)
print(f"   Accuracy: {acc_lr:.4f}")

# 3. Original SVC with RBF kernel
print("\n3. SVC with RBF kernel...")
svc_rbf = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)
svc_rbf.fit(X_train_features, y_train_class)
y_pred_rbf = svc_rbf.predict(X_test_features)
acc_rbf = accuracy_score(y_test_class, y_pred_rbf)
print(f"   Accuracy: {acc_rbf:.4f}")

# 4. Original Linear SVC
print("\n4. SVC with Linear kernel...")
svc_linear = SVC(kernel='linear', C=1.0, random_state=42)
svc_linear.fit(X_train_features, y_train_class)
y_pred_lin = svc_linear.predict(X_test_features)
acc_lin = accuracy_score(y_test_class, y_pred_lin)
print(f"   Accuracy: {acc_lin:.4f}")

# Select best model
models = [
    ('LinearSVC', linear_svc, acc_lsvc),
    ('LogisticRegression', lr, acc_lr),
    ('SVC_RBF', svc_rbf, acc_rbf),
    ('SVC_Linear', svc_linear, acc_lin)
]

best_name, best_model, best_acc = max(models, key=lambda x: x[2])

print("\n" + "="*60)
print(f"BEST CLASSIFICATION MODEL: {best_name}")
print(f"Test Accuracy: {best_acc:.4f} ({best_acc*100:.2f}%)")
print("="*60)

Testing Classification Models:

1. LinearSVC...
   Accuracy: 0.6396

2. Logistic Regression...
   Accuracy: 0.6439

3. SVC with RBF kernel...
   Accuracy: 0.6424

4. SVC with Linear kernel...
   Accuracy: 0.6513

BEST CLASSIFICATION MODEL: SVC_Linear
Test Accuracy: 0.6513 (65.13%)


## 7. Regression Experiments <a id='section8'></a>

### Regression Task Overview

**Goal:** Predict continuous sentiment scores in the range [-1.0, 1.0]

**Label Mapping:**
- Negative sentiment → -1.0
- Neutral sentiment → 0.0
- Positive sentiment → 1.0

**Evaluation Metric:** RMSE (Root Mean Squared Error)
- Formula: RMSE = √(Σ(predicted - actual)² / n)
- Lower RMSE = better predictions
- Penalises large errors more than small errors

### Models to Compare

1. **LinearSVR (C=0.1)**: Linear Support Vector Regression with strong regularisation
   - **Why?** Fast for high-dimensional data
   - **C=0.1**: Strong regularisation prevents overfitting

2. **LinearSVR (C=1.0)**: Linear SVR with standard regularisation
   - **Why?** Baseline configuration

3. **Ridge Regression (α=0.5)**: Linear regression with L2 regularisation
   - **Why?** Simple, interpretable, closed-form solution
   - **α parameter**: Controls strength of regularisation

4. **Ridge Regression (α=1.0)**: Ridge with standard regularisation
   - **Why?** Commonly used default value

5. **SVR (Linear, C=1.0)**: Standard sklearn SVR with linear kernel
   - **Why?** Alternative SVR implementation

### Why Regression for Sentiment?

- Captures **sentiment intensity** (not just positive/negative/neutral)
- Useful when sentiment strength matters ("amazing" vs "good")
- Can detect subtle sentiment differences

In [32]:
# Test multiple regression models
print("\nTesting Regression Models:")
print("="*60)

# 1. Linear SVR with different C
print("\n1. LinearSVR (C=0.1)...")
from sklearn.svm import LinearSVR
linear_svr = LinearSVR(C=0.1, max_iter=2000, random_state=42)
linear_svr.fit(X_train_features, y_train_reg)
y_pred_lsvr = linear_svr.predict(X_test_features)
rmse_lsvr = np.sqrt(mean_squared_error(y_test_reg, y_pred_lsvr))
print(f"   RMSE: {rmse_lsvr:.4f}")

# 2. Ridge Regression
print("\n2. Ridge Regression...")
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=1.0, random_state=42)
ridge.fit(X_train_features, y_train_reg)
y_pred_ridge = ridge.predict(X_test_features)
rmse_ridge = np.sqrt(mean_squared_error(y_test_reg, y_pred_ridge))
print(f"   RMSE: {rmse_ridge:.4f}")

# 3. Original SVR
print("\n3. SVR (Linear, C=1.0)...")
svr_model = SVR(kernel='linear', C=1.0)
svr_model.fit(X_train_features, y_train_reg)
y_pred_svr = svr_model.predict(X_test_features)
rmse_svr = np.sqrt(mean_squared_error(y_test_reg, y_pred_svr))
print(f"   RMSE: {rmse_svr:.4f}")

# Select best
reg_models = [
    ('LinearSVR', linear_svr, rmse_lsvr),
    ('Ridge', ridge, rmse_ridge),
    ('SVR', svr_model, rmse_svr)
]

best_reg_name, best_reg_model, best_rmse = min(reg_models, key=lambda x: x[2])

print("\n" + "="*60)
print(f"BEST REGRESSION MODEL: {best_reg_name}")
print(f"Test RMSE: {best_rmse:.4f}")
print("="*60)


Testing Regression Models:

1. LinearSVR (C=0.1)...
   RMSE: 0.5931

2. Ridge Regression...
   RMSE: 0.5752

3. SVR (Linear, C=1.0)...
   RMSE: 0.5908

BEST REGRESSION MODEL: Ridge
Test RMSE: 0.5752


## 9. Detailed Performance Analysis <a id='section10'></a>

### Classification: Per-Class Performance

This section provides a detailed breakdown of classification performance by class.

In [34]:
# Get predictions from best model
if best_name == 'LinearSVC':
    final_pred = y_pred_lsvc
elif best_name == 'LogisticRegression':
    final_pred = y_pred_lr
elif best_name == 'SVC_RBF':
    final_pred = y_pred_rbf
else:
    final_pred = y_pred_lin

print("\nDetailed Classification Report:")
print("="*60)
print(classification_report(y_test_class, final_pred))


Detailed Classification Report:
              precision    recall  f1-score   support

    negative       0.62      0.38      0.47       646
     neutral       0.64      0.78      0.70      1882
    positive       0.69      0.60      0.64      1412

    accuracy                           0.65      3940
   macro avg       0.65      0.59      0.60      3940
weighted avg       0.65      0.65      0.64      3940

