# <span style="color: #CCFF00;">Project TruthMiner - Mining through headlines to uncover real vs fake

## <span style="color: #98FF98;">📌 Project Overview

TruthMiner is a Natural Language Processing (NLP) project that focuses on classifying news headlines as either **real (1)** or **fake (0)**.  
The goal is to build a robust machine learning pipeline that can process raw text, extract meaningful features, and accurately distinguish between authentic and fabricated news.

### <span style="color: #77DD77;">Main Goals


1. **Text Preprocessing & Feature Engineering**  
   - Clean and transform raw news headlines (remove stopwords, punctuation, apply stemming/lemmatization, and vectorize text using TF-IDF or Bag-of-Words).

2. **Model Development**  
   - Build and train a classifier (e.g., Logistic Regression, Naïve Bayes, SVM, Random Forest, or deep learning) to distinguish between real and fake headlines.

3. **Prediction on Unseen Data**  
   - Apply the trained model to `testing_data.csv`, replacing placeholder labels (`2`) with predicted labels (`0` = fake, `1` = real).

4. **Evaluation & Accuracy Estimation**  
   - Evaluate performance using metrics such as Accuracy, Precision, Recall, and F1-score.  
   - Provide an estimation of how well the model is expected to perform in practice.


## <span style="color: #FFD1DC;"> Import Libraries

In [87]:
import pandas as pd
import numpy as np
import re
import string
from collections import Counter

# NLP libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer


# Feature extraction
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split

#Models and Evaluation
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.naive_bayes import MultinomialNB

## <span style="color: #00FFFF;"> Training Data 

### <span style="color: #30D5C8;">Loading the Training Data 

In [88]:
data_train = pd.read_csv(r"C:\DSML bootcamp\Week7\project-3-nlp\dataset\training_data.csv", sep="\t", header=None, names=["label", "headline"])


In [89]:
data_train

Unnamed: 0,label,headline
0,0,donald trump sends out embarrassing new year‚s...
1,0,drunk bragging trump staffer started russian c...
2,0,sheriff david clarke becomes an internet joke ...
3,0,trump is so obsessed he even has obama‚s name ...
4,0,pope francis just called out donald trump duri...
...,...,...
34147,1,tears in rain as thais gather for late king's ...
34148,1,pyongyang university needs non-u.s. teachers a...
34149,1,philippine president duterte to visit japan ah...
34150,1,japan's abe may have won election\tbut many do...


* After viewing the data here , the next step is to split the data into training and test partitions. We already have a different csv file for final prediction after picking the best model but then splitting the data here into training and validation data helps in estimating accuracy before the final submission. 
* So in this development phase of the model , we will train on X_train and validate on X_val 

### <span style="color: #30D5C8;">X y Split

In [90]:
X=data_train["headline"]
y=data_train["label"]

In [91]:
X_train, X_val, y_train, y_val = \
    train_test_split(data_train["headline"], data_train["label"], test_size=0.3, random_state=42, stratify=y)

### <span style="color: #30D5C8;">Data Preprocessing

This process involves cleaning and normalizing raw text.  
In other words, it transforms messy raw text into a cleaned text batch.  

**Purpose:** Make text consistent and remove noise.  

 🔄 Pipeline

- **Input:** Raw messy text  
- **Output:** Clean, normalized text (still text!)

**Steps:**
1. Convert all text to **lowercase**  
2. Remove **punctuation** (where necessary)  
3. Perform **tokenization** (break sentences into words)  
4. Remove **stopwords** (common words like *the, is, and*)  
5. Apply **stemming** and/or **lemmatization** to reduce words to their root form  
6. Obtain **cleaned text**

```python
text = text.lower()                          # ✅ Preprocessing
text = remove_punctuation(text)              # ✅ Preprocessing
tokens = tokenize(text)                      # ✅ Preprocessing
tokens = remove_stopwords(tokens)            # ✅ Preprocessing
tokens = apply_stemming(tokens)              # ✅ Preprocessing
clean_text = ' '.join(tokens)


In [92]:
# Look at actual text samples
print("Sample headlines by class:")

for label in data_train['label'].unique():
    print(f"\n--- Class {label} samples ---")
    
    samples = data_train[data_train['label'] == label]['headline'].head(3)
    for i, headline in enumerate(samples, 1):
        print(f"{i}: '{headline}'")

Sample headlines by class:

--- Class 0 samples ---
1: 'donald trump sends out embarrassing new year‚s eve message; this is disturbing'
2: 'drunk bragging trump staffer started russian collusion investigation'
3: 'sheriff david clarke becomes an internet joke for threatening to poke people ‚in the eye‚'

--- Class 1 samples ---
1: 'as u.s. budget fight looms	republicans flip their fiscal script'
2: 'u.s. military to accept transgender recruits on monday: pentagon'
3: 'senior u.s. republican senator: 'let mr. mueller do his job''


In [93]:
# Check for missing values
print("Missing values:")
print(data_train.isnull().sum())

# Handle missing text
print(f"Null headlines: {data_train['headline'].isnull().sum()}")
print(f"Empty headlines: {(data_train['headline'] == '').sum()}")

Missing values:
label       0
headline    0
dtype: int64
Null headlines: 0
Empty headlines: 0


In [94]:
def clean_single_headline(text):
    """Clean individual headline text"""
    
    # Handle missing values
    if pd.isna(text):
        return ""
    
    # Ensure string type
    text = str(text)
    
    # Convert to lowercase
    text = text.lower()
    
    # Fix whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    # Handle excessive punctuation
    text = re.sub(r'[!]{2,}', '!', text)
    text = re.sub(r'[?]{2,}', '?', text)


    # Step 8: REMOVE STOPWORDS
    # Tokenize first
    tokens = word_tokenize(text)

    # Get English stopwords
    stop_words = set(stopwords.words('english'))
    
    # Remove stopwords
    filtered_tokens = [word for word in tokens if word not in stop_words]
    
    # Join back into text
    text = ' '.join(filtered_tokens)
    
    
    # Expand contractions
    contractions = {
        "can't": "cannot", "won't": "will not", "n't": " not",
        "'re": " are", "'ve": " have", "'ll": " will",
        "'d": " would", "'m": " am"
    }
    
    for contraction, expansion in contractions.items():
        text = text.replace(contraction, expansion)
    
    return text

In [95]:
data_train["headline_cleaned"]=data_train["headline"].apply(clean_single_headline)

In [96]:
data_train

Unnamed: 0,label,headline,headline_cleaned
0,0,donald trump sends out embarrassing new year‚s...,donald trump sends embarrassing new year‚s eve...
1,0,drunk bragging trump staffer started russian c...,drunk bragging trump staffer started russian c...
2,0,sheriff david clarke becomes an internet joke ...,sheriff david clarke becomes internet joke thr...
3,0,trump is so obsessed he even has obama‚s name ...,trump obsessed even obama‚s name coded website...
4,0,pope francis just called out donald trump duri...,pope francis called donald trump christmas speech
...,...,...,...
34147,1,tears in rain as thais gather for late king's ...,tears rain thais gather late king 's funeral
34148,1,pyongyang university needs non-u.s. teachers a...,pyongyang university needs non-u.s. teachers t...
34149,1,philippine president duterte to visit japan ah...,philippine president duterte visit japan ahead...
34150,1,japan's abe may have won election\tbut many do...,japan 's abe may election many not want pm


-   The headlines show very little differences after implementing the text cleaning.In order to verify this we can create a function to see both cleaned and original headlines to see if there are any changes

In [97]:
def visual_comparison(dataframe, num_samples=5):
    """
    Show original vs cleaned headlines side by side
    """
    print("🔍 VISUAL COMPARISON - Original vs Cleaned")
    print("="*80)
    
    for i in range(min(num_samples, len(dataframe))):
        original = dataframe['headline'].iloc[i]
        cleaned = dataframe['headline_cleaned'].iloc[i]
        label = dataframe['label'].iloc[i]
        
        # Show if anything changed
        changed = "CHANGED" if original != cleaned else " NO CHANGE"
        
        print(f"\n📰 Sample {i+1} (Label: {label}) - {changed}")
        print(f"Original:  '{original}'")
        print(f"Cleaned:   '{cleaned}'")
        
        # Show specific differences
        if original != cleaned:
            print(f"Length:    {len(original)} → {len(cleaned)} chars")
            print(f"Words:     {len(original.split())} → {len(cleaned.split())} words")

In [98]:
visual_comparison(data_train, 5)

🔍 VISUAL COMPARISON - Original vs Cleaned

📰 Sample 1 (Label: 0) - CHANGED
Original:  'donald trump sends out embarrassing new year‚s eve message; this is disturbing'
Cleaned:   'donald trump sends embarrassing new year‚s eve message ; disturbing'
Length:    78 → 67 chars
Words:     12 → 10 words

📰 Sample 2 (Label: 0) -  NO CHANGE
Original:  'drunk bragging trump staffer started russian collusion investigation'
Cleaned:   'drunk bragging trump staffer started russian collusion investigation'

📰 Sample 3 (Label: 0) - CHANGED
Original:  'sheriff david clarke becomes an internet joke for threatening to poke people ‚in the eye‚'
Cleaned:   'sheriff david clarke becomes internet joke threatening poke people ‚in eye‚'
Length:    89 → 75 chars
Words:     15 → 11 words

📰 Sample 4 (Label: 0) - CHANGED
Original:  'trump is so obsessed he even has obama‚s name coded into his website (images)'
Cleaned:   'trump obsessed even obama‚s name coded website ( images )'
Length:    77 → 57 chars
Words: 

-   This proves the texts are already clean so we can move on to the next step which is Feature Engineering

### <span style="color: #30D5C8;"> Feature Engineering


Feature Engineering transforms cleaned text into numerical representations  
that machine learning models can understand and process.  

- **Input:** Clean text  
- **Output:** Numerical vectors or matrices  
- **Purpose:** Convert text into a format suitable for ML models  

 🔄 Pipeline

**Clean Text → Numerical Representation**

Common techniques include:  
1. **Bag of Words (BoW)** – word counts in a document  
2. **TF-IDF (Term Frequency–Inverse Document Frequency)** – weighted word importance  
3. **Word Embeddings** (e.g., Word2Vec, GloVe, FastText) – semantic vector representations  
4. **Transformer Embeddings** (e.g., BERT, GPT) – contextualized text representations  

#### <span style="color: #A7C7E7;"> TF-IDF

In [99]:
# Create TF-IDF features
tfidf = TfidfVectorizer(
    max_features=5000,
    stop_words='english', 
    ngram_range=(1, 2),
    min_df=2,
    max_df=0.95
)

X_train_tfidf = tfidf.fit_transform(X_train)
X_val_tfidf = tfidf.transform(X_val)

print(f"TF-IDF Complete!")
print(f"Training features: {X_train_tfidf.shape}")
print(f"Validation features: {X_val_tfidf.shape}")

TF-IDF Complete!
Training features: (23906, 5000)
Validation features: (10246, 5000)


#### <span style="color: #89CFF0;">Insights and Analysis of TF-IDF Results:

In [100]:
# Get feature names
feature_names = tfidf.get_feature_names_out()
print(f"Total features: {len(feature_names)}")
print(f"Sample features: {list(feature_names[:20])}")

# Analyzing fake vs real word preferences:
fake_indices = y_train == 0
real_indices = y_train == 1

# Convert to numpy arrays to avoid pandas/sparse matrix issues
if hasattr(X_train_tfidf, 'toarray'):
    # If it's a sparse matrix
    X_train_array = X_train_tfidf.toarray()
else:
    # If it's already a dense array or DataFrame
    X_train_array = np.array(X_train_tfidf)

# Calculate means for fake and real news
fake_means = X_train_array[fake_indices].mean(axis=0)
real_means = X_train_array[real_indices].mean(axis=0)

# Most distinctive fake news words:
fake_top = fake_means.argsort()[-10:][::-1]
print("\nTop fake news words:", [feature_names[i] for i in fake_top])

# Most distinctive real news words:
real_top = real_means.argsort()[-10:][::-1] 
print("Top real news words:", [feature_names[i] for i in real_top])

# Showing the actual TF-IDF scores for context:
print("\nFake news word scores:")
for i in fake_top[:5]:
    print(f"   '{feature_names[i]}': {fake_means[i]:.4f}")

print("\nReal news word scores:")  
for i in real_top[:5]:
    print(f"   '{feature_names[i]}': {real_means[i]:.4f}")

# Optional: Show the difference between fake and real news word usage
print("\nWords more distinctive to fake news (fake - real scores):")
diff_scores = fake_means - real_means
fake_distinctive = diff_scores.argsort()[-10:][::-1]
for i in fake_distinctive[:5]:
    print(f"   '{feature_names[i]}': fake={fake_means[i]:.4f}, real={real_means[i]:.4f}, diff={diff_scores[i]:.4f}")

print("\nWords more distinctive to real news (real - fake scores):")
real_distinctive = diff_scores.argsort()[:10]
for i in real_distinctive[:5]:
    print(f"   '{feature_names[i]}': fake={fake_means[i]:.4f}, real={real_means[i]:.4f}, diff={diff_scores[i]:.4f}")

Total features: 5000
Sample features: ['10', '100', '100 days', '100 million', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '200', '2000', '2008', '2010', '2012', '2015']

Top fake news words: ['trump', 'video', 'obama', 'hillary', 'just', 'donald', 'gop', 'clinton', 'donald trump', 'president']
Top real news words: ['trump', 'says', 'house', 'russia', 'white house', 'white', 'senate', 'republican', 'china', 'tax']

Fake news word scores:
   'trump': 0.0454
   'video': 0.0443
   'obama': 0.0192
   'hillary': 0.0188
   'just': 0.0153

Real news word scores:
   'trump': 0.0371
   'says': 0.0267
   'house': 0.0186
   'russia': 0.0123
   'white house': 0.0120

Words more distinctive to fake news (fake - real scores):
   'video': fake=0.0443, real=0.0003, diff=0.0440
   'hillary': fake=0.0188, real=0.0007, diff=0.0181
   'just': fake=0.0153, real=0.0002, diff=0.0151
   'obama': fake=0.0192, real=0.0086, diff=0.0106
   'gop': fake=0.0105, real=0.0001, diff=0.0105

Words more d

### <span style="color: #30D5C8;"> Models and Evaluation

#### <span style="color: #A7C7E7;"> Logical Regression

In [101]:
# Initialize Logistic Regression
log_reg = LogisticRegression(max_iter=1000, random_state=42)

# Train the model
log_reg.fit(X_train_tfidf, y_train)

# Predictions
y_pred = log_reg.predict(X_val_tfidf)

# Evaluation
print(" Accuracy:", accuracy_score(y_val, y_pred))
print("\n Classification Report:\n", classification_report(y_val, y_pred))
print("\n Confusion Matrix:\n", confusion_matrix(y_val, y_pred))

 Accuracy: 0.9287526839742338

 Classification Report:
               precision    recall  f1-score   support

           0       0.94      0.92      0.93      5272
           1       0.91      0.94      0.93      4974

    accuracy                           0.93     10246
   macro avg       0.93      0.93      0.93     10246
weighted avg       0.93      0.93      0.93     10246


 Confusion Matrix:
 [[4831  441]
 [ 289 4685]]


#### <span style="color: #A7C7E7;"> Naive Bayes

In [102]:
# Initialize Naive Bayes (MultinomialNB is best for text data)
nb = MultinomialNB()

# Train the model
nb.fit(X_train_tfidf, y_train)

# Predictions
y_pred = nb.predict(X_val_tfidf)

# Evaluation
print(" Accuracy:", accuracy_score(y_val, y_pred))
print("\nClassification Report:\n", classification_report(y_val, y_pred))
print("\n Confusion Matrix:\n", confusion_matrix(y_val, y_pred))

 Accuracy: 0.9227991411282451

Classification Report:
               precision    recall  f1-score   support

           0       0.92      0.93      0.93      5272
           1       0.93      0.91      0.92      4974

    accuracy                           0.92     10246
   macro avg       0.92      0.92      0.92     10246
weighted avg       0.92      0.92      0.92     10246


 Confusion Matrix:
 [[4922  350]
 [ 441 4533]]


#### <span style="color: #A7C7E7;"> Random Forest Classifier

In [103]:
# ----- 1) Train -----
rf = RandomForestClassifier(
    n_estimators=400,
    max_depth=None,           # try 20–60 if you want to regularize
    max_features="sqrt",      # good default for high-dim TF-IDF
    min_samples_split=2,
    min_samples_leaf=1,
    n_jobs=-1,
    random_state=42,
    class_weight="balanced"   # helps if classes are imbalanced; remove if not needed
)
rf.fit(X_train_tfidf, y_train)

# ----- 2) Validate -----
y_pred = rf.predict(X_val_tfidf)
print(" Accuracy:", accuracy_score(y_val, y_pred))
print("\n Classification Report:\n", classification_report(y_val, y_pred, digits=4))
print("\n Confusion Matrix:\n", confusion_matrix(y_val, y_pred))

 Accuracy: 0.9133320320124927

 Classification Report:
               precision    recall  f1-score   support

           0     0.9186    0.9124    0.9155      5272
           1     0.9078    0.9144    0.9111      4974

    accuracy                         0.9133     10246
   macro avg     0.9132    0.9134    0.9133     10246
weighted avg     0.9134    0.9133    0.9133     10246


 Confusion Matrix:
 [[4810  462]
 [ 426 4548]]


#### <span style="color: #A7C7E7;"> Linear SVM

In [104]:
# Train
svm = LinearSVC(C=1.0, class_weight="balanced", random_state=42)  # set class_weight=None if balanced
svm.fit(X_train_tfidf, y_train)

# Validate
y_pred = svm.predict(X_val_tfidf)
print(" Accuracy:", accuracy_score(y_val, y_pred))
print("\n Classification Report:\n", classification_report(y_val, y_pred, digits=4))
print("\n Confusion Matrix:\n", confusion_matrix(y_val, y_pred))

 Accuracy: 0.9318758539918017

 Classification Report:
               precision    recall  f1-score   support

           0     0.9441    0.9222    0.9330      5272
           1     0.9195    0.9421    0.9307      4974

    accuracy                         0.9319     10246
   macro avg     0.9318    0.9322    0.9319     10246
weighted avg     0.9322    0.9319    0.9319     10246


 Confusion Matrix:
 [[4862  410]
 [ 288 4686]]


#### <span style="color: #A7C7E7;"> XGBoost Classifier

In [105]:
# Your code + evaluation:
pos = (y_train == 1).sum()
neg = (y_train == 0).sum()
scale_pos_weight = neg / pos if pos > 0 else 1.0

print(f"Class distribution: Fake={neg}, Real={pos}")
print(f"Scale pos weight: {scale_pos_weight:.3f}")

# Your XGBoost model (perfect as is!)
xgb = XGBClassifier(
    objective="binary:logistic",
    eval_metric="logloss",
    n_estimators=800,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_lambda=1.0,
    min_child_weight=1,
    tree_method="hist",
    n_jobs=-1,
    random_state=42,
    scale_pos_weight=scale_pos_weight
)

Class distribution: Fake=12300, Real=11606
Scale pos weight: 1.060


In [106]:
xgb.fit(X_train_tfidf, y_train)

0,1,2
,objective,'binary:logistic'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,0.8
,device,
,early_stopping_rounds,
,enable_categorical,False


In [107]:
# Make predictions
y_pred = xgb.predict(X_val_tfidf)
y_pred_proba = xgb.predict_proba(X_val_tfidf)[:, 1]

In [108]:
print("Accuracy:", accuracy_score(y_val, y_pred))
print("\n Classification Report:\n", classification_report(y_val, y_pred, digits=4))
print("\n Confusion Matrix:\n", confusion_matrix(y_val, y_pred))

Accuracy: 0.88834667187195

 Classification Report:
               precision    recall  f1-score   support

           0     0.9456    0.8308    0.8845      5272
           1     0.8411    0.9493    0.8920      4974

    accuracy                         0.8883     10246
   macro avg     0.8934    0.8901    0.8882     10246
weighted avg     0.8949    0.8883    0.8881     10246


 Confusion Matrix:
 [[4380  892]
 [ 252 4722]]


#### <span style="color: #FFE5B4;"> Results

In [109]:
import pandas as pd
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def evaluate_model(name, model, X_val, y_val):
    """Return accuracy, precision, recall, F1 for given model."""
    y_pred = model.predict(X_val)
    acc = accuracy_score(y_val, y_pred)
    p, r, f1, _ = precision_recall_fscore_support(y_val, y_pred, average="binary", zero_division=0)
    return {"Model": name, "Accuracy": acc, "Precision": p, "Recall": r, "F1": f1}

In [110]:
# Fit each model (using your existing X_train_tfidf, y_train)
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, class_weight="balanced", random_state=42),
    "Naive Bayes": MultinomialNB(alpha=0.5),
    "Random Forest": RandomForestClassifier(
        n_estimators=400, max_depth=None, max_features="sqrt", class_weight="balanced", random_state=42, n_jobs=-1
    ),
    "Linear SVM": LinearSVC(C=1.0, class_weight="balanced", random_state=42),
    "XGBoost": XGBClassifier(
        objective="binary:logistic",
        eval_metric="logloss",
        n_estimators=500,
        learning_rate=0.05,
        max_depth=6,
        subsample=0.8,
        colsample_bytree=0.8,
        tree_method="hist",
        random_state=42,
        scale_pos_weight=((y_train==0).sum() / (y_train==1).sum())  # handles imbalance
    )
}

results = []
for name, model in models.items():
    print(f" Training {name}...")
    model.fit(X_train_tfidf, y_train)
    res = evaluate_model(name, model, X_val_tfidf, y_val)
    results.append(res)

# Convert to DataFrame and sort by Test Accuracy (highest first)
df_results = pd.DataFrame(results).sort_values("Accuracy", ascending=False)

print("\n MODEL PERFORMANCE RANKING (by Test Accuracy)")
print("="*60)
print(df_results.to_string(index=False, float_format='%.4f'))

# Highlight the winner
best_model = df_results.iloc[0]
print(f"\n BEST PERFORMING MODEL:")
print(f"   Model: {best_model['Model']}")
print(f"   Test Accuracy: {best_model['Accuracy']:.4f}")
print(f"   F1-Score: {best_model['F1']:.4f}")

 Training Logistic Regression...
 Training Naive Bayes...
 Training Random Forest...
 Training Linear SVM...
 Training XGBoost...

 MODEL PERFORMANCE RANKING (by Test Accuracy)
              Model  Accuracy  Precision  Recall     F1
         Linear SVM    0.9319     0.9195  0.9421 0.9307
Logistic Regression    0.9277     0.9103  0.9441 0.9269
        Naive Bayes    0.9233     0.9291  0.9115 0.9202
      Random Forest    0.9133     0.9078  0.9144 0.9111
            XGBoost    0.8751     0.8177  0.9558 0.8813

 BEST PERFORMING MODEL:
   Model: Linear SVM
   Test Accuracy: 0.9319
   F1-Score: 0.9307


### <span style="color: #30D5C8;"> Transformers

#### <span style="color: #A7C7E7;"> distilBert

In [111]:
import pandas as pd
from sklearn.metrics import accuracy_score
from transformers import pipeline

In [112]:
# 1. Load your dataset
df_training = pd.read_csv(r"C:\DSML bootcamp\Week7\project-3-nlp\dataset\training_data.csv", sep="\t", header=None, names=["label", "headline"])
df_training = df_training.dropna()
df_training["label"] = df_training["label"].astype(int)

# 2. Use a transformer pipeline (zero-shot or fine-tuned for sequence classification)
classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english", truncation=True)

# 3. Run predictions
preds = []
for text in df_training["headline"]:
    result = classifier(text)[0]
    label = 1 if result["label"] == "POSITIVE" else 0
    preds.append(label)

# 4. Accuracy
acc = accuracy_score(df_training["label"], preds)
print(f"Accuracy: {acc:.4f}")

Device set to use cpu


Accuracy: 0.5493


#### <span style="color: #A7C7E7;"> Transfer Learning

-   In this section we want to train the Pre-trained Model BERT and fit it with our data in order to generate better results

In [113]:
# Run this to test everything is working:
import accelerate
import transformers
from transformers import Trainer
print("✅ All imports successful!")
print(f"Accelerate version: {accelerate.__version__}")
print(f"Transformers version: {transformers.__version__}")

✅ All imports successful!
Accelerate version: 1.10.1
Transformers version: 4.56.0


In [114]:
# Test if accelerate imports correctly
import accelerate
print(f"Accelerate version: {accelerate.__version__}")

# Test if Trainer can import
from transformers import Trainer
print("✅ Trainer imported successfully!")

Accelerate version: 1.10.1
✅ Trainer imported successfully!


In [115]:
# train_distilbert_truthminer_tabs.py - FIXED VERSION
import pandas as pd
import numpy as np
import torch
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from sklearn.model_selection import train_test_split  # Add this import
from datasets import Dataset, DatasetDict
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    DataCollatorWithPadding, TrainingArguments, Trainer, EarlyStoppingCallback
)

In [120]:
# ---------- Paths (adjust if needed) ----------
TRAIN_PATH = "C:/DSML bootcamp/Week7/project-3-nlp/dataset/training_data.csv"
TEST_PATH  = "C:/DSML bootcamp/Week7/project-3-nlp/dataset/testing_data.csv"
SUBMIT_OUT = "C:/DSML bootcamp/Week7/project-3-nlp/dataset/testing_predictions.csv"

In [121]:
# ---------- Columns & model --
# ---------- Config ----------
TEXT_COL   = "headline"
LABEL_COL  = "label"
MODEL_NAME = "distilbert-base-uncased"
MAX_LEN    = 256
VAL_SIZE   = 0.2   # 👈 80/20 split (change this to 0.1 for 90/10, etc.)

print("Loading TSVs (no header)...")
data_train = pd.read_csv(TRAIN_PATH, sep="\t", header=None, names=[LABEL_COL, TEXT_COL])
data_test  = pd.read_csv(TEST_PATH,  sep="\t", header=None, names=[LABEL_COL, TEXT_COL])

# Sanity checks
assert TEXT_COL in data_train.columns and LABEL_COL in data_train.columns
assert TEXT_COL in data_test.columns  and LABEL_COL in data_test.columns
assert set(data_train[LABEL_COL].unique()).issubset({0,1}), "Training labels must be 0/1."

print(f"Training data: {data_train.shape}")
print(f"Label distribution: {data_train[LABEL_COL].value_counts().to_dict()}")

Loading TSVs (no header)...
Training data: (34152, 2)
Label distribution: {0: 17572, 1: 16580}


In [122]:
# ---------- FIXED: Build HF Datasets + validation split (90/10) ----------
print("Creating stratified train/validation split...")

# Use sklearn for stratified splitting (this always works)
train_data, val_data = train_test_split(
    data_train,
    test_size=0.1,
    random_state=42,
    stratify=data_train[LABEL_COL]
)

print(f"Train size: {len(train_data):,}")
print(f"Validation size: {len(val_data):,}")
print("Training label distribution:", train_data[LABEL_COL].value_counts().to_dict())
print("Validation label distribution:", val_data[LABEL_COL].value_counts().to_dict())

# Create HF datasets from the split data
hf_train = Dataset.from_pandas(train_data[[TEXT_COL, LABEL_COL]], preserve_index=False)
hf_val = Dataset.from_pandas(val_data[[TEXT_COL, LABEL_COL]], preserve_index=False)
hf_ds = DatasetDict(train=hf_train, validation=hf_val)

# Test dataset for inference (ignore its 'label' values = 2)
hf_test = Dataset.from_pandas(data_test[[TEXT_COL]], preserve_index=False)

print("✅ Datasets created successfully!")

Creating stratified train/validation split...
Train size: 30,736
Validation size: 3,416
Training label distribution: {0: 15814, 1: 14922}
Validation label distribution: {0: 1758, 1: 1658}
✅ Datasets created successfully!


In [123]:
# ---------- Tokenizer ----------
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def tok_fn(examples):
    return tokenizer(
        examples[TEXT_COL],
        truncation=True,
        padding=False,      # dynamic padding via collator
        max_length=MAX_LEN
    )

cols_to_remove = [c for c in hf_ds["train"].column_names if c not in [TEXT_COL, LABEL_COL]]
tok_trainval = hf_ds.map(tok_fn, batched=True, remove_columns=cols_to_remove)
tok_trainval = tok_trainval.rename_column(LABEL_COL, "labels")
tok_trainval.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

tok_test = hf_test.map(tok_fn, batched=True)
tok_test.set_format(type="torch", columns=["input_ids", "attention_mask"])

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map: 100%|██████████| 30736/30736 [00:01<00:00, 23315.90 examples/s]
Map: 100%|██████████| 3416/3416 [00:00<00:00, 31357.84 examples/s]
Map: 100%|██████████| 9984/9984 [00:00<00:00, 34272.70 examples/s]


In [124]:
# ---------- Model ----------
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2)

# ---------- Metrics ----------
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    acc = accuracy_score(labels, preds)
    p, r, f1, _ = precision_recall_fscore_support(labels, preds, average="binary", zero_division=0)
    return {"accuracy": acc, "precision": p, "recall": r, "f1": f1}

print("Using device:", "mps" if torch.backends.mps.is_available() else "cpu")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Using device: cpu


In [125]:
# ---------- Training args (M3/MPS friendly) ----------
args = TrainingArguments(
    output_dir="runs/distilbert_truthminer",
    learning_rate=2e-5,
    per_device_train_batch_size=8,      # increase to 16 if memory allows
    per_device_eval_batch_size=16,
    num_train_epochs=4,                 # a bit more training than 3
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    logging_steps=50,
    report_to="none",
    dataloader_num_workers=0,           # safer on MPS
    seed=42,
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tok_trainval["train"],
    eval_dataset=tok_trainval["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)],
)

print("🚀 Starting training...")

🚀 Starting training...


  trainer = Trainer(


In [126]:
# ---------- Train & evaluate ----------
trainer.train()
val_metrics = trainer.evaluate()
print("Validation metrics:", val_metrics)



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.1086,0.063437,0.983899,0.984876,0.981906,0.983389
2,0.0412,0.066949,0.98829,0.982122,0.993969,0.98801
3,0.0003,0.114554,0.984778,0.989038,0.979493,0.984242
4,0.0,0.108051,0.986827,0.986136,0.986731,0.986434




Validation metrics: {'eval_loss': 0.06694863736629486, 'eval_accuracy': 0.9882903981264637, 'eval_precision': 0.9821215733015495, 'eval_recall': 0.9939686369119421, 'eval_f1': 0.988009592326139, 'eval_runtime': 25.885, 'eval_samples_per_second': 131.968, 'eval_steps_per_second': 8.267, 'epoch': 4.0}


In [127]:
# ---------- Predict test ----------
print("Making predictions on test data...")
pred = trainer.predict(tok_test)
test_labels = np.argmax(pred.predictions, axis=-1)  # 0/1

Making predictions on test data...




In [129]:
# ---------- Write submission in the SAME format as input ----------
submit_df = data_test.copy()
submit_df[LABEL_COL] = test_labels
submit_df = submit_df[[LABEL_COL, TEXT_COL]]  # ensure order

submit_df.to_csv(SUBMIT_OUT, sep="\t", header=False, index=False)
print(f"✅ Saved predictions → {SUBMIT_OUT}")
print(f"Prediction distribution: {pd.Series(test_labels).value_counts().to_dict()}")
print("📝 SAVE THIS AS YOUR FIXED SCRIPT:")
print("Copy the complete code above to replace your current script")

✅ Saved predictions → C:/DSML bootcamp/Week7/project-3-nlp/dataset/testing_predictions.csv
Prediction distribution: {1: 5032, 0: 4952}
📝 SAVE THIS AS YOUR FIXED SCRIPT:
Copy the complete code above to replace your current script


## <span style="color: #00FFFF;"> Test Data 

### <span style="color: #30D5C8;">Loading the Test Data 

In [130]:
data_test = pd.read_csv(r"C:\DSML bootcamp\Week7\project-3-nlp\dataset\testing_data.csv", sep="\t", header=None, names=["label", "headline"])


In [131]:
data_test

Unnamed: 0,label,headline
0,2,copycat muslim terrorist arrested with assault...
1,2,wow! chicago protester caught on camera admits...
2,2,germany's fdp look to fill schaeuble's big shoes
3,2,mi school sends welcome back packet warning ki...
4,2,u.n. seeks 'massive' aid boost amid rohingya '...
...,...,...
9979,2,boom! fox news leftist chris wallace attempts ...
9980,2,here it is: list of democrat hypocrites who vo...
9981,2,new fires ravage rohingya villages in northwes...
9982,2,meals on wheels shuts the lyin‚ lefties up wit...


### <span style="color: #30D5C8;">X y Split

In [132]:
X = data_test["headline"]

### <span style="color: #30D5C8;">Data Preprocessing

In [133]:
data_test["headline_cleaned"] = data_test["headline"].apply(clean_single_headline)

In [134]:
data_test

Unnamed: 0,label,headline,headline_cleaned
0,2,copycat muslim terrorist arrested with assault...,copycat muslim terrorist arrested assault weapons
1,2,wow! chicago protester caught on camera admits...,wow ! chicago protester caught camera admits v...
2,2,germany's fdp look to fill schaeuble's big shoes,germany 's fdp look fill schaeuble 's big shoes
3,2,mi school sends welcome back packet warning ki...,mi school sends welcome back packet warning ki...
4,2,u.n. seeks 'massive' aid boost amid rohingya '...,u.n. seeks amassive ' aid boost amid rohingya...
...,...,...,...
9979,2,boom! fox news leftist chris wallace attempts ...,boom ! fox news leftist chris wallace attempts...
9980,2,here it is: list of democrat hypocrites who vo...,: list democrat hypocrites voted filibuster gw...
9981,2,new fires ravage rohingya villages in northwes...,new fires ravage rohingya villages northwest m...
9982,2,meals on wheels shuts the lyin‚ lefties up wit...,meals wheels shuts lyin‚ lefties truth moveon....


### <span style="color: #30D5C8;">TF-IDF

In [135]:
X = data_test["headline_cleaned"]

In [136]:
X_test_tfidf = tfidf.transform(X)

### <span style="color: #30D5C8;">Predictions

In [137]:
Predictions = svm.predict(X_test_tfidf)

In [138]:
data_test["label"] = Predictions

In [139]:
data_test

Unnamed: 0,label,headline,headline_cleaned
0,0,copycat muslim terrorist arrested with assault...,copycat muslim terrorist arrested assault weapons
1,0,wow! chicago protester caught on camera admits...,wow ! chicago protester caught camera admits v...
2,0,germany's fdp look to fill schaeuble's big shoes,germany 's fdp look fill schaeuble 's big shoes
3,0,mi school sends welcome back packet warning ki...,mi school sends welcome back packet warning ki...
4,1,u.n. seeks 'massive' aid boost amid rohingya '...,u.n. seeks amassive ' aid boost amid rohingya...
...,...,...,...
9979,0,boom! fox news leftist chris wallace attempts ...,boom ! fox news leftist chris wallace attempts...
9980,1,here it is: list of democrat hypocrites who vo...,: list democrat hypocrites voted filibuster gw...
9981,1,new fires ravage rohingya villages in northwes...,new fires ravage rohingya villages northwest m...
9982,0,meals on wheels shuts the lyin‚ lefties up wit...,meals wheels shuts lyin‚ lefties truth moveon....
