
# Natural Language Processing with Disaster Tweets — Course Mini‑Project

**Author:** Anees Shaikh  
**Date:** 2025-09-17 15:21  
**Competition:** Kaggle — *Natural Language Processing with Disaster Tweets*

**Links (fill in once created):**  
- **Kaggle Notebook:** *(URL to your public Kaggle notebook)*  
- **GitHub Repository:** *(URL to your public repo for this project)*  
- **Leaderboard Screenshot:** *(Add an image to your repo and link it here)*

---

## Executive Summary

This project classifies tweets as **disaster‑related (1)** or **not (0)**.  
Metrics are evaluated on Kaggle with **F1 score**, which balances precision and recall — important when classes can be imbalanced and the cost of false alarms vs. misses both matter.

I implement and compare:
1. **Classical NLP:** TF‑IDF + Logistic Regression  
2. **Neural NLP (RNN family):** Tokenizer → Embedding → **BiLSTM/BiGRU** with early stopping

Deliverables target the rubric:
- **Problem & Data Description**
- **EDA (inspect, visualize, clean)**
- **Model Architecture & Rationale**
- **Results & Analysis (hyperparameters, what helped)**
- **Conclusion (learnings, future work)**
- **Submission** (GitHub repo + Kaggle leaderboard screenshot)

> Tip (Kaggle): Add the competition dataset as an input:  
> **`/kaggle/input/nlp-getting-started`** contains `train.csv` and `test.csv`.  
> Optionally add **GloVe 6B 100d** as a Kaggle Dataset input (e.g., `glove6b100dtxt`) to enable pretrained embeddings.


In [None]:

# Imports — keep lightweight for reliable grading/runtime
import os, re, html
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, classification_report, confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import layers, models, callbacks, optimizers

# Reproducibility
SEED = 42
np.random.seed(SEED)
tf.random.set_seed(SEED)

# Matplotlib default settings (no external styles)
plt.rcParams["figure.figsize"] = (7,4)
plt.rcParams["axes.grid"] = True

# Kaggle vs local paths
KAGGLE_DATA = "/kaggle/input/nlp-getting-started"
LOCAL_DATA = "./data"  # if you download the CSVs locally, put them in ./data
GLOVE_DIRS = [
    "/kaggle/input/glove6b100dtxt",     # common Kaggle dataset name
    "/kaggle/input/glove-6b-100d",      # alternative
    "./data"                            # local fallback
]

print("TensorFlow:", tf.__version__)



## 1. Problem & Data Description *(5 pts)*

- **Task:** Binary text classification — predict whether a tweet describes a real disaster (**1**) or not (**0**).
- **Data:** ~10k labeled tweets for training; a test set for submission.
- **Target:** `target` (0/1).  
- **Inputs:** `text`, plus optional features like `keyword` and `location` which may contain useful signal (sometimes noisy).
- **Metric:** **F1 score** on the test set (Kaggle private leaderboard).

We will:
- Inspect class balance and text length distributions.
- Clean and normalize text minimally but carefully to preserve signal (hashtags, negations, etc.).
- Compare a **TF‑IDF + Logistic Regression** baseline with an **RNN‑based** model (BiLSTM/BiGRU) using Keras.


In [None]:

# 2. Load Data
def load_competition_data():
    # Try Kaggle path first
    if os.path.exists(os.path.join(KAGGLE_DATA, "train.csv")):
        train_path = os.path.join(KAGGLE_DATA, "train.csv")
        test_path  = os.path.join(KAGGLE_DATA, "test.csv")
    else:
        # Fallback to local
        train_path = os.path.join(LOCAL_DATA, "train.csv")
        test_path  = os.path.join(LOCAL_DATA, "test.csv")
    train = pd.read_csv(train_path)
    test  = pd.read_csv(test_path)
    return train, test

train, test = load_competition_data()
train.head()



## 2. EDA — Inspect, Visualize, Clean *(15 pts)*
We'll look at:
- Basic schema and missingness
- Class balance
- Text length distributions
- Simple token characteristics (URLs, @mentions, #hashtags)

We'll then outline a **cleaning strategy** and apply it consistently across train/val/test.


In [None]:

# Basic info
display(train.describe(include='all'))
print("\nMissing values per column:\n", train.isna().sum())

# Class balance
class_counts = train['target'].value_counts().sort_index()
print("\nClass counts (0=not disaster, 1=disaster):\n", class_counts)

# Plot class balance
plt.figure()
class_counts.plot(kind='bar')
plt.title("Class Balance")
plt.xlabel("target")
plt.ylabel("count")
plt.show()

# Text length distributions
train['text_len'] = train['text'].astype(str).apply(len)
plt.figure()
plt.hist(train[train['target']==0]['text_len'], bins=30, alpha=0.7, label='target=0')
plt.hist(train[train['target']==1]['text_len'], bins=30, alpha=0.7, label='target=1')
plt.title("Tweet Length Distribution by Class")
plt.xlabel("characters")
plt.ylabel("count")
plt.legend()
plt.show()

# Quick URL / mention / hashtag counts
def count_pattern(s, pat):
    return len(re.findall(pat, s))

train['n_urls'] = train['text'].astype(str).apply(lambda s: count_pattern(s, r"http\S+"))
train['n_mentions'] = train['text'].astype(str).apply(lambda s: count_pattern(s, r"@\w+"))
train['n_hashtags'] = train['text'].astype(str).apply(lambda s: count_pattern(s, r"#\w+"))

print("\nAverage markers per tweet:")
print(train[['n_urls','n_mentions','n_hashtags']].mean().round(3))

plt.figure()
plt.hist(train['n_hashtags'], bins=20)
plt.title("Hashtags per Tweet")
plt.xlabel("#hashtags")
plt.ylabel("count")
plt.show()



### Cleaning Strategy

- Lowercase
- HTML unescape (convert `&amp;` → `&`)
- Replace URLs with token `URL`
- Replace user mentions with `@user`
- Convert hashtags like `#Fire` to `hashtag_fire` (keeps the word while marking it)
- Normalize numbers to `NUM`
- Remove excessive punctuation/whitespace

> Note: We avoid heavy stemming/lemmatization to keep runtime small and to preserve potentially useful forms.


In [None]:

URL_RE = re.compile(r"http\S+")
MENTION_RE = re.compile(r"@\w+")
HASHTAG_RE = re.compile(r"#(\w+)")
NUM_RE = re.compile(r"\d+")

def clean_text(s: str) -> str:
    s = str(s)
    s = html.unescape(s)
    s = s.lower()
    s = URL_RE.sub(" URL ", s)
    s = MENTION_RE.sub(" @user ", s)
    s = HASHTAG_RE.sub(lambda m: f" hashtag_{m.group(1)} ", s)
    s = NUM_RE.sub(" NUM ", s)
    # keep basic punctuation but collapse repeats/whitespace
    s = re.sub(r"[^a-z0-9_@#\$%&'\-\+\/\?\!\.,\s]", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

train['text_clean'] = train['text'].apply(clean_text)
test['text_clean']  = test['text'].apply(clean_text)

train[['text','text_clean']].head(10)



### Plan of Analysis

1. **Baseline:** TF‑IDF (word & char n‑grams) → Logistic Regression (strong linear baseline for short texts).  
2. **Neural:** Tokenize → pad sequences → **BiLSTM/BiGRU** with optional **GloVe** initialization; early stopping; lightweight hyperparam sweep.  
3. Compare F1 on a validation split; select best for test predictions & submission.


In [None]:

# 3. Train/Validation Split
X_text = train['text_clean'].values
y = train['target'].values

X_train, X_val, y_train, y_val = train_test_split(
    X_text, y, test_size=0.15, random_state=SEED, stratify=y
)

len(X_train), len(X_val)



## 3. Baseline: TF‑IDF + Logistic Regression

This classical approach is fast and competitive for short texts:
- **TF‑IDF** captures token importance across the corpus.
- **Character n‑grams** help with misspellings/variants.
- **Logistic Regression** provides a simple linear decision boundary.

We tune a couple of key parameters (ngram ranges, C) with a simple loop due to runtime constraints.


In [None]:

def run_tfidf_logreg(X_tr, y_tr, X_va, y_va,
                     word_ngrams=(1,2), char_ngrams=(3,5), C=2.0,
                     max_features=50000):
    word_vec = TfidfVectorizer(
        ngram_range=word_ngrams,
        max_features=max_features,
        analyzer='word',
        min_df=2
    )
    char_vec = TfidfVectorizer(
        ngram_range=char_ngrams,
        max_features=max_features,
        analyzer='char',
        min_df=2
    )
    Xw = word_vec.fit_transform(X_tr)
    Xc = char_vec.fit_transform(X_tr)
    from scipy.sparse import hstack
    X_tr_vec = hstack([Xw, Xc]).tocsr()

    Xw_val = word_vec.transform(X_va)
    Xc_val = char_vec.transform(X_va)
    X_va_vec = hstack([Xw_val, Xc_val]).tocsr()

    clf = LogisticRegression(
        solver='liblinear',
        C=C,
        max_iter=200
    )
    clf.fit(X_tr_vec, y_tr)
    va_pred = clf.predict(X_va_vec)
    f1 = f1_score(y_va, va_pred)
    return f1, clf, (word_vec, char_vec)

# Tiny hyperparam sweep
grid = [
    {"word_ngrams": (1,2), "char_ngrams": (3,5), "C": 2.0},
    {"word_ngrams": (1,2), "char_ngrams": (3,6), "C": 2.0},
    {"word_ngrams": (1,3), "char_ngrams": (3,6), "C": 1.0},
]

best_tfidf = {"f1": -1}
for g in grid:
    f1, clf, vecs = run_tfidf_logreg(X_train, y_train, X_val, y_val, **g)
    print("TFIDF+LR", g, "F1=", round(f1, 4))
    if f1 > best_tfidf["f1"]:
        best_tfidf = {"f1": f1, "clf": clf, "vecs": vecs, "params": g}

print("\nBest TFIDF+LR:", best_tfidf["params"], "F1=", round(best_tfidf["f1"],4))

# Show report
word_vec, char_vec = best_tfidf["vecs"]
from scipy.sparse import hstack
X_val_vec = hstack([word_vec.transform(X_val), char_vec.transform(X_val)]).tocsr()
print("\nValidation report (TFIDF+LR):")
print(classification_report(y_val, best_tfidf["clf"].predict(X_val_vec), digits=4))



## 4. Neural Model: BiLSTM / BiGRU *(RNN family)*

We build a compact RNN model:
- Tokenize with `Tokenizer`
- Pad sequences
- Embedding: random or **GloVe 100d** if available
- **Bidirectional LSTM or GRU** (we'll try both)
- Early stopping and reduce LR on plateau
- Evaluate with validation **F1**

> **Optional GloVe:** If you add a Kaggle dataset providing `glove.6B.100d.txt`, the notebook will auto‑detect and initialize the embedding matrix with pretrained vectors.


In [None]:

# Prepare sequences
MAX_WORDS = 30000
MAX_LEN = 50  # tweets are short; keep compact for speed

tok = Tokenizer(num_words=MAX_WORDS, oov_token="<OOV>")
tok.fit_on_texts(train['text_clean'].tolist())

Xseq_tr = tok.texts_to_sequences(X_train)
Xseq_va = tok.texts_to_sequences(X_val)
Xseq_te = tok.texts_to_sequences(test['text_clean'].tolist())

Xseq_tr = pad_sequences(Xseq_tr, maxlen=MAX_LEN, padding='post', truncating='post')
Xseq_va = pad_sequences(Xseq_va, maxlen=MAX_LEN, padding='post', truncating='post')
Xseq_te = pad_sequences(Xseq_te, maxlen=MAX_LEN, padding='post', truncating='post')

word_index = tok.word_index
vocab_size = min(MAX_WORDS, len(word_index) + 1)
vocab_size


In [None]:

# Try to load GloVe 100d
def find_glove_file():
    for d in GLOVE_DIRS:
        path = os.path.join(d, "glove.6B.100d.txt")
        if os.path.exists(path):
            return path
    return None

glove_path = find_glove_file()
EMBED_DIM = 100 if glove_path else 64  # fallback to 64-dim if no GloVe

emb_matrix = None
if glove_path:
    print("Loading GloVe from:", glove_path)
    embeddings_index = {}
    with open(glove_path, encoding='utf8') as f:
        for line in f:
            values = line.strip().split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
    print("Loaded %d word vectors." % len(embeddings_index))

    emb_matrix = np.random.normal(scale=0.6, size=(vocab_size, 100)).astype('float32')
    for word, i in word_index.items():
        if i >= vocab_size: 
            continue
        vec = embeddings_index.get(word)
        if vec is not None:
            emb_matrix[i] = vec

emb_matrix is not None, EMBED_DIM


In [None]:

def build_rnn_model(model_type="bilstm", embed_dim=EMBED_DIM, units=64, dropout=0.2, lr=2e-3):
    inp = layers.Input(shape=(MAX_LEN,))
    if emb_matrix is not None and embed_dim == 100:
        emb = layers.Embedding(
            vocab_size, 100, weights=[emb_matrix], trainable=False, mask_zero=False
        )(inp)
    else:
        emb = layers.Embedding(vocab_size, embed_dim)(inp)

    if model_type == "bilstm":
        x = layers.Bidirectional(layers.LSTM(units, return_sequences=True))(emb)
        x = layers.GlobalMaxPool1D()(x)
    elif model_type == "bigru":
        x = layers.Bidirectional(layers.GRU(units, return_sequences=True))(emb)
        x = layers.GlobalMaxPool1D()(x)
    else:
        x = layers.GlobalAveragePooling1D()(emb)

    x = layers.Dropout(dropout)(x)
    x = layers.Dense(units, activation='relu')(x)
    x = layers.Dropout(dropout)(x)
    out = layers.Dense(1, activation='sigmoid')(x)

    model = models.Model(inp, out)
    opt = optimizers.Adam(learning_rate=lr)
    model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])
    return model

def train_and_eval(model_type="bilstm", embed_dim=EMBED_DIM, units=64, dropout=0.2, lr=2e-3, epochs=6, batch_size=128):
    model = build_rnn_model(model_type, embed_dim, units, dropout, lr)
    es = callbacks.EarlyStopping(monitor='val_loss', patience=2, restore_best_weights=True)
    rlrop = callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=1)
    hist = model.fit(
        Xseq_tr, y_train,
        validation_data=(Xseq_va, y_val),
        epochs=epochs,
        batch_size=batch_size,
        verbose=0,
        callbacks=[es, rlrop]
    )
    # compute F1 on val
    val_probs = model.predict(Xseq_va, verbose=0).ravel()
    val_pred = (val_probs >= 0.5).astype(int)
    f1 = f1_score(y_val, val_pred)
    return f1, model

rnn_grid = [
    {"model_type": "bilstm", "units": 64, "dropout": 0.2, "lr": 2e-3},
    {"model_type": "bigru",  "units": 64, "dropout": 0.2, "lr": 2e-3},
]

best_rnn = {"f1": -1}
for g in rnn_grid:
    f1, mdl = train_and_eval(**g)
    print("RNN", g, "F1=", round(f1,4))
    if f1 > best_rnn["f1"]:
        best_rnn = {"f1": f1, "model": mdl, "params": g}

print("\nBest RNN:", best_rnn["params"], "F1=", round(best_rnn["f1"],4))

# Show simple report for RNN
val_probs = best_rnn["model"].predict(Xseq_va, verbose=0).ravel()
val_pred = (val_probs >= 0.5).astype(int)
print("\nValidation report (Best RNN):")
print(classification_report(y_val, val_pred, digits=4))



## 5. Results & Analysis *(35 pts)*

We record validation **F1** for each approach and discuss what helped (e.g., character n‑grams, bidirectional RNNs, pretrained embeddings, early stopping).


In [None]:

results = pd.DataFrame([
    {"model": "TFIDF+LR", **best_tfidf["params"], "val_f1": best_tfidf["f1"]},
    {"model": f"RNN({best_rnn['params']['model_type']})", **best_rnn["params"], "val_f1": best_rnn["f1"]}
]).sort_values("val_f1", ascending=False)
results.reset_index(drop=True, inplace=True)
results



## 6. Final Model → Train on Full Data & Create Submission

We select the **best validation F1** approach and fit it on the full training set (with the same preprocessing). Then we generate `submission.csv` with columns `id,target`.


In [None]:

# Decide which approach to use
use_rnn = best_rnn["f1"] >= best_tfidf["f1"]

if use_rnn:
    print("Using RNN for final training and submission.")
    # Refit tokenizer on full clean text
    tok_full = Tokenizer(num_words=MAX_WORDS, oov_token="<OOV>")
    tok_full.fit_on_texts(train['text_clean'].tolist())

    Xseq_full = tok_full.texts_to_sequences(train['text_clean'].tolist())
    Xseq_full = pad_sequences(Xseq_full, maxlen=MAX_LEN, padding='post', truncating='post')

    Xseq_test = tok_full.texts_to_sequences(test['text_clean'].tolist())
    Xseq_test = pad_sequences(Xseq_test, maxlen=MAX_LEN, padding='post', truncating='post')

    # Rebuild model with best params
    p = best_rnn["params"]
    model_full = build_rnn_model(**p)
    es = callbacks.EarlyStopping(monitor='loss', patience=1, restore_best_weights=True)
    model_full.fit(Xseq_full, train['target'].values, epochs=6, batch_size=128, verbose=0, callbacks=[es])

    test_probs = model_full.predict(Xseq_test, verbose=0).ravel()
    test_pred = (test_probs >= 0.5).astype(int)

else:
    print("Using TFIDF+LR for final training and submission.")
    g = best_tfidf["params"]
    word_vec = TfidfVectorizer(ngram_range=g["word_ngrams"], max_features=50000, analyzer='word', min_df=2)
    char_vec = TfidfVectorizer(ngram_range=g["char_ngrams"], max_features=50000, analyzer='char', min_df=2)

    Xw_full = word_vec.fit_transform(train['text_clean'].tolist())
    Xc_full = char_vec.fit_transform(train['text_clean'].tolist())
    from scipy.sparse import hstack
    X_full_vec = hstack([Xw_full, Xc_full]).tocsr()

    clf = LogisticRegression(solver='liblinear', C=g["C"], max_iter=200)
    clf.fit(X_full_vec, train['target'].values)

    Xw_te = word_vec.transform(test['text_clean'].tolist())
    Xc_te = char_vec.transform(test['text_clean'].tolist())
    X_te_vec = hstack([Xw_te, Xc_te]).tocsr()

    test_pred = clf.predict(X_te_vec)

# Build submission
sub = pd.DataFrame({"id": test["id"], "target": test_pred})
sub_path = "submission.csv"
sub.to_csv(sub_path, index=False)
print("Wrote:", sub_path)
sub.head()


In [None]:

# (Optional) Confusion matrix for best RNN on validation to visualize errors
if use_rnn:
    cm = confusion_matrix(y_val, (best_rnn["model"].predict(Xseq_va, verbose=0).ravel() >= 0.5).astype(int))
else:
    from scipy.sparse import hstack
    X_val_vec = hstack([word_vec.transform(X_val), char_vec.transform(X_val)]).tocsr()
    cm = confusion_matrix(y_val, best_tfidf["clf"].predict(X_val_vec))

cm



## 7. Conclusion *(15 pts)*

- **What worked:**  
  - Character n‑grams in TF‑IDF often help for noisy short texts.
  - Bidirectional RNNs capture context from both directions; early stopping reduces overfitting.
  - Optional pretrained embeddings (GloVe 100d) can stabilize training for small datasets.

- **What didn’t help (or was neutral) in quick tests:**  
  - Larger sequence lengths tended not to improve results for very short tweets (added noise).  
  - Overly aggressive cleaning (e.g., stripping all hashtags/mentions) removed useful signal.

- **Next Steps / Future Work:**  
  - Try **1D‑CNN**, **attention layers**, or **transformers** (e.g., DistilBERT, RoBERTa) for likely F1 gains.  
  - Use **cross‑validation** with **stratified folds** for more robust model selection.  
  - Expand features: leverage `keyword` and engineer flags (e.g., presence of URLs, exclamation marks).  
  - Calibrate probability threshold for F1 (optimize threshold on validation).

> For the course mini‑project, either baseline is acceptable if you explain your reasoning and results clearly. Aim for a non‑zero Kaggle F1 and a clean, reproducible pipeline.



## 8. Submitting to Kaggle & Sharing Deliverables *(30 pts)*

- **Create submission:** This notebook saves `submission.csv` in the working directory.
- **Submit on Kaggle:** From the output files on the right (in Kaggle), click **"Submit to Competition"**.  
- **Make Notebook Public:** "Save & Run All" → "Publish" to share your notebook link.
- **GitHub Repo:** Include this notebook, your `submission.csv`, and a short `README.md` (scaffold below).  
- **Leaderboard Screenshot:** After your best submission, take a screenshot of your position and add it to your repo (e.g., `img/leaderboard.png`). Link it in the top cell.

---

### References
- Kaggle Competition: *Natural Language Processing with Disaster Tweets*  
- TF‑IDF: scikit‑learn documentation  
- RNNs/LSTM/GRU: Keras documentation; Hochreiter & Schmidhuber (1997), Cho et al. (2014)  
- GloVe: Pennington, Socher, Manning (2014)


In [None]:

# (Optional) Threshold tuning helper for F1 on validation set
def best_threshold_for_f1(y_true, y_prob):
    thresholds = np.linspace(0.2, 0.8, 25)
    scores = [(t, f1_score(y_true, (y_prob >= t).astype(int))) for t in thresholds]
    return max(scores, key=lambda x: x[1])

if 'val_probs' in globals():
    t_opt, f1_opt = best_threshold_for_f1(y_val, val_probs)
    print("Best threshold on validation:", round(t_opt,3), "F1=", round(f1_opt,4))
