# NLP Disaster Tweets Classification
## Natural Language Processing with Recurrent Neural Networks

**Author**: Matthew Campbell  
**Course**: Week 14 Module 4 - RNNs and NLP  
**Date**: October 2025

---

## Table of Contents
1. [Project Description](#1-project-description)
2. [Data Description](#2-data-description)
3. [Exploratory Data Analysis](#3-exploratory-data-analysis)
4. [Data Preprocessing](#4-data-preprocessing)
5. [Model Architecture](#5-model-architecture)
6. [Results and Analysis](#6-results-and-analysis)
7. [Conclusion](#7-conclusion)
8. [GitHub Repository](#8-github-repository)
9. [Kaggle Submission](#9-kaggle-submission)

## Import Libraries

In [None]:
# Google Colab Setup (uncomment if running in Colab)
# from google.colab import drive
# drive.mount('/content/drive')
# %cd /content/drive/MyDrive/Boulder_University/Week14_Module4

# Install packages if needed in Colab
# !pip install -q wordcloud

## Setup: Google Colab vs Local

This notebook can run in both Google Colab and locally. If using Colab, uncomment the cell below to mount Google Drive.

In [None]:
# Standard library
import os
import re
import string
import random
from pathlib import Path

# Data manipulation
import numpy as np
import pandas as pd

# Visualisation
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

# NLP libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Deep learning
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models, callbacks
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, roc_curve, f1_score

# Set random seeds for reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)
random.seed(RANDOM_SEED)

# Styling
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print(f"TensorFlow version: {tf.__version__}")
print(f"GPU available: {len(tf.config.list_physical_devices('GPU')) > 0}")

In [None]:
# Download NLTK data (run once)
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)

In [None]:
# Check if running in Google Colab
try:
    import google.colab
    IN_COLAB = True
    print("Running in Google Colab")
except:
    IN_COLAB = False
    print("Running locally")

# Set data directory
# If you mounted Google Drive in Colab, the data is in the 'data' subdirectory
# If running locally, data is also in 'data' subdirectory
data_dir = Path('data')

print(f"Data directory: {data_dir}")

# Load data
train_df = pd.read_csv(data_dir / 'train.csv')
test_df = pd.read_csv(data_dir / 'test.csv')
sample_submission = pd.read_csv(data_dir / 'sample_submission.csv')

print("✓ Data loaded successfully!")
print(f"Training set: {len(train_df):,} samples")
print(f"Test set: {len(test_df):,} samples")

---
# 1. Project Description

## Problem Statement

This project addresses the Kaggle [Natural Language Processing with Disaster Tweets](https://www.kaggle.com/c/nlp-getting-started) competition. The goal is to build a machine learning model that can accurately predict whether a given tweet is about a real disaster or not.

### Context
Social media platforms like Twitter have become important communication channels during emergency events. However, not all tweets containing disaster-related keywords actually refer to real disasters. For example:
- "Our city is on fire with excitement!" ❌ Not a disaster
- "There's a bushfire approaching the town" ✅ Real disaster

Being able to automatically identify genuine disaster tweets can help:
- Emergency services respond faster
- News organisations verify events
- Aid organisations deploy resources effectively

### Problem Type
**Binary text classification** using Natural Language Processing and Recurrent Neural Networks.

### Evaluation Metric
Models will be evaluated using **F1 Score** (harmonic mean of precision and recall). This metric is appropriate because:
- It balances false positives and false negatives
- Both types of errors matter in disaster detection
- It handles class imbalance better than accuracy

### Dataset Overview
- **Training samples**: 7,613 labelled tweets
- **Test samples**: 3,263 unlabelled tweets (for Kaggle submission)
- **Features**: Tweet text, keywords, location (optional)
- **Labels**: 1 = real disaster, 0 = not a disaster

### Technical Approach
We'll explore Recurrent Neural Networks (RNNs), specifically:
1. **LSTM** (Long Short-Term Memory) - handles long-term dependencies
2. **GRU** (Gated Recurrent Unit) - simpler, faster alternative to LSTM
3. **Bidirectional RNNs** - process text in both directions for better context

These architectures are well-suited for sequential text data because they can:
- Capture word order and context
- Handle variable-length inputs
- Learn long-range dependencies between words

---
# 2. Data Description

## Loading the Data

In [None]:
# Check if running in Google Colab
try:
    import google.colab
    IN_COLAB = True
    print("Running in Google Colab")
except:
    IN_COLAB = False
    print("Running locally")

# Set data directory based on environment
if IN_COLAB:
    # In Colab, data is typically in /content or mounted drive
    # Assuming data is uploaded to Colab's content folder
    data_dir = Path('/content')
else:
    # Local environment
    data_dir = Path('data')

print(f"Data directory: {data_dir}")

# Load data
train_df = pd.read_csv(data_dir / 'train.csv')
test_df = pd.read_csv(data_dir / 'test.csv')
sample_submission = pd.read_csv(data_dir / 'sample_submission.csv')

print("✓ Data loaded successfully!")
print(f"Training set: {len(train_df):,} samples")
print(f"Test set: {len(test_df):,} samples")

## Dataset Structure and Size

In [None]:
# Display basic information
print("=" * 60)
print("DATASET OVERVIEW")
print("=" * 60)

print(f"\n📊 Training set size: {len(train_df):,} samples")
print(f"📊 Test set size: {len(test_df):,} samples")
print(f"📊 Total samples: {len(train_df) + len(test_df):,}")

print("\n" + "=" * 60)
print("TRAINING DATA STRUCTURE")
print("=" * 60)
print(f"\nColumns: {list(train_df.columns)}")
print(f"\nData types:\n{train_df.dtypes}")
print(f"\nShape: {train_df.shape}")
print(f"Memory usage: {train_df.memory_usage(deep=True).sum() / 1024:.2f} KB")

In [None]:
# Show first few rows
print("First 10 training samples:")
train_df.head(10)

## Data Dimensions and Properties

In [None]:
# Analyse text length
train_df['text_length'] = train_df['text'].apply(len)
train_df['word_count'] = train_df['text'].apply(lambda x: len(x.split()))

print("=" * 60)
print("TEXT STATISTICS")
print("=" * 60)

print(f"\n📝 Character length:")
print(f"   Mean: {train_df['text_length'].mean():.1f} characters")
print(f"   Median: {train_df['text_length'].median():.1f} characters")
print(f"   Min: {train_df['text_length'].min()} characters")
print(f"   Max: {train_df['text_length'].max()} characters")
print(f"   Std: {train_df['text_length'].std():.1f}")

print(f"\n📝 Word count:")
print(f"   Mean: {train_df['word_count'].mean():.1f} words")
print(f"   Median: {train_df['word_count'].median():.1f} words")
print(f"   Min: {train_df['word_count'].min()} words")
print(f"   Max: {train_df['word_count'].max()} words")
print(f"   Std: {train_df['word_count'].std():.1f}")

In [None]:
# Check for missing values
print("\n=" * 60)
print("MISSING VALUES")
print("=" * 60)
missing_counts = train_df.isnull().sum()
missing_pct = (missing_counts / len(train_df) * 100).round(2)

missing_df = pd.DataFrame({
    'Missing Count': missing_counts,
    'Percentage': missing_pct
})
print(f"\n{missing_df}")

print("\n💡 Note: 'keyword' and 'location' fields have missing values.")
print("   We'll focus on 'text' field which has no missing values.")

In [None]:
# Check class distribution
print("\n=" * 60)
print("CLASS DISTRIBUTION")
print("=" * 60)

class_counts = train_df['target'].value_counts().sort_index()
print(f"\n0 (Not disaster): {class_counts[0]:,} samples ({class_counts[0]/len(train_df)*100:.1f}%)")
print(f"1 (Disaster):     {class_counts[1]:,} samples ({class_counts[1]/len(train_df)*100:.1f}%)")

imbalance_ratio = class_counts[0] / class_counts[1]
print(f"\nClass imbalance ratio: {imbalance_ratio:.2f}:1")

if imbalance_ratio > 1.5:
    print("⚠️  Moderate class imbalance detected")
    print("    → Will use stratified split to maintain class balance")
    print("    → May consider class weights during training")