# Kaggle Competition: Natural Language Processing with Disaster Tweets


## Description of the Problem and Data
Briefly describe the challenge problem and NLP.  Describe the size, dimension, structure, etc., of the data.

In [None]:
# Import
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import re
import string
from urllib.parse import unquote
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

### Problem
(from Kaggle) Twitter has become an important communication channel in times of emergency.
The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

Natural Language Processing - NLP enables computers to understand, interpret, and generate human language.  Key aspects include: Understanding context and meaning, processing text and speech, and enabling technologies for use in applications, like machine translation or voice assistants.  

Sources for NLP techniques: NLTK, tensorflow, keras, and sklearn documentation. Reddit r/NLP.   

### Data
Each sample in the train and test set has the following information: The text of a tweet, a keyword from that tweet (watch for blanks!), and the location the tweet was sent from (also could be missing).

We are predicting whether a given tweet is about a real disaster or not.  If so, predict a '1', otherwise '0'.

Columns: 'id', 'text', 'location', 'keyword', 'target' (in train.csv only, this denotes whether a tweet is about a real disaster or not).

In [None]:
# load the training dataset
df_train = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")

In [None]:
# Describe the data size, dimension, structure, etc...
print(f"Dimensions (Rows, Columns): {df_train.shape}")
print(f"Total elements: {df_train.size}")

df_train.info()
print("\nData Types per Column:\n", df_train.dtypes)
print("\nFirst 3 rows:\n", df_train.head(3))

# Check for missing values (hinted at in problem description)
print("\nMissing values per column:\n", df_train.isnull().sum())

## Exploratory Data Analysis
Show a few visualizations.  Describe any data cleaning procedures.  Based on this EDA, what is the plan of analysis?

'keyword' is missing 61 values.  This represents a small fraction of the data so I think that just deleting the rows
 will be the best option.  This will have a negligible impact on the model's training. 

In [None]:
# Handle missing values from 'keyword'
df_train.dropna(subset=['keyword'], inplace=True)
print(f"New training dataframe shape after dropping missing rows: {df_train.shape}")

After accounting for the missing values in 'keyword', 'location' has 2472 missing values.  This is a large percentage of the data so I do not just want to delete it.  I'll replace the na's with a 'NONE_PROVIDED' to give them their own category and we can potentially check if this category has any predictive value.

In [None]:
# Handle missing values from 'location'
df_train['location'] = df_train['location'].fillna('NONE_PROVIDED')
print("Missing 'location' values imputed with 'NONE_PROVIDED'.")

In [None]:
# Check the balance of the target classes using a count plot
plt.figure(figsize=(6,5))
sns.countplot(x='target', data=df_train)
plt.title('Distribution of Target Class')
plt.xlabel('Target (0: Not Disaster, 1: Disaster)')
plt.ylabel('Count')
plt.xticks([0,1])
plt.grid(axis='y', alpha = 0.5)
plt.show()

### Text Cleaning and Normalization
Remove/replace elements that are typically irrelevant to the text's meaning.  HTML tags, URLs, Hashtags, Mentions, and Punctuation.

Standardize the remaining text.  Lowercasing, tokenization (splitting text into individual words/sub-words), Stop Word removal, and Lemmatize (reduce to base form of word).

In [None]:
# Preprocessing Function
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # HTML tags
    text = re.sub(r'<.*?>', '', text)
    # URLs
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    # Mentions and Hashtags
    text = re.sub(r'@\w+|#', '', text)
    # Punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Lowercase
    text = text.lower()
    # Tokenize
    tokens = word_tokenize(text)
    # Stop Word removal and lemmatize
    processed_tokens = []
    for word in tokens:
        if word not in stop_words:
            processed_tokens.append(lemmatizer.lemmatize(word))

    return " ".join(processed_tokens)

In [None]:
# Apply preprocessing to 'text' column
df_train['clean_text'] = df_train['text'].apply(preprocess_text)

print("Text column ('text') has been cleaned and normalized into 'clean_text'.")
print("\nExample Original vs. Cleaned Text:")
print(f"Original: {df_train['text'].iloc[0]}")
print(f"Cleaned:  {df_train['clean_text'].iloc[0]}")

### Preprocess 'keyword' Column
decode the URL encodings, aggregate low frequency keywords, One-hot encode

In [None]:
# apply unquote to keyword
def unquote_keyword(keyword):
    if pd.isna(keyword): return 'UNKNOWN'
    return unquote(keyword)
    
df_train['clean_keyword'] = df_train['keyword'].apply(unquote_keyword)

In [None]:
# get counts of unique keywords
keyword_counts = df_train['clean_keyword'].value_counts()
unique_count = len(keyword_counts)

print(keyword_counts.head())
print("-" * 50)
print(keyword_counts.tail())
print("-" * 50)

print(f"The total number of unique keywords is: {unique_count}")

### Aggregate 'keyword'
Aggregating keywords into a 'top N' and 'other' category should help with training the model by reducing dimensionality and complexity by shrinking the number of features the model has to learn from the keyword column.  This should make a simpler, faster model with less risk of overfitting.  

In [None]:
# Aggregate low-frequency keywords
TOP_N = 75
keyword_counts = df_train['clean_keyword'].value_counts()
top_n_keywords = keyword_counts.nlargest(TOP_N).index.tolist()

df_train['agg_keyword'] = np.where(
    df_train['clean_keyword'].isin(top_n_keywords),
    df_train['clean_keyword'],
    'Other'
)

In [None]:
# Keyword Distribution
keyword_counts = df_train['agg_keyword'].value_counts().reset_index()
keyword_counts.columns = ['agg_keyword', 'count']
keyword_counts = keyword_counts.sort_values(by='count', ascending=False)

plt.figure(figsize=(8,10))
sns.barplot(x='count', y='agg_keyword', data=keyword_counts)
plt.title('Distribution of Keywords')
plt.xlabel('Count')
plt.xscale('log')
plt.ylabel('Keyword')
plt.grid(axis='x', alpha=0.5)
plt.tight_layout()
plt.show()

In [None]:
# One-hot encode 'agg_keyword' column
keyword_dummies = pd.get_dummies(df_train['agg_keyword'], prefix='keyword')

df_train = pd.concat([df_train, keyword_dummies], axis=1)

df_train.drop(['keyword', 'clean_keyword', 'agg_keyword'], axis=1, inplace=True)

In [None]:
print("New DataFrame Head (Showing New Keyword Columns)")
print(df_train.head())
print("\nNew DataFrame Shape")
print(f"The new shape (rows, columns) is: {df_train.shape}")

### Preprocessing the 'location' column
There are significant data quality issues that need to be addressed before One-hot encoding.  

Aliases/Variants of locations like 'USA' and 'United States'.  Non-Locations like 'worldwide' and 'everywhere'.  Foreign Characters that need correction. and single occurances that I will group together as they are unlikely to be predictive.

In [None]:
# create function to accomplish preprocessing steps to 'location'
def clean_location(loc):
    if loc == 'NONE_PROVIDED':
        return loc
    try:
        loc = loc.encode('latin1').decode('utf8')
    except (UnicodeEncodeError, UnicodeDecodeError):
        pass
        
    # lower case and strip whitespace
    loc = loc.lower().strip()
    
    # remove trailing punctuation
    loc = loc.strip(string.punctuation)
    
    # remove numbers and non-alphabetic noise
    loc = re.sub(r'[\d]', '', loc)
    loc = re.sub(r'[^\w\s,\-\']', '', loc)
    loc = re.sub(r'\s+', ' ', loc).strip() # replace multiple spaces

    return loc

In [None]:
# apply the clean_location function to the location column
df_train['clean_location'] = df_train['location'].apply(clean_location)

In [None]:
# create function to handle the noise and variation in the location column
def consolidate_location(loc):
    """Maps common noise terms and aliases to standard forms."""
    if loc in ['unknown', '']:
        return 'UNKNOWN'

    # Common Noise/Junk terms identified during inspection
    junk_list = ['worldwide', 'everywhere', 'here', 'the internet', 'my timeline', 
                 'noplace', 'follow me', 'he/him or she/her (ask)', 'email', 
                 'facebook', 'twitter', 'ig', 'snapchat', 'link in bio',
                'five down from the coffeeshop', 'reddit', 'road to the billionaires club',
                'all around the world', 'mad as hell', 'in the word of god',
                'narnia', 'planet earth', 'ìït ,', 'ìït ,-','happily married with kids',
                'america of founding fathers','taylor swift','theythem','anonymous','upstairs',
                ',','httpwwwamazoncomdpbhr','in the shadows','international','the world',
                'breaking news','in hell','pedophile hunting ground','neverland','world',
                'world wide'] # this list could go on forever.  stopping here.
    
    if any(j in loc for j in junk_list):
        return 'JUNK'

    # Consolidate country/city variations
    if loc in ['usa', 'united states', 'united states of america', 'us']:
        return 'usa'
    if loc in ['uk', 'london uk', 'england']:
        return 'united kingdom'
    if loc in ['new york', 'new york ny', 'ny', 'nyc']:
        return 'new york city'
    if loc in ['los angeles, ca', 'la']:
        return 'los angeles'
    if loc in ['ca', 'cali', 'california', 'southern california']:
        return 'california'
    if loc in ['texas', 'tx', 'republic of texas']:
        return 'texas'
    
    return loc

In [None]:
# apply consolidated mapping
df_train['standard_location'] = df_train['clean_location'].apply(consolidate_location)

In [None]:
# Aggregate low frequency locations
TOP_N_LOCATIONS = 200

location_counts = df_train['standard_location'].value_counts()
top_n_locations = location_counts.nlargest(TOP_N_LOCATIONS).index.tolist()

df_train['agg_location'] = np.where(
    df_train['standard_location'].isin(top_n_locations),
    df_train['standard_location'],
    'UNKNOWN'
)

In [None]:
final_unique_count = df_train['agg_location'].nunique()
print(f"\nAggregated to {final_unique_count} features.")
print(f"aggregated location feature unique counts:")
print(df_train['agg_location'].value_counts().head(5))

# Drop intermediate columns
df_train.drop(['location', 'clean_location', 'standard_location'], axis=1, inplace=True)

#print(df_train['agg_location'].unique().tolist())

In [None]:
location_dummies = pd.get_dummies(df_train['agg_location'], prefix='agg_location')

df_train = pd.concat([df_train, location_dummies], axis =1)

df_train.drop('agg_location', axis =1, inplace=True)

In [None]:
print("New DataFrame Head (Showing New Location Columns)")
print(df_train.head())
print("\nNew DataFrame Shape")
print(df_train.shape)

## Model Architecture and Training
Describe model architecture and reasoning for why it is suitable for this problem.  

Include reference list for NLP-specific tutorials/discussion boards/code examples.  Methods to process texts to matrix form include: TF-IDF, GloVe, Word2Vec, etc.  Briefly explain the method and how they work.  

Build and train a sequential neural network model (any RNN family nn, including advanced architectures LSTM, GRU, bidirectional RNN, etc).

### Text Vectorization and Word Embedding
There are two components we'll use to transform the text: Keras's Tokenizer to convert the text to a sequence of integers and a Keras embedding layer for processing the text.  

Keras's Tokenizer scans the entire 'clean_text', assigns a unique integer to every unique word, then converts each tweet into a sequence of those integers.

The Keras Embedding layer is initialized randomly and then learns a unique dense vector for every word in the vocabulary in training.  An advantage of this method vs GloVe or Word2Vec is that this process is simple and requires no external files.

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, Dropout, Concatenate
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

In [None]:
TEXT_COLUMN = 'clean_text'
TARGET_COLUMN = 'target'

categorical_cols = [col for col in df_train.columns if col.startswith('location_') or col.startswith('keyword_')]
X_cat = df_train[categorical_cols].values
Y = df_train[TARGET_COLUMN].values

MAX_WORDS = 10000  # Only consider the top 10,000 words
MAX_LEN = 50       # Pad/truncate sequences to a fixed length of 50

tokenizer = Tokenizer(num_words=MAX_WORDS, oov_token="<OOV>")
tokenizer.fit_on_texts(df_train[TEXT_COLUMN])
word_index = tokenizer.word_index

# Convert text to sequences and pad them
sequences = tokenizer.texts_to_sequences(df_train[TEXT_COLUMN])
X_text = pad_sequences(sequences, maxlen=MAX_LEN, padding='post', truncating='post')

# Split the dataset into training and validation sets
X_text_train, X_text_val, X_cat_train, X_cat_val, Y_train, Y_val = train_test_split(
    X_text, X_cat, Y, test_size=0.2, random_state=0, stratify=Y
)


### Model Architecture
We're using a Bidirectional LSTM (Bi-LSTM) network.  

LSTMs are a type of RNN designed to overcome the vanishing gradient problem, making them excellent at learning long term dependencies in sequence data (like text).

Bidirectional LSTM means that the tweet is processed twice.  One time running from beginning to end, and another from the end to the beginning.  The outputs are combined.

For text classification, knowing the context that follows a word can be as important as the context that precedes it and this method attempts to capture that.

In [None]:
# Build the Bi-LSTM Model Architecture

EMBEDDING_DIM = 100
VOCAB_SIZE = min(len(word_index) + 1, MAX_WORDS)
NUM_CATEGORICAL_FEATURES = X_cat_train.shape[1] # Number of one-hot encoded features

# Text Input Branch (RNN)
text_input = tf.keras.Input(shape=(MAX_LEN,), name='text_input')
x = Embedding(VOCAB_SIZE, EMBEDDING_DIM)(text_input)
x = Bidirectional(LSTM(64, return_sequences=False))(x) # 64 units, returns only the final output
x = Dense(32, activation='relu')(x)
x = Dropout(0.5)(x)
text_output = x

# Categorical Input Branch (Dense Network)
cat_input = tf.keras.Input(shape=(NUM_CATEGORICAL_FEATURES,), name='cat_input')
y = Dense(16, activation='relu')(cat_input) # Simple dense layer for categorical features
cat_output = y

# Merge Branches
# Concatenate the output of the text branch and the categorical branch
merged = Concatenate()([text_output, cat_output])

# Final Output Layer
z = Dense(16, activation='relu')(merged)
z = Dropout(0.5)(z)
output = Dense(1, activation='sigmoid')(z) # Sigmoid for binary classification

# Create Model
model = tf.keras.Model(inputs=[text_input, cat_input], outputs=output)

In [None]:
# Hyperparameter Optimization & Training Setup

# Optimization: Using a learning rate scheduler and dropout
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-4) # Start with a small learning rate

model.compile(
    optimizer=optimizer,
    loss='binary_crossentropy',
    metrics=['accuracy', tf.keras.metrics.Precision(), tf.keras.metrics.Recall()]
)

# Callbacks for improved training performance
early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss', patience=5, restore_best_weights=True
)

print(model.summary())

# Train Model
history = model.fit(
     {'text_input': X_text_train, 'cat_input': X_cat_train},
     Y_train,
     epochs=20,
     batch_size=32,
     validation_data=({'text_input': X_text_val, 'cat_input': X_cat_val}, Y_val),
     callbacks=[early_stopping],
     verbose=1
 )

In [44]:
# apply preprocessing pipeline to the test data
df_test = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")
submission_ids = df_test['id']
print(submission_ids.info())

<class 'pandas.core.series.Series'>
RangeIndex: 3263 entries, 0 to 3262
Series name: id
Non-Null Count  Dtype
--------------  -----
3263 non-null   int64
dtypes: int64(1)
memory usage: 25.6 KB
None


In [45]:
# Describe the data size, dimension, structure, etc...
print(f"Dimensions (Rows, Columns): {df_test.shape}")
print(f"Total elements: {df_test.size}")

df_test.info()
print("\nData Types per Column:\n", df_test.dtypes)
print("\nFirst 3 rows:\n", df_test.head(3))

# Check for missing values
print("\nMissing values per column:\n", df_test.isnull().sum())

Dimensions (Rows, Columns): (3263, 4)
Total elements: 13052
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3263 entries, 0 to 3262
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        3263 non-null   int64 
 1   keyword   3237 non-null   object
 2   location  2158 non-null   object
 3   text      3263 non-null   object
dtypes: int64(1), object(3)
memory usage: 102.1+ KB

Data Types per Column:
 id           int64
keyword     object
location    object
text        object
dtype: object

First 3 rows:
    id keyword location                                               text
0   0     NaN      NaN                 Just happened a terrible car crash
1   2     NaN      NaN  Heard about #earthquake is different cities, s...
2   3     NaN      NaN  there is a forest fire at spot pond, geese are...

Missing values per column:
 id             0
keyword       26
location    1105
text           0
dtype: int64


In [46]:
# Re-run all imputation and cleaning steps
#df_test.dropna(subset=['keyword'], inplace=True)
df_test['keyword'].fillna('UNKNOWN', inplace=True)
df_test['location'].fillna('UNKNOWN', inplace=True)
df_test['clean_location'] = df_test['location'].apply(clean_location)
df_test['standard_location'] = df_test['clean_location'].apply(consolidate_location)
df_test['clean_keyword'] = df_test['keyword'].apply(unquote_keyword)
df_test['clean_text'] = df_test['text'].apply(preprocess_text)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_test['keyword'].fillna('UNKNOWN', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_test['location'].fillna('UNKNOWN', inplace=True)


In [47]:
# Aggregate Locations and Keywords
df_test['agg_location'] = np.where(
    df_test['standard_location'].isin(top_n_locations),
    df_test['standard_location'],
    'OTHER_LOCATION'
)
df_test['agg_keyword'] = np.where(
    df_test['clean_keyword'].isin(top_n_keywords),
    df_test['clean_keyword'],
    'OTHER_KEYWORD'
)

In [48]:
# One-Hot Encode Categorical Features
# Create dummies for the test set
test_cat_dummies = pd.get_dummies(df_test['agg_location'], prefix='agg_location')
test_cat_dummies_k = pd.get_dummies(df_test['agg_keyword'], prefix='agg_keyword')
test_cat_dummies = pd.concat([test_cat_dummies, test_cat_dummies_k], axis=1)

#print(test_cat_dummies.head())

In [49]:
# Reindex the test set dummy features to match the training set column list.
# Columns in the test set but not in 'categorical_cols' are dropped (noise).
# Columns in 'categorical_cols' but not in the test set are added as 0s (safe).
test_cat_aligned = test_cat_dummies.reindex(columns=categorical_cols, fill_value=0)

# Convert the final, aligned DataFrame to a NumPy array for the Keras model
X_cat_test = test_cat_aligned.values

print(f"Categorical features successfully aligned. Final shape: {X_cat_test.shape}")

Categorical features successfully aligned. Final shape: (3263, 76)


In [50]:
# Tokenize Text 
X_text_test = tokenizer.texts_to_sequences(df_test['clean_text'])
X_text_test = pad_sequences(X_text_test, maxlen=MAX_LEN, padding='post', truncating='post')

In [51]:
# Make predictions (probabilities)
predictions_proba = model.predict({'text_input': X_text_test, 'cat_input': X_cat_test})

# Convert probabilities to binary class (0 or 1) using 0.5 threshold
predictions = (predictions_proba > 0.5).astype(int).flatten()

[1m102/102[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 11ms/step


In [52]:
# Create Submission DataFrame
submission_df = pd.DataFrame({
    'id': submission_ids,
    'target': predictions
})

# Display the head of the submission file
print(submission_df.head())

# Save the submission file to CSV
submission_df.to_csv('submission.csv', index=False)
print("\nsubmission.csv created.")

   id  target
0   0       1
1   2       0
2   3       1
3   9       1
4  11       1

submission.csv created.


## Result and Analysis
Hyperparameter tuning, compare different architectures, apply techniques to improve training/performance.  Include results with tables and figures.  Discuss why or why not things worked well, any troubleshooting, and hyperparameter optimization procedure summary.

### Results 
#### (based on a recent 'Run' may not be the same results acheived on subsequent runs)
Best Epoch = 6.  An accuracy of 81.3% suggests that the model is performing quite well at distinguishing between disaster and non-disaster tweets.  The difference between Precision (84.0%) and Recall (69.7%) indicates a slight bias.  The model correctly predicts a disaster 84% of the time, but misses about 30% of the actual disaster tweets (low recall).  

The F1 score is 2x(Precision * Recall)/(Precision + Recall) which is 76.2%.

### Analysis
We used a Bi-LSTM instead of a simpler RNN to improve sequence learning capacity.  We used Adam optimizer with a small initial learning rate (1e^-4) for slow, stable convergence.  Dropout (0.5) was applied after the main Bi-LSTM output and before the final dense layers to fight overfitting.  We used an Early Stopping callback to monitor the val_loss.  Training automatically stopped after the validation loss failed to improve for 5 consecutive epochs.  This technique is used to prevent overfitting (and speed things up).  

Submissions are evaluated using F1 between the predicted and expected answers.

## Discussion and Conclusion
Discuss and interpret results as well as learnings and takeaways.  What did and did not help improve the performance of the model(s)  What improvements could we try in the future?

### Areas for Improvement
The Bi-LSTM model might still struggle with stuble sarcasm or highly figurative language that is often found in social media.  

A pre-trained embedding like GloVe might have gave a better starting point, especially if many words in the test set were not present in the training set.  

The categorical dense layers were simple, more complex interactions could be learned by adding more layers to the categorical branch. 