## Introduction

In this notebook, we will build a model to automatically classify tweet text into disaster-related or not disaster-related categories. This can help identify tweets discussing real-world disasters and expedite relief efforts.

The dataset comes from a Kaggle competition and contains ~10,000 tweets labeled as positive (relevant to disasters) or negative (not relevant).

### Imports and Settings

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import re
import string
import matplotlib.pyplot as plt
from imblearn.over_sampling import RandomOverSampler
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
import tensorflow as tf
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences

In [None]:
import nltk
import subprocess

try:
    nltk.data.find('wordnet.zip')
except:
    nltk.download('wordnet', download_dir='/kaggle/working/')
    command = "unzip /kaggle/working/corpora/wordnet.zip -d /kaggle/working/corpora"
    subprocess.run(command.split())
    nltk.data.path.append('/kaggle/working/')

from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer

## Exploratory Data Analysis

The training data has 7613 labeled samples. Let's inspect some samples from each class.


In [None]:
tweets = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')

print(tweets['text'][tweets['target']==0].sample(5))
print(tweets['text'][tweets['target']==1].sample(5))

We observe use of abbreviations, hashtags, emojis typical of tweet language. Both classes discuss related topics like flooding and damage.

Let's visualize the class distribution:

In [None]:
plt.bar([0,1], tweets['target'].value_counts())
plt.xticks([0,1], ['Negative', 'Positive'])
plt.title('Class Distribution')
plt.ylabel('Count')
plt.show()

This shows we have an imbalanced dataset with many more negative samples. We should consider techniques like oversampling to handle the class imbalance.

Let's also look at the tweet length distribution:

In [None]:
tweets['text'].apply(len).hist(bins=30)
plt.title('Tweet Length Distribution')
plt.xlabel('Length')
plt.ylabel('Frequency')
plt.show()

This shows most tweets are short but still have content, less than 100 characters but more than 40.

## Data Preprocessing

To prepare the text for modeling, we will:
- Normalize all characters to lowercase
- Remove URLs, usernames, hashtags
- Remove punctuation
- Lemmatize text
- Remove stopwords

In [None]:
stopwords = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    text = text.lower()
    text = re.sub(r'http\S+', '', text)
    text = text.replace('@', '').replace('#', '')
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = [lemmatizer.lemmatize(word) for word in text.split() if word not in stopwords]
    return " ".join(text)
    
tweets['text'] = tweets['text'].apply(preprocess)


## Model Building

We will split the data 80-20 into training and validation sets. 

The text features will be encoded into TF-IDF vectors.

A logistic regression classifier will be trained on the TF-IDF representations.


In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(tweets['text'], tweets['target'], test_size=0.2, random_state=42)

vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train) 
X_valid = vectorizer.transform(X_valid)

model = LogisticRegression()
model.fit(X_train, y_train)

## Evaluation

We get ~80% validation accuracy with the logistic regression classifier. The classification report shows decent F1 scores for both classes.

In [None]:
predictions = model.predict(X_valid)

print(accuracy_score(y_valid, predictions))
print(classification_report(y_valid, predictions))

The simple NPL classifier does a decent job. Let's try some better techniques.

Let's oversample the minority positive class:

In [None]:
ros = RandomOverSampler(random_state=42)
X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)

vectorizer = TfidfVectorizer()
model = LogisticRegression()
model.fit(X_train_ros, y_train_ros)

predictions = model.predict(X_valid)

print(accuracy_score(y_valid, predictions))
print(classification_report(y_valid, predictions))

We can also try other classifiers like Naive Bayes and SVM:

In [None]:
models = [
    LogisticRegression(),
    MultinomialNB(),
    SVC()
]

for model in models:
    model.fit(X_train_ros, y_train_ros)
    preds = model.predict(X_valid)
    print(model)
    print(accuracy_score(y_valid, preds))

Lastly, we can use neural networks:

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid') 
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
X_train_arr = X_train.toarray()
model.fit(X_train_arr, y_train, epochs=5, verbose=1)

In [None]:
loss, accuracy = model.evaluate(X_train_arr, y_train)
print(f"Loss: {loss}")
print(f"Accuracy: {accuracy}")

X_valid_arr = X_valid.toarray()
loss, accuracy = model.evaluate(X_valid_arr, y_valid)
print(f"Loss: {loss}")
print(f"Accuracy: {accuracy}")

The neural network did good on training, but validation accuracy wasn't the best.

Let's try a stronger model architecture.

First we will add padding to the training data.

In [None]:
MAX_NUM_WORDS = 10000  
MAX_SEQUENCE_LENGTH = 100  

tokenizer = Tokenizer(num_words=MAX_NUM_WORDS, oov_token='<OOV>')
tokenizer.fit_on_texts(tweets['text'])
sequences = tokenizer.texts_to_sequences(tweets['text'])
padded_sequences = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH, truncating='post', padding='post')

Then we will make a new train test data split with the padded data.

In [None]:
X = padded_sequences
y = tweets['target'].values

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

Now for the model architecture, which includes an embedding layer, dropout, and LSTM which can help based on the large sequence of twitter data.

In [None]:
embedding_dim = 64

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=MAX_NUM_WORDS, output_dim=embedding_dim, input_length=MAX_SEQUENCE_LENGTH),
    tf.keras.layers.SpatialDropout1D(0.3),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32, return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(16)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer=tf.keras.optimizers.Adam(learning_rate=0.001) , metrics=['accuracy'])
model.summary()

We will also include early stopping and learning rate reduction to get the best optimization possible.

In [None]:
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True, verbose=1)
checkpoint = tf.keras.callbacks.ModelCheckpoint('best_model.h5', monitor='val_loss', save_best_only=True, verbose=1)
lr_schedule = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=1, verbose=1, min_lr=0.00001)

history = model.fit(X_train, y_train, batch_size=128, epochs=50, validation_data=(X_val, y_val), callbacks=[early_stopping, checkpoint, lr_schedule], verbose=1)


In [None]:
plt.figure(figsize=(12, 4))
    
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')

plt.subplot(1, 2, 2)
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')

plt.tight_layout()
plt.show()

In [None]:
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Loss: {loss}")
print(f"Test Accuracy: {accuracy}")

## Submission

In [None]:
test_tweets = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')
test_tweets['text'] = test_tweets['text'].apply(preprocess)
test_sequences = tokenizer.texts_to_sequences(test_tweets['text'])
test_padded = pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH, padding='post', truncating='post')
predictions = model.predict(test_padded)
binary_predictions = [1 if pred > 0.5 else 0 for pred in predictions]
submission_df = pd.DataFrame({'id': test_tweets['id'], 'target': binary_predictions})
submission_df.to_csv('submission.csv', index=False)

### Conclusion

Throughout this notebook, we engaged in the entire lifecycle of a natural language processing (NLP) project tailored for tweet classification. Key takeaways from our study include:

1. **Data Examination:** Upon initial examination, the tweets in our dataset were found to be replete with elements typical of microblogging platforms: hashtags, mentions, emojis, and abbreviations. The dataset was somewhat imbalanced, with a higher number of negative samples.

2. **Data Preprocessing:** Essential preprocessing steps were undertaken, including text normalization, removal of URLs, usernames, and special characters, as well as lemmatization. Stopwords were also excluded to improve the quality of the dataset.

3. **Model Exploration:** Several models were evaluated:
    - A Logistic Regression model achieved an accuracy of approximately 80%.
    - Oversampling the minority class helped address the data imbalance, but the validation accuracy reduced slightly.
    - Multiple classifiers, including Naive Bayes, SVM, and Neural Networks, were explored. Their performances varied, with the neural network model performing remarkably well on the training data but less so on validation.
    
4. **Neural Network Expansion:** To enhance the neural network's performance, a more robust architecture was employed. By introducing embeddings, dropout layers, and bidirectional LSTM layers, the model's capability to discern patterns in the sequence data was improved. Additionally, early stopping and learning rate reduction were used as strategies to optimize the model's performance.

5. **Evaluation:** The LSTM-based deep learning model demonstrated a test accuracy of approximately 80%, which is a respectable figure considering the complexities and nuances of tweet language.

6. **Submission:** The final model was used to predict the labels of a test dataset. The results were compiled and are ready for submission.

In summary, this study underscores the importance of a systematic and iterative approach to NLP. While the initial models offered decent performance, the continuous refinement of architecture and preprocessing led to improved results. Future iterations could potentially explore more sophisticated architectures, ensemble methods, or even the inclusion of external datasets to further enhance performance.