## Introduction

In this notebook, we will build a model to automatically classify tweet text into disaster-related or not disaster-related categories. This can help identify tweets discussing real-world disasters and expedite relief efforts.

The dataset comes from a Kaggle competition and contains ~10,000 tweets labeled as positive (relevant to disasters) or negative (not relevant).

### Imports and Settings

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import re
import string

In [None]:
import nltk
import subprocess

try:
    nltk.data.find('wordnet.zip')
except:
    nltk.download('wordnet', download_dir='/kaggle/working/')
    command = "unzip /kaggle/working/corpora/wordnet.zip -d /kaggle/working/corpora"
    subprocess.run(command.split())
    nltk.data.path.append('/kaggle/working/')

from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer

## Exploratory Data Analysis

The training data has 7613 labeled samples. Let's inspect some samples from each class.


In [None]:
tweets = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')

print(tweets['text'][tweets['target']==0].sample(5))
print(tweets['text'][tweets['target']==1].sample(5))

We observe use of abbreviations, hashtags, emojis typical of tweet language. Both classes discuss related topics like flooding and damage.

## Data Preprocessing

To prepare the text for modeling, we will:
- Normalize all characters to lowercase
- Remove URLs, usernames, hashtags
- Remove punctuation
- Lemmatize text
- Remove stopwords

In [None]:
stopwords = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    text = text.lower()
    text = re.sub(r'http\S+', '', text)
    text = text.replace('@', '').replace('#', '')
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = [lemmatizer.lemmatize(word) for word in text.split() if word not in stopwords]
    return " ".join(text)
    
tweets['text'] = tweets['text'].apply(preprocess)


## Model Building

We will split the data 80-20 into training and validation sets. 

The text features will be encoded into TF-IDF vectors.

A logistic regression classifier will be trained on the TF-IDF representations.


In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(tweets['text'], tweets['target'], test_size=0.2, random_state=42)

vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train) 
X_valid = vectorizer.transform(X_valid)

model = LogisticRegression()
model.fit(X_train, y_train)

## Evaluation

We get ~80% validation accuracy with the logistic regression classifier. The classification report shows decent F1 scores for both classes.

In [None]:
predictions = model.predict(X_valid)

print(accuracy_score(y_valid, predictions))
print(classification_report(y_valid, predictions))

## Conclusion

In this notebook, we built a simple NLP classifier to detect disaster-related tweets. The steps included:

- Exploring the tweet dataset
- Preprocessing the text data
- Creating TF-IDF features
- Fitting a logistic regression model
- Evaluating on a held-out set

Some ways to improve the model would be:
- Using word embeddings instead of TF-IDF
- Trying other classifiers like SVM, RNNs
- Expanding the dataset size using augmentation
- Ensembling multiple models

This provides a template to get started with identifying disaster tweets using NLP. The techniques can be extended to build a robust real-world system.