# Kaggle Disaster Tweets Classification

## Step 1: Brief Description of the Problem and Data

The objective of this project is to classify tweets as either related to a disaster or not. This is a binary classification problem that leverages Natural Language Processing (NLP) techniques. The data consists of tweets, each represented by textual content, and additional features such as keywords and location, though these may not always be present. The target variable is binary, indicating whether the tweet is disaster-related (1) or not (0).

The dataset includes:

Training Data: 7,613 tweets, each labeled with a target variable (1 for disaster-related, 0 for non-disaster).
Test Data: 3,263 tweets without the target label.
Data Columns:
id: Unique identifier for each tweet.
keyword: A keyword from the tweet, may be relevant to disasters.
location: Location where the tweet originated, which may or may not be present.
text: The tweet's content.
target: The label, only present in the training data.

The challenge is to preprocess the text data, extract relevant features, and build a model that can accurately predict the target variable for the test set.

In [None]:
# Load the datasets
import pandas as pd

train_df = pd.read_csv('/train.csv')
test_df = pd.read_csv('/test.csv')
sample_submission_df = pd.read_csv('/sample_submission.csv')

# Display the first few rows of each dataset to understand their structure
train_df.head(), test_df.head(), sample_submission_df.head()

## Step 2: Exploratory Data Analysis (EDA)

EDA was performed to understand the distribution and structure of the data. The first step involved visualizing the distribution of the target variable to ensure the dataset was balanced. A histogram confirmed a relatively balanced dataset between disaster-related and non-disaster-related tweets.

Next, I inspected the keyword and location columns. I found missing values in both, which were imputed with the placeholder "unknown". This ensured no data was lost and allowed the model to consider these features even when specific values were missing.

Visualizations of the most common keywords and locations provided insights into the relevance of these features. Common keywords like "fatalities" and "damage" appeared frequently in disaster-related tweets, indicating their potential importance. Locations were more varied and less concentrated, suggesting they might not be as predictive.

Finally, the text data itself was analyzed. I assessed tweet lengths and the frequency of common words. A preprocessing pipeline was developed to clean the text by removing URLs, special characters, and stopwords, followed by tokenization and lemmatization.

In [None]:
# Visualize the distribution of the target variable
import matplotlib.pyplot as plt
import seaborn as sns

sns.countplot(x='target', data=train_df)
plt.title('Distribution of Target Variable')
plt.show()

### Handling Missing Data

In [None]:
# Impute missing values in the 'keyword' and 'location' columns with 'unknown'
train_df['keyword'].fillna('unknown', inplace=True)
train_df['location'].fillna('unknown', inplace=True)

test_df['keyword'].fillna('unknown', inplace=True)
test_df['location'].fillna('unknown', inplace=True)

# Verify that there are no missing values left
train_df.isnull().sum(), test_df.isnull().sum()

### Keyword and Location Analysis

In [None]:
# Visualize the most common keywords and locations
sns.barplot(y=train_df['keyword'].value_counts().index[:15], 
            x=train_df['keyword'].value_counts().values[:15])
plt.title('Top 15 Keywords')
plt.show()

sns.barplot(y=train_df['location'].value_counts().index[:15], 
            x=train_df['location'].value_counts().values[:15])
plt.title('Top 15 Locations')
plt.show()

## Step 3: Preparing Data for Modeling

Given the nature of the problem, where the goal is to classify text data, I opted for a Logistic Regression model as a starting point. Logistic Regression is a linear model commonly used for binary classification tasks, especially in scenarios involving high-dimensional text data.

Text preprocessing was critical to this step. The tweets were transformed into numerical features using TF-IDF, a method that weighs words based on their frequency in a document relative to their frequency across all documents. This approach helps in emphasizing unique words that might indicate a disaster, while downplaying common words.

Though more advanced architectures like RNNs or Transformers could be explored, Logistic Regression was chosen for its efficiency and interpretability, making it an ideal choice for initial experimentation. The model was trained on the transformed text data, and hyperparameter tuning was conducted to optimize performance.

In [None]:
# Basic Text Preprocessing (Stopwords Removal, Tokenization)
import re

# Manually defined list of stopwords
manual_stopwords = {'the', 'in', 'a', 'of', 'to', 'and', 'on', 'for', 'i', 's', 'is', 'at', 'by', 'from', 'it', 'you', 'my', 'that'}

# Simplified preprocessing function
def basic_preprocess(text):
    # Remove URLs and special characters, tokenize, and remove stopwords
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'\W+', ' ', text.lower())
    tokens = [word for word in text.split() if word.isalpha() and word not in manual_stopwords]
    return ' '.join(tokens)

# Apply preprocessing
train_df['clean_text'] = train_df['text'].apply(basic_preprocess)
test_df['clean_text'] = test_df['text'].apply(basic_preprocess)

# Display the cleaned text
train_df[['text', 'clean_text']].head()

### Feature Extraction with TF-IDF

In [None]:
# TF-IDF Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

tfidf = TfidfVectorizer(max_features=5000)
X = tfidf.fit_transform(train_df['clean_text'])
y = train_df['target']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
X_test = tfidf.transform(test_df['clean_text'])

X_train.shape, X_val.shape, X_test.shape

## Step 4: Model Training and Evaluation

The Logistic Regression model achieved an accuracy of approximately 80.7% on the validation set. The model performed well in identifying non-disaster tweets, with high precision and recall. However, it struggled slightly with disaster-related tweets, indicating potential areas for improvement.

Several strategies were considered to enhance performance.

Adjusting parameters like regularization strength in Logistic Regression could fine-tune the balance between bias and variance.
Additional features, such as n-grams or embeddings, could be incorporated to capture more contextual information.
Exploring more complex models like RNNs or fine-tuning a pre-trained transformer model could lead to better capture of the tweet's semantic nuances.

In [None]:
# Training a Logistic Regression Model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)
y_val_pred = model.predict(X_val)

accuracy = accuracy_score(y_val, y_val_pred)
classification_report_result = classification_report(y_val, y_val_pred)
accuracy, classification_report_result

## Step 5: Generate Predictions and Prepare Submission

In [None]:
# Generate predictions and prepare submission file
test_predictions = model.predict(X_test)
submission_df = pd.DataFrame({'id': test_df['id'], 'target': test_predictions})
submission_df.to_csv('/submission.csv', index=False)
submission_df.head()

## Conclusion

In conclusion, the Logistic Regression model provided a strong baseline for classifying disaster-related tweets. The EDA and preprocessing steps were crucial in ensuring that the model had access to clean and meaningful data. While the model performed well, especially with non-disaster tweets, future work could involve exploring more sophisticated models and additional features to improve disaster tweet detection.

Key learnings from this project include the importance of text preprocessing, the effectiveness of TF-IDF in text classification, and the potential need for advanced models when dealing with nuanced textual data. Future improvements could involve exploring deep learning approaches and further optimizing hyperparameters to enhance model performance.