# Advanced Classification of Disaster-Related Tweets Using Deep Learning

## Introduction
In this project, we will build a deep learning model using Keras to classify tweets as real or fake in the context of disasters. This task is inspired by the "NLP with Disaster Tweets" challenge and enriched with additional data to improve model performance and insights. The dataset provides a fascinating opportunity to explore Natural Language Processing (NLP) techniques on real-world data.

---

## Dataset Overview
### Context
The dataset contains over 11,000 tweets associated with disaster-related keywords such as "crash," "quarantine," and "bush fires." The data structure is based on the original "Disasters on social media" dataset. It includes:
- **Tweets:** The text of the tweet.
- **Keywords:** Specific disaster-related keywords.
- **Location:** The geographical information provided in the tweets.

These tweets were collected on **January 14th, 2020** and cover major events including:
- The eruption of Taal Volcano in Batangas, Philippines.
- The emerging outbreak of **Coronavirus (COVID-19)**.
- The devastating **Bushfires in Australia**.
- The **Iranian downing of flight PS752**.

### Important Note
The dataset contains text that may include profane, vulgar, or offensive language. Please approach with caution during analysis.

---

## Project Goals
### Inspiration
The primary goal of this project is to develop a machine learning model capable of identifying whether a tweet is genuinely related to a disaster or not. This involves:
1. Enriching the already available data with newly collected, manually classified tweets.
2. Leveraging state-of-the-art deep learning methods to extract meaningful insights.
3. Applying NLP techniques to preprocess, clean, and tokenize the tweets for model training.

This notebook will walk through the process of preparing the dataset, building a deep learning model, and evaluating its performance. By the end, we aim to achieve a robust model that can classify disaster tweets with high accuracy.

---

## Why It Matters
Effective classification of disaster-related tweets has numerous practical applications:
- **Emergency Response:** Helps organizations identify critical information in real time.
- **Resource Allocation:** Facilitates better planning by focusing on real disasters.
- **Misinformation Control:** Mitigates the spread of false information during crises.

In [1]:
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import numpy as np
import os

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [4]:
# Load the dataset
data = pd.read_csv('tweets.csv')

# Display the first few rows to inspect the dataset
print(data.head())

# Display dataset information (columns, data types, non-null counts)
print(data.info())

   id keyword        location  \
0   0  ablaze             NaN   
1   1  ablaze             NaN   
2   2  ablaze   New York City   
3   3  ablaze  Morgantown, WV   
4   4  ablaze             NaN   

                                                text  target  
0  Communal violence in Bhainsa, Telangana. "Ston...       1  
1  Telangana: Section 144 has been imposed in Bha...       1  
2  Arsonist sets cars ablaze at dealership https:...       1  
3  Arsonist sets cars ablaze at dealership https:...       1  
4  "Lord Jesus, your love brings freedom and pard...       0  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11370 entries, 0 to 11369
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        11370 non-null  int64 
 1   keyword   11370 non-null  object
 2   location  7952 non-null   object
 3   text      11370 non-null  object
 4   target    11370 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 444.3+ 

Ensure the dataset contains the required columns, such as:
- `text`: The tweet content.
- `label`: The classification label indicating whether the tweet is fake or not.

In [6]:
# Verify required columns
assert 'text' in data.columns, "Column 'text' is missing in the dataset."
assert 'target' in data.columns, "Column 'target' is missing in the dataset."

We will split the dataset into training and validation sets using an 80%-20% ratio.

In [8]:
# Features (tweet content) and labels (fake/true)
X = data['text']       # Features
y = data['target']      # Labels

# Split the dataset (80% training, 20% validation)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Save the training and validation datasets as separate CSV files for later use.

In [9]:
# Combine features and labels into dataframes
train_df = pd.DataFrame({'text': X_train, 'label': y_train})
test_df = pd.DataFrame({'text': X_test, 'label': y_test})

# Save the dataframes to CSV files
train_df.to_csv('train.csv', index=False)
test_df.to_csv('test.csv', index=False)

print("Datasets have been saved successfully:")
print("- Training set: train.csv")
print("- Validation set: test.csv")

Datasets have been saved successfully:
- Training set: train.csv
- Validation set: test.csv
