# Advanced Classification of Disaster-Related Tweets Using Deep Learning

## Introduction
In this project, we will build a deep learning model using Keras to classify tweets as real or fake in the context of disasters. This task is inspired by the "NLP with Disaster Tweets" challenge and enriched with additional data to improve model performance and insights. The dataset provides a fascinating opportunity to explore Natural Language Processing (NLP) techniques on real-world data.

---

## Dataset Overview
### Context
The dataset contains over 11,000 tweets associated with disaster-related keywords such as "crash," "quarantine," and "bush fires." The data structure is based on the original "Disasters on social media" dataset. It includes:
- **Tweets:** The text of the tweet.
- **Keywords:** Specific disaster-related keywords.
- **Location:** The geographical information provided in the tweets.

These tweets were collected on **January 14th, 2020** and cover major events including:
- The eruption of Taal Volcano in Batangas, Philippines.
- The emerging outbreak of **Coronavirus (COVID-19)**.
- The devastating **Bushfires in Australia**.
- The **Iranian downing of flight PS752**.

### Important Note
The dataset contains text that may include profane, vulgar, or offensive language. Please approach with caution during analysis.

---

## Project Goals
### Inspiration
The primary goal of this project is to develop a machine learning model capable of identifying whether a tweet is genuinely related to a disaster or not. This involves:
1. Enriching the already available data with newly collected, manually classified tweets.
2. Leveraging state-of-the-art deep learning methods to extract meaningful insights.
3. Applying NLP techniques to preprocess, clean, and tokenize the tweets for model training.

This notebook will walk through the process of preparing the dataset, building a deep learning model, and evaluating its performance. By the end, we aim to achieve a robust model that can classify disaster tweets with high accuracy.

---

## Why It Matters
Effective classification of disaster-related tweets has numerous practical applications:
- **Emergency Response:** Helps organizations identify critical information in real time.
- **Resource Allocation:** Facilitates better planning by focusing on real disasters.
- **Misinformation Control:** Mitigates the spread of false information during crises.

In [None]:
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import numpy as np
import os

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
# Load the dataset
data = pd.read_csv('tweets.csv')

# Display the first few rows to inspect the dataset
print(data.head())

# Display dataset information (columns, data types, non-null counts)
print(data.info())

   id keyword        location  \
0   0  ablaze             NaN   
1   1  ablaze             NaN   
2   2  ablaze   New York City   
3   3  ablaze  Morgantown, WV   
4   4  ablaze             NaN   

                                                text  target  
0  Communal violence in Bhainsa, Telangana. "Ston...       1  
1  Telangana: Section 144 has been imposed in Bha...       1  
2  Arsonist sets cars ablaze at dealership https:...       1  
3  Arsonist sets cars ablaze at dealership https:...       1  
4  "Lord Jesus, your love brings freedom and pard...       0  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11370 entries, 0 to 11369
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        11370 non-null  int64 
 1   keyword   11370 non-null  object
 2   location  7952 non-null   object
 3   text      11370 non-null  object
 4   target    11370 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 444.3+ 

Ensure the dataset contains the required columns, such as:
- `text`: The tweet content.
- `label`: The classification label indicating whether the tweet is fake or not.

In [None]:
# Verify required columns
assert 'text' in data.columns, "Column 'text' is missing in the dataset."
assert 'target' in data.columns, "Column 'target' is missing in the dataset."

We will split the dataset into training and validation sets using an 80%-20% ratio.

In [None]:
# Features (tweet content) and labels (fake/true)
X = data['text']       # Features
y = data['target']      # Labels

# Split the dataset (80% training, 20% validation)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Save the training and validation datasets as separate CSV files for later use.

In [None]:
# Combine features and labels into dataframes
train_df = pd.DataFrame({'text': X_train, 'label': y_train})
test_df = pd.DataFrame({'text': X_test, 'label': y_test})

# Save the dataframes to CSV files
train_df.to_csv('train.csv', index=False)
test_df.to_csv('test.csv', index=False)

print("Datasets have been saved successfully:")
print("- Training set: train.csv")
print("- Validation set: test.csv")

Datasets have been saved successfully:
- Training set: train.csv
- Validation set: test.csv


## 2. Data Visualization

In this section, we conduct a detailed exploratory data analysis (EDA) to understand the structure and distribution of our dataset. EDA is crucial for identifying potential challenges, trends, and biases within the data, which in turn helps in selecting the most suitable models and preprocessing steps.

### 2.1 Dataset Size

First, we print the size of both the training and testing datasets. This gives us an idea of how many data points we are working with, which is essential when evaluating model performance and understanding the balance of the dataset.


In [None]:
print("Training dataset size: ", len(X_train))
print("Testing dataset size: ", len(X_test))

## 2.2 Class Distribution in Training Data
Next, we examine the distribution of the target variable in the training set. The target variable indicates whether a tweet is related to a **disaster (1) or not (0)**. Understanding the class distribution helps identify if the dataset is imbalanced, which could influence the choice of model or evaluation metrics (e.g., using precision-recall curves instead of accuracy).

In [None]:
# Display the count of tweets for each target class in the training set
X_train['target'].value_counts()

We then plot a histogram of the target variable to visualize the distribution of disaster and non-disaster tweets in the training data. This graphical representation makes it easier to spot any imbalance.

In [None]:
X_train['target'].hist()
plt.ylabel("# tweets")
plt.show()

## 2.3 Exploratory Analysis of Tweet Length
### 2.3.1 Word Count per Tweet
A useful aspect of text data is the length of the text. Here, we explore the number of words per tweet. This metric can help us understand the average tweet size for both disaster-related and non-disaster-related tweets, which may inform decisions on feature engineering, such as tokenization or padding.

We use histograms to display the word count for each category of tweets (disaster vs. non-disaster).

In [None]:
# Calculate the number of words per tweet for each category (disaster vs non-disaster)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,5))

tweet_len_0 = X_train[X_train['target'] == 0]['text'].str.split().map(lambda x: len(x))  # Non-disaster tweets
tweet_len_1 = X_train[X_train['target'] == 1]['text'].str.split().map(lambda x: len(x))  # Disaster tweets

ax1.hist(tweet_len_0, color='green')
ax1.set_title('Non-disaster tweets')

ax2.hist(tweet_len_1, color='red')
ax2.set_title('Disaster tweets')

fig.suptitle('Word Count per Tweet')

plt.show()

#### 2.3.2 Unique Word Count per Tweet
Next, we analyze the number of unique words per tweet. This measure indicates how diverse the vocabulary is for each tweet and can be an important factor when building feature sets **like word embeddings or TF-IDF.**

In [None]:
# Calculate the number of unique words per tweet for each category
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,5))

tweet_len_0 = X_train[X_train['target'] == 0]['text'].str.split().map(lambda x: len(set(x)))  # Non-disaster tweets
tweet_len_1 = X_train[X_train['target'] == 1]['text'].str.split().map(lambda x: len(set(x)))  # Disaster tweets

ax1.hist(tweet_len_0, color='green')
ax1.set_title('Non-disaster tweets')

ax2.hist(tweet_len_1, color='red')
ax2.set_title('Disaster tweets')

fig.suptitle('Unique Word Count per Tweet')

plt.show()

### 2.3.3 Average Word Length per Tweet
Lastly, we investigate the average length of the words used in the tweets. This measure provides insight into the complexity or simplicity of the language used in disaster vs. non-disaster tweets. A higher average word length might indicate more formal or technical language, whereas shorter words could suggest more informal communication.

In [None]:
# Calculate the average word length per tweet for each category
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,5))

tweet_len_0 = X_train[X_train['target'] == 0]['text'].str.split().map(lambda x: np.mean([len(i) for i in x]))  # Non-disaster tweets
tweet_len_1 = X_train[X_train['target'] == 1]['text'].str.split().map(lambda x: np.mean([len(i) for i in x]))  # Disaster tweets

ax1.hist(tweet_len_0, color='green')
ax1.set_title('Non-disaster tweets')

ax2.hist(tweet_len_1, color='red')
ax2.set_title('Disaster tweets')

fig.suptitle('Average Word Length per Tweet')

plt.show()


## 2.4 Further Feature Calculations

In addition to the basic tweet length analysis, we could calculate several other features that may provide additional insights for modeling. These features include:

* Number of **words** at the end of a tweet
* Number of **URLs** per tweet
* **Average** number of characters per tweet
* Number of **characters** per tweet
* Number of punctuation marks per tweet
* Number of **hashtags** per tweet
* Number of **mentions** (@) per tweet

These additional features could be crucial when constructing advanced models or for improving the understanding of tweet content.

## 2.5 Stopwords Analysis

Stopwords are words that do not carry significant meaning by themselves, but help structure or modify other words in a sentence. These include articles, pronouns, prepositions, adverbs, and some verbs. In natural language processing (NLP), stopwords are typically removed because they do not add value to the analysis. For example, search engines like Google do not consider stopwords when indexing content, but they are used when displaying results.

To explore which stopwords are most common in the dataset, we can use the following approach:



In [None]:
from nltk.corpus import stopwords

stopwords.words('english')

def plot_stopwords(label):
    tweets_stopwords = {}
    for words in X_train[X_train['target'] == label]['text'].str.split():
        sw = list(set(words).intersection(stopwords.words('english')))
        for w in sw:
            if w in tweets_stopwords.keys():
                tweets_stopwords[w] += 1
            else:
                tweets_stopwords[w] = 1

    top = sorted(tweets_stopwords.items(), key=lambda x:x[1],reverse=True)[:10]
    plt.bar(*zip(*top))
    plt.show()

plot_stopwords(0)
plot_stopwords(1)

This will display the 10 most frequent stopwords in disaster and non-disaster tweets separately.

## 2.6 Punctuation Marks Analysis

Next, we analyze the punctuation marks used in the tweets. Punctuation marks can play a role in sentiment analysis and text classification. We examine the most frequently used punctuation marks in both disaster-related and non-disaster-related tweets:

In [None]:
import string

def plot_punctuation(label):
    tweets_punctuation = {}
    for words in X_train[X_train['target'] == label]['text'].str.split():
        sw = list(set(words).intersection(string.punctuation))
        for w in sw:
            if w in tweets_punctuation.keys():
                tweets_punctuation[w] += 1
            else:
                tweets_punctuation[w] = 1

    top = sorted(tweets_punctuation.items(), key=lambda x:x[1], reverse=True)[:20]
    plt.figure(figsize=(10, 5))
    plt.bar(*zip(*top))
    plt.show()

plot_punctuation(0)
plot_punctuation(1)


This will plot the most common punctuation marks for both types of tweets.

## 2.7 N-grams Analysis

Finally, we perform an analysis of n-grams, which are sequences of n consecutive words. N-grams are useful for capturing patterns in text and can help us understand frequently occurring phrases or expressions.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(ngram_range=(2, 2))
sum_words = cv.fit_transform(X_train['text']).sum(axis=0)

# Calculate frequency of n-grams
words_freq = [(word, sum_words[0, idx]) for word, idx in cv.vocabulary_.items()]
words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)[:20]

plt.figure(figsize=(15, 7))
plt.barh(*zip(*words_freq))
plt.show()

This analysis will highlight the top 20 most frequent 2-grams (bigrams) in the dataset.

These visualizations and additional feature calculations provide deeper insights into the structure and content of the dataset. By understanding features like word count, stopwords, punctuation marks, and n-grams, we can better prepare our data for modeling. Moreover, the analysis helps identify any potential biases or patterns that could influence the performance of the model, ensuring that we approach the task of disaster tweet classification in a well-informed manner.