# NLP for Disaster Tweets Analysis & Prediction

## Introduction

This project is dedicated to the exploration of Natural Language Processing (NLP) techniques to determine whether tweets are related to real disasters. The dataset comprises two essential files: `train.csv` and `test.csv`, with a sample submission file, `sample_submission.csv`, illustrating the correct format for result submissions.

## Dataset Description

### Files Needed
- **train.csv:** Training set containing tweet text, keywords, location, and target labels.
- **test.csv:** Test set with tweet text, keywords, and location information.
- **sample_submission.csv:** Sample file demonstrating the correct submission format.

### Data Format
Each sample in the training and test sets includes:
- The text of a tweet.
- A keyword from the tweet (which may be blank).
- The location from which the tweet was sent (may also be blank).

### Prediction Target
The primary objective is to predict whether a given tweet is about a real disaster. If it is, the prediction is coded as 1; if not, it is coded as 0.

### The Data Dictionary

| Column    | Description                                              |
|-----------|----------------------------------------------------------|
| `id`      | Unique identifier for each tweet                         |
| `text`    | The content of the tweet                                 |
| `location`| Location from which the tweet was sent (may be blank)    |
| `keyword` | A specific keyword from the tweet (may be blank)         |
| `target`  | In train.csv only, denotes if the tweet is about a real disaster (1) or not (0) |

## Getting Started

1. **Obtain the required files:**
   - Download `train.csv`, `test.csv`, and `sample_submission.csv`.

2. **Explore the dataset:**
   - Utilize the provided files to comprehend the structure and contents of the data.

3. **Run NLP models:**
   - Employ NLP techniques to predict whether tweets are about real disasters.

4. **Submit Predictions:**
   - Utilize insights gained from NLP models to predict the target for tweets in the test set.

## Contribution

This project encourages contributions from the community. Feel free to submit pull requests, share insights, or engage in discussions in the project's forum.

## Acknowledgements

The dataset for this project is sourced from [Kaggle](https://www.kaggle.com/), and heartfelt gratitude is extended to the contributors for making this valuable data available for analysis.

## Let's get started with our PRRRRRRRRRRREDICTION!


#### Importing required libraries

In [1]:
%reset -f

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import os
import missingno as msno 

import nltk
from nltk.corpus import stopwords
from nltk.util import ngrams
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import StratifiedKFold

from collections import defaultdict, Counter

plt.style.use('ggplot')

import re
from nltk.tokenize import word_tokenize
import gensim
import string

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Embedding, LSTM, Dense, Input, Dropout, SpatialDropout1D, GlobalAveragePooling1D
from tensorflow.keras.initializers import Constant
from tensorflow.keras.optimizers import Adam, SGD
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping, Callback
import gc

import operator
import joblib

import tokenization
from wordcloud import STOPWORDS

from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit
from sklearn.metrics import precision_score, recall_score, f1_score
from scikeras.wrappers import KerasClassifier

import tensorflow_hub as hub


#### Load the dataset

In [3]:
# import data from CSV after downloading from Kaggle 
# https://www.kaggle.com/competitions/nlp-getting-started/overview
path = ('/Users/stevenschepanski/Documents/Projects/NLP_Tweets/')
tweet= pd.read_csv(path + 'data/train.csv')
test=pd.read_csv(path + 'data/test.csv')

### 1. Data Exploration

In [4]:
# Display the initial rows of the data frame
tweet.head()


Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [5]:
# Provide concise information about the data frame
tweet.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


In [6]:
# Display the initial rows of the data frame
test.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [7]:
# Provide concise information about the data frame
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3263 entries, 0 to 3262
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        3263 non-null   int64 
 1   keyword   3237 non-null   object
 2   location  2158 non-null   object
 3   text      3263 non-null   object
dtypes: int64(1), object(3)
memory usage: 102.1+ KB


In [8]:
print('The train data has', tweet.shape[0],'rows and', tweet.shape[1], 'columns.')
print('The test data has', test.shape[0],'rows and', test.shape[1], 'columns.')

The train data has 7613 rows and 5 columns.
The test data has 3263 rows and 4 columns.


In [9]:
import plotly.graph_objects as go
import plotly.subplots as sp

# Custom colours
violet = 'rgb(108, 84, 158)'
dark_orange = 'rgb(191, 87, 0)'

# Create subplots with one row and two columns
fig = sp.make_subplots(rows=1, cols=2, subplot_titles=['Missing Values in "tweet" Dataset', 'Missing Values in "test" Dataset'])

# Plot for "tweet" dataset
fig.add_trace(go.Bar(x=['keyword', 'location'], y=tweet[['keyword', 'location']].isnull().sum().values,
                     name='tweet', marker_color=[violet, dark_orange]), row=1, col=1)

# Plot for "test" dataset
fig.add_trace(go.Bar(x=['keyword', 'location'], y=test[['keyword', 'location']].isnull().sum().values,
                     name='test', marker_color=[violet, dark_orange]), row=1, col=2)

# Update layout
fig.update_layout(
    showlegend=False,  # Hide legend for individual clarity
    yaxis=dict(title='Missing Value Count'),
    plot_bgcolor='white',
    paper_bgcolor='white'  # Set the paper background to white
)

# Update axis appearance
fig.update_xaxes(showline=True, linewidth=1, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=1, linecolor='black', mirror=True)

# Show the plot
fig.show()


In [10]:
# Calculate missing value ratios for 'keyword' and 'location' columns
tweet_missing_ratio = tweet[['keyword', 'location']].isnull().mean()
test_missing_ratio = test[['keyword', 'location']].isnull().mean()

# Display missing value ratios for 'tweet' dataset
print("Missing Value Ratios in 'tweet' Dataset:")
print(tweet_missing_ratio)

# Display missing value ratios for 'test' dataset
print("\nMissing Value Ratios in 'test' Dataset:")
print(test_missing_ratio)


Missing Value Ratios in 'tweet' Dataset:
keyword     0.008013
location    0.332720
dtype: float64

Missing Value Ratios in 'test' Dataset:
keyword     0.007968
location    0.338645
dtype: float64


In [11]:
# Print the number of duplicated rows in the 'tweet' dataset
print(tweet.duplicated().sum(), 'rows are duplicated.')

# Print the number of duplicated 'id' values in the 'tweet' dataset
print(tweet['id'].duplicated().sum(), 'IDs are duplicated.')


0 rows are duplicated.
0 IDs are duplicated.


No entire rows are duplicated, and no project appears twice in the data.

1.1 Missing Values

Both training and test set have same ratio of missing values in keyword and location.

0.8% of keyword is missing in both training and test set
33% of location is missing in both training and test set
Since missing value ratios between training and test set are too close, they are most probably taken from the same sample. Missing values in those features are filled with no_keyword and no_location respectively.

In [12]:
for df in [tweet, test]:
    for col in ['keyword', 'location']:
        df[col] = df[col].fillna(f'no_{col}')


In [13]:
# Check missing values after filling
print("\nMissing values after filling:")
print(tweet[['keyword', 'location']].isna().sum())


Missing values after filling:
keyword     0
location    0
dtype: int64


### 2. Exploratory Data Analysis

In [14]:
# Create subplots with one row and two columns
fig = sp.make_subplots(rows=1, cols=2, subplot_titles=['Disaster Tweets', 'Not Disaster Tweets'],
                       shared_yaxes=True)

# Disaster tweets
tweet_len_disaster = tweet[tweet['target'] == 1]['text'].str.len()
fig.add_trace(go.Histogram(x=tweet_len_disaster, marker_color=violet), row=1, col=1)

# Not disaster tweets
tweet_len_not_disaster = tweet[tweet['target'] == 0]['text'].str.len()
fig.add_trace(go.Histogram(x=tweet_len_not_disaster, marker_color=dark_orange), row=1, col=2)

# Update layout with titles and labels
fig.update_layout(
    title='Number of Characters in Tweets',
    xaxis=dict(title='Number of Characters'),
    yaxis=dict(title='Frequency'),
    xaxis2=dict(title='Number of Characters'),  # Add x-axis label for the second subplot
    showlegend=False,  # Hide legend for individual clarity
    plot_bgcolor='white',
    paper_bgcolor='white'  # Set the paper background to white
)

# Update axis appearance
fig.update_xaxes(showline=True, linewidth=1, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=1, linecolor='black', mirror=True)

# Show the plot
fig.show()


In [15]:
# Create subplots with one row and two columns
fig = sp.make_subplots(rows=1, cols=2, subplot_titles=['Disaster Tweets', 'Not Disaster Tweets'],
                       shared_yaxes=True)

# Disaster tweets
tweet_len_words_disaster = tweet[tweet['target'] == 1]['text'].str.split().map(lambda x: len(x))
fig.add_trace(go.Histogram(x=tweet_len_words_disaster, marker_color=violet), row=1, col=1)

# Not disaster tweets
tweet_len_words_not_disaster = tweet[tweet['target'] == 0]['text'].str.split().map(lambda x: len(x))
fig.add_trace(go.Histogram(x=tweet_len_words_not_disaster, marker_color=dark_orange), row=1, col=2)

# Update layout with titles and labels
fig.update_layout(
    title='Number of Words in a Tweet',
    xaxis=dict(title='Number of Words'),
    yaxis=dict(title='Frequency'),
    xaxis2=dict(title='Number of Words'),
    showlegend=False,  # Hide legend for individual clarity
    plot_bgcolor='white',
    paper_bgcolor='white'  # Set the paper background to white
)

# Update axis appearance
fig.update_xaxes(showline=True, linewidth=1, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=1, linecolor='black', mirror=True)

# Show the plot
fig.show()

In [16]:
# Create subplots with one row and two columns
fig = sp.make_subplots(rows=1, cols=2, subplot_titles=['Disaster Tweets', 'Not Disaster Tweets'],
                       shared_yaxes=True)

# Disaster tweets
word_disaster = tweet[tweet['target'] == 1]['text'].str.split().apply(lambda x: [len(i) for i in x])
fig.add_trace(go.Histogram(x=word_disaster.map(lambda x: np.mean(x)), marker_color=violet, nbinsx=20), row=1, col=1)

# Not disaster tweets
word_not_disaster = tweet[tweet['target'] == 0]['text'].str.split().apply(lambda x: [len(i) for i in x])
fig.add_trace(go.Histogram(x=word_not_disaster.map(lambda x: np.mean(x)), marker_color=dark_orange, nbinsx=20), row=1, col=2)

# Update layout with titles and labels
fig.update_layout(
    title='Average Word Length in Each Tweet',
    xaxis=dict(title='Average Word Length'),
    yaxis=dict(title='Frequency'),
    xaxis2=dict(title='Average Word Length'),
    showlegend=False,  # Hide legend for individual clarity
    plot_bgcolor='white',
    paper_bgcolor='white'  # Set the paper background to white
)

# Update axis appearance
fig.update_xaxes(showline=True, linewidth=1, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=1, linecolor='black', mirror=True)

# Show the plot
fig.show()


In [17]:
# Add a column for target mean in the DataFrame
tweet['target_mean'] = tweet.groupby('keyword')['target'].transform('mean')

# Sort the DataFrame by 'target_mean' in descending order
sorted_tweet = tweet.sort_values(by='target_mean', ascending=False)

# Reverse the order of keywords
sorted_tweet_reversed = sorted_tweet[::-1]


In [18]:
# Create a horizontal bar chart
fig = go.Figure()

# Add bars for each keyword with target distribution
fig.add_trace(go.Bar(
    y=sorted_tweet_reversed['keyword'],
    x=sorted_tweet_reversed['target'],
    orientation='h',
    marker=dict(color=[dark_orange if val == 0 else violet for val in sorted_tweet_reversed['target']])
))

# Update layout with titles and labels
fig.update_layout(
    title='Target Distribution in Keywords',
    xaxis=dict(title='Frequency'),
    yaxis=dict(title='Keywords'),
    showlegend=False,  # Hide legend for individual clarity
    plot_bgcolor='white',
    paper_bgcolor='white'  # Set the paper background to white
)

# Update axis appearance
fig.update_xaxes(showline=True, linewidth=1, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=1, linecolor='black', mirror=True, tickfont=dict(size=12))

# Show the plot
fig.show()

# Drop the 'target_mean' column from the DataFrame
tweet.drop(columns=['target_mean'], inplace=True)


In [19]:
def create_corpus(target):
    """
    Create a corpus of words from the 'text' column for a specific target value.

    Parameters:
    - target: The target value for which to create the corpus.

    Returns:
    - corpus: A list containing individual words from the 'text' column for the specified target.
    """
    corpus = []

    # Iterate over the 'text' column for rows with the specified target value
    for x in tweet[tweet['target'] == target]['text'].str.split():
        for i in x:
            corpus.append(i)
    
    return corpus


In [20]:
import nltk

# Download NLTK stopwords
nltk.download('stopwords')

# Define the set of stopwords
stop = set(stopwords.words('english'))

# Create corpuses for target values '0' and '1'
corpus_0 = create_corpus(0)
corpus_1 = create_corpus(1)


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/stevenschepanski/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [21]:
# Count stopwords for target value '0'
dic_0 = defaultdict(int)
for word in corpus_0:
    if word in stop:
        dic_0[word] += 1

# Count stopwords for target value '1'
dic_1 = defaultdict(int)
for word in corpus_1:
    if word in stop:
        dic_1[word] += 1


In [22]:
# Get top stopwords for target value '0'
top_0 = sorted(dic_0.items(), key=lambda x: x[1], reverse=True)[:10]

# Get top stopwords for target value '1'
top_1 = sorted(dic_1.items(), key=lambda x: x[1], reverse=True)[:10]

# Unpack top stopwords for plotting
x_0, y_0 = zip(*top_0)
x_1, y_1 = zip(*top_1)


In [23]:
from plotly.subplots import make_subplots

# Create a bar plot comparing top stopwords for target values '0' and '1'
fig = make_subplots(rows=1, cols=2, subplot_titles=['Top Stopwords for Target=0', 'Top Stopwords for Target=1'],
                    shared_yaxes=True, horizontal_spacing=0.1)

# Plot for Target=0
fig.add_trace(go.Bar(x=x_0, y=y_0, marker_color=violet), row=1, col=1)

# Plot for Target=1
fig.add_trace(go.Bar(x=x_1, y=y_1, marker_color=dark_orange), row=1, col=2)

# Update layout with titles and labels
fig.update_layout(
    title='Comparison of Top Stopwords',
    xaxis=dict(title='Stopwords'),
    yaxis=dict(title='Frequency'),
    xaxis2=dict(title='Stopwords'),
    showlegend=False,  # Hide legend for individual clarity
    plot_bgcolor='white',
    paper_bgcolor='white'  # Set the paper background to white
)

# Update axis appearance
fig.update_xaxes(showline=True, linewidth=1, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=1, linecolor='black', mirror=True)

# Rotate x-axis labels for better readability
fig.update_xaxes(tickangle=0, tickfont=dict(size=10))

# Show the plot
fig.show()


In [24]:
# Create subplots
fig = make_subplots(rows=1, cols=2, subplot_titles=['Punctuation Analysis for Target=0', 'Punctuation Analysis for Target=1'],
                    shared_yaxes=True, horizontal_spacing=0.1)

# Count punctuation occurrences for target value '0'
dic_0_punctuation = defaultdict(int)
special = string.punctuation
for i in corpus_0:
    if i in special:
        dic_0_punctuation[i] += 1

# Count punctuation occurrences for target value '1'
dic_1_punctuation = defaultdict(int)
for i in corpus_1:
    if i in special:
        dic_1_punctuation[i] += 1

# Create bar plots for punctuation for target values '0' and '1'
x_0_punctuation, y_0_punctuation = zip(*dic_0_punctuation.items())
x_1_punctuation, y_1_punctuation = zip(*dic_1_punctuation.items())

# Plot for target value '0'
fig.add_trace(go.Bar(x=x_0_punctuation, y=y_0_punctuation, marker_color=dark_orange, opacity=0.7), row=1, col=1)

# Plot for target value '1'
fig.add_trace(go.Bar(x=x_1_punctuation, y=y_1_punctuation, marker_color=violet, opacity=0.7), row=1, col=2)

# Update layout with titles and labels
fig.update_layout(
    title='Comparison of Punctuation',
    xaxis=dict(title='Punctuation'),
    yaxis=dict(title='Frequency'),
    xaxis2=dict(title='Punctuation'),
    showlegend=False,  # Hide legend for individual clarity
    plot_bgcolor='white',
    paper_bgcolor='white'  # Set the paper background to white
)

# Update axis appearance
fig.update_xaxes(showline=True, linewidth=1, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=1, linecolor='black', mirror=True)

# Show the plot
fig.show()


In [25]:
# Create subplots
fig = make_subplots(rows=1, cols=2, subplot_titles=['Most Common Words for Target=0', 'Most Common Words for Target=1'],
                    shared_yaxes=True, horizontal_spacing=0.1)

# Count occurrences for target value '0'
counter_0 = Counter(corpus_0)
most_0 = counter_0.most_common()
x_0 = []
y_0 = []
for word, count in most_0[:40]:
    if word not in stop:
        x_0.append(word)
        y_0.append(count)

# Count occurrences for target value '1'
counter_1 = Counter(corpus_1)
most_1 = counter_1.most_common()
x_1 = []
y_1 = []
for word, count in most_1[:40]:
    if word not in stop:
        x_1.append(word)
        y_1.append(count)

# Plot for target value '0'
fig.add_trace(go.Bar(x=y_0, y=x_0, orientation='h', marker_color=violet), row=1, col=1)

# Plot for target value '1'
fig.add_trace(go.Bar(x=y_1, y=x_1, orientation='h', marker_color=dark_orange), row=1, col=2)

# Update layout with titles and labels
fig.update_layout(
    title='Comparison of Most Common Words',
    xaxis=dict(title='Frequency'),
    yaxis=dict(title='Words'),
    xaxis2=dict(title='Frequency'),
    showlegend=False,  # Hide legend for individual clarity
    plot_bgcolor='white',
    paper_bgcolor='white'  # Set the paper background to white
)

# Update axis appearance
fig.update_xaxes(showline=True, linewidth=1, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=1, linecolor='black', mirror=True)

# Show the plot
fig.show()


In [26]:
def get_top_tweet_bigrams(corpus, n=None):
    """
    Get the top n bigrams from a corpus.

    Parameters:
    - corpus: List of text documents.
    - n: Number of top bigrams to return. If None, return all.

    Returns:
    - List of tuples containing the top n bigrams and their frequencies.
    """
    vec = CountVectorizer(ngram_range=(2, 2)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)
    return words_freq[:n]


In [27]:
# Create a horizontal bar plot for the top 10 bigrams in tweets
fig = go.Figure()

top_tweet_bigrams = get_top_tweet_bigrams(tweet['text'])[:10]
x, y = map(list, zip(*top_tweet_bigrams))

fig.add_trace(go.Bar(x=y, y=x, orientation='h', marker_color=violet))

# Update layout with titles and labels
fig.update_layout(
    title='Top 10 Bigrams in Tweets',
    xaxis=dict(title='Frequency'),
    yaxis=dict(title='Bigrams'),
    showlegend=False,  # Hide legend for individual clarity
    plot_bgcolor='white',
    paper_bgcolor='white'  # Set the paper background to white
)

# Update axis appearance
fig.update_xaxes(showline=True, linewidth=1, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=1, linecolor='black', mirror=True)

# Show the plot
fig.show()

In [28]:
def calculate_meta_features(df):
    """
    Calculate meta-features for a DataFrame.

    Parameters:
    - df: The DataFrame for which to calculate meta-features.

    Returns:
    - The DataFrame with added meta-features.
    """
    # Word count
    df['word_count'] = df['text'].apply(lambda x: len(str(x).split()))

    # Unique word count
    df['unique_word_count'] = df['text'].apply(lambda x: len(set(str(x).split())))

    # Stop word count
    df['stop_word_count'] = df['text'].apply(lambda x: len([word for word in str(x).lower().split() if word in stop]))

    # URL count
    df['url_count'] = df['text'].apply(lambda x: len([word for word in str(x).lower().split() if 'http' in word or 'https' in word]))

    # Mean word length
    df['mean_word_length'] = df['text'].apply(lambda x: np.mean([len(word) for word in str(x).split()]))

    # Character count
    df['char_count'] = df['text'].apply(lambda x: len(str(x)))

    # Punctuation count
    df['punctuation_count'] = df['text'].apply(lambda x: len([char for char in str(x) if char in string.punctuation]))

    # Hashtag count
    df['hashtag_count'] = df['text'].apply(lambda x: len([word for word in str(x).split() if word.startswith('#')]))

    # Mention count
    df['mention_count'] = df['text'].apply(lambda x: len([word for word in str(x).split() if word.startswith('@')]))

    return df

In [29]:
# Apply the function to both 'tweet' and 'test' DataFrames
tweet = calculate_meta_features(tweet)
test = calculate_meta_features(test)

In [119]:
import plotly.express as px
from plotly.subplots import make_subplots

features = ['word_count', 'unique_word_count', 'stop_word_count', 'url_count', 'mean_word_length',
            'char_count', 'punctuation_count', 'hashtag_count', 'mention_count']

# Create subplots for tweet data frame
fig_tweet = make_subplots(rows=len(features), cols=1, subplot_titles=features)

for i, feature in enumerate(features):
    # Separate histograms for each class of 'target'
    histogram_class_0 = px.histogram(tweet[tweet['target'] == 0], x=feature, title=f'{feature} Distribution (No disaster)')
    histogram_class_1 = px.histogram(tweet[tweet['target'] == 1], x=feature, title=f'{feature} Distribution (Disaster)')

    # Update trace colors
    histogram_class_0.update_traces(marker_color=violet, selector=dict(type='histogram'))
    histogram_class_1.update_traces(marker_color=dark_orange, selector=dict(type='histogram'))

    # Add traces to the subplots
    fig_tweet.add_trace(histogram_class_0['data'][0], row=i + 1, col=1)
    fig_tweet.add_trace(histogram_class_1['data'][0], row=i + 1, col=1)

fig_tweet.update_layout(height=len(features) * 300, showlegend=False, title_text="Tweet Data")

# Show the plots
fig_tweet.update_layout(barmode='stack')
fig_tweet.show()

In [120]:
# Create subplots for tweet and test data frames
fig_combined = make_subplots(rows=len(features), cols=1, subplot_titles=['Tweet Data', 'Test Data'])

for i, feature in enumerate(features):
    # Histogram for tweet data frame
    histogram_tweet = px.histogram(tweet, x=feature, title=f'{feature} Distribution (Tweet)')
    # Histogram for test data frame
    histogram_test = px.histogram(test, x=feature, title=f'{feature} Distribution (Test)')

    # Update trace colors (adjust colors as needed)
    histogram_tweet.update_traces(marker_color='blue', selector=dict(type='histogram'))
    histogram_test.update_traces(marker_color='orange', selector=dict(type='histogram'))

    # Add traces to the subplots
    fig_combined.add_trace(histogram_tweet['data'][0], row=i + 1, col=1)
    fig_combined.add_trace(histogram_test['data'][0], row=i + 1, col=1)

fig_combined.update_layout(height=len(features) * 300, showlegend=False, title_text="Comparison between Tweet and Test Data")

# Show the plots
fig_combined.update_layout(barmode='stack')
fig_combined.show()

### 3. Data Preprocessing/Cleaning

In [None]:
import re
import string
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, GlobalMaxPooling1D, Dense
from tensorflow.keras.optimizers import Adam as AdamLegacy
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from spellchecker import SpellChecker
from tqdm import tqdm

In [None]:
# Drop the 'id' column from the 'tweet' DataFrame
df = tweet.drop(columns=['id'])
print(df.shape)

In [None]:
# Function to remove punctuations
def remove_punctuations(text):
    punctuations = '@#!?+&*[]-%.:/();$=><|{}^' + "'`"
    
    for p in punctuations:
        text = text.replace(p, f' {p} ')

    text = text.replace('...', ' ... ')
    
    if '...' not in text:
        text = text.replace('..', ' ... ')
    
    return text

In [None]:
# Function to remove URLs from text
def remove_URL(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'', text)

In [None]:
# Function to remove HTML tags from text
def remove_html(text):
    html = re.compile(r'<.*?>')
    return html.sub(r'', text)

In [None]:
# Function to remove emojis from text
def remove_emoji(text):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  
                               u"\U0001F300-\U0001F5FF"  
                               u"\U0001F680-\U0001F6FF"  
                               u"\U0001F1E0-\U0001F1FF"  
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)


In [None]:
# Function for spell correction using SpellChecker
spell = SpellChecker()
def correct_spellings(text):
    corrected_text = []
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        if word in misspelled_words:
            corrected_word = spell.correction(word)
            if corrected_word and corrected_word != word:
                corrected_text.append(corrected_word)
            else:
                corrected_text.append(word)
        else:
            corrected_text.append(word)
    return " ".join(filter(None, corrected_text))

In [None]:
abbreviations = {
    "$" : " dollar ",
    "€" : " euro ",
    "4ao" : "for adults only",
    "a.m" : "before midday",
    "a3" : "anytime anywhere anyplace",
    "aamof" : "as a matter of fact",
    "acct" : "account",
    "adih" : "another day in hell",
    "afaic" : "as far as i am concerned",
    "afaict" : "as far as i can tell",
    "afaik" : "as far as i know",
    "afair" : "as far as i remember",
    "afk" : "away from keyboard",
    "app" : "application",
    "approx" : "approximately",
    "apps" : "applications",
    "asap" : "as soon as possible",
    "asl" : "age, sex, location",
    "atk" : "at the keyboard",
    "ave." : "avenue",
    "aymm" : "are you my mother",
    "ayor" : "at your own risk", 
    "b&b" : "bed and breakfast",
    "b+b" : "bed and breakfast",
    "b.c" : "before christ",
    "b2b" : "business to business",
    "b2c" : "business to customer",
    "b4" : "before",
    "b4n" : "bye for now",
    "b@u" : "back at you",
    "bae" : "before anyone else",
    "bak" : "back at keyboard",
    "bbbg" : "bye bye be good",
    "bbc" : "british broadcasting corporation",
    "bbias" : "be back in a second",
    "bbl" : "be back later",
    "bbs" : "be back soon",
    "be4" : "before",
    "bfn" : "bye for now",
    "blvd" : "boulevard",
    "bout" : "about",
    "brb" : "be right back",
    "bros" : "brothers",
    "brt" : "be right there",
    "bsaaw" : "big smile and a wink",
    "btw" : "by the way",
    "bwl" : "bursting with laughter",
    "c/o" : "care of",
    "cet" : "central european time",
    "cf" : "compare",
    "cia" : "central intelligence agency",
    "csl" : "can not stop laughing",
    "cu" : "see you",
    "cul8r" : "see you later",
    "cv" : "curriculum vitae",
    "cwot" : "complete waste of time",
    "cya" : "see you",
    "cyt" : "see you tomorrow",
    "dae" : "does anyone else",
    "dbmib" : "do not bother me i am busy",
    "diy" : "do it yourself",
    "dm" : "direct message",
    "dwh" : "during work hours",
    "e123" : "easy as one two three",
    "eet" : "eastern european time",
    "eg" : "example",
    "embm" : "early morning business meeting",
    "encl" : "enclosed",
    "encl." : "enclosed",
    "etc" : "and so on",
    "faq" : "frequently asked questions",
    "fawc" : "for anyone who cares",
    "fb" : "facebook",
    "fc" : "fingers crossed",
    "fig" : "figure",
    "fimh" : "forever in my heart", 
    "ft." : "feet",
    "ft" : "featuring",
    "ftl" : "for the loss",
    "ftw" : "for the win",
    "fwiw" : "for what it is worth",
    "fyi" : "for your information",
    "g9" : "genius",
    "gahoy" : "get a hold of yourself",
    "gal" : "get a life",
    "gcse" : "general certificate of secondary education",
    "gfn" : "gone for now",
    "gg" : "good game",
    "gl" : "good luck",
    "glhf" : "good luck have fun",
    "gmt" : "greenwich mean time",
    "gmta" : "great minds think alike",
    "gn" : "good night",
    "g.o.a.t" : "greatest of all time",
    "goat" : "greatest of all time",
    "goi" : "get over it",
    "gps" : "global positioning system",
    "gr8" : "great",
    "gratz" : "congratulations",
    "gyal" : "girl",
    "h&c" : "hot and cold",
    "hp" : "horsepower",
    "hr" : "hour",
    "hrh" : "his royal highness",
    "ht" : "height",
    "ibrb" : "i will be right back",
    "ic" : "i see",
    "icq" : "i seek you",
    "icymi" : "in case you missed it",
    "idc" : "i do not care",
    "idgadf" : "i do not give a damn fuck",
    "idgaf" : "i do not give a fuck",
    "idk" : "i do not know",
    "ie" : "that is",
    "i.e" : "that is",
    "ifyp" : "i feel your pain",
    "IG" : "instagram",
    "iirc" : "if i remember correctly",
    "ilu" : "i love you",
    "ily" : "i love you",
    "imho" : "in my humble opinion",
    "imo" : "in my opinion",
    "imu" : "i miss you",
    "iow" : "in other words",
    "irl" : "in real life",
    "j4f" : "just for fun",
    "jic" : "just in case",
    "jk" : "just kidding",
    "jsyk" : "just so you know",
    "l8r" : "later",
    "lb" : "pound",
    "lbs" : "pounds",
    "ldr" : "long distance relationship",
    "lmao" : "laugh my ass off",
    "lmfao" : "laugh my fucking ass off",
    "lol" : "laughing out loud",
    "ltd" : "limited",
    "ltns" : "long time no see",
    "m8" : "mate",
    "mf" : "motherfucker",
    "mfs" : "motherfuckers",
    "mfw" : "my face when",
    "mofo" : "motherfucker",
    "mph" : "miles per hour",
    "mr" : "mister",
    "mrw" : "my reaction when",
    "ms" : "miss",
    "mte" : "my thoughts exactly",
    "nagi" : "not a good idea",
    "nbc" : "national broadcasting company",
    "nbd" : "not big deal",
    "nfs" : "not for sale",
    "ngl" : "not going to lie",
    "nhs" : "national health service",
    "nrn" : "no reply necessary",
    "nsfl" : "not safe for life",
    "nsfw" : "not safe for work",
    "nth" : "nice to have",
    "nvr" : "never",
    "nyc" : "new york city",
    "oc" : "original content",
    "og" : "original",
    "ohp" : "overhead projector",
    "oic" : "oh i see",
    "omdb" : "over my dead body",
    "omg" : "oh my god",
    "omw" : "on my way",
    "p.a" : "per annum",
    "p.m" : "after midday",
    "pm" : "prime minister",
    "poc" : "people of color",
    "pov" : "point of view",
    "pp" : "pages",
    "ppl" : "people",
    "prw" : "parents are watching",
    "ps" : "postscript",
    "pt" : "point",
    "ptb" : "please text back",
    "pto" : "please turn over",
    "qpsa" : "what happens", #"que pasa",
    "ratchet" : "rude",
    "rbtl" : "read between the lines",
    "rlrt" : "real life retweet", 
    "rofl" : "rolling on the floor laughing",
    "roflol" : "rolling on the floor laughing out loud",
    "rotflmao" : "rolling on the floor laughing my ass off",
    "rt" : "retweet",
    "ruok" : "are you ok",
    "sfw" : "safe for work",
    "sk8" : "skate",
    "smh" : "shake my head",
    "sq" : "square",
    "srsly" : "seriously", 
    "ssdd" : "same stuff different day",
    "tbh" : "to be honest",
    "tbs" : "tablespooful",
    "tbsp" : "tablespooful",
    "tfw" : "that feeling when",
    "thks" : "thank you",
    "tho" : "though",
    "thx" : "thank you",
    "tia" : "thanks in advance",
    "til" : "today i learned",
    "tl;dr" : "too long i did not read",
    "tldr" : "too long i did not read",
    "tmb" : "tweet me back",
    "tntl" : "trying not to laugh",
    "ttyl" : "talk to you later",
    "u" : "you",
    "u2" : "you too",
    "u4e" : "yours for ever",
    "utc" : "coordinated universal time",
    "w/" : "with",
    "w/o" : "without",
    "w8" : "wait",
    "wassup" : "what is up",
    "wb" : "welcome back",
    "wtf" : "what the fuck",
    "wtg" : "way to go",
    "wtpa" : "where the party at",
    "wuf" : "where are you from",
    "wuzup" : "what is up",
    "wywh" : "wish you were here",
    "yd" : "yard",
    "ygtr" : "you got that right",
    "ynk" : "you never know",
    "zzz" : "sleeping bored and tired"
}

In [None]:
def convert_abbrev(word):
    return abbreviations[word.lower()] if word.lower() in abbreviations.keys() else word


In [None]:
def convert_abbrev_in_text(text):
    tokens = word_tokenize(text)
    tokens = [convert_abbrev(word) for word in tokens]
    text = ' '.join(tokens)
    return text

In [None]:
def remove_punct(text):
    """
    Remove punctuation from the input text.

    Parameters:
    - text: Input text containing punctuation.

    Returns:
    - Cleaned text with punctuation removed.
    """
    table = str.maketrans('', '', string.punctuation)
    return text.translate(table)

In [None]:
def expand_contractions(text):
    """
    Expand contractions in the input text.

    Parameters:
    - text: Input text containing contractions.

    Returns:
    - Text with expanded contractions.
    """
    contraction_mapping = {"ain't": "is not", "aren't": "are not", "can't": "cannot", "'cause": "because",
                           "could've": "could have", "couldn't": "could not", "didn't": "did not",
                           "doesn't": "does not", "don't": "do not", "hadn't": "had not",
                           "hasn't": "has not", "haven't": "have not", "he'd": "he would",
                           "he'll": "he will", "he's": "he is", "how'd": "how did", "how'll": "how will",
                           "how's": "how is", "i'd": "i would", "i'll": "i will", "i'm": "i am",
                           "i've": "i have", "isn't": "is not", "it'd": "it would", "it'll": "it will",
                           "it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not",
                           "might've": "might have", "mightn't": "might not", "must've": "must have",
                           "mustn't": "must not", "needn't": "need not", "oughtn't": "ought not",
                           "shan't": "shall not", "she'd": "she would", "she'll": "she will",
                           "she's": "she is", "should've": "should have", "shouldn't": "should not",
                           "that'd": "that would", "that's": "that is", "there'd": "there would",
                           "there's": "there is", "they'd": "they would", "they'll": "they will",
                           "they're": "they are", "they've": "they have", "wasn't": "was not",
                           "we'd": "we would", "we'll": "we will", "we're": "we are", "we've": "we have",
                           "weren't": "were not", "what'll": "what will", "what're": "what are",
                           "what's": "what is", "what've": "what have", "when's": "when is",
                           "when've": "when have", "where'd": "where did", "where's": "where is",
                           "where've": "where have", "who'll": "who will", "who's": "who is",
                           "who've": "who have", "why's": "why is", "why've": "why have",
                           "will've": "will have", "won't": "will not", "would've": "would have",
                           "wouldn't": "would not", "y'all": "you all", "you'd": "you would",
                           "you'll": "you will", "you're": "you are", "you've": "you have"}

    contraction_pattern = re.compile(r'\b(' + '|'.join(contraction_mapping.keys()) + r')\b')
    return contraction_pattern.sub(lambda x: contraction_mapping[x.group()], text)


In [None]:
def handle_hashtags_usernames(text):
    """
    Handle hashtags and usernames in the input text.

    Parameters:
    - text: Input text containing hashtags and usernames.

    Returns:
    - Text with expanded hashtags and usernames.
    """
    # Replace hashtags with space-separated words
    text = re.sub(r'#(\w+)', r'\1', text)
    
    # Replace usernames with space
    text = re.sub(r'@(\w+)', ' ', text)

    return text

In [None]:
from nltk.tokenize import word_tokenize
# Function to preprocess text data
def preprocess_text(text):
    text = remove_URL(text)
    text = remove_html(text)
    text = remove_emoji(text)
    text = remove_punctuations(text)
    text = remove_punct(text)
    text = correct_spellings(text)
    text = convert_abbrev_in_text(text)
    text = expand_contractions(text) 
    text = handle_hashtags_usernames(text)
    return text

In [None]:
# Function to load GloVe word embeddings from a file
def load_glove_embedding(file_path):
    embedding_dict = {}
    with open(file_path, 'r') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vectors = np.asarray(values[1:], 'float32')
            embedding_dict[word] = vectors
    return embedding_dict

In [None]:
# Function to create an embedding matrix for a given tokenizer using pre-trained word embeddings
def create_embedding_matrix(embedding_dict, tokenizer, max_len, num_words):
    embedding_matrix = np.zeros((num_words, len(list(embedding_dict.values())[0])))
    for word, i in tqdm(tokenizer.word_index.items(), desc="Creating Embedding Matrix"):
        if i > num_words:
            continue
        emb_vec = embedding_dict.get(word)
        if emb_vec is not None:
            embedding_matrix[i] = emb_vec
    return embedding_matrix


In [None]:
# Function to create a corpus from text data
def create_corpus(df):
    corpus = []
    for text in df['text']:
        for word in text.split():
            corpus.append(word)
    return corpus


In [None]:
# Function to preprocess data for training with pre-trained word embeddings
def prepare_data_with_embedding(df, glove_file, max_len=50):
    # Apply text cleaning using preprocess_text function
    df['text'] = df['text'].apply(preprocess_text)

    # Create corpus and tokenize
    corpus = create_corpus(df)
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(corpus)
    sequences = tokenizer.texts_to_sequences(corpus)
    tweet_pad = pad_sequences(sequences, maxlen=max_len, truncating='post', padding='post')

    # Create word index and embedding matrix
    word_index = tokenizer.word_index
    num_words = len(word_index) + 1
    embedding_dict = load_glove_embedding(glove_file)
    embedding_matrix = create_embedding_matrix(embedding_dict, tokenizer, max_len, num_words)

    return tweet_pad, embedding_matrix, num_words

In [None]:
# Paths to GloVe embeddings
path = '/Users/stevenschepanski/Documents/Projects/NLP_Tweets/models/'
glove_files = [os.path.join(path, 'glove.6B.50d.txt'), os.path.join(path, 'glove.6B.100d.txt'), os.path.join(path, 'glove.6B.200d.txt')]

In [None]:
# Initialize a list to store models and their performances
models_and_performances = []


### Model Selection and Architecture


In [None]:
# Define additional variables
num_epochs = 10
max_len = 50


In [None]:
# Loop through each GloVe file
for glove_file in glove_files:
    # Define the path to the GloVe file
    glove_path = os.path.join(path, 'models', glove_file)
    tweet_pad, embedding_matrix, num_words = prepare_data_with_embedding(df, glove_path)

    # Shuffle the DataFrame
    df_shuffled = shuffle(df, random_state=42)

    # Split the shuffled data into training and validation sets
    X_train, X_val, y_train, y_val = train_test_split(
        tweet_pad[:len(df_shuffled)],
        df_shuffled['target'].values,
        test_size=0.2,
        random_state=42
    )

    # Combine embedding matrices from different GloVe files
    combined_embedding_matrix = np.concatenate([embedding_matrix], axis=1)

    # Define the deep learning model
    model = Sequential([
        Embedding(input_dim=num_words, output_dim=combined_embedding_matrix.shape[1], weights=[combined_embedding_matrix], input_length=max_len, trainable=False),
        Bidirectional(LSTM(128, return_sequences=True)),
        GlobalMaxPooling1D(),
        Dense(64, activation='relu'),
        Dense(1, activation='sigmoid')
    ])

    # Compile the model, can replace loss='categorical_crossentropy'
    model.compile(optimizer=AdamLegacy(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])

    # Train the model
    history = model.fit(
        X_train,
        y_train,
        epochs=num_epochs,
        validation_data=(X_val, y_val),
        verbose=1
    )

    # Evaluate the model on the validation set
    val_loss, val_acc = model.evaluate(X_val, y_val)
    print(f"Validation Loss for {glove_file}: {val_loss:.4f}, Validation Accuracy: {val_acc*100:.2f}%")

    # Save the model and its performance
    models_and_performances.append((model, val_acc))

### Hyperparameters

# Conclusion