## Import libraries

In [None]:
import pandas as pd
import numpy as np
import nltk
import re
from imblearn.over_sampling import SMOTE
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB

The `nltk.download('stopwords')` function call ensures that the NLTK stopwords dataset is downloaded and available for use. Stopwords are common words (such as "and", "the", "is") that are often removed during text preprocessing to focus on more meaningful words.

In [183]:
nltk.download('stopwords')

[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1000)>


False

## Functions Defined

### `load_data`

The `load_data` function is responsible for loading the training and test datasets from CSV files. It reads the data into pandas DataFrames and returns them for further processing.

In [55]:
def load_data():
    """
    Load training and test data from CSV files
    Returns:
        train_df: pandas DataFrame containing training data
        test_df: pandas DataFrame containing test data
    """
    train_df = pd.read_csv('/Users/mmesoma/personal-projects/kaggle-competition-nlp-disaster-tweets/data/train.csv')
    test_df = pd.read_csv('/Users/mmesoma/personal-projects/kaggle-competition-nlp-disaster-tweets/data/test.csv')
    return train_df, test_df

### `clean_text`

The `clean_text` function preprocesses text data by performing several cleaning steps. It removes URLs, mentions, and non-alphanumeric characters (except spaces and numbers), converts the text to lowercase, tokenizes it, removes stopwords, and applies stemming.

In [126]:
def clean_text(text):
    """
    Clean text data by removing URLs, mentions, hashtags, and non-alphanumeric characters
    Args:
        text: string
    """
    
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'@\w+', '', text)
    text = re.sub(r'#(\w+)', r'\1', text)
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    text = text.lower().strip()
    words = text.split()
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    stemmer = PorterStemmer()
    words = [stemmer.stem(word) for word in words]
    return ' '.join(words)

### `vectorize_text`

The `vectorize_text` function applies TF-IDF vectorization to the training, validation, and test text datasets. It converts the text data into numerical feature vectors, which are suitable for machine learning models. The function takes in the training, validation, and test text data, along with optional parameters for max_features and ngram_range, and returns the vectorized feature matrices.

In [169]:
def vectorize_text(train_text, val_text, test_text, max_features=5000, ngram_range=(1, 2)):
    """
    Vectorize text data using TF-IDF
    Args:
        train_text: pandas Series containing training text data
        val_text: pandas Series containing validation text data
        test_text: pandas Series containing test text data
        max_features: int, default=5000
        ngram_range: tuple, default=(1, 2)
    Returns:
        X_train_vec: sparse matrix for training data
        X_val_vec: sparse matrix for validation data
        X_test_vec: sparse matrix for test data
    """
    tfidf = TfidfVectorizer(max_features=max_features, ngram_range=ngram_range)
    X_train_vec = tfidf.fit_transform(train_text)
    X_val_vec = tfidf.transform(val_text)
    X_test_vec = tfidf.transform(test_text)
    return X_train_vec, X_val_vec, X_test_vec

## Inspect Dataset

In [99]:
train_df, test_df = load_data()

In [100]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


In [101]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3263 entries, 0 to 3262
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        3263 non-null   int64 
 1   keyword   3237 non-null   object
 2   location  2158 non-null   object
 3   text      3263 non-null   object
dtypes: int64(1), object(3)
memory usage: 102.1+ KB


## Data Preprocessing

We apply the `clean_text` function to preprocess the text data in both the training and test datasets. This step ensures that the text data is cleaned and standardized before further processing.

In [127]:
print("\nCleaning text data...")
train_df['cleaned_text'] = train_df['text'].apply(clean_text)
test_df['cleaned_text'] = test_df['text'].apply(clean_text)
print("\nText data cleaned successfully.")


Cleaning text data...

Text data cleaned successfully.


In [128]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target,cleaned_text
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,deed reason earthquak may allah forgiv us
1,4,,,Forest fire near La Ronge Sask. Canada,1,forest fire near la rong sask canada
2,5,,,All residents asked to 'shelter in place' are ...,1,resid ask shelter place notifi offic evacu she...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,peopl receiv wildfir evacu order california
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,got sent photo rubi alaska smoke wildfir pour ...


Next, we split the training dataset into training and validation sets using an 80-20 split.

In [None]:
X_train, X_val, y_train, y_val = train_test_split(train_df["cleaned_text"], 
                                                  train_df["target"], 
                                                  test_size=0.2, 
                                                  random_state=79)

We vectorized the cleaned text data using the `vectorize_text` function, which applies TF-IDF vectorization to the training, validation, and test datasets. This step converts the text data into numerical feature vectors suitable for machine learning models. The following code snippet demonstrates the vectorization process:

In [None]:
X_train_vec, X_val_vec, X_test_vec = vectorize_text(X_train, X_val, test_df['cleaned_text'])

We utilized the Synthetic Minority Over-sampling Technique (SMOTE) to address class imbalance in our training dataset. By generating synthetic samples for the minority class, SMOTE helps to balance the class distribution, which can improve model performance and generalization. The following code snippet demonstrates the application of SMOTE:

In [None]:

smote = SMOTE(random_state=79)
X_train_bal, y_train_bal = smote.fit_resample(X_train_vec, y_train)

Next, we'll train a Bernoulli Naive Bayes model using the balanced training data and evaluate its performance on the validation set.

In [None]:
BERNmodel = BernoulliNB().fit(X_train_bal, y_train_bal)
pred_val = BERNmodel.predict(X_val_vec)
accuracy = accuracy_score(y_val, pred_val)
print(f"Validation accuracy: {accuracy}")

Validation accuracy: 0.8168089297439265


Finally, we make predictions with our trained model. The predicted values are saved in `submission.csv`.

In [182]:
pred_test = BERNmodel.predict(X_test_vec)
submission = pd.DataFrame({'id': test_df['id'], 'target': pred_test})
submission.to_csv('submission.csv', index=False)
print("\nFile created: submission.csv")


File created: submission.csv
