## Import libraries

In [27]:
import nltk
from scripts.model_utils import evaluate_model, train_model, make_predictions
from scripts.data_utils import clean_text, load_data, vectorize_text
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

## Functions Defined

### `load_data`

The `load_data` function is responsible for loading the training and test datasets from CSV files. It reads the data into pandas DataFrames and returns them for further processing.

### `clean_text`

The `clean_text` function preprocesses text data by performing several cleaning steps. It removes URLs, mentions, and non-alphanumeric characters (except spaces and numbers), converts the text to lowercase, tokenizes it, removes stopwords, and applies stemming.

### `vectorize_text`

The `vectorize_text` function applies TF-IDF vectorization to the training, validation, and test text datasets. It converts the text data into numerical feature vectors, which are suitable for machine learning models. The function takes in the training, validation, and test text data, along with optional parameters for max_features and ngram_range, and returns the vectorized feature matrices.

### `train_model`

The `train_model` function trains a Bernoulli Naive Bayes model using the provided training data.

### `evaluate_model`

The `evaluate_model` function evaluates the trained model on a validation set and prints the validation accuracy.

### `make_predictions`

The `make_predictions` function makes predictions on a test set using the trained model and saves the results to a CSV file.



## Load and Inspect Dataset

In [28]:
train_df, test_df = load_data()

In [29]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


In [30]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3263 entries, 0 to 3262
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        3263 non-null   int64 
 1   keyword   3237 non-null   object
 2   location  2158 non-null   object
 3   text      3263 non-null   object
dtypes: int64(1), object(3)
memory usage: 102.1+ KB


## Data Preprocessing

We apply the `clean_text` function to preprocess the text data in both the training and test datasets. This step ensures that the text data is cleaned and standardized before further processing.

In [31]:
print("\nCleaning text data...")
train_df['cleaned_text'] = train_df['text'].apply(clean_text)
test_df['cleaned_text'] = test_df['text'].apply(clean_text)
print("\nText data cleaned successfully.")


Cleaning text data...

Text data cleaned successfully.


In [32]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target,cleaned_text
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,deed reason earthquak may allah forgiv us
1,4,,,Forest fire near La Ronge Sask. Canada,1,forest fire near la rong sask canada
2,5,,,All residents asked to 'shelter in place' are ...,1,resid ask shelter place notifi offic evacu she...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,peopl receiv wildfir evacu order california
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,got sent photo rubi alaska smoke wildfir pour ...


Next, we split the training dataset into training and validation sets using an 80-20 split.

In [33]:
X_train, X_val, y_train, y_val = train_test_split(train_df["cleaned_text"], 
                                                  train_df["target"], 
                                                  test_size=0.2, 
                                                  random_state=79)
print(f"Training data: {X_train.shape[0]} samples\nValidation data: {X_val.shape[0]} samples")

Training data: 6090 samples
Validation data: 1523 samples


We vectorized the cleaned text data using the `vectorize_text` function, which applies TF-IDF vectorization to the training, validation, and test datasets. This step converts the text data into numerical feature vectors suitable for machine learning models. The following code snippet demonstrates the vectorization process:

In [34]:
X_train_vec, X_val_vec, X_test_vec = vectorize_text(X_train, X_val, test_df['cleaned_text'])
print("\nText data vectorized successfully.")


Text data vectorized successfully.


We utilized the Synthetic Minority Over-sampling Technique (SMOTE) to address class imbalance in our training dataset. By generating synthetic samples for the minority class, SMOTE helps to balance the class distribution, which can improve model performance and generalization. The following code snippet demonstrates the application of SMOTE:

In [35]:
smote = SMOTE(random_state=79)
X_train_bal, y_train_bal = smote.fit_resample(X_train_vec, y_train)
print(f"Balanced training data: {X_train_bal.shape[0]} samples")

Balanced training data: 6974 samples


## Model Training and Evaluation

Next, we'll train a Bernoulli Naive Bayes model using the balanced training data and evaluate its performance on the validation set.

In [36]:
model = train_model(X_train_bal, y_train_bal)
print("\nModel trained successfully.")
evaluate_model(model, X_val_vec, y_val)


Model trained successfully.

Validation accuracy: 0.8168


## Model Prediction

Finally, we make predictions with our trained model. The predicted values are saved in `submission.csv`.

In [37]:
make_predictions(model, X_test_vec, test_df['id'])


Predictions made and saved to file: submission.csv
