## Assignement 2
### Food Hazard Detection LGBM Model Submission Notebook

In this submission, we focus on leveraging a text-based LGBM model to achieve better classification results. Based on benchmark evaluations, the model is trained using the optimal parameters for text-focused tasks and will be applied to predict labels for a new unlabeled dataset. The reasoning behind this approach is as follows:

Better Performance with Text: The text-based LGBM model outperforms the title-based model, particularly in hazard-category classification (F1-score of 0.9065).
Training with Benchmark Parameters: The model is trained using the parameters of the text-focused model from benchmarks.
Prediction on Unlabeled Dataset: The trained model will predict on a new, unlabeled dataset.
Product Classification Challenge: While product classification is weak, the text-based model still offers better performance and room for improvement.
This text-focused approach maximizes model performance and ensures more accurate predictions for the classification tasks.

# Training Data Overview

In [1]:
import pandas as pd

# Load the dataset
file_path = r"C:\Users\steli\OneDrive\Desktop\Stelios\DSAUEB\Trimester 1\PDS\A2\PDS-A2\Data\incidents_train.csv"
df = pd.read_csv(file_path)

# Initial inspection of the data
data_overview = {
    'Shape': df.shape,
    'Columns': df.columns.tolist(),
    'df Types': df.dtypes,
    'Missing Values': df.isnull().sum(),
}

print(data_overview)
# Drop the unnecessary index column
df = df.drop(columns=['Unnamed: 0'])


{'Shape': (5082, 11), 'Columns': ['Unnamed: 0', 'year', 'month', 'day', 'country', 'title', 'text', 'hazard-category', 'product-category', 'hazard', 'product'], 'df Types': Unnamed: 0           int64
year                 int64
month                int64
day                  int64
country             object
title               object
text                object
hazard-category     object
product-category    object
hazard              object
product             object
dtype: object, 'Missing Values': Unnamed: 0          0
year                0
month               0
day                 0
country             0
title               0
text                0
hazard-category     0
product-category    0
hazard              0
product             0
dtype: int64}


# Import Necessary Libraries

In [2]:
import pandas as pd
import lightgbm as lgb
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import os
from shutil import make_archive
import re
from nltk.corpus import stopwords
import nltk


# Download NLTK Stopwords

Stopwords: Provides a list of common English stopwords to remove from the text during preprocessing.

In [3]:
# Download stopwords from nltk (if you haven't already)
#nltk.download('stopwords')

# Get the list of English stopwords
stop_words = set(stopwords.words('english'))


# Define the Function to Clean Text

This function cleans the text by:
Removing non-alphanumeric characters.
Converting text to lowercase.
Removing extra spaces.
Removing common stopwords using the NLTK stopwords list.

In [4]:
# Function to clean text (title or text) and remove stopwords
def clean_text(text):
    # Remove non-alphanumeric characters (excluding spaces)
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    # Remove extra spaces
    text = ' '.join(text.split())
    # Remove stopwords
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text


# Load Data (Train and Test)

In [5]:
# Load training data (assuming the path is correct for the training data)
train_path = r"C:\Users\steli\OneDrive\Desktop\Stelios\DSAUEB\Trimester 1\PDS\A2\PDS-A2\Data\incidents_train.csv"
train_df = pd.read_csv(train_path, index_col=0)

# Load test data (test data will remain uncleaned)
test_path = r"C:\Users\steli\OneDrive\Desktop\Stelios\DSAUEB\Trimester 1\PDS\A2\PDS-A2\Data\validation_data\incidents.csv"
test_df = pd.read_csv(test_path, index_col=0)


# Clean Only the 'text' Column in the Train Data

In [6]:
# Clean the 'text' column in the training data only
train_df['text'] = train_df['text'].apply(clean_text)


# Define Features and Targets

In [7]:
# Define relevant features and targets
features = ['year', 'month', 'day', 'country']
targets_subtask1 = ['hazard-category', 'product-category']
targets_subtask2 = ['hazard', 'product']
all_targets = targets_subtask1 + targets_subtask2


# Prepare Data Function for Test Set (No Cleaning)

In [8]:
# Prepare data function for test set
def prepare_test_data(text_column):
    X = test_df[features + [text_column]]  # Include text for prediction (no cleaning applied)
    return X


# Prepare Data Function for Train Set (With Cleaning)

In [9]:
# Prepare data function for train set
def prepare_train_data(text_column):
    X = train_df[features + [text_column]]  # Include cleaned text for training
    return X


# Define the LightGBM Pipeline

- Text Preprocessing: TF-IDF vectorization is used for the text column.
- Standard Scaling: The year, month, and day columns are scaled.
- Categorical Encoding: The country column is one-hot encoded.
- LightGBM Classifier: The pipeline uses the LightGBM classifier with specified parameters (num_leaves=80, learning_rate=0.05, etc.).

In [10]:
# Define LightGBM pipeline for text
def build_lgb_pipeline_text():
    preprocessor = ColumnTransformer(
        transformers=[
            ('text', TfidfVectorizer(), 'text'),  # Use TF-IDF for text
            ('num', StandardScaler(), ['year', 'month', 'day']),
            ('cat', OneHotEncoder(handle_unknown='ignore'), ['country'])
        ]
    )
    
    # LightGBM classifier
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', lgb.LGBMClassifier(num_leaves=80, learning_rate=0.05, n_estimators=300, verbose=-1))
    ])
    return pipeline


# Train a Model for Each Target

- train_lgb_model_for_target: For each target, the model is trained using the specified features and target.
- train_test_split: Splits the data into training and testing sets (80/20).
- Model Training: The model is trained using the train_df data.

In [11]:
# Train a model for each target
def train_lgb_model_for_target(target):
    text_pipeline = build_lgb_pipeline_text()
    
    # Split the data for training (use only the current target for y_train)
    X_train, _, y_train, _ = train_test_split(
        train_df[features + ['text']],  # Features
        train_df[target],  # Target for this specific task
        test_size=0.2, random_state=42
    )

    text_pipeline.fit(X_train, y_train)
    return text_pipeline


# Make Predictions for Each Target

- make_predictions_for_target: This function uses the trained pipeline to make predictions on the test data.

In [12]:
# Make predictions on the test data
def make_predictions_for_target(pipeline, X_test):
    return pipeline.predict(X_test)


# Prepare Train and Test Data

- train_X: Prepares the cleaned training data.
- test_X: Prepares the test data (no cleaning).
- Training and Predictions: For each target, the model is trained, and predictions are made on the test data.

In [13]:
# Prepare train and test data
train_X = prepare_train_data('text')  # Cleaned train data
test_X = prepare_test_data('text')  # Test data (no cleaning)


# Train Models and Make Predictions for Each Target

In [None]:
# Initialize a DataFrame to store all predictions
predictions_df = pd.DataFrame()

# Train models and make predictions for each target
for target in all_targets:
    print(f"Training and predicting for {target}...")
    
    # Train a separate model for each target
    target_pipeline = train_lgb_model_for_target(target)
    
    # Make predictions for the test set
    predictions_df[target] = make_predictions_for_target(target_pipeline, test_X)


Training and predicting for hazard-category...


# Save Predictions and Create Zip Archive

In [None]:
# Step 2: Save predictions to a new folder
os.makedirs('./submissions/submission_v3', exist_ok=True)
predictions_df.to_csv('./submission_v3/submission.csv', index=False)

# Step 3: Zip the folder for submission
make_archive('./submission_v3', 'zip', './submission_v3')

print("Predictions and submission.zip created successfully.")


In [None]:
predictions_df