Data Loading and Preprocessing
Loading the Dataset:

The dataset is read from a CSV file using pandas. This step loads the data into a pandas DataFrame, making it easy to manipulate and preprocess.
Handling Missing Values:

Missing values in the 'comment' column are filled with empty strings. Handling missing data is crucial because most machine learning algorithms cannot handle missing values directly. Filling with empty strings ensures that all rows have valid text data.
Reducing Dataset Size:

For memory efficiency and faster computation, a random sample of 10,000 records is taken from the dataset. Sampling is important when working with large datasets, especially during the development and experimentation phase, as it reduces the computational load and speeds up the process.
Encoding Target Labels:

The target labels are converted from categorical values to numeric values using LabelEncoder from scikit-learn. This is necessary because machine learning algorithms typically require numeric input. Encoding ensures that the target variable is in a suitable format for training the model.
Text Vectorization using TF-IDF:

The text data in the 'comment' column is vectorized using the TF-IDF (Term Frequency-Inverse Document Frequency) technique. TF-IDF transforms the text into numerical features by calculating the importance of each word in the document relative to the entire dataset. Limiting the number of features to 1000 helps balance between capturing useful information and maintaining computational efficiency.
Train-Test Split:

The dataset is split into training and test sets with an 80/20 split using the train_test_split function from scikit-learn. This split allows the model to be trained on one portion of the data and evaluated on another, ensuring that the performance metrics reflect the model's ability to generalize to unseen data.
Standardizing the Data:

Standardization scales the features to have zero mean and unit variance using StandardScaler from scikit-learn. This step is important for gradient-based optimization methods used in neural networks, as it helps in faster convergence and more stable training.
Building and Optimizing the CNN Model
Defining the Hypermodel:

A hypermodel is defined to specify the architecture of the CNN model and includes tunable hyperparameters. This approach allows us to explore different configurations of the model to find the one that performs best.
Input Layer:

The input layer defines the shape of the input data, which in this case is (number of features, 1). The number of features corresponds to the number of TF-IDF features (1000), and 1 represents the sequence length. This layer ensures that the data fed into the model has the correct dimensions.
Conv1D Layer:

The initial Conv1D (1-dimensional convolutional) layer is added with a tunable number of filters and kernel size. Conv1D layers are effective for extracting local features from the input sequences. The filters and kernel size are chosen from a range of values to find the best configuration.
Flatten Layer:

A Flatten layer is added to convert the multi-dimensional output of the Conv1D layer into a 1-dimensional tensor. This prepares the data for the fully connected Dense layers.
Dense Layers:

Dense layers are added with tunable units and dropout rates. These layers add non-linearity to the model and help capture complex patterns in the data. Dropout layers are used to prevent overfitting by randomly setting a fraction of the input units to zero during training. Batch normalization layers are added to stabilize and accelerate training.
Output Layer:

A final Dense layer with a single unit and sigmoid activation is added. The sigmoid activation function outputs a probability between 0 and 1, making it suitable for binary classification tasks. This layer produces the final prediction of the model.
Compiling the Model:

The model is compiled using the Adam optimizer and binary cross-entropy loss. The Adam optimizer is an adaptive learning rate optimization algorithm that is well-suited for training deep learning models. Binary cross-entropy is used as the loss function because it is appropriate for binary classification problems. The model's performance is evaluated using accuracy as a metric.
Hyperparameter Tuning with Keras Tuner
Initializing the Tuner:

Keras Tuner's RandomSearch method is used to initialize the tuner. This method searches for the best hyperparameters by randomly sampling from the specified search space. It optimizes the model's validation accuracy over multiple trials.
Hyperparameter Search:

The tuner searches for the best hyperparameters by training the model with different configurations and evaluating their performance. It performs up to 10 trials, each executed 3 times to ensure robustness. A validation split of 20% is used to assess the model's performance on unseen data during the tuning process.
Retrieving Best Hyperparameters:

The best hyperparameters found by the tuner are retrieved. These optimal hyperparameters are then used to build the final model. This step ensures that the model configuration selected has the best chance of achieving high performance on the task.
Training the Optimized Model
Building the Optimized Model:

The model is built using the optimal hyperparameters obtained from the tuning process. This model configuration is expected to perform better than random or manually selected configurations.
Early Stopping:

Early stopping is configured to monitor the validation loss and stop training if the loss does not improve for 3 consecutive epochs. This prevents the model from overfitting to the training data by stopping the training process once it stops improving on the validation set. The best weights are restored at the end of training.
Learning Rate Scheduler:

A learning rate scheduler is defined to adjust the learning rate during training. Initially, the learning rate is kept constant, and after a specified number of epochs, it is decreased exponentially. This helps the model converge faster and escape local minima by allowing larger steps initially and smaller, more precise steps later.
Reshaping Data for CNN Input:

The data is reshaped to match the input shape required by the Conv1D layer, which is (batch_size, sequence_length, num_features). This step ensures that the input data has the correct dimensions for the Conv1D layers to process.
Model Training:

The model is trained on the training data for up to 50 epochs, with a validation split of 20%. The early stopping and learning rate scheduler callbacks are used during training to improve efficiency and prevent overfitting. A batch size of 32 is chosen as it is a common size that balances training speed and stability.
Evaluating the Model
Making Predictions:

Predictions are made on the test set using the trained model. The predicted probabilities are converted to binary values (0 or 1) using a threshold of 0.5. This step provides the model's final predictions for the test set.
Calculating the F1 Score:

The F1 score is calculated to evaluate the model's performance. The F1 score is the harmonic mean of precision and recall, and it provides a balanced measure of the model's accuracy, especially useful for imbalanced datasets. A high F1 score indicates that the model performs well in terms of both precision and recall.
Making Predictions with New Input
Function for Preprocessing and Predicting New Input:

A function is defined to preprocess new text input and make predictions using the trained model. This function vectorizes the input text using the TF-IDF vectorizer, standardizes the vector, reshapes it to match the CNN input shape, and then feeds it into the model to obtain the prediction.
Example Usage for New Input:

An example is provided where a new comment is passed to the preprocessing and prediction function. The function returns the predicted class (0 or 1) for the new comment. This demonstrates how the model can be used in real-world applications to make predictions on new, unseen data.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import f1_score
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Dense, Conv1D, Flatten, Dropout, BatchNormalization
from tensorflow.keras.callbacks import EarlyStopping, LearningRateScheduler
import keras_tuner as kt

# Load the dataset
data = pd.read_csv('kaggle_train.csv')

# Handle missing values in the 'comment' column
data['comment'].fillna('', inplace=True)

# Reduce dataset size for memory efficiency (sample 10,000 records)
data = data.sample(n=10000, random_state=42)

# Encode target labels if necessary
label_column = 'label'
label_encoder = LabelEncoder()
data[label_column] = label_encoder.fit_transform(data[label_column])

# Text Vectorization using TF-IDF with fewer features
tfidf = TfidfVectorizer(max_features=1000)
X = tfidf.fit_transform(data['comment']).toarray()

# Split data into features and target
y = data[label_column]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the data
scaler = StandardScaler(with_mean=False)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Define the hypermodel for CNN
def build_cnn_model(hp):
    model = Sequential()
    
    # Input layer
    model.add(Input(shape=(X_train.shape[1], 1)))
    
    # Conv1D layer
    model.add(Conv1D(filters=hp.Int('filters', min_value=32, max_value=128, step=32),
                     kernel_size=hp.Choice('kernel_size', values=[3, 5, 7]),
                     activation='relu'))
    
    # Flatten layer
    model.add(Flatten())
    
    # Dense layers with dropout and batch normalization
    for i in range(hp.Int('num_dense_layers', 1, 3)):
        model.add(Dense(units=hp.Int(f'dense_units_{i}', min_value=32, max_value=128, step=32), activation='relu'))
        model.add(Dropout(hp.Float(f'dropout_{i}', min_value=0.2, max_value=0.5, step=0.1)))
        model.add(BatchNormalization())
    
    # Output layer
    model.add(Dense(1, activation='sigmoid'))
    
    # Compile the model
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

# Initialize the tuner for CNN model
tuner_cnn = kt.RandomSearch(
    build_cnn_model,
    objective='val_accuracy',
    max_trials=10,
    executions_per_trial=3,
    directory='hyperband_cnn',
    project_name='cnn_optimization'
)

# Search for the best hyperparameters
tuner_cnn.search(X_train, y_train, epochs=10, validation_split=0.2)

# Get the optimal hyperparameters
best_hps_cnn = tuner_cnn.get_best_hyperparameters(num_trials=1)[0]

# Build the model with the optimal hyperparameters
model_cnn = build_cnn_model(best_hps_cnn)

# Define early stopping and learning rate scheduler
early_stopping_cnn = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

def scheduler_cnn(epoch, lr):
    if epoch < 10:
        return lr
    else:
        return lr * tf.math.exp(-0.1)

lr_scheduler_cnn = LearningRateScheduler(scheduler_cnn)

# Reshape the data for CNN input
X_train_cnn = X_train.reshape(X_train.shape[0], X_train.shape[1], 1)
X_test_cnn = X_test.reshape(X_test.shape[0], X_test.shape[1], 1)

# Train the CNN model
history_cnn = model_cnn.fit(
    X_train_cnn, y_train,
    epochs=50,
    validation_split=0.2,
    batch_size=32,
    callbacks=[early_stopping_cnn, lr_scheduler_cnn]
)

# Predict on the test set for CNN model
y_pred_cnn = (model_cnn.predict(X_test_cnn) > 0.5).astype("int32")

# Calculate the F1 score for CNN model
f1_cnn = f1_score(y_test, y_pred_cnn)
print(f"CNN F1 Score: {f1_cnn}")

# Function to preprocess and predict new input for CNN model
def preprocess_and_predict_cnn(comment):
    # Preprocess the input comment
    input_vector = tfidf.transform([comment]).toarray()
    input_vector = scaler.transform(input_vector)
    input_vector = input_vector.reshape(input_vector.shape[0], input_vector.shape[1], 1)
    
    # Make prediction
    prediction = (model_cnn.predict(input_vector) > 0.5).astype("int32")
    return prediction

# Example usage for new input with CNN model
new_comment_cnn = "This is a sample comment for prediction."
prediction_cnn = preprocess_and_predict_cnn(new_comment_cnn)
print(f"Prediction for new comment with CNN model: {prediction_cnn[0][0]}")


Trial 10 Complete [00h 02m 21s]
val_accuracy: 0.6272916793823242

Best val_accuracy So Far: 0.6272916793823242
Total elapsed time: 01h 19m 51s
Epoch 1/50
[1m200/200[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 22ms/step - accuracy: 0.5431 - loss: 0.9230 - val_accuracy: 0.6300 - val_loss: 0.6842 - learning_rate: 0.0010
Epoch 2/50
[1m200/200[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 19ms/step - accuracy: 0.6544 - loss: 0.6056 - val_accuracy: 0.6156 - val_loss: 0.6807 - learning_rate: 0.0010
Epoch 3/50
[1m200/200[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 18ms/step - accuracy: 0.7104 - loss: 0.5573 - val_accuracy: 0.6275 - val_loss: 0.6970 - learning_rate: 0.0010
Epoch 4/50
[1m200/200[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 19ms/step - accuracy: 0.7432 - loss: 0.5171 - val_accuracy: 0.6062 - val_loss: 0.7481 - learning_rate: 0.0010
Epoch 5/50
[1m200/200[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 19ms/step - accuracy: 0.7576 - l