# This notebook is both a learning exercise and contest entry
All changes are under the MIT License.
I am building upon DIAGT related work as the vast majority of other contestants have also leveraged it. I am starting with Wilmer E. Henao's notebook as it claims to have an initial high LB score, is recent, and appears well structured/built.

The first submission is Henao's baseline. Alterations will follow. I am initially using the T4 x2 accelerator.
The baseline run did not actually run. First attempt through all cells showed that catboost is not accepting the features. Instead of trying to fix it, I've removed it and added its weight to lgb. I believe the current code will finish execution and am running a new baseline.

1. The first run failed as the code did not make an actual submission and ended with no real work done.
2. The second attempt successed with the code block that prevented an actual submission from being computed removed. Also, catboost caused errors and was commented out.
3. There were three runs with different weights for NB, SGD, and LGB, all three producing an LB of 0.956.
4. LGB alone produced an LB of 0.898
5. Seeing if I can get TabPFN to work and what LB score it produces. Unfortunately, TabPFN only runs with 1 single configuration on the test data, but runs out of memory on the public data test run. Versions 6, 7,and 8 were attempts. The val_loss was also coming back with 0.5 suggesting little more than 50/50 odds. I am leaving the code as commented out or marked down.
6. NB scored 0.926 and SDG scored 0.940 on the LB.
7. Tried weights [0.34,0.38,0.28] but the LB was only 0.953.
8. Seeing if DenseNet can improve results. This was a journey trying to get it to run without exceptions or out of memory errors. I learned about sparse matrix and sparse tensors in tensor flow. I also saw odd behavior in training and then realized that we have a leak - it is loading the vocab from the test data which means it is training with words that do not exist in the training essays. However, it still just times out with both the 'SPE' and 'BPE' models so I'm cutting the models in half.
9. The half set of models (one NB, one SDG, one LGB, and one DenseNet) also using only the training set's vocabulary scored 0.893 with weights 0.07, 0.31, 0.31, and 0.31. Ironically, just the new DenseNet scored 0.921. See v36 and v37 respectively.
10. Running full experiment, training vocab only, weights NB=0.07, SDG=0.31, LGB=0.31, and DN=0.31. (DN is training with an 80/20 split). This is v38.
11. Running full experiment, test vocab, weights NB=0.07, SDG=0.31, LGB=0.31, and DN=0.31. (DN is training with an 80/20 split). This is v39.
12. OK, from an LB perspective, it scores higher when trained with all words in the test set instead of just the words in the training set. I presume it can relate words unique to the test set to words in the training set to help improve accuracy / lower loss. 
13. Added CatBoost as a 5th model, and changed the weights to NB=0.07, SDG=0.24, LGB=0.23, DN=0.23, and CB=0.23. Will build as v40.
14. OutOfMemory (OOM) is a bother, let's see if we can make the first "BPE" pass without it. Also found a curious item, it looks like the y_train was loaded with the 'labels' column, but the 'labels' column was left in the train data (it is now dropped). This produced a LB score of 0.960, the highest score yet. 
15. Someone of a wild stab. Past evidence suggests LGB is the lowest scoring contributor, next come DN and NB, then we have SGD and CatBoost. They are entered NB, SGD, LGB, ND, and CatBoost as [0.10, 0.35, 0.05, 0.10, 0.40].  This produced the highest LB score of 0.960.
16. Setting up the first attempt to dynamically determine weights by ratio of the roc_auc scores coming back from the individual models based upon ratios of the area above the curve (the negative effects). Retry will be needed, reogranized things so the internal training set predictions are only kept for one model at a time and their roc_auc is the only thing perserved for later use. We ran into OOM on trying to keep 5 extra predictions in memory (it's really close to its limits). The result was bad, it scored 0.946 on the LB. 
17. I am going to also give Nelder-Mead optimization one try with the weights. 

# Comprehensive Analysis and Modeling Notebook

Welcome to this comprehensive notebook where we dive deep into the world of data analysis and machine learning. This document is meticulously crafted to guide you through various stages of data processing, modeling, and prediction. Here's what to expect:

## What This Notebook Offers:
1. **Data Preprocessing**: Initial steps to clean and prepare the data for analysis.
2. **Exploratory Data Analysis (EDA)**: Insights and patterns unraveled through visual and statistical methods.
3. **Feature Engineering**: Enhancing the dataset with new, informative features.
4. **Model Development**: Implementation of various machine learning models, including both traditional and advanced techniques.
5. **Evaluation and Optimization**: Assessing model performance and tuning them for better accuracy.
6. **Ensemble Techniques**: Leveraging the power of multiple models to improve predictions.
7. **Final Predictions and Submission**: Preparing the final predictions for submission, demonstrating the practical application of our analysis.

## Intended Audience:
This notebook is designed for both beginners and experienced practitioners in the field of data science. Whether you're looking to learn new skills, seeking to understand specific methodologies, or aiming to apply advanced techniques in machine learning, this notebook has something to offer.

## Feedback and Collaboration:
Your feedback is highly appreciated! If you have any suggestions, questions, or ideas for improvement, please feel free to share. Collaboration is the key to success in the ever-evolving field of data science, and your input is invaluable.

---

Let's embark on this data science journey together and uncover the stories hidden within the data!


In [None]:
import sys
import gc

import pandas as pd
from sklearn.model_selection import StratifiedKFold
import numpy as np
from sklearn.metrics import roc_auc_score
import numpy as np
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.feature_extraction.text import TfidfVectorizer

from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
    SentencePieceBPETokenizer
)

from datasets import Dataset
from tqdm.auto import tqdm
from transformers import PreTrainedTokenizerFast

from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import VotingClassifier

from tensorflow.keras.metrics import AUC
from sklearn.model_selection import train_test_split

In [None]:
from tensorflow.keras.layers import Concatenate, Dense, Dropout, LSTM, BatchNormalization, Activation, Input
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Conv1D, Conv2D, Flatten, Input, MaxPooling1D, MaxPooling2D, concatenate
import matplotlib.pyplot as plt
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
from sklearn.base import BaseEstimator, ClassifierMixin

In [None]:
import tensorflow as tf

# Configure Strategy. Assume TPU...if not set default for GPU
tpu = None
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver.connect(tpu="local") # "local" for 1VM TPU
    strategy = tf.distribute.TPUStrategy(tpu)
    print("on TPU")
    print("REPLICAS: ", strategy.num_replicas_in_sync)
except:
    strategy = tf.distribute.get_strategy()

# pip install catboost[gpu]

In [None]:
INITIAL_SEED = 4221

# For reproducability, lock down the random seeds
import random
import os
import tensorflow as tf

def seed_everything(seed):
    random.seed(seed)
    np.random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    tf.random.set_seed(seed)
    
seed_everything(INITIAL_SEED) # best try to make runs reproducible - can also tweak this as it affects training splits

In [None]:
import sys
print(sys.version)

In [None]:
test = pd.read_csv('/kaggle/input/llm-detect-ai-generated-text/test_essays.csv')
sub = pd.read_csv('/kaggle/input/llm-detect-ai-generated-text/sample_submission.csv')
train = pd.read_csv("/kaggle/input/daigt-v2-train-dataset/train_v2_drcat_02.csv", sep=',')

In [None]:
print ("Test")
print(test.head(5))
print(test.describe())
print ("train")
print(train.head(5))
print(train.describe())

In [None]:
train = train.drop_duplicates(subset=['text'])
train.reset_index(drop=True, inplace=True)

In [None]:
LOWERCASE = False
VOCAB_SIZE = 64000

# Creating Byte-Pair Encoding Tokenizer

This cell initializes a Byte-Pair Encoding (BPE) tokenizer, a method effective for subword tokenization in NLP tasks. The tokenizer is configured with special tokens like `[UNK]`, `[PAD]`, `[CLS]`, `[SEP]`, and `[MASK]`. We use normalization and pre-tokenization strategies suitable for BPE. The tokenizer is trained on a subset of the dataset iteratively and wrapped in `PreTrainedTokenizerFast` for efficient tokenization. Finally, it's applied to both the test and training text data.


In [None]:
# Creating Byte-Pair Encoding tokenizer
raw_tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))
raw_tokenizer.normalizer = normalizers.Sequence([normalizers.NFC()] + [normalizers.Lowercase()] if LOWERCASE else [])
raw_tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.BpeTrainer(vocab_size=VOCAB_SIZE, special_tokens=special_tokens)
dataset = Dataset.from_pandas(test[['text']])
def train_corp_iter(): 
    for i in range(0, len(dataset), 1000):
        yield dataset[i : i + 1000]["text"]
raw_tokenizer.train_from_iterator(train_corp_iter(), trainer=trainer)
tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=raw_tokenizer,
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)
tokenized_texts_test = []

for text in tqdm(test['text'].tolist()):
    tokenized_texts_test.append(tokenizer.tokenize(text))

tokenized_texts_train = []

for text in tqdm(train['text'].tolist()):
    tokenized_texts_train.append(tokenizer.tokenize(text))

# TF-IDF Vectorization with Custom Tokenization

This cell implements the TF-IDF (Term Frequency-Inverse Document Frequency) vectorization process, customized for specific tokenization needs. The `TfidfVectorizer` is set up with a 3-5 n-gram range and various parameters including sublinear term frequency scaling and unicode accent stripping. Custom functions are used for both tokenization and preprocessing to maintain control over these processes. After fitting the vectorizer to the tokenized test data, we extract the vocabulary. This vocabulary is then used to initialize a new `TfidfVectorizer` which transforms both the training and test datasets. Post-processing, the vectorizer is deleted to free up memory.


In [None]:
def dummy(text):
    return text

vectorizer = TfidfVectorizer(ngram_range=(3, 5), lowercase=False, sublinear_tf=True, 
    analyzer = 'word',
    tokenizer = dummy,
    preprocessor = dummy,
    token_pattern = None, strip_accents='unicode')

vectorizer.fit(tokenized_texts_test) #train) #test)

# Getting vocab
vocab = vectorizer.vocabulary_

print("Vocabulary length", len(vocab))

vectorizer = TfidfVectorizer(ngram_range=(3, 5), lowercase=False, sublinear_tf=True, vocabulary=vocab,
                            analyzer = 'word',
                            tokenizer = dummy,
                            preprocessor = dummy,
                            token_pattern = None, strip_accents='unicode'
                            )

tf_train = vectorizer.fit_transform(tokenized_texts_train)
tf_test = vectorizer.transform(tokenized_texts_test)

del vectorizer
gc.collect()

In [None]:
y_train = train['label'].values

In [None]:
y_train

In [None]:
train.drop('label', axis=1, inplace=True)

# Adding a DenseNet Neural Network (block like MLP)
With a plot for its output

In [None]:
def plot_training_history(history, metrics):
    loss = history.history['loss']
    #val_loss = history.history['val_loss']

    epochs = range(len(loss))

    plt.figure(figsize=(12, 6))

    # Plot loss
    plt.subplot(1, 2, 1)
    plt.plot(epochs, loss, label='Training Loss', color="blue")
    #plt.plot(epochs, val_loss, label='Validation Loss', color="red")
    plt.title('Loss')
    plt.xlabel('Epochs')
    plt.legend()

    plt.title('Metrics')
    plt.xlabel('Epochs')
    plt.legend(loc='upper right')

    plt.tight_layout()
    plt.show()

In [None]:
USE_DN_ACTIVATION = 'relu'

def build_DenseNet(X_transformed, num_outputs):
    """
    Build an multi-block Dense Network with several hidden layers.
    Args:
        The full training set, though this is only used for its shape.
        
    Returns:
        The input and output to be used to define a Neural Network Model
        All hidden layers are prepared between the Input of the training set's width and target output's size (number of labels).
    """    
    
    inputs = Input(shape=[X_transformed.shape[1]])# Initial fully connected layer
    #x = BatchNormalization()(inputs)
    x = Dense(16)(inputs)
    x = Activation(USE_DN_ACTIVATION)(x)
    x = Dropout(.2)(x)

    # Dense blocks
    num_blocks = 5  
    
    for _ in range(num_blocks):
        # Dense block
        for _ in range(2):  # Adjust the number of layers in each block as needed
            y = x
            #x = BatchNormalization()(x)
            x = Activation(USE_DN_ACTIVATION)(x)
            x = Dense(8)(x)#, kernel_initializer='he_normal')(x)
            x = Dropout(.2)(x)
            x = Concatenate()([y, x])

    # Final fully connected layers
    #x = BatchNormalization()(x)
    x = Activation(USE_DN_ACTIVATION)(x)
    x = Dense(24)(x) # 64)(x) #, kernel_initializer='he_normal')(x) # was 1800
    x = Dropout(.2)(x)
    x = Activation(USE_DN_ACTIVATION)(x)

    # Regression output
    regression_output = Dense(num_outputs, activation='sigmoid', name='regression_output')(x)#, kernel_initializer='he_normal')(x)

    return inputs, regression_output

In [None]:
import pandas as pd
from scipy.sparse import csr_matrix

def csr_to_dataframe(csr_data):
    # Convert the csr_matrix to a dense format
    dense_data = csr_data.todense()

    # Create a DataFrame from the dense matrix
    df = pd.DataFrame(dense_data)

    return df

In [None]:
def numpy_to_csr(np_pred):
    # Convert the DataFrame to a sparse matrix
    sparse_matrix = csr_matrix(np_pred)

    return sparse_matrix

In [None]:
DN_EPOCH_PLAN = 100

early_stopping = EarlyStopping(
    patience=6,
    min_delta=0.001,
    restore_best_weights=True,
)

mycount = 0

class DNKerasClassifierWrapper(BaseEstimator, ClassifierMixin):
    def __init__(self, keras_model):
        self.keras_model = keras_model

    def fit(self, scrX,y):
        #lr_callback = tf.keras.callbacks.LearningRateScheduler(lambda step: 0.001-(0.001*(step/(DN_EPOCH_PLAN+1))), verbose=0)
        lr_callback = tf.keras.callbacks.LearningRateScheduler(lambda step: 0.0001, verbose=0)
        global mycount

        split_index = int(0.2 * scrX.shape[0])

        X_train_now, X_val, y_train_now, y_val = train_test_split(scrX, 
                                                  y, 
                                                  test_size=0.20,
                                                  random_state=42+mycount)        
        
        X_train_now = self._convert_to_sparse_tensor(X_train_now)
        X_train_now = tf.sparse.reorder(X_train_now)
        X_val = self._convert_to_sparse_tensor(X_val)
        X_val = tf.sparse.reorder(X_val)
        
        mycount += 1
        
        with strategy.scope():
            history_1 = self.keras_model.fit(X_train_now, y_train_now, 
                                         epochs=80, 
                                         batch_size=1024,
                                         callbacks=[lr_callback,early_stopping],
                                         validation_data = (X_val, y_val),
                                         verbose=0)
        plot_training_history(history_1, metrics=[])
        _ = gc.collect()
        
    def predict_proba(self, X_sparse):
        _ = gc.collect()
        
        batch_size = 1000
        print("Predicting for DN")
        
        n_rows = X_sparse.shape[0]
        prob_class_1 = []
        initialized = False
        for start in range(0, n_rows, batch_size):
            end = min(start + batch_size, n_rows)
            chunk = self._convert_to_sparse_tensor(X_sparse[start:end])  # Convert to dense format
            chunk = tf.sparse.reorder(chunk)
            chunk_predictions = []
            with strategy.scope():
                chunk_predictions = self.keras_model.predict(chunk).ravel()
            if not initialized:
                initialized = True
                prob_class_1 = chunk_predictions
            else:
                prob_class_1 = np.concatenate([prob_class_1, chunk_predictions])
            _ = gc.collect()
            
        print("Prediction completed")

        prob_class_0 = 1 - prob_class_1
        _ = gc.collect()
        return np.vstack((prob_class_0, prob_class_1)).T
        
    def _convert_to_sparse_tensor(self, csr):
        csr_coo = csr.tocoo()
        indices = np.mat([csr_coo.row, csr_coo.col]).transpose()
        return tf.SparseTensor(indices, csr_coo.data, csr_coo.shape)



In [None]:
# How many DenseNet models to build - Do not change without adding to weights.

TOTAL_MODELS = 1

# Ensemble Learning with Multiple Classifiers

This cell sets up an ensemble learning model using various classifiers, each with its specific configurations:

1. **Multinomial Naive Bayes**: 
   A `MultinomialNB` classifier with a smoothing parameter `alpha` set to 0.02.

2. **SGD Classifier**: 
   An `SGDClassifier` for linear models with a modified Huber loss function, a maximum of 8000 iterations, and a tolerance of 1e-4 for stopping criteria.

3. **LightGBM Classifier**: 
   An `LGBMClassifier` configured with custom parameters such as learning rate, lambda values, max depth, and more, specified in the `p6` dictionary.

4. **CatBoost Classifier**: 
   A `CatBoostClassifier` with 1000 iterations, silent mode (no verbose output), specific learning rate, L2 regularization, and a subsampling rate of 0.4.

5. **Ensemble Model - Voting Classifier**: 
   A `VotingClassifier` that combines the above models (`MultinomialNB`, `SGDClassifier`, `LGBMClassifier`, `CatBoostClassifier`) using soft voting. The weights for each classifier in the ensemble are specified, with a focus on the three non-Naive Bayes models.
   
6. **Added a DenseNet (blocks an FNN)**:
   This addition was made by Thomas Gamet as are weights adjustments.

The ensemble model is then trained on the transformed training data (`tf_train`) and labels (`y_train`). Finally, the ensemble model is used to predict probabilities on the test dataset (`tf_test`), and garbage collection is run to manage memory.
``


In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import auc, precision_recall_curve
from scipy.optimize import minimize
from sklearn.metrics import log_loss
import time
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

#def minobjective(weights, truthset, pred1, pred2, pred3, pred4, pred5):
#    weighted_sum = weights[0] * pred1 + weights[1] * pred2 + weights[2] * pred3 + weights[3] * pred4 + weights[4] * pred5
#    objectiveval = roc_auc_score(truthset, weighted_sum)
#    return -objectiveval

def calculate_voting_bpe(tf_train, tf_test, y_train):
    neg_deltas = [0.0, 0.0, 0.0, 0.0, 0.0]
    
    clf = MultinomialNB(alpha=0.02)
    sgd_model = SGDClassifier(max_iter=8000, tol=1e-4, loss="modified_huber") 
    p6={'n_iter': 3000,
        'verbose': -1,'objective': 'cross_entropy','metric': 'auc',
        'learning_rate': 0.00582, 'colsample_bytree': 0.78,
        'colsample_bynode': 0.8, 'lambda_l1': 4.56296,
        "device": "gpu",
        'gpu_device_id': 0,
        'lambda_l2': 2.97485, 'min_data_in_leaf': 115, 'max_depth': 23}
    lgb=LGBMClassifier(**p6)
    print("tf_train.shape=", tf_train.shape)
    # Fit classifiers and make predictions
    print("clf.fit() in progress")
    clf.fit(tf_train, y_train)
    predictions_mnb = clf.predict_proba(tf_test)[:, 1]
    tpredictions_mnb = clf.predict_proba(tf_train)[:, 1]
    neg_deltas[0] = 1 - roc_auc_score(y_train, tpredictions_mnb)
    del tpredictions_mnb
    del clf
    _ = gc.collect()

    print("sgd_model.fit() in progress")
    sgd_model.fit(tf_train, y_train)
    predictions_sgd = sgd_model.predict_proba(tf_test)[:, 1]
    tpredictions_sgd = sgd_model.predict_proba(tf_train)[:, 1]
    neg_deltas[1] = 1 - roc_auc_score(y_train, tpredictions_sgd)
    del tpredictions_sgd
    del sgd_model
    _ = gc.collect()
    
    print("lgb.fit() in progress")
    lgb.fit(tf_train, y_train)
    predictions_lgb = lgb.predict_proba(tf_test)[:, 1]
    tpredictions_lgb = lgb.predict_proba(tf_train)[:, 1]
    neg_deltas[2] = 1 - roc_auc_score(y_train, tpredictions_lgb)
    print('done with lightgbm')
    del tpredictions_lgb
    del lgb
    _ = gc.collect()
    
    #TOTAL_MODELS = 1

    dn_model_wrapper_list = []

    for model in range(TOTAL_MODELS):
        inputs, outputs = build_DenseNet(tf_train, 1)
        dnmodelx = Model(inputs=inputs, outputs=outputs)
        dnmodelx.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001), 
                        loss='binary_crossentropy', metrics=[])
        dnmodelx.summary()
        dn_keras_wrapper = DNKerasClassifierWrapper(dnmodelx)
        dn_model_wrapper_list.append(dn_keras_wrapper)
    
    print("dn_model_wrapper_list=",dn_model_wrapper_list)

    print("dn0.fit() in progress")
    dn_model_wrapper_list[0].fit(tf_train, y_train)
    predictions_dn0 = dn_model_wrapper_list[0].predict_proba(tf_test)[:, 1]
    tpredictions_dn0 = dn_model_wrapper_list[0].predict_proba(tf_train)[:, 1]
    neg_deltas[3] = 1 - roc_auc_score(y_train, tpredictions_dn0)
    del dn_model_wrapper_list # add to the last use of dn#
    del tpredictions_dn0
    _ = gc.collect()
    print('done with dn0')
    
    cat=CatBoostClassifier(iterations=3000, #task_type='GPU', bootstrap_type='Bernoulli', 
                       verbose=0,
                       l2_leaf_reg=6.65913,
                       learning_rate=0.005599,
                       subsample = 0.35,
                       allow_const_label=True,loss_function = 'CrossEntropy')
    
    try:
        cat.fit(tf_train, y_train)
        predictions_cat = cat.predict_proba(tf_test)[:, 1]
        tpredictions_cat = cat.predict_proba(tf_train)[:, 1]
        neg_deltas[4] = 1 - roc_auc_score(y_train, tpredictions_cat)
        print('done with catboost')
        del cat
        #del tpredictions_cat
        _ = gc.collect()
    except Exception:
        print("Skipping catboost in the dev run as it cannot train on test only vocab.")
        # the following line prevents failure of the test run to calculate the weights below
        predictions_cat = predictions_dn0
        #tpredictions_cat = tpredictions_dn0
        
    del tpredictions_cat
    _ = gc.collect()
    
    #from sklearn.svm import SVC
    #
    #the_model = SVC(kernel='linear',
    #                    probability=True, 
    #                    #max_iter=8000,
    #                    verbose=0)
    #print("+++++++++++++ svm_model.fit() in progress")
    #the_model.fit(tf_train, y_train)
    #predictions_svc = the_model.predict_proba(tf_test)[:, 1]
    #print("+++++++++++++ wrappinng up the predictions")
    
    # Initial weights (equal weights for pred1, pred2, pred3)
    #initial_weights = [1/5, 1/5, 1/5, 1/5, 1/5]

    # Constraints: weights should sum to 1
    # constraints = [{'type': 'eq', 'fun': lambda w: sum(w) - 1}]
    # Nedler-Mead does not use the constraints (kept in case a differentiable option is tried

    # Bounds: each weight should be between 0 and 1
    #bounds = [(0, 1)] * 5

    #print("Attempting to find an optimal set of weights")
    #starttime = time.time()
    #options = {'maxiter': 1000, 'maxfev': 5000}
    #result = minimize(minobjective, initial_weights, args=(y_train, 
    #                                                    tpredictions_mnb,
    #                                                    tpredictions_sgd,
    #                                                    tpredictions_lgb,
    #                                                    tpredictions_dn0,
    #                                                    tpredictions_cat),
    #                  method='Nelder-Mead', bounds=bounds, options=options) # constraints=constraints)

    # Get the optimized weights
    #weights = result.x
    #print("weights =",weights)
    #print("Took",time.time()-starttime,"seconds.")
    
    # Now assigned dynamic weights LB score dropped from .96 to .946
    #min_delta = min(neg_deltas)
    #print("neg_deltas=",neg_deltas)
    #weights = [1.0, 1.0, 1.0, 1.0, 1.0]
    #for i in range(len(weights)):
    #    weights[i] = min_delta/neg_deltas[i]
    #print ("Dynamically adjusted weights =", weights)
    
    # Define weights
    #weights = [0.07,0.24,0.23,0.23,0.23] # 0.893 with DN using 80% training data
    #weights = [0.07,0.31,0.62] # LB 0.956 with 2000 and 4000 trees
    #weights = [0.06,0.47,0.47] # LB 0.956 with 4000 trees
    #weights = [0.00,0.00,0.00,1.00]
    weights = [0.02,0.20,0.10,0.24,0.44]

    # Calculate weighted average of predictions
    final_preds = (weights[0] * predictions_mnb + weights[1] * predictions_sgd + weights[2] * predictions_lgb +
                   weights[3] * predictions_dn0 + weights[4] * predictions_cat) / sum(weights)
    #final_preds = (weights[0] * tpredictions_mnb + weights[1] * tpredictions_sgd + weights[2] * tpredictions_lgb +
    #               weights[3] * tpredictions_dn0 + weights[4] * tpredictions_cat) / sum(weights)
    #final_preds = predictions_dn0
    
    del predictions_mnb
    del predictions_sgd
    del predictions_lgb
    del predictions_dn0
    del predictions_cat
    #del predictions_svc
    
    #del tpredictions_mnb
    #del tpredictions_sgd
    #del tpredictions_lgb
    #del tpredictions_dn0
    #del tpredictions_cat
    
    # Garbage collection
    gc.collect()

    return final_preds

final_preds_bpe = calculate_voting_bpe(tf_train, tf_test, y_train)
_ = gc.collect()



from sklearn.metrics import roc_auc_score 

print(y_train.shape) 
print(final_preds_bpe.shape)

try: 
    roc_auc = roc_auc_score(y_train, final_preds_bpe) 
    print("ROC AUC Score (if with y_test then this is the baseline):", roc_auc) 
except Exception: 
    print("Trying to igore a mismatch of test and training data on ROC AUC calculation (this can only use all training data)")

del tokenized_texts_test, tokenized_texts_train, dataset, raw_tokenizer, tokenizer
del trainer
del tf_train, tf_test, y_train
del vocab
_ = gc.collect()

# The whole spe thing looks like little more than a second build of the models
## Let's comment out all of "SPE" and see how well just the "BPE" models do.

# Go as far as possible, make sure any references to the training and test data
# are truly returned to the heap and hope it is not fragmented beyond use.

del test
del sub
del train
_ = gc.collect()

# Before we can get started we need to reload the training data
It was deleted and garbage collected so that that we really start off clean
to avoid out of memory errors.

test = pd.read_csv('/kaggle/input/llm-detect-ai-generated-text/test_essays.csv')
sub = pd.read_csv('/kaggle/input/llm-detect-ai-generated-text/sample_submission.csv')
train = pd.read_csv("/kaggle/input/daigt-v2-train-dataset/train_v2_drcat_02.csv", sep=',')

train = train.drop_duplicates(subset=['text'])
train.reset_index(drop=True, inplace=True)

We can now get serious about the second way to build encodings for training.

VOCAB_SIZE = 42000
# Creating Byte-Pair Encoding tokenizer
print("Working on the the second BPE (it's not really SPE)")
raw_tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))
raw_tokenizer.normalizer = normalizers.Sequence([normalizers.NFC()] + [normalizers.Lowercase()] if LOWERCASE else [])
raw_tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.BpeTrainer(vocab_size=VOCAB_SIZE, special_tokens=special_tokens)
dataset = Dataset.from_pandas(test[['text']])
def train_corp_iter(): 
    for i in range(0, len(dataset), 500):
        yield dataset[i : i + 500]["text"]
raw_tokenizer.train_from_iterator(train_corp_iter(), trainer=trainer)
tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=raw_tokenizer,
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)
tokenized_texts_test = []

for text in tqdm(test['text'].tolist()):
    tokenized_texts_test.append(tokenizer.tokenize(text))

tokenized_texts_train = []

for text in tqdm(train['text'].tolist()):
    tokenized_texts_train.append(tokenizer.tokenize(text))
    
########################################
    
vectorizer = TfidfVectorizer(ngram_range=(3, 5), lowercase=False, sublinear_tf=True, analyzer = 'word',
    tokenizer = dummy,
    preprocessor = dummy,
    token_pattern = None, strip_accents='unicode')

vectorizer.fit(tokenized_texts_test) #train)

# Getting vocab
vocab = vectorizer.vocabulary_

print("Vocabulary length", len(vocab))

vectorizer = TfidfVectorizer(ngram_range=(3, 5), lowercase=False, sublinear_tf=True, vocabulary=vocab,
                            analyzer = 'word',
                            tokenizer = dummy,
                            preprocessor = dummy,
                            token_pattern = None, strip_accents='unicode'
                            )

print("vectorizer fit_transform in progress")
tf_train = vectorizer.fit_transform(tokenized_texts_train)
print("vectorizer transform in progress")
tf_test = vectorizer.transform(tokenized_texts_test)

del vectorizer
gc.collect()

y_train = train['label'].values



def calculate_voting(tf_train, tf_test, y_train):
    # Initialize classifiers
    clf = MultinomialNB(alpha = 0.02)
    sgd_model = SGDClassifier(max_iter=8000, tol=1e-4, loss="modified_huber") 
    p7={'n_iter': 2000,
        'verbose': -1,'objective': 'cross_entropy','metric': 'auc',
        'learning_rate': 0.0058, 'colsample_bytree': 0.78,
        'colsample_bynode': 0.8, 'lambda_l1': 4.563,
        "device": "gpu",
        'gpu_device_id': 0,
        'lambda_l2': 2.97, 'min_data_in_leaf': 112, 'max_depth': 21}
    lgb = LGBMClassifier(**p7)
    cat=CatBoostClassifier(iterations=2000,
                       verbose=0,
                       l2_leaf_reg=6.659,
                       learning_rate=0.0056,
                       subsample = 0.4,
                       allow_const_label=True,loss_function = 'CrossEntropy')
    
    print("tf_trsin.shape=", tf_train.shape)
    # Fit classifiers and make predictions
    print("clf.fit() in progress")
    clf.fit(tf_train, y_train)
    predictions_mnb = clf.predict_proba(tf_test)[:, 1]
    del clf
    _ = gc.collect()

    print("sgd_model.fit() in progress")
    sgd_model.fit(tf_train, y_train)
    predictions_sgd = sgd_model.predict_proba(tf_test)[:, 1]
    del sgd_model
    _ = gc.collect()
    
    print("lgb.fit() in progress")
    lgb.fit(tf_train, y_train)
    predictions_lgb = lgb.predict_proba(tf_test)[:, 1]
    print('done with lightgbm')
    del lgb
    _ = gc.collect()
    
    dn_model_wrapper_list = []

    for model in range(TOTAL_MODELS):
        inputs, outputs = build_DenseNet(tf_train, 1)
        dnmodelx = Model(inputs=inputs, outputs=outputs)
        dnmodelx.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), 
                          loss='binary_crossentropy', metrics=[])
        dnmodelx.summary()
        dn_keras_wrapper = DNKerasClassifierWrapper(dnmodelx)
        dn_model_wrapper_list.append(dn_keras_wrapper)
    
    print("dn_model_wrapper_list=",dn_model_wrapper_list)
    
    print("dn0.fit() in progress")
    dn_model_wrapper_list[0].fit(tf_train, y_train)
    predictions_dn0 = dn_model_wrapper_list[0].predict_proba(tf_test)[:, 1]
    del dn_model_wrapper_list
    _ = gc.collect()
    print('done with dn0')
    
    predictions_cat = []
    try:
        cat.fit(tf_train, y_train)
        predictions_cat = cat.predict_proba(tf_test)[:, 1]
        print('done with catboost')
        del cat
    except Exception:
        print("Skipping catboost in the dev run as it cannot train on test only vocab.")
        predictions_cat = predictions_dn0
    
    # Define weights
    weights = [0.07,0.24,0.23,0.23,0.23]
    #weights = [0.07,0.31,0.31,0.31]
    #weights = [0.07,0.31,0.62] # LB 0.956 with 2000 and 4000 trees
    #weights = [0.06,0.47,0.47] # LB 0.956 with 4000 trees
    #weights = [0.00,0.00,0.00,1.00]

    # Calculate weighted average of predictions
    #final_preds = (weights[0] * predictions_mnb + weights[1] * predictions_sgd + weights[2] * predictions_lgb + weights[3] * predictions_cat) / sum(weights)
    final_preds = (weights[0] * predictions_mnb + 
                   weights[1] * predictions_sgd + 
                   weights[2] * predictions_lgb + 
                   weights[3] * predictions_dn0 + 
                   weights[4] * predictions_cat) / sum(weights)
    
    del predictions_mnb
    del predictions_sgd
    del predictions_lgb
    del predictions_dn0
    del predictions_cat

    # Garbage collection
    gc.collect()

    return final_preds

final_preds_spe = calculate_voting(tf_train, tf_test, y_train)
_ = gc.collect()

# Final Submission and Closing Remarks

## Submission Preparation
In this final cell, we prepare our submission:

1. **Ensemble Prediction Averaging**:
   We combine the predictions from both the Byte-Pair Encoding (BPE) and Sentencepiece Encoding (SPE) models by averaging them. This approach helps in harnessing the strengths of both models.

   ```python
   sub['generated'] ~= (final_preds_bpe + final_preds_spe) / 2


In [None]:
#sub['generated'] = 0.51 * np.array(final_preds_bpe) + 0.49 * final_preds_spe
sub['generated'] = np.array(final_preds_bpe)
sub.to_csv('submission.csv', index=False)
sub