Question 1: Transfer Learning with Hugging Face Models
Part (a): Implement Transfer Learning Using Hugging Face Models
For this task, you'll use the Hugging Face library to implement transfer learning with pre-trained models like emilyalsentzer/Bio_ClinicalBERT or the Universal Sentence Encoder. You need to follow these steps:

Choose a Model:

BERT-based models (Bio_ClinicalBERT): These models are great for text classification tasks, especially medical text, as they are trained specifically on clinical data. The Bio_ClinicalBERT model is a good choice for medical-related tasks.
Universal Sentence Encoder (USE): This model is useful for generating sentence embeddings, which can then be used in various downstream tasks like classification, clustering, etc. It is a more generalized model.
Load and Implement the Model:

You'll load the pre-trained model using the Hugging Face library or TensorFlow Hub.
You should preprocess your data (e.g., tokenization for BERT, or text embedding for USE) and fine-tune the model.
Custom Class and Functional API:

As part of the task, you will implement the model using TensorFlow’s Functional API. This allows flexibility in defining the architecture, especially when wrapping pre-trained models in custom classes.

Here is a basic example of how you might integrate Universal Sentence Encoder (USE) in a custom layer:

In [None]:
import tensorflow as tf
from transformers import BertTokenizer, TFBertModel
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
# Step 1: Load the dataset into a pandas DataFrame
data = pd.read_csv('St_Paul_hospital_train.csv')

# Load the data into a pandas DataFrame
df = pd.DataFrame(data)

In [None]:
# Step 2: Load the pre-trained Bio_ClinicalBERT tokenizer
model_name = "emilyalsentzer/Bio_ClinicalBERT"
tokenizer = BertTokenizer.from_pretrained(model_name)


In [None]:
# Step 3: Tokenize the medical text data
def tokenize_texts(texts):
    """Tokenize the texts and pad/truncate to a fixed length."""
    return tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="tf")

# Tokenize the medical texts
encodings = tokenize_texts(df['medical_text'].tolist())


In [None]:
# Step 4: Convert the labels into a TensorFlow tensor
labels = tf.convert_to_tensor(df['diagnosis'].values)


In [None]:
# Step 5: Split the data into training and validation sets
train_texts, val_texts, train_labels, val_labels = train_test_split(
    df['medical_text'], labels, test_size=0.2, random_state=42
)

# Tokenize the train and validation texts
train_encodings = tokenize_texts(train_texts.tolist())
val_encodings = tokenize_texts(val_texts.tolist())


In [None]:
# Step 6: Create TensorFlow Datasets
def create_tf_dataset(encodings, labels):
    """Create a TensorFlow dataset."""
    return tf.data.Dataset.from_tensor_slices((
        {"input_ids": encodings['input_ids'], "attention_mask": encodings['attention_mask']},
        labels
    ))


In [None]:
# Create train and validation datasets
train_dataset = create_tf_dataset(train_encodings, train_labels).shuffle(1000).batch(32)
val_dataset = create_tf_dataset(val_encodings, val_labels).batch(32)


In [None]:
# Step 7: Build the model using TensorFlow and Bio_ClinicalBERT
class BioClinicalBERTLayer(tf.keras.layers.Layer):
    def __init__(self, model):
        super(BioClinicalBERTLayer, self).__init__()
        self.bert = model

    def call(self, inputs):
        """Call function that passes the input through the BERT model."""
        outputs = self.bert(inputs)
        return outputs.last_hidden_state  # Return embeddings from the last layer


In [None]:
# Load the Bio_ClinicalBERT model
bert_model = TFBertModel.from_pretrained(model_name)

# Instantiate the BioClinicalBERT layer
bert_layer = BioClinicalBERTLayer(bert_model)


In [None]:
# Define the input layer
input_text = tf.keras.Input(shape=(), dtype=tf.string, name="text_input")

# Pass the input through the Bio_ClinicalBERT layer
embedding = bert_layer(input_text)

# Use the CLS token to represent the sentence (first token in the output)
output = tf.keras.layers.Dense(1, activation='sigmoid')(embedding[:, 0, :])  # [CLS] token is at index 0


In [None]:
# Define the model
model = tf.keras.Model(inputs=input_text, outputs=output)


In [None]:
# Step 8: Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


In [None]:
# Step 9: Train the model
model.fit(train_dataset, validation_data=val_dataset, epochs=3)


In [None]:
# Step 10: Evaluate the model (optional, with test data)
# test_texts = [...]
# test_labels = [...]
# test_encodings = tokenize_texts(test_texts)
# test_labels = tf.convert_to_tensor(test_labels)
# test_dataset = create_tf_dataset(test_encodings, test_labels)
# model.evaluate(test_dataset)
