## 1. Setup and Data Loading

In [None]:
!pip install transformers torch pandas numpy scikit-learn

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

In [None]:
import pandas as pd
import numpy as np
import torch
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertForSequenceClassification
from torch.optim import AdamW # Import AdamW from torch.optim
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm

In [None]:
# Read the CSV file
df = pd.read_csv('/content/course_dataset.csv')

# Display sample data
print("Dataset shape:", df.shape)
df.head()

Dataset shape: (255, 3)


Unnamed: 0,course_description,learning_outcomes,skill_sets
0,This course provides comprehensive insights in...,['Understand the core principles of Introducti...,"['Report writing', 'Ethics', 'Presentation']"
1,Students will explore fundamental and advanced...,['Demonstrate knowledge of Introduction to Pro...,"['AWS', 'Containers', 'Service models']"
2,The course introduces Introduction to Programm...,['Gain practical experience with Introduction ...,"['Security policies', 'Risk assessment', 'Cryp..."
3,Introduction to Programming is covered in-dept...,['Identify challenges in Introduction to Progr...,"['Data modeling', 'Database design', 'SQL']"
4,This module covers critical aspects of Introdu...,['Explore advanced concepts of Introduction to...,"['SQL', 'Normalization', 'Data modeling']"


## 2. Data Preprocessing

In [None]:
import ast

df['learning_outcomes'] = df['learning_outcomes'].apply(ast.literal_eval)
df['skill_sets'] = df['skill_sets'].apply(ast.literal_eval)

# Create a combined target column (learning outcomes + skills)
df['target_text'] = df.apply(lambda x: "Learning Outcomes: " + "; ".join(x['learning_outcomes']) +
                    " | Skills: " + ", ".join(x['skill_sets']), axis=1)

In [None]:
# Split data into training and validation sets
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)

print(f"Training samples: {len(train_df)}")
print(f"Validation samples: {len(val_df)}")


Training samples: 204
Validation samples: 51


## 3. Model Setup - Using BERT for Sequence-to-Sequence Learning

In [None]:
# Load tokenizer and model
from transformers import BertTokenizer, EncoderDecoderModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-uncased", "bert-base-uncased")

# Set decoder_start_token_id and pad_token_id for generation
model.config.decoder_start_token_id = tokenizer.cls_token_id
model.config.pad_token_id = tokenizer.pad_token_id


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertLMHeadModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['bert.encoder.layer.0.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.0.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.0.crossattention.output.dense.bias', 'bert.encoder.layer.0.crossattention.output.dense.weight', 'bert.encoder.layer.0.crossattention.self.key.bias', 'bert.encoder.layer.0.crossattention.self.key.weight', 'bert.encoder.layer.0.crossattention.self.query.bias', 'bert.encoder.layer.0.crossattention.self.query.weight', 'bert.encoder.layer.0.crossattention.self.value.bias', 'bert.encoder.layer.0.crossattention.self.value.weight', 'bert.encoder.layer.1.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.1.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.1.crossattention.output.dense.bias', 'bert.encoder.layer.1.crossattention.output.dense.weight', 'bert.encoder.layer.1.crossattention.self.key.bias', 'bert.e

In [None]:
# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

EncoderDecoderModel(
  (encoder): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, el

## 4. Data Preparation for Model

In [None]:
# Tokenize the data
max_input_length = 128
max_target_length = 128

def tokenize_data(df):
    inputs = tokenizer(df['course_description'].tolist(), padding='max_length',
                      truncation=True, max_length=max_input_length, return_tensors="pt")

    targets = tokenizer(df['target_text'].tolist(), padding='max_length',
                       truncation=True, max_length=max_target_length, return_tensors="pt")

    return inputs, targets

train_inputs, train_targets = tokenize_data(train_df)
val_inputs, val_targets = tokenize_data(val_df)

In [None]:
# Create PyTorch datasets
class CourseDataset(Dataset):
    def __init__(self, inputs, targets):
        self.inputs = inputs
        self.targets = targets

    def __len__(self):
        return len(self.inputs['input_ids'])

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.inputs.items()}
        item['labels'] = self.targets['input_ids'][idx]
        return item

train_dataset = CourseDataset(train_inputs, train_targets)
val_dataset = CourseDataset(val_inputs, val_targets)


In [None]:
batch_size = 8
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)

## 5. Model Training

In [None]:
# Training setup
optimizer = AdamW(model.parameters(), lr=5e-5)
num_epochs = 10

In [None]:
# Training loop
for epoch in range(num_epochs):
    model.train()
    total_loss = 0

    for batch in tqdm(train_loader, desc=f"Epoch {epoch + 1}"):
        # Move batch to device
        batch = {k: v.to(device) for k, v in batch.items()}

        # Forward pass
        outputs = model(input_ids=batch['input_ids'],
                      attention_mask=batch['attention_mask'],
                      labels=batch['labels'])

        loss = outputs.loss
        total_loss += loss.item()

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    avg_train_loss = total_loss / len(train_loader)
    print(f"Epoch {epoch + 1} - Training loss: {avg_train_loss:.4f}")

    # Validation
    model.eval()
    val_loss = 0
    with torch.no_grad():
        for batch in tqdm(val_loader, desc="Validating"):
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(input_ids=batch['input_ids'],
                            attention_mask=batch['attention_mask'],
                            labels=batch['labels'])
            val_loss += outputs.loss.item()

    avg_val_loss = val_loss / len(val_loader)
    print(f"Epoch {epoch + 1} - Validation loss: {avg_val_loss:.4f}")

  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
Epoch 1: 100%|██████████| 26/26 [00:12<00:00,  2.04it/s]


Epoch 1 - Training loss: 6.0979


Validating: 100%|██████████| 7/7 [00:00<00:00,  7.02it/s]


Epoch 1 - Validation loss: 3.1264


Epoch 2: 100%|██████████| 26/26 [00:11<00:00,  2.20it/s]


Epoch 2 - Training loss: 2.6587


Validating: 100%|██████████| 7/7 [00:01<00:00,  6.96it/s]


Epoch 2 - Validation loss: 2.3685


Epoch 3: 100%|██████████| 26/26 [00:12<00:00,  2.12it/s]


Epoch 3 - Training loss: 2.2515


Validating: 100%|██████████| 7/7 [00:01<00:00,  6.91it/s]


Epoch 3 - Validation loss: 2.1748


Epoch 4: 100%|██████████| 26/26 [00:11<00:00,  2.17it/s]


Epoch 4 - Training loss: 2.0911


Validating: 100%|██████████| 7/7 [00:01<00:00,  6.87it/s]


Epoch 4 - Validation loss: 2.3983


Epoch 5: 100%|██████████| 26/26 [00:12<00:00,  2.15it/s]


Epoch 5 - Training loss: 2.0548


Validating: 100%|██████████| 7/7 [00:01<00:00,  6.80it/s]


Epoch 5 - Validation loss: 1.9403


Epoch 6: 100%|██████████| 26/26 [00:12<00:00,  2.08it/s]


Epoch 6 - Training loss: 1.8293


Validating: 100%|██████████| 7/7 [00:01<00:00,  6.74it/s]


Epoch 6 - Validation loss: 1.7332


Epoch 7: 100%|██████████| 26/26 [00:12<00:00,  2.10it/s]


Epoch 7 - Training loss: 1.7019


Validating: 100%|██████████| 7/7 [00:01<00:00,  6.73it/s]


Epoch 7 - Validation loss: 1.6939


Epoch 8: 100%|██████████| 26/26 [00:12<00:00,  2.12it/s]


Epoch 8 - Training loss: 1.5704


Validating: 100%|██████████| 7/7 [00:01<00:00,  6.69it/s]


Epoch 8 - Validation loss: 1.4850


Epoch 9: 100%|██████████| 26/26 [00:12<00:00,  2.12it/s]


Epoch 9 - Training loss: 1.5197


Validating: 100%|██████████| 7/7 [00:01<00:00,  6.61it/s]


Epoch 9 - Validation loss: 1.3780


Epoch 10: 100%|██████████| 26/26 [00:12<00:00,  2.10it/s]


Epoch 10 - Training loss: 1.4697


Validating: 100%|██████████| 7/7 [00:01<00:00,  6.76it/s]

Epoch 10 - Validation loss: 1.4526





## 6. Save the Model

In [None]:
# Save the trained model
model.save_pretrained("course_outcomes_generator")
tokenizer.save_pretrained("course_outcomes_generator")

('course_outcomes_generator/tokenizer_config.json',
 'course_outcomes_generator/special_tokens_map.json',
 'course_outcomes_generator/vocab.txt',
 'course_outcomes_generator/added_tokens.json')

## 7. Inference Example

In [None]:
#Load the saved model for inference
model = EncoderDecoderModel.from_pretrained("course_outcomes_generator").to(device)

In [None]:
def generate_outcomes(course_description):
    # Tokenize input
    inputs = tokenizer(course_description, return_tensors="pt",
                      max_length=max_input_length, truncation=True, padding='max_length')
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Generate output
    outputs = model.generate(input_ids=inputs['input_ids'],
                             attention_mask=inputs['attention_mask'],
                             max_length=max_target_length)

    # Decode output
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Split into learning outcomes and skills
    if "|" in generated_text:
        outcomes_part, skills_part = generated_text.split("|", 1)
        outcomes = [o.strip() for o in outcomes_part.replace("Learning Outcomes:", "").split(";")]
        skills = [s.strip() for s in skills_part.replace("Skills:", "").split(",")]
    else:
        outcomes = [generated_text]
        skills = []

    return outcomes, skills

In [None]:
!pip install transformers torch sentencepiece



In [None]:

from transformers import pipeline

# Use a pre-trained educational text generator
generator = pipeline('text-generation', model='gpt2')

def generate_educational_outcomes(description):
    prompt = f"Generate learning outcomes for a course about {description}. Outcomes:\n1."
    results = generator(prompt, max_length=150, num_return_sequences=1)
    return results[0]['generated_text']

print(generate_educational_outcomes("web development with React and Node.js"))

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generate learning outcomes for a course about web development with React and Node.js. Outcomes:
1. Complete a course about writing React applications on top of React
2. Learn JavaScript and build functional web apps using JavaScript. Outcomes:
1. Earn 1,500 ESX Credits for completing React
2. Learn building applications with Node
3. Learn React-centric web frameworks like Ember and Cake
4. Learn building web applications with jQuery
5. Learn building web apps with WebRPC
And we've even created two free Courses to learn React and Node: React: Building a React Application on top of Node, Node.js and Angular on top of React: Building a React Application using JavaScript and Node on


In [None]:

from transformers import pipeline

# Use a pre-trained educational text generator
generator = pipeline('text-generation', model='gpt2')

def generate_educational_outcomes(description):
    prompt = f"Generate learning outcomes for a course about {description}. Outcomes:\n1."
    results = generator(prompt, max_length=150, num_return_sequences=1)
    return results[0]['generated_text']

print(generate_educational_outcomes("web development with React and Node.js"))

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generate learning outcomes for a course about web development with React and Node.js. Outcomes:
1. Create a component that renders data.
2. Create and use template templates to render various components.
3. Implement components with and for components.
4. Create templates where we can use jQuery (to render some page), Webpack (to integrate with other browsers), and React (to create responsive apps for mobile devices).
5. Test component creation and use our React API and build our App, including creating an app.
6. Apply tests on our web page to test our API and create our component.
Here's a summary of the test lifecycle.
How much of this component is a JSX library
