![Workshop Banner](https://github.com/CLDiego/SPE_GeoHackathon_2025/blob/main/assets/S1_M1.png?raw=1)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CLDiego/SPE_GeoHackathon_2025/blob/main/S1_M1_LLM_HF.ipynb)

***
# Session 01 // Module 01: Large Language Models (LLMs) with HuggingFace

In this module, we'll explore the fundamentals of Large Language Models (LLMs) using HuggingFace Transformers. We'll cover tokens, embeddings, context windows, and hands-on text generation with a focus on geoscience applications.

## Learning Objectives
- Understand tokens, subword tokenization, embeddings, and context windows
- Load and use a small HuggingFace model for inference
- Generate text with controlled decoding parameters
- Visualize and interpret embeddings for geoscience texts
- Apply LLMs to simple geoscience definition tasks

## What you’ll build
- A minimal pipeline to tokenize text, produce embeddings, and visualize them
- A lightweight text generation setup using a small causal LM
- Exercises to craft prompts and compare decoding strategies



In [1]:
# Download utils from GitHub
!wget -q --show-progress https://raw.githubusercontent.com/CLDiego/SPE_GeoHackathon_2025/refs/heads/dev/spe_utils.txt -O spe_utils.txt
!wget -q --show-progress -x -nH --cut-dirs=5 -i spe_utils.txt



In [2]:
# Hugging Face API token
# Retrieving the token is required to get access to HF hub
from google.colab import userdata
hf_token = userdata.get('HF_TOKEN')

In [5]:
# input/import all the data source, Terms, Tokenizaton_examples, Geophysics_texts, Geophysics_categories
from spe_utils.data import (
    GEOSCIENCE_TERMS,
    TOKENIZATION_EXAMPLES,
    GEOPHYSICS_TEXTS,
    GEOPHYSICS_CATEGORIES,
)

# 1. Understanding Tokens

> <img src="https://raw.githubusercontent.com/CLDiego/uom_fse_dl_workshop/main/figs/icons/write.svg" width="20"/> **Definition**: **Tokens** are the basic units LLMs process. Modern tokenizers use subword schemes (WordPiece, BPE) to split words into frequently occurring chunks.

A token can be:
- a whole word (e.g., "seismic")
- a subword (e.g., "sei", "##smic")
- punctuation or special symbols

Why subwords?
- Handle rare words and morphology better
- Keep the vocabulary compact while covering most text

Example (WordPiece/BPE style):

```text
reservoir → ["reservoir"]
microfracture → ["micro", "##fract", "##ure"]
wellbore-stability → ["well", "##bore", "-", "stability"]
```

> <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/reminder.svg" width="20"/> **Tip**: Different models use different tokenizers (BERT uses WordPiece; GPT-2 uses BPE).

In [7]:
from transformers import BertTokenizer
from spe_utils.visualisation import bert_tokenize_and_color

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [8]:
for text in TOKENIZATION_EXAMPLES:
    bert_tokenize_and_color(text, tokenizer)

Original text: Seismic inversion is a geophysical technique.
Number of tokens: 8


Tokens: ['seismic', 'inversion', 'is', 'a', 'geo', '##physical', 'technique', '.']
--------------------------------------------------------------------------------
Original text: Hydrocarbon exploration uses seismic surveys.
Number of tokens: 7


Tokens: ['hydro', '##carbon', 'exploration', 'uses', 'seismic', 'surveys', '.']
--------------------------------------------------------------------------------
Original text: Reservoir characterization involves petrophysical analysis.
Number of tokens: 10


Tokens: ['reservoir', 'characterization', 'involves', 'pet', '##rop', '##hy', '##sic', '##al', 'analysis', '.']
--------------------------------------------------------------------------------
Original text: What is the porosity and permeability of this formation?
Number of tokens: 13


Tokens: ['what', 'is', 'the', 'por', '##osity', 'and', 'per', '##me', '##ability', 'of', 'this', 'formation', '?']
--------------------------------------------------------------------------------


In [9]:
# Display sample vocabulary, special tokens, and token mapping

# Sample vocab (first 20 keys)
vocab = tokenizer.get_vocab()
print("Sample vocabulary (first 20):", list(vocab.keys())[:20])

# Special tokens
print("Special tokens:", tokenizer.special_tokens_map)

# Mapping for the first tokenization example
sample_text = TOKENIZATION_EXAMPLES[0]
tokens = tokenizer.tokenize(sample_text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"\nSample text: {sample_text}")
print(f"Tokens: {tokens}")
print(f"Token IDs: {token_ids}")

# Full encoding for the first example
encoded = tokenizer(sample_text, return_tensors='pt')
print(f"\nFull encoding (input_ids): {encoded['input_ids']}")
print(f"Attention mask: {encoded['attention_mask']}")

decoded = tokenizer.decode(token_ids)
print(f"Decoded tokens: {decoded}")

Sample vocabulary (first 20): ['[PAD]', '[unused0]', '[unused1]', '[unused2]', '[unused3]', '[unused4]', '[unused5]', '[unused6]', '[unused7]', '[unused8]', '[unused9]', '[unused10]', '[unused11]', '[unused12]', '[unused13]', '[unused14]', '[unused15]', '[unused16]', '[unused17]', '[unused18]']
Special tokens: {'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}

Sample text: Seismic inversion is a geophysical technique.
Tokens: ['seismic', 'inversion', 'is', 'a', 'geo', '##physical', 'technique', '.']
Token IDs: [22630, 28527, 2003, 1037, 20248, 23302, 6028, 1012]

Full encoding (input_ids): tensor([[  101, 22630, 28527,  2003,  1037, 20248, 23302,  6028,  1012,   102]])
Attention mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
Decoded tokens: seismic inversion is a geophysical technique.


# 2. Understanding Embeddings

> <img src="https://raw.githubusercontent.com/CLDiego/uom_fse_dl_workshop/main/figs/icons/write.svg" width="20"/> **Definition**: **Embeddings** map tokens or sentences to dense vectors that capture semantic relationships. Nearby vectors are semantically similar.

Key concepts:
- Static vs contextual:
  - Static (e.g., GloVe) assigns one vector per word
  - Contextual (e.g., BERT, MiniLM) depends on surrounding words
- Dimensions: Common sizes are 384, 512, 768, 1024+

> <img src="https://raw.githubusercontent.com/CLDiego/uom_fse_dl_workshop/main/figs/icons/reminder.svg" width="20"/> **Evaluation tip**: Use cosine similarity to compare embeddings. Values closer to 1.0 indicate higher similarity.

In [10]:
from transformers import AutoModel, AutoTokenizer
import torch

# Load a small model for embeddings
model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

In [11]:
def get_embeddings(texts, tokenizer, model):
    """Get sentence embeddings"""
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=512)

    with torch.no_grad():
        outputs = model(**inputs)
        # Use CLS token embedding (first token) for sentence representation
        embeddings = outputs.last_hidden_state[:, 0, :]

    return embeddings.numpy()

# Get embeddings for geoscience terms
geoscience_terms = [
    "seismic inversion",
    "reservoir characterization",
    "hydrocarbon exploration",
    "petrophysical analysis",
    "porosity measurement",
    "permeability analysis"
]

# Get embeddings for geoscience terms
# Remove the hardcoded list and use the imported constant
embeddings = get_embeddings(GEOSCIENCE_TERMS, tokenizer, model)

print(f"Embedding shape: {embeddings.shape}")
print(f"Each term is represented by {embeddings.shape[1]} numbers")
print(f"\nFirst 10 embedding values for '{GEOSCIENCE_TERMS[0]}':")
print(embeddings[0][:10])

Embedding shape: (6, 384)
Each term is represented by 384 numbers

First 10 embedding values for 'seismic inversion':
[ 0.04884825 -0.23083082  0.6350067   0.37743932 -0.08776819 -0.7972005
 -0.56521857 -0.129537   -0.17343435  0.01015301]


In [12]:
print(f"Total number of geophysics texts: {len(GEOPHYSICS_TEXTS)}")
print("Sample texts:")
for i, text in enumerate(GEOPHYSICS_TEXTS[:5]):
    print(f"{i+1}. {text}")

Total number of geophysics texts: 56
Sample texts:
1. Seismic inversion transforms seismic reflection data into quantitative subsurface rock properties.
2. P-wave velocity depends on rock density and bulk modulus in elastic media.
3. S-wave velocity is controlled by shear modulus and density of the formation.
4. Seismic amplitude variation with offset reveals fluid content and lithology changes.
5. Pre-stack seismic inversion simultaneously estimates multiple elastic parameters from angle stacks.


In [13]:
import torch
from sklearn.manifold import TSNE

# Encode all geophysics sentences
inputs = tokenizer(GEOPHYSICS_TEXTS, padding=True, truncation=True, return_tensors="pt", max_length=512)

with torch.no_grad():
    embeddings = model(**inputs).last_hidden_state[:,0,:]  # CLS token

print(f"Embeddings shape: {embeddings.shape}")
print(f"Each sentence is represented by {embeddings.shape[1]} dimensional vector")

# Reduce dimensions to 3D with t-SNE
perplexity = min(30, len(GEOPHYSICS_TEXTS) - 1)
tsne = TSNE(n_components=3, perplexity=perplexity, random_state=42, max_iter=1000)
embeddings_3d = tsne.fit_transform(embeddings.numpy())

print(f"3D embeddings shape: {embeddings_3d.shape}")
print(f"Using perplexity: {perplexity}")

Embeddings shape: torch.Size([56, 384])
Each sentence is represented by 384 dimensional vector
3D embeddings shape: (56, 3)
Using perplexity: 30


In [14]:
import plotly.express as px
# Create the 3D scatter plot using imported data
fig = px.scatter_3d(
    x=embeddings_3d[:,0],
    y=embeddings_3d[:,1],
    z=embeddings_3d[:,2],
    hover_name=GEOPHYSICS_TEXTS,  # Use imported data
    color=GEOPHYSICS_CATEGORIES,  # Use imported categories
    title="Interactive 3D Geophysics Text Embeddings",
    labels={'x':'Dimension 1', 'y':'Dimension 2', 'z':'Dimension 3'},
)

fig.update_traces(marker=dict(size=5, opacity=0.7))
fig.update_layout(
    template='plotly_dark', font_family='monospace', width=900, height=700)
fig.show()


In [15]:
# Analyze semantic similarities within categories
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

# Calculate similarity matrix
similarity_matrix = cosine_similarity(embeddings.numpy())

# Find most similar sentence pairs
similarity_pairs = []
for i in range(len(GEOPHYSICS_TEXTS)):
    for j in range(i+1, len(GEOPHYSICS_TEXTS)):
        similarity_pairs.append({
            'text1': GEOPHYSICS_TEXTS[i][:50] + '...',
            'text2': GEOPHYSICS_TEXTS[j][:50] + '...',
            'category1': GEOPHYSICS_CATEGORIES[i],
            'category2': GEOPHYSICS_CATEGORIES[j],
            'similarity': similarity_matrix[i, j],
            'same_category': GEOPHYSICS_CATEGORIES[i] == GEOPHYSICS_CATEGORIES[j]
        })

# Convert to DataFrame and sort by similarity
df_similarities = pd.DataFrame(similarity_pairs)
df_top_similar = df_similarities.nlargest(10, 'similarity')

print("Top 10 Most Similar Sentence Pairs:")
print("=" * 80)
for idx, row in df_top_similar.iterrows():
    same_cat = "✓" if row['same_category'] else "✗"
    print(f"Similarity: {row['similarity']:.3f} | Same Category: {same_cat}")
    print(f"1. [{row['category1']}] {row['text1']}")
    print(f"2. [{row['category2']}] {row['text2']}")
    print("-" * 80)

# Calculate average similarity within vs between categories
within_category_sim = df_similarities[df_similarities['same_category']]['similarity'].mean()
between_category_sim = df_similarities[~df_similarities['same_category']]['similarity'].mean()

print(f"\nAverage similarity within same category: {within_category_sim:.3f}")
print(f"Average similarity between different categories: {between_category_sim:.3f}")
print(f"Difference: {within_category_sim - between_category_sim:.3f}")

Top 10 Most Similar Sentence Pairs:
Similarity: 0.905 | Same Category: ✓
1. [Drilling & Completion] Perforation creates communication pathways between...
2. [Drilling & Completion] Wellbore trajectory optimization maximizes reservo...
--------------------------------------------------------------------------------
Similarity: 0.892 | Same Category: ✓
1. [Reservoir Properties] Porosity measures the void space available for flu...
2. [Reservoir Properties] Capillary pressure controls fluid distribution at ...
--------------------------------------------------------------------------------
Similarity: 0.874 | Same Category: ✓
1. [Geology & Geochemistry] Migration pathways allow hydrocarbons to move from...
2. [Geology & Geochemistry] Seal integrity prevents hydrocarbon leakage from r...
--------------------------------------------------------------------------------
Similarity: 0.866 | Same Category: ✓
1. [Seismic Methods] Pre-stack seismic inversion simultaneously estimat...
2. [Seismic 

# 3. Understanding Context Windows

> <img src="https://raw.githubusercontent.com/CLDiego/uom_fse_dl_workshop/main/figs/icons/write.svg" width="20"/> **Definition**:The **context window** is the maximum number of tokens a model processes at once. Both your prompt and generated continuation must fit within this limit.

Why it matters:
- Limits how much the model “remembers” at inference time
- Affects truncation and chunking strategies for long documents
- Impacts latency and memory usage

Typical sizes (approximate):
- GPT-2 family: 1,024–2,048 tokens
- Modern LLMs: 4k–200k+ tokens, depending on the model

Key parameters to watch:
- `tokenizer.model_max_length` or `config.n_positions`
- `max_new_tokens` vs `max_length` (prefer `max_new_tokens` to avoid counting prompt tokens implicitly)

***

> <img src="https://raw.githubusercontent.com/CLDiego/uom_fse_dl_workshop/main/figs/icons/code.svg" width="20"/> **Snippet**: Chunking long text (sliding window) example:

```python
def chunk_tokens(tokenizer, text, max_len=1024, overlap=50):
    ids = tokenizer.encode(text, add_special_tokens=False)
    chunks = []
    start = 0
    while start < len(ids):
        end = min(start + max_len, len(ids))
        chunks.append(ids[start:end])
        if end == len(ids):
            break
        start = end - overlap  # slide back by overlap
    return chunks

# Each chunk can be fed independently; aggregate results later.
```

> <img src="https://raw.githubusercontent.com/CLDiego/uom_fse_dl_workshop/main/figs/icons/reminder.svg" width="20"/> **Tip**: Leave a margin in the window for generation. Example: if the model limit is 1024, keep the prompt ≤ 900 tokens and use `max_new_tokens` ≤ 100.

In [16]:
# Demonstrate context window limitations
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load GPT-2 model
tokenizer_gpt2 = GPT2Tokenizer.from_pretrained('gpt2')
model_gpt2 = GPT2LMHeadModel.from_pretrained('gpt2')

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [18]:
# Set pad token
tokenizer_gpt2.pad_token = tokenizer_gpt2.eos_token

print(f"GPT-2 maximum position embeddings: {model_gpt2.config.n_positions}")
print(f"This means the context window is {model_gpt2.config.n_positions} tokens")

# Create a long geoscience text to test context limits
long_text = """
Seismic inversion is a geophysical technique used to derive subsurface properties from seismic data.
The process involves converting seismic reflection data into quantitative rock and fluid properties such as
acoustic impedance, porosity, and lithology. This technique is fundamental in hydrocarbon exploration
and reservoir characterization. The inversion process typically starts with seismic data acquisition,
followed by data processing, and finally the inversion itself. There are several types of seismic inversion
including post-stack inversion, pre-stack inversion, and simultaneous inversion. Post-stack inversion
works with stacked seismic data to derive acoustic impedance. Pre-stack inversion uses angle-dependent
reflectivity information to derive multiple elastic properties. Simultaneous inversion integrates seismic
and well log data to provide more accurate and detailed subsurface models.
""" * 10  # Repeat to make it longer

# Tokenize the long text
tokens = tokenizer_gpt2.tokenize(long_text)
print(f"\nLong text has {len(tokens)} tokens")
print(f"Exceeds context window: {len(tokens) > model_gpt2.config.n_positions}")

# Show what happens when we truncate
max_length = model_gpt2.config.n_positions - 50  # Leave room for generation
truncated_tokens = tokens[:max_length]
print(f"Truncated to {len(truncated_tokens)} tokens for processing")

GPT-2 maximum position embeddings: 1024
This means the context window is 1024 tokens

Long text has 1820 tokens
Exceeds context window: True
Truncated to 974 tokens for processing


# 4. Loading a Small HuggingFace Model

We’ll use a compact causal language model for fast experimentation. Smaller models are great for demos and offline inference but will have limited knowledge and coherence compared to larger models.

Considerations:
- Trade-offs: size vs speed vs quality
- Device placement: CPU, GPU, or Apple MPS
- Precision: fp32 (safe), fp16/bf16 (faster on supported hardware)

***

> <img src="https://raw.githubusercontent.com/CLDiego/uom_fse_dl_workshop/main/figs/icons/code.svg" width="20"/> **Snippet**: Convenient loading patterns:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = 'distilgpt2'
tokenizer_gen = AutoTokenizer.from_pretrained(model_name)
model_gen = AutoModelForCausalLM.from_pretrained(model_name)

# Device selection
device = 'mps' if torch.backends.mps.is_available() else ('cuda' if torch.cuda.is_available() else 'cpu')
model_gen = model_gen.to(device)

# Pad/eos safety for generation
if tokenizer_gen.pad_token_id is None:
    tokenizer_gen.pad_token = tokenizer_gen.eos_token
```

> <img src="https://raw.githubusercontent.com/CLDiego/uom_fse_dl_workshop/main/figs/icons/reminder.svg" width="20"/> **Tip (advanced)**: For larger models that fit in memory, try `device_map='auto'` and `torch_dtype=torch.float16` on supported hardware.

In [19]:
from transformers import AutoModelForCausalLM

# Load a small, efficient model for text generation
model_name = "distilgpt2"  # Smaller, faster version of GPT-2
tokenizer_gen = AutoTokenizer.from_pretrained(model_name)
model_gen = AutoModelForCausalLM.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [20]:
# Set pad token
if tokenizer_gen.pad_token is None:
    tokenizer_gen.pad_token = tokenizer_gen.eos_token

print(f"Model: {model_name}")
print(f"Vocabulary size: {tokenizer_gen.vocab_size:,}")
print(f"Model parameters: {model_gen.num_parameters():,}")
print(f"Context window: {model_gen.config.n_positions} tokens")
print(f"Embedding dimension: {model_gen.config.n_embd}")

Model: distilgpt2
Vocabulary size: 50,257
Model parameters: 81,912,576
Context window: 1024 tokens
Embedding dimension: 768


# 5. Generate Simple Text Completions

We’ll generate short continuations and compare decoding strategies.

Decoding parameters (cheat sheet):
- `max_new_tokens`: number of tokens to generate (prefer over `max_length`)
- `temperature`: randomness (↓ = conservative, ↑ = creative)
- `top_k`: Only consider the top K most likely next tokens. Smaller K = safer, larger K = more variety.
- `top_p` (nucleus): Only consider the smallest set of tokens whose total probability ≥ p (e.g., 0.9). Lower p = safer.
- `repetition_penalty`: Punishes repeating the same text. Use >1.0 (e.g., 1.1–1.3) to reduce loops.
- `no_repeat_ngram_size`: Forbids repeating any exact phrase of length n (e.g., 2 prevents repeating bigrams).

***

> <img src="https://raw.githubusercontent.com/CLDiego/uom_fse_dl_workshop/main/figs/icons/code.svg" width="20"/> **Snippet**: Example generation call:

```python
prompt = 'Reservoir characterization involves'
inputs = tokenizer_gen(prompt, return_tensors='pt').to(model_gen.device)
with torch.no_grad():
    out_ids = model_gen.generate(
        **inputs,
        max_new_tokens=60,
        temperature=0.7,
        top_p=0.9,
        top_k=50,
        do_sample=True,
        repetition_penalty=1.1,
        eos_token_id=tokenizer_gen.eos_token_id,
        pad_token_id=tokenizer_gen.pad_token_id,
    )
print(tokenizer_gen.decode(out_ids[0], skip_special_tokens=True))
```

> <img src="https://raw.githubusercontent.com/CLDiego/uom_fse_dl_workshop/main/figs/icons/reminder.svg" width="20"/> **Tips**
> - Keep prompts concise and specific. Prefix with context if needed (e.g., “Geoscience definition: …”).
> - Use either top_k or top_p (top_p is usually easier to tune).
> - These work only when do_sample=True (otherwise decoding is greedy).
> - Prefer max_new_tokens over max_length.

In [21]:
def generate_text(prompt, tokenizer, model, max_length=100, temperature=0.7, num_return_sequences=1):
    """Generate text completion given a prompt"""
    inputs = tokenizer(prompt, return_tensors='pt')

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_length=max_length,
            num_return_sequences=num_return_sequences,
            temperature=temperature,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            no_repeat_ngram_size=2  # Avoid repetition
        )

    generated_texts = []
    for output in outputs:
        generated_text = tokenizer.decode(output, skip_special_tokens=True)
        generated_texts.append(generated_text)

    return generated_texts

# Test with simple prompts
simple_prompts = [
    "The geology of this region",
    "Oil and gas exploration requires",
    "Seismic waves travel through"
]

print("=== Simple Text Completions ===")
for prompt in simple_prompts:  # Use imported prompts
    generated = generate_text(prompt, tokenizer_gen, model_gen, max_length=60)
    print(f"\nPrompt: '{prompt}'")
    print(f"Completion: '{generated[0]}'")
    print("-" * 80)

=== Simple Text Completions ===

Prompt: 'The geology of this region'
Completion: 'The geology of this region

The area is the largest and most complex of the 2.5 billion sq km region that has existed in the past 50 years. (The most famous example of a geologic area lies in northern North America.)
This region is also home to the vast majority'
--------------------------------------------------------------------------------

Prompt: 'Oil and gas exploration requires'
Completion: 'Oil and gas exploration requires an average of 30 minutes before the drilling begins, said Doug Visser, executive director of the Oil and Gas Change Coalition. The most recent gas company exploration in Texas has been in Tennessee, which is the last state with a fracking ban.

"Texas is a'
--------------------------------------------------------------------------------

Prompt: 'Seismic waves travel through'
Completion: 'Seismic waves travel through Europe, and they are not as common as the European ones. The 

In [22]:
# Experiment with different generation parameters
prompt = "Reservoir characterization involves"

print("=== Effect of Different Parameters ===")
print(f"Prompt: '{prompt}'\n")

# Low temperature (more deterministic)
low_temp = generate_text(prompt, tokenizer_gen, model_gen, max_length=50, temperature=0.3)
print(f"Low temperature (0.3): {low_temp[0]}")

# High temperature (more creative)
high_temp = generate_text(prompt, tokenizer_gen, model_gen, max_length=50, temperature=1.2)
print(f"High temperature (1.2): {high_temp[0]}")

# Multiple generations
multiple = generate_text(prompt, tokenizer_gen, model_gen, max_length=50, temperature=0.8, num_return_sequences=3)
print("\nMultiple generations:")
for i, gen in enumerate(multiple, 1):
    print(f"{i}. {gen}")

=== Effect of Different Parameters ===
Prompt: 'Reservoir characterization involves'

Low temperature (0.3): Reservoir characterization involves the use of a technique of measuring the concentration of the active ingredient in the product. The concentration is determined by the volume of concentration in a given product, and the amount of concentrated concentration.

The concentration concentration and concentration
High temperature (1.2): Reservoir characterization involves placing hands on a water bottle to avoid any possible contamination and thus minimizing the risk of contamination or contamination at any other facility.

The U.S. Department of Marine Fisheries uses special measures for protection of freshwater species and

Multiple generations:
1. Reservoir characterization involves the use of a small number of microchips: the one in which two microchip chips are inserted into each other. The tiny ones are placed at the bottom of the chip and then placed directly next to the micr

## Summary

In this module, we covered:

1. **Tokens**: Basic units that LLMs process (words, subwords, punctuation)
2. **Embeddings**: Numerical representations that capture semantic meaning
3. **Context Windows**: Maximum input size limitations (1,024 tokens for GPT-2)
4. **Model Loading**: Using HuggingFace transformers to load pre-trained models
5. **Text Generation**: Creating completions with different parameters
6. **Geoscience Applications**: Generating definitions for technical terms

### Key Takeaways:
- Tokenization breaks text into processable units
- Embeddings capture semantic relationships between concepts
- Context windows limit how much text models can process at once
- Different prompting strategies can yield different results
- Temperature controls randomness in generation

# 6. Exercise: Geoscience Definition Generation

> <img src="https://raw.githubusercontent.com/CLDiego/uom_fse_dl_workshop/main/figs/icons/write.svg" width="20"/> **Task**: Generate concise, factual definitions for geoscience terms.

Guidelines:
- Aim for 1–3 sentences per definition
- Prefer domain-appropriate vocabulary
- Keep statements verifiable and neutral

Suggested prompt templates:

```text
Define "{term}" in geoscience.
In petroleum geoscience, what is {term}?
Give a brief, technical definition of {term}.
```

***

> <img src="https://raw.githubusercontent.com/CLDiego/uom_fse_dl_workshop/main/figs/icons/code.svg" width="20"/> **Snippet**: Example helper (pseudocode you can adapt in the code cell):

```python
def generate_definition(term, tokenizer, model, max_new_tokens=80, temperature=0.6):
    prompt = f"Geoscience definition: {term}. Definition:"
    inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=0.9,
            do_sample=True,
            no_repeat_ngram_size=2,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id,
        )
    return tokenizer.decode(out[0], skip_special_tokens=True)
```

Advanced challenge:
- Compare three prompts for the same term and evaluate which is clearer
- Embed each generated definition and compute cosine similarity between them
- Visualize definitions in 2D (t-SNE/UMAP) to see clustering by prompt style

In [None]:
# # Exercise: Generate geoscience definitions
# # STEP 1: Define a method to generate definitions using
# # a predifined prompt
# def generate_definition(term, tokenizer, model, max_length=150):
#     """Generate a definition for a geoscience term"""

#     """ YOUR CODE HERE """

#     prompt =

#     return generated[0]

# # Main exercise: Seismic inversion
# print("=== MAIN EXERCISE: Seismic Inversion Definition ===")
# seismic_inversion_def = generate_definition("seismic inversion", tokenizer_gen, model_gen)
# print(seismic_inversion_def)
# print("\n" + "="*80 + "\n")

# # Additional geoscience terms to try
# geoscience_terms_exercise = [
#     "porosity",
#     "permeability",
#     "reservoir characterization",
#     "hydrocarbon migration",
#     "seismic interpretation",
#     "well logging"
# ]

# print("=== Additional Geoscience Definitions ===")
# # STEP 2: Iterate over the list and generate definitions
# """ YOUR CODE HERE """

In [None]:
# # Advanced exercise: Compare different prompting strategies
# term = "seismic inversion"

# # STEP 3: Create a list of different prompting strategies
# # These strategies will be used to generate definitions for the term
# """ YOUR CODE HERE """

# prompting_strategies =

# # STEP 4: Iterate through strategies and generate definitions
# print("=== Comparing Prompting Strategies ===")
# for i, prompt in enumerate(prompting_strategies, 1):

#     """ YOUR CODE HERE """

#     print(f"\nStrategy {i}: '{prompt}'")
#     print(f"Response: {generated[0]}")
#     print("-" * 60)