# 05 — Code & Data Prep with LLMs

**Goal:** Practice using an LLM to help with **coding** and **data wrangling** tasks, while keeping your workflow transparent and reproducible.

What you'll learn:
- When LLMs are helpful for coding — and when they are **not**.
- Generating a small Python function from natural language (NL).
- Explaining existing code in plain English.
- Cleaning a tiny messy dataset with clear, auditable steps.
- Asking quick statistical questions (and verifying with code).

> This notebook assumes you've completed **03** and **04**. We continue using the same Groq setup and the `chat()` helper.


## 0) Groq API settings (run first)

Paste your Groq API key below. For a workshop, it's fine to use a plain string here (but not secure for sharing).


In [None]:
# --- Groq API Settings ---
import os
os.environ["GROQ_API_KEY"] = ""  # <-- replace with your key, e.g., "gsk_..."
model = "llama-3.3-70b-versatile"  # https://console.groq.com/docs/models
os.environ["BASE_URL"] = "https://api.groq.com/openai/v1"  # OpenAI-compatible endpoint
print("Groq settings ready (remember to paste your key).")


Groq settings ready (remember to paste your key).


## 1) Create the client and helpers

We use a small set of helpers:
- `chat()` — send a message to the model.
- `set_system()` — reset memory with a system instruction.
- `clear_chat()` — clear memory entirely.

Keeping memory tidy is important for **reproducibility**.


In [None]:
from openai import OpenAI

chat_history = []
client = OpenAI(api_key=os.environ["GROQ_API_KEY"], base_url=os.environ["BASE_URL"])

def set_system(prompt: str):
    """Reset chat to a single system message."""
    global chat_history
    chat_history = [{"role": "system", "content": prompt}]
    return chat_history

def clear_chat():
    """Clear all chat history."""
    global chat_history
    chat_history = []
    return chat_history

def chat(user_input: str, temperature: float = 0.2, max_turns: int = 8):
    """Send one user turn and get a reply, preserving context.
    Keeps system prompt + last `max_turns` user/assistant messages.
    """
    global chat_history
    chat_history.append({"role": "user", "content": user_input})
    system = chat_history[:1]
    recent = chat_history[-(max_turns*2):] if len(chat_history) > 1 else []
    window = system + recent
    resp = client.chat.completions.create(model=model, messages=window, temperature=temperature)
    reply = resp.choices[0].message.content
    chat_history.append({"role": "assistant", "content": reply})
    return reply


print("Testing API ...")
print()
print("User:")
print("Hello LLM!")
print("LLM:")
print(chat("Hello LLM!"))
print()
print("Client ready. Use chat('...'), set_system('...'), clear_chat().")



Testing API ...

User:
Hello LLM!
LLM:
Hello! It's nice to meet you. Is there something I can help you with or would you like to chat?

Client ready. Use chat('...'), set_system('...'), clear_chat().


## 2) When to use LLMs for coding — and when not to

**Good uses:**
- Writing small helper functions (string cleaning, quick plotting, file I/O).
- Explaining unknown code you *already trust*.
- Drafting boilerplate (docstrings, argument parsing).

**Be cautious / avoid:**
- Security-sensitive code (auth, crypto, secrets).
- Heavy numerical code where correctness is critical (derive first, then code).
- Blindly executing generated code — **always read and test**.

In this notebook, we keep examples **small and auditable**.


## 3) Generate a small Python function from natural language (NL)

We'll ask the LLM for a short, reusable helper:
> *"Write a function that normalises a CSV column (strip spaces, lowercase strings, and convert numbers safely)."*

We **do not** auto-execute the returned code — we read it first, then (optionally) paste it into a new cell.


In [3]:
clear_chat()
set_system("You are a helpful Python tutor. Keep code short, readable, and beginner-friendly.")

prompt = """    Write a Python function:
- name: normalize_column(df, col)
- if the column is string-like: strip spaces and lowercase
- if numeric-like: coerce to numeric (errors='coerce')
- return the modified dataframe (in-place ok for simplicity)
Keep it short and add a concise docstring.
"""
code_suggestion = chat(prompt, temperature=0.2)
print(code_suggestion)


```python
import pandas as pd
import numpy as np

def normalize_column(df, col):
    """
    Normalize a column in a DataFrame: strip and lowercase strings, coerce to numeric otherwise.

    Parameters:
    df (pd.DataFrame): Input DataFrame
    col (str): Column name to normalize

    Returns:
    pd.DataFrame: Modified DataFrame
    """
    if df[col].dtype.kind in 'bOUS':  # string-like
        df[col] = df[col].str.strip().str.lower()
    else:  # numeric-like
        df[col] = pd.to_numeric(df[col], errors='coerce')
    return df
```


**Review the code above** (visually) and paste it into a new cell if you're happy with it.  
For convenience, here's a **simple reference implementation** you can run directly:


In [4]:
import pandas as pd
import numpy as np

def normalize_column(df: pd.DataFrame, col: str) -> pd.DataFrame:
    """Normalize a column: strip/lowercase strings, coerce numeric.


    This is a minimal, workshop-friendly version.
    """
    if col not in df.columns:
        return df
    s = df[col]
    # Try numeric coercion first
    coerced = pd.to_numeric(s, errors="coerce")
    if coerced.notna().sum() > 0 and coerced.isna().sum() < len(s):
        df[col] = coerced
        return df
    # Fall back to string normalization
    df[col] = s.astype(str).str.strip().str.lower()
    return df

# Tiny demo
tmp = pd.DataFrame({
    "name": ["  Alice ", "BOB", None],
    "age": [" 23", "not a number", "45 "],
})
print("Before:\n", tmp)
normalize_column(tmp, "name")
normalize_column(tmp, "age")
print("\nAfter:\n", tmp)


Before:
        name           age
0    Alice             23
1       BOB  not a number
2      None           45 

After:
     name   age
0  alice  23.0
1    bob   NaN
2   none  45.0


## 4) Explain existing code

Now let's ask the LLM to **explain** a snippet. This is great for onboarding and code reviews.


In [5]:
snippet = """    def zscore(x):
    import numpy as np
    x = np.asarray(x, dtype=float)
    return (x - x.mean()) / (x.std(ddof=1) if x.size > 1 else 1.0)
"""

clear_chat()
set_system("Explain code like a patient senior engineer helping a beginner.")
explanation = chat(f"Explain this code step by step:\n\n{snippet}", temperature=0.2)
print(explanation)


Let's break down this code step by step.

### Function Definition
```python
def zscore(x):
```
This line defines a function named `zscore` that takes one argument `x`. This function will calculate the z-score of the input data.

### Importing Libraries
```python
import numpy as np
```
This line imports the `numpy` library and assigns it the alias `np`. `numpy` is a library for efficient numerical computation in Python.

### Data Conversion
```python
x = np.asarray(x, dtype=float)
```
This line converts the input `x` into a `numpy` array with a data type of `float`. This is done to ensure that the input data is in a format that can be used for numerical computations.

### Calculating the Z-Score
```python
return (x - x.mean()) / (x.std(ddof=1) if x.size > 1 else 1.0)
```
This line calculates the z-score of the input data and returns the result. Let's break it down further:

* `x.mean()`: This calculates the mean (average) of the input data.
* `x - x.mean()`: This subtracts the mean from

## 5) Data cleaning with guidance

We'll make a **tiny messy dataset** and demonstrate a few small, **auditable** cleaning steps.

Steps:
1. Inspect data (head, dtypes, missing values).
2. Normalise columns using our helper.
3. Handle missing values with simple choices.
4. Fix simple typos using a small dictionary.


In [6]:
# Create a tiny messy dataset
raw = pd.DataFrame({
    "id": [1, 2, 3, 4, 5],
    "city": ["Brisbane", "brisbane ", "Brisban", None, "Sydney"],
    "temp_c": ["22.1", "21.8", "not avail", " 19.5", "  28.0 "],
    "notes": [" Sunny", "Cloudy", None, " light rain", "WINDY  "],
})

print("Original data:")
display(raw)

print("Dtypes:")
print(raw.dtypes)


Original data:


Unnamed: 0,id,city,temp_c,notes
0,1,Brisbane,22.1,Sunny
1,2,brisbane,21.8,Cloudy
2,3,Brisban,not avail,
3,4,,19.5,light rain
4,5,Sydney,28.0,WINDY


Dtypes:
id         int64
city      object
temp_c    object
notes     object
dtype: object


### Normalising columns

In [7]:
# 5.1 Normalize textual/numeric columns
clean1 = raw.copy()
clean1 = normalize_column(clean1, "temp_c")
clean1 = normalize_column(clean1, "notes")
# for 'city', keep as strings (no numeric coercion), so do a custom string pass:
clean1["city"] = clean1["city"].astype(str).str.strip().str.lower()

print("After basic normalization:")
display(clean1)
print(clean1.dtypes)


After basic normalization:


Unnamed: 0,id,city,temp_c,notes
0,1,brisbane,22.1,sunny
1,2,brisbane,21.8,cloudy
2,3,brisban,,none
3,4,none,19.5,light rain
4,5,sydney,28.0,windy


id          int64
city       object
temp_c    float64
notes      object
dtype: object


### Handle missing values

Example function to compare with LLm's code

In [8]:
def fill_missing_auto(df, placeholder="unknown"):
    """
    Fill missing values automatically:
      - Numeric columns → replace NaN with the column mean
      - Text columns → replace 'none' with missing, then fill with a placeholder

    Returns a new DataFrame (does not change the original).
    """
    out = df.copy()

    for col in out.columns:
        if out[col].dtype.kind in "fiu":  # numeric columns
            mean_val = out[col].mean(skipna=True)
            out[col] = out[col].fillna(mean_val)
        elif out[col].dtype.kind in "bOUS":  # boolean or text-like
            out[col] = out[col].replace({"none": None}).fillna(placeholder)

    return out



Asking LLM to generate the function

In [9]:
### Ask LLM to generate the function

clear_chat()
set_system("You are a helpful Python tutor. Keep code short, readable, and beginner-friendly.")

prompt = """
Create a function named fill_missing_auto that performs below.
It gets a dataframe, and a place holder string (default "unknown"). It detects string and numerical values and fills in 
missing values using a simple, easy-to-read approach:
  - For the numeric columns replaces NaN values with the column mean.
  - For text columns turns 'none' into missing, then fills missing entries with the placeholder (default "unknown").
Returns a new DataFrame (does not modify the original).
"""

code_suggestion = chat(prompt, temperature=0.2)
print(code_suggestion)

```python
import pandas as pd
import numpy as np

def fill_missing_auto(df, placeholder="unknown"):
    """
    Fill missing values in a DataFrame.

    For numeric columns, replace NaN values with the column mean.
    For text columns, replace 'none' with NaN, then fill NaN values with the placeholder.

    Args:
        df (pd.DataFrame): Input DataFrame.
        placeholder (str, optional): Placeholder string. Defaults to "unknown".

    Returns:
        pd.DataFrame: New DataFrame with filled missing values.
    """

    # Create a copy of the original DataFrame to avoid modifying it
    new_df = df.copy()

    # Iterate over each column in the DataFrame
    for col in new_df.columns:
        # Check if the column contains numeric values
        if new_df[col].dtype.kind in 'bifc':
            # Replace NaN values with the column mean
            new_df[col] = new_df[col].fillna(new_df[col].mean())
        else:
            # Replace 'none' with NaN
            new_df[col] = new_df

Paste LLMs function below

In [10]:
def fill_missing_auto(df, placeholder="unknown"):
    """
    Fill missing values in a DataFrame.

    For numeric columns, replace NaN values with the column mean.
    For text columns, replace 'none' with NaN, then fill NaN values with the placeholder.

    Args:
        df (pd.DataFrame): Input DataFrame.
        placeholder (str, optional): Placeholder string. Defaults to "unknown".

    Returns:
        pd.DataFrame: New DataFrame with filled missing values.
    """

    # Create a copy of the original DataFrame
    new_df = df.copy()

    # Iterate over each column in the DataFrame
    for col in new_df.columns:
        # Check if the column contains numeric values
        if new_df[col].dtype.kind in 'bifc':
            # Replace NaN values with the column mean
            new_df[col] = new_df[col].fillna(new_df[col].mean())
        else:
            # Replace 'none' with NaN
            new_df[col] = new_df[col].replace('none', np.nan)
            # Replace NaN values with the placeholder
            new_df[col] = new_df[col].fillna(placeholder)

    return new_df

Testing the function

In [11]:
# 5.2 Handle missing values (with helper)
clean2 = fill_missing_auto(clean1, placeholder="unknown")

print("After filling missing values:")
display(clean2)

After filling missing values:


Unnamed: 0,id,city,temp_c,notes
0,1,brisbane,22.1,sunny
1,2,brisbane,21.8,cloudy
2,3,brisban,22.85,unknown
3,4,unknown,19.5,light rain
4,5,sydney,28.0,windy


### Fix typos

Example dictionary to fix typos

In [12]:
city_fix = {"brisban": "brisbane", "brisbane ": "brisbane"}

Asking LLM to make the dictionary

In [13]:
prompt

'\nCreate a function named fill_missing_auto that performs below.\nIt gets a dataframe, and a place holder string (default "unknown"). It detects string and numerical values and fills in \nmissing values using a simple, easy-to-read approach:\n  - For the numeric columns replaces NaN values with the column mean.\n  - For text columns turns \'none\' into missing, then fills missing entries with the placeholder (default "unknown").\nReturns a new DataFrame (does not modify the original).\n'

In [14]:
### Ask LLM to generate the dictionary

clear_chat()
set_system("You are a helpful Python tutor. Keep code short, readable, and beginner-friendly.")


column_unique_str = str(raw.city.unique())
prompt = f"""
Give me a dictionary to use for replacing and fixing the misspellings and typos in this list: {column_unique_str}
"""

code_suggestion = chat(prompt, temperature=0.2)
print(code_suggestion)

Here's a dictionary that maps common misspellings and typos to their correct spellings:

```python
city_corrections = {
    'Brisban': 'Brisbane',
    'brisbane': 'Brisbane',  # to standardize case
    'brisban': 'Brisbane',  # another possible typo
    'sydney': 'Sydney',  # to standardize case
}
```

You can use this dictionary to replace the misspellings and typos in your list. Note that this dictionary does not handle the `None` value, as it's not a string that can be corrected. You may want to add a separate check to handle `None` values. 

Here's a simple example of how you could use this dictionary:

```python
city_list = ['Brisbane', 'brisbane', 'Brisban', None, 'Sydney']

corrected_list = []
for city in city_list:
    if city is None:
        corrected_list.append(None)  # or some other default value
    else:
        corrected_list.append(city_corrections.get(city, city))

print(corrected_list)
```


Paste the dictionary proposed by the LLM below

In [15]:
city_fix = {
    'brisbane': 'Brisbane',
    'brisban': 'Brisbane',
    'sydney': 'Sydney'
}

Testing the dictionary

In [16]:
# 5.3 Fix simple typos using a dictionary
clean3 = clean2.copy()
clean3["city"] = clean2["city"].replace(city_fix)

print("Final cleaned dataset:")
display(clean3)


Final cleaned dataset:


Unnamed: 0,id,city,temp_c,notes
0,1,Brisbane,22.1,sunny
1,2,Brisbane,21.8,cloudy
2,3,Brisbane,22.85,unknown
3,4,unknown,19.5,light rain
4,5,Sydney,28.0,windy


## 6) Quick statistical help

We'll ask the LLM which test compares two independent samples, then **run the test** with Python.


In [17]:
clear_chat()
set_system("You are a helpful stats tutor for beginners.")
q = "What is the name of simplest statistical test to compare two independent samples when we assume normality? Pick a common multi-purpose one and explain briefly what it returns and how it does it."
ans = chat(q, temperature=0.0)
print(ans)


The simplest statistical test to compare two independent samples when we assume normality is the **Two-Sample T-Test**, also known as the Independent Samples T-Test.

This test returns a **p-value** and a **t-statistic**. The p-value indicates the probability of observing the difference between the two sample means (or a more extreme difference) assuming that the true population means are equal. The t-statistic is a measure of the number of standard errors by which the sample means differ.

The Two-Sample T-Test works by:

1. Calculating the sample means and standard deviations of the two groups.
2. Calculating the difference between the sample means.
3. Calculating the standard error of the difference (which takes into account the sample sizes and standard deviations).
4. Converting the difference to a t-statistic by dividing it by the standard error.
5. Looking up the t-statistic in a t-distribution table (or using software) to find the p-value.

If the p-value is below a certain sig

In [18]:
q="""
give me a python function that gets two arrays as input and implements this test. 
Do not import it from any other libraries. 
keep the code very simple for educational purposes. add comments and explanations to teh code.
"""

ans = chat(q, temperature=0.0)
print("Model's suggestion:", ans)

Model's suggestion: ```python
import math

def calculate_mean(array):
    # Calculate the mean of an array
    return sum(array) / len(array)

def calculate_variance(array):
    # Calculate the variance of an array
    mean = calculate_mean(array)
    return sum((x - mean) ** 2 for x in array) / len(array)

def calculate_standard_deviation(array):
    # Calculate the standard deviation of an array
    return math.sqrt(calculate_variance(array))

def two_sample_t_test(sample1, sample2):
    """
    Perform a two-sample t-test on two independent samples.

    Args:
        sample1 (list): The first sample.
        sample2 (list): The second sample.

    Returns:
        t_statistic (float): The t-statistic.
        p_value (float): The p-value (approximated).
    """
    # Calculate the means of the two samples
    mean1 = calculate_mean(sample1)
    mean2 = calculate_mean(sample2)

    # Calculate the standard deviations of the two samples
    std_dev1 = calculate_standard_deviation(sam

Paste the LLM proposed function below

In [19]:
import math

def calculate_mean(array):
    # Calculate the mean of an array
    return sum(array) / len(array)

def calculate_variance(array):
    # Calculate the variance of an array
    mean = calculate_mean(array)
    return sum((x - mean) ** 2 for x in array) / len(array)

def calculate_standard_deviation(array):
    # Calculate the standard deviation of an array
    return math.sqrt(calculate_variance(array))

def two_sample_t_test(sample1, sample2):
    # Calculate the means of the two samples
    mean1 = calculate_mean(sample1)
    mean2 = calculate_mean(sample2)

    # Calculate the standard deviations of the two samples
    std_dev1 = calculate_standard_deviation(sample1)
    std_dev2 = calculate_standard_deviation(sample2)

    # Calculate the standard error of the difference
    n1 = len(sample1)
    n2 = len(sample2)
    std_error = math.sqrt((std_dev1 ** 2 / n1) + (std_dev2 ** 2 / n2))

    # Calculate the t-statistic
    t_statistic = (mean1 - mean2) / std_error

    # Calculate the degrees of freedom
    df = n1 + n2 - 2

    # Calculate the p-value (approximation using a normal distribution)
    # Note: This is a simplification and not the exact p-value calculation
    # In practice, you would use a t-distribution table or a library function
    p_value = 2 * (1 - (math.erf(abs(t_statistic) / math.sqrt(2)) + 1) / 2)

    return t_statistic, p_value

Testing the function

In [20]:
# Verify with a simple t-test example

import numpy as np
rng = np.random.default_rng(0)
sample1 = rng.normal(22.0, 2.5, size=30)
sample2 = rng.normal(23.0, 2.5, size=30)

t_statistic, p_value = two_sample_t_test(sample1, sample2)
print(f"T-Statistic: {t_statistic}")
print(f"P-Value: {p_value}")


T-Statistic: -3.5295748922219743
P-Value: 0.0004162278799961783


## 7) Common pitfalls (and how to avoid them)

- **Hallucinated imports or functions:** If the model suggests a library you've never seen, verify it exists before using it.
- **Silent logic errors:** Read generated code line-by-line and write **tiny tests**.
- **Mutable defaults and hidden state:** Prefer pure functions that take inputs and return outputs.
- **Reproducibility:** Save prompts, model name, and parameters (temperature, etc.) next to your code.

> Rule of thumb: *Treat LLM output like a junior assistant’s draft — helpful, but requires your review.*


## 8) Wrap-up

You used an LLM to:
- Propose a small helper function and explain code.
- Clean a tiny dataset with auditable, simple steps.
- Ask for a statistical test and verify with Python.

**Takeaway:** LLMs can speed up everyday research coding, but **you** stay in control by reviewing, testing, and documenting.
