# **🚀 AI-Powered Coding with Free & Open Models**

---

## **🔹 Step 1: What are Hugging Face Models?**

---

Hugging Face is like the **GitHub for AI models** — it hosts thousands of models built by researchers, companies, and the open-source community.

### ✅ Key points for you:

1. **Model Hub**:

   * Website: [https://huggingface.co/models](https://huggingface.co/models)
   * You can search models by task (text generation, code generation, translation, etc.).
   * Each model has a "card" that explains usage, training data, limitations.

2. **Open-source**:

   * Most models are free to use.
   * Some are small (good for Colab/local use).
   * Some are huge (require GPUs or cloud).

3. **Coding Models (for our “mini Copilot”)**:
   Here are 3 popular free ones:

   * `deepseek-ai/deepseek-coder-1.3b-base` → lightweight, fast, free.
   * `codellama/CodeLlama-7b-Instruct` → trained for coding conversations.
   * `bigcode/starcoder` → great at structured Python/code generation.

4. **How you use them**:

   * With the `transformers` Python library.
   * You don’t need an API key if you run locally or in Colab (the model downloads once and runs offline).

---

---

## **🔹 Step 2: Setup (Free & Open Models)**

---

We’ll set up Hugging Face models locally (or in Google Colab).

### ✅ Install dependencies

In [None]:
!pip install transformers accelerate sentencepiece

#transformers → lets us load and run models
#accelerate → makes inference faster and handles GPU if available
#sentencepiece → required for some tokenizers (like CodeLlama, StarCoder)



---

---

## **Step 3: Auto-Generating Code** 🚀

---

Since we have a GPU available (Tesla T4), we’ll make sure the model uses it for faster inference.



In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM # AutoTokenizer: Converts text (your prompt) into tokens (numbers the model understands).
                                                             # AutoModelForCausalLM: Loads a causal language model (good for text/code generation).
import torch # torch: PyTorch, used to run the model on CPU/GPU.


In [None]:
# Pick a model
model_name = "deepseek-ai/deepseek-coder-1.3b-instruct" # DeepSeek Coder (1.3B parameters) — an open-source coding model.

# Load tokenizer & model
tokenizer = AutoTokenizer.from_pretrained(model_name) # Downloads and loads the tokenizer for this model.
model = AutoModelForCausalLM.from_pretrained( # Loads the actual AI model.
    model_name, # model we are going to use
    torch_dtype=torch.float16, # Uses half-precision to save memory & run faster on GPU.
    device_map="auto" # Automatically moves the model to GPU (if available), otherwise CPU.
)

# Define your prompt
prompt ="""### Instruction:
Output only valid Python code (no explanations, no comments, no markdown fences).

### Task:
1. Load the Titanic dataset from seaborn: sns.load_dataset("titanic").
2. Clean missing values:
   - Fill missing ages with the median age.
   - Fill missing embark_town with the most frequent value.
   - Drop COLUMNS (not rows) with too many missing values:
       na_ratio = titanic.isna().mean()
       cols_to_drop = na_ratio[na_ratio > 0.30].index
       titanic = titanic.drop(columns=cols_to_drop)
     Do NOT use dropna(subset=...).
3. Show summary statistics
4. Print missing counts after cleaning

### Response:
"""

# Tokenize input
inputs = tokenizer(prompt, return_tensors="pt").to(model.device) # Converts your text prompt into tokens (PyTorch tensor).
                                                                 # .to(model.device) ensures the tokens go to the same device as the model (GPU if available).

# Generate code
outputs = model.generate( # Runs the model to generate text/code.
    **inputs, # tokens derived from text prompt which are stored in input veriable from previous step
    max_new_tokens=300, # Limits output length.
    do_sample=True,         # <— enable sampling so temperature/top_p apply
    temperature=0.7, # Controls randomness (low = more logical, high = more creative).
    top_p=0.9, # Keeps only top 90% likely tokens.
    eos_token_id=tokenizer.eos_token_id,  # stop properly
    pad_token_id=tokenizer.eos_token_id   # avoid padding warnings
)

# Decode response
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True) # Converts the generated token IDs (numbers) back into readable text, ignoring special tokens like <pad> or <eos>.
#outputs[0] → contains all tokens (prompt + new response).
#inputs['input_ids'].shape[1]: → slices out only the newly generated tokens (so the model doesn’t just repeat the prompt).
#tokenizer.decode(..., skip_special_tokens=True) → converts tokens back to readable text, ignoring special symbols.
print(response)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/631 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.69G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

```python
import seaborn as sns
import pandas as pd

# Load the Titanic dataset from seaborn
titanic = sns.load_dataset("titanic")

# Fill missing ages with the median age
titanic['age'] = titanic['age'].fillna(titanic['age'].median())

# Fill missing embark_town with the most frequent value
titanic['embark_town'] = titanic['embark_town'].fillna(titanic['embark_town'].mode()[0])

# Drop COLUMNS with too many missing values
na_ratio = titanic.isna().mean()
cols_to_drop = na_ratio[na_ratio > 0.30].index
titanic = titanic.drop(columns=cols_to_drop)

# Show summary statistics
print(titanic.describe(include='all'))

# Print missing counts after cleaning
print(titanic.isna().sum())
```



---

---

## **✅ Step 4: Summarizing Outputs with Free Models**

---

Imagine you ran some code (say a **confusion matrix** from a classification task). Instead of you interpreting it manually, you’ll ask the model to **explain it**.



In [None]:
# Example "output" from a model or code execution
output_text = """
Confusion Matrix:
[[50, 10],
 [ 5, 35]]
"""

# Summarization prompt
summary_prompt = f"""### Instruction:
Explain this confusion matrix in 3 simple bullet points for a beginner.

### Data:
{output_text}

### Response:
"""

# Tokenize
inputs = tokenizer(summary_prompt, return_tensors="pt").to(model.device)

# Generate explanation
outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    eos_token_id=tokenizer.eos_token_id, # tells the model when to stop.
    pad_token_id=tokenizer.eos_token_id # avoids padding warnings.
)

# Decode only new tokens
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
#outputs[0] → contains all tokens (prompt + new response).
#inputs['input_ids'].shape[1]: → slices out only the newly generated tokens (so the model doesn’t just repeat the prompt).
#tokenizer.decode(..., skip_special_tokens=True) → converts tokens back to readable text, ignoring special symbols.
print(response)

Sure, here's a simple explanation:

1. **True Positives (TP):** These are cases in which we predicted yes (they have the disease), and they actually have the disease.

2. **False Positives (FP):** These are cases in which we predicted no (they don't have the disease), but they actually have the disease.

3. **True Negatives (TN):** These are cases in which we predicted no (they don't have the disease), and they don't actually have the disease.

4. **False Negatives (FN):** These are cases in which we predicted yes (they have the disease), but they don't actually have the disease.

So, the confusion matrix is a table that describes the performance of a classification model (or "classifier") on a set of data for which the true values are known. It is a summary of predictions made by the


---

---

## **🚀Step 5: Writing Docstrings & Tests**.

---

Here, we’ll ask the model to **analyze a function**, then generate:

1. A **Google-style docstring** (explains inputs, outputs, and behavior).
2. A **pytest unit test** (to check if the function works as expected).


In [None]:


# Example Function
function_code = """
def preprocess_data(df):
    df.fillna(0, inplace=True)
    return df
"""

# Prompt asking for docstring + pytest unit test
prompt = f"""### Instruction:
Write a Google-style docstring and a correct pytest unit test for this function.
Rules:
- Do NOT add `import` statements for the function (assume it's already defined).
- The unit test must contain exactly 3 assertions:
  1. `assert result.isnull().sum().sum() == 0`
  2. Explicit equality checks for a few known non-missing values.
  3. Explicit equality checks that missing values are replaced with 0.
- Do NOT use `.all().all()` style assertions.
- Do NOT use tautological conditions like (x == x).
- Only include pandas import if needed.

### Function:
{function_code}

### Response:
"""

# Tokenize
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate output
outputs = model.generate(
    **inputs,
    max_new_tokens=300,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id
)

# Decode only the new tokens
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

```python
import pandas as pd
import pytest

def preprocess_data(df):
    """
    Preprocesses a pandas DataFrame by replacing missing values with 0.

    Parameters:
    df (pandas.DataFrame): The DataFrame to preprocess.

    Returns:
    pandas.DataFrame: The preprocessed DataFrame.
    """
    df.fillna(0, inplace=True)
    return df
```

### Unit Test

```python
def test_preprocess_data():
    df = pd.DataFrame({
        'A': [1, 2, None],
        'B': [None, None, 3],
        'C': [4, 5, None]
    })

    preprocess_data(df)

    assert df.isnull().sum().sum() == 0
    assert df['A'].sum() == 1
    assert df['B'].sum() == 0
    assert df['C'].sum() == 4
```

This test first creates a DataFrame with some missing values. It then calls the `preprocess_data` function on this DataFrame. After the function call, it checks that the DataFrame is now valid (i.e., all values are not null). It also checks that


---

## Step 5B: Manual Fixes to Model Output

---

Let’s carefully look at the generated output.

---

### ✅ What’s correct

* **Docstring**: Well-structured, clear, and Google-style.
* **Function**: Correctly fills missing values with `0`.
* **Unit test structure**: Defines a `test_preprocess_data` function, creates a DataFrame, calls the preprocessing function, and runs assertions.

---

### ❌ What’s wrong

1. **Unnecessary imports**

   * `import pytest` is not needed, since the test doesn’t use any pytest features (like fixtures or `raises`).
   * Only `import pandas as pd` is required.

2. **Assertions are logically wrong**
   The function replaces missing values with `0`, so column sums **change**.

   * `assert df['A'].sum() == 1` → Wrong, because column `A` becomes `[1, 2, 0]` → sum = 3, not 1.
   * `assert df['B'].sum() == 0` → Wrong, because column `B` becomes `[0, 0, 3]` → sum = 3, not 0.
   * `assert df['C'].sum() == 4` → Wrong, because column `C` becomes `[4, 5, 0]` → sum = 9, not 4.

3. **Assertions don’t test the actual behavior**
   Instead of checking sums, the test should verify that:

   * Missing values are replaced with `0`.
   * Known values remain unchanged.
   * No `NaN` values remain.

👉 So the main problem with generated output is that the assertions are invalid (checking wrong sums), and pytest import is unnecessary.

👉 To fix this, I will **manually rewrite the test assertions** so that they properly check:
1. All missing values are filled with 0.  
2. The DataFrame contains no null values after preprocessing.  
3. Known values remain unchanged.

---

##✅ Corrected Version

In [None]:
import pandas as pd

def preprocess_data(df):
    """
    Preprocesses a pandas DataFrame by replacing missing values with 0.

    Parameters:
        df (pandas.DataFrame): The DataFrame to preprocess.

    Returns:
        pandas.DataFrame: The preprocessed DataFrame.
    """
    df.fillna(0, inplace=True)
    return df


def test_preprocess_data():
    df = pd.DataFrame({
        'A': [1, 2, None],
        'B': [None, None, 3],
        'C': [4, 5, None]
    })

    result = preprocess_data(df)

    # ✅ Correct assertions
    assert result.isnull().sum().sum() == 0               # No NaN values left
    assert result.loc[0, 'A'] == 1                        # Known value unchanged
    assert result.loc[1, 'B'] == 0                        # Missing replaced with 0
    assert result.loc[2, 'C'] == 0                        # Missing replaced with 0


With these manual corrections, the unit test now validates the preprocessing logic correctly.  

---

---

## **🔹Step 6: Build a Coding Assistant Notebook**

---

Now we’ll bring everything together into an **AI Coding Assistant**.  
This assistant will:

1. Take a user’s coding request as input.  
2. Generate Python code using a Hugging Face model.  
3. Allow the user to execute that code.  
4. Generate docstrings, tests, and explanations.  

This turns our notebook into a **mini Copilot**, fully open-source and free.

---

###**Model setup**

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM # AutoTokenizer → turns your text prompt into tokens (numbers).
                                                             # AutoModelForCausalLM → loads the model that predicts the next token (good for code/text).

import torch # needed because Hugging Face runs models on PyTorch.


---

###**Choose the model**

We’re using the instruction-tuned version.

This one follows prompts like “only output Python code, no explanation” better than the base version.

In [None]:
model_name = "deepseek-ai/deepseek-coder-1.3b-instruct"


---

###**Load tokenizer & model**


In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name) # Downloads the tokenizer + model from Hugging Face.
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,  # saves memory, faster on GPU
    device_map="auto"           # automatically uses GPU if available
)


---

###**Define the helper function**


**First: Why “wrap in a function”?**

Right now, every time we want to generate something (code, docstring, explanation), we’re writing a lot of repeated lines:

```python
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, ...)
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
```

That’s **boilerplate** (repeated code).
If you don’t wrap it, you’ll need to copy-paste this block **every single time** you ask the model something.

👉 By putting this into a function, e.g., `generate_code(prompt)`, you can just call that function

That makes your notebook cleaner, easier to use, and avoids mistakes.


In [None]:
def generate_code(prompt, max_new_tokens=300, temperature=0.7, top_p=0.9):
    """Generate clean Python code from prompt using DeepSeek."""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate( # asks the model to continue your prompt.
        **inputs,  # unpacks the tokenized prompt.
        max_new_tokens=max_new_tokens, # how long the answer can be.
        do_sample=True, # allows randomness (instead of always picking the most likely word).
        temperature=temperature, # control creativity
        top_p=top_p, # control creativity
        eos_token_id=tokenizer.eos_token_id, # tells the model when to stop (end-of-sequence).
        pad_token_id=tokenizer.eos_token_id # used for padding (keeps batch dimensions aligned).
    )
    # Decode tokens → text
    response = tokenizer.decode( # converts numbers back to readable text (Python code here).
        outputs[0][inputs['input_ids'].shape[1]:],
        skip_special_tokens=True
    )
    return response.strip()



This is where you wrap the repeated steps into a **function**.
Now instead of writing 10 lines every time, you just call the function:

```python
generated_code = generate_code(user_prompt)
print(generated_code)
```

---

✅**We did the function refactor**.
This is exactly how we’ll now reuse it for:

* generating code,
* writing docstrings/tests,
* explaining results.

---


---

###**User Prompt → Code Generation:**

In [None]:
user_prompt = """### Instruction:
Output only valid Python code (no explanations, no comments, no markdown fences).

### Task:
1. Load the Titanic dataset from seaborn: sns.load_dataset("titanic").
2. Print the first 5 rows.

### Response:
"""

generated_code = generate_code(user_prompt)
print(" Generated Code:\n")
print(generated_code)


 Generated Code:

```python
import seaborn as sns

# Load the Titanic dataset
titanic = sns.load_dataset("titanic")

# Print the first 5 rows
print(titanic.head())
```


---

###**Run Generated Code:**

In [None]:
if "```" in generated_code:
        generated_code = generated_code.split("```")[1]  # take content inside the first ```
        generated_code = generated_code.replace("python", "").strip()  # remove 'python' tag if present


# Carefully inspect generated code before running it
try:
    exec(generated_code)
    print("\n Code executed successfully")
except Exception as e:
    print("\n Error:", e)


   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  

 Code executed successfully


---

###**Ask for Docstring & Tests:**

In [None]:
docstring_prompt = f"""### Instruction:
Write a Google-style docstring and pytest unit test for this code.

### Code:
{generated_code}

### Response:
"""

docstring_output = generate_code(docstring_prompt, max_new_tokens=250)
print(" Docstring & Tests:\n")
print(docstring_output)


 Docstring & Tests:

# Google-style docstring
"""
This script is used to load the Titanic dataset from seaborn.

Attributes:
    titanic (DataFrame): The Titanic dataset from seaborn.

Functions:
    sns.load_dataset("titanic"): Loads the Titanic dataset from seaborn.
"""

# Pytest unit test
import pytest
import seaborn as sns
import pandas as pd

@pytest.fixture
def titanic():
    return sns.load_dataset("titanic")

def test_titanic_load(titanic):
    assert isinstance(titanic, pd.DataFrame)
    assert titanic.shape[0] > 0
    assert titanic.shape[1] > 0

# The fixture function is used to create a shared instance of the dataset for testing
# The test function checks that the dataset is a DataFrame and that it has at least one row and one column.
# If these conditions are not met, the test will fail and


---

###**Ask for Explanation:**

In [None]:
result_text = "Suppose my confusion matrix is [[50,10],[5,35]]"

explain_prompt = f"""### Instruction:
Explain this confusion matrix in 3 simple bullet points.

### Input:
{result_text}

### Response:
"""

explanation = generate_code(explain_prompt, max_new_tokens=150)
print("Explanation:\n")
print(explanation)


Explanation:

Sure, here are the three simple bullet points that explain the confusion matrix:

1. The confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of data for which the true values are known.

2. In this context, the diagonal elements represent the number of points for which the predicted label is equal to the true label, while off-diagonal elements are those that are mislabeled by the classifier.

3. In the given confusion matrix, the true labels are 0 (for the first group) and 1 (for the second group), and the predicted labels are 0 (for the first group) and 1 (
