<img src="https://theaiengineer.dev/tae_logo_gw_flatter.png" width=35% align=right>

# Building a Large Language Model from Scratch — A Step-by-Step Guide Using Python and PyTorch
## Chapter 3 — Setting Up the Project
**© Dr. Yves J. Hilpisch**<br>AI-Powered by GPT-5.

## How to Use This Notebook

- Clone or sync the repository into your Colab workspace in a reproducible way.
- Install Python dependencies with pinned versions for deterministic runs.
- Validate the setup by running a smoke test that imports key modules.

### Repository Setup Options

There are two common paths in Colab:

1. Mount a cloud drive (Google Drive, Dropbox, etc.) and access the synchronized folder.
2. Clone the repository directly with `git clone`.

Pick the approach that gives you the most reliable storage for persistent artifacts.

In [None]:
# Example: cloning via Git (replace the URL with your fork if needed)
import os
import subprocess
import sys

repo_url = "https://github.com/your-org/atto-llm-book.git"
if os.environ.get('COLAB_RELEASE_TAG'):
    subprocess.run(['git', 'clone', '--depth=1', repo_url, 'project_repo'], check=True)
else:
    print('Skipping git clone outside Colab. Run this cell manually when you have network access.')


### Dependency Management

Use a requirements file whenever possible. That way your future training runs or deployments start from the same baseline environment.

In [None]:
# Use pip to install only what you need for this chapter.
import os
import subprocess
import sys

if os.environ.get('COLAB_RELEASE_TAG'):
    subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', '-r', 'requirements.txt'], check=True)
else:
    print('Skipping pip install outside Colab to keep validation fast. Run this in Colab when needed.')


### Smoke Test the Environment

Import the utilities that later notebooks expect. If this cell fails, resolve the issue before moving forward.

In [None]:
try:
    from code.attollm.data import dataset
    from code.attollm.model import transformer
    print("Modules imported successfully.")
except ModuleNotFoundError as exc:
    print(f"Import failed: {exc}")
    print("Adjust PYTHONPATH or install missing packages before continuing.")


### Configuration Snapshot

Capture the versions of critical libraries so your future self (or collaborators) knows what environment produced your results.

In [None]:
import json
import pkg_resources

packages_of_interest = ["torch", "numpy", "pandas", "transformers"]
versions = {}
for pkg in packages_of_interest:
    try:
        versions[pkg] = pkg_resources.get_distribution(pkg).version
    except pkg_resources.DistributionNotFound:
        versions[pkg] = "not installed"
print(json.dumps(versions, indent=2))


## Exercises

- Decide where you will store checkpoints and large artifacts; document the path in your notes.
- Create a new Python virtual environment (locally or in Colab) and list the exact activation steps.
- Write a short shell script that provisions the environment from scratch using the commands above.

<img src="https://theaiengineer.dev/tae_logo_gw_flatter.png" width=35% align=right>