## **[tool.poetry]**

This section contains metadata about the project. It defines the core information needed for the project setup and package distribution.

### **Details:**

- **`name`**:
  - `name = "llm-engineering"`
  - The name of the project. It is used as the package name if this project is published as a Python package.

- **`version`**:
  - `version = "0.1.0"`
  - Specifies the version of the project, following semantic versioning conventions (major.minor.patch).

- **`description`**:
  - `description = ""`
  - A short description of the project. It’s currently empty, but it is recommended to add a concise summary of what the project does.

- **`authors`**:
  - `authors = ["iusztinpaul <p.b.iusztin@gmail.com>"]`
  - A list of contributors or maintainers for the project. It includes the author’s name and email in the format `Name <email>`.

- **`license`**:
  - `license = "MIT"`
  - Indicates the license type for the project (e.g., MIT, Apache-2.0). This clarifies how the project can be used or modified.

- **`readme`**:
  - `readme = "README.md"`
  - Specifies the README file (usually Markdown) that provides detailed documentation about the project.

---

### **Why It Matters**

This section is crucial for:
- Identifying the project.
- Setting expectations for users and collaborators.
- Preparing the project for distribution (e.g., uploading to PyPI).

## **[tool.poetry.dependencies]**

This section defines the dependencies required for the project to run. These are the Python packages that will be installed when the project is set up.

### **Details:**

- **`python = "~3.11"`**:
  - Specifies the Python version required for the project.
  - The `~` symbol means that any patch release of Python `3.11` (e.g., `3.11.1`, `3.11.8`) is acceptable, but it will not allow minor version upgrades (e.g., `3.12`).

- **Dependencies and Versions**:
  - **`zenml = { version = "0.67.0", extras = ["server"] }`**:
    - Requires `zenml` version `0.67.0`.
    - The `extras = ["server"]` specifies optional features to include during installation (e.g., server functionality).

  - **`pymongo = "^4.6.2"`**:
    - Requires `pymongo` version `4.6.2` or higher, but less than `5.0.0`.
    - The `^` operator means compatibility with versions that don't change the major version.
    - **Description**:
      - `pymongo` is the official Python driver for MongoDB, a popular NoSQL database. It provides tools to connect to MongoDB instances, query and manipulate the database, and perform CRUD operations (Create, Read, Update, Delete).
      - It supports advanced features like indexing, aggregation pipelines, and replica set management.
      - Useful in data pipelines, where MongoDB serves as a storage layer.

  - **Other Dependencies**:
    - **`click = "^8.0.1"`**: CLI creation library.
    - **`loguru = "^0.7.2"`**: Logging library.
    - **`rich = "^13.7.1"`**: For creating rich terminal output.
    - **`numpy = "^1.26.4"`**: Numerical computation library.
    - **`poethepoet = "0.29.0"`**: Task runner for managing commands.
    - **`datasets = "^3.0.1"`**: Library for accessing and managing datasets.

- **Commented Sections**:
  - Dependencies are grouped into logical categories (e.g., **Digital data ETL**, **Feature engineering**, **RAG**, **Inference**) for better organization.

  - **Digital Data ETL**:
    - Libraries like `selenium`, `webdriver-manager`, `beautifulsoup4`, etc., are used for extracting and transforming web-based data.

  - **Feature Engineering**:
    - Libraries like `qdrant-client`, `langchain`, and `sentence-transformers` are used for building and processing features, particularly in machine learning workflows.

  - **RAG (Retrieval-Augmented Generation)**:
    - Dependencies like `langchain-openai`, `tiktoken`, and `jinja2` facilitate retrieval-augmented generation pipelines and text processing.

  - **Inference**:
    - Libraries such as `fastapi` and `uvicorn` are used for deploying and managing inference services.

---

### **Why It Matters**

This section:
- Defines the environment the project runs in.
- Ensures consistent dependency versions for reproducibility.
- Allows grouping dependencies by functionality, making it easier to manage and understand the project's needs.


## **[tool.poetry.group.dev.dependencies]**

This section specifies the development dependencies for the project. These are tools and libraries that are required only during development, such as linters, testing frameworks, and pre-commit hooks.

### **Details:**

- **`ruff = "^0.4.9"`**:
  - `ruff` is a fast Python linter and formatter designed to enforce code style and improve code quality.
  - It supports many common Python style guides and offers features like detecting unused imports, enforcing consistent formatting, and catching syntax issues.

- **`pre-commit = "^3.7.1"`**:
  - A framework for managing and executing pre-commit hooks.
  - Pre-commit hooks run checks or tasks (e.g., linting, formatting, security checks) automatically before committing code, ensuring code quality and consistency.

- **`pytest = "^8.2.2"`**:
  - A popular Python testing framework used to write and run tests.
  - It supports fixtures, parameterized tests, and plugins, making it highly extensible and suitable for various testing needs.

---

### **Why It Matters**

This section:
- Ensures the development environment includes tools to maintain high code quality and robustness.
- Separates development-only dependencies from runtime dependencies, reducing unnecessary package installations in production environments.
- Helps enforce best practices during the software development lifecycle.

---

### **How This Section Works**

- Poetry organizes dependencies into **dependency groups**, making it easy to install only the dependencies relevant to a particular task.
- For development dependencies, use the following command:
  ```bash
  poetry install --with dev


## **[tool.poetry.group.aws.dependencies]**

This section specifies the dependencies required for integrating with AWS services. These dependencies are grouped under the `aws` category and are typically used when deploying the application on AWS or interacting with AWS resources.

### **Details:**

- **`sagemaker = ">=2.232.2"`**:
  - The `sagemaker` library is used to work with Amazon SageMaker, a fully managed service for building, training, and deploying machine learning models at scale.
  - This version constraint ensures compatibility with the latest features while maintaining a minimum version of `2.232.2`.

- **`s3fs = ">2022.3.0"`**:
  - A Pythonic file system for accessing Amazon S3 (Simple Storage Service) buckets.
  - Enables reading from and writing to S3 buckets as if they were local files.

- **`aws-profile-manager = "^0.7.3"`**:
  - A tool to manage AWS credentials and profiles efficiently.
  - Useful when switching between multiple AWS accounts or roles.

- **`kubernetes = "^30.1.0"`**:
  - A library for interacting with Kubernetes clusters from Python.
  - This might be used to manage AWS EKS (Elastic Kubernetes Service) clusters.

- **`sagemaker-huggingface-inference-toolkit = "^2.4.0"`**:
  - A toolkit for deploying Hugging Face models to SageMaker endpoints.
  - Provides utilities for serving models with SageMaker's inference APIs.

---

### **Why It Matters**

This section:
- Groups AWS-specific dependencies to make them optional unless explicitly needed.
- Focuses on tools that facilitate cloud deployment, model serving, and resource management on AWS.
- Helps ensure the project is modular, allowing only necessary dependencies to be installed when deploying to AWS.

---

### **How This Section Works**

- To install the `aws` dependencies group, use:
  ```bash
  poetry install --with aws


## **Parallel Dependencies for Azure**

If you plan to work with Microsoft Azure services instead of AWS, here are equivalent or parallel libraries and dependencies for integrating and managing Azure resources in Python:

### **Details:**

- **`azureml-sdk`**:
  - Equivalent to `sagemaker` for AWS.
  - Used for building, training, and deploying machine learning models in Azure Machine Learning (AzureML).
  - Includes tools for managing AzureML workspaces, compute clusters, and experiments.

- **`adlfs`**:
  - Equivalent to `s3fs` for AWS.
  - Provides a Pythonic interface for working with Azure Data Lake Storage (ADLS) Gen1 and Gen2.
  - Enables file system operations such as reading, writing, and listing files in ADLS.

- **`azure-identity`**:
  - Equivalent to `aws-profile-manager` for AWS.
  - A library for managing Azure credentials and authentication, particularly for accessing Azure resources securely using service principals, managed identities, or interactive login.

- **`azure-mgmt-containerservice`**:
  - Equivalent to `kubernetes` for AWS.
  - A library for managing Azure Kubernetes Service (AKS) clusters programmatically.
  - Useful for creating, updating, and managing Kubernetes clusters hosted on Azure.

- **`azureml-hyperdrive`**:
  - Parallel to `sagemaker-huggingface-inference-toolkit`.
  - A toolkit for automating hyperparameter tuning experiments in AzureML.
  - Provides utilities for optimizing model configurations using distributed training on Azure resources.

---

### **Why These Dependencies Matter**

- These libraries allow you to seamlessly integrate your Python application with Azure’s machine learning and cloud computing services.
- They provide equivalent functionality to the AWS dependencies in the `aws.dependencies` group but are tailored to Azure's ecosystem.
- Using these packages ensures that your project can operate effectively in Azure’s cloud infrastructure.

---

### **How to Add These Dependencies**

To include these Azure-specific dependencies in your `pyproject.toml`, you can define an additional group:

```toml
[tool.poetry.group.azure.dependencies]
azureml-sdk = "^1.48.0"
adlfs = "^2023.5.0"
azure-identity = "^1.14.0"
azure-mgmt-containerservice = "^23.0.0"
azureml-hyperdrive = "^1.48.0"


## **[build-system]**

This section specifies the build system requirements for the project. It defines the tools and configurations used to build the package and prepare it for distribution.

### **Details:**

- **`requires`**:
  - `requires = ["poetry-core"]`
  - Specifies that the project requires `poetry-core` to handle packaging and dependency management.
  - `poetry-core` is a lightweight core library extracted from Poetry to manage the build process.

- **`build-backend`**:
  - `build-backend = "poetry.core.masonry.api"`
  - Defines the backend used for building the package.
  - This points to Poetry’s backend API, which processes the build metadata and generates distributions (e.g., `.whl` or `.tar.gz` files).

---

### **Why It Matters**

- This section is mandatory for projects using **PEP 517** and **PEP 518**, which define the standard for Python package building.
- It ensures that tools like Poetry or pip know how to build the project.
- By using `poetry-core`, the project benefits from Poetry's streamlined dependency resolution and build processes while keeping the build system lightweight.

---

### **How It Works**

When running a build command like:
```
poetry build
```
The specified build-backend (poetry.core.masonry.api) uses the dependencies and metadata defined in the pyproject.toml file to create a distributable package.

The requires field ensures that poetry-core is available during the build process, so the build backend has the necessary tools to execute the build.

The build process generates distribution files in the dist/ directory:

A source archive (.tar.gz).

A wheel file (.whl).

To publish the package to a repository like PyPI, use the command:
```
poetry publish --build
```
This ensures the project complies with modern Python packaging standards and simplifies the build and distribution process.


## **[tool.poe.tasks]**

This section defines custom tasks for the project using **Poe the Poet**, a task runner designed for Python projects. It simplifies repetitive or complex commands by allowing them to be stored in the configuration file and executed with a simple `poetry poe <task>` command.

### **Details:**

- **General Syntax**:  
  Each task is defined with a name and a command or a sequence of commands.  
  Example:  
  `run-digital-data-etl-alex = "echo 'It is not supported anymore.'"`

- **Task Categories**:  
  Tasks are grouped into functional categories for clarity:
  - **Data pipelines**: Automate ETL processes, feature engineering, and dataset preparation.
  - **Utility pipelines**: Manage exporting and importing data artifacts.
  - **Training pipelines**: Automate the training and evaluation processes for models.
  - **Inference**: Handle inference-related processes such as calling APIs or running services.
  - **Infrastructure**: Manage local and cloud infrastructure setups, including Docker and AWS.
  - **QA (Quality Assurance)**: Automate code linting, formatting, and security checks.
  - **Tests**: Run test suites for validating the codebase.

### **Highlighted Tasks**:

#### **Data Pipelines**:
- `run-digital-data-etl-maxime`:  
  Runs a specific ETL process with a configuration file:  
  `poetry run python -m tools.run --run-etl --no-cache --etl-config-filename digital_data_etl_maxime_labonne.yaml`

- `run-feature-engineering-pipeline`:  
  Executes a feature engineering pipeline:  
  `poetry run python -m tools.run --no-cache --run-feature-engineering`

#### **Inference**:
- `run-inference-ml-service`:  
  Starts an inference service using FastAPI:  
  `poetry run uvicorn tools.ml_service:app --host 0.0.0.0 --port 8000 --reload`

- `call-inference-ml-service`:  
  Sends a POST request to the local inference service using `curl`:  
  `curl -X POST 'http://127.0.0.1:8000/rag' -H 'Content-Type: application/json' -d '{"query": "My query"}'`

#### **Infrastructure**:
- `local-docker-infrastructure-up`:  
  Brings up the local Docker infrastructure:  
  `docker compose up -d`

- `set-local-stack`:  
  Sets the default local ZenML stack:  
  `poetry run zenml stack set default`

#### **QA (Quality Assurance)**:
- `lint-check`:  
  Runs the `ruff` linter to check code style:  
  `poetry run ruff check .`

- `format-fix`:  
  Formats the code using `ruff`:  
  `poetry run ruff format .`

#### **Tests**:
- `test`:  
  Runs the test suite using `pytest`:  
  `poetry run pytest tests/`

---

### **Why It Matters**

- Automates repetitive tasks, improving development efficiency.
- Provides a clear, structured way to organize commands related to data processing, training, inference, and infrastructure.
- Integrates seamlessly with Poetry, allowing commands to be executed within the managed environment.

---

### **How It Works**

1. Define tasks in the `[tool.poe.tasks]` section of the `pyproject.toml` file.  
2. Each task can be a single command or a sequence of commands.
3. Execute tasks using the following syntax:  
   ```bash
   poetry poe <task-name>
    ```
A good example is if we want to run the run-feature-engineering-pipeline task we can use


   ```bash
   poetry poe run-feature-engineering-pipeline
  ```