# Mastering the Configuration in Data Science Projects

Configuration management is a critical aspect of data science projects. The separation of configuration from source code offers numerous benefits. This article will explore the importance of this practice, compare various configuration file formats, introduce the use of Pydantic BaseSettings, and provide an example of a web scraping project that employs these technologies.

## Importance of Separating Configuration from Source Code

1. Environment Flexibility: By separating configuration from the source code, you can easily adapt your application to different environments (development, testing, production) without changing the code.

2. Security: Sensitive data such as API keys, database credentials, and other secrets should not be hard-coded into the application. Keeping them in a separate configuration file allows for better security practices.

3. Maintainability and Scalability: Changes in the environment or third-party services should not require a change in the codebase. A well-structured configuration file makes it easier to update these details and scale the application.

## Comparison of Configuration File Formats
Several formats are commonly used for configuration files, each with its own strengths and weaknesses:

1. JSON: Easy to use with JavaScript and many other languages, but lacks comments and can be verbose.
2. YAML: More human-readable and supports comments, but the syntax can be complex and parsing is slower.
3. INI: Simple and easy to write, but lacks standardization and advanced features.
4. TOML: Designed to be easy to parse and write, supports complex data types, but less widely adopted.

## Leveraging Pydantic for Configuration Management
Pydantic offers numerous benefits for configuration management. It provides strong typing of configuration variables, which can help catch errors early and improve code readability. Pydantic’s BaseSettings class allows for automatic parsing from environment variables and advanced validation, simplifying the process of loading and validating configurations. It also supports various file formats like JSON, YAML, and .env, offering flexibility in how configurations are stored. Furthermore, Pydantic configurations are easy to use with modern Python features like type hints and autocomplete, making the development process more efficient. Lastly, Pydantic’s clear and concise syntax helps keep configuration code clean and maintainable.

## Project Example: Web Scraping Financial Website

Project Example: Web Scraping Financial Website
Let’s consider a project where we scrape financial data from a website, clean the data, compute metrics, and save the data to a database. We’ll use Pydantic’s BaseSettings to manage our configuration. We will TOML configuration file

### Project Structure
Here is a possible structure for our project
```
/my_project
|-- /src
|   |-- __init__.py
|   |-- main.py
|   |-- scrape.py
|   |-- clean.py
|   |-- compute.py|   
|-- /config
|   |-- settings.toml
|-- /tests
|   |-- __init__.py
|   |-- test_scrape.py
|   |-- test_clean.py
|   |-- test_compute.py
|-- README.md
|-- requirements.txt

```
In this structure, the src directory contains the source code of our project, divided into different modules (scrape.py, clean.py, compute.py, database.py). The config directory contains our configuration file, settings.toml. The tests directory contains our unit tests.

### Configuration File 
Here’s an example of what our settings.toml file could look like:
```
[default]
scraping_url = "http://example.com"
database_url = "postgresql://user:password@localhost:5432/mydatabase"
metrics = ["metric1", "metric2", "metric3"]

```
### Using Pydantic BaseSettings with TOML
Pydantic’s BaseSettings class can be used with TOML configuration files. Here’s how you can modify our Settings class to load from our settings.toml file:
```python
from pydantic import BaseSettings, AnyHttpUrl, PostgresDsn
from typing import List
import toml

class Settings(BaseSettings):
    scraping_url: AnyHttpUrl
    database_url: PostgresDsn
    metrics: List[str]

    class Config:
        env_file = "config/settings.toml"
        env_file_encoding = 'utf-8'

settings = Settings()

```
In this example, the Config inner class tells BaseSettings to load environment variables from our settings.toml file. The env_file_encoding is set to 'utf-8' to read the file correctly.

Now, when you create an instance of Settings, it will automatically load the configuration from settings.toml.  In our web scraping script, we ca now use these settings





In [None]:
#scrape.py
import pandas as pd

def scrape_website(url: str) -> pd.DataFrame:
    """
    Scrapes the given website URL and returns the scraped data.

    Args:
        url (str): The URL of the website to scrape.

    Returns:
        pd.DataFrame: The scraped data.
    """
    ...
#clean.py
import pandas as pd

def clean_data(data: pd.DataFrame) -> pd.DataFrame:
    """
    Cleans the given data and returns the cleaned data.

    Args:
        data (pd.DataFrame): The data to clean.

    Returns:
        pd.DataFrame: The cleaned data.
    """
    ...
#compute.py
import pandas as pd

def compute_metrics(data: pd.DataFrame, metrics: list[str]) -> pd.DataFrame:
    """
    Computes the given metrics on the data and returns the results.

    Args:
        data (Any): The data to compute metrics on.
        metrics (List[str]): The list of metrics to compute.

    Returns:
        pd.DataFrame: The results of the computed metrics.
    """
    ...
#main.py
from pydantic import BaseSettings, AnyHttpUrl, PostgresDsn
import pandas as pd
from scrape import scrape_website
from clean import clean_data
from compute import compute_metrics

class Settings(BaseSettings):
    scraping_url: AnyHttpUrl
    database_url: PostgresDsn
    metrics: list[str]

def save_to_database(metrics pd, url:str)->None:
    """
    Save computed metrics to database.

    Args:
        
        metrics (List[str]): The list of metrics to compute.
        url(str): url of database
    """
    ...
def main() -> None:
    """
    Main function that orchestrates the scraping, cleaning, computing metrics, and saving to database.
    """
    settings = Settings()
    data = scrape_website(settings.scraping_url)
    clean_data = clean_data(data)
    metrics = compute_metrics(clean_data, settings.metrics)
    save_to_database(metrics, settings.database_url)
