# MCBS Dataset Loader Tutorial

This notebook demonstrates the dynamic dataset discovery functionality in MCBS. The library can discover datasets from the remote repository in real-time, even if they were added after the MCBS package was released.

## Key Features

- **Dynamic Dataset Discovery**: Discover datasets that were added after the package was released
- **Automatic Caching**: Datasets are cached locally for faster access
- **Metadata Access**: Get detailed information about each dataset
- **Simple API**: Load any dataset with just one line of code

In [2]:
# Import the necessary functions from mcbs.datasets
from mcbs.datasets import fetch_data, list_available_datasets, get_dataset_info
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set some display options
pd.set_option('display.max_columns', 20)
plt.style.use('ggplot')

## 1. Discovering Available Datasets

First, let's see what datasets are available in the remote repository. This will query the GitHub repository in real-time.

In [3]:
# Get real-time list of all available datasets
available_datasets = list_available_datasets()
print(f"Found {len(available_datasets)} available datasets:\n")

# Print the list of available datasets
for dataset in available_datasets:
    print(f"- {dataset}")

INFO:mcbs.datasets.dataset_loader:Fetching metadata from: https://raw.githubusercontent.com/carlosguirado/mcbs-datasets/main/datasets/metadata.json
INFO:mcbs.datasets.dataset_loader:Querying GitHub API for available datasets: https://api.github.com/repos/carlosguirado/mcbs-datasets/contents/datasets
INFO:mcbs.datasets.dataset_loader:Discovered 4 datasets from remote repository


Found 4 available datasets:

- chicago_mode_choice_dataset
- ltds_dataset
- modecanada_dataset
- swissmetro_dataset


## 2. Getting Dataset Information

Now let's get more detailed information about each dataset.

In [4]:
# Create a DataFrame to display dataset information
dataset_info = []

for dataset_name in available_datasets:
    try:
        info = get_dataset_info(dataset_name)
        dataset_info.append({
            'Name': dataset_name,
            'Description': info.get('description', 'N/A'),
            'Samples': info.get('n_samples', 'N/A'),
            'Features': info.get('n_features', 'N/A'),
            'Target': info.get('target', 'N/A')
        })
    except Exception as e:
        print(f"Error getting info for {dataset_name}: {str(e)}")

# Display the information as a table
pd.DataFrame(dataset_info)

INFO:mcbs.datasets.dataset_loader:Fetching metadata from: https://raw.githubusercontent.com/carlosguirado/mcbs-datasets/main/datasets/metadata.json
INFO:mcbs.datasets.dataset_loader:Fetching metadata from: https://raw.githubusercontent.com/carlosguirado/mcbs-datasets/main/datasets/metadata.json
INFO:mcbs.datasets.dataset_loader:Querying GitHub API for available datasets: https://api.github.com/repos/carlosguirado/mcbs-datasets/contents/datasets
INFO:mcbs.datasets.dataset_loader:Discovered 4 datasets from remote repository


Unnamed: 0,Name,Description,Samples,Features,Target
0,chicago_mode_choice_dataset,,,,
1,ltds_dataset,London Travel Demand Survey (LTDS),81086.0,35.0,travel_mode
2,modecanada_dataset,ModeCanada,15520.0,11.0,choice
3,swissmetro_dataset,Swissmetro,10728.0,27.0,CHOICE


## 3. Loading a Dataset

Let's load one of the datasets. The `fetch_data` function will:
1. Check if the dataset is cached locally
2. If not, download it from the remote repository
3. Cache it locally for future use
4. Return it as a pandas DataFrame

In [5]:
# Load the Swissmetro dataset
swissmetro_data = fetch_data('swissmetro_dataset')

# Display basic information about the dataset
print(f"Dataset shape: {swissmetro_data.shape}")
print(f"\nColumn names:")
print(", ".join(swissmetro_data.columns))

# Display the first 5 rows
swissmetro_data.head()

INFO:mcbs.datasets.dataset_loader:Local cache disabled. Downloading dataset from remote source.
INFO:mcbs.datasets.dataset_loader:Downloading dataset from: https://raw.githubusercontent.com/carlosguirado/mcbs-datasets/main/datasets/swissmetro/swissmetro.csv.gz


ConnectionError: Error downloading dataset: 404 Client Error: Not Found for url: https://raw.githubusercontent.com/carlosguirado/mcbs-datasets/main/datasets/swissmetro/swissmetro.csv.gz

## 4. Basic Data Exploration

Now let's do some basic exploration of the loaded dataset.

In [6]:
# Check data types
swissmetro_data.dtypes

NameError: name 'swissmetro_data' is not defined

In [7]:
# Summary statistics
swissmetro_data.describe()

NameError: name 'swissmetro_data' is not defined

## 5. Visualizing Mode Choices

Let's create a simple visualization of the mode choices in the dataset.

In [None]:
# Check the target variable distribution
target_column = get_dataset_info('swissmetro_dataset').get('target', 'CHOICE')

# Create a bar plot of the mode choices
plt.figure(figsize=(10, 6))
mode_counts = swissmetro_data[target_column].value_counts()
sns.barplot(x=mode_counts.index, y=mode_counts.values)
plt.title('Distribution of Mode Choices in Swissmetro Dataset')
plt.xlabel('Mode Choice')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.show()

# Display the percentage of each mode
mode_percentages = (mode_counts / mode_counts.sum() * 100).round(2)
mode_df = pd.DataFrame({
    'Count': mode_counts,
    'Percentage': mode_percentages
})
mode_df

## 6. Loading a Different Dataset

Let's load another dataset to demonstrate how easy it is to switch between datasets.

In [None]:
# Load the LTDS dataset
ltds_data = fetch_data('ltds_dataset')

# Display basic information
print(f"Dataset shape: {ltds_data.shape}")
print(f"\nColumn names:")
print(", ".join(ltds_data.columns))

# Display the first 5 rows
ltds_data.head()

## 7. Checking Dataset Cache

By default, datasets are cached in the `~/.mcbs/datasets` directory. Let's check the cache to see which datasets have been downloaded.

In [9]:
import os
from pathlib import Path

# Get the default cache directory
cache_dir = os.path.join(str(Path.home()), '.mcbs', 'datasets')

# List cached datasets
if os.path.exists(cache_dir):
    print(f"Cached datasets in {cache_dir}:")
    for root, dirs, files in os.walk(cache_dir):
        for directory in dirs:
            if directory not in ['__pycache__']:
                print(f"- {directory}")
else:
    print(f"Cache directory {cache_dir} does not exist yet.")

Cached datasets in /Users/carlosguirado/.mcbs/datasets:
- swissmetro


## 8. How Does the Dynamic Discovery Work?

The dataset discovery works by:

1. First checking local metadata for known datasets
2. Fetching remote metadata from the GitHub repository
3. Using the GitHub API to discover dataset directories in the repository

This means MCBS can discover datasets that were added after the package was released!

## Conclusion

The MCBS dataset loader provides a simple and powerful way to access transportation mode choice datasets. Key benefits:

- **One-line data loading**: `data = fetch_data('dataset_name')`
- **Dynamic discovery**: Always finds the latest datasets
- **Automatic caching**: Fast access to previously used datasets
- **Consistent interface**: All datasets follow the same format

This makes it easy to benchmark models across multiple datasets without having to worry about data preprocessing or storage.