# 1. Getting Started
## a) Connection à Weights and Biases

In [1]:
# 1. Log in to your W&B account
import wandb

wandb.login()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/user/.netrc
[34m[1mwandb[0m: Currently logged in as: [33msebastien-s[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

## Import Data

As described in the `DVC` course, you need to download the data first.

In other to run the script for this notebook, we need to adjust a bit the ingestion script.

Below you will find the update that you need to perform. You simply need to change `raw_data_relative_path="./data/raw"` to `raw_data_relative_path="../data/raw"` in the `main` function of the `src/data/import_raw_data.py` script.

```py
def main(raw_data_relative_path="../data/raw", # just add .. before data/raw
        ## The rest of the code..`
```

In [6]:
# import data
!echo y | python ../src/data/import_raw_data.py

raw doesn't exists. Do you want to create it? (y/n): downloading https://mlops-project-db.s3.eu-west-1.amazonaws.com/accidents/caracteristiques-2021.csv as caracteristiques-2021.csv
downloading https://mlops-project-db.s3.eu-west-1.amazonaws.com/accidents/lieux-2021.csv as lieux-2021.csv
downloading https://mlops-project-db.s3.eu-west-1.amazonaws.com/accidents/usagers-2021.csv as usagers-2021.csv
downloading https://mlops-project-db.s3.eu-west-1.amazonaws.com/accidents/vehicules-2021.csv as vehicules-2021.csv
2025-02-19 12:42:02,269 - __main__ - INFO - making raw data set


Now let's adjust the the `make_dataset.py` script as well to make it work inside this jupyter notebook.

Before executing the command below, change all the `\\` to `/` in the `main` function.

```py
def main(input_filepath, output_filepath):
    # same code

    # Prompt the user for input file paths
    input_filepath= click.prompt('Enter the file path for the input data', type=click.Path(exists=True))
    input_filepath_users = f"{input_filepath}/usagers-2021.csv" # change `\\` to `/`
    input_filepath_caract = f"{input_filepath}/caracteristiques-2021.csv" # change `\\` to `/`
    input_filepath_places = f"{input_filepath}/lieux-2021.csv" # change `\\` to `/`
    input_filepath_veh = f"{input_filepath}`vehicules-2021.csv" # change `\\` to `/`
    output_filepath = click.prompt('Enter the file path for the output preprocessed data (e.g., output/preprocessed_data.csv)', type=click.Path())
    
    # same code
```

In [12]:
# preprocess data
!printf "../data/raw\n../data/preprocessed\ny\n" | python ../src/data/make_dataset.py

2025-02-19 12:50:16,955 - __main__ - INFO - making final data set from raw data
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_users.grav.replace([1,2,3,4], [1,3,4,2], inplace = True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_veh['catv'].replace(catv_value, catv_value_new, inplace = True)
The behavior will change 

## b) Première run W&B

In [7]:
# 2. Start a W&B Run
run = wandb.init(
    project="classification-car-accidents",
    name='My first run',
    tags=["baseline", "random-forest"],
)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


In [13]:
#  3. Capture a dictionary of hyperparameters
params = {"n_estimators": 2, "criterion": 'gini', "max_depth": 2}

wandb.config = params

In [14]:
# 4. Train the model
import pandas as pd 
from sklearn.ensemble import RandomForestClassifier
import numpy as np

X_train = pd.read_csv('../data/preprocessed/X_train.csv')
X_test = pd.read_csv('../data/preprocessed/X_test.csv')
y_train = pd.read_csv('../data/preprocessed/y_train.csv')
y_test = pd.read_csv('../data/preprocessed/y_test.csv')
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

rf_classifier = RandomForestClassifier(**params)

rf_classifier.fit(X_train, y_train)

In [15]:
# 5. Capture a dictionary of metrics
train_accuracy = rf_classifier.score(X_train, y_train)
test_accuracy = rf_classifier.score(X_test, y_test)
wandb.log({"train_accuracy": train_accuracy, "test_accuracy": test_accuracy})

In [16]:
# 6. Track model artifact
import joblib

#Save the trained model to a file
model_filename = '../models/trained_model.joblib'
joblib.dump(rf_classifier, model_filename)

#Track the file
wandb.log_artifact(model_filename)

<Artifact run-7sf5my6t-trained_model.joblib>

In [17]:
# 7. Finish the run
wandb.finish()

0,1
test_accuracy,▁
train_accuracy,▁

0,1
test_accuracy,0.70238
train_accuracy,0.70247


# 2. Visualisation des métriques

## c) Seconde Run W&B

In [18]:
# 1. Log in to your W&B account
wandb.login()

# 2. Start a W&B Run
run = wandb.init(
    project="classification-car-accidents",
    name='My second run',
    tags=["baseline", "Decision Tree"],
)

#  3. Capture a dictionary of hyperparameters
params = {"criterion": 'gini', "max_depth": 10}

wandb.config = params

# 4. Train the model
from sklearn.tree import DecisionTreeClassifier

dt_classifier = DecisionTreeClassifier(**params)
dt_classifier.fit(X_train, y_train)

In [19]:
# 5. Capture a dictionary of metrics 
train_accuracy = dt_classifier.score(X_train, y_train)
test_accuracy = dt_classifier.score(X_test, y_test)
wandb.log({"train_accuracy": train_accuracy, "test_accuracy": test_accuracy})

In [20]:
# 6. Track plots and log artifacts with sklearn.plot_classifier
y_pred = dt_classifier.predict(X_test)
y_probas = dt_classifier.predict_proba(X_test)
labels = ['non-prioritary accident', 'prioritary accident']

wandb.sklearn.plot_classifier(
    dt_classifier,
    X_train,
    X_test,
    y_train,
    y_test,
    y_pred,
    y_probas,
    labels,
    model_name="Decision Tree",
    feature_names=X_train.columns,
)

# 7. Finish the run
wandb.finish()

[34m[1mwandb[0m: 
[34m[1mwandb[0m: Plotting Decision Tree.
[34m[1mwandb[0m: Logged feature importances.
[34m[1mwandb[0m: Logged confusion matrix.
[34m[1mwandb[0m: Logged summary metrics.
[34m[1mwandb[0m: Logged class proportions.
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[34m[1mwandb[0m: Logged calibration curve.
[34m[1mwandb[0m: Logged roc curve.
[34m[1mwandb[0m: Logged precision-recall curve.


0,1
test_accuracy,▁
train_accuracy,▁

0,1
test_accuracy,0.75198
train_accuracy,0.80096


## b) Comparer les runs entre elles

In [None]:
# 1. Log in to your W&B account
wandb.login()

# 2. Start a W&B Run
run = wandb.init(
    project="classification-car-accidents",
    name='My third run',
    tags=["baseline", "Decision Tree"],
)

#  3. Capture a dictionary of hyperparameters
params = {"criterion": 'entropy', "max_depth": 20}

wandb.config = params

# 4. Train the model
from sklearn.tree import DecisionTreeClassifier

dt_classifier = DecisionTreeClassifier(**params)
dt_classifier.fit(X_train, y_train)

# 5. Capture a dictionary of metrics 
train_accuracy = dt_classifier.score(X_train, y_train)
test_accuracy = dt_classifier.score(X_test, y_test)
wandb.log({"train_accuracy": train_accuracy, "test_accuracy": test_accuracy})

# 6. Track plots and log artifacts with sklearn.plot_classifier
y_pred = dt_classifier.predict(X_test)
y_probas = dt_classifier.predict_proba(X_test)
labels = ['non-prioritary accident', 'prioritary accident']

wandb.sklearn.plot_classifier(
    dt_classifier,
    X_train,
    X_test,
    y_train,
    y_test,
    y_pred,
    y_probas,
    labels,
    model_name="Decision Tree",
    feature_names=X_train.columns,
)

# 7. Finish the run
wandb.finish()

# 3. Sweeps
## b) Méthodes et hyperparamètres

In [None]:
# 1. Pick a method
sweep_config = {
    'method': 'random'
    }

In [None]:
# 2. Name hyperparameters
parameters_dict = {
    'criterion': {
        'values': ['gini', 'entropy', 'log_loss']
        },
    'splitter': {
        'values': ['best', 'random']
        },
    'max_depth': {
          'values': [None, 10, 20, 50, 100, 200, 500]
        },
    'random_state': {
        'values': [42]
    }
    }

sweep_config['parameters'] = parameters_dict

## c) Lancement du Sweep

In [None]:
from sklearn.tree import DecisionTreeClassifier

# 3. Initialize the sweep
sweep_id = wandb.sweep(sweep_config, project="classification-car-accidents")

# 4. Define the training function
def train(parameters=None):
    run = wandb.init(
        project="classification-car-accidents",
        tags=["sweep", "Decision Tree"],
        config=parameters
    )

    parameters = wandb.config
    
    dt_classifier = DecisionTreeClassifier(**parameters)
    dt_classifier.fit(X_train, y_train)

    train_accuracy = dt_classifier.score(X_train, y_train)
    test_accuracy = dt_classifier.score(X_test, y_test)
    wandb.log({"train_accuracy": train_accuracy, "test_accuracy": test_accuracy})

    wandb.finish()

# 5. Run the sweep agent
wandb.agent(sweep_id, train, count=5)