## No Azure AI Foundry and API key and Endpoint required for this practical session

##### Resource : [https://learn.microsoft.com/en-us/training/modules/hyperparameters-azure-databricks/1-introduction](url)

In [0]:
 import mlflow
 from openai import AzureOpenAI

 system_prompt = "Assistant is a large language model trained by OpenAI."

 mlflow.openai.autolog()

 with mlflow.start_run():

     response = client.chat.completions.create(
         model="gpt-4osachin",
         messages=[
             {"role": "system", "content": system_prompt},
             {"role": "user", "content": "Tell me a joke about animals."},
         ],
     )

     print(response.choices[0].message.content)
     mlflow.log_param("completion_tokens", response.usage.completion_tokens)
 mlflow.end_run()

Sure! Here's one:

Why don’t elephants use computers?

Because they’re afraid of the mouse! 🐘🖱️


Trace(request_id=tr-ae1cfeffb1974affbabac5cd8e11b029)

In [0]:
 %sh
 rm -r /dbfs/hyperparam_tune_lab
 mkdir /dbfs/hyperparam_tune_lab
 wget -O /dbfs/hyperparam_tune_lab/penguins.csv https://raw.githubusercontent.com/MicrosoftLearning/mslearn-databricks/main/data/penguins.csv

rm: cannot remove '/dbfs/hyperparam_tune_lab': No such file or directory
--2025-08-19 05:57:17--  https://raw.githubusercontent.com/MicrosoftLearning/mslearn-databricks/main/data/penguins.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9533 (9.3K) [text/plain]
Saving to: ‘/dbfs/hyperparam_tune_lab/penguins.csv’

     0K .........                                             100% 2.89M=0.003s

2025-08-19 05:57:17 (2.89 MB/s) - ‘/dbfs/hyperparam_tune_lab/penguins.csv’ saved [9533/9533]



In [0]:
from pyspark.sql.types import *
from pyspark.sql.functions import *
   
data = spark.read.format("csv").option("header", "true").load("/hyperparam_tune_lab/penguins.csv")
data = data.dropna().select(col("Island").astype("string"),
                          col("CulmenLength").astype("float"),
                          col("CulmenDepth").astype("float"),
                          col("FlipperLength").astype("float"),
                          col("BodyMass").astype("float"),
                          col("Species").astype("int")
                          )
display(data.sample(0.2))
   
splits = data.randomSplit([0.7, 0.3])
train = splits[0]
test = splits[1]
print ("Training Rows:", train.count(), " Testing Rows:", test.count())

Island,CulmenLength,CulmenDepth,FlipperLength,BodyMass,Species
Torgersen,38.9,17.8,181.0,3625.0,0
Torgersen,36.6,17.8,185.0,3700.0,0
Torgersen,38.7,19.0,195.0,3450.0,0
Biscoe,40.6,18.6,183.0,3550.0,0
Dream,36.4,17.0,195.0,3325.0,0
Dream,40.8,18.4,195.0,3900.0,0
Dream,36.0,18.5,186.0,3100.0,0
Dream,42.3,21.2,191.0,4150.0,0
Biscoe,39.6,17.7,186.0,3500.0,0
Biscoe,39.0,17.5,186.0,3550.0,0


Training Rows: 250  Testing Rows: 92


In [0]:
import optuna
import mlflow # if you wish to log your experiments
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler, MinMaxScaler
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
   
def objective(trial):
    # Suggest hyperparameter values (maxDepth and maxBins):
    max_depth = trial.suggest_int("MaxDepth", 0, 9)
    max_bins = trial.suggest_categorical("MaxBins", [10, 20, 30])

    # Define pipeline components
    cat_feature = "Island"
    num_features = ["CulmenLength", "CulmenDepth", "FlipperLength", "BodyMass"]
    catIndexer = StringIndexer(inputCol=cat_feature, outputCol=cat_feature + "Idx")
    numVector = VectorAssembler(inputCols=num_features, outputCol="numericFeatures")
    numScaler = MinMaxScaler(inputCol=numVector.getOutputCol(), outputCol="normalizedFeatures")
    featureVector = VectorAssembler(inputCols=[cat_feature + "Idx", "normalizedFeatures"], outputCol="Features")

    dt = DecisionTreeClassifier(
        labelCol="Species",
        featuresCol="Features",
        maxDepth=max_depth,
        maxBins=max_bins
    )

    pipeline = Pipeline(stages=[catIndexer, numVector, numScaler, featureVector, dt])
    model = pipeline.fit(train)

    # Evaluate the model using accuracy.
    predictions = model.transform(test)
    evaluator = MulticlassClassificationEvaluator(
        labelCol="Species",
        predictionCol="prediction",
        metricName="accuracy"
    )
    accuracy = evaluator.evaluate(predictions)

    # Since Optuna minimizes the objective, return negative accuracy.
    return -accuracy

In [0]:
# Optimization run with 5 trials:
study = optuna.create_study()
study.optimize(objective, n_trials=5)

print("Best param values from the optimization run:")
print(study.best_params)

[I 2025-08-19 05:58:23,117] A new study created in memory with name: no-name-f90394f3-038d-43ca-9f75-0fd49f0bd86b


Downloading artifacts:   0%|          | 0/45 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

[I 2025-08-19 05:59:34,420] Trial 0 finished with value: -0.9891304347826086 and parameters: {'MaxDepth': 5, 'MaxBins': 20}. Best is trial 0 with value: -0.9891304347826086.


Downloading artifacts:   0%|          | 0/45 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

[I 2025-08-19 06:00:34,770] Trial 1 finished with value: -0.9891304347826086 and parameters: {'MaxDepth': 6, 'MaxBins': 30}. Best is trial 0 with value: -0.9891304347826086.


Downloading artifacts:   0%|          | 0/45 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

[I 2025-08-19 06:01:35,443] Trial 2 finished with value: -0.9782608695652174 and parameters: {'MaxDepth': 9, 'MaxBins': 10}. Best is trial 0 with value: -0.9891304347826086.


Downloading artifacts:   0%|          | 0/45 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

[I 2025-08-19 06:02:33,308] Trial 3 finished with value: -0.9782608695652174 and parameters: {'MaxDepth': 7, 'MaxBins': 10}. Best is trial 0 with value: -0.9891304347826086.


Downloading artifacts:   0%|          | 0/45 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

[I 2025-08-19 06:03:30,335] Trial 4 finished with value: -0.5108695652173914 and parameters: {'MaxDepth': 0, 'MaxBins': 10}. Best is trial 0 with value: -0.9891304347826086.


Best param values from the optimization run:
{'MaxDepth': 5, 'MaxBins': 20}
