##  Importing Libraries and Checking spacier Version
In this initial setup, we import necessary libraries including `pandas` for data manipulation, and `sys` to modify the Python path to include our custom module. We then import `model` and `spacier` from our `spacier` package. This cell concludes by printing the version of the `spacier` package, ensuring we are using the intended version for our analysis. `spacier` is a custom library tailored for this project, enabling advanced data analysis and modeling techniques.

In [1]:
import pandas as pd
import sys
sys.path.append('../')
from spacier.ml import model, spacier

print("spacier: ", spacier.__version__)

spacier:  0.0.4


## Loading Data
Here, we load our datasets using `pandas`. The datasets include `X.csv` and `X_pool.csv` for features, alongside `y.csv` and `y_pool.csv` for the labels or targets. These datasets are presumably split into training and "pool" sets, the latter of which might be used for techniques such as active learning or further validation.

In [2]:
data_path = "../spacier/data"
df_X = pd.read_csv(f"{data_path}/X.csv")
df_pool_X = pd.read_csv(f"{data_path}/X_pool.csv")

df = pd.read_csv(f"{data_path}/y.csv")

## Random Sampling

This section demonstrates random sampling, a method where we randomly select a subset of data from our pool dataset. Random sampling is a basic but effective strategy for selecting data points without any inherent bias, often used as a baseline in various data analysis tasks.

In [3]:
new_index = spacier.Random(df_X, df_pool_X , df).sample(10)
print(new_index)

Number of candidates :  1067
[432, 540, 68, 366, 192, 426, 886, 314, 429, 385]


## Uncertainty Sampling

In this part, we utilize uncertainty sampling, a technique often employed in active learning. It involves selecting samples for which the model has the lowest confidence in its predictions. This method is beneficial for improving model performance efficiently by focusing on learning from ambiguous or challenging examples.

In [4]:
new_index = spacier.BO(df_X, df_pool_X , df,"sklearn_GP", ["Cp"]).uncertainty(10)
print(new_index)

Number of training data :  10
Number of candidates :  1067
[920, 730, 916, 917, 919, 914, 927, 722, 723, 731]


## Probability of Improvement (PI)

Probability of Improvement is a strategy used in Bayesian optimization to select the next point to evaluate by maximizing the probability of achieving improvement over the current best observation. It's particularly useful in optimizing performance criteria under uncertainty.

In [5]:
new_index = spacier.BO(df_X, df_pool_X , df, "sklearn_GP", ["Cp"]).PI([[3000, 4000]], 10)
print(new_index)

Number of training data :  10
Number of candidates :  1067
[709, 708, 714, 713, 712, 711, 710, 716, 705, 704]


This cell is a continuation of the previous PI method, now incorporating an additional parameter, `refractive_index`, alongside `Cp`. This demonstrates how PI can be adapted to multi-dimensional scenarios, enhancing the model's ability to navigate more complex optimization landscapes.

In [6]:
new_index = spacier.BO(df_X, df_pool_X , df, "sklearn_GP", ["Cp", "refractive_index"]).PI([[3000, 4000], [1.6, 1.7]], 10)
print(new_index)

Number of training data :  10
Number of candidates :  1067
[709, 708, 714, 713, 712, 711, 710, 716, 705, 704]


## Expected Improvement (EI)

Expected Improvement is another technique from the realm of Bayesian optimization. It chooses the next query point by considering both the expected improvement and the uncertainty of the outcome. EI is particularly effective in scenarios where we aim to balance exploration (of uncharted territories) and exploitation (of known valuable areas).

In [7]:
new_index = spacier.BO(df_X, df_pool_X , df,"sklearn_GP", ["Cp"]).EI(10)
print(new_index)

Number of training data :  10
Number of candidates :  1067
[410, 330, 413, 335, 411, 754, 268, 409, 134, 412]


## Upper Confidence Bound (UCB)

The Upper Confidence Bound algorithm is a balance between exploring uncertain areas and exploiting known areas of the parameter space. It's used in decision-making processes where there's a need to balance the exploration of untested options with the exploitation of current knowledge.


In [8]:
new_index = spacier.BO(df_X, df_pool_X , df, "sklearn_GP", ["Cp"]).UCB(10)
print(new_index)

Number of training data :  10
Number of candidates :  1067
[335, 411, 413, 410, 330, 754, 268, 331, 409, 134]


## Expected Hypervolume Improvement (EHVI)

Expected Hypervolume Improvement is a multi-objective optimization strategy used in Bayesian optimization. It aims to select points that are expected to most improve the 'hypervolume' metric, a measure of space covered by the Pareto front in multi-objective optimization. This method is valuable when dealing with trade-offs between two or more conflicting objectives.

In [9]:
%%time
new_index = spacier.BO(df_X, df_pool_X , df,"sklearn_GP", ["Cp", "refractive_index"], standardization=True).EHVI(10)
print(new_index)

Number of training data :  10
Number of candidates :  1067
[330, 486, 413, 412, 411, 623, 410, 353, 335, 390]
CPU times: total: 2.03 s
Wall time: 2.33 s
