HephAIstos is an open-source Python framework for running machine learning pipelines on CPUs, GPUs, and quantum computing (QPUs). Distributed under the Apache License, Version 2.0, HephAIstos utilizes various frameworks such as Scikit-Learn, Keras on TensorFlow, Qiskit, and custom code. You can create a pipeline using a Python function with parameters or leverage specific routines. Contributions to improve the framework are highly encouraged.
Find it on GitHub: https://github.com/xaviervasques/hephaistos.git
- Installation
- Dependencies
- Techniques
- Feature Rescaling
- Categorical Data Encoding
- Time-Related Feature Engineering
- Missing Values
- Feature Extraction
- Feature Selection
- Classification Algorithms for CPUs
- Classification Algorithms for GPUs
- Classification Algorithms for QPUs
- Regression Algorithms for CPUs
- Regression Algorithms for GPUs
- QPU-based Classification Algorithms
- Convolutional Neural Networks
To install HephAIstos, clone it from GitHub by downloading or typing the following command in your terminal:
git clone https://github.com/xaviervasques/hephaistos.git
Next, navigate to the Hephaistos directory and install the requirements:
pip install -r requirements.py
To execute the examples described in README.md, create a Python file within the hephaistos directory and proceed to run it.
HephAIstos has the following dependencies:
- Python
- joblib
- numpy
- scipy
- pandas
- sklearn
- category_encoders
- hashlib
- matplotlib
- tensorflow
- qiskit
- qiskit_machine_learning
- torch
- torchvision
Datasets are included in the package.
StandardScaler, MinMaxScaler, MaxAbsScaler, RobustScaler, Unit Vector Normalization, Log transformation, Square Root transformation, Reciprocal transformation, Box-Cox, Yeo-Johnson, Quantile Gaussian, and Quantile Uniform.
Ordinal Encoding, One Hot Encoding, Label Encoding, Helmert Encoding, Binary Encoding, Frequency Encoding, Mean Encoding, Sum Encoding, Weight of Evidence Encoding, Probability Ratio Encoding, Hashing Encoding, Backward Difference Encoding, Leave One Out Encoding, James-Stein Encoding, M-estimator.
Time Split (year, month, seconds, …), Lag, Rolling Window, Expanding Window.
Row/Column Removal, Statistical Imputation (Mean, Median, Mode), Linear Interpolation, Multivariate Imputation by Chained Equation (MICE) Imputation, KNN Imputation.
Principal Component Analysis, Independent Component Analysis, Linear Discriminant Analysis, Locally Linear Embedding, t-distributed Stochastic Neighbor Embedding, Manifold Learning Techniques.
Filter methods (Variance Threshold, Statistical Tests, Chi-Square Test, ANOVA F-value, Pearson correlation coefficient), Wrapper methods (Forward Stepwise Selection, Backward Elimination, Exhaustive Feature Selection), and Embedded methods (Least Absolute Shrinkage and Selection Operator, Ridge Regression, Elastic Net, Regularization embedded into ML algorithms, Tree-based Feature Importance, Permutation Feature Importance).
Support Vector Machine with linear, radial basis function, sigmoid, and polynomial kernel functions (svm_linear, svm_rbf, svm_sigmoid, svm_poly), Multinomial Logistic Regression (logistic_regression), Linear Discriminant Analysis (lda), Quadratic Discriminant Analysis (qda), Gaussian Naive Bayes (gnb), Multinomial Naive Bayes (mnb), K-Neighbors Naive Bayes (kneighbors), Stochastic Gradient Descent (sgd), Nearest Centroid Classifier (nearest_centroid), Decision Tree Classifier (decision_tree), Random Forest Classifier (random_forest), Extremely Randomized Trees (extra_trees), Multi-layer Perceptron Classifier (mlp_neural_network), Multi-layer Perceptron Classifier with automatic hyperparameter tuning (mlp_neural_network_auto).
Logistic Regression (gpu_logistic_regression), Multi-Layer perceptron (gpu_mlp), Recurrent Neural Network (gpu_rnn), 2D Convolutional Neural Network (conv2d).
q_kernel_zz, q_kernel_default, q_kernel_8, q_kernel_9, q_kernel_10, q_kernel_11, q_kernel_12, q_kernel_training, q_kernel_8_pegasos, q_kernel_9_pegasos, q_kernel_10_pegasos, q_kernel_11_pegasos, q_kernel_12_pegasos, q_kernel_default_pegasos.
Linear Regression (linear_regression), SVR with linear kernel (svr_linear), SVR with rbf kernel (svr_rbf), SVR with sigmoid kernel (svr_sigmoid), SVR with polynomial kernel (svr_poly), Multi-layer Perceptron for regression (mlp_regression), Multi-layer Perceptron with automatic hyperparameter tuning for regression (mlp_auto_regression).
Linear Regression (gpu_linear_regression), Multi-layer perceptron neural network using GPUs for regression (gpu_mlp_regression), Recurrent Neural Network for regression (gpu_rnn_regression).
2D convolutional neural network (CNN) using GPUs if they are available (conv2d).
To run a machine learning pipeline with HephAIstos, use the Python function ml_pipeline_function
, which accepts user-defined parameters. This section explains all available options for the ml_pipeline_function
.
The first parameter is mandatory, as it requires a DataFrame with a variable to predict, which we call "Target".
The rest of the parameters are optional, meaning you can ignore them, and depend on what you need to run. Assuming you provide a DataFrame defined as df
, you can apply the following options:
-
output_folder
: If you want to save figures, results, inference models to an output folder, set the path of the output folder where you want to save the pipeline results, such as the model's accuracy metrics in a .csv file.- Let’s take an example. Create a Python file in hephaistos, write the following lines and execute it:
from ml_pipeline_function import ml_pipeline_function # Import dataset from data.datasets import neurons df = neurons() # Load the neurons dataset # Run ML Pipeline ml_pipeline_function(df, output_folder='./Outputs/') # Execute the ML pipeline with the loaded dataset and store the output in the './Outputs/' folder
-
test_size
: If the dataset is not a time series dataset, you can set the amount of data you want for testing purposes. For example, iftest_size=0.2
, it means that 20% of the data is used for testing.-
Example:
from ml_pipeline_function import ml_pipeline_function # Import dataset from data.datasets import neurons df = neurons() # Load the neurons dataset # Run ML Pipeline with row removal for handling missing values and 20% test set size ml_pipeline_function(df, output_folder='./Outputs/', test_size=0.2) # Execute the ML pipeline with the loaded dataset, # split the data into train and test sets with a test size of 20%, and # store the output in the './Outputs/' folder
-
-
test_time_size
: If the dataset is a time series dataset, usetest_time_size
instead oftest_size
. If you choosetest_time_size=1000
, it will take the last 1000 values of the dataset for testing. -
time_feature_name
: Name of the feature containing the time series. -
time_split
: Used to split the time variable by year, month, minutes, seconds, etc., as described in pandas documentation. Available options: 'year', 'month', 'hour', 'minute', 'second'. -
time_format
: The strftime to parse time, e.g., "%d/%m/%Y". See strftime documentation for more information and different options. For example, if the data is '1981-7-1 15:44:31', the format would be "%Y-%d-%m %H:%M:%S".-
Example:
from ml_pipeline_function import ml_pipeline_function # Import Dataset from data.datasets import DailyDelhiClimateTrain df = DailyDelhiClimateTrain() # Load the DailyDelhiClimateTrain dataset df = df.rename(columns={"meantemp": "Target"}) # Rename the 'meantemp' column to 'Target' # Run ML Pipeline ml_pipeline_function( df, output_folder='./Outputs/', # Store the output in the './Outputs/' folder missing_method='row_removal', # Remove rows with missing values test_time_size=365, # Set the test time size to 365 days time_feature_name='date', # Set the time feature name to 'date' time_format="%Y-%m-%d", # Define the time format as "%Y-%m-%d" time_split=['year', 'month', 'day'] # Split the time feature into 'year', 'month', and 'day' columns ) # Execute the ML pipeline with the loaded dataset, removing rows with missing values, # using a test set of 365 days and splitting the 'date' feature into 'year', 'month', and 'day' columns.
-
-
missing_method
: You can select different methodologies (Options: "row_removal", "column_removal", "stats_imputation_mean", "stats_imputation_median", "stats_imputation_mode", "linear_interpolation", "mice", "knn").- Example:
from ml_pipeline_function import ml_pipeline_function # Import dataset from data.datasets import neurons df = neurons() # Load the neurons dataset # Run ML Pipeline with row removal for handling missing values ml_pipeline_function(df, output_folder='./Outputs/', missing_method='row_removal') # Execute the ML pipeline with the loaded dataset, remove rows with missing values, # and store the output in the './Outputs/' folder
-
time_transformation
: If you want to transform time series data, you can use different techniques such as lag, rolling window, or expanding window. For example, if you want to use lag, you just need to set thetime_transformation
as follows:time_transformation='lag'
. -
If
lag
is selected, you need to add the following parameters:number_of_lags
: An integer defining the number of lags you want.lagged_features
: Select features you want to apply lag.lag_aggregation
: Select the aggregation method. For aggregation, the following options are available: "min", "max", "mean", "std", or "no". Several options can be selected at the same time.
-
If
rolling_window
is selected:window_size
: An integer indicating the size of the window.rolling_features
: Select the features you want to apply rolling window.
-
If
expanding_window
is selected:expanding_window_size
: An integer indicating the size of the window.expanding_features
: Select the features you want to apply expanding rolling window.
-
Example:
from ml_pipeline_function import ml_pipeline_function from data.datasets import DailyDelhiClimateTrain df = DailyDelhiClimateTrain() # Load the DailyDelhiClimateTrain dataset df = df.rename(columns={"meantemp": "Target"}) # Rename the 'meantemp' column to 'Target' # Run ML Pipeline ml_pipeline_function( df, output_folder='./Outputs/', # Store the output in the './Outputs/' folder missing_method='row_removal', # Remove rows with missing values test_time_size=365, # Set the test time size to 365 days time_feature_name='date', # Set the time feature name to 'date' time_format="%Y-%m-%d", # Define the time format as "%Y-%m-%d" time_split=['year', 'month', 'day'], # Split the time feature into 'year', 'month', and 'day' columns time_transformation='lag', # Apply the lag transformation number_of_lags=2, # Set the number of lags to 2 lagged_features=['wind_speed', 'meanpressure'], # Set the lagged features to 'wind_speed' and 'meanpressure' lag_aggregation=['min', 'mean'] # Set the aggregation methods for the lagged features to 'min' and 'mean' ) # Execute the ML pipeline with the loaded dataset, removing rows with missing values, # using a test set of 365 days, splitting the 'date' feature into 'year', 'month', and 'day' columns, # applying the lag transformation with 2 lags for 'wind_speed' and 'meanpressure', and aggregating them with 'min' and 'mean'.
-
categorical
: If the dataset is composed of categorical data that are labeled with text, you can select data encoding methods. The following options are available: 'ordinal_encoding', 'one_hot_encoding', 'label_encoding', 'helmert_encoding', 'binary_encoding', 'frequency_encoding', 'mean_encoding', 'sum_encoding', 'weightofevidence_encoding', 'probability_ratio_encoding', 'hashing_encoding', 'backward_difference_encoding', 'leave_one_out_encoding', 'james_stein_encoding', 'm_estimator_encoding'. Different encoding methods can be combined. -
You need to select the features that you want to encode with the specified method. For this, indicate the features you want to encode for each method:
- features_ordinal
- features_one_hot
- features_label
- features_helmert
- features_binary
- features_frequency
- features_mean
- features_sum
- features_weight
- features_proba_ratio
- features_hashing
- features_backward
- features_leave_one_out
- features_james_stein
- features_m
-
Example:
from ml_pipeline_function import ml_pipeline_function from data.datasets import insurance df = insurance() # Load the insurance dataset df = df.rename(columns={"charges": "Target"}) # Rename the 'charges' column to 'Target' # Run ML Pipeline ml_pipeline_function( df, output_folder='./Outputs/', # Store the output in the './Outputs/' folder missing_method='row_removal', # Remove rows with missing values test_size=0.2, # Set the test set size to 20% of the dataset categorical=['binary_encoding', 'label_encoding'], # Apply binary and label encoding for categorical features features_binary=['smoker', 'sex'], # Apply binary encoding for 'smoker' and 'sex' features features_label=['region'] # Apply label encoding for the 'region' feature ) # Execute the ML pipeline with the loaded dataset, removing rows with missing values, # using a test set of 20%, and applying binary encoding for 'smoker' and 'sex' features and label encoding for the 'region' feature.
-
rescaling
: Include a data rescaling method. The following options are available.- standard_scaler
- minmax_scaler
- maxabs_scaler
- robust_scaler
- normalizer
- log_transformation
- square_root_transformation
- reciprocal_transformation
- box_cox
- yeo_johnson
- quantile_gaussian
- quantile_uniform
-
Example:
from ml_pipeline_function import ml_pipeline_function # Import Data from data.datasets import neurons df = neurons() # Load the neurons dataset # Run ML Pipeline ml_pipeline_function( df, output_folder='./Outputs/', # Store the output in the './Outputs/' folder missing_method='row_removal', # Remove rows with missing values test_size=0.2, # Set the test set size to 20% of the dataset categorical=['label_encoding'], # Apply label encoding for categorical features features_label=['Target'], # Apply label encoding for the 'Target' feature rescaling='standard_scaler' # Apply standard scaling to the features ) # Execute the ML pipeline with the loaded dataset, removing rows with missing values, # using a test set of 20%, applying label encoding for the 'Target' feature, and rescaling the features using the standard scaler.
-
features_extraction
: Select the feature extraction method. The following options are available:- pca
- ica
- icawithpca
- lda_extraction
- random_projection
- truncatedSVD
- isomap
- standard_lle
- modified_lle
- hessian_lle
- ltsa_lle
- mds
- spectral
- tsne
- nca
-
number_components
: Number of principal components you want to keep for PCA, ICA, LDA, etc. -
n_neighbors
: Number of neighbors to consider for Manifold Learning techniques. -
Example:
from ml_pipeline_function import ml_pipeline_function # Import Data from data.datasets import neurons df = neurons() # Load the neurons dataset # Run ML Pipeline ml_pipeline_function( df, output_folder='./Outputs/', # Store the output in the './Outputs/' folder missing_method='row_removal', # Remove rows with missing values test_size=0.2, # Set the test set size to 20% of the dataset categorical=['label_encoding'], # Apply label encoding for categorical features features_label=['Target'], # Apply label encoding for the 'Target' feature rescaling='standard_scaler', # Apply standard scaling to the features features_extraction='pca', # Apply Principal Component Analysis (PCA) for feature extraction number_components=2 # Set the number of PCA components to 2 ) # Execute the ML pipeline with the loaded dataset, removing rows with missing values, # using a test set of 20%, applying label encoding for the 'Target' feature, rescaling the features using the standard scaler, # and performing PCA feature extraction with 2 components.
This code snippet demonstrates how to use the ml_pipeline_function
to run an ML pipeline with feature extraction using PCA. The dataset is first preprocessed by handling missing data, encoding categorical features, and rescaling the data. After that, PCA is applied to extract two principal components before the final model is fit and evaluated.
The ml_pipeline_function
allows you to select a feature selection method (Filter, Wrapper, and Embedded) to improve your machine learning model's performance. Different options and examples are provided below:
Filter options include:
variance_threshold
: Apply variance threshold. If you choose this option, you need to indicate the features to process (features_to_process= ['feature_1', 'feature_2', …]
) and the threshold (var_threshold=0
or any number).chi2
: Perform a chi-square test on the samples and retrieve only the k-best features. You can define k with thek_features
parameter.anova_f_c
: Create a SelectKBest object to select features with the k-best ANOVA F-Values for classification. You can define k with thek_features
parameter.anova_f_r
: Create a SelectKBest object to select features with the k-best ANOVA F-Values for regression. You can define k with thek_features
parameter.pearson
: Keep variables highly correlated with the target and uncorrelated among themselves. Define the Pearson correlation coefficient between features withcc_features
and between features and the target withcc_target
.
Examples:
from ml_pipeline_function import ml_pipeline_function
# Import Data
from data.datasets import neurons
df = neurons() # Load the neurons dataset
# Run ML Pipeline
ml_pipeline_function(
df,
output_folder='./Outputs/', # Store the output in the './Outputs/' folder
missing_method='row_removal', # Remove rows with missing values
test_size=0.2, # Set the test set size to 20% of the dataset
categorical=['label_encoding'], # Apply label encoding for categorical features
features_label=['Target'], # Apply label encoding for the 'Target' feature
rescaling='standard_scaler', # Apply standard scaling to the features
feature_selection='pearson', # Apply Pearson correlation-based feature selection
cc_features=0.7, # Set the correlation coefficient threshold for pairwise feature correlation to 0.7
cc_target=0.7 # Set the correlation coefficient threshold for correlation with the target variable to 0.7
)
# Execute the ML pipeline with the loaded dataset, removing rows with missing values,
# using a test set of 20%, applying label encoding for the 'Target' feature, rescaling the features using the standard scaler,
# and performing feature selection based on Pearson correlation with thresholds of 0.7 for pairwise feature correlation and correlation with the target variable.
from ml_pipeline_function import ml_pipeline_function
# Import Data
from data.datasets import neurons
df = neurons() # Load the neurons dataset
# Run ML Pipeline
ml_pipeline_function(
df,
output_folder='./Outputs/', # Store the output in the './Outputs/' folder
missing_method='row_removal', # Remove rows with missing values
test_size=0.2, # Set the test set size to 20% of the dataset
categorical=['label_encoding'], # Apply label encoding for categorical features
features_label=['Target'], # Apply label encoding for the 'Target' feature
rescaling='standard_scaler', # Apply standard scaling to the features
feature_selection='anova_f_c', # Apply ANOVA F-test based feature selection
k_features=2 # Select the top 2 features based on their F-scores
)
# Execute the ML pipeline with the loaded dataset, removing rows with missing values,
# using a test set of 20%, applying label encoding for the 'Target' feature, rescaling the features using the standard scaler,
# and performing feature selection based on ANOVA F-test, selecting the top 2 features.
Available feature_selection
options for wrapper methods: forward_stepwise
, backward_elimination
, exhaustive
.
wrapper_classifier
: Select a classifier or regressor from scikit-learn (e.g., KneighborsClassifier(), RandomForestClassifier, LinearRegression) and apply it to forward stepwise, backward elimination, or exhaustive methods.min_features
andmax_features
specify the minimum and maximum number of features in the combination for exhaustive methods.
Example:
from ml_pipeline_function import ml_pipeline_function
from sklearn.neighbors import KNeighborsClassifier
from data.datasets import breastcancer
# Import Data
df = breastcancer() # Load the breast cancer dataset
df = df.drop(["id"], axis=1) # Drop the 'id' column
# Run ML Pipeline
ml_pipeline_function(
df,
output_folder='./Outputs/', # Store the output in the './Outputs/' folder
missing_method='row_removal', # Remove rows with missing values
test_size=0.2, # Set the test set size to 20% of the dataset
categorical=['label_encoding'], # Apply label encoding for categorical features
features_label=['Target'], # Apply label encoding for the 'Target' feature
rescaling='standard_scaler', # Apply standard scaling to the features
feature_selection='backward_elimination', # Apply backward elimination for feature selection
wrapper_classifier=KNeighborsClassifier(), # Use K-nearest neighbors classifier for the wrapper method in backward elimination
k_features=2 # Select the top 2 features
)
# Execute the ML pipeline with the loaded dataset, removing rows with missing values,
# using a test set of 20%, applying label encoding for the 'Target' feature, rescaling the features using the standard scaler,
# and performing feature selection using backward elimination with a K-nearest neighbors classifier.
You can select several embedded methods for feature_selection
:
lasso
: If you choose lasso, add thealpha
parameter (lasso_alpha
).feat_reg_ml
: Select features with regularization embedded into machine learning algorithms. Select the machine learning algorithm (in scikit-learn) by setting theml_penalty
parameter:- embedded_linear_regression
- embedded_logistic_regression
- embedded_decision_tree_regressor
- embedded_decision_tree_classifier
- embedded_random_forest_regressor
- embedded_random_forest_classifier
- embedded_permutation_regression
- embedded_permutation_classification
- embedded_xgboost_regression
- embedded_xgboost_classification
Example:
from ml_pipeline_function import ml_pipeline_function
from sklearn.svm import LinearSVC
from data.datasets import breastcancer
# Import Data
df = breastcancer() # Load the breast cancer dataset
df = df.drop(["id"], axis=1) # Drop the 'id' column
# Run ML Pipeline
ml_pipeline_function(
df,
output_folder='./Outputs/', # Store the output in the './Outputs/' folder
missing_method='row_removal', # Remove rows with missing values
test_size=0.2, # Set the test set size to 20% of the dataset
categorical=['label_encoding'], # Apply label encoding for categorical features
features_label=['Target'], # Apply label encoding for the 'Target' feature
rescaling='standard_scaler', # Apply standard scaling to the features
feature_selection='feat_reg_ml', # Apply feature selection using a machine learning model
ml_penalty=LinearSVC(
C=0.05, # Apply a regularization strength of 0.05
penalty='l1', # Apply L1 regularization
dual=False, # Use the primal form of the SVM problem
max_iter=5000 # Set the maximum number of iterations to 5000
)
)
# Execute the ML pipeline with the loaded dataset, removing rows with missing values,
# using a test set of 20%, applying label encoding for the 'Target' feature, rescaling the features using the standard scaler,
# and performing feature selection using a machine learning model with a LinearSVC with specified parameters.
In this example, we use the feat_reg_ml
embedded feature selection method with a LinearSVC classifier. The LinearSVC classifier is specified with an L1 penalty, a C value of 0.05, and a maximum of 5000 iterations. The pipeline uses this classifier for feature selection and then proceeds with training and evaluation.
Choose from the following classification algorithms that run on CPUs:
svm_linear
svm_rbf
svm_sigmoid
svm_poly
logistic_regression
lda
qda
gnb
mnb
kneighbors
: Add an additional parameter,n_neighbors
, to specify the number of neighbors.sgd
nearest_centroid
decision_tree
random_forest
: Optionally add then_estimators_forest
parameter to specify the number of estimators.extra_trees
: Add then_estimators_forest
parameter to specify the number of estimators.mlp_neural_network
: Set various parameters such asmax_iter
,hidden_layer_sizes
,activation
,solver
,alpha
,learning_rate
, andlearning_rate_init
.mlp_neural_network_auto
For each classification algorithm, add the number of k-folds for cross-validation (cv
).
from ml_pipeline_function import ml_pipeline_function
from data.datasets import breastcancer
# Import Data
df = breastcancer() # Load the breast cancer dataset
df = df.drop(["id"], axis=1) # Drop the 'id' column
# Run ML Pipeline
ml_pipeline_function(
df,
output_folder='./Outputs/', # Store the output in the './Outputs/' folder
missing_method='row_removal', # Remove rows with missing values
test_size=0.2, # Set the test set size to 20% of the dataset
categorical=['label_encoding'], # Apply label encoding for categorical features
features_label=['Target'], # Apply label encoding for the 'Target' feature
rescaling='standard_scaler', # Apply standard scaling to the features
classification_algorithms=[
'svm_rbf', # Apply Support Vector Machine with radial basis function kernel
'lda', # Apply Linear Discriminant Analysis
'random_forest' # Apply Random Forest Classifier
],
n_estimators_forest=100, # Set the number of trees in the random forest to 100
cv=5 # Perform 5-fold cross-validation
)
# Execute the ML pipeline with the loaded dataset, removing rows with missing values,
# using a test set of 20%, applying label encoding for the 'Target' feature, rescaling the features using the standard scaler,
# and performing classification using SVM with RBF kernel, LDA, and Random Forest with the specified parameters.
Choose from the following classification algorithms that run on GPUs:
gpu_logistic_regression
: Set parameters such asgpu_logistic_optimizer
,gpu_logistic_loss
, andgpu_logistic_epochs
.gpu_mlp
: Set parameters such asgpu_mlp_optimizer
,gpu_mlp_activation
,gpu_mlp_loss
, andgpu_mlp_epochs
.gpu_rnn
: Set parameters such asrnn_units
,rnn_activation
,rnn_optimizer
,rnn_loss
, andrnn_epochs
.
from ml_pipeline_function import ml_pipeline_function
from data.datasets import breastcancer
# Import Data
df = breastcancer() # Load the breast cancer dataset
df = df.drop(["id"], axis=1) # Drop the 'id' column
# Run ML Pipeline
ml_pipeline_function(
df,
output_folder='./Outputs/', # Store the output in the './Outputs/' folder
missing_method='row_removal', # Remove rows with missing values
test_size=0.2, # Set the test set size to 20% of the dataset
categorical=['label_encoding'], # Apply label encoding for categorical features
features_label=['Target'], # Apply label encoding for the 'Target' feature
rescaling='standard_scaler', # Apply standard scaling to the features
classification_algorithms=[
'svm_rbf', # Apply Support Vector Machine with radial basis function kernel
'lda', # Apply Linear Discriminant Analysis
'random_forest', # Apply Random Forest Classifier
'gpu_logistic_regression' # Apply GPU-accelerated Logistic Regression
],
n_estimators_forest=100, # Set the number of trees in the random forest to 100
gpu_logistic_activation='adam', # Set the activation function for GPU Logistic Regression to 'adam'
gpu_logistic_optimizer='adam', # Set the optimizer for GPU Logistic Regression to 'adam'
gpu_logistic_epochs=50, # Set the number of training epochs for GPU Logistic Regression to 50
cv=5 # Perform 5-fold cross-validation
)
# Execute the ML pipeline with the loaded dataset, removing rows with missing values,
# using a test set of 20%, applying label encoding for the 'Target' feature, rescaling the features using the standard scaler,
# and performing classification using SVM with RBF kernel, LDA, Random Forest, and GPU-accelerated Logistic Regression with the specified parameters.
Running the pipeline will print the steps of the processes and provide metrics for the models, such as accuracy, precision, recall, and F1 score.
-
Another example with SGD optimizer:
from ml_pipeline_function import ml_pipeline_function from data.datasets import breastcancer # Import Data df = breastcancer() # Load the breast cancer dataset df = df.drop(["id"], axis=1) # Drop the 'id' column # Run ML Pipeline ml_pipeline_function( df, output_folder='./Outputs/', # Store the output in the './Outputs/' folder missing_method='row_removal', # Remove rows with missing values test_size=0.2, # Set the test set size to 20% of the dataset categorical=['label_encoding'], # Apply label encoding for categorical features features_label=['Target'], # Apply label encoding for the 'Target' feature rescaling='standard_scaler', # Apply standard scaling to the features classification_algorithms=[ 'svm_rbf', # Apply Support Vector Machine with radial basis function kernel 'lda', # Apply Linear Discriminant Analysis 'random_forest', # Apply Random Forest Classifier 'gpu_logistic_regression' # Apply GPU-accelerated Logistic Regression ], n_estimators_forest=100, # Set the number of trees in the random forest to 100 gpu_logistic_optimizer='adam', # Set the optimizer for GPU Logistic Regression to 'adam' gpu_logistic_epochs=50, # Set the number of training epochs for GPU Logistic Regression to 50 gpu_logistic_loss = 'mse', # Set the loss functions, such as the mean squared error (“mse”), the binary logarithmic loss (“binary_crossentropy”), or the multi-class logarithmic loss (“categorical_crossentropy”) cv=5 # Perform 5-fold cross-validation ) # Execute the ML pipeline with the loaded dataset, removing rows with missing values, # using a test set of 20%, applying label encoding for the 'Target' feature, rescaling the features using the standard scaler, # and performing classification using SVM with RBF kernel, LDA, Random Forest, and GPU-accelerated Logistic Regression with the specified parameters.
This gives the following metrics at the end:
Choose from the following classification algorithms that run on Quantum Processing Units (QPUs):
- Quantum Kernel methods: Select from the following algorithms:
q_kernel_default
q_kernel_8
q_kernel_9
q_kernel_10
q_kernel_11
q_kernel_12
q_kernel_default_pegasos
q_kernel_8_pegasos
q_kernel_9_pegasos
q_kernel_10_pegasos
q_kernel_11_pegasos
q_kernel_12_pegasos
q_kernel_zz
q_kernel_zz_pegasos
- Quantum Neural Networks: Select from the following algorithms:
q_samplerqnn
q_estimatorqnn
q_vqc
- Quantum Kernel Alignment (QKA) methods: Use
q_kernel_training
.
reps
: Number of times the feature map circuit is repeated.ibm_account
: Your IBM Quantum API key.quantum_backend
: The backend to run the quantum algorithms on. Choose from:ibmq_qasm_simulator
ibmq_armonk
ibmq_santiago
ibmq_bogota
ibmq_lima
ibmq_belem
ibmq_quito
simulator_statevector
simulator_mps
simulator_extended_stabilizer
simulator_stabilizer
ibmq_manila
multiclass
: Choose from:OneVsRestClassifier
OneVsOneClassifier
svc
: Pass the quantum kernel to SVC from scikit-learn.None
: Use QSVC from Qiskit.
n_steps
: Number of steps performed during the training procedure.C
: Regularization parameter.
from ml_pipeline_function import ml_pipeline_function
from data.datasets import breastcancer
# Import Data
df = breastcancer() # Load the breast cancer dataset
df = df.drop(["id"], axis=1) # Drop the 'id' column
# Run ML Pipeline
ml_pipeline_function(
df,
output_folder='./Outputs/', # Store the output in the './Outputs/' folder
missing_method='row_removal', # Remove rows with missing values
test_size=0.2, # Set the test set size to 20% of the dataset
categorical=['label_encoding'], # Apply label encoding for categorical features
features_label=['Target'], # Apply label encoding for the 'Target' feature
rescaling='standard_scaler', # Apply standard scaling to the features
features_extraction='pca', # Apply Principal Component Analysis for feature extraction
classification_algorithms=['svm_linear'], # Apply Support Vector Machine with a linear kernel
number_components=2, # Set the number of principal components to 2
cv=5, # Perform 5-fold cross-validation
quantum_algorithms=[
'q_kernel_default',
'q_kernel_zz',
'q_kernel_8',
'q_kernel_9',
'q_kernel_10',
'q_kernel_11',
'q_kernel_12'
], # List of quantum algorithms to use
reps=2, # Set the number of repetitions for the quantum circuits
ibm_account=YOUR_API, # Replace with your IBM Quantum API key
quantum_backend='qasm_simulator' # Use the QASM simulator as the quantum backend
)
# Execute the ML pipeline with the loaded dataset, removing rows with missing values,
# using a test set of 20%, applying label encoding for the 'Target' feature, rescaling the features using the standard scaler,
# and performing classification using SVM with a linear kernel, PCA for feature extraction, and quantum algorithms with the specified parameters.
To run the algorithm on the least busy quantum backend, set the quantum_backend
parameter to 'least_busy'
.
quantum_backend = 'least_busy'
The pipeline supports the following regression algorithms:
linear_regression
svr_linear
svr_rbf
svr_sigmoid
svr_poly
mlp_regression
mlp_auto_regression
-
gpu_linear_regression
: Linear Regression using SGD optimizer. The following parameters can be configured:gpu_linear_activation
:'linear'
gpu_linear_epochs
: An integer for the number of epochsgpu_linear_learning_rate
: Learning rate for the SGD optimizergpu_linear_loss
: Loss functions such as'mse'
,'binary_crossentropy'
, or'categorical_crossentropy'
.
-
gpu_mlp_regression
: Multi-layer perceptron neural network using GPUs for regression. The following parameters can be configured:gpu_mlp_epochs_r
: An integer for the number of epochsgpu_mlp_activation_r
: Activation function, e.g.,softmax
,sigmoid
,linear
, ortanh
.
-
gpu_rnn_regression
: Recurrent Neural Network for regression. The following parameters can be configured:rnn_units
: Positive integer for the dimensionality of the output spacernn_activation
: Activation function, e.g.,softmax
,sigmoid
,linear
, ortanh
rnn_optimizer
: Optimizer, e.g.,adam
,sgd
,RMSprop
rnn_loss
: Loss functions such as'mse'
,'binary_crossentropy'
, or'categorical_crossentropy'
.rnn_epochs
: An integer for the number of epochs
# Import the required modules
from ml_pipeline_function import ml_pipeline_function
from data.datasets import breastcancer
# Load the breast cancer dataset
df = breastcancer()
# Drop the 'id' column from the dataset
df = df.drop(["id"], axis=1)
# Run the Machine Learning (ML) pipeline function with the following parameters:
# - Dataframe: df
# - Output folder: './Outputs/'
# - Missing data handling method: 'row_removal'
# - Test dataset size: 20% of the total dataset
# - Categorical data handling: 'label_encoding'
# - Target feature label: 'Target'
# - Data rescaling method: 'standard_scaler'
# - Regression algorithms to be used: 'linear_regression', 'svr_linear', 'svr_rbf', 'gpu_linear_regression'
# - GPU Linear Regression specific parameters:
# - Number of training epochs: 50
# - Activation function: 'linear'
# - Learning rate: 0.01
# - Loss function: 'mse'
ml_pipeline_function(
df,
output_folder='./Outputs/',
missing_method='row_removal',
test_size=0.2,
categorical=['label_encoding'],
features_label=['Target'],
rescaling='standard_scaler',
regression_algorithms=['linear_regression', 'svr_linear', 'svr_rbf', 'gpu_linear_regression'],
gpu_linear_epochs=50,
gpu_linear_activation='linear',
gpu_linear_learning_rate=0.01,
gpu_linear_loss='mse',
)
Running this example will print the steps of the processes and give the metrics of the models. Note that the data used might not be optimal for the chosen algorithms. The output will be similar to the following images:
In this example, we demonstrate how to use the pipeline for training a Recurrent Neural Network (RNN) model on the Breast Cancer dataset.
# Import the required modules
from ml_pipeline_function import ml_pipeline_function
from data.datasets import breastcancer
# Load the breast cancer dataset
df = breastcancer()
# Drop the 'id' column from the dataset as it is not relevant for analysis
df = df.drop(["id"], axis=1)
# Call the machine learning pipeline function with the following parameters:
# - DataFrame (df)
# - Output folder for saving results ('./Outputs/')
# - Method for handling missing values ('row_removal')
# - Test set size (0.2 or 20%)
# - Encoding method for categorical variables ('label_encoding')
# - Column name for target variable in the dataset (['Target'])
# - Data rescaling method ('standard_scaler')
# - Regression algorithms to be applied (['linear_regression', 'svr_linear', 'svr_rbf', 'gpu_rnn_regression'])
# - Number of epochs for RNN training (50)
# - Activation function for RNN ('linear')
# - Optimizer for RNN ('adam')
# - Number of units in RNN (500)
# - Loss function for RNN ('mse')
ml_pipeline_function(
df,
output_folder='./Outputs/',
missing_method='row_removal',
test_size=0.2,
categorical=['label_encoding'],
features_label=['Target'],
rescaling='standard_scaler',
regression_algorithms=['linear_regression', 'svr_linear', 'svr_rbf', 'gpu_rnn_regression'],
rnn_epochs=50,
rnn_activation='linear',
rnn_optimizer='adam',
rnn_units=500,
rnn_loss='mse'
)
In this example, we demonstrate how to use the pipeline for training a Recurrent Neural Network (RNN) regression model on the Daily Delhi Climate dataset.
# Import the required modules
from ml_pipeline_function import ml_pipeline_function
import pandas as pd
# Load the Daily Delhi Climate Train dataset
DailyDelhiClimateTrain = './data/datasets/DailyDelhiClimateTrain.csv'
df = pd.read_csv(DailyDelhiClimateTrain, delimiter=',')
# Convert the 'date' column to datetime format
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
# Extract the year, month, and day from the 'date' column and create new columns for each
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
# Drop the 'date' column
df.drop('date', inplace=True, axis=1)
# Rename the 'meantemp' column to 'Target'
df = df.rename(columns={"meantemp": "Target"})
# Drop rows with at least one missing value
df = df.dropna()
# Run the Machine Learning (ML) pipeline function with the following parameters:
# - Dataframe: df
# - Output folder: './Outputs/'
# - Missing data handling method: 'row_removal'
# - Test dataset size: 20% of the total dataset
# - Data rescaling method: 'standard_scaler'
# - Regression algorithm to be used: 'gpu_rnn_regression'
# - GPU RNN Regression specific parameters:
# - Number of units: 500
# - Activation function: 'tanh'
# - Optimizer: 'RMSprop'
# - Loss function: 'mse'
# - Number of training epochs: 50
ml_pipeline_function(
df,
output_folder='./Outputs/',
missing_method='row_removal',
test_size=0.2,
rescaling='standard_scaler',
regression_algorithms=['gpu_rnn_regression'],
rnn_units=500,
rnn_activation='tanh',
rnn_optimizer='RMSprop',
rnn_loss='mse',
rnn_epochs=50,
)
Running this example will train the RNN regression model on the preprocessed dataset and display the model's performance metrics.
conv2d
: 2D convolutional neural network (CNN) using GPUs if they are available. The parameters are the following:
conv_kernel_size
: The size of the filter matrix for the convolution (conv_kernel_size x conv_kernel_size).conv_activation
: Activation function to use (softmax, sigmoid, linear, relu, or tanh).conv_optimizer
: Optimizer (adam, sgd, RMSprop).conv_loss
: Loss function such as the mean squared error ('mse'), the binary logarithmic loss ('binary_crossentropy'), or the multi-class logarithmic loss ('categorical_crossentropy').conv_epochs
: Number (integer) of epochs.
In this example, we train a 2D CNN on the MNIST dataset using the pipeline:
# Import the ml_pipeline_function from the external module
from ml_pipeline_function import ml_pipeline_function
# Import pandas for data manipulation
import pandas as pd
# Import TensorFlow library for machine learning and deep learning
import tensorflow as tf
# Import the mnist dataset from the Keras library
from keras.datasets import mnist
# Import to_categorical function to convert integer labels to binary class matrices
from tensorflow.keras.utils import to_categorical
# Load the MNIST dataset into a tuple of tuples (train and test data)
df = mnist.load_data()
# Separate the dataset into features (X) and labels (y) for both train and test data
(X, y), (_,_) = mnist.load_data()
# Assign the train and test data to variables
(X_train, y_train), (X_test, y_test) = df
# Reshape the training data to fit the model's input shape (number of images, height, width, and channels)
X_train = X_train.reshape(X_train.shape[0],X_train.shape[1],X_train.shape[2],1)
# Reshape the testing data to fit the model's input shape
X_test = X_test.reshape(X_test.shape[0],X_test.shape[1],X_test.shape[2],1)
# Reshape the whole dataset to fit the model's input shape
X = X.reshape(X.shape[0],X.shape[1],X.shape[2],1)
# Call the ml_pipeline_function with the given parameters to train and evaluate a convolutional neural network
ml_pipeline_function(df, X, y, X_train, y_train, X_test, y_test, output_folder = './Outputs/', convolutional=['conv2d'], conv_activation='relu', conv_kernel_size = 3, conv_optimizer = 'adam', conv_loss='categorical_crossentropy', conv_epochs=1)
This code snippet imports the necessary modules, loads the MNIST dataset, and splits it into training and test sets. The data is reshaped to fit the model, and then the ML pipeline function is called with the specified parameters. The pipeline function performs tasks such as data preprocessing, fitting a Convolutional Neural Network (CNN) model to the data using a 2D convolutional layer, and saving the results in the specified output folder. Running this example will train the 2D CNN model on the preprocessed dataset and display the model's performance metrics.