# Utility Functions for PR104

1. [ann_cv()](#ann)
1. [arff_tocsv()](#arff_tocsv)
1. [compare_classifiers()](#compare_class)
1. [compare_roc()](#compare_roc)
1. [data_barplot()](#barplot)
1. [data_summary()](#data_summary)
1. [df_display()](#df_display)
1. [nbayes_cv()](#naivebayes)
1. [knn_cv()](#knn)
1. [tree_cv()](#tree)
1. [plot_line()](#plot_line)
1. [plot_performance()](#plotperf)
1. [roc_cv()](#roc_cv)
1. [save_fig()](#save_fig)
1. [scatter_custom()](#scatter_cust)
1. [svm_cv()](#svm)
1. [varselect_tocsv()](#varselect_tocsv)

In [9]:
%run setup.ipynb

<br>

### ann_cv(*df, ax = None, cbar = False, normalize = None, folds = 10, shuffle = True, seed = 42, n_iter = 50, **kargs*)<a class='anchor' id='ann'></a>
#### A wrapper function that performs cross-validated tuning and evaluation of an ANN with 1 hidden layer on a given dataset

The function implements a randomized search over the parameter space using `RandomizedSearchCV`. In contrast to GridSearchCV, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. Additionally, a computation budget, being the number of sampled candidates or sampling iterations, is specified using the `n_iter` parameter. This way, increasing `n_iter` will always lead to a finer search.

For each parameter, either a distribution over possible values or a list of discrete choices (which will be sampled uniformly) can be specified. 
For continuous parameters, such as $\alpha$ in this case, it is important to specify a continuous distribution to take full advantage of the randomization. A continuous log-uniform random variable is available through `loguniform`. This is a continuous version of log-spaced parameters and is useful for searching penalty values as we often explore values at different orders of magnitude, at least as a first step. For example to specify $\alpha$ above, `loguniform(0.0001, 0.01)` can be used instead of `[0.0001, 0.001, 0.01]` 

The function has the following parameters:

- **`df`**: Dataframe containing the data.
- **`ax`**: *Axes object, default = None* 
     <br> Axes on which to draw the plot of the confusion matrix.
- **`cbar`**: *bool, default = False*
     <br> Whether or not to display the colorbar.
- **`normalize`**: *None or {'true', 'pred', 'all'}*
    <br> Normalization mode to apply to the confusion matrix.
- **`folds`**: *int, default = 10*
    <br> Number of folds to use in the k-fold cross-validation.
- **`shuffle`**: *bool, default = True*
    <br> Whether or not to shuffle the data before applying k-fold cross-validation.
- **`seed`**: *int, default = 42*
    <br> Integer used as a seed for the random number generator
- **`n_iter`**: *int, default = 50*
    <br> Number of parameter settings that are sampled. Tunes the trade-off runtime vs quality of the solution.
- **`**kargs`**: 
    <br> Additional keyword arguments to pass to the `MLPClassifier`.

The function outputs a dictionary containing the following:

- **`ConfusionMatrixDisplay`**: *ConfusionMatrixDisplay object*
    <br> Object containing the confusion matrix, labels and the Confusion Matrix visualization.
- **`confusionmatrix`**: *ndarray of shape (n_classes, n_classes)*
  <br> Matrix whose *i*-th row and *j*-th column entry indicates the # of samples with true label being *i*-th class and predicted label being *j*-th class.
- **`results`**: *dict of ndarrays*
    <br> a dictionary that summarizes the results of cross-validation for each hyperparameter combination tried during the random search.
- **`accuracy`**: *float*
    <br>Mean cross-validated accuracy of the model.
- **`recall`**: *float*
    <br>Mean cross-validated percentage recall of the model.
- **`auc`**: *float*
    <br>Mean cross-validated AUC of the ROC of the model.
- **`model`**: *sklearn.pipeline.Pipeline object*
    <br>The best estimator from `RandomizedSearchCV`

In [1]:
def ann_cv(df, ax = None, cbar = False, normalize = None, folds = 10, shuffle = True, seed = 42, n_iter = 50, **kargs):
    
    # If no Axes object is provided, create a new one
    if ax is None:
        ax = plt.gca()
    # Check if ax is an instance of the Axes class
    elif not isinstance(ax, Axes):
        raise TypeError("The 'ax' argument must be an Axes object")
        
    # Separate the target variable from predictors
    y = df['defects']
    X = df.drop('defects', axis = 1)
    predictors = X.columns.values.tolist()
    
    # Create StratifiedKFold object
    # If shuffle is False, ignore any passed seed to avoid a ValueError
    kf = StratifiedKFold(n_splits = folds, shuffle = shuffle, random_state = seed if shuffle else None)

    # Create a MLPClassifier model
    ann = MLPClassifier(random_state = seed, **kargs)
    
    # Create column transformer to normalize predictors
    z = make_column_transformer((RobustScaler(), predictors), remainder = "passthrough")
    # Create pipeline
    pipe = make_pipeline(z, ann)
    
    # Perform Randomized Search to find the optimal parameters values
    space = {
        'mlpclassifier__hidden_layer_sizes': [i for i in range(3,26,2)], # number of neurons in the hidden layer
    'mlpclassifier__activation':['logistic', 'tanh' , 'relu'], # Activation function for the hidden layer
    'mlpclassifier__alpha': loguniform(0.0001, 0.01) # Strength of the L2 regularization term
    }
    metrics = ('accuracy','balanced_accuracy','recall','roc_auc')
    rnd_srch = RandomizedSearchCV(pipe, space, scoring = metrics, cv = kf, refit = 'balanced_accuracy', n_iter = n_iter,
                                 n_jobs = -1, verbose = 0, random_state = seed)
    rnd_srch.fit(X, y)
    
    # Retrieve the mean cross-validated accuracy and recall
    acc = rnd_srch.cv_results_['mean_test_accuracy'][rnd_srch.best_index_]
    recall_perc = rnd_srch.cv_results_['mean_test_recall'][rnd_srch.best_index_]*100
    auc = rnd_srch.cv_results_['mean_test_roc_auc'][rnd_srch.best_index_]
    
    # Generate cross-validated predicted labels for the target class
    y_pred = cross_val_predict(rnd_srch.best_estimator_, X, y, cv=kf)
    
    
    # Create confusion matrix
    lab = ['Clean','Buggy']
    disp = ConfusionMatrixDisplay.from_predictions(y_true = y, y_pred = y_pred,  cmap = 'Blues', display_labels = lab, 
                                                   normalize = normalize, ax = ax, colorbar = cbar)
    ax.set_title('ANN', fontsize=14, fontweight='bold')
    
    # Create output
    out = {'ConfusionMatrixDisplay' : disp,
           'confusionmatrix' : disp.confusion_matrix,
           'results' : rnd_srch.cv_results_,
           'accuracy' : round(acc, 3),
           'recall' : round(recall_perc, 2),
           'auc' : round(auc, 3),
           'model' : rnd_srch.best_estimator_}
    
    return out

<br>

### arff_tocsv(*name*) <a class='anchor' id='arff_tocsv'></a>
#### Loads *.arff* data and save it in a CSV file
This function is used to load data from an ARFF file and save it as a CSV file. The `with` statement is used to open the ARFF file, read its contents using the `loadarff` function, and automatically close the file when the with block ends. This ensures that the file is properly closed and avoids potential issues with file handles being left open.
The function also checks if there is any missing value in the dataset and if there are missing values it impute the missing values with median values of the same column.
It then prints a message indicating that the ARFF file was successfully loaded and saves the DataFrame as a CSV file using the `to_csv()` method. The CSV file is saved in the same directory as the ARFF file, using the same name as the ARFF file, overwriting any existing data. The function then returns the final dataframe.

The function takes one argument:

- **`name`**: a string that specifies the name of the ARFF file to be loaded. This string should be a valid file name and should have .arff appended to it when the file is read.

In [3]:
from scipy.io.arff import loadarff 
from sklearn.impute import SimpleImputer

def arff_tocsv(name):
    # Constructs the full file path for the ARFF file
    path = os.path.join(DATA_PATH, name + ".arff")
    
    try:
        # Reads the data from the file into a NumPy structured array
        with open(path, "r") as arff_file:
            raw_data = loadarff(arff_file)
    except FileNotFoundError:
        print("The specified ARFF file was not found.")
        return
    
    # Define the columns that should be of integer type
    intdict = {'loc':'int','v(g)':'int','ev(g)':'int',
               'iv(g)': 'int','n':'int','lOCode':'int',
               'lOComment':'int','lOBlank':'int',
               'lOCodeAndComment':'int','uniq_Op':'int',
               'uniq_Opnd':'int','total_Op':'int',
               'total_Opnd':'int','branchCount':'int',
               'defects':'int'}
    
    # Array is then converted to a Pandas dataframe 
    df_data = pd.DataFrame(raw_data[0])
    
    # Check if there are any missing values in the dataframe
    # 5 rows with missing values can be found in jm1.arff data source
    if df_data.isnull().any().any():
        # Impute missing values with median of the same column
        imputer = SimpleImputer(strategy='median')
        df_data_num = df_data.select_dtypes(include=[np.number])
        imputer.fit(df_data_num)
        X = imputer.transform(df_data_num)
        df_data = pd.DataFrame(X, columns=df_data.columns)
    
    # Convert the variables contained in the intdict dictionary to an integer dtype
    df_data = df_data.astype(dtype=intdict)

    print(name + ".arff","successfully loaded")
    
    # Saves the dataframe as a CSV file using the same name as the ARFF file
    # (any existing data is overwritten)
    df_data.to_csv(f'{DATA_PATH}/{name}.csv', mode="w")
    print("Saved in", f'{DATA_PATH}/{name}.csv')
    return df_data


<br>

### compare_classifiers(*df, models_list, outname, cbar = False, normalize = None, folds = 10, shuffle = True, seed = 42*) <a class='anchor' id='compare_class'></a>
####    Compare classifiers by generating side-by-side confusion matrices visualizations and evaluating their accuracy and recall. Results are saved in a pickle file and returned as a dictionary.

The function has the following parameters:

- **`df`**: *pandas dataframe*
    <br> Input data to apply the machine learning models on.
- **`models_list`**: *list of dictionaries*
    <br> List of dictionaries containing the classification models (with optional parameters) to apply on the input data.
- **`outname`**: *string*
    <br> Output name for the results pickle file.
- **`cbar`**: *bool, default = False*
    <br> Whether or not to display color bar in the confusion matrix.
- **`normalize`**: *None or {'true', 'pred', 'all'}*
    <br> Normalization mode to apply to the confusion matrix.
- **`folds`**: *int, default = 10*
    <br> Number of folds for cross-validation.
- **`shuffle`**: *bool, default = True*
    <br> Whether to shuffle the input data before applying the cross-validation.
- **`seed`**: *int, default = 42*
    <br> Seed for the random number generator.

The function outputs a dictionary containing the following:

- **`res`**: *dict*
    <br> A dictionary containing selected results under the following keys:
       - models: A list of fitted classifier models.
       - accuracy: An array of accuracy scores for each model.
       - recall: An array of recall scores for each model.
       - auc: An array of AUC scores for each model.


In [1]:
def compare_classifiers(df, models_list, outname, cbar = False, normalize = None, folds = 10, shuffle = True, seed = 42):
    
    try:
        full_res = {} # dict to store results of all models
        models = copy.deepcopy(models_list) # creating a copy of models_list
        n_mod = len(models) # number of models
        accuracies = np.empty(n_mod) # empty array to store accuracy of each model
        recalls = np.empty(n_mod) # empty array to store recall of each model
        aucs = np.empty(n_mod) # empty array to store recall of each model
        fitted = [] # list to store fitted models
        
        # Creates a subplot grid with 1 row and variable number of columns
        fig, ax = plt.subplots(1, n_mod, figsize=((n_mod*5), 4)) 
        
        # Loop through all models
        for i, model_data in enumerate(models):
            model = model_data.pop('model')
            
            # Check if the model is callable
            if not callable(model):
                raise TypeError(f"The 'model' attribute in models_list[{i}] is not callable: {model}")
            
            # Call the model with the required arguments and any additional arguments
            out = model(df, ax=ax[i], cbar = cbar, normalize = normalize, folds = folds, shuffle = shuffle, seed = seed, **model_data)
            
            # Get the classifier name
            if isinstance(out['model'], Pipeline):
                name = out['model'].steps[-1][1].__class__.__name__ 
            else:
                name = out['model'].__class__.__name__
            
            # Store results of this model in the dict
            full_res[name] = out
            
            # Append the fitted model to the list
            fitted.append(out['model'])
            
            # Store accuracy and recall of this model
            accuracies[i] = out['accuracy']
            recalls[i] = out['recall']
            aucs[i] = out['auc']

        # Save results to a binary file
        with open(f'{RESULTS_PATH}/{outname}.pickle', 'wb') as handle:
            pickle.dump(full_res, handle, protocol=pickle.HIGHEST_PROTOCOL)

        # Store models, accuracy and recall in a dict
        res = {'models':fitted, 'accuracy':accuracies, 'recall':recalls, 'auc':aucs}
        return  res
    
    except Exception as e:
        # If an error occurs, close the plot and return None
        print(f"An error occurred while processing models_list: {e}")
        plt.close()
        return None


<br>

### compare_roc(*df, models_list, palette = None, **kargs*) <a class='anchor' id='compare_roc'></a>
#### A function that compares the ROC curves of multiple binary classifiers.

The function has the following parameters:

- **`df`**: *Pandas DataFrame*
    <br> Dataframe containing features and target variables. The target variable should be binary.
- **`models_list`**: *list*
    <br> A list of binary classifiers. Each classifier in the list must have `fit` and `predict_proba` methods.
- **`**kargs`**: any other keyword arguments to pass to the `roc_cv` function.
- **`palette`**: an optional parameter to specify a color palette for the ROC curve plots. The default value is `None`, which means that the default palette `'tab10'` from Matplotlib will be used.
    

The function outputs the following:

- **`aucs`**: *Numpy Array*
  <br> An array containing the AUC (Area Under the Curve) values of each binary classifier.


In [2]:
def compare_roc(df, models_list, palette = None, **kargs):
    # Get the number of models
    n_mod = len(models_list)
    
    # Initialize an empty array to store the AUC of each model
    aucs = np.empty(n_mod)
    
    # Create a figure and axis for plotting the ROC curve
    fig, ax = plt.subplots(figsize=(10, 6))
    
    # Get a color palette to use for plotting the ROC curves
    if palette is None:
        # Use the default color palette 'tab10' from Matplotlib 
        palette = plt.get_cmap('tab10')
    # Select a # of colors from the palette equal to the number of ROC curves being plotted
    colors = [palette(i) for i in range(n_mod)]
    
    # Loop through each model in the models_list
    for i, model in enumerate(models_list):
        try:
            # Call the roc_cv function on the current model, passing the df data, axis, and color
            out = roc_cv(model, df, ax=ax, chance=False, color=colors[i], **kargs)
            # Store the AUC of the current model in the aucs array
            aucs[i] = out['AUC']
        except Exception as e:
            print(f"Model {i} cannot be fit or predict_proba method is missing: {e}")
            aucs[i] = np.nan
            
    # Plot the chance level (AUC = 0.5) on the same axis
    ax.plot([0, 1], [0, 1], "k--", label="Chance Level (AUC = 0.5)", alpha=0.4)
    
    # Add a legend to the plot to show the AUC for each model
    plt.legend(loc='lower right', markerscale=9, prop={'size': 9})
    
    # Return the AUCs of all the models
    return aucs

<br>

### data_barplot(*df, save = True, id = 'data_barplot', **kargs*) <a class='anchor' id='barplot'></a>
#### Creates a grouped bar plot with the *Buggy* and *Clean* instances

It plots the '*Buggy*' and '*Clean*' columns from `df`, with the '*Buggy*' bars shown in red and the '*Clean*' bars shown in blue. The x-axis of the plot is labeled with the index values from `df`, corresponding to the different datasets employed. The figure has a title, and is saved with the specified `id` before being shown. The `save_fig()` function is a separate function used to save the figure.
The function takes four arguments:
<br>
- **`df`**: is the dataframe containing the data to be plotted
- **`save`**: a boolean value indicating whether the plot should be saved (default is True)
- **`id`**: a string value specifying the name of the file to save the plot as (default is 'data_barplot')
- **`**kargs`**: any other keyword arguments to pass to the save_fig function.

In [4]:
def data_barplot(df, save = True, id='data_barplot', **kargs):
    
    # Create x-axis values
    num_rows = df.shape[0]
    x = np.arange(num_rows)

    y1 = df['Buggy']
    y2 = df['Clean']
    # Set width of bars in plot
    width = 0.40

    # Set tick positions on x-axis
    ticks_pos = [r for r in range(num_rows)]

    # Create figure and axis objects
    figure, ax = plt.subplots(figsize=(9,4))

    # Plot data in grouped bar format
    ax.bar(x-0.2, y1, width, label='Buggy', color='#8b0000')
    ax.bar(x+0.2, y2, width, label='Clean', color='#00008b')

    # Set tick positions and labels on x-axis
    ax.set_xticks(ticks_pos, df.index)
    
    # Get current y-axis ticks
    yticks = ax.get_yticks()

    # Modify some of the y-axis ticks
    yticks[0] = 500
    ax.set_yticks(yticks)
    
    # Remove the horizontal grid lines
    ax.xaxis.grid(False)

    # Add legend to plot
    ax.legend()

    # Add title
    figure.suptitle('Figure 2: Class Distribution among data - "Defects"',
               fontsize=12, fontweight='bold')

    # Save Figure
    if save:
        save_fig(id, **kargs)
    
    # Show plot
    return figure

<br>

### data_summary(*filename, index, plot =* True)<a class='anchor' id='data_summary'></a>
#### Reads the data from a specified file, creates a summary table and a grouped bar plot of the data.

This function takes three arguments:
<br>
- **`filename`**: a string that specifies the name of the file containing the data. This function assumes that the file is a CSV file located in the `DATA_PATH` directory.
- **`index`**: specifies the column to use as the index of the dataframe.
- **`plot`**: a boolean that, if set to True, causes the function to also return a grouped bar plot of the data using the `data_barplot()` function. If plot is not provided, it defaults to True.
<br>

In [6]:
def data_summary(filename, index, plot = True):
    try:
        # Check that the filename is a string
        if not isinstance(filename, str):
            raise ValueError('ERROR: The filename must be a string')

        # Constructs the full file path for the CSV file
        path = os.path.join(DATA_PATH, filename + ".csv")

        # Open the file with the given filename and read the data into a DataFrame
        table1 = pd.read_csv(path)

        # Check that the index is a proper column name
        if index not in table1.columns:
            raise ValueError('ERROR: The index must be an existing column')

        # Set the 'Name' column as the index
        table1.set_index(index, inplace=True)

        # Add a new column called 'Clean' with the number of modules without bugs
        table1.insert(2,'Clean',table1['Instances'] -  table1['Buggy'])
        
        # Add a new column called 'Imbalance Ratio'
        table1.insert(3, 'Imbalance Ratio', table1['Buggy'] / (table1['Clean']))

        # Round the values in the 'Imbalance Ratio' column to 2 decimal places
        table1['Imbalance Ratio'] = table1['Imbalance Ratio'].round(3)

        # Format the values in the 'Imbalance Ratio' column as strings with 2 decimal places
        table1['Imbalance Ratio'] = table1['Imbalance Ratio'].map('{:.3f}'.format)

        # Convert the dataframe to an HTML table with a caption
        html_table1 = table1.style.to_html(caption="Table 1: Summary of Datasets")


        # If the plot argument is True, create a grouped bar plot of the data
        if plot:
            return HTML(html_table1), data_barplot(table1)
        else:
            # Otherwise, only return the HTML table
            return HTML(html_table1)

    except FileNotFoundError:
        # If the file does not exist, print an error message
        print('ERROR: The file does not exist')
    except ValueError as error:
        print(error)


<br>

### df_display(*df, title, decimals = 3, highlight = False*) <a class='anchor' id='df_display'></a>
#### Takes a pandas DataFrame and returns a styled version of the DataFrame using the `.style` attribute.

This is a Python function that utilizes the styling capabilities of the pandas library's DataFrame object. The function takes the following parameters:

- **`df`**: Pandas DataFrame object
- **`title`**: string that represents the caption to be set for the styled DataFrame
- **`decimals`**: *int, default 3*
    <br> Determines the number of decimal places to be displayed in the styled DataFrame
- **`highlight`**: *bool, default False*
    <br> Indicates whether to highlight the maximum and minimum values in each row of the DataFrame
    
The function returns a styled version of the input DataFrame df by setting various display properties, including:

- the caption of the styled DataFrame, which is set to title and has a font size of 14 points
- the cell font size of the styled DataFrame, which is set to 13 points
- the number of decimal places displayed in the styled DataFrame, which is set to `decimals`

If the highlight argument is set to `True`, the function also highlights the maximum and minimum values in each row of the DataFrame using green and red colors. The styling is performed using the `.style` attribute of the input DataFrame and the `.highlight_max()` and `.highlight_min()` methods.

In [7]:
def df_display(df, title, decimals = 3, highlight = False):
    
    # Create a style object with the specified formatting options
    style = (
        df.style
        .set_caption(title)                 # Set the caption for the table
        .set_precision(decimals)            # Set the number of decimal places to display
        .set_properties(**{'font-size': '13pt'})  # Set the font size for the cells
        .set_table_styles([{'selector': 'caption', 'props': [('font-size', '14pt')]}])  # Set the font size for the caption
    )
    
    # If highlight is True, apply the highlight_max and highlight_min methods to the style object
    if highlight:
        style = style.highlight_max(axis=1, color='#90EE90').highlight_min(axis=1, color='#FFB6C1')
        
    return style

<br>

### nbayes_cv(*df, ax = None, cbar = False, normalize = None, folds = 10, shuffle = True, seed = 42, **kargs*)<a class='anchor' id='naivebayes'></a>
#### A wrapper function that performs cross-validated evaluation of a Naive Bayes classifier on a given dataset

The function has the following parameters:

- **`df`**: Dataframe containing the data.
- **`ax`**: *Axes object, default = None* 
     <br> Axes on which to draw the plot of the confusion matrix.
- **`cbar`**: *bool, default = False*
     <br> Whether or not to display the colorbar.
- **`normalize`**: *None or {'true', 'pred', 'all'}*
    <br> Normalization mode to apply to the confusion matrix.
- **`folds`**: *int, default = 10*
    <br> Number of folds to use in the k-fold cross-validation.
- **`shuffle`**: *bool, default = True*
    <br> Whether or not to shuffle the data before applying k-fold cross-validation.
- **`seed`**: *int, default = 42*
    <br> The random seed used to shuffle the data when 'shuffle' is set to True. If 'shuffle' is set to False, this parameter is ignored.
- **`**kargs`**: 
    <br> Additional keyword arguments to pass to the GaussianNB() classifier.

The function outputs a dictionary containing the following:

- **`ConfusionMatrixDisplay`**: *ConfusionMatrixDisplay object*
    <br> Object containing the confusion matrix, labels and the Confusion Matrix visualization.
- **`confusionmatrix`**: *ndarray of shape (n_classes, n_classes)*
  <br> Matrix whose *i*-th row and *j*-th column entry indicates the # of samples with true label being *i*-th class and predicted label being *j*-th class.
- **`accuracy`**: *float*
    <br>Mean cross-validated accuracy of the model.
- **`recall`**: *float*
    <br>Mean cross-validated percentage recall of the model.
- **`model`**: *sklearn.pipeline.Pipeline object*
    <br>Fitted pipeline object that includes the `GaussianNB` classifier and column transformer.

In [18]:
def nbayes_cv(df, ax = None, cbar = False, normalize = None, folds = 10, shuffle = True, seed = 42, **kargs):
    
    # if no Axes object is provided, create a new one
    if ax is None:
        ax = plt.gca()
    # check if ax is an instance of the Axes class
    elif not isinstance(ax, Axes):
        raise TypeError("The 'ax' argument must be an Axes object")
    
    # Separate the target variable from predictors
    y = df['defects']
    X = df.drop('defects', axis = 1)
    predictors = X.columns.values.tolist()
    
    # Create StratifiedKFold object
    # If shuffle is False, ignore any passed seed to avoid a ValueError 
    seed = seed if shuffle else None
    kf = StratifiedKFold(n_splits = folds, shuffle = shuffle, random_state = seed)
        
    # create a Naive Bayes classifier with a Gaussian distribution assumption
    nb = GaussianNB(**kargs)
    
    # Create column transformer to normalize predictors
    z = make_column_transformer((RobustScaler(), predictors), remainder = "passthrough")
    # Create pipeline
    pipe = make_pipeline(z, nb)
    
    # calculate the mean cross-validated accuracy, percentage recall and AUC
    scores = cross_validate(pipe, X, y, cv=kf, scoring = ('accuracy', 'recall', 'roc_auc'))
    acc = np.mean(scores['test_accuracy'])
    recall_perc = np.mean(scores['test_recall'])*100
    auc = np.mean(scores['test_roc_auc'])
    
    # generate cross-validated predicted labels for the target class
    y_pred = cross_val_predict(pipe, X, y, cv=kf)
    
    # create a ConfusionMatrixDisplay object using the confusion matrix and labels
    lab = ['Clean','Buggy']
    disp = ConfusionMatrixDisplay.from_predictions(y_true=y, y_pred=y_pred, display_labels = lab, cmap='Blues',
                                                   normalize = normalize, ax = ax, colorbar = cbar)
    ax.set_title('GaussianNB', fontsize=14, fontweight='bold')
    
    # Create output
    out = {'ConfusionMatrixDisplay' : disp,
           'confusionmatrix' : disp.confusion_matrix,
           'accuracy' : round(acc, 3),
           'recall' : round(recall_perc, 2),
           'auc' : round(auc, 3),
           'model' : pipe.fit(X, y)}
    
    return out

<br>

### knn_cv(*df, ax = None, cbar = False, normalize = None, folds = 10, shuffle = True, seed = 42, **kargs*)<a class='anchor' id='knn'></a>
#### A wrapper function that performs cross-validated tuning and evaluation of a k-NN classifier on a given dataset

The function has the following parameters:

- **`df`**: Dataframe containing the data.
- **`ax`**: *Axes object, default = None* 
     <br> Axes on which to draw the plot of the confusion matrix.
- **`cbar`**: *bool, default = False*
     <br> Whether or not to display the colorbar.
- **`normalize`**: *None or {'true', 'pred', 'all'}*
    <br> Normalization mode to apply to the confusion matrix.
- **`folds`**: *int, default = 10*
    <br> Number of folds to use in the k-fold cross-validation.
- **`shuffle`**: *bool, default = True*
    <br> Whether or not to shuffle the data before applying k-fold cross-validation.
- **`seed`**: *int, default = 42*
    <br> The random seed used to shuffle the data when `shuffle` is set to True. If `shuffle` is set to False, this parameter is ignored.
- **`**kargs`**: 
    <br> Additional keyword arguments to pass to the `KNeighborsClassifier` classifier.

The function outputs a dictionary containing the following:

- **`ConfusionMatrixDisplay`**: *ConfusionMatrixDisplay object*
    <br> Object containing the confusion matrix, labels and the Confusion Matrix visualization.
- **`confusionmatrix`**: *ndarray of shape (n_classes, n_classes)*
  <br> Matrix whose *i*-th row and *j*-th column entry indicates the # of samples with true label being *i*-th class and predicted label being *j*-th class.
- **`results`**: *dict of ndarrays*
    <br> a dictionary that summarizes the results of cross-validation for each hyperparameter combination tried during the grid search.
- **`accuracy`**: *float*
    <br>Mean cross-validated accuracy of the model.
- **`recall`**: *float*
    <br>Mean cross-validated percentage recall of the model.
- **`model`**: *sklearn.pipeline.Pipeline object*
    <br>The best estimator from `GridSearchCV`

In [17]:
def knn_cv(df, ax = None, cbar = False, normalize = None, folds = 10, shuffle = True, seed = 42, **kargs):
    
    # If no Axes object is provided, create a new one
    if ax is None:
        ax = plt.gca()
    # Check if ax is an instance of the Axes class
    elif not isinstance(ax, Axes):
        raise TypeError("The 'ax' argument must be an Axes object")
        
    # Separate the target variable from predictors
    y = df['defects']
    X = df.drop('defects', axis = 1)
    predictors = X.columns.values.tolist()
    
    # Create StratifiedKFold object
    # If shuffle is False, ignore any passed seed to avoid a ValueError
    kf = StratifiedKFold(n_splits = folds, shuffle = shuffle, random_state = seed if shuffle else None)
    
    # Create k-NN model
    knn = KNeighborsClassifier(n_jobs = -1, **kargs)
    # Create column transformer to normalize predictors
    z = make_column_transformer((RobustScaler(), predictors), remainder = "passthrough")
    # Create pipeline
    pipe = make_pipeline(z, knn)
    
    # Perform grid search to find the optimal number of neighbors
    ks = {"kneighborsclassifier__n_neighbors": range(1,17,2)}
    metrics = ('accuracy','balanced_accuracy','recall', 'roc_auc')
    grid = GridSearchCV(pipe, ks, scoring = metrics, cv = kf, refit = 'balanced_accuracy', n_jobs = -1)
    grid.fit(X, y)
    k = grid.best_params_["kneighborsclassifier__n_neighbors"]
    
    # Retrieve the mean cross-validated metrics
    acc = grid.cv_results_['mean_test_accuracy'][grid.best_index_]
    recall_perc = grid.cv_results_['mean_test_recall'][grid.best_index_]*100
    auc = grid.cv_results_['mean_test_roc_auc'][grid.best_index_]
    
    # Generate cross-validated predicted labels for the target class
    y_pred = cross_val_predict(grid.best_estimator_, X, y, cv = kf)
    
    # Create confusion matrix
    lab = ['Clean','Buggy']
    disp = ConfusionMatrixDisplay.from_predictions(y_true = y, y_pred = y_pred,  cmap = 'Blues', display_labels = lab,
                                                   normalize = normalize, ax = ax, colorbar = cbar)
    ax.set_title(str(k) +'-NN', fontsize = 14, fontweight = 'bold')
    
    # Create output
    out = {'ConfusionMatrixDisplay' : disp,
           'confusionmatrix' : disp.confusion_matrix,
           'results' : grid.cv_results_,
           'accuracy' : round(acc, 3),
           'recall' : round(recall_perc, 2),
           'auc' : round(auc, 2),
           'model' : grid.best_estimator_}
    
    return out

<br>

### tree_cv(*df, ax = None, cbar = False, normalize = None, folds = 10, shuffle = True, seed = 42, **kargs*) <a class='anchor' id='tree'></a>
#### A wrapper function that performs cross-validated tuning and evaluation of a DecisionTreeClassifier on a given dataset

The function has the following parameters:

- **`df`**: Dataframe containing the data.
- **`ax`**: *Axes object, default = None* 
     <br> Axes on which to draw the plot of the confusion matrix.
- **`cbar`**: *bool, default = False*
     <br> Whether or not to display the colorbar.
- **`normalize`**: *None or {'true', 'pred', 'all'}*
    <br> Normalization mode to apply to the confusion matrix.
- **`folds`**: *int, default = 10*
    <br> Number of folds to use in the k-fold cross-validation.
- **`shuffle`**: *bool, default = True*
    <br> Whether or not to shuffle the data before applying k-fold cross-validation.
- **`seed`**: *int, default = 42*
    <br> Integer used as a seed for the random number generator
- **`**kargs`**: 
    <br> Additional keyword arguments to pass to the `DecisionTreeClassifier` classifier.

The function outputs a dictionary containing the following:

- **`ConfusionMatrixDisplay`**: *ConfusionMatrixDisplay object*
    <br> Object containing the confusion matrix, labels and the Confusion Matrix visualization.
- **`confusionmatrix`**: *ndarray of shape (n_classes, n_classes)*
  <br> Matrix whose *i*-th row and *j*-th column entry indicates the # of samples with true label being *i*-th class and predicted label being *j*-th class.
- **`results`**: *dict of ndarrays*
    <br> a dictionary that summarizes the results of cross-validation for each hyperparameter combination tried during the grid search.
- **`accuracy`**: *float*
    <br>Mean cross-validated accuracy of the model.
- **`recall`**: *float*
    <br>Mean cross-validated percentage recall of the model.
- **`model`**: *sklearn.pipeline.Pipeline object*
    <br>The best estimator from `GridSearchCV`

In [15]:
def tree_cv(df, ax = None, cbar = False, normalize = None, folds = 10, shuffle = True, seed = 42, **kargs):
    
    # If no Axes object is provided, create a new one
    if ax is None:
        ax = plt.gca()
    # Check if ax is an instance of the Axes class
    elif not isinstance(ax, Axes):
        raise TypeError("The 'ax' argument must be an Axes object")
        
    # Separate the target variable from predictors
    y = df['defects']
    X = df.drop('defects', axis = 1)
    predictors = X.columns.values.tolist()
    
    # Create StratifiedKFold object
    # If shuffle is False, ignore any passed seed to avoid a ValueError
    kf = StratifiedKFold(n_splits = folds, shuffle = shuffle, random_state = seed if shuffle else None)
    
    # Create a DecisionTreeClassifier model
    tree_clf = DecisionTreeClassifier(random_state = seed, **kargs)
    
    # Create column transformer to normalize predictors
    z = make_column_transformer((RobustScaler(), predictors), remainder = "passthrough")
    # Create pipeline
    pipe = make_pipeline(z, PCA(), tree_clf)
    
    # Get depth of the fully grown tree
    pipe.fit(X, y)
    full_tree = pipe.named_steps['decisiontreeclassifier']
    full_depth = full_tree.get_depth()
    
    # Perform grid search to find the optimal depth
    depths = {"decisiontreeclassifier__max_depth": range(1, full_depth, 2)}
    metrics = ('accuracy','balanced_accuracy','recall', 'roc_auc')
    grid = GridSearchCV(pipe, depths, scoring = metrics, cv = kf, refit = 'balanced_accuracy', n_jobs = -1)
    grid.fit(X, y)
    
    # Retrieve the mean cross-validated metrics
    acc = grid.cv_results_['mean_test_accuracy'][grid.best_index_]
    recall_perc = grid.cv_results_['mean_test_recall'][grid.best_index_]*100
    auc = grid.cv_results_['mean_test_roc_auc'][grid.best_index_]
    
    # Generate cross-validated predicted labels for the target class
    y_pred = cross_val_predict(grid.best_estimator_, X, y, cv=kf)
    
    # Create confusion matrix
    lab = ['Clean','Buggy']
    disp = ConfusionMatrixDisplay.from_predictions(y_true=y, y_pred=y_pred,  cmap='Blues', display_labels = lab, 
                                                   normalize = normalize, ax = ax, colorbar = cbar)
    title = 'Balanced DecisionTree' if tree_clf.class_weight == 'balanced' else 'DecisionTree'
    ax.set_title(title, fontsize=14, fontweight='bold')
    
    # Create output
    out = {'ConfusionMatrixDisplay' : disp,
           'confusionmatrix' : disp.confusion_matrix,
           'results' : grid.cv_results_,
           'accuracy' : round(acc, 3),
           'recall' : round(recall_perc, 2),
           'auc' : round(auc, 3),
           'model' : grid.best_estimator_}
    
    return out

<br>

### plot_line(*axis, slope, intercept, **kargs*)<a class='anchor' id='plot_line'></a>
#### Utility function for plotting a line on a given matplotlib axis object
The plot_line function is a utility function for plotting a line on a given matplotlib axis object. The line is defined by its slope and intercept values, and it is plotted using the plt.plot function from the matplotlib library. The function takes at least three arguments:
<br>
- **`axis`**: a matplotlib axis object on which the line will be plotted.
- **`slope`**: a float value that defines the slope of the line.
- **`intercept`**: a float value that defines the y-intercept of the line.
- **`kargs`**: a dictionary of optional keyword arguments that will be passed to the `plt.plot` function. This allows the caller to specify additional properties of the line such as its color, line style, and so on.
<br>

In [13]:
def plot_line(axis, slope, intercept, **kargs):
    xmin, xmax = axis.get_xlim()
    plt.plot([xmin, xmax],
             [xmin*slope+intercept, xmax*slope+intercept],
             **kargs)

<br>

### plot_performance(*data, metric_name, save=True, fig_num = None*) <a class='anchor' id='plotperf'></a>
#### Creates and saves two plots (a boxplot and a barplot) of a given performance metric for classifiers on multiple datasets

The `plot_classifier_performance` function takes in three inputs:

- **`data`:** *pandas DataFrame*<br> contains the performance metric values for each classifier
- **`metric_name`:** *str*<br> name of the performance metric being plotted
- **`fig_num`:** *int, optional*<br> number to use as the figure number, default is None

The function outputs two plots: a box plot and a bar plot, each showing the distribution of the performance metric for each classifier. The function saves the plots as image files if `fig_num` is not `None`.

In [4]:
def plot_performance(data, metric_name, save=True, fig_num = None):
    
    # Check if input data is a pandas dataframe
    if not isinstance(data, pd.DataFrame):
        raise TypeError("'data' argument must be a pandas DataFrame")
        
    # Check if metric name is a string
    if not isinstance(metric_name, str):
        raise TypeError("'metric_name' argument must be a string")
    
    sns.set(style="darkgrid")
    sns.set_palette('tab10')
    
    # Create box plot
    box_a = sns.boxplot(data=data, linewidth=2.0, width=0.7)
    
    # Get the figure and set its size
    fig = box_a.get_figure()
    fig.set_size_inches(10,5)
    
    # Create the title for the plot
    # If fig_num is provided, add "Figure X. " to the title, where X is the fig_num
    head = 'Figure ' + str(fig_num) + ".  " if fig_num is not None else ''
    box_a.set_title(head + metric_name + " Measure for Classifiers", y=1.05, fontsize=15, fontweight='bold')
    
    # Set the font size for the x-axis tick labels
    box_a.axes.tick_params(axis='x', labelsize=14)
    plt.ylabel(metric_name, fontsize=14, labelpad=20, fontweight='bold')
    
    # Save the figure, if save is set to True
    if save:
        save_fig(metric_name + '_Boxplot')
    
    # Create bar plot
    fig, ax = plt.subplots(figsize=(10,5))
    data.plot(kind='bar', stacked=False, ax=ax, width=0.7, edgecolor='darkslategray')
    plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5), fontsize=12)
    plt.xlabel('Datasets', fontsize=13, labelpad=10, fontweight='bold')
    plt.ylabel(metric_name, fontsize=14, labelpad=20, fontweight='bold')
    ax.tick_params(axis='x', labelrotation=0, labelsize=12)
    
    # Create the title for the plot
    # If fig_num is provided, add "Figure X. " to the title, where X is the fig_num + 1
    head = 'Figure ' + str(int(fig_num)+1) + ".  " if fig_num is not None else ''
    plt.title(head +  metric_name + " Measure for Classifiers", y=1.05, fontsize=15, fontweight='bold')
    ax.grid(axis='x')
    
    # Save the figure, if save is set to True
    if save:
        save_fig(metric_name + '_Barplot')

<br>

### roc_cv(*classifier, X, y, cv = None, ax = None, chance = True, **kargs*)<a class='anchor' id='roc_cv'></a>
#### Calculates the cross-validated ROC curve and AUC for a given classifier 
It does this by splitting the data into a specified number of folds using a stratified k-fold cross-validation method (or any other iterable yielding train-test splits as arrays of indices) and then fitting the classifier on the training set and predicting probabilities on the test set. The false positive rate (FPR) and TPR for the test set are calculated using the `roc_curve` function from scikit-learn's `metrics` module, and the TPR is interpolated at the mean FPR points. The AUC for the test set is calculated using the roc_auc_score function.

The mean TPR and standard deviation of TPR across all folds are calculated, as well as the mean AUC and standard deviation of AUC across all folds. 
The function takes the following arguments:
- **`classifier`**: a binary classifier object that has a `fit` and `predict_proba` method.
- **`df`**: a DataFrame of feature values in which the last column is assumed to be the target labels variable. 
- **`cv`**: an optional cross-validation generator or iterator. If not provided, a stratified k-fold cross-validation with 10 splits is used.
- **`ax`**: an optional matplotlib `Axes` object to plot the ROC curve on. If not provided, a new Axes object is created.
- **`chance`**: Boolean indicating whether to plot the chance level. Default is True.
- **`**kargs`**: optional keyword arguments that are passed to the plot function when drawing the ROC curve.

The function returns a dictionary containing the following keys:

- **`AUC`**: the mean AUC value across all folds.
- **`AUC_std`**: the standard deviation of the AUC values across all folds.

In [9]:
def roc_cv(classifier, df, cv = None, ax = None, chance = True , **kargs):
    
    # If no cross-validation object is provided, use stratified k-fold with 10 splits
    if cv is None:
        cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
        
    # If no Axes object is provided, create a new one
    if ax is None:
        ax = plt.gca()
    
    # Initialize lists to store true positive rates and AUC values
    tprs = []
    aucs = []
    # Generate an array of 100 evenly spaced points between 0 and 1
    mean_fpr = np.linspace(0, 1, 100)

    # Set the target column to be the last column in the dataframe
    target_col = df.columns[-1]
    # Set the feature columns to be all the columns except the last one
    feature_cols = df.columns[:-1]

    # Iterate over the folds of the cross-validation
    for fold, (train, test) in enumerate(cv.split(df, df[target_col])):
        # Split the data into training and test sets
        X_train, y_train = df[feature_cols].iloc[train], df[target_col].iloc[train]
        X_test, y_test = df[feature_cols].iloc[test], df[target_col].iloc[test]
        
        # Fit the classifier on the training set
        classifier.fit(X_train, y_train)
        # Predict the probabilities of the positive class for the test set
        y_pred = classifier.predict_proba(X_test)[:, 1]
        # Calculate the false positive rate and true positive rate for the test set
        fpr, tpr, thresholds = roc_curve(y_test, y_pred)
        # Interpolate the true positive rate at the mean false positive rate points
        interp_tpr = np.interp(mean_fpr, fpr, tpr)
        # Set the first element of the interpolated true positive rate to 0
        interp_tpr[0] = 0.0
        # Add the interpolated true positive rate to the list
        tprs.append(interp_tpr)
        # Add the AUC for the test set to the list
        aucs.append(roc_auc_score(y_test, y_pred))

    # Calculate the mean and standard deviation of the true positive rates across all folds
    mean_tpr = np.mean(tprs, axis=0)
    mean_tpr[-1] = 1.0
    std_tpr = np.std(tprs, axis=0)

    # Calculate the mean and standard deviation of the AUC values across all folds
    mean_auc = auc(mean_fpr, mean_tpr)
    std_auc = np.std(aucs)

    # Stores the Classifier name in a variable
    if isinstance(classifier, Pipeline):
        classifier_name = classifier.steps[-1][1].__class__.__name__ 
    else:
        classifier_name = classifier.__class__.__name__
    
    # If the chance flag is set to True, the function plots a chance level line (y=x, AUC=0.5)
    if chance:
        ax.plot([0, 1], [0, 1], "k--", label="Chance Level (AUC = 0.5)", alpha=0.4)
    
    # Plot the ROC curve on the Axes object
    ax.plot(
        mean_fpr,
        mean_tpr,
        label=classifier_name + r" (AUC = %0.2f $\pm$ %0.2f)" % (mean_auc, std_auc),
        lw=2,
        alpha=0.8, 
        **kargs
    )
    ax.set_xlabel("False Positive Rate (FPR)", labelpad=20)
    ax.set_ylabel("True Positive Rate (TPR)", labelpad=20)
    
    
    # Return the Cv AUC, and standard deviation of the AUC
    return {'AUC':mean_auc, 'AUC_std':std_auc}

<br>

### save_fig(*fig_id, tight_layout = True, fig_extension = "png", resolution = 300*)<a class='anchor' id='save_fig'></a>
#### utility function for saving a matplotlib figure to a file

The function takes four arguments:
<br>
- **`fig_id`**: a string that specifies the name of the file to which the figure will be saved.
- **`tight_layout`**: a boolean value that determines whether or not the figure's layout should be tight (i.e., should take up as little space as possible). The default value is True.
- **`fig_extension`**: a string that specifies the file extension of the figure file. The default value is "png".
- **`resolution`**: an integer that specifies the resolution of the figure in dots per inch (dpi). The default value is 300.
<br>

In [14]:
def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    #print("Saving figure", fig_id)
    
    if tight_layout:
        plt.tight_layout()
        
    plt.savefig(path, format=fig_extension, dpi=resolution)

<br>

### scatter_custom(*x, y, df, ax, **kargs*) <a class='anchor' id='scatter_cust'></a>
#### Custom function for generating a scatter plot from a pandas.DataFrame
It takes three required arguments:
- **`x and y`**: strings representing the names of the columns in df that you want to use for the x- and y-axes of the plot, respectively.
- **`df`**: should be a pandas.DataFrame containing the data you want to plot.
- **`ax`**: an *Axes* object, which is a part of a Figure in `matplotlib`. It represents a single subplot in a grid of plots. The ax argument allows the user to specify which Axes object the plot should be created on.

The function also accepts additional keyword arguments, which are passed to the `matplotlib.pyplot.scatter` function when generating the plot. This allows you to customize the appearance of the plot, such as by setting the marker size or color.

In [18]:
def scatter_custom(x, y, df, ax = None, **kargs):
    
    # if no Axes object is provided, create a new one
    if ax is None:
        ax = plt.gca()
    # check if ax is an instance of the Axes class
    elif not isinstance(ax, Axes):
        # if ax is not an Axes object, raise an error
        raise TypeError("The 'ax' argument must be an Axes object")
    
    # create sets of the provided x and y column names, and the column names in df
    varnames_set = set((x,y))
    columns_set = set(cm1_mc.columns)

    # check if the x and y column names are a subset of the column names in df
    if not varnames_set.issubset(columns_set):
        # if x and y are not a subset of df.columns, raise an error
        raise ValueError(f"Provide existing columns names")
    
    # Create two DataFrames for clean and buggy data
    df_buggy = df.loc[df['defects'] == True]
    df_clean = df.loc[df['defects'] == False]
    df_grouped = [df_clean, df_buggy]
    color = {True:'#8b0000',False:'#00008b'}
    marker = {True:'s',False:'o'}
    labels = ['Clean', 'Buggy']
    
    # create the plot
    for i in range(2):
        # scatter plot of the clean and buggy data
        ax.scatter(x,y, alpha=0.7, marker = marker[i],
                   edgecolors = color[i], color = 'white', data = df_grouped[i],
                   label = labels[i], **kargs)
        
    # set the title, x-axis label, and y-axis label for the plot
    # ax.set_title(), ax.set_xlabel(), and ax.set_ylabel()
    ax.set_title((x + ' vs. ' + y), fontsize=14, fontweight='bold')
    ax.set_xlabel(x, fontstyle='italic')
    ax.set_ylabel(y, fontstyle='italic')
    # show the legend for the plot
    ax.legend()
    # display the plot
    return ax

<br>

### svm_cv(*df, ax = None, cbar = False, normalize = None, folds = 10, shuffle = True, seed = 42, n_iter = 50, **kargs*) <a class='anchor' id='svm'></a>
#### A wrapper function that performs cross-validated tuning and evaluation of a SVM with RBF kernel on a given dataset

The function implements a randomized search over the parameter space using `RandomizedSearchCV`. In contrast to GridSearchCV, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. Additionally, a computation budget, being the number of sampled candidates or sampling iterations, is specified using the `n_iter` parameter. This way, increasing `n_iter` will always lead to a finer search.

For each parameter, either a distribution over possible values or a list of discrete choices (which will be sampled uniformly) can be specified. 
For continuous parameters, such as C in this case, it is important to specify a continuous distribution to take full advantage of the randomization. A continuous log-uniform random variable is available through `loguniform`. This is a continuous version of log-spaced parameters and is useful for searching penalty values as we often explore values at different orders of magnitude, at least as a first step. For example to specify C above, `loguniform(1, 100)` can be used instead of `[1, 10, 100]` 

The function has the following parameters:

- **`df`**: Dataframe containing the data.
- **`ax`**: *Axes object, default = None* 
     <br> Axes on which to draw the plot of the confusion matrix.
- **`cbar`**: *bool, default = False*
     <br> Whether or not to display the colorbar.
- **`normalize`**: *None or {'true', 'pred', 'all'}*
    <br> Normalization mode to apply to the confusion matrix.
- **`folds`**: *int, default = 10*
    <br> Number of folds to use in the k-fold cross-validation.
- **`shuffle`**: *bool, default = True*
    <br> Whether or not to shuffle the data before applying k-fold cross-validation.
- **`seed`**: *int, default = 42*
    <br> Integer used as a seed for the random number generator
- **`n_iter`**: *int, default = 50*
    <br> Number of parameter settings that are sampled. Tunes the trade-off runtime vs quality of the solution.
- **`**kargs`**: 
    <br> Additional keyword arguments to pass to the `SVC` classifier.

The function outputs a dictionary containing the following:

- **`ConfusionMatrixDisplay`**: *ConfusionMatrixDisplay object*
    <br> Object containing the confusion matrix, labels and the Confusion Matrix visualization.
- **`confusionmatrix`**: *ndarray of shape (n_classes, n_classes)*
  <br> Matrix whose *i*-th row and *j*-th column entry indicates the # of samples with true label being *i*-th class and predicted label being *j*-th class.
- **`results`**: *dict of ndarrays*
    <br> a dictionary that summarizes the results of cross-validation for each hyperparameter combination tried during the random search.
- **`accuracy`**: *float*
    <br>Mean cross-validated accuracy of the model.
- **`recall`**: *float*
    <br>Mean cross-validated percentage recall of the model.
- **`model`**: *sklearn.pipeline.Pipeline object*
    <br>The best estimator from `RandomizedSearchCV`

In [14]:
def svm_cv(df, ax = None, cbar = False, normalize = None, folds = 10, shuffle = True, seed = 42, n_iter = 50, **kargs):
    
    # If no Axes object is provided, create a new one
    if ax is None:
        ax = plt.gca()
    # Check if ax is an instance of the Axes class
    elif not isinstance(ax, Axes):
        raise TypeError("The 'ax' argument must be an Axes object")
        
    # Separate the target variable from predictors
    y = df['defects']
    X = df.drop('defects', axis = 1)
    predictors = X.columns.values.tolist()
    
    # Create StratifiedKFold object
    # If shuffle is False, ignore any passed seed to avoid a ValueError
    kf = StratifiedKFold(n_splits = folds, shuffle = shuffle, random_state = seed if shuffle else None)

    # Create a SupportVectorMachine model
    svm = SVC(random_state = seed, cache_size = 1500, **kargs)
    
    # Create column transformer to normalize predictors
    z = make_column_transformer((RobustScaler(), predictors), remainder = "passthrough")
    # Create pipeline
    pipe = make_pipeline(z, svm)
    
    # Perform Randomized Search to find the optimal parameters values
    space = {'svc__C': loguniform(10e-3, 10e3),
             'svc__gamma': loguniform(10e-3, 10e3)}
    metrics = ('accuracy','balanced_accuracy','recall','roc_auc')
    rnd_srch = RandomizedSearchCV(pipe, space, n_iter = n_iter, scoring = metrics, cv = kf, refit = 'balanced_accuracy',
                                 n_jobs = -1, verbose = 0, random_state = seed)
    rnd_srch.fit(X, y)
    
    # Retrieve the mean cross-validated metrics
    acc = rnd_srch.cv_results_['mean_test_accuracy'][rnd_srch.best_index_]
    recall_perc = rnd_srch.cv_results_['mean_test_recall'][rnd_srch.best_index_]*100
    auc = rnd_srch.cv_results_['mean_test_roc_auc'][rnd_srch.best_index_]
    
    # Generate cross-validated predicted labels for the target class
    y_pred = cross_val_predict(rnd_srch.best_estimator_, X, y, cv=kf)
    
    # Create confusion matrix
    lab = ['Clean','Buggy']
    disp = ConfusionMatrixDisplay.from_predictions(y_true = y, y_pred = y_pred,  cmap = 'Blues', display_labels = lab, 
                                                   normalize = normalize, ax = ax, colorbar = cbar)
    title = 'Balanced SVM' if svm.class_weight == 'balanced' else 'SVM'
    ax.set_title(title, fontsize=14, fontweight='bold')
    
    # Create output
    out = {'ConfusionMatrixDisplay' : disp,
           'confusionmatrix' : disp.confusion_matrix,
           'results' : rnd_srch.cv_results_,
           'accuracy' : round(acc, 3),
           'recall' : round(recall_perc, 2),
           'auc' : round(auc, 3),
           'model' : rnd_srch.best_estimator_}
    
    return out

<br>

### varselect_tocsv(*df, varnames, outname*) <a class='anchor' id='varselect_tocsv'></a>
#### Selects features of interest and saves them in a CSV file

This function is used to select a subset of columns (i.e., variables) from a Pandas dataframe and save the result as a CSV file. The function takes three arguments:

- **`df`**: a Pandas dataframe that contains the data.
- **`varnames`**: a list of strings, each of which is the name of a column in *df* that should be selected.
- **`outname`**: a string that specifies the name to be used for the CSV file. This string should be a valid file name and will have *.csv* appended to it when the file is created.

In [1]:
def varselect_tocsv(df, varnames, outname):
    varnames_set = set(varnames)
    columns_set = set(df.columns)
    # Checks whether the list of variable names is a subset of the column names
    if (isinstance(df, pd.core.frame.DataFrame) &
        (varnames_set.issubset(columns_set)) &
        (type(outname) == str)):
            # Selects the specified columns from the dataframe
            out_df = df[varnames]
            # Saves the result to a CSV file using the specified name
            # (any existing data is overwritten)
            out_df.to_csv(f'{DATA_PATH}/{outname}.csv', mode="w")
            print('The variable selection was successfully saved in',f'{DATA_PATH}/{outname}.csv' )
            return out_df
    else:
        # If any of the input arguments are invalid, the function prints an error message and returns None.
        print('Provide proper arguments: a df, a list of feature names and an output name')

<br>

<br>