# Utility Functions for PR104

1. [arff_tocsv()](#arff_tocsv)
1. [data_barplot()](#barplot)
1. [data_summary()](#data_summary)
1. [naivebayes_cv()](#naivebayes)
1. [knn_cv()](#knn)
1. [plot_line()](#plot_line)
1. [roc_cv()](#roc_cv)
1. [save_fig()](#save_fig)
1. [scatter_custom()](#scatter_cust)
1. [varselect_tocsv()](#varselect_tocsv)

In [9]:
%run setup.ipynb

<br>

### arff_tocsv(*name*) <a class='anchor' id='arff_tocsv'></a>
#### Loads *.arff* data and save it in a CSV file
This function is used to load data from an ARFF file and save it as a CSV file. The `with` statement is used to open the ARFF file, read its contents using the `loadarff` function, and automatically close the file when the with block ends. This ensures that the file is properly closed and avoids potential issues with file handles being left open.
The function also checks if there is any missing value in the dataset and if there are missing values it impute the missing values with median values of the same column.
It then prints a message indicating that the ARFF file was successfully loaded and saves the DataFrame as a CSV file using the `to_csv()` method. The CSV file is saved in the same directory as the ARFF file, using the same name as the ARFF file, overwriting any existing data. The function then returns the final dataframe.

The function takes one argument:

- **`name`**: a string that specifies the name of the ARFF file to be loaded. This string should be a valid file name and should have .arff appended to it when the file is read.

In [3]:
from scipy.io.arff import loadarff 
from sklearn.impute import SimpleImputer

def arff_tocsv(name):
    # Constructs the full file path for the ARFF file
    path = os.path.join(DATA_PATH, name + ".arff")
    
    try:
        # Reads the data from the file into a NumPy structured array
        with open(path, "r") as arff_file:
            raw_data = loadarff(arff_file)
    except FileNotFoundError:
        print("The specified ARFF file was not found.")
        return
    
    # Define the columns that should be of integer type
    intdict = {'loc':'int','v(g)':'int','ev(g)':'int',
               'iv(g)': 'int','n':'int','lOCode':'int',
               'lOComment':'int','lOBlank':'int',
               'lOCodeAndComment':'int','uniq_Op':'int',
               'uniq_Opnd':'int','total_Op':'int',
               'total_Opnd':'int','branchCount':'int',
               'defects':'int'}
    
    # Array is then converted to a Pandas dataframe 
    df_data = pd.DataFrame(raw_data[0])
    
    # Check if there are any missing values in the dataframe
    # 5 rows with missing values can be found in jm1.arff data source
    if df_data.isnull().any().any():
        # Impute missing values with median of the same column
        imputer = SimpleImputer(strategy='median')
        df_data_num = df_data.select_dtypes(include=[np.number])
        imputer.fit(df_data_num)
        X = imputer.transform(df_data_num)
        df_data = pd.DataFrame(X, columns=df_data.columns)
    
    # Convert the variables contained in the intdict dictionary to an integer dtype
    df_data = df_data.astype(dtype=intdict)

    print(name + ".arff","successfully loaded")
    
    # Saves the dataframe as a CSV file using the same name as the ARFF file
    # (any existing data is overwritten)
    df_data.to_csv(f'{DATA_PATH}/{name}.csv', mode="w")
    print("Saved in", f'{DATA_PATH}/{name}.csv')
    return df_data


<br>

### data_barplot(*df, save = True, id='data_barplot', **kargs*) <a class='anchor' id='barplot'></a>
#### Creates a grouped bar plot with the *Buggy* and *Clean* instances

It plots the '*Buggy*' and '*Clean*' columns from `df`, with the '*Buggy*' bars shown in red and the '*Clean*' bars shown in blue. The x-axis of the plot is labeled with the index values from `df`, corresponding to the different datasets employed. The figure has a title, and is saved with the specified `id` before being shown. The `save_fig()` function is a separate function used to save the figure.
The function takes four arguments:
<br>
- **`df`**: is the dataframe containing the data to be plotted
- **`save`**: a boolean value indicating whether the plot should be saved (default is True)
- **`id`**: a string value specifying the name of the file to save the plot as (default is 'data_barplot')
- **`**kargs`**: any other keyword arguments to pass to the save_fig function.

In [1]:
def data_barplot(df, save = True, id='data_barplot', **kargs):
    
    # Create x-axis values
    num_rows = df.shape[0]
    x = np.arange(num_rows)

    y1 = df['Buggy']
    y2 = df['Clean']
    # Set width of bars in plot
    width = 0.40

    # Set tick positions on x-axis
    ticks_pos = [r for r in range(num_rows)]

    # Create figure and axis objects
    figure, ax = plt.subplots(figsize=(7,4))

    # Plot data in grouped bar format
    ax.bar(x-0.2, y1, width, label='Buggy', color='#8b0000')
    ax.bar(x+0.2, y2, width, label='Clean', color='#00008b')

    # Set tick positions and labels on x-axis
    ax.set_xticks(ticks_pos, df.index)

    # Add legend to plot
    ax.legend()

    # Add title
    figure.suptitle('Figure 2: Class Distribution among data - "Defects"',
               fontsize=12, fontweight='bold')

    # Save Figure
    if save:
        save_fig(id, **kargs)
    
    # Show plot
    return figure

<br>

### data_summary(*filename, index, plot =* True)<a class='anchor' id='data_summary'></a>
#### Reads the data from a specified file, creates a summary table and a grouped bar plot of the data.

This function takes three arguments:
<br>
- **`filename`**: a string that specifies the name of the file containing the data. This function assumes that the file is a CSV file located in the `DATA_PATH` directory.
- **`index`**: specifies the column to use as the index of the dataframe.
- **`plot`**: a boolean that, if set to True, causes the function to also return a grouped bar plot of the data using the `data_barplot()` function. If plot is not provided, it defaults to True.
<br>

In [12]:
def data_summary(filename, index, plot = True):
    try:
        # Check that the filename is a string
        if not isinstance(filename, str):
            raise ValueError('ERROR: The filename must be a string')

        # Constructs the full file path for the CSV file
        path = os.path.join(DATA_PATH, filename + ".csv")

        # Open the file with the given filename and read the data into a DataFrame
        table1 = pd.read_csv(path)

        # Check that the index is a proper column name
        if index not in table1.columns:
            raise ValueError('ERROR: The index must be an existing column')

        # Set the 'Name' column as the index
        table1.set_index(index, inplace=True)

        # Add a new column called 'Clean' with the number of modules without bugs
        table1.insert(2,'Clean',table1['Instances'] -  table1['Buggy'])

        # Format the dataframe as an HTML table
        html_table1 = table1.style.to_html(caption="Table 1: Summary of Datasets")

        # If the plot argument is True, create a grouped bar plot of the data
        if plot:
            return HTML(html_table1), data_barplot(table1)
        else:
            # Otherwise, only return the HTML table
            return HTML(html_table1)

    except FileNotFoundError:
        # If the file does not exist, print an error message
        print('ERROR: The file does not exist')
    except ValueError as error:
        print(error)


<br>

### naivebayes_cv(*df, ax = None, cbar = False, normalize = None, **kargs*)<a class='anchor' id='naivebayes'></a>
#### A wrapper function that performs cross-validated evaluation of a Naive Bayes classifier on a given dataset

The function has the following parameters:

- **`df`**: A dataframe containing the data
- **`ax`**: An axis object to plot the confusion matrix on (default is None)
- **`cbar`**: A boolean value for displaying a colorbar (default is False)
- **`normalize`**: A value for normalizing the confusion matrix (default is None)
- **`**kargs`**: Additional keyword arguments passed to the `StratifiedKFold` function

The function outputs a dictionary containing the following:

- **`ax`**: The axis object on which the confusion matrix is plotted 
- **`confusionmatrix`**: The confusion matrix object
- **`accuracy`**: The accuracy of the model
- **`recall`**: The recall of the model
- **`model`**: The Naive Bayes classifier with a Gaussian distribution assumption which is a `GaussianNB` object

In [1]:
def naivebayes_cv(df, ax=None, cbar = False, normalize = None, **kargs):
    
    # if no Axes object is provided, create a new one
    if ax is None:
        ax = plt.gca()
    # check if ax is an instance of the Axes class
    elif not isinstance(ax, Axes):
        raise TypeError("The 'ax' argument must be an Axes object")
    
    # Separate the target variable from predictors
    y = df['defects']
    X = df.drop('defects', axis = 1)
    predictors = X.columns.values.tolist()
    
    # Create StratifiedKFold object
    kf = StratifiedKFold(**kargs)
        
    # create a GaussianNB object, which is a Naive Bayes classifier with a Gaussian distribution assumption
    nb = GaussianNB()
    
    # Create column transformer to normalize predictors
    z = make_column_transformer((RobustScaler(), predictors), remainder = "passthrough")
    # Create pipeline
    pipe = make_pipeline(z, nb)
    
    # calculate the mean cross-validated accuracy and percentage recall
    acc = cross_val_score(nb, X, y, cv=kf, scoring='accuracy').mean()
    recall_perc = cross_val_score(nb, X, y, cv=kf, scoring='recall').mean()*100
    
    # generate cross-validated predicted labels for the target class
    y_pred = cross_val_predict(nb, X, y, cv=kf)
    
    # create a ConfusionMatrixDisplay object using the confusion matrix and labels
    lab = ['Clean','Buggy']
    disp = ConfusionMatrixDisplay.from_predictions(y_true=y, y_pred=y_pred, display_labels = lab, cmap='Blues',
                                                   normalize = normalize, ax = ax, colorbar = cbar)
    ax.set_title('GaussianNB', fontsize=14, fontweight='bold')
    
    # Create output
    out = {'ax' : ax,
          'confusionmatrix' : disp,
          'accuracy' : acc,
          'recall' : recall_perc,
          'model' : pipe}
    
    return out

<br>

### knn_cv(*df, ax = None, cbar = False, normalize = None, **kargs*)<a class='anchor' id='knn'></a>
#### A wrapper function that performs cross-validated evaluation of a k-NN classifier on a given dataset

The function has the following parameters:

- **`df`**: A dataframe containing the data
- **`ax`**: An axis object to plot the confusion matrix on (default is None)
- **`cbar`**: A boolean value for displaying a colorbar (default is False)
- **`normalize`**: A value for normalizing the confusion matrix (default is None)
- **`kargs`**: Additional keyword arguments passed to the StratifiedKFold function

The function outputs a dictionary containing the following:

- **`ax`**: The axis object on which the confusion matrix is plotted
- **`confusionmatrix`**: The *ConfusionMatrixDisplay* object
- **`accuracy`**: The accuracy of the model
- **`recall`**: The recall of the model
- **`model`**: The best estimator from the grid search

In [3]:
def knn_cv(df, ax=None, cbar = False, normalize = None, **kargs):
    
    # If no Axes object is provided, create a new one
    if ax is None:
        ax = plt.gca()
    # Check if ax is an instance of the Axes class
    elif not isinstance(ax, Axes):
        raise TypeError("The 'ax' argument must be an Axes object")
        
    # Separate the target variable from predictors
    y = df['defects']
    X = df.drop('defects', axis = 1)
    predictors = X.columns.values.tolist()
    
    # Create StratifiedKFold object
    kf = StratifiedKFold(**kargs)
    
    # Create k-NN model
    knn = KNeighborsClassifier(weights='distance')
    # Create column transformer to normalize predictors
    z = make_column_transformer((RobustScaler(), predictors), remainder = "passthrough")
    # Create pipeline
    pipe = make_pipeline(z,knn)
    
    # Perform grid search to find the optimal number of neighbors
    ks = {"kneighborsclassifier__n_neighbors": range(1,15)}
    grid = GridSearchCV(pipe, ks, scoring = "recall", cv = kf, refit = True)
    grid.fit(X, y)
    k=grid.best_params_["kneighborsclassifier__n_neighbors"]
    
    # Calculate the mean cross-validated accuracy and recall
    acc = cross_val_score(grid, X, y, cv= kf, scoring='accuracy').mean()
    recall_perc = cross_val_score(grid, X, y, cv=kf, scoring='recall').mean()*100
    
    # Generate cross-validated predicted labels for the target class
    y_pred = cross_val_predict(grid, X, y, cv=kf)
    
    # Create confusion matrix
    lab = ['Clean','Buggy']
    disp = ConfusionMatrixDisplay.from_predictions(y_true=y, y_pred=y_pred,  cmap='Blues', display_labels = lab,
                                                   normalize = normalize, ax = ax, colorbar = cbar)
    ax.set_title(str(k) +'-NN', fontsize=14, fontweight='bold')
    
    # Create output
    out = {'ax' : ax,
          'confusionmatrix' : disp,
          'accuracy' : acc,
          'recall' : recall_perc,
          'model' : grid.best_estimator_}
    
    return out

<br>

### plot_line(*axis, slope, intercept, **kargs*)<a class='anchor' id='plot_line'></a>
#### Utility function for plotting a line on a given matplotlib axis object
The plot_line function is a utility function for plotting a line on a given matplotlib axis object. The line is defined by its slope and intercept values, and it is plotted using the plt.plot function from the matplotlib library. The function takes at least three arguments:
<br>
- **`axis`**: a matplotlib axis object on which the line will be plotted.
- **`slope`**: a float value that defines the slope of the line.
- **`intercept`**: a float value that defines the y-intercept of the line.
- **`kargs`**: a dictionary of optional keyword arguments that will be passed to the `plt.plot` function. This allows the caller to specify additional properties of the line such as its color, line style, and so on.
<br>

In [13]:
def plot_line(axis, slope, intercept, **kargs):
    xmin, xmax = axis.get_xlim()
    plt.plot([xmin, xmax],
             [xmin*slope+intercept, xmax*slope+intercept],
             **kargs)

<br>

### roc_cv(*classifier, X, y, cv = None, ax = None, **kargs*)<a class='anchor' id='roc_cv'></a>
#### Calculates the cross-validated ROC curve and AUC for a given classifier 
It does this by splitting the data into a specified number of folds using a stratified k-fold cross-validation method (or any other iterable yielding train-test splits as arrays of indices) and then fitting the classifier on the training set and evaluating its performance on the test set in each fold.

The function takes the following arguments:
- **`classifier`**: a classifier object that has a fit and predict_proba method.
- **`df`**: a DataFrame of feature values in which the last column is assumed to be the target labels variable 
- **`cv`**: an optional cross-validation generator or iterator. If not provided, a stratified k-fold cross-validation with 10 splits is used.
- **`ax`**: an optional matplotlib Axes object to plot the ROC curve on. If not provided, a new Axes object is created.
- **`**kargs`**: optional keyword arguments that are passed to the plot function when drawing the ROC curve.

The function returns a list containing:

- **`ax`**: the Axes object used to plot the ROC curve.
- **`mean_auc`**: the mean AUC value across all folds.
- **`std_auc`**: the standard deviation of the AUC values across all folds.

In [1]:
def roc_cv(classifier, df, cv = None, ax = None, **kargs):
    
    # If no cross-validation object is provided, use stratified k-fold with 10 splits
    if cv is None:
        cv = StratifiedKFold(n_splits=10)
        
    # If no Axes object is provided, create a new one
    if ax is None:
        ax = plt.gca()
    
    # Initialize lists to store true positive rates and AUC values
    tprs = []
    aucs = []
    # Generate an array of 100 evenly spaced points between 0 and 1
    mean_fpr = np.linspace(0, 1, 100)

    # Set the target column to be the last column in the dataframe
    target_col = df.columns[-1]
    # Set the feature columns to be all the columns except the last one
    feature_cols = df.columns[:-1]

    # Iterate over the folds of the cross-validation
    for fold, (train, test) in enumerate(cv.split(df, df[target_col])):
        # Split the data into training and test sets
        X_train, y_train = df[feature_cols].iloc[train], df[target_col].iloc[train]
        X_test, y_test = df[feature_cols].iloc[test], df[target_col].iloc[test]
        
        # Fit the classifier on the training set
        classifier.fit(X_train, y_train)
        # Predict the probabilities for the test set
        y_pred = classifier.predict_proba(X_test)[:, 1]
        # Calculate the false positive rate and true positive rate for the test set
        fpr, tpr, thresholds = roc_curve(y_test, y_pred)
        # Interpolate the true positive rate at the mean false positive rate points
        interp_tpr = np.interp(mean_fpr, fpr, tpr)
        # Set the first element of the interpolated true positive rate to 0
        interp_tpr[0] = 0.0
        # Add the interpolated true positive rate to the list
        tprs.append(interp_tpr)
        # Add the AUC for the test set to the list
        aucs.append(roc_auc_score(y_test, y_pred))

    # Calculate the mean and standard deviation of the true positive rates across all folds
    mean_tpr = np.mean(tprs, axis=0)
    mean_tpr[-1] = 1.0
    std_tpr = np.std(tprs, axis=0)

    # Calculate the mean and standard deviation of the AUC values across all folds
    mean_auc = auc(mean_fpr, mean_tpr)
    std_auc = np.std(aucs)

    # Stores the Classifier name in a variable
    if isinstance(classifier, Pipeline):
        classifier_name = classifier.steps[-1][1].__class__.__name__ 
    else:
        classifier_name = classifier.__class__.__name__
    
    # Plot the ROC curve on the Axes object
    ax.plot(
        mean_fpr,
        mean_tpr,
        label=classifier_name + r" (AUC = %0.2f $\pm$ %0.2f)" % (mean_auc, std_auc),
        lw=2,
        alpha=0.8, 
        **kargs
    )
    
    # Return the Axes object, Cv AUC, and standard deviation of the AUC
    return [ax, mean_auc, std_auc]

<br>

### save_fig(*fig_id, tight_layout=True, fig_extension="png", resolution=300*)<a class='anchor' id='save_fig'></a>
#### utility function for saving a matplotlib figure to a file

The function takes four arguments:
<br>
- **`fig_id`**: a string that specifies the name of the file to which the figure will be saved.
- **`tight_layout`**: a boolean value that determines whether or not the figure's layout should be tight (i.e., should take up as little space as possible). The default value is True.
- **`fig_extension`**: a string that specifies the file extension of the figure file. The default value is "png".
- **`resolution`**: an integer that specifies the resolution of the figure in dots per inch (dpi). The default value is 300.
<br>

In [14]:
def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    #print("Saving figure", fig_id)
    
    if tight_layout:
        plt.tight_layout()
        
    plt.savefig(path, format=fig_extension, dpi=resolution)

<br>

### scatter_custom(*x, y, df, ax, **kargs*) <a class='anchor' id='scatter_cust'></a>
#### Custom function for generating a scatter plot from a pandas.DataFrame
It takes three required arguments:
- **`x and y`**: strings representing the names of the columns in df that you want to use for the x- and y-axes of the plot, respectively.
- **`df`**: should be a pandas.DataFrame containing the data you want to plot.
- **`ax`**: an *Axes* object, which is a part of a Figure in `matplotlib`. It represents a single subplot in a grid of plots. The ax argument allows the user to specify which Axes object the plot should be created on.

The function also accepts additional keyword arguments, which are passed to the `matplotlib.pyplot.scatter` function when generating the plot. This allows you to customize the appearance of the plot, such as by setting the marker size or color.

In [18]:
def scatter_custom(x, y, df, ax = None, **kargs):
    
    # if no Axes object is provided, create a new one
    if ax is None:
        ax = plt.gca()
    # check if ax is an instance of the Axes class
    if not isinstance(ax, Axes):
        # if ax is not an Axes object, raise an error
        raise TypeError("The 'ax' argument must be an Axes object")
    
    # create sets of the provided x and y column names, and the column names in df
    varnames_set = set((x,y))
    columns_set = set(cm1_mc.columns)

    # check if the x and y column names are a subset of the column names in df
    if not varnames_set.issubset(columns_set):
        # if x and y are not a subset of df.columns, raise an error
        raise ValueError(f"Provide existing columns names")
    
    # Create two DataFrames for clean and buggy data
    df_buggy = df.loc[df['defects'] == True]
    df_clean = df.loc[df['defects'] == False]
    df_grouped = [df_clean, df_buggy]
    color = {True:'#8b0000',False:'#00008b'}
    marker = {True:'s',False:'o'}
    labels = ['Clean', 'Buggy']
    
    # create the plot
    for i in range(2):
        # scatter plot of the clean and buggy data
        ax.scatter(x,y, alpha=0.7, marker = marker[i],
                   edgecolors = color[i], color = 'white', data = df_grouped[i],
                   label = labels[i], **kargs)
        
    # set the title, x-axis label, and y-axis label for the plot
    # ax.set_title(), ax.set_xlabel(), and ax.set_ylabel()
    ax.set_title((x + ' vs. ' + y), fontsize=14, fontweight='bold')
    ax.set_xlabel(x, fontstyle='italic')
    ax.set_ylabel(y, fontstyle='italic')
    # show the legend for the plot
    ax.legend()
    # display the plot
    return ax

<br>

### varselect_tocsv(*df, varnames, outname*) <a class='anchor' id='varselect_tocsv'></a>
#### Selects features of interest and saves them in a CSV file

This function is used to select a subset of columns (i.e., variables) from a Pandas dataframe and save the result as a CSV file. The function takes three arguments:

- **`df`**: a Pandas dataframe that contains the data.
- **`varnames`**: a list of strings, each of which is the name of a column in *df* that should be selected.
- **`outname`**: a string that specifies the name to be used for the CSV file. This string should be a valid file name and will have *.csv* appended to it when the file is created.

In [16]:
def varselect_tocsv(df, varnames, outname):
    varnames_set = set(varnames)
    columns_set = set(df.columns)
    # Checks whether the list of variable names is a subset of the column names
    if (isinstance(df, pd.core.frame.DataFrame) &
        (varnames_set.issubset(columns_set)) &
        (type(outname) == str)):
            # Selects the specified columns from the dataframe
            out_df = df[varnames]
            # Saves the result to a CSV file using the specified name
            # (any existing data is overwritten)
            out_df.to_csv(f'{RESULTS_PATH}/{outname}.csv', mode="w")
            print('The variable selection was successfully saved in',f'{RESULTS_PATH}/{outname}.csv' )
            return out_df
    else:
        # If any of the input arguments are invalid, the function prints an error message and returns None.
        print('Provide proper arguments: a df, a list of feature names and an output name')

<br>

<br>