# Utility Functions for PR104

1. [arff_tocsv()](#arff_tocsv)
1. [varselect_tocsv()](#varselect_tocsv)

In [1]:
%run setup.ipynb

<br>

### arff_tocsv(*name*) <a class='anchor' id='arff_tocsv'></a>
#### Loads *.arff* data and save it in a CSV file
This function is used to load data from an ARFF file and save it as a CSV file. The function takes one argument:

- **name**: a string that specifies the name of the ARFF file to be loaded. This string should be a valid file name and should have .arff appended to it when the file is read.

In [9]:
from scipy.io.arff import loadarff 

def arff_tocsv(name):
    # Constructs the full file path for the ARFF file
    path = os.path.join(DATA_PATH, name + ".arff")
    # Reads the data from the file into a NumPy structured array
    raw_data = loadarff(path)
    # Array is then converted to a Pandas dataframe 
    df_data = pd.DataFrame(raw_data[0])
    # Converts the last column of the dataframe to a boolean data type
    df_data["defects"] = df_data["defects"].astype("boolean")
    print(name + ".arff","successfully loaded")
    # Saves the dataframe as a CSV file using the same name as the ARFF file
    df_data.to_csv(f'{DATA_PATH}/{name}.csv', mode="w")
    print("Saved in", f'{DATA_PATH}/{name}.csv')
    return df_data

<br>

### varselect_tocsv(*df, varnames, outname*) <a class='anchor' id='varselect_tocsv'></a>
#### Selects features of interest and saves them in a CSV file

This function is used to select a subset of columns (i.e., variables) from a Pandas dataframe and save the result as a CSV file. The function takes three arguments:

- **df**: a Pandas dataframe that contains the data.
- **varnames**: a list of strings, each of which is the name of a column in *df* that should be selected.
- **outname**: a string that specifies the name to be used for the CSV file. This string should be a valid file name and will have *.csv* appended to it when the file is created.

In [8]:
# In the paper, only McCabe metrics are selected to build the quality predictors.

def varselect_tocsv(df, varnames, outname):
    varnames_set = set(varnames)
    columns_set = set(df.columns)
    # Checks whether the list of variable names is a subset of the column names
    if (isinstance(df, pd.core.frame.DataFrame) &
        (varnames_set.issubset(columns_set)) &
        (type(outname) == str)):
            # Selects the specified columns from the dataframe
            out_df = df[varnames]
            # Saves the result to a CSV file using the specified name
            out_df.to_csv(f'{RESULTS_PATH}/{outname}.csv', mode="w")
            print('The variable selection was successfully saved in',f'{RESULTS_PATH}/{outname}.csv' )
            return out_df
    else:
        # If any of the input arguments are invalid, the function prints an error message and returns None.
        print('Provide proper arguments: a df, a list of features names and a subfix string')