# Overview

The process flow:
1. Clean data
2. Run data through Weka tool
3. Record Results

_Google Style Python Docstrings have been used for documentation_

## 1. Data Cleaning
The Weka tool will be taking care of most of the preprocessing required such as:
* Tokenization
* Stop Word Removal
* Stemming
* Tf-Idf
* InfoGain

Before feeding the data into Weka, unneeded columns from the dataset must be removed. Of the remaining columns we should also ensure that the formatting meets the ARFF file format.

In [8]:
# dependencies

import numpy as np
import pandas as pd
from pprint import pprint

In [9]:
# load raw dataset

def load_data(path):
    """Loads file data and removes unneeded columns.
    
    Args:
        path: The file path of the file being opened
        
    Returns:
        A numpy array of the 'Subject' and 'Severity Rating' columns.
    """
    df = pd.read_csv(path, sep=',', encoding='ISO-8859-1')
    raw_data = np.array(df)
    
    # get the columns for Subject and Severity Rating
    extract_cols = [1, 2]
    del_cols = np.delete(np.arange(raw_data.shape[1]), extract_cols)
    data = np.delete(raw_data, del_cols, axis=1)
    
    return data

In [35]:
def no_quotes(data):
    
    # replace all punctuation with spaces
    remove_punctuation = ''
    for char in data:
        if char == '\'':
            remove_punctuation += ''
        else:
            remove_punctuation += char
    
    return remove_punctuation

# text-based columns
def add_quotes(data):
    """Inserts surrounding quotes to the Summary column
    
    Args:
        data: An n*m matrix where n is the number of rows in the dataset
              and m is columns: 'Summary' and 'Severity Rating'. 
              
              For example:
              [['Build 5.3: Unitialized Variables' 3]
              ['Build 5.3 FSW: Typecast Mismatch in Memory Deallocation' 3]
              ['Build 5.3 FSW: Parameter Type Mismatch' 3]]
 
    
    Returns:
        An n*m matrix where n is the number of rows in the dataset
        and m is columns: 'Summary' and 'Severity Rating'.
    """
    for x in data:
        if '\'' in x[0]:
            x[0] = no_quotes(x[0])
        x[0] = '\''+ x[0] + '\''
    
    return data

In [36]:
# output ARFF file
def generate_arff(data):
    output_path = '../output/output.arff'
    output_file = open(output_path, 'w')
    
    # write header
    output_file.write('@relation nasa\n')
    output_file.write('@attribute Text string\n')
    output_file.write('@attribute class-att {1, 2, 3, 4, 5}\n\n')
    output_file.write('@data\n\n')
    
    # write data
    for x in data:
        line = x[0] + ',' + str(x[1]) + '\n'
        output_file.write(line)
    
    # close file
    output_file.close()
    

In [37]:
# test code
raw_data = load_data('../dataset/raw/pitsA.csv')
cleaned_data = add_quotes(raw_data)
generate_arff(cleaned_data)

#pprint(cleaned_data)

Build 5.1 OBC Code: Symbol 'tbl_ptr' and 'sm_cond_commit_crc' not initialized
UPL Build 5.1 Code: 'WORDS_PER_MONITOR' macro usage inconsistent with description
OBC Build 5.1 Code: possible data overrun for function 'memcpy'
OBC Build 5.1 Code: variable 'vm_10Hz_frame_delayâ tested against out-of-range value
OBC Build 5.1 Code: function 'check_to_report_tag' missing return value
Build 5.1 UPL Code: Array 'tlmBuffer' Possibly Out-of-Bounds
Build 5.1 UPL Code: Suspiciously Placed ';' in sm_main.c
Build 5.1 OBC Code: 'TempLevelStat' Not Initialized in sts_df.c
Build 5.1 OBC Code: Array 'Fdc' Possibly Out-of-Bounds in acprocessiru.c
Build 5.1 OBC Code: Array 'v' Possibly Out-of-Bounds in accommandattitude.c
Build 5.0 OBC Code: Uninitialized local vars, 'x' and 'y'., in sts_df.c
Build 5.0 OBC FSW CA: Uninitialized local vars, 'w', 'x', 'y', 'z', and 'a', in sts_df.c - case #2
Build 5.0 OBC Code: Uninitialized local vars, 'w', 'x', 'y', 'z', and 'a', in sts_df.c - case #1
Build 5.0 OBC Code