# Overview

The process flow:
1. Clean data
2. Run data through Weka tool
3. Record Results

#### Assumptions
* Due to sev 1 and 5 rarely occurring we have chosen not to include the test data initially 


_Google Style Python Docstrings have been used for documentation_

## 1. Data Cleaning
The Weka tool will be taking care of most of the preprocessing required such as:
* Tokenization
* Stop Word Removal
* Stemming
* Tf-Idf
* InfoGain

Weka can also randomize and split the dataset into training and testing sets.

Before feeding the data into Weka, unneeded columns from the dataset must be removed. Of the remaining columns we should also ensure that the formatting meets the ARFF file format.

In [1]:
# dependencies

import numpy as np
import pandas as pd
from pprint import pprint

In [2]:
# load raw dataset

def load_data(path):
    """Loads file data and removes unneeded columns.
    
    Args:
        path: The file path of the file being opened
        
    Returns:
        A numpy array of the 'Subject' and 'Severity Rating' columns.
    """
    df = pd.read_csv(path, sep=',', encoding='ISO-8859-1')
    raw_data = np.array(df)
    
    # get the columns for Subject and Severity Rating
    extract_cols = [1, 2]
    del_cols = np.delete(np.arange(raw_data.shape[1]), extract_cols)
    data = np.delete(raw_data, del_cols, axis=1)
    
    return data

In [3]:
def no_punctuation(data):
    punctuations = '''!()[]{};:'"\,<>./?@#$%^&*~'''
    
    
    # convert all letters to lowercase
    lowercase_string = data.lower()
    
    # replace all punctuation with spaces
    remove_punctuation = ''
    for char in lowercase_string:
        if char in punctuations:
            remove_punctuation += ' '
        else:
            remove_punctuation += char
    
    return remove_punctuation

# text-based columns
def add_quotes(data):
    """Inserts surrounding quotes to the Summary column
    
    Args:
        data: An n*m matrix where n is the number of rows in the dataset
              and m is columns: 'Summary' and 'Severity Rating'. 
              
              For example:
              [['Build 5.3: Unitialized Variables' 3]
              ['Build 5.3 FSW: Typecast Mismatch in Memory Deallocation' 3]
              ['Build 5.3 FSW: Parameter Type Mismatch' 3]]
 
    
    Returns:
        An n*m matrix where n is the number of rows in the dataset
        and m is columns: 'Summary' and 'Severity Rating'.
    """
    for x in data:
        # remove punctuation and surround by single quotes
        x[0] = no_punctuation(x[0])
        x[0] = '\''+ x[0] + '\''
        
        # convert float severity ratings to int
        x[1] = int(x[1])
    
    return data

In [4]:
# randomize and 

In [5]:
# output ARFF file
def generate_arff(data):
    output_path = '../output/output.arff'
    output_file = open(output_path, 'w')
    
    # write header
    output_file.write('@relation nasa\n')
    output_file.write('@attribute Text string\n')
    output_file.write('@attribute class-att {1, 2, 3, 4, 5}\n\n')
    output_file.write('@data\n\n')
    
    # write data
    for x in data:
        line = x[0] + ',' + str(x[1]) + '\n'
        output_file.write(line)
    
    # close file
    output_file.close()
    

In [6]:
# test code
raw_data = load_data('../dataset/raw/pitsC.csv')
cleaned_data = add_quotes(raw_data)
generate_arff(cleaned_data)

#pprint(cleaned_data)