### Initial Data - Minimal Edits ###
This is the initial dataset formatted for pandas dataframe <br/>
This notebook is strictly for data preprocessing

##### What This Notebook Will Do #####
- Remove NFR requirements classes where there is too little data or that class is not need (Functional)
- Remove all non-unique sentences
- Remove all non alphabetical characters
- Remove all 'common' stop-words
- Export final class to new .txt file

This new .txt will be manually examined for any remaining errors: grammatical, spelling or otherwise

A later decision will be made to remove any custom stop-words in an effort to improve the model's accuracy


In [87]:
import pandas as pd

#putting correctly formatted data in pandas DataFrame
filename = 'DataFiles/InitialData.txt'
df = pd.read_csv(filename)
grouped_classes = df.groupby(['class'])
grouped_classes.describe()

Unnamed: 0_level_0,sentence,sentence,sentence,sentence
Unnamed: 0_level_1,count,unique,top,freq
class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
access control,473,464,The LHCP can select a patient to obtain additi...,2
audit,99,99,The irregular activity reports are customizable,1
availability,38,38,The product shall be available during normal b...,1
capacity and performance,74,73,The response time shall be fast enough to main...,2
database design,965,959,The system shall provide the ability to save a...,2
functional,1793,1787,Upon reading the report the public health agen...,2
legal,66,66,Recipient shall not attempt to identify the in...,1
look and feel,67,67,All screens created as part of the Disputes ap...,1
maintainability,133,133,Other law (including regulations adopted by th...,1
operational,101,101,The product shall display HTML properly in 80%...,1


### Trimming the data ###
The data set is clearly imbalanced so we are going to trim it down <br />
###### Classes that we will include: ######
- Access Control
- Database Design
- Privacy
- Security

In [88]:
import re 
import io

#import nltk 
#nltk.download('stopwords')
#from nltk.corpus import stopwords
#not sure if we should use the NLTK corpus....has some weird entries in it
#print(stopwords.words('english'))


removal_array = ['audit', 'availability', 'capacity and performance', 'functional', 'legal', 'look and feel', 
                'maintainability','reliability','operational', 'other nonfunctional', 'recoverability', 'reliability', 
                 'usability']

#loop through the removal_array
#remove the the row for that specific class value from the data frame
for i in removal_array:
    df = df[df['class'] != i]
    
#drop all the duplicate rows
df = df.drop_duplicates()

#remove all non-alphabetical characters 
#make all words lower case
for ind in df.index:
    current_sentence = df['sentence'][ind]
    current_sentence = re.sub('[^a-zA-Z_ ]+', '', current_sentence)
    lower = current_sentence.lower()
    df['sentence'][ind] = lower

    #Stop words and custom stop phrases
stop_words = ['the ', ' the ' ' and ', ' or ', ' a ', ' an ',  ' of ', ' to ', ' that ', ' is ', 
              ' by ', ' you ', ' not ', ' can ', ' user ', ' when ', ' are ', ' this ', ' who ', 'whom', 'if']
custom_stop_phrases = ['the system shall ', 'the product shall ', 
                    ' provide the ability ', ' have the ability ', ' has the capability ', ' has the ability '] 

# * works like the ... (spread) operator from javascript
full_stop = [*custom_stop_phrases, *stop_words]

#remove stop items from each sentence in Dataframe
for ind in df.index:
    for stop in full_stop:
        df['sentence'][ind] = df['sentence'][ind].replace(stop, ' ')

df.to_csv('DataFiles/PostProcessing.txt', encoding='utf-8', index=False)



#### Data Trimming ####
The above data will be found in /DataFiles/PostProcessing.txt <br />
The data will be trimmed to have the following sizes <br />

- Access Control - 350
- Database Design - 350
- Privacy - 246
- Security - 302

In [89]:
new_classes = df.groupby(['class'])
new_classes.describe()

Unnamed: 0_level_0,sentence,sentence,sentence,sentence
Unnamed: 0_level_1,count,unique,top,freq
class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
access control,464,464,chooses open his her upcoming appointment list,1
database design,959,958,generate lists patients specific conditions us...,2
privacy,246,246,a covered health care provider may condition ...,1
security,302,302,system provides integrated security managed in...,1
