# Software Design Using ML&AI nWave


# 1. Setup

To prepare your environment, you need to install some packages

# 1.1 Install the necessary packages

You need the latest versions of these packages:<br>
 
** Spacy** a client library for NLP.<br>
** Pandas for dataframe.<br>
** stop_words: **List of common stop words.<br>
** python-boto3:** is a python client for the Boto3 API used for communicating to AWS.<br>
** websocket-client: ** is a python client for the Websockets.<br>
** pyorient: ** is a python client for the Orient DB.<br><br>



** Install NLTK: **

In [1]:
!pip install --upgrade nltk
!pip install --upgrade pyorient

Requirement already up-to-date: nltk in /anaconda3/lib/python3.6/site-packages (3.3)
Requirement not upgraded as not directly required: six in /anaconda3/lib/python3.6/site-packages (from nltk) (1.11.0)
Requirement already up-to-date: pyorient in /Users/swaroopmishra/.local/lib/python3.6/site-packages (1.5.5)


**Install Boto3 client for AWS communication thorugh CLI **

In [2]:
!pip install boto3 



** Install stop_words **

In [3]:
!pip install stop-words



** Install websocket client: **

In [4]:
!pip install websocket-client



** Install pyorient: **

In [5]:
! pip install awscli
! pip install pyorient --user
! pip install --upgrade --user awscli



Collecting awscli
[?25l  Downloading https://files.pythonhosted.org/packages/0f/7d/81e59502c95100bfb9010e6e04fe6dc8f03b4c11f5c63d79b9888ad4a412/awscli-1.15.20-py2.py3-none-any.whl (1.3MB)
[K    100% |████████████████████████████████| 1.3MB 1.4MB/s ta 0:00:01
[?25hRequirement not upgraded as not directly required: PyYAML<=3.12,>=3.10 in /anaconda3/lib/python3.6/site-packages (from awscli) (3.12)
Collecting botocore==1.10.20 (from awscli)
[?25l  Downloading https://files.pythonhosted.org/packages/8d/bf/99c47b80476a890773d56233a890a4c30d0d5868e6c991dcc945f4735d75/botocore-1.10.20-py2.py3-none-any.whl (4.2MB)
[K    100% |████████████████████████████████| 4.2MB 3.8MB/s ta 0:00:011
[?25hRequirement not upgraded as not directly required: colorama<=0.3.9,>=0.2.5 in /anaconda3/lib/python3.6/site-packages (from awscli) (0.3.7)
Requirement not upgraded as not directly required: s3transfer<0.2.0,>=0.1.12 in /anaconda3/lib/python3.6/site-packages (from awscli) (0.1.13)
Requirement not upgrade

# 1.2 Import packages and libraries 

Import the packages and libraries that you'll use:

In [1]:
import json
import spacy

import re
import nltk
from nltk.cluster.util import cosine_distance
from stop_words import get_stop_words
import numpy

import boto3
from botocore.client import Config

import websocket
import _thread
import time

from io import BytesIO
import pandas as pd
import json
import sys

# 2. Configuration

Add configurable items of the notebook below
## 2.1 Add your service credentials if any required( this is where you need to add credentials of infrastructure you are using to store data etc)


Run the cell.

In [6]:
### This is the section to provide credentials for AWS S3 account
### While sharing the notebook remove them -- will try to make this cell hidden later

## Console URL :::  https://awstestconsole-swaroop.signin.aws.amazon.com/console
## Account Id: 
## Username : 
## Password : 
## Then Navigate to the S3 section

## 2.2 Add your service credentials for S3

You must create S3 bucket service on AWS. To access data in a file in Object Storage, you need the Object Storage authentication credentials. Insert the Object Storage authentication credentials as credentials_1 in the following cell after removing the current contents in the cell.

In [20]:
# @hidden_cell
# The following code contains the credentials for a file in your IBM Cloud Object Storage.
# You might want to remove those credentials before you share your notebook.
credentials_1 = {
    'ACCESS_KEY_ID': '',
    'ACCESS_SECRET_KEY': '',
    'BUCKET': 'software-testing-pyscript'
}

# 3.  Spacy Text Classification  ( this section will be required if we use spacy for machine learning)

Write the classification related utility functions in a modularalized form.

## 3.1  REQUIREMENT Classification


In [21]:
def POS_tagging(text):
    """ Generate Part of speech tagging of the text.
    """
    POSofText = nltk.tag.pos_tag(text)
    return POSofText
def split_into_tokens(text):
    """ Split text into tokens.
    """
    tokens = nltk.word_tokenize(text)
    return tokens


In [15]:
text = "want to withdraw cash from ATM"
tokens = split_into_tokens(text)
postags = POS_tagging(tokens)
print(postags)

[('want', 'NN'), ('to', 'TO'), ('withdraw', 'VB'), ('cash', 'NN'), ('from', 'IN'), ('ATM', 'NNP')]


## 3.2 Augumented Classification

Custom classification utlity fucntions for augumenting the results of Spacy API call

In [22]:
def split_sentences(text):
    """ Split text into sentences.
    """
    sentence_delimiters = re.compile(u'[\\[\\]\n.!?]')
    sentences = sentence_delimiters.split(text)
    return sentences

def split_into_tokens(text):
    """ Split text into tokens.
    """
    tokens = nltk.word_tokenize(text)
    return tokens
    
def POS_tagging(text):
    """ Generate Part of speech tagging of the text.
    """
    POSofText = nltk.tag.pos_tag(text)
    return POSofText

def keyword_tagging(tag,tagtext,text):
    """ Tag the text matching keywords.
    """
    if (text.lower().find(tagtext.lower()) != -1):
        return text[text.lower().find(tagtext.lower()):text.lower().find(tagtext.lower())+len(tagtext)]
    else:
        return 'UNKNOWN'
    
def regex_tagging(tag,regex,text):
    """ Tag the text matching REGEX.
    """    
    p = re.compile(regex, re.IGNORECASE)
    matchtext = p.findall(text)
    regex_list=[]    
    if (len(matchtext)>0):
        for regword in matchtext:
            regex_list.append(regword)
    return regex_list

def chunk_tagging(tag,chunk,text):
    """ Tag the text using chunking.
    """
    parsed_cp = nltk.RegexpParser(chunk)
    pos_cp = parsed_cp.parse(text)
    chunk_list=[]
    for root in pos_cp:
        if isinstance(root, nltk.tree.Tree):               
            if root.label() == tag:
                chunk_word = ''
                for child_root in root:
                    chunk_word = chunk_word +' '+ child_root[0]
                chunk_list.append(chunk_word)
    return chunk_list
    
def augument_SpResponse(responsejson,updateType,text,tag):
    """ Update the output JSON with augumented classifications.
    """
    if(updateType == 'keyword'):
        if not any(d.get('text', None) == text for d in responsejson['keywords']):
            responsejson['keywords'].append({"text":text,"relevance":0.5})
    else:
        if not any(d.get('text', None) == text for d in responsejson['entities']):
            responsejson['entities'].append({"type":tag,"text":text,"relevance":0.5,"count":1}) 

def classify_text(text, config):
    """ Perform augumented classification of the text.
    """
    
    #will be used for storing initial value of response json, this is from nlu earlier
    with open('sample.json') as f:
        responsejson = json.load(f)
    
    sentenceList = split_sentences(text) #
    
    tokens = split_into_tokens(text)
    
    postags = POS_tagging(tokens)
    
    configjson = json.loads(config)#load would take a file-like object, read the data from that object, and use that string to create an object:
    
    
    for stages in configjson['configuration']['classification']['stages']:
        # print('Stage - Performing ' + stages['name']+':')
        for steps in stages['steps']:
            # print('    Step - ' + steps['type']+':')
            if (steps['type'] == 'keywords'):
                for keyword in steps['keywords']:
                    for word in sentenceList:
                        wordtag = keyword_tagging(keyword['tag'],keyword['text'],word)
                        if(wordtag != 'UNKNOWN'):
                            #print('      '+keyword['tag']+':'+wordtag)
                            augument_SpResponse(responsejson,'entities',wordtag,keyword['tag'])
            elif(steps['type'] == 'd_regex'):
                for regex in steps['d_regex']:
                    for word in sentenceList:
                        regextags = regex_tagging(regex['tag'],regex['pattern'],word)
                        if (len(regextags)>0):
                            for words in regextags:
                                #print('      '+regex['tag']+':'+words)
                                augument_SpResponse(responsejson,'entities',words,regex['tag'])
            elif(steps['type'] == 'chunking'):
                for chunk in steps['chunk']:
                    chunktags = chunk_tagging(chunk['tag'],chunk['pattern'],postags)
                    if (len(chunktags)>0):
                        for words in chunktags:
                            #print('      '+chunk['tag']+':'+words)
                            augument_SpResponse(responsejson,'entities',words,chunk['tag'])
            else:
                print('UNKNOWN STEP')
    
    return responsejson

def replace_unicode_strings(response):
    """ Convert dict with unicode strings to strings.
    """
    if isinstance(response, dict):
        return {replace_unicode_strings(key): replace_unicode_strings(value) for key, value in response.iteritems()}
    elif isinstance(response, list):
        return [replace_unicode_strings(element) for element in response]
    elif isinstance(response, unicode):
        return response.encode('utf-8')
    else:
        return response

# 4. Correlate text content

In [23]:
stopWords = get_stop_words('english')
# List of words to be ignored for text similarity
stopWords.extend(["The","This","That",".","!","?"])

def compute_text_similarity(text1, text2, text1tags, text2tags):
    """ Compute text similarity using cosine
    """
    #stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form
    stemmer = nltk.stem.porter.PorterStemmer()
    sentences_text1 = split_sentences(text1)
    sentences_text2 = split_sentences(text2)
    tokens_text1 = []
    tokens_text2 = []
    
    for sentence in sentences_text1:
        tokenstemp = split_into_tokens(sentence.lower())
        tokens_text1.extend(tokenstemp)
    
    for sentence in sentences_text2:
        tokenstemp = split_into_tokens(sentence.lower())
        tokens_text2.extend(tokenstemp)
    if (len(text1tags) > 0):  
        tokens_text1.extend(text1tags)
    if (len(text2tags) > 0):    
        tokens_text2.extend(text2tags)
    
    tokens1Filtered = [stemmer.stem(x) for x in tokens_text1 if x not in stopWords]
    
    tokens2Filtered = [stemmer.stem(x) for x in tokens_text2 if x not in stopWords]
    
    #  remove duplicate tokens
    tokens1Filtered = set(tokens1Filtered)
    tokens2Filtered = set(tokens2Filtered)
   
    tokensList=[]

    text1vector = []
    text2vector = []
    
    if len(tokens1Filtered) < len(tokens2Filtered):
        tokensList = tokens1Filtered
    else:
        tokensList = tokens2Filtered

    for token in tokensList:
        if token in tokens1Filtered:
            text1vector.append(1)
        else:
            text1vector.append(0)
        if token in tokens2Filtered:
            text2vector.append(1)
        else:
            text2vector.append(0)  

    cosine_similarity = 1-cosine_distance(text1vector,text2vector)
    if numpy.isnan(cosine_similarity):
        cosine_similarity = 0
    
    return cosine_similarity

# 5. Persistence and Storage
## 5.1 Configure Object Storage Client

In [24]:
s3 = boto3.client('s3',
                    aws_access_key_id=credentials_1['ACCESS_KEY_ID'],
                    aws_secret_access_key=credentials_1['ACCESS_SECRET_KEY'],
                    config=Config(signature_version='s3v4')
                     )
#Enter the path where you want to store data downlaoded from S3


def get_file(filename,Location):
    #s3.download_file(Bucket=credentials_1['BUCKET'],Key=filename,Filename=Location)
    t="abc"

#def load_string(fileobject):
#    '''Load the file contents into a Python string'''
#    text = fileobject.read()
#    return text

#def load_df(fileobject,sheetname):
#    '''Load file contents into a Pandas dataframe'''
#    excelFile = pd.ExcelFile(fileobject)
#    df = excelFile.parse(sheetname)
#    return df

#def put_file(filename, filecontents):
#    '''Write file to Cloud Object Storage'''
#    resp = s3.put_object(Bucket=credentials_1['BUCKET'], Key=filename, Body=filecontents)
    #resp = s3.Bucket(Bucket=credentials_1['BUCKET']).put_object(Key=filename, Body=filecontents)
#    return resp



## 5.2 OrientDB client - functions to connect, store and retrieve data

** Connect to OrientDB **

In [25]:
client = pyorient.OrientDB(host="localhost", port=2424)
user = ""
passw = ""
session_id = client.connect(user, passw)

NameError: name 'pyorient' is not defined

** OrientDB Core functions **

In [63]:
def create_database(dbname, username, password):
    """ Create a database
    """
    client.db_create( dbname, pyorient.DB_TYPE_GRAPH, pyorient.STORAGE_TYPE_MEMORY )
    print(dbname  + " created and opened successfully")
        
def drop_database(dbname):
    """ Drop a database
    """
    if client.db_exists( dbname, pyorient.STORAGE_TYPE_MEMORY ):
        client.db_drop(dbname)
    
def create_class(classname):
    """ Create a class
    """
    command = "create class "+classname + " extends V"
    client.command(command)
    
def create_record(classname, entityname, attributes):
    """ Create a record
    """
    command = "insert into " + classname + " set " 
    attrstring = ""
    for index,key in enumerate(attributes):
        attrstring = attrstring + key + " = '"+ attributes[key] + "'"
        if index != len(attributes) -1:
            attrstring = attrstring +","
    command = command + attrstring
    client.command(command)
    
def create_defect_testcase_edge(defectid, testcaseid, attributes):
    """ Create an edge between a defect and a testcase
    """
    command = "create edge linkedtestcases from (select from Defect where ID = " + "'" + defectid + "') to (select from Testcase where ID = " + "'" + testcaseid + "')" 
    if len(attributes) > 0:
        command = command + " set "
    attrstring = ""
    for index,key in enumerate(attributes):
        val = attributes[key]
        if not isinstance(val, str):
            val = str(val)
        attrstring = attrstring + key + " = '"+ val + "'"
        if index != len(attributes) -1:
            attrstring = attrstring +","
    command = command + attrstring
    client.command(command)    
    
def create_testcase_requirement_edge(testcaseid, reqid, attributes):
    """ Create an edge between a testcase and a requirement
    """
    command = "create edge linkedrequirements from (select from Testcase where ID = "+ "'" + testcaseid+"') to (select from Requirement where ID = "+"'"+reqid+"')" 
    if len(attributes) > 0:
        command = command + " set "
    attrstring = ""
    for index,key in enumerate(attributes):
        val = attributes[key]
        if not isinstance(val, str):
            val = str(val)
        attrstring = attrstring + key + " = '"+ val + "'"
        if index != len(attributes) -1:
            attrstring = attrstring +","
    command = command + attrstring
    client.command(command)  

    
def create_requirement_defect_edge(reqid, defectid, attributes):
    """ Create an edge between a requirement and a defect
    """
    command = "create edge linkeddefects from (select from Requirement where ID = "+ "'" + reqid+"') to (select from Defect where ID = "+"'"+defectid+"')" 
    
    if len(attributes) > 0:
         command = command + " set "
    attrstring = ""
    for index,key in enumerate(attributes):
        val = attributes[key]
        if not isinstance(val, str):
            val = str(val)
        attrstring = attrstring + key + " = '"+ val + "'"
        if index != len(attributes) -1:
            attrstring = attrstring +","
    command = command + attrstring
    client.command(command) 
    
def execute_query(query):
    """ Execute a query
    """
    return client.query(query)

** OrientDB Insights **

In [None]:
def get_related_testcases(defectid):
    """ Get the related testcases for a defect
    """
    testcasesQuery = "select * from ( select expand( out('linkedtestcases')) from Defect where ID = '" + defectid +"' )"
    testcases = execute_query(testcasesQuery)
    scoresQuery = "select expand(out_linkedtestcases) from Defect where ID = '"+defectid+"'"
    scores = execute_query(scoresQuery)
    testcaseList =[]
    scoresList= []
    for testcase in testcases:
        testcaseList.append(testcase.ID)
    for score in scores:
        scoresList.append(score.score)
    result = {}
    length = len(testcaseList)
    for i in range(0, length):
        result[testcaseList[i]] = scoresList[i]
    return result

def get_related_requirements(testcaseid):
    """ Get the related requirements for a testcase
    """
    requirementsQuery = "select * from ( select expand( out('linkedrequirements') ) from Testcase where ID = '" + testcaseid +"' )"
    requirements = execute_query(requirementsQuery)
    print requirements
    scoresQuery = "select expand(out_linkedrequirements) from Testcase where ID = '"+testcaseid+"'"
    scores = execute_query(scoresQuery)
    requirementsList =[]
    scoresList= []
    for requirement in requirements:
        requirementsList.append(requirement.ID)
    for score in scores:
        scoresList.append(score.score)
    result = {}
    length = len(requirementsList)
    print requirementsList, scoresList
    for i in range(0, length):
        result[requirementsList[i]] = scoresList[i]
    return result

def get_related_defects(reqid):
    """ Get the related defects for a requirement
    """
    defectsQuery = "select * from ( select expand( out('linkeddefects')) from Requirement where ID = '" + reqid +"' )"
    defects = execute_query(defectsQuery)
    scoresQuery = "select expand(out_linkeddefects) from Requirement where ID = '"+reqid+"'"
    scores = execute_query(scoresQuery)
    defectsList =[]
    scoresList= []
    for defect in defects:
        defectsList.append(defect.ID)
    for score in scores:
        scoresList.append(score.score)
    result = {}
    length = len(defectsList)
    for i in range(0, length):
        result[defectsList[i]] = scoresList[i]
    return result

def build_format_defects_list(defectsResult):
    """ Build and format the OrientDB query results for defects
    """
    defects = []
    for defect in defectsResult:
        detail = {}
        detail['ID'] = defect.ID
        detail['Severity'] = defect.Severity
        detail['Description'] = defect.Description
        defects.append(detail)
    return defects

def build_format_testcases_list(testcasesResult):
    """ Build and format the OrientDB query results for testcases
    """
    testcases = []
    for testcase in testcasesResult:
        detail = {}
        detail['ID'] = testcase.ID
        detail['Category'] = testcase.Category
        detail['Description'] = testcase.Description
        testcases.append(detail)
    return testcases  

def build_format_requirements_list(requirementsResult):
    """ Build and format the OrientDB query results for requirements
    """
    requirements = []
    for requirement in requirementsResult:
        detail = {}
        detail['ID'] =requirement.ID
        detail['Description'] = requirement.Description
        detail['Priority'] = requirement.Priority
        requirements.append(detail)
    return requirements  

def get_defects():
    """ Get all defects
    """
    defectsQuery = "select * from Defect"
    defectsResult = execute_query(defectsQuery)
    defects = build_format_defects_list(defectsResult)
    return defects

def get_testcases():
    """ Get all testcases
    """
    testcasesQuery = "select * from Testcase"
    testcasesResult = execute_query(testcasesQuery)
    testcases = build_format_testcases_list(testcasesResult)
    return testcases

def get_requirements():
    """ Get all requirements
    """
    requirementsQuery = "select * from Requirement"
    requirementsResult =  execute_query(requirementsQuery)
    requirements = build_format_requirements_list(requirementsResult)
    return requirements  

def get_defects_severity(severity):
    """ Get defects of a given severity
    """
    query = "select * from Defect where Severity = " + str(severity)
    queryResult =  execute_query(query)
    defects = build_format_defects_list(queryResult)    
    return defects

def get_testcases_category(category):
    """ Get testcases of a given category
    """
    testcasesQuery = "select * from Testcase where Category = '"+str(category)+"'"
    testcasesResult = execute_query(testcasesQuery)
    testcases = build_format_testcases_list(testcasesResult)
    return testcases

def get_testcases_zero_defects():
    """ Get testcases that did not generate any defects
    """
    testcasesQuery = "Select * from Testcase where in('linkedtestcases').size() = 0"
    testcasesResult = execute_query(testcasesQuery)
    testcases = build_format_testcases_list(testcasesResult)
    print testcases
    return testcases

def get_defects_zero_testcases():
    """ Get defects that have no associated testcases
    """
    query = "Select * from Defect where out('linkedtestcases').size() = 0"
    queryResult =  execute_query(query)
    defects = build_format_defects_list(queryResult)   
    print defects
    return defects

def get_requirements_zero_defect():
    """ Get requirements that have no defects
    """
    query = "Select * from Requirement where out('linkeddefects').size() = 0"
    requirementsResult =  execute_query(query)
    requirements = build_format_requirements_list(requirementsResult)
    return requirements  

def get_requirements_zero_testcases():
    """ Get requirements that have no associated testcases
    """
    query = "Select * from Requirement where in('linkedrequirements').size() = 0"
    requirementsResult =  execute_query(query)
    requirements = build_format_requirements_list(requirementsResult)
    return requirements  

def get_requirement_defects(numdefects):
    """ Get requirements that have more than a given number of defects
    """
    query = "select ID,Description,Priority from Requirement where out('linkeddefects').size() >= " + str(numdefects)
    requirementsResult =  execute_query(query)
    requirements = build_format_requirements_list(requirementsResult)
    for requirement in requirements:
        num = len(get_related_defects(requirement['ID']))
        requirement['defectcount'] = num
    return requirements  

# 6. Data Preparation

## 6.1 Global variables and functions

In [26]:
# Name of the excel file with data in S3 Storage
BrdFileName = "Banking-BRD.xlsx"
# Choose or get as an input as to which Domain it belongs to i.e banking, healthcare etc
Domain = "Banking"

# Name of the config file in Object Storage
configFileName = "sample_config.txt"

# Config contents
config = None;

Path = ".//test/"

# Requirements dataframe
requirements_file_name = "Requirements.xlsx"
requirements_sheet_name = "".join((Domain,"-Requirements"))
requirements_df = None

# Domain/UseCase dataframe
domain_file_name = "Domain.xlsx"
domain_sheet_name = "".join((Domain,"-Domain"))
domain_df = None

# DataElements dataframe
dataelements_file_name ="DataElements.xlsx"
dataelements_sheet_name ="".join((Domain,"-Dataelements"))
dataelements_df = None

def load_artifacts():
    global requirements_df 
    global domain_df 
    global dataelements_df 
    global config
    global Path
    Location = "".join((Path,requirements_file_name))
    get_file(requirements_file_name,Location)
    excel = pd.ExcelFile(Location)
    requirements_df = excel.parse(requirements_sheet_name)
    Location = "".join((Path,domain_file_name))
    get_file(domain_file_name,Location)
    excel = pd.ExcelFile(Location)
    domain_df = excel.parse(domain_sheet_name)
    Location = "".join((Path,dataelements_file_name))
    get_file(dataelements_file_name,Location)
    excel = pd.ExcelFile(Location)
    dataelements_df = excel.parse(dataelements_sheet_name)
    rule_text = open(configFileName)
    config = rule_text.read()
    

def prepare_artifact_dataframes():
    """ Prepare artifact dataframes by creating necessary output columns
    """
    global requirements_df 
    global domain_df 
    global dataelements_df 
    req_cols_len = len(requirements_df.columns)
    dom_cols_len = len(domain_df.columns)
    dat_cols_len = len(dataelements_df.columns)
    requirements_df.insert(req_cols_len, "ClassifiedText","")
    requirements_df.insert(req_cols_len+1, "Keywords","")
    requirements_df.insert(req_cols_len+2, "DomainMatchScore","")
    
    domain_df.insert(dom_cols_len, "ClassifiedText","")
    domain_df.insert(dom_cols_len+1, "Keywords","")
    domain_df.insert(dom_cols_len+2, "DataElementsMatchScore","")

    dataelements_df.insert(dat_cols_len, "ClassifiedText","")
    dataelements_df.insert(dat_cols_len+1, "Keywords","")
    dataelements_df.insert(dat_cols_len+2, "RequirementsMatchScore","")

## 6.2 Utility functions for Engineering Insights

In [32]:
def mod_req_text_classifier_output(artifact_df, config, output_column_name):
    """ Add text classifier output to the artifact dataframe based on rule defined in config
    """
    for index, row in artifact_df.iterrows():
        summary = row["I want to <perform some task>"]
        print("--------------")
        print(summary)
        #classifier_journey_output = mod_classify_text(summary, config)
        #print(classifier_journey_output)
        #artifact_df.set_value(index, output_column_name, classifier_journey_output)
    return artifact_df 


def add_text_classifier_output(artifact_df, config, output_column_name):
    """ Add text classifier output to the artifact dataframe based on rule defined in config
    """
    for index, row in artifact_df.iterrows():
        summary = row["Description"]
        #print("--------------")
        #print(summary)
        classifier_journey_output = classify_text(summary, config)
        print(classifier_journey_output)
        artifact_df.set_value(index, output_column_name, classifier_journey_output)
    return artifact_df 
           
def add_keywords_entities(artifact_df, classify_text_column_name, output_column_name):
    """ Add keywords and entities to the artifact dataframe"""
    for index, artifact in artifact_df.iterrows():
        #print("-----------")
        #print(artifact)
        keywords_array = []
        for row in artifact[classify_text_column_name]['keywords']:
            if not row['text'] in keywords_array:
                keywords_array.append(row['text'])
                
        for entities in artifact[classify_text_column_name]['entities']:
            if not entities['text'] in keywords_array:
                keywords_array.append(entities['text'])
            if not entities['type'] in keywords_array:
                keywords_array.append(entities['type'])
        artifact_df.set_value(index, output_column_name, keywords_array)
        print(keywords_array)
    return artifact_df 

def populate_text_similarity_score(artifact_df1, artifact_df2, keywords_column_name, output_column_name):
    """ Populate text similarity score to the artifact dataframes
    """
    for index1, artifact1 in artifact_df1.iterrows():
        matches = []
        top_matches = []
        for index2, artifact2 in artifact_df2.iterrows():
            matches.append({'ID': artifact2['ID'], 
                            'cosine_score': 0, 
                            'SubjectID':artifact1['ID']})
            cosine_score = compute_text_similarity(
                artifact1['Description'], 
                artifact2['Description'], 
                artifact1['Keywords'], 
                artifact2['Keywords'])
            matches[index2]["cosine_score"] = cosine_score
       
        sorted_obj = sorted(matches, key=lambda x : x['cosine_score'], reverse=True)
    
    # This is where the lower cosine value to be truncated is set and needs to be adjusted based on output
    
        for obj in sorted_obj:
            if obj['cosine_score'] > 0.4:
                top_matches.append(obj)
               
        artifact_df1.set_value(index1, output_column_name, top_matches)
    return artifact_df1

## 6.3 Process flow

** Prepare data **
* Load artifacts from object storage and create pandas dataframes
* Prepare the pandas dataframes. Add additional columns required for further processing.

In [31]:
load_artifacts()
prepare_artifact_dataframes()
requirements_df.head()

Unnamed: 0,ID,As a <type of user>,I want to <perform some task>,so that I can <achieve some goal>,ClassifiedText,Keywords,DomainMatchScore
0,R01,Customer,deposit check,I want to increase my bank balance,,,
1,R02,Customer,withdraw cash from an ATM,I don't have to wait in line at the Bank,,,
2,R03,Customer,want to transfer money from one account to ano...,I don't need to pay the amount in person,,,
3,R04,Customer,pay my utility bills online,I don't need to write checks or use postal ser...,,,
4,R05,Customer,apply for a loan,purchase a car,,,


** Run Text Classification on data **
* Add the text classification output to the artifact dataframes

In [33]:
output_column_name = "ClassifiedText"
requirements_df = mod_req_text_classifier_output(requirements_df, config, output_column_name)
#requirements_df = add_text_classifier_output(requirements_df,config, output_column_name)
#domain_df = add_text_classifier_output(domain_df,config, output_column_name)
#dataelements_df = add_text_classifier_output(dataelements_df,config, output_column_name)


--------------
deposit check
--------------
withdraw cash from an ATM
--------------
want to transfer money from one account to another
--------------
pay my utility bills online
--------------
apply for a loan
--------------
request for check books
--------------
restock sufficient cash in ATM machines


** Populate keywords and entities **
* Add the keywords and entities extracted from the unstructured text to the artifact dataframes

In [20]:
classify_text_column_name = "ClassifiedText"
output_column_name = "Keywords"
requirements_df = add_keywords_entities(requirements_df, classify_text_column_name, output_column_name)
domain_df = add_keywords_entities(domain_df, classify_text_column_name, output_column_name)
dataelements_df = add_keywords_entities(dataelements_df, classify_text_column_name, output_column_name)

['', ' A customer', 'NP', ' cheque', ' the balance', ' like', 'VERB', ' deposit', ' increase']
['', ' cash', 'NP', ' line', ' Customer', 'NAME', ' ATM', ' withdraw', 'VERB', ' wait']
['', ' money', 'NP', ' account', ' no physical transfer', ' like', 'VERB', ' transfer']
['', ' a customer', 'NP', ' the bank', ' order', ' mail', ' postal service', ' checksto', 'VERB', ' write', ' use']
['', ' a loan', 'NP', ' bank', ' a car', ' Customer', 'NAME', ' apply', 'VERB', ' purchase']
['', ' sufficient cash', 'NP', ' Banker', 'NAME', ' ATM', ' be', 'VERB', ' restock']
['', ' the credit', 'NP', ' history', ' the bank', ' balance', ' banker', ' the loan', ' Banker', 'NAME', ' be', 'VERB', ' review', ' help', ' decide', ' approve', ' reject']
['', ' Function', 'NP', ' money', ' balance', ' cheque', ' The input', ' debit card number', ' number', ' Verification', 'NAME', ' PIN', ' Customer', ' ATM', ' be', 'VERB']
['', ' This use', 'NP', ' case', ' the current account', ' balance', ' transferingmoney



** Correlate keywords between artifacts **
* Add the text similarity score of associated artifacts to the dataframe

In [33]:
requirements_df.head()
#domain_df.head()

Unnamed: 0,ID,User,Description,ClassifiedText,Keywords,DomainMatchScore
0,R01,Customer,A customer would like to deposit cheque so tha...,"{'keywords': [{'text': '', 'relevance': 0}], '...","[, A customer, NP, cheque, the balance, li...","[{'ID': 'U1', 'cosine_score': 0.70710678118654..."
1,R02,Customer,I am a Customer and I want to withdraw cash fr...,"{'keywords': [{'text': '', 'relevance': 0}], '...","[, cash, NP, line, Customer, NAME, ATM, w...","[{'ID': 'U1', 'cosine_score': 0.67082039324993..."
2,R03,Customer,Customers would like to transfer money from on...,"{'keywords': [{'text': '', 'relevance': 0}], '...","[, money, NP, account, no physical transfer...","[{'ID': 'U4', 'cosine_score': 0.82915619758885..."
3,R04,Customer,"I am a customer at the bank, and I want to or...","{'keywords': [{'text': '', 'relevance': 0}], '...","[, a customer, NP, the bank, order, mail, ...","[{'ID': 'U1', 'cosine_score': 0.50917507721731..."
4,R05,Customer,Customer need to apply for a loan from bank s...,"{'keywords': [{'text': '', 'relevance': 0}], '...","[, a loan, NP, bank, a car, Customer, NAME...","[{'ID': 'U7', 'cosine_score': 0.66666666666666..."


In [22]:
keywords_column_name = "Keywords"
output_column_name = "DomainMatchScore"
requirements_df = populate_text_similarity_score(requirements_df, domain_df, keywords_column_name, output_column_name)

output_column_name = "DataElementsMatchScore"
domain_df = populate_text_similarity_score(domain_df, dataelements_df, keywords_column_name, output_column_name)

output_column_name = "RequirementsMatchScore"
dataelements_df = populate_text_similarity_score(dataelements_df, requirements_df, keywords_column_name, output_column_name)



In [37]:
#print(requirements_df)
#domain_df.head()
#dataelements_df.head()
#requirements_df.head()
requirements_df.get_value(2,'DomainMatchScore')

  """


[{'ID': 'U4', 'SubjectID': 'R03', 'cosine_score': 0.82915619758885},
 {'ID': 'U1', 'SubjectID': 'R03', 'cosine_score': 0.6614378277661476},
 {'ID': 'U5', 'SubjectID': 'R03', 'cosine_score': 0.6614378277661476},
 {'ID': 'U3', 'SubjectID': 'R03', 'cosine_score': 0.6123724356957946},
 {'ID': 'U7', 'SubjectID': 'R03', 'cosine_score': 0.6123724356957946},
 {'ID': 'U9', 'SubjectID': 'R03', 'cosine_score': 0.5590169943749475},
 {'ID': 'U2', 'SubjectID': 'R03', 'cosine_score': 0.5},
 {'ID': 'U6', 'SubjectID': 'R03', 'cosine_score': 0.4330127018922194},
 {'ID': 'U8', 'SubjectID': 'R03', 'cosine_score': 0.4330127018922194}]

In [38]:
domain_df.get_value(3,'DataElementsMatchScore')

  """Entry point for launching an IPython kernel.


[{'ID': 'DE5', 'SubjectID': 'U4', 'cosine_score': 0.9682458365518541},
 {'ID': 'DE6', 'SubjectID': 'U4', 'cosine_score': 0.9013878188659974},
 {'ID': 'DE8', 'SubjectID': 'U4', 'cosine_score': 0.8660254037844386},
 {'ID': 'DE2', 'SubjectID': 'U4', 'cosine_score': 0.8017837257372732},
 {'ID': 'DE11', 'SubjectID': 'U4', 'cosine_score': 0.7637626158259734},
 {'ID': 'DE9', 'SubjectID': 'U4', 'cosine_score': 0.7337993857053429},
 {'ID': 'DE1', 'SubjectID': 'U4', 'cosine_score': 0.7223151185146154},
 {'ID': 'DE3', 'SubjectID': 'U4', 'cosine_score': 0.674199862463242},
 {'ID': 'DE12', 'SubjectID': 'U4', 'cosine_score': 0.674199862463242},
 {'ID': 'DE4', 'SubjectID': 'U4', 'cosine_score': 0.6666666666666666},
 {'ID': 'DE7', 'SubjectID': 'U4', 'cosine_score': 0.6666666666666666},
 {'ID': 'DE10', 'SubjectID': 'U4', 'cosine_score': 0.5423261445466404},
 {'ID': 'DE16', 'SubjectID': 'U4', 'cosine_score': 0.5},
 {'ID': 'DE14', 'SubjectID': 'U4', 'cosine_score': 0.447213595499958},
 {'ID': 'DE13', 'Su

In [26]:
#dataelements_df.get_value(1,'RequirementsMatchScore')

#  "*"**********************************************************"
# Next steps :

* Populate the correct wording in 3 sheets to provide more accurate and insightful data
* Use OrientdB to graph the result of cosine
* Use Node Red to start a UI dashboard
* Optmize code to reduce memory usage
* Move the components like Notebook,OrientDB etc to EC2 AWS .

** Utility functions to store entities and relations in Orient DB **

In [81]:
def store_requirements(requirements_df):
    """ Store requirements into the database
    """
    for index, row in requirements_df.iterrows():
        attrs = {}
        reqid = row["ID"]
        attrs["Description"] = row["Description"].replace('\n', ' ').replace('\r', '')
        attrs["ID"] = reqid
        attrs["User"]= str(row["User"])
        create_record(requirement_classname, reqid, attrs)    
        
def store_domain(domain_df):  
    """ Store domain which has user functions into the database
    """
    for index, row in domain_df.iterrows():
        attrs = {}
        tcaseid = row["ID"]
        attrs["Description"] = row["Description"].replace('\n', ' ').replace('\r', '')
        attrs["ID"] = tcaseid
        attrs["User Fuction"] = str(row["Use CASE"])
        create_record(domain_classname, tcaseid, attrs)
        
def store_dataelements(dataelements_df):
    """ Store data elements or attributes into the database
    """
    for index, row in dataelements_df.iterrows():
        attrs = {}
        defid = row["ID"]
        attrs["Description"] = row["Description"].replace('\n', ' ').replace('\r', '')
        attrs["ID"] = defid
        attrs["Short form"] = str(row["Short"])
        create_record(dataelement_classname, defid, attrs)
        
def store_testcases_requirement_mapping(testcases_df):
    """ Store the related requirements for testcases into the database
    """
    for index, row in testcases_df.iterrows():
        tcaseid = row["ID"]
        requirements = row["RequirementsMatchScore"]
        for requirement in requirements:
            reqid = requirement["ID"]
            attributes = {}
            attributes['score'] = requirement['cosine_score']
            create_testcase_requirement_edge(tcaseid,reqid, attributes)
            
def store_defect_testcase_mapping(defects_df):
    """ Store the related testcases for the defects into the database
    """
    for index, row in defects_df.iterrows():
        defid = row["ID"]
        testcases = row["TestCasesMatchScore"]
        for testcase in testcases:
            testcaseid = testcase["ID"]
            attributes = {}
            attributes['score'] = testcase["cosine_score"]
            create_defect_testcase_edge(defid,testcaseid, attributes)
            
def store_requirement_defect_mapping(requirements_df):
    """ Store the related defects for the requirements in the database
    """
    for index, row in requirements_df.iterrows():
        reqid = row["ID"]
        defects = row["DefectsMatchScore"]
        for defect in defects:
            defectid = defect["ID"]
            cosine_score =  defect["cosine_score"]
            attributes = {}
            attributes['score'] = cosine_score
            create_requirement_defect_edge(reqid,defectid, attributes)

** Store artifacts data and relations into OrientDB **
* Drop and create a database
* Create classes for each category of artifact
* Store artifact data
* Store artifact relations data

In [82]:
#drop_database("SoftwareDesign")
#create_database("SoftwareDesignAI", "", "admin")

requirement_classname = "Requirements"
domain_classname = "Domain"
dataelement_classname = "DataElements"

#create_class(requirement_classname)
#create_class(domain_classname)
#create_class(dataelement_classname)

#store_requirements(requirements_df)
store_dataelements(dataelements_df)
store_domain(domain_df)

#store_testcases_requirement_mapping(testcases_df)
#store_defect_testcase_mapping(defects_df)
#store_requirement_defect_mapping(requirements_df)

PyOrientSQLParsingException: com.orientechnologies.orient.core.sql.OCommandSQLParsingException - Error parsing query:
insert into DataElements set Description = 'Customer Name will be used as part of authenting user and to greet customer',ID = 'DE1',Short form = 'Cus_Nme'
                                                                                                                                             ^
Encountered " <IDENTIFIER> "form "" at line 1, column 139.
Was expecting one of:
    "=" ...
    "=" ...
    
	DB name="SoftwareDesignAI"

# This will be used for line by line checking of code for correlation

In [253]:


index1 = 0
index2 = 1
matches.append({'ID': "U2", 
                'cosine_score': 0, 
                'SubjectID':"R01"})
cosine_score = 0.7
matches[index2]["cosine_score"] = cosine_score



In [254]:
print(matches)

[{'ID': 'U1', 'cosine_score': 0.4, 'SubjectID': 'R01'}, {'ID': 'U2', 'cosine_score': 0.7, 'SubjectID': 'R01'}]


In [245]:
sorted_obj = sorted(matches, key=lambda x : x['cosine_score'], reverse=True)
print(sorted_obj)

[{'ID': 'U3', 'cosine_score': 0.9, 'SubjectID': 'R03'}, {'ID': 'U4', 'cosine_score': 0.9, 'SubjectID': 'R04'}, {'ID': 'U1', 'cosine_score': 0.8, 'SubjectID': 'R01'}, {'ID': 'U2', 'cosine_score': 0.7, 'SubjectID': 'R02'}, {'ID': 'U5', 'cosine_score': 0.4, 'SubjectID': 'R05'}]


In [246]:
for obj in sorted_obj:
            if obj['cosine_score'] > 0.4:
                top_matches.append(obj)

In [259]:
stopWords = get_stop_words('english')
# List of words to be ignored for text similarity
stopWords.extend(["The","This","That",".","!","?"])print(top_matches)

SyntaxError: invalid syntax (<ipython-input-259-382d6133efcf>, line 3)

In [287]:
stopWords = get_stop_words('english')
# List of words to be ignored for text similarity
stopWords.extend(["The","This","That",".","!","?"])

# this depicts cosine similarity

text1 = "A customer would like to deposit cheque at the ATM so that he can increase the balance."
text1tags = ['', ' A customer', 'NP', ' cheque', ' the balance', ' like', 'VERB', ' deposit', ' increase', ' ATM']

#text2 = "This Use case deals with Verification of PIN entered by Customer at ATM. One has to validate the pin using the customer number and the debit card number"
#text2tags = ['', ' case', 'NP', ' the pin', ' the customer', ' number', ' the debit', ' card', ' Use', 'NAME', ' Verification', ' PIN', ' Customer', ' ATM', ' validate', 'VERB']

#text1 = "I am a Customer and I want to withdraw cash from an ATM so that I don’t have to wait in line"
#text1tags = ['', ' cash', 'NP', ' line', ' Customer', 'NAME', ' ATM', ' withdraw', 'VERB', ' wait']

#text2 = "I am a Customer and I want to withdraw cash from an ATM so that I don’t have to wait in line"
#text2tags = ['', ' cash', 'NP', ' line', ' Customer', 'NAME', ' ATM', ' withdraw', 'VERB', ' wait']

text2 = "Customer Name will be used as part of authenting user and to greet customer"
text2tags = ['', ' part', 'NP', ' user', ' customer', ' Customer Name', 'NAME', ' be', 'VERB', ' greet']


stemmer = nltk.stem.porter.PorterStemmer()

sentences_text1 = split_sentences(text1)
#print(sentences_text1)
sentences_text2 = split_sentences(text2)
tokens_text1 = []
tokens_text2 = []

for sentence in sentences_text1:
        tokenstemp = split_into_tokens(sentence.lower())
        #print(tokenstemp)
        tokens_text1.extend(tokenstemp)
#print(tokens_text1)

for sentence in sentences_text2:
        tokenstemp = split_into_tokens(sentence.lower())
        tokens_text2.extend(tokenstemp)

if (len(text1tags) > 0):  
        tokens_text1.extend(text1tags)
if (len(text2tags) > 0):    
        tokens_text2.extend(text2tags)
#print(tokens_text1)
        
tokens1Filtered = [stemmer.stem(x) for x in tokens_text1 if x not in stopWords]
#print(tokens1Filtered)    
tokens2Filtered = [stemmer.stem(x) for x in tokens_text2 if x not in stopWords]
    
    #  remove duplicate tokens
tokens1Filtered = set(tokens1Filtered)
tokens2Filtered = set(tokens2Filtered)
#print(tokens1Filtered)

tokensList=[]

text1vector = []
text2vector = []
    
if len(tokens1Filtered) < len(tokens2Filtered):
    tokensList = tokens1Filtered
else:
    tokensList = tokens2Filtered

#print(tokensList)
for token in tokensList:
    if token in tokens1Filtered:
        text1vector.append(1)
    else:
        text1vector.append(0)
    if token in tokens2Filtered:
        text2vector.append(1)
    else:
        text2vector.append(0)         
#print(text1vector)  
print(text2vector)
cosine_similarity = 1-cosine_distance(text1vector,text2vector) 

print("cosine similarity :")
#print(cosine_similarity)

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
cosine similarity :


In [40]:
requirements_df.head()
#domain_df.head()

Unnamed: 0,User Story ID,User,Description,ClassifiedText,Keywords,DataelementMatchScore
0,R01,Customer,A customer would like to deposit cheque so tha...,"{'keywords': [{'text': '', 'relevance': 0}], '...","[, A customer, NP, cheque, the balance, li...",
1,R02,Customer,I am a Customer and I want to withdraw cash fr...,"{'keywords': [{'text': '', 'relevance': 0}], '...","[, cash, NP, line, Customer, NAME, ATM, w...",
2,R03,Customer,Customers would like to transfer money from on...,"{'keywords': [{'text': '', 'relevance': 0}], '...","[, money, NP, account, no physical transfer...",
3,R04,Customer,My name is Ryan and I am a customer at the ban...,"{'keywords': [{'text': '', 'relevance': 0}], '...","[, name, NP, a customer, the bank, order, ...",
4,R05,Customer,Customer will need to have a feature to apply ...,"{'keywords': [{'text': '', 'relevance': 0}], '...","[, a feature, NP, a loan, bank, a car, Cu...",


In [15]:
import json

with open('sample.json') as f:
    data = json.load(f)
    
print(data)

{'usage': {'text_characters': 502, 'features': 2, 'text_units': 1}, 'keywords': [{'text': 'A', 'relevance': 0}], 'language': 'en', 'entities': [{'type': 'Person', 'text': 'Stephen Hawking', 'relevance': 0.846941, 'count': 5}]}


In [18]:
def augument_SpResponse(responsejson,updateType,text,tag):
    """ Update the NLU response JSON with augumented classifications.
    """
    if(updateType == 'keyword'):
        if not any(d.get('text', None) == text for d in responsejson['keywords']):
            responsejson['keywords'].append({"text":text,"relevance":0.5})
    else:
        if not any(d.get('text', None) == text for d in responsejson['entities']):
            responsejson['entities'].append({"type":tag,"text":text,"relevance":0.5,"count":1})        
    return responsejson

In [21]:

responsejson = augument_SpResponse(data,'entities','Ryan','Person')

In [24]:
#print(responsejson)

{'usage': {'text_characters': 502, 'features': 2, 'text_units': 1}, 'keywords': [{'text': 'A', 'relevance': 0}], 'language': 'en', 'entities': [{'type': 'Person', 'text': 'Stephen Hawking', 'relevance': 0.846941, 'count': 5}, {'type': 'Person', 'text': 'Ryan', 'relevance': 0.5, 'count': 1}]}


In [11]:
x ="select*"


print(x)

select*


In [None]:
data = json.loads(nlu_string)

# *****   The Following code for uploading and downloading is working


** this will upload the file UserStroies-V0.1.xlsx in the S3
First delete the file if you want to see it in S3 
While downloading the png it will be downloaded as "MYLOCALIMAGE.PNG" in the location where you started the jupyter
Credentials for S3 are provided above in section 2.1


In [105]:
import boto3
from botocore.client import Config

import pandas as pd
data = None

ACCESS_KEY_ID = ''
ACCESS_SECRET_KEY = ''
BUCKET_NAME = 'software-testing-pyscript'
KEY = 'Banking-BRD.xlsx' # replace with your object key


data = open('Banking-BRD.xlsx', 'rb' )

s3 = boto3.client(
    's3',
    aws_access_key_id=ACCESS_KEY_ID,
    aws_secret_access_key=ACCESS_SECRET_KEY,
    config=Config(signature_version='s3v4')
)

#s3.put_object(Bucket=BUCKET_NAME, Key=KEY, Body=data)

print("Done uploading")



s3.download_file(BUCKET_NAME,KEY,".//test/MYLOCALEXCELBRD_mod.xlsx")
    
print("Done downloading")

Done uploading
Done downloading


In [106]:
df = pd.read_excel("MYLOCALEXCELBRD_mod.xlsx","Banking-Requirements")
df.head()

Unnamed: 0,User Story ID,User,Description
0,R01,Customer,A customer would like to deposit cheque so tha...
1,R02,Customer,I am a Customer and I want to withdraw cash fr...
2,R03,Customer,Customers would like to transfer money from on...
3,R04,Customer,My name is Ryan and I am a customer at the ban...
4,R05,Customer,Customer will need to have a feature to apply ...


In [58]:
xl = pd.ExcelFile("MYLOCALEXCELBRD_mod.xlsx")
xl.sheet_names


['Banking-Requirements']

In [64]:
df = xl.parse("Banking-Requirements")
df.iterrows()




<generator object DataFrame.iterrows at 0x1a133226d0>

In [68]:
config = open("sample_config.txt")

config.read()

'{\n  "configuration": {\n    "classification": {\n      "stages": [\n        {\n          "name": "Base Tagging",\n          "steps": [\n            {\n              "type": "keywords",\n              "keywords": [\n                {\n                  "tag": "chart",\n                  "text": "bar"\n                },\n                { \n                  "tag": "chart",\n                  "text": "line"\n                },\n                {\n                  "tag": "chart",\n                  "text": "pie"\n                },\n                {\n                  "tag": "UI",\n                  "text": "visualization"\n                },\n                {\n                  "tag": "edition",\n                  "text": "editions"\n                },\n                {\n                  "tag": "country",\n                  "text": "countries"\n                },\n                {\n                  "tag": "medal",\n                  "text": "medals"\n                },\n       

In [60]:

from xlrd import open_workbook
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
class BRD:
    def __init__(self, usrStryID, asa, action, goal):
        self.usrStryID = usrStryID
        self.asa = asa
        self.action = action
        self.goal = goal
        
        #Reads the spreadsheet from the file location
#wb = open_workbook("MYLOCALEXCEL.xlsx")

In [61]:
 # Prepares the list of Stop words which can be ignored like the, can , am etc
!pip install stop-words
from stop_words import get_stop_words

stopWords = get_stop_words('english')
# List of words to be ignored for text similarity
stopWords.extend(["The","This","That",".","!","?"])
#stop_words = set(stopwords.words('english'))  


[31mboto3 1.7.11 has requirement botocore<1.11.0,>=1.10.11, but you'll have botocore 1.10.10 which is incompatible.[0m


In [62]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/swaroopmishra/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/swaroopmishra/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [63]:
wb = open_workbook("MYLOCALEXCELBRD_mod.xlsx")
# FOr each sheet in Spreadsheet
for sheet in wb.sheets():
        numberRows = sheet.nrows
        numberCols = sheet.ncols
        
        items = []
        rows = []
        # For each row in a workbook
        for row in range(1, numberRows):
            values =  []
            filtered_Sentence = []
            verbs = []
            proNouns = []
            # For each columns in the workbook
            for col in range(1, numberCols):
                value = (sheet.cell(row, col).value)
                values.append(value)
                # tokenize the words in the sentence
                print(sent_tokenize(value))
                print("*********************************************")
                words = word_tokenize(value)
                # Words are tagged so, that they can be identified which one is verb/Noun/ Pronoun
                tagged = nltk.pos_tag(words)
                print ("Tagged:",tagged)
                #Retrieves Verb from the Sentence or tokens
                chunkVerb = r"""Chunk: {<VB.?>*} """
                chunkProNoun = r"""Chunk: {<NNP.?>*} """
                
                chunkParser = nltk.RegexpParser(chunkVerb)
                chunked = chunkParser.parse(tagged)
                print ("---->",chunked)
                verbs.append(chunked)
                print ("Grouped VERBS ::", verbs)
                chunkParser = nltk.RegexpParser(chunkProNoun)
                chunked = chunkParser.parse(tagged)
                print("===>",chunked)
                proNouns.append(chunked)
                print ("Chunked Verbs::",verbs)
                print ("Chunked ProNOuns::",proNouns)
                
                #chunked.draw()
                
               
                for w in words:
                    if w not in stopWords:
                        filtered_Sentence.append(w)
            print("Filtered Sentence-->",filtered_Sentence)
                

        break;
        

['Customer']
*********************************************
Tagged: [('Customer', 'NN')]
----> (S Customer/NN)
Grouped VERBS :: [Tree('S', [('Customer', 'NN')])]
===> (S Customer/NN)
Chunked Verbs:: [Tree('S', [('Customer', 'NN')])]
Chunked ProNOuns:: [Tree('S', [('Customer', 'NN')])]
['A customer would like to deposit cheque so that he can increase the balance']
*********************************************
Tagged: [('A', 'DT'), ('customer', 'NN'), ('would', 'MD'), ('like', 'VB'), ('to', 'TO'), ('deposit', 'VB'), ('cheque', 'NN'), ('so', 'RB'), ('that', 'IN'), ('he', 'PRP'), ('can', 'MD'), ('increase', 'VB'), ('the', 'DT'), ('balance', 'NN')]
----> (S
  A/DT
  customer/NN
  would/MD
  (Chunk like/VB)
  to/TO
  (Chunk deposit/VB)
  cheque/NN
  so/RB
  that/IN
  he/PRP
  can/MD
  (Chunk increase/VB)
  the/DT
  balance/NN)
Grouped VERBS :: [Tree('S', [('Customer', 'NN')]), Tree('S', [('A', 'DT'), ('customer', 'NN'), ('would', 'MD'), Tree('Chunk', [('like', 'VB')]), ('to', 'TO'), Tree('Chu

## 6.2 Utility functions for Engineering Insights

## 6.3 Process flow

** Prepare data **
* Load artifacts from object storage and create pandas dataframes
* Prepare the pandas dataframes. Add additional columns required for further processing.

In [11]:
load_artifacts()
#prepare_artifact_dataframes()

Requirements.xlsx


FileNotFoundError: [Errno 2] No such file or directory: './/test/Requirements.xlsx'

** Run Spacy Text Classifier on data **
* Add the text classification output to the artifact dataframes

In [None]:
#output_column_name = "ClassifiedText"
#defects_df = add_text_classifier_output(defects_df,config, output_column_name)
#testcases_df = add_text_classifier_output(testcases_df,config, output_column_name)
#requirements_df = add_text_classifier_output(requirements_df,config, output_column_name)

** Populate keywords and entities **
* Add the keywords and entities extracted from the unstructured text to the artifact dataframes

** Correlate keywords between artifacts **
* Add the text similarity score of associated artifacts to the dataframe

** Utility functions to store entities and relations in Orient DB **

# 7. Transform results for Visualization

# 8. Expose integration point with a websocket client

## 8.1 Start websocket client

In [None]:
start_websocket_listener()