In [1]:
import lxml.etree
import os

### 0. Preparing Data

Before digging into the parser notebook, the version of the CAPEC xml file within this notebook is v2.11, which can be downloaded from [this link](http://capec.mitre.org/data/archive/capec_v2.11.zip). Here we loaded CAPEC v21 xml file. Therefore, if there is any new version of XML raw file, please make change for the following code. If the order of weakness table is changed, please change the code for function <b>extract_target_field_elements</b> in section 2.1.

In [2]:
capec_xml_file='capec_v2.11.xml'

### 1. Introduction 

The purpose of this notebook is to build the fields parser and extract the contents from various fields in the CWE 3.0 XML file so that the field content can be directly analyzed and stored into database. Guided by [CWE Introduction notebook](https://github.com/sailuh/perceive/blob/master/Notebooks/CAPEC/Introduction/capec_introduction.ipynb), this notebook will focus on the detail structure under attack pattern table and how parser functions work within the attack pattern table. 

To preserve the semantic information and not lose details during the transformation from the representation on website and XML file to the output file, we build a 3-step pipeline to modularize the parser functions for various fields in different semantic format. The 3-step pipeline contains the following steps: searching XML Field node location, XML field node parser, and exporting the data structure to the output file based on the semantic format in Section 4 of CAPEC Introduction Notebook. More details will be explained in Section 2.

### 2. Parser Architecture

The overall parser architecture is constituted by the following three procedures: 1) extracting the nodes with the target field tag, 2) parsing the target field node to the representation in memory, and 3) exporting the data structure to the output file. 

Section 2.1 explains the way to search XML field nodes with the target field tag. No matter parsing which field, the first step is to use Xpath and then locate all XML field nodes with the field tag we are intended to parse. The function in section 2.1 has been tested for all fields and thus can locate XML nodes with any given field naming, except Summary under Description node. If parsing Summary, please make change for the Xpath.

Section 2.2 explains the way to parse and extract the content of the target field into the representation in memory. Since different fields have various nested structures in xml raw file and the content we will parse varies field by field, the worst situation is that there will be one parser function for each different field. However, from Section 4 in CAPEC Introduction Notebook, certain fields may share a same format on website, such as table or bullet list, the ideal situation is that we would have only 4 or 5 functions to represent the data in memory. 

Section 3 addresses the way to export the data representation from Section 2.2. A set of functions in Section 3 should be the number of data structures in Section 2.2.

#### 2.1 XML Field Node Location 

This function searches the tree for the specified field node provided as input and returns the associated XML node of the field.   The string containing the field name can be found in the Introductory Notebook's histogram on [Section 4](https://github.com/sailuh/perceive/blob/master/Notebooks/CAPEC/Introduction/capec_introduction.ipynb) . As it can be observed in that histogram, only certain fields are worthwhile parsing due to their occurrence frequency. 

In addition, since Summary field is under Description field, please make change for target_field_path when parsing Summary.

In [3]:
def extract_target_field_elements(target_field, capec_xml_file):
    # read xml file and store as the root element in memory
    tree = lxml.etree.parse(capec_xml_file)
    root = tree.getroot()
    
    # Remove namespaces from XML.  
    for elem in root.getiterator(): 
        if not hasattr(elem.tag, 'find'): continue  # (1)
        i = elem.tag.find('}') # Counts the number of characters up to the '}' at the end of the XML namespace within the XML tag
        if i >= 0: 
            elem.tag = elem.tag[i+1:] # Starts the tag a character after the '}'

    # define the path of target field. Here we select all element nodes that the tag is the target field
    # if the target field is Summary field, please use the following path: 
    #target_field_path='Attack_Pattern/Description/'+target_field
    target_field_path='Attack_Pattern/./'+target_field

    # extract attack pattern table in the XML
    attack_pattern_table = root[2]
    
    # generate all elements with the target field name
    target_field_nodes=attack_pattern_table.findall(target_field_path)
    return target_field_nodes