# Reading SPWLA file

The 3 methods for extracting data are:

- Looping over the file and applying string methods, etc. >>> this is a naive approach, very fiddly.
- Using regex to extract everything at once. >>> I have this 'sort of' working, see below.
- **Using a PEG parser.** >>> this notebook.

In [1]:
with open('../data/core_analysis_example.spwla') as f:
    text = f.read()

In [2]:
test = text[:1656]

print(test)

10     2                                                                                                                       
    9999/9-9                                Norway                                            9Sep99
    Weatherford-Labs
15    10   10
          1507      1602  2031   0Weatherford-Labs    Nitrogen Permeability, Hor.
          1512      1602  2031   0Weatherford-Labs    Klinkenberg corrected gas perm, Hor.
          1510      1602  2031   0Weatherford-Labs    Nitrogen Permeability, Vert.
          1515      1602  2031   0Weatherford-Labs    Klinkenberg corrected gas perm, Vert.
          1402      1211  3084   0Weatherford-Labs    Porosity, Horizontal PLUG
          1403      1211  3084   0Weatherford-Labs    Porosity, Vertical PLUG
          1401      1212  3084   0Weatherford-Labs    Porosity, Summation
          1302      1103  3085   0Weatherford-Labs    CORE Oil Saturation
          1301      1103  3085   0Weatherford-Labs    CORE Water Saturation
      

I'm going to manually add a couple of records. One with a cutting description ('description block') and one with a cuttings description but no data block (although this latter situation does not occur in the actual data).

In [3]:
test += """30     1
     1920.10     0.00   9
36     1    1
    Sst.Lt-gry.VF/F-gr.Sbang.W-cmt.W-srt.Lam.w-Cl,Mic,tr Pyr.
40     1   10
        18.72400    15.45700   143.77300   129.17200    15.74091 -1002.00000 -1002.00000 -1002.00000 -1002.00000     2.68513
30     1
     1940.40     0.00   10
36     1    1
    Sst.Lt-gry.VF/F-gr.Sbang.W-cmt.W-srt.Lam.w-Cl,Mic,tr Pyr.
"""

### Observations

- Some lines are 128 characters wide
- Some of the data is unidentifiable
- This is probably a job for striplog
- The info after the record type (10, 15, 20, 30, etc) seems to be the number of lines (and fields per line, perhaps) in that record, which is redundant information (can just read until the next record type flag)

## Writing a PEG for the file

Let's try using `parsimonious`. There are examples to build from:

- [In the docs](https://github.com/erikrose/parsimonious)
- [On StackOverflow](https://stackoverflow.com/questions/47982949/how-to-parse-complex-text-files-using-python)

First we have to write a **parsing expression grammar**. PEGs are similar to a _context-free grammar_ (i.e. one in which the terms are uniquely defined, and don't depend on context), but there is zero ambiguity. PEGs are good at describing artificial languages (whereas CFGs are better for human languages). [Read more.](https://en.wikipedia.org/wiki/Parsing_expression_grammar)

Let's try writing the PEG for this file format...

In [996]:
from parsimonious import Grammar

grammar = Grammar(
    r"""
    file          = file_block field_block range_block data_blocks*
    
    file_block    = "10" ws number ws file_data ws
    file_data     = well_name ws country ws date ws company ws

    field_block   = "15" ws number ws number ws field_row+
    field_row     = number ws number ws number ws "0" company ws field_name ws

    range_block   = "20" ws number ws range_data ws
    range_data    = number ws number ws start_depth ws stop_depth ws number ws "1"

    data_blocks   = depth_block descr_block? data_block?
    depth_block   = "30" ws "1" ws depth ws number ws number ws
    descr_block   = "36" ws "1" ws "1" ws description ws
    data_block    = "40" ws "1" ws record_count ws data ws
    
    start_depth   = number+
    stop_depth    = number+
    record_count  = number+
    depth         = number+
    description   = sentence+
    field_name    = sentence+

    ws            = ~r"\s*"
    number        = ~r"[-.0-9]+"
    data          = ~r"[- .0-9]+"
    well_name     = ~r"[-/0-9]+"
    country       = ~r"[A-Z]+"i
    company       = ~r"[-_A-Z]+"i
    date          = ~r"[A-Z0-9]+"i
    sentence      = ~r"[-., /ÅA-Z]"i
    """
)

Can add flags after Regex, eg like `~r"[A-Z]+"m` as follows:

- i: IGNORECASE
- l: LOCALE
- m: MULTILINE
- s: DOTALL
- u: UNICODE
- x: VERBOSE
- a: ASCII

Now we can parse the text to build an abstract syntax tree, or AST.

In [997]:
tree = grammar.parse(text)

## Processing the AST

Now we have an abstract syntax tree (AST). We have to process this thing to get our data out of it.

I think the idea is to write functions that return what we want, gathering everything at the top level.

I think this is how we have to do it to link the Depth to the Descr and Data fields.

In [998]:
from parsimonious import NodeVisitor
import pandas as pd
import numpy as np


class FileVisitor(NodeVisitor):
    
    # -------------- File --------------
    def visit_file(self, node, visited_children):
        
        file, field, rnge, data = visited_children
    
        info = {}
        info['file'] = file
        info['fields'] = field
        info['range'] = rnge
        info['data'] = data

        return info
    
    # -------------- Meta --------------
    def visit_file_block(self, node, visited_children):
        *_, file_data, _ = visited_children
        return file_data

    def visit_file_data(self, node, visited_children):
        well, _, country, _, date, _, company, _ = node.children
        meta = {'well_name': well.text}
        meta['country'] = country.text
        meta['company'] = company.text
        try:
            meta['date'] = pd.to_datetime(date.text).isoformat()
        except (TypeError, ValueError) as e:
            meta['date'] = date.text
        return meta

    # -------------- Fields --------------
    def visit_field_block(self, node, visited_children):
        *_, fields = visited_children
        return fields

    def visit_field_row(self, node, visited_children):
        *_, field, _ = node.children
        return field.text.strip()
    
    # -------------- Range --------------
    def visit_range_block(self, node, visited_children):
        *_, rnge, _ = visited_children
        return rnge

    def visit_range_data(self, node, visited_children):
        _, _, _, _, start, _, stop, *_ = node.children
        meta = {'start': float(start.text), 'stop': float(stop.text)}
        return meta

    # -------------- ALL DATA --------------
    def visit_data_blocks(self, node, visited_children):
        depth, descr, data = visited_children
        
        # This is gross, there must be a better way...
        if not isinstance(descr, list):
            descr = ''
        
        if not isinstance(data, list):
            data = []
        
        return {depth: {'descr': descr, 'data': data}}
    
    # -------------- Depth --------------
    def visit_depth_block(self, node, visited_children):
        _, _, _, _, depth, *_ = node.children
        return float(depth.text)

    # -------------- Descr --------------
    def visit_descr_block(self, node, visited_children):
        *_, descr, _ = visited_children
        return descr

    def visit_description(self, node, visited_children):
        return node.text

    # -------------- Data --------------
    def visit_data_block(self, node, visited_children):
        *_, data, _ = visited_children
        return data
    
    def visit_data(self, node, visited_children):
        return [float(x) for x in node.text.split()]
        
    # -------------- Generic --------------
    def generic_visit(self, node, visited_children):
        return visited_children or node

Wow, that seems like a lot of code. Hm.

Anyway, let's instantiate this thing

In [999]:
fv = FileVisitor()

f = fv.visit(tree)

f

{'file': {'well_name': '9999/9-9',
  'country': 'Norway',
  'company': 'Weatherford-Labs',
  'date': '1999-09-09T00:00:00'},
 'fields': ['Nitrogen Permeability, Hor.',
  'Klinkenberg corrected gas perm, Hor.',
  'Nitrogen Permeability, Vert.',
  'Klinkenberg corrected gas perm, Vert.',
  'Porosity, Horizontal PLUG',
  'Porosity, Vertical PLUG',
  'Porosity, Summation',
  'CORE Oil Saturation',
  'CORE Water Saturation',
  'Grain Density, Hor.'],
 'range': {'start': 1918.0, 'stop': 1983.72},
 'data': [{1918.95: {'descr': '',
    'data': [[-1002.0,
      -1002.0,
      -1002.0,
      -1002.0,
      -1002.0,
      18.44722,
      -1002.0,
      14.78718,
      -1002.0,
      -1002.0]]}},
  {1919.95: {'descr': '',
    'data': [[-1002.0,
      -1002.0,
      -1002.0,
      -1002.0,
      -1002.0,
      17.06246,
      -1002.0,
      18.06427,
      -1002.0,
      -1002.0]]}},
  {1920.95: {'descr': '',
    'data': [[-1002.0,
      -1002.0,
      -1002.0,
      -1002.0,
      -1002.0,
      1

That's the data we collected. Cool!

## Stack Overflow question

In [6]:
text = """\
30     1
     2001.10     0.00   2.11
40     1   2
     -1002.0000 34.5678
30     1
     2001.90     0.00   1
36     1    1
    Sst.Lt-gry. Pyr.
40     1   2
        18.72400    15.45700
30     1
     2002.90     0.00   2
36     1    1
    Sst.Lt-gry. W-cmt.
"""

In [7]:
grammar = Grammar(
    r"""
    file          = data_blocks+

    data_blocks   = depth_block descr_block? data_block?

    depth_block   = "30" WS "1" WS depth WS NUMBER WS NUMBER WS
    descr_block   = "36" WS "1" WS "1" WS description WS
    data_block    = "40" WS "1" WS record_count WS DATA WS

    record_count  = NUMBER+
    depth         = NUMBER+
    description   = SENTENCE+
    field_name    = SENTENCE+

    WS            = ~r"\s*"
    NUMBER        = ~r"[-.0-9]+"
    DATA          = ~r"[- .0-9]+"
    SENTENCE      = ~r"[-., /ÅA-Z]"i
    """
)

In [8]:
ast = grammar.parse(text)

In [10]:
from parsimonious import NodeVisitor

class FileVisitor(NodeVisitor):
    
    def visit_file(self, node, visited_children):       
        data = {}
        for d in visited_children:
            data.update(d)
        return data
    
    def visit_data_blocks(self, node, visited_children):
        depth, descr, data = visited_children
        descr = descr[0] if isinstance(descr, list) else ''
        data = data[0] if isinstance(data, list) else []
        return {depth: {'descr': descr, 'data': data}}
    
    def visit_depth_block(self, node, visited_children):
        _, _, _, _, depth, *_ = node.children
        return float(depth.text)

    def visit_descr_block(self, node, visited_children):
        *_, descr, _ = visited_children
        return descr

    def visit_description(self, node, visited_children):
        return node.text

    def visit_data_block(self, node, visited_children):
        *_, data, _ = visited_children
        return data
    
    def visit_DATA(self, node, visited_children):
        return [float(x) for x in node.text.split()]
        
    def generic_visit(self, node, visited_children):
        return visited_children or node

In [11]:
FileVisitor().visit(ast)

{2001.1: {'descr': '', 'data': [-1002.0, 34.5678]},
 2001.9: {'descr': 'Sst.Lt-gry. Pyr.', 'data': [18.724, 15.457]},
 2002.9: {'descr': 'Sst.Lt-gry. W-cmt.', 'data': []}}

----

## Help from Code Review Stack Exchange

In [14]:
from parsimonious import Grammar

grammar = Grammar(
    r"""
    file  = chunk+
    chunk = depth_block other_block*
    
    other_block = descr_block / data_block
    
    depth_block = ~"30\s+1\s+" depth number number nl
    descr_block = ~"36\s+1\s+1\s+" description+ nl
    data_block  = ~"40\s+1\s+" count nl number+ nl

    count = number+
    depth = number+

    ws          = ~r"[ \t]+"
    nl          = ~r"(\n\r?|\r\n?)"
    number      = ~r"-?[.0-9]+"
    description = ~r"\S+"
    """
)

In [15]:
ast = grammar.parse(text)

ParseError: Rule 'number' didn't match at '     0.00   2.11
40 ' (line 2, column 13).

In [17]:
class Visitor(NodeVisitor):

    visit_file = NodeVisitor.lift_child
        
    def visit_chunk(self, node, visited_children):
        chunk, others = visited_children
        for block in others:
            chunk.update(block)
        return chunk
        
    def visit_depth_block(self, node, visited_children):
        _, depth, _, _, _ = visited_children
        return {'depth':depth}
        
    visit_other_block = NodeVisitor.lift_child
        
    def visit_descr_block(self, node, visited_children):
        _, descriptions, _ = visited_children
        return {'description':descriptions}
        
    def visit_data_block(self, node, visited_children):
        _, count, data_list, _ = visited_children
        return {'count':count, 'data':data_list}
        
    visit_count = NodeVisitor.lift_child
        
    visit_depth = NodeVisitor.lift_child
        
    def visit_number(self, node, visited_children):
        text = node.text.strip()
        return float(text) if '.' in text else int(text)
        
    def visit_description(self, node, visited_children):
        return node.text.strip()

In [19]:
Visitor().visit(ast)

VisitationError: NotImplementedError: No visitor method was defined for this expression: "30"

Parse tree:
<Node matching "30">  <-- *** We were here. ***

----

## Another approach

I also saw someone not returning anything from these methods, but instead just building a dictionary internally. But I don't think it's how the `NodeVisitor` is supposed to be used.

In [877]:
from parsimonious import NodeVisitor
import pandas as pd


class FileVisitor(NodeVisitor):

    file = {}
    file['fields'] = []
    file['data'] = []
    file['depth'] = []
    file['description'] = []
    
    def visit_file_data(self, node, visited_children):
        well, _, country, _, date, _, company, _ = node.children
        self.file['well_name'] = well.text
        self.file['country'] = country.text
        self.file['company'] = company.text
        try:
            self.file['date'] = pd.to_datetime(date.text).isoformat()
        except (TypeError, ValueError) as e:
            self.file['date'] = date.text

    def visit_field_row(self, node, visited_children):
        *_, field = node.children
        self.file['fields'].append(field.text.strip())
        
    def visit_range_data(self, node, visited_children):
        _, _, _, _, start, _, stop, *_ = node.children
        self.file['start'] = float(start.text)
        self.file['stop'] = float(stop.text)

    def visit_depth_block(self, node, visited_children):
        _, _, _, _, depth, *_ = node.children
        self.file['depth'].append(float(depth.text))

    def visit_data_block(self, node, visited_children):
        data = []
        for child in node.children:
            if child.expr_name == 'data':
                self.file['data'].append([float(x) for x in child.text.split()])
        
    def visit_descr_block(self, node, visited_children):
        *_, descr, _ = node.children
        self.file['description'].append(descr.text)

    def generic_visit(self, node, visited_children):
        """
        For all other nodes.
        """
        pass