In [2]:
import re
import typer
import pandas as pd

from typing import Dict
from pathlib import Path

# <span style="color:purple">Regular Expressions and Their Applications</span>

## Empirical Workshop

### Winter, 2021

# <span style="color:purple">Installing this Notebook Locally</span>

These slides are generated from a working Python notebook. To install the notebook and required packages locally, execute these steps:

```bash
$ git clone https://github.com/rs-kellogg/empirical_workshop_2021.git
```

```bash
$ cd empirical_workshop_2021
```

```bash
$ conda env create -f environment.yml
```
```bash
$ conda activate workshop-env (on older systems, use 'source' instead of 'conda')
```
```bash
$ jupyter notebook 2_regex/notebooks
```


<center><img src="../figures/library.png" width="25%" style='border:5px solid #000000'/></center>


* Vast amounts of information are encoded as unstructured data, in the form of text. Fortunately, a lot of it is already stored digitally and is available for computational analysis.

* What tools can we use to perform this analysis?

# <span style="color:purple">The Tool to Use Depends on Text Format and Your Goals</span>

<br>
<br>
<br>

<center><img src="../figures/unstrctured-data-types.png" width="80%" style='border:5px solid #000000'/></center>

<center><img src="../figures/workflow.png" width="80%" style='border:5px solid #000000'/></center>

# <span style="color:purple">Example: Insider Trading Data: SEC Form 4</span>

Form 4 filings are reports submitted to the SEC by investors who buy or sell shares in companies where they are deemed insiders. The SEC defines an insider as any officer, director or more than 10% shareholder of a publicly traded company.

* https://www.sec.gov/files/forms-3-4-5.pdf
* https://www.sec.gov/Archives/edgar/data/1326190/000101297517000759/xslF345X03/edgar.xml
* https://whalewisdomalpha.com/form-4-insider-trading-analysis/

* https://www.sec.gov/Archives/edgar/data/1326190/000101297517000759/

# <span style="color:purple">Regular Expressions: The Swiss Army Knife for Text</span>

<center><img src="../figures/regular_expressions.png" height="100%" style='border:5px solid #000000'/></center>

<center>https://xkcd.com/208/</center>

# <span style="color:purple">What Are Regular Expressions</span>


Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language. Using this little language, you specify the rules for the set of possible strings that you want to match

* They've been around for decades (https://en.wikipedia.org/wiki/Regular_expression)
* Practically every programming language supports them (e.g, https://docs.python.org/3/howto/regex.html)
* Command line tools such as grep support them (https://en.wikipedia.org/wiki/Grep)

Lot of web tools exist:

* https://regex101.com/
* https://www.debuggex.com/

# <span style="color:purple">Python Example</span>

<br>
<br>
<br>

<center><img src="../figures/create_regex.png" width="50%" style='border:5px solid #000000'/></center>

In [4]:
# Store the text in a python variable

file = Path("../data/0001012975-17-000759.txt")
text = file.read_text()
typer.secho(text, fg=typer.colors.WHITE, bg=typer.colors.BLACK)

[37m[40m<SEC-DOCUMENT>0001012975-17-000759.txt : 20171017
<SEC-HEADER>0001012975-17-000759.hdr.sgml : 20171017
<ACCEPTANCE-DATETIME>20171017200436
ACCESSION NUMBER:		0001012975-17-000759
CONFORMED SUBMISSION TYPE:	4
PUBLIC DOCUMENT COUNT:		1
CONFORMED PERIOD OF REPORT:	20171013
FILED AS OF DATE:		20171017
DATE AS OF CHANGE:		20171017

REPORTING-OWNER:	

	OWNER DATA:	
		COMPANY CONFORMED NAME:			Hodges Philip
		CENTRAL INDEX KEY:			0001705562

	FILING VALUES:
		FORM TYPE:		4
		SEC ACT:		1934 Act
		SEC FILE NUMBER:	001-32587
		FILM NUMBER:		171141634

	MAIL ADDRESS:	
		STREET 1:		19 FIRSTFIELD RD., SUITE 200
		CITY:			GAITHERSBURG
		STATE:			MD
		ZIP:			20878

REPORTING-OWNER:	

	OWNER DATA:	
		COMPANY CONFORMED NAME:			Redmont VAXN Capital Holdings, LLC
		CENTRAL INDEX KEY:			0001705638
		STATE OF INCORPORATION:			DE
		FISCAL YEAR END:			1231

	FILING VALUES:
		FORM TYPE:		4
		SEC ACT:		1934 Act
		SEC FILE NUMBER:	001-32587
		FILM NUMBER:		171141633

	BUSINESS ADDRESS:	
		STREET 1:		8

In [5]:
# Create a pattern string

pat_str = r"^\s*FORMER CONFORMED NAME:(.+?)$"

typer.secho(pat_str, fg=typer.colors.WHITE, bg=typer.colors.BLACK)

[37m[40m^\s*FORMER CONFORMED NAME:(.+?)$[0m


In [6]:
# Compile the pattern string into a pattern object

pattern = re.compile(pat_str, flags=re.DOTALL | re.MULTILINE)

typer.secho(f"{pattern}", fg=typer.colors.WHITE, bg=typer.colors.BLACK)

[37m[40mre.compile('^\\s*FORMER CONFORMED NAME:(.+?)$', re.MULTILINE|re.DOTALL)[0m


In [7]:
# Match the pattern against text

match = pattern.findall(text)
for m in match:
    typer.secho(f"Match: {m}", fg=typer.colors.WHITE, bg=typer.colors.BLACK)

[37m[40mMatch: 	PHARMATHENE, INC[0m
[37m[40mMatch: 	HEALTHCARE ACQUISITION CORP[0m


In [8]:
# Split the XML tags using Regex with groups

xml_pat = re.compile(r"<XML>(.+)</XML>", flags=re.DOTALL)
match = xml_pat.findall(text)
xml_text = match[0].strip() 

split_pat = re.compile(r"<(.+)>(.+)<.+>")
match = split_pat.findall(xml_text)
for m in match:
    typer.secho(f"{m}", fg=typer.colors.WHITE, bg=typer.colors.BLACK)
    typer.secho(" " *(len(m[0])+len(m[1]) + 8), bg=typer.colors.RED)

[37m[40m('schemaVersion', 'X0306')[0m
[41m                          [0m
[37m[40m('documentType', '4')[0m
[41m                     [0m
[37m[40m('periodOfReport', '2017-10-13')[0m
[41m                                [0m
[37m[40m('issuerCik', '0001326190')[0m
[41m                           [0m
[37m[40m('issuerName', 'Altimmune, Inc.')[0m
[41m                                 [0m
[37m[40m('issuerTradingSymbol', 'ALT')[0m
[41m                              [0m
[37m[40m('rptOwnerCik', '0001705562')[0m
[41m                             [0m
[37m[40m('rptOwnerName', 'Hodges Philip')[0m
[41m                                 [0m
[37m[40m('rptOwnerStreet1', 'C/O ALTIMMUNE, INC.')[0m
[41m                                          [0m
[37m[40m('rptOwnerStreet2', '19 FIRSTFIELD ROAD, SUITE 200')[0m
[41m                                                    [0m
[37m[40m('rptOwnerCity', 'GAITHERSBURG')[0m
[41m                                [0m
[37m[40m(

# <span style="color:purple">Deeper Dive: Regex Pattern Elements</span>

* Most characters match themselves: <span style="color:blue">A</span> matches "A", <span style="color:blue">9</span> matches "9"

* Sequences of characters match sequences in text: <span style="color:blue">ABC</span> matches "ABC"

* Metacharacters are what allow us to specify abstract patterns: <span style="color:blue">. ^ $ * + ? { } [ ] \ | ( )</span>

* Disjunctions: <span style="color:blue">A|B</span> or <span style="color:blue">[AB]</span>

* Character classes and ranges: <span style="color:blue">[A-Z]</span>, <span style="color:blue">\d</span>, <span style="color:blue">\s</span>, <span style="color:blue">\w</span>, <span style="color:blue">.</span>

* Operators (repetition and optionality): <span style="color:blue">A*</span>, <span style="color:blue">A+</span>, <span style="color:blue">Ab?</span>

* Groups (capture sup-pieces for extraction): <span style="color:blue">A([A-Z]\d)+Z</span>

# <span style="color:purple">Exploring and Checking Data with GREP</span>

<br>
<br>
<br>

<center><img src="../figures/man_grep.png" width="100%"/></center>

# <span style="color:purple">Scaling up to Multiple Documents</span>

<br>
<br>
<br>

<center><img src="../figures/information_extraction.png" width="100%" style='border:5px solid #000000'/></center>

In [9]:
import re
from typing import Dict

document_fields_header: Dict[str, re.Pattern] = {
    "accession": re.compile(r"^\s*ACCESSION NUMBER:(.+?)$", flags=re.DOTALL | re.MULTILINE),
    "sec_document": re.compile(r"<SEC-DOCUMENT>(.+?):", flags=re.DOTALL | re.MULTILINE),
    "sec_header": re.compile(r"<SEC-DOCUMENT>(.+?):", flags=re.DOTALL | re.MULTILINE),
    "acceptance_datetime": re.compile(r"<ACCEPTANCE-DATETIME>(.+?)$", flags=re.DOTALL | re.MULTILINE)
}
    
for key, val in document_fields_header.items():
    typer.secho(f"key: {key}", fg=typer.colors.WHITE, bg=typer.colors.RED)
    typer.secho(f"val: {val}", fg=typer.colors.WHITE, bg=typer.colors.BLACK)

[37m[41mkey: accession[0m
[37m[40mval: re.compile('^\\s*ACCESSION NUMBER:(.+?)$', re.MULTILINE|re.DOTALL)[0m
[37m[41mkey: sec_document[0m
[37m[40mval: re.compile('<SEC-DOCUMENT>(.+?):', re.MULTILINE|re.DOTALL)[0m
[37m[41mkey: sec_header[0m
[37m[40mval: re.compile('<SEC-DOCUMENT>(.+?):', re.MULTILINE|re.DOTALL)[0m
[37m[41mkey: acceptance_datetime[0m
[37m[40mval: re.compile('<ACCEPTANCE-DATETIME>(.+?)$', re.MULTILINE|re.DOTALL)[0m


In [10]:
def extract_doc_header_info(f: Path) -> Dict[str, str]:
    text = f.read_text()
    row_dict = {"filename": f.name}
    for field, pat in document_fields_header.items():
        row_dict[field] = None
        match = pat.findall(text)
        if match:
            row_dict[field] = match[0].strip()
        else:
            typer.secho(f"WARNING: {f} does not contain {field}", fg=typer.colors.RED)
    return row_dict

typer.secho(f"{extract_doc_header_info}", fg=typer.colors.WHITE, bg=typer.colors.BLACK)

[37m[40m<function extract_doc_header_info at 0x2b86699af280>[0m


In [11]:
from pathlib import Path
import pandas as pd

row_dicts = []
in_dir = Path("../data/2020-sample")
for f in in_dir.glob("*.txt"):
    typer.secho(f"proccessing file: {f.name}", fg=typer.colors.WHITE, bg=typer.colors.BLACK)
    row_dicts.append(extract_doc_header_info(f))

header_df = pd.DataFrame(row_dicts)
header_df = header_df.set_index("filename")

[37m[40mproccessing file: 1363364_2_0001638599-20-000500.txt[0m
[37m[40mproccessing file: 1487371_1_0001487371-20-000092.txt[0m
[37m[40mproccessing file: 1192933_2_0001179110-20-005642.txt[0m
[37m[40mproccessing file: 1642376_2_0001140361-20-013703.txt[0m
[37m[40mproccessing file: 1737287_4_0001214659-20-008571.txt[0m
[37m[40mproccessing file: 1278895_1_0000899243-20-009380.txt[0m
[37m[40mproccessing file: 1467858_2_0001467858-20-000082.txt[0m
[37m[40mproccessing file: 1221787_2_0001209191-20-035900.txt[0m
[37m[40mproccessing file: 1686807_1_0000947871-20-000166.txt[0m
[37m[40mproccessing file: 315054_2_0001140361-20-011342.txt[0m
[37m[40mproccessing file: 1653653_3_0000899243-20-022914.txt[0m
[37m[40mproccessing file: 1047122_1_0001047122-20-000051.txt[0m
[37m[40mproccessing file: 1333986_1_0001209191-20-017244.txt[0m
[37m[40mproccessing file: 1715974_4_0001246360-20-001800.txt[0m
[37m[40mproccessing file: 1509261_4_0001567619-20-017410.txt

In [12]:
header_df.head(20)

Unnamed: 0_level_0,accession,sec_document,sec_header,acceptance_datetime
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1363364_2_0001638599-20-000500.txt,0001638599-20-000500,0001638599-20-000500.txt,0001638599-20-000500.txt,20200518185743
1487371_1_0001487371-20-000092.txt,0001487371-20-000092,0001487371-20-000092.txt,0001487371-20-000092.txt,20200305114400
1192933_2_0001179110-20-005642.txt,0001179110-20-005642,0001179110-20-005642.txt,0001179110-20-005642.txt,20200508175755
1642376_2_0001140361-20-013703.txt,0001140361-20-013703,0001140361-20-013703.txt,0001140361-20-013703.txt,20200612194911
1737287_4_0001214659-20-008571.txt,0001214659-20-008571,0001214659-20-008571.txt,0001214659-20-008571.txt,20201013163420
1278895_1_0000899243-20-009380.txt,0000899243-20-009380,0000899243-20-009380.txt,0000899243-20-009380.txt,20200325104905
1467858_2_0001467858-20-000082.txt,0001467858-20-000082,0001467858-20-000082.txt,0001467858-20-000082.txt,20200508173026
1221787_2_0001209191-20-035900.txt,0001209191-20-035900,0001209191-20-035900.txt,0001209191-20-035900.txt,20200611161258
1686807_1_0000947871-20-000166.txt,0000947871-20-000166,0000947871-20-000166.txt,0000947871-20-000166.txt,20200227191820
315054_2_0001140361-20-011342.txt,0001140361-20-011342,0001140361-20-011342.txt,0001140361-20-011342.txt,20200511204850


# <span style="color:purple">Working with XML</span>

https://www.xmlviewer.org/

In [13]:
from typing import Dict

document_fields: Dict[str, str] = {
    "schemaVersion": "schemaVersion",
    "documentType": "documentType",
    "periodOfReport": "periodOfReport",
    "notSubjectToSection16": "notSubjectToSection16",
    "issuerCik": "issuer/issuerCik",
    "issuerName": "issuer/issuerName",
    "issuerTradingSymbol": "issuer/issuerTradingSymbol"
}
    
for key, val in document_fields.items():
    typer.secho(f"key: {key}", fg=typer.colors.WHITE, bg=typer.colors.RED)
    typer.secho(f"val: {val}", fg=typer.colors.WHITE, bg=typer.colors.BLACK)

[37m[41mkey: schemaVersion[0m
[37m[40mval: schemaVersion[0m
[37m[41mkey: documentType[0m
[37m[40mval: documentType[0m
[37m[41mkey: periodOfReport[0m
[37m[40mval: periodOfReport[0m
[37m[41mkey: notSubjectToSection16[0m
[37m[40mval: notSubjectToSection16[0m
[37m[41mkey: issuerCik[0m
[37m[40mval: issuer/issuerCik[0m
[37m[41mkey: issuerName[0m
[37m[40mval: issuer/issuerName[0m
[37m[41mkey: issuerTradingSymbol[0m
[37m[40mval: issuer/issuerTradingSymbol[0m


In [14]:
from typing import Dict
import xml.etree.ElementTree as ET
import re

def extract_doc_xml_info(f: Path) -> Dict[str, str]:
    text = f.read_text()
    row_dict = {"filename": f.name}
    
    # extract the XML portion of the document using a regex
    xml_pat = re.compile(r"<XML>(.+)</XML>", flags=re.DOTALL)
    match = xml_pat.findall(f.read_text())
    xml_text = match[0].strip()   
    root = ET.fromstring(xml_text)

    # iterate through fields and match on path 
    for field, pat in document_fields.items():
        row_dict[field] = None
        match = root.find(pat)
        if match is not None:
            row_dict[field] = match.text.strip()
        else:
            typer.secho(f"WARNING: {f.name} does not contain {field}", bg=typer.colors.BLACK, fg=typer.colors.WHITE)
    
    return row_dict

typer.secho(f"{extract_doc_xml_info}", fg=typer.colors.WHITE, bg=typer.colors.BLACK)

[37m[40m<function extract_doc_xml_info at 0x2b8571908790>[0m


In [15]:
import typer
import pandas as pd
from pathlib import Path

row_dicts = []
in_dir = Path("../data/2020-sample")
for f in in_dir.glob("*.txt"):
    row_dicts.append(extract_doc_xml_info(f))

xml_df = pd.DataFrame(row_dicts)
xml_df = xml_df.set_index("filename")



In [16]:
xml_df.head(20)

Unnamed: 0_level_0,schemaVersion,documentType,periodOfReport,notSubjectToSection16,issuerCik,issuerName,issuerTradingSymbol
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1363364_2_0001638599-20-000500.txt,X0306,4,2020-05-14,0,899866,"ALEXION PHARMACEUTICALS, INC.",ALXN
1487371_1_0001487371-20-000092.txt,X0306,4,2020-03-04,0,1487371,"GenMark Diagnostics, Inc.",GNMK
1192933_2_0001179110-20-005642.txt,X0306,4,2020-05-07,0,31462,ECOLAB INC.,ECL
1642376_2_0001140361-20-013703.txt,X0306,4,2020-06-10,,1517342,PACIFIC DRILLING S.A.,PACD
1737287_4_0001214659-20-008571.txt,X0306,4,2020-10-12,,1737287,"Allogene Therapeutics, Inc.",ALLO
1278895_1_0000899243-20-009380.txt,X0306,4,2020-03-23,0,1278895,"BLACKROCK ENHANCED CAPITAL & INCOME FUND, INC.",CII
1467858_2_0001467858-20-000082.txt,X0306,4,2020-05-07,0,1467858,General Motors Co,GM
1221787_2_0001209191-20-035900.txt,X0306,4,2020-06-10,0,1059556,MOODYS CORP /DE/,MCO
1686807_1_0000947871-20-000166.txt,X0306,4,2020-02-25,0,1517175,"Chefs' Warehouse, Inc.",CHEF
315054_2_0001140361-20-011342.txt,X0306,4,2020-05-11,,1655888,Owl Rock Capital Corp,ORCC


In [17]:
# join the results together
df = header_df.join(xml_df)
df.head(10)

Unnamed: 0_level_0,accession,sec_document,sec_header,acceptance_datetime,schemaVersion,documentType,periodOfReport,notSubjectToSection16,issuerCik,issuerName,issuerTradingSymbol
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1363364_2_0001638599-20-000500.txt,0001638599-20-000500,0001638599-20-000500.txt,0001638599-20-000500.txt,20200518185743,X0306,4,2020-05-14,0.0,899866,"ALEXION PHARMACEUTICALS, INC.",ALXN
1487371_1_0001487371-20-000092.txt,0001487371-20-000092,0001487371-20-000092.txt,0001487371-20-000092.txt,20200305114400,X0306,4,2020-03-04,0.0,1487371,"GenMark Diagnostics, Inc.",GNMK
1192933_2_0001179110-20-005642.txt,0001179110-20-005642,0001179110-20-005642.txt,0001179110-20-005642.txt,20200508175755,X0306,4,2020-05-07,0.0,31462,ECOLAB INC.,ECL
1642376_2_0001140361-20-013703.txt,0001140361-20-013703,0001140361-20-013703.txt,0001140361-20-013703.txt,20200612194911,X0306,4,2020-06-10,,1517342,PACIFIC DRILLING S.A.,PACD
1737287_4_0001214659-20-008571.txt,0001214659-20-008571,0001214659-20-008571.txt,0001214659-20-008571.txt,20201013163420,X0306,4,2020-10-12,,1737287,"Allogene Therapeutics, Inc.",ALLO
1278895_1_0000899243-20-009380.txt,0000899243-20-009380,0000899243-20-009380.txt,0000899243-20-009380.txt,20200325104905,X0306,4,2020-03-23,0.0,1278895,"BLACKROCK ENHANCED CAPITAL & INCOME FUND, INC.",CII
1467858_2_0001467858-20-000082.txt,0001467858-20-000082,0001467858-20-000082.txt,0001467858-20-000082.txt,20200508173026,X0306,4,2020-05-07,0.0,1467858,General Motors Co,GM
1221787_2_0001209191-20-035900.txt,0001209191-20-035900,0001209191-20-035900.txt,0001209191-20-035900.txt,20200611161258,X0306,4,2020-06-10,0.0,1059556,MOODYS CORP /DE/,MCO
1686807_1_0000947871-20-000166.txt,0000947871-20-000166,0000947871-20-000166.txt,0000947871-20-000166.txt,20200227191820,X0306,4,2020-02-25,0.0,1517175,"Chefs' Warehouse, Inc.",CHEF
315054_2_0001140361-20-011342.txt,0001140361-20-011342,0001140361-20-011342.txt,0001140361-20-011342.txt,20200511204850,X0306,4,2020-05-11,,1655888,Owl Rock Capital Corp,ORCC


# <span style="color:purple">Normalizing Dates with Regex</span>

<br>
<br>
<br>

<center><img src="../figures/MedjoolDates2lb.png" width="800" style='border:5px solid #000000'/></center>

In [18]:
# Normalizing dates with regex

df["acceptance_datetime"] = df["acceptance_datetime"].str.replace(r'^(\d\d\d\d)(\d\d)(\d\d)(\d*)', r'\1-\2-\3', regex=True)
df.head(10)

Unnamed: 0_level_0,accession,sec_document,sec_header,acceptance_datetime,schemaVersion,documentType,periodOfReport,notSubjectToSection16,issuerCik,issuerName,issuerTradingSymbol
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1363364_2_0001638599-20-000500.txt,0001638599-20-000500,0001638599-20-000500.txt,0001638599-20-000500.txt,2020-05-18,X0306,4,2020-05-14,0.0,899866,"ALEXION PHARMACEUTICALS, INC.",ALXN
1487371_1_0001487371-20-000092.txt,0001487371-20-000092,0001487371-20-000092.txt,0001487371-20-000092.txt,2020-03-05,X0306,4,2020-03-04,0.0,1487371,"GenMark Diagnostics, Inc.",GNMK
1192933_2_0001179110-20-005642.txt,0001179110-20-005642,0001179110-20-005642.txt,0001179110-20-005642.txt,2020-05-08,X0306,4,2020-05-07,0.0,31462,ECOLAB INC.,ECL
1642376_2_0001140361-20-013703.txt,0001140361-20-013703,0001140361-20-013703.txt,0001140361-20-013703.txt,2020-06-12,X0306,4,2020-06-10,,1517342,PACIFIC DRILLING S.A.,PACD
1737287_4_0001214659-20-008571.txt,0001214659-20-008571,0001214659-20-008571.txt,0001214659-20-008571.txt,2020-10-13,X0306,4,2020-10-12,,1737287,"Allogene Therapeutics, Inc.",ALLO
1278895_1_0000899243-20-009380.txt,0000899243-20-009380,0000899243-20-009380.txt,0000899243-20-009380.txt,2020-03-25,X0306,4,2020-03-23,0.0,1278895,"BLACKROCK ENHANCED CAPITAL & INCOME FUND, INC.",CII
1467858_2_0001467858-20-000082.txt,0001467858-20-000082,0001467858-20-000082.txt,0001467858-20-000082.txt,2020-05-08,X0306,4,2020-05-07,0.0,1467858,General Motors Co,GM
1221787_2_0001209191-20-035900.txt,0001209191-20-035900,0001209191-20-035900.txt,0001209191-20-035900.txt,2020-06-11,X0306,4,2020-06-10,0.0,1059556,MOODYS CORP /DE/,MCO
1686807_1_0000947871-20-000166.txt,0000947871-20-000166,0000947871-20-000166.txt,0000947871-20-000166.txt,2020-02-27,X0306,4,2020-02-25,0.0,1517175,"Chefs' Warehouse, Inc.",CHEF
315054_2_0001140361-20-011342.txt,0001140361-20-011342,0001140361-20-011342.txt,0001140361-20-011342.txt,2020-05-11,X0306,4,2020-05-11,,1655888,Owl Rock Capital Corp,ORCC


# <span style="color:purple">RegEx Resources</span>

* http://www.regular-expressions.info/

* Python: https://docs.python.org/3/howto/regex.html
* R: https://cran.r-project.org/web/packages/stringr/vignettes/regular-expressions.html
* Stata: https://www.stata.com/support/faqs/data-management/regular-expressions/

* If you want to be a master: https://www.amazon.com/dp/0596528124/ref=cm_sw_su_dp

* Interactive web page: https://regex101.com/