## Tutorial on regexmodel

#### Setup and installation

If you haven't installed `regexmodel` yet, do so now:

In [1]:
# %pip install git+https://github.com/sodascience/regexmodel.git

For this tutorial we will need three packages: regexmodel, faker and matplotlib. Install the two last packages now if they are not install already with:

In [2]:
# %pip install faker matplotlib

Import the packages that were installed:

In [3]:
from matplotlib import pyplot as plt

Normally we would already have data that we want to model and synthesize, but for this tutorial we will use the faker package to generate that data for us. We will use fake email addresses.

In [4]:
from faker import Faker

fake = Faker("en")
Faker.seed(12345)
email_addresses = [fake.email() for _ in range(300)]
email_addresses[:10]

['zoconnor@example.com',
 'sarah49@example.com',
 'nathansawyer@example.com',
 'cheryllopez@example.net',
 'schmidtjill@example.net',
 'eddieball@example.org',
 'marc75@example.org',
 'qsmith@example.org',
 'justindelgado@example.org',
 'jillhall@example.net']

#### Modeling the structured strings

Now we will use the regexmodel package to model the data:

In [5]:
from regexmodel import RegexModel

model = RegexModel.fit(email_addresses)

<class 'regexmodel.regexclass.UpperRegex'> [\S\s]*([A-Z]+)$ [] ['' '' '' '']
<class 'regexmodel.regexclass.LowerRegex'> [\S\s]*([a-z]+)$ ['c' 'ni'] ['' '' '' '']
<class 'regexmodel.regexclass.DigitRegex'> [\S\s]*([0-9]+)$ [] ['' '' '' '']
<class 'regexmodel.regexclass.UpperRegex'> [\S\s]*([A-Z]+)$ [] ['sarah49' 'lindsay31' 'william61' 'jo70']
<class 'regexmodel.regexclass.LowerRegex'> [\S\s]*([a-z]+)$ [] ['sarah49' 'lindsay31' 'william61' 'jo70']
<class 'regexmodel.regexclass.DigitRegex'> [\S\s]*([0-9]+)$ ['49' '31' '61' '70'] ['sarah49' 'lindsay31' 'william61' 'jo70']
<class 'regexmodel.regexclass.UpperRegex'> [\S\s]*([A-Z]+)$ [] ['sarah' 'lindsay' 'william' 'jo']
<class 'regexmodel.regexclass.LowerRegex'> [\S\s]*([a-z]+)$ ['sarah' 'lindsay' 'william' 'jo'] ['sarah' 'lindsay' 'william' 'jo']
<class 'regexmodel.regexclass.DigitRegex'> [\S\s]*([0-9]+)$ [] ['sarah' 'lindsay' 'william' 'jo']
<class 'regexmodel.regexclass.UpperRegex'> [\S\s]*([A-Z]+)$ [] ['s' 'lin' 'wil' 'bri']
<class 'reg

Let's first see how the good the model is by synthesizing new email addresses:

In [6]:
[model.draw() for _ in range(10)]

start Node <LowerRegex[a-z]{4,16}>
start Node <LowerRegex[a-z]{4,16}>
start Node <LowerRegex[a-z]{4,16}>
start Node <LowerRegex[a-z]{4,16}>
start Node <LowerRegex[a-z]{4,16}>
start Node <LowerRegex[a-z]{4,16}>
start Node <LowerRegex[a-z]{4,16}>
start Node <LowerRegex[a-z]{4,16}>
start Node <LowerRegex[a-z]{4,16}>
start Node <Literal [@]>


['lmvvtbuuyuax@example.net',
 'txyzei@example.com',
 'juzsgcqku16@example.net',
 'ycehwdwwanofg@example.com',
 'yzznkpyl@example.com',
 'vfshaihqnnhhnqny@example.com',
 'gsxbddvriqtvq@example.org',
 'ctvdevixvbbpxw@example.net',
 'gzubiqywaclcenx@example.org',
 'zgearggr52@example.net']

While not perfect, it certainly isn't so bad either, given that we have given the model only positive examples!

Now let's look at the serialization of the model:

In [9]:
model.serialize()

[{'regex': '[a-z]{4,16}[@][e][x][a][m][p][l][e][\\\\.][c][o][m]',
  'weights': [300, 220, 248, 248, 248, 248, 248, 248, 248, 248, 103, 103, 103],
  'side_branches_before': [{'i_branch': 0, 'data': {'weight': 272}},
   {'i_branch': 1,
    'data': {'regex': '[a-z]{1,1}[a-z]{1,2}[a-z]{1,3}[a-z]{4,7}[0-9]{2,2}',
     'weights': [28, 0, 9, 14, 24],
     'side_branches_before': [{'i_branch': 0, 'data': {'weight': 9}},
      {'i_branch': 1, 'data': {'weight': 5}},
      {'i_branch': 2, 'data': {'weight': 10}},
      {'i_branch': 3, 'data': {'weight': 3}},
      {'i_branch': 4, 'data': {'weight': 1}}],
     'side_branches_after': []}}],
  'side_branches_after': [{'i_branch': 0,
    'data': {'regex': '[0-9]{2,2}[@][e][x][a][m][p][l][e][\\\\.][n][e][t]',
     'weights': [52, 51, 51, 51, 51, 51, 51, 51, 51, 51, 26, 26, 26],
     'side_branches_before': [],
     'side_branches_after': [{'i_branch': 9,
       'data': {'regex': '[o][r][g]',
        'weights': [26, 26, 26],
        'side_branches_bef

In [8]:
email_addresses

['zoconnor@example.com',
 'sarah49@example.com',
 'nathansawyer@example.com',
 'cheryllopez@example.net',
 'schmidtjill@example.net',
 'eddieball@example.org',
 'marc75@example.org',
 'qsmith@example.org',
 'justindelgado@example.org',
 'jillhall@example.net',
 'lindsay31@example.com',
 'joavila@example.net',
 'jonathanhamilton@example.org',
 'pettykevin@example.org',
 'tammiebailey@example.com',
 'paulsmith@example.org',
 'zmiller@example.org',
 'bartonmichael@example.net',
 'castillobryan@example.net',
 'nnguyen@example.com',
 'michaelbates@example.org',
 'christophermartin@example.com',
 'jamesmartin@example.org',
 'stoneaaron@example.org',
 'mblackwell@example.org',
 'william61@example.com',
 'lwhite@example.org',
 'mark62@example.org',
 'susanpalmer@example.com',
 'david73@example.net',
 'karaschmidt@example.org',
 'ashley28@example.net',
 'greenronald@example.com',
 'sotocarlos@example.com',
 'hillscott@example.com',
 'nparker@example.com',
 'jo70@example.com',
 'wjohnson@example