## Tutorial on regexmodel

#### Setup and installation

If you haven't installed `regexmodel` yet, do so now:

In [1]:
# %pip install git+https://github.com/sodascience/regexmodel.git

For this tutorial we will need three packages: regexmodel, faker and matplotlib. Install the two last packages now if they are not install already with:

In [2]:
# %pip install faker matplotlib

Import the packages that were installed:

In [3]:
from matplotlib import pyplot as plt

Normally we would already have data that we want to model and synthesize, but for this tutorial we will use the faker package to generate that data for us. We will use fake email addresses.

In [4]:
from faker import Faker

fake = Faker("en")
Faker.seed(12345)
email_addresses = [fake.ascii_email() for _ in range(1000)]
email_addresses[:10]

['wilkinsleah@yahoo.com',
 'ashlee25@sawyer.net',
 'michaelrubio@schmidt.biz',
 'jamiemiller@morris-ellis.com',
 'smithelizabeth@gmail.com',
 'cory61@lane.net',
 'hallmorgan@gmail.com',
 'brianna82@petty.com',
 'michaelpeterson@morris-williams.info',
 'lvang@barton.com']

#### Modeling the structured strings

Now we will use the regexmodel package to model the data:

In [5]:
from regexmodel import RegexModel

model = RegexModel.fit(email_addresses)

[@][a-z]{3,9}[\-][a-z]{4,9}[\.][c][o][m] [@][a-z]{3,9}[\-][a-z]{4,9}[\.][c][o][m] 0 8
['@schmidt.biz' '@morris-ellis.com' '@morris-williams.info' '@vasquez.net'
 '@davis.info'] ['' '' '' '' '']
[@][a-z]{3,9}[\-][a-z]{4,9}[\.][c][o][m] [@][a-z]{3,9}[\-][a-z]{4,9}[\.][c][o] 0 7
['@schmidt.biz' '@morris-williams.info' '@vasquez.net' '@davis.info'
 '@tyler.biz'] []
[@][a-z]{3,9}[\-][a-z]{4,9}[\.][c][o][m] [@][a-z]{3,9}[\-][a-z]{4,9}[\.][c] 0 6
['@schmidt.biz' '@morris-williams.info' '@vasquez.net' '@davis.info'
 '@tyler.biz'] []
[@][a-z]{3,9}[\-][a-z]{4,9}[\.][c][o][m] [@][a-z]{3,9}[\-][a-z]{4,9}[\.] 0 5
['@schmidt.biz' '@morris-williams.info' '@vasquez.net' '@davis.info'
 '@tyler.biz'] ['info' 'net' 'net' 'net' 'biz']
[@][a-z]{3,9}[\-][a-z]{4,9}[\.][c][o][m] [@][a-z]{3,9}[\-][a-z]{4,9} 0 4
['@schmidt.biz' '@vasquez.net' '@davis.info' '@tyler.biz' '@johnson.biz'] ['n.com']
[@][a-z]{3,9}[\-][a-z]{4,9}[\.][c][o][m] [@][a-z]{3,9}[\-] 0 3
['@schmidt.biz' '@vasquez.net' '@davis.info' '@tyler.bi

Let's first see how the good the model is by synthesizing new email addresses:

In [6]:
[model.draw() for _ in range(10)]

['tsfvdcvksqsnawix@vijs.com',
 'ukqesyrxv81@omlfdyxn.com',
 'rcaagzwrn@yprusofb.com',
 'kunldqueuioc@ntzqqumm.com',
 'tysnzauilrosjztdia@dasvc.com',
 'kvbwrki@lelvjtx.com',
 'dwcuycunogzzjzzfn@uamzk.com',
 'vkei78@cnnyuhxb.com',
 'ujufedozkcwcreqh@rjhcgrcz.com',
 'aahouc95@uqdcm.com']

While certainly not perfect, it certainly isn't so bad either, given that we have given the model only positive examples!

Now let's look at the serialization of the model:

In [7]:
model.serialize()

[{'regex': '[a-z]{3,18}[0-9]{2,2}[@][a-z]{4,9}[\\\\.][c][o][m]',
  'weights': [1000, 259, 259, 791, 746, 705, 705, 705],
  'side_branches_before': [{'i_branch': 0, 'data': {'weight': 466}},
   {'i_branch': 2,
    'data': {'regex': '[a-z]{4,18}',
     'weights': [533],
     'side_branches_before': [{'i_branch': 0, 'data': {'weight': 533}}],
     'side_branches_after': []}},
   {'i_branch': 2, 'data': {'weight': 1}}],
  'side_branches_after': [{'i_branch': 0,
    'data': {'regex': '[@][a-z]{3,9}[\\\\-][a-z]{4,9}[\\\\.][c][o][m]',
     'weights': [207, 207, 120, 119, 118, 70, 70, 70],
     'side_branches_before': [],
     'side_branches_after': [{'i_branch': 1,
       'data': {'regex': '[\\\\.][b][i][z]',
        'weights': [86, 26, 26, 26],
        'side_branches_before': [],
        'side_branches_after': [{'i_branch': 0,
          'data': {'regex': '[i][n][f][o]',
           'weights': [23, 23, 23, 23],
           'side_branches_before': [],
           'side_branches_after': [{'i_branc

In [8]:
model.fit_statistics(email_addresses)

philip03@washington.com -1000013.2712598622
huntamber@richardson.biz -1000013.2712272695
richardsonalexander@murray.com -1000013.2712269897
yli@gmail.com -1000013.8905941774
amy54@washington.com -1000014.1218832616
randall21@strickland.org -1000013.2712285216
maria78@ray.com -1000013.272067169
dustincole@lam.com -1000013.2712272678
michelegarrett@copeland-washington.com -1000013.271227268
andersonmaureen@hawkins-wu.info -1000013.271227268
turnertravis@lee.com -1000013.271227268
timothy24@villa-tran.info -1000013.2712285216
kenneth96@cooper-washington.com -1000013.2712285216
dylan80@murray-manning.info -1000013.2720750612
christopher55@orr.net -1000013.271227268
robinsonchristopher@gmail.com -1000013.2712200311
mitchellchristopher@gmail.com -1000013.2712200311
stacey22@burton-kim.net -1000013.2712598622


{'failed': 18,
 'success': 982,
 'n_tot_char': 22023,
 'n_char_success': 21582,
 'n_parameters': 121,
 'avg_log_like_per_char': -2.8483489023604314,
 'avg_log_like_pc_success': -2.7654002316121136}

In [9]:
import networkx as nx

In [10]:
from regexmodel.util import Dir

def create_network(G, link, labels, node_start):
    cur_i_label = len(labels)
    if link.destination is None:
        G.add_node(cur_i_label, label="X")
        labels.append("X")
        if link.direction == Dir.LEFT:
            G.add_edge(cur_i_label, node_start, group=link.direction.value)
        else:
            G.add_edge(node_start, cur_i_label, group=link.direction.value)
        return
    
    cur_node = link.destination
    if cur_node.regex is None:
        cur_label = "X"
    else:
        cur_label = cur_node.regex.regex
    G.add_node(cur_i_label, label=cur_label)
    labels.append(cur_label)
    if link.direction == Dir.LEFT:
        G.add_edge(cur_i_label, node_start, group=link.direction.value)
    else:
        G.add_edge(node_start, cur_i_label, group=link.direction.value)
    
    for cur_link in cur_node.all_links:
        create_network(G, cur_link, labels, cur_i_label)
    

In [17]:
import numpy as np
G = nx.DiGraph()
labels = ["s"]
G.add_node(0, label="start", group=2)
for link in model.root_links:
    create_network(G, link, labels, 0)

In [18]:
from pyvis.network import Network

net = Network('1000px', '1000px', notebook=True, directed=True)
net.from_nx(G)
net.show("regex.html")

regex.html


In [None]:
Dir.RIGHT.value