# Tutorial on regexmodel

## Setup and installation

If you haven't installed `regexmodel` yet, including the optional dependencies, do so now:

In [1]:
# %pip install git+https://github.com/sodascience/regexmodel.git[tutorial]

Normally we would already have data that we want to model and synthesize, but for this tutorial we will use the faker package to generate that data for us. We will use fake email addresses.

In [2]:
from faker import Faker

fake = Faker("en")
Faker.seed(12345)
email_addresses = [fake.ascii_email() for _ in range(1000)]
email_addresses[:10]

['wilkinsleah@yahoo.com',
 'ashlee25@sawyer.net',
 'michaelrubio@schmidt.biz',
 'jamiemiller@morris-ellis.com',
 'smithelizabeth@gmail.com',
 'cory61@lane.net',
 'hallmorgan@gmail.com',
 'brianna82@petty.com',
 'michaelpeterson@morris-williams.info',
 'lvang@barton.com']

## Modeling the structured strings

Now we will use the regexmodel package to model the data:

In [3]:
from regexmodel import RegexModel

model = RegexModel.fit(email_addresses)

Let's first see how the good the model is by synthesizing new email addresses:

In [4]:
[model.draw() for _ in range(10)]

['brvc61@ocojvbiar.com',
 'hszltapze22@rewzb.com',
 'hwotequvt@duy.com',
 'dkimkedyjmzanzto60@ecm.com',
 'bbohcirqvmivp16@ass.com',
 'jdidhaidejznuxwbsep81@nwpqhuniah.dng',
 'huweuxbhsev83@kfbhetug-oybvj.com',
 'kcqpgtysdw74@bvndjdf.com',
 'txxbe72@ffzdykfrf.com',
 'xakqql@ncsdfyflv.com']

While certainly not perfect, it certainly isn't so bad either, given that we have given the model only positive examples!

Now let's look at the regex that was fitted:

In [5]:
model.regex

'[a-z]{3,19}((|[0-9]{2})@[a-z]{3,9}(|\\-[a-z]{4,9})\\.(com|[b-o]{1,4}(|[zta-z]{1,2}))|[0-9]{2}@[a-z]{10}\\.[a-z]{3})'

## Modeling performance

There are also some modeling statistics that can be computed. Note that computing these can take a while depending on your computer.

In [6]:
model.fit_statistics(email_addresses)

{'failed': 5,
 'success': 995,
 'n_tot_char': 22023,
 'n_char_success': 21876,
 'n_parameters': 35,
 'avg_log_like_per_char': -2.9062940557741594,
 'avg_log_like_pc_success': -2.8794054655468972}

What the `fit_statistics` method does is to retrace back whether an email address that is given to it (e.g. johndoe@example.com) has a non-zero probability to be generated by the regex model. As we can see above, there were 8 email addresses in the list that have a probability of 0 to be generated by the model, while the overwhelming majority (992) can be generated with the fitted model.

The value `n_parameters` gives the number of nodes in the model, and is thus an indicator of the complexity of the model. This is also correlated with the fit taking longer. We can influence this parameter during fitting by setting the `count_thres` parameter. If we set that threshold higher, we generally have a lower number of parameters and better performance.

The statistic `avg_log_like_per_char` (average log-likelihood per character) shows how probable a value is on average per character. To understand this better, let's take a more simple example, where the regex is simply `\d{2,2}`. For this regex, the log likelihood is simply log(1/10\*1/10) = -2\*log(10). Since all values have 2 characters, the average log-likelihood per character is -log(10) ~= 2.30. For failed values (values that cannot be generated by the model), we use a penalty score of -log(1000) per character.

Ideally we want to have the lowest `n_parameters` (simplest model) with the highest `success` and the highest log-likelihood.

## Visualization

To more clearly understand how the graph looks like, we can plot the regex model using the `regex_model_to_pyvis` function. To retrace the paths that can be taken, first find the start node and look at the path from there.

Note: PyVis doesnt work interactively in VSCode/Code OSS.

In [7]:
from regexmodel.visualization import regex_model_to_pyvis

net = regex_model_to_pyvis(model)
net.show("regex.html", notebook=True)

regex.html
