First we need to import the data profiler:

In [None]:
from data_profile import DataProfile

Then load the data into the profiler, which will give us a summary of the data set:

In [None]:
data_src = 'data/sample_data.csv'
dp = DataProfile(data_src) 

We can then describe all attributes:

In [None]:
dp.describe()

Or just describe a particular attribute, plus we can override the default number of rows to show in the frequency table:

In [None]:
dp.describe('current_age', 10)

Obviously ages should be positive (and probably whole numbers), so we can check their validity:

In [None]:
dp.int_validation('current_age', minimum=0)

The most recent validity of an attribute is now associated with the data quality of that attribute:

In [None]:
dp.describe('current_age')

Wrappers are also provided for pandas functions head, tail, and sample, to view a selection of records:

In [None]:
dp.head()

For each wrapper we can also override the default number of records to show:

In [None]:
dp.tail(10)

As well as isolate a particular attribute:

In [None]:
dp.sample(20, 'birth_date')

The dates appear to be formatted as day/month/year, so we can also check their validity by specifying an appropriate date format string, and at the same time restrict the date range to only dates that fall in the past:

In [None]:
dp.datetime_validation('birth_date', dt_format='%d/%m/%Y', to_dt='6/3/2022')

Convenience functions are provided for email and IP address validation:

In [None]:
dp.email_validation('email')

We can also change the sample size of the invalid values to show:

In [None]:
dp.ip_validation('ip_address', 10)

These convenience functions are just wrappers for a more general regular expression validator.  For example, consider the 'rec_id' attribute:

In [None]:
dp.sample(20, 'rec_id')

It appears their values should all start with the letter 'R' and be followed by 6 digits.  So we can validate this using a regular expression:

In [None]:
dp.regex_validation('rec_id',r'^[R][0-9]{6}$')

There's also a string validation tool:

In [None]:
dp.string_validation('first_name')

By default it's just checking that the attribute only contains letters, but it can also be adjusted to consider other character sets as either valid or invalid, as well as check that the string length falls within a particular range:

In [None]:
dp.string_validation('phone', letters=False, digits=True, whitespace=True, min_length=10)

Or for the 'postcode' attribute:

In [None]:
dp.string_validation('postcode', letters=False, digits=True, min_length=3, max_length=4)

Though given Australian postcodes all run from '0200' to '9999', a better way might be to use the integer validation:

In [None]:
dp.int_validation('postcode', minimum=200, maximum=9999)

The most recent validity figures can also be found in the summary:

In [None]:
dp.summary()

Finally, we can save our profile to disk:

In [None]:
dp.save()

Refer to the code documentation for the full list of available methods and their arguments:

In [None]:
help(dp)

Feel free to use, improve, adapt...  

If you come across any bugs or have any suggestions by all means let me know, or better yet, make the changes and submit a merge request.

Happy exploring!