# TDD for data with pytest
<br><br>
TDD is great for software engineering, but did you know TDD can add a lot of speed and quality to Data Science projects too?  


We'll learn how we can use TDD to save you time - and quickly improve functions which extract and process data.


# About me

**Chris Brousseau**

*Surface Owl - Founder & Data Scientist*
<br>
*Pragmatic AI Labs - Cloud-Native ML*
<br>

<br>
Prior work at Accenture & USAF
<br>
Engineering @Boston University

<img src="data/images/detective_and_murderer.jpg" alt="Filipe Fortes circa 2013" style="width: 1500px;">

# 0 - Problem to solve
- speed up development & improve quality on data science projects
<br><br>

# Two main cases
  1. test tidy input *(matrix - columns = var, row = observation)*  **(tidy data != clean data)**
<br><br>
  2. test ingest/transformations of complex input *(creating tidy & clean data)*


# Our Objectives
- Intro TDD (Test Driven Development)
- Learn about two packages: pytest & datatest 
  1. For tidy data - *see datatest in action* 
  2. For data engineering - *see TDD for complex input*
- Understand When not to use TDD
- Get links to Resources

### Why TDD?
<img src=data/images/debugging_switches.200w.webp style="height: 700px;"/>

### What is TDD
- process for software development
- **themes:** intentional -> small -> explicit -> automated


### How does it work?

- confirm requirements
- write a failing test (vs ONLY these requirements!)
- write code to pass the test (keep it small)
- refactor
- retest
- automate

### Why TDD?

1. first - focus on requirements and outcomes

2. save time debugging

3. boost confidence in your code

4. improve refactoring - *speed and confidence*

5. encourages "clean code" - *SRP, organization*

6. speed up onboarding new team members - *read 1K lines, or a test?*


### Why TDD for data?
1. all the above
2. save you time
3. confidence in pipeline

# Relevant Packages:  pytest & datatest
<br><br>
**pytest:**
framework for writing and running tests
- pypi
- auto-discovery of your tests (prefix `test` on files, classes & functions)
- runs unittest and nose tests
- detailed info on failures
- useful plugins (coverage, aws, selenium, databases, etc)
- [Human-readable usage here](https://gist.github.com/kwmiebach/3fd49612ef7a52b5ce3a)
<br><br>

**datatest:**
helps speed up and formalize data-wrangling and data validation tasks

- pypi
- Test data pipeline components and end-to-end behavior

# 1- TDD for tidy data


### datatest deets!

- *core functions:*
    1. validation
    2. error reporting
    3. acceptance declarations
    <br><br>
- built-in classes for selecting, querying, and iterating over the data under test
- both pytest & unittest styles


- works with Pandas
- has Pandas-like syntax
- useful for pipelines


- https://github.com/shawnbrown/datatest
- https://datatest.readthedocs.io/en/stable/index.html


#### datatest - what does it do for you?

- **validation:**  check that raw data meets requirements you specify
    - columns exist
    - values are in: specific set, range, types
    - match order and sequences@specific index, mapping
    - fuzzy

- **compute differences** between inputs & test conditions

- **acceptances** - based on differences
    - tolerance - absolute
    - tolerance - percentage
    - fuzzy, others
    - composable - construct acceptance criteria based on *intersection of lower-level datatest acceptances*

- **all in a test framework**

**[link: validate docs](https://datatest.readthedocs.io/en/stable/reference/datatest-core.html#datatest.validate)**

### Example 0 - datatest cases

#### sources:
-  https://datatest.readthedocs.io/en/stable/tutorial/dataframe.html
<br><br>
- https://github.com/moshez/interactive-unit-test/blob/master/unit_testing.ipynb

In [None]:
# setup - thank you Moshe!
import unittest

def test(klass):
    loader = unittest.TestLoader()
    # suite=loader.loadTestsFromTestCase(klass) # original
    suite=loader.loadTestsFromModule(klass) # to work with datatest example
    runner = unittest.TextTestRunner()
    runner.run(suite)

# other helpful setup
# ipytest - https://github.com/chmp/ipytest
import ipytest
import ipytest.magics
# enable pytest's assertions and ipytest's magics
ipytest.config(rewrite_asserts=False, magics=True)


# load datatest example
import pandas as pd
df = pd.read_csv("./data/test_datatest/movies.csv")
df.head(5)


In [None]:
# %load tests/test_01_datatest_movies_df_unit
#!/usr/bin/env python
import pandas as pd
import datatest as dt
import os


def setUpModule():
    global df
    print(os.getcwd())
    df = pd.read_csv('data/test_datatest/movies.csv')


class TestMovies(dt.DataTestCase):
    @dt.mandatory
    def test_columns(self):
        self.assertValid(
            df.columns,
            {'title', 'rating', 'year', 'runtime'},
        )

    def test_title(self):
        self.assertValidRegex(df['title'], r'^[A-Z]')

    def test_rating(self):
        self.assertValidSuperset(
            df['rating'],
            {'G', 'PG', 'PG-13', 'R', 'NC-17'},
        )

    def test_year(self):
        self.assertValid(df['year'], int)

    def test_runtime(self):
        self.assertValid(df['runtime'], int)

test(TestMovies())


In [None]:
# what is going on with our original data?
df_fixed = pd.read_csv('data/test_datatest/movies.csv')
df_fixed.iloc[7:11, :]

In [None]:
# fix the bad data - through pipeline or manually
df_fixed = pd.read_csv('data/test_datatest/movies_fixed.csv')
df_fixed.iloc[7:11, :]

In [None]:
# clear existing test objects in jupyter notebook - similar to reset
%reset_selective -f df 
%reset_selective -f TestMovies

In [None]:
# fixed data - rerun tests
def setUpModule():
    global df
    print(os.getcwd())
    df = pd.read_csv('data/test_datatest/movies_fixed.csv')  # note new source


class TestMovies(dt.DataTestCase):
    @dt.mandatory
    def test_columns(self):
        self.assertValid(
            df.columns,
            {'title', 'rating', 'year', 'runtime'},
        )

    def test_title(self):
        self.assertValidRegex(df['title'], r'^[A-Z]')

    def test_rating(self):
        self.assertValidSuperset(
            df['rating'],
            {'G', 'PG', 'PG-13', 'R', 'NC-17'},
        )

    def test_year(self):
        self.assertValid(df['year'], int)

    def test_runtime(self):
        self.assertValid(df['runtime'], int)

test(TestMovies())

# 2 - TDD for data engineering


### Example 1 - finding urls in excel

- url test case
- multiple url test case which breaks prior tests
- regex101.com illustration (edit function to make tests pass)
  https://regex101.com/
- final regex to rule them all




#### Sample data (under /data/test_cais) - needs transformation

|example 1 |~ |example 2 |
|:--- |:--- |:---|
| <img src="data/images/excel_sample2013.png" alt="excel example 1" style="height: 900px;"> | ...|<img src="data/images/excel_sample2018.png" alt="excel example 2" style="height: 900px;"> |
    

In [None]:
# %load tests/test_02_cais_find_single_url
"""
test functions to find url in cell content from an excel worksheet

functions below have "do_this_later_" prefix to prevent tests from running during early part of talk
remove prefix as we walk through examples, and re-run tests
"""
from src.excel_find_url import find_url


def test_find_single_url():
    """
    unit test to find url in a single text string
    :return: None
    """
    # the find_url function we are testing takes cell content as a string, and current results dict
    # pass an empty results dict, so no existing value is found
    result = {}

    # inputs we expect to pass
    input01 = "Coeducational Boarding/Day School Grades 6-12; Enrollment 350 www.prioryca.org"

    # declare result we expect to find here
    assert find_url(input01, result) == "www.prioryca.org"


In [None]:
# %load src/excel_find_url.py
# %load src/excel_find_url.py
# %%writefile src/excel_find_url.py


import re
from src.excel_read_cell_info import check_if_already_found

def find_url(content, result):
    """
    finds url of school if it exists in cell
    :param content: cell content from spreadsheet
    :type content: string
    :param result: dict of details on current school
    :type result: dict
    :return: url
    :rtype: basestring
    """
    if check_if_already_found("url", result):
        return result['url']

        # different regex to use during python talk
        # https://regex101.com

    # regex = re.compile(r"w{3}.*", re.IGNORECASE)
    # regex = re.compile(r"(http|https):\/\/.*", re.IGNORECASE)  # EDIT THIS LIVE

    regex = re.compile(
    r"((http|https):\/\/)?[a-zA-Z0-9.\/?::-_=#]+\.([a-zA-Z]){2,6}([a-zA-Z0-9..\/&\/\-_=#])*",
    re.IGNORECASE)

    try:
        match = re.search(regex,
                          str(content))
    except TypeError:
        raise TypeError

    if match:
        url = str(match.group()).strip()
        return url
    else:
        return None


In [None]:
%ls "tests/"

In [None]:
test02 = "tests/test_02_cais_find_single_url.py"

__file__  = test02

ipytest.clean_tests()
ipytest.config.addopts=['-v']
# ['-k test_03_cais_find_https_url.py']

ipytest.run()

In [None]:
# %load tests/test_03_cais_find_https_url.py
"""
test functions to find url in cell content from an excel worksheet

functions below have "do_this_later_" prefix to prevent tests from running during early part of talk
remove prefix as we walk through examples, and re-run tests
"""
from src.excel_find_url import find_url


def test_find_https_url():
    """
    unit test multiple strings for urls in bulk - rather than separate test functions for each
    one way to rapidly iterate on your code, nicely encapsulates similar cases

    requires editing REGEX in excel_read_cell_info.find_url to make this test pass
    """
    result = {}

    # inputs we expect to pass
    input01 = "Coed Boarding/Day School Grades 6-12; Enrollment 350 http://www.prioryca.org"
    input02 = "https://windwardschool.org"

    assert find_url(input01, result) == "http://www.prioryca.org"
    assert find_url(input02, result) == "https://windwardschool.org"


In [None]:
# %load src/excel_find_url.py
# %load src/excel_find_url.py
# %%writefile src/excel_find_url.py


import re
from src.excel_read_cell_info import check_if_already_found

def find_url(content, result):
    """
    finds url of school if it exists in cell
    :param content: cell content from spreadsheet
    :type content: string
    :param result: dict of details on current school
    :type result: dict
    :return: url
    :rtype: basestring
    """
    if check_if_already_found("url", result):
        return result['url']

        # different regex to use during python talk
        # https://regex101.com

    # regex = re.compile(r"w{3}.*", re.IGNORECASE)
    # regex = re.compile(r"(http|https):\/\/.*", re.IGNORECASE)  # EDIT THIS LIVE

    regex = re.compile(
    r"((http|https):\/\/)?[a-zA-Z0-9.\/?::-_=#]+\.([a-zA-Z]){2,6}([a-zA-Z0-9..\/&\/\-_=#])*",
    re.IGNORECASE)

    try:
        match = re.search(regex,
                          str(content))
    except TypeError:
        raise TypeError

    if match:
        url = str(match.group()).strip()
        return url
    else:
        return None
### Switch to PyCharm

### [regex101](https://regex101.com)

    w{3}.*

    (http|https):\/\/.*

    ((http|https):\/\/)?[a-zA-Z0-9.\/?::-_=#]+\.([a-zA-Z]){2,6}([a-zA-Z0-9..\/&\/\-_=#])*
    
    www.prioryca.org
    http://www.prioryca.org
    https://prioryca.org


### Switch to PyCharm

### Example 2 - finding names & use of supplementary data summaries

- use of expected results file bundled with data as pytest input
- structured discovery of edge cases

- **objective:** find school names in messy excel document
- **strategy:** find names by finding specific formats - removing stopwords & addresses
- **test goals:** confirm code finds same # of names as we do manually
- **test approach:** summarize names manually in new tab, *then test code results vs. manual results*


#### Recall our data (under /data/test_cais) - needs transformation

|example 1 |~ |example 2 |
|:--- |:--- |:---|
| <img src="data/images/excel_sample2013.png" alt="excel example 1" style="height: 900px;"> | ...|<img src="data/images/excel_sample2018.png" alt="excel example 2" style="height: 900px;"> |

#### Review input files (/data/test_cais)
<br><br>
<img src="data/images/excel_summarize_expected_results.png" alt="excel example 1" style="height: 900px;">

In [None]:
"""
tests focused on ability to pull all the names from a cais excel file
"""

def test_find_2013_cais_name_table10():
    """
    test finding names in first member schools tab
    test function to dynamically look up names vs. expected result from separate file
    :return: True or False
    """
    test_file = "School_Directory_2013-2014-converted.xlsx"
    results_file = "cais_name_counts_manual_2013-2014.xlsx"
    table_num = 10

    found_in_table_10, expected_in_table_10 = common_search(test_file, results_file, table_num)

    assert found_in_table_10 == expected_in_table_10


#### Data Driven transformation accuracy (/data/test_cais)
<br><br>
<img src="data/images/test_results.excel_table_accuracy.png" alt="dynamic input testing">


# 3 - When not to use TDD for data?

- EDA
- quick prototypes
- data source is complete & managed
- cost / time >> benefits


# 4 - Resources

**this talk:**  https://github.com/surfaceowl/pythontalk_tdd_for_data
<br><br>
**pytest**

[pytest on pypi: https://pypi.org/project/pytest/](https://pypi.org/project/pytest/)

[pytest docs: https://docs.pytest.org/en/latest/](https://docs.pytest.org/en/latest/)
<br><br>
**ipytest**

[ipytest pypi](https://pypi.org/project/ipytest/)

[ipytest github](https://github.com/chmp/ipytest)
<br><br>
**datatest**

[datatest on pypi: https://pypi.org/project/datatest/](https://pypi.org/project/datatest/)

[datatest:       https://github.com/shawnbrown/datatest](https://github.com/shawnbrown/datatest)

[datatest docs: https://datatest.readthedocs.io/en/stable/](https://datatest.readthedocs.io/en/stable/)

**TDD for data**

[towards data science article](https://towardsdatascience.com/tdd-datascience-689c98492fcc)


# Recap: Our Objectives

- Intro TDD (Test Driven Development)
- Learned about pytest & datatest 
- Saw testing in action for:
  1. tidy data
  2. transformation / data engineering
- Understand When not to use TDD
- Have links to Resources

# END

### setup notes

venv, then pip install -r requirements.txt
pip install -e .

run pytest from terminal - must be in tests dir

pycharm setup -- set test runner to pytest


### resources

https://nbviewer.jupyter.org/github/agostontorok/tdd_data_analysis/blob/master/TDD%20in%20data%20analysis%20-%20Step-by-step%20tutorial.ipynb#Step-by-step-TDD-in-a-data-science-task

http://www.tdda.info/

fix pytest Module not found
https://medium.com/@dirk.avery/pytest-modulenotfounderror-no-module-named-requests-a770e6926ac5


#### regex
https://regex101.com/


In [None]:
# Notes
# writefile must be in first line
# %%writefile src/excel_find_url.py

# javascript commands to set var -- must be in separate cell
%%javascript
IPython.notebook.kernel.execute('nb_name = "' + IPython.notebook.notebook_name + '"')

In [None]:
import ipytest
import ipytest.magics
ipytest.config(rewrite_asserts=True, magics=True)

# __file__ = "INSERT YOUR NOTEBOOK FILENAME HERE"
__file__ = nb_name

In [None]:
# ipytest usage
# https://github.com/chmp/ipytest/blob/master/Example.ipynb

# %%run_pytest[clean] -qq
ipytest.clean_tests()
ipytest.run('-qq')

In [None]:
%%javascript
IPython.notebook.kernel.execute('nb_name = "' + IPython.notebook.notebook_name + '"')
IPython.notebook.kernel.execute('test00 = "' + "tests/test_00_simple_pytest_example.py" + '"')
IPython.notebook.kernel.execute('test01 = "' + "tests/test_01_datatest_movies_df_unit.py" + '"')
IPython.notebook.kernel.execute('test02 = "' + "tests/test_02_cais_find_single_url.py" + '"')
IPython.notebook.kernel.execute('test03 = "' + "tests/test_03_cais_find_https_url.py" + '"')
IPython.notebook.kernel.execute('test04 = "' + "tests/test_04_cais_find_multi_url.py" + '"')
IPython.notebook.kernel.execute('test05 = "' + "tests/test_05_cais_name_count_2013.py" + '"')
IPython.notebook.kernel.execute('test06 = "' + "tests/test_06_cais_name_count_2018.py" + '"')