<img src="data/images/detective_and_murderer.jpg" alt="Filipe Fortes circa 2013" style="width: 1500px;">

# 0 - TDD for data with pytest

TDD is great for software engineering, but did you know TDD can add a lot of speed and quality to Data Science projects too?  

We'll learn how we can use TDD to save you time - and quickly improve functions which extract and process data.


# About me

**Chris Brousseau**

*Surface Owl - Founder & Data Scientist*
<br>
*Pragmatic AI Labs - Cloud-Native ML*
<br>

<br>
Prior work at Accenture & USAF
<br>
Engineering @Boston University

# Problem to solve
- speed up development & improve quality on data science projects
<br><br>

# Two main cases
  1. test tidy input *(matrix - columns = var, row = observation)*  **(tidy data != clean data)**
<br><br>
  2. test ingest/transformations of complex input *(creating tidy & clean data)*
<br><br>

# Our Objectives
- Intro TDD (Test Driven Development)
- Learn about two packages: pytest & datatest 
  - For tidy data - *see datatest in action* 
  - For data engineeing - *see TDD for complex input*
- Understand When not to use TDD
- Get links to Resources

### Why TDD?
<img src=data/images/debugging_switches.200w.webp style="height: 700px;"/>

### What is TDD
- process for software development
- themes: intentional -> small -> explicit -> automated


### How does it work?

- confirm requirements
- write a failing test (vs ONLY these requirements!)
- write code to pass the test (keep it small)
- refactor
- retest
- automate

In [1]:
%%javascript
IPython.notebook.kernel.execute('nb_name = "' + IPython.notebook.notebook_name + '"')

<IPython.core.display.Javascript object>

In [2]:
# %load src/pytest_simple_example.py
"""
overly simple example of function to test with pytest
"""


def add_two_ints(first_int, second_int):
    """
    function to add two numbers together
    :param first_int: the first number to add together
    :param second_int: the second number to add together
    :return: sum of inputs, may be positive or negative
    :rtype: int
    """

    return first_int + second_int


In [3]:
# %load tests/test_00_simple_pytest_example
"""
basic usage example of pytest
"""
import pytest
from src.pytest_simple_example import add_two_ints


@pytest.mark.filterwarnings("ignore: :DeprecationWarning")
def test_simple_example():
    assert add_two_ints(1, 1) == 2


In [4]:
add_two_ints(1,1)

2

In [5]:
test_simple_example()

### Why TDD?

1. first - focus on requirements and outcomes

2. reduce time debugging

3. boost confidence in your code

4. improve refactoring - *speed and confidence*

5. encourages "clean code" - *SRP, organization*

6. speed up onboarding new team members - *read 1K lines, or a test?*


### Why TDD for data?
1. all the above
2. save you time
3. confidence in pipeline

# 1- Relevant Packages:  pytest & datatest

**pytest**
framework for writing and running tests

- pypi
- auto-discovery of your tests (prefix `test` on files, classes & functions)
- runs unittest and nose tests
- detailed info on failures
- useful plugins (coverage, aws, selenium, databases, etc)

**datatest**
helps speed up and formalize data-wrangling and data validation tasks

- pypi
- Test data pipeline components and end-to-end behavior

In [6]:
import ipytest
import ipytest.magics
ipytest.config(rewrite_asserts=True, magics=True)

# __file__ = "INSERT YOUR NOTEBOOK FILENAME HERE"
__file__ = nb_name

NameError: name 'nb_name' is not defined

In [None]:
# pytest not working correctly in jupyter notebook
# investigate:  https://github.com/chmp/ipytest
# https://github.com/chmp/ipytest/blob/master/Example.ipynb

# %%run_pytest[clean] -qq
ipytest.clean_tests()
ipytest.run('-qq')

In [None]:
add_two_ints(1,1)

# 2- TDD for data engineering

### datatest deets

- core functions:  validation, error reporting, and acceptance declarations
- built-in classes for selecting, querying, and iterating over the data under test
- both pytest & unittest styles


- works with Pandas
- has Pandas-like syntax
- useful for pipelines


- https://github.com/shawnbrown/datatest
- https://datatest.readthedocs.io/en/stable/index.html


In [None]:
Example 0 - datatest

- TBD (Chris to write)
- need to create some input files for pandas


### Example 1 - finding urls in excel

- url test case
- multiple url test case which breaks prior tests
- regex101.com illustration (edit function to make tests pass)
  https://regex101.com/
- final regex to rule them all

#### Review input files  1 & 2 (under /data/test_cais)

#### Switch to PyCharm

In [None]:
# %load tests/test_02_cais_find_url_examples
"""
test functions to find url in cell content from an excel worksheet

functions below have "do_this_later_" prefix to prevent tests from running during early part of talk
remove prefix as we walk through examples, and re-run tests
"""
from src.excel_read_file_functions import get_workbook
from src.excel_read_cell_info import find_url
from src.configuration_info_cais import datapath_tests


def test_find_single_url():
    """
    unit test to find url in a single text string
    :return: None
    """
    # the find_url function we are testing takes cell content as a string, and current results dict
    # pass an empty results dict, so no existing value is found
    result = {}

    # inputs we expect to pass
    input01 = "Coeducational Boarding/Day School Grades 6-12; Enrollment 350 www.prioryca.org"

    # declare result we expect to find here
    assert find_url(input01, result) == "www.prioryca.org"


def do_this_later_test_find_multi_url():
    """
    unit test multiple strings for urls in bulk - rather than separate test functions for each
    one way to rapidly iterate on your code, nicely encapsulates similar cases

    requires editing REGEX in excel_read_cell_info.find_url to make this test pass
    """
    result = {}

    # inputs we expect to pass
    input01 = "Coed Boarding/Day School Grades 6-12; Enrollment 350 http://www.prioryca.org"
    input02 = "https://windwardschool.org"
    input03 = "  Enrollment 225 york.org"
    input04 = "Surface Owl Inc., https://surfaceowl.com"

    # inputs we expect to return `None`
    input05 = "Woodside Priory School"
    input06 = "8221  Fax (650)"
    input07 = "Head of School Coeducational Boarding/Day School Grades 6-12; Enrollment 350"
    input08 = "surfaceowl"

    assert find_url(input01, result) == "http://www.prioryca.org"
    assert find_url(input02, result) == "https://windwardschool.org"
    assert find_url(input03, result) == "york.org"
    assert find_url(input04, result) == "https://surfaceowl.com"
    assert find_url(input05, result) is None
    assert find_url(input06, result) is None
    assert find_url(input07, result) is None
    assert find_url(input08, result) is None


def do_this_later_test_find_url_from_excelfile():
    """
    integration test to find url from excel file
    :return: None
    """

    filename = datapath_tests + "School_Directory_2013-2014-converted.xlsx"
    workbook = get_workbook(filename)
    tab_name = "Table 15"
    row = 23
    result = {}  # function expects input dict with prior results; equals empty dict now

    # inputs we expect to pass
    test_input = workbook[tab_name].cell(row=row, column=1).value

    assert find_url(test_input, result) == "https://www.hamlin.org"


In [None]:
ipytest.run('-qq')

### Example 2 - finding names & use of supplementary data summaries

- use of expected results file bundled with data as pytest input
- structured discovery of edge cases

- **objective:** find school names in messy excel document
- **strategy:** find names by finding specific formats - removing stopwords & addresses
- **test goals:** confirm code finds same # of names as we do manually
- **test approach:** summarize names manually in new tab, *then test code results vs. manual results*

### Review input files (/data/test_cais)

#### Switch to PyCharm


# 3 - When not to use TDD for data?

- EDA
- quick prototypes
- data source is complete & managed
- cost / time >> benefits


# 4 - Resources

**this talk:**  https://github.com/surfaceowl/pythontalk_tdd_for_data
<br><br>
**pytest**

[pytest on pypi: https://pypi.org/project/pytest/](https://pypi.org/project/pytest/)

[pytest docs: https://docs.pytest.org/en/latest/](https://docs.pytest.org/en/latest/)
<br><br>
**datatest**

[datatest on pypi: https://pypi.org/project/datatest/](https://pypi.org/project/datatest/)

[datatest:       https://github.com/shawnbrown/datatest](https://github.com/shawnbrown/datatest)

[datatest docs: https://datatest.readthedocs.io/en/stable/](https://datatest.readthedocs.io/en/stable/)

**TDD for data**

[towards data science article](https://towardsdatascience.com/tdd-datascience-689c98492fcc)


# Recap: Our Objectives

- Intro TDD (Test Driven Development)
- Learn about two packages: pytest & datatest 
- See TDD for data engineering in action
- Understand When not to use TDD
- Get some relevant Resources

# END

### setup notes

venv, then pip install -r requirements.txt
pip install -e .

run pytest from terminal - must be in tests dir

pycharm setup -- set test runner to pytest


### resources

https://nbviewer.jupyter.org/github/agostontorok/tdd_data_analysis/blob/master/TDD%20in%20data%20analysis%20-%20Step-by-step%20tutorial.ipynb#Step-by-step-TDD-in-a-data-science-task

http://www.tdda.info/

fix pytest Module not found
https://medium.com/@dirk.avery/pytest-modulenotfounderror-no-module-named-requests-a770e6926ac5


#### regex
https://regex101.com/
