Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inferred schema fails to generate example #988

Closed
mattharrison opened this issue Oct 27, 2022 · 5 comments · Fixed by #989
Closed

Inferred schema fails to generate example #988

mattharrison opened this issue Oct 27, 2022 · 5 comments · Fixed by #989
Labels
bug Something isn't working

Comments

@mattharrison
Copy link

Describe the bug

I'm inferring the schema from a CSV with 83 columns. When I try to generate an example it fails.

Unsatisfiable: Unable to satisfy assumptions of hypothesis example_generating_inner_function.

Code Sample, a copy-pastable example

import pandas as pd
import pandera as pa
import time

url = 'https://github.com/mattharrison/datasets/blob/master/data/ames-housing-dataset.zip?raw=true'
ames = pd.read_csv(url, compression='zip')
s = pa.infer_schema(ames)

for i in range(1, 80):
    start = time.time()
    s.select_columns(list(s.columns.keys())[:i]).example(i)
    print(f'{i} took {time.time()-start} seconds {list(s.columns.keys())[:i]}')

Expected behavior

A clear and concise description of what you expected to happen.

I would expect this to generate an example. I made a simple script to measure timing when adding columns (of int, str, and float) and it works with 80 columns:

for i in range(80):
    cols = {}
    for y in range(i):
        if y % 3 == 0:
            cols[f'col{y}'] = pa.Column(int)
        if y % 3 == 1:
            cols[f'col{y}'] = pa.Column(float)
        if y % 3 == 2:
            cols[f'col{y}'] = pa.Column(str)
    schema = pa.DataFrameSchema(cols)
    start = time.time()
    schema.example()#s.select_columns(list(s.columns.keys())[:i]).example(i)
    print(f'{i} took {time.time()-start} seconds')

Desktop (please complete the following information):

  • OS: WSL2 Ubuntu 20.4
  • Python Version: 3.8
  • Pandera Version: 0.7.0
@mattharrison mattharrison added the bug Something isn't working label Oct 27, 2022
@cosmicBboy
Copy link
Collaborator

hey @mattharrison, is there a particular reason you're generating more and more examples .example(i) as you include more and more columns?

hypothesis is doing all the heavy lifting generating the dataframes, and the more examples it has to generate the more time it needs. I believe one can increase the deadline setting, which is basically a timeout for generating examples, to give it more time to generate examples: https://hypothesis.readthedocs.io/en/latest/settings.html#hypothesis.settings.deadline

It'd also be worth documenting the recommendation that generating more than 50 rows of data is a lot to handle to pandera/hypothesis... basically the purpose of this synthetic data is for unit testing, which typically won't involve large datasets.

@mattharrison
Copy link
Author

mattharrison commented Oct 27, 2022

Good catch. I changed the i to a 1

import pandas as pd
import pandera as pa
import time

url = 'https://github.com/mattharrison/datasets/blob/master/data/ames-housing-dataset.zip?raw=true'
ames = pd.read_csv(url, compression='zip')
s = pa.infer_schema(ames)

for i in range(80, 81):
    start = time.time()
    s.select_columns(list(s.columns.keys())[:i]).example(1)
    print(f'{i} took {time.time()-start} seconds {list(s.columns.keys())[:i]}')

I also changed

for i in range(1, 80):

to

for i in range(80, 81):

And it failed to generate an example:

------------------------------------------------------------------
Unsatisfiable                    Traceback (most recent call last)
<ipython-input-106-b6da2b8d41e0> in <module>
      8 for i in range(80, 81):
      9     start = time.time()
---> 10     s.select_columns(list(s.columns.keys())[:i]).example(1)
     11     print(f'{i} took {time.time()-start} seconds {list(s.columns.keys())[:i]}')

~/envs/menv/lib/python3.8/site-packages/pandera/schemas.py in example(self, size, n_regex_columns)
    759                 category=hypothesis.errors.NonInteractiveExampleWarning,
    760             )
--> 761             return self.strategy(
    762                 size=size, n_regex_columns=n_regex_columns
    763             ).example()

~/envs/menv/lib/python3.8/site-packages/hypothesis/strategies/_internal/strategies.py in example(self)
    322 
    323         examples: List[Ex] = []
--> 324         example_generating_inner_function()
    325         return random_choice(examples)
    326 

~/envs/menv/lib/python3.8/site-packages/hypothesis/strategies/_internal/strategies.py in example_generating_inner_function()
    310         # tracebacks, and we want users to know that they can ignore it.
    311         @given(self)
--> 312         @settings(
    313             database=None,
    314             max_examples=10,

    [... skipping hidden 2 frame]

~/envs/menv/lib/python3.8/site-packages/hypothesis/core.py in run_engine(self)
    770         else:
    771             if runner.valid_examples == 0:
--> 772                 raise Unsatisfiable(
    773                     "Unable to satisfy assumptions of hypothesis %s."
    774                     % (get_pretty_function_description(self.test),)

Unsatisfiable: Unable to satisfy assumptions of hypothesis example_generating_inner_function.

@mattharrison
Copy link
Author

I also tried inferring from a sample (100 rows) of the data and got a different error:

import pandas as pd
import pandera as pa
import time

url = 'https://github.com/mattharrison/datasets/blob/master/data/ames-housing-dataset.zip?raw=true'
ames = pd.read_csv(url, compression='zip')
s = pa.infer_schema(ames.sample(100, random_state=42))

#for i in range(1, 20):
for i in range(80, 81):
    start = time.time()
    s.select_columns(list(s.columns.keys())[:i]).example(1)
    print(f'{i} took {time.time()-start} seconds {list(s.columns.keys())[:i]}')
------------------------------------------------------------------
TypeError                        Traceback (most recent call last)
~/envs/menv/lib/python3.8/site-packages/pandera/engines/pandas_engine.py in dtype(cls, data_type)
    122         try:
--> 123             return engine.Engine.dtype(cls, data_type)
    124         except TypeError:

~/envs/menv/lib/python3.8/site-packages/pandera/engines/engine.py in dtype(cls, data_type)
    210         except (KeyError, ValueError):
--> 211             raise TypeError(
    212                 f"Data type '{data_type}' not understood by {cls.__name__}."

TypeError: Data type 'empty' not understood by Engine.

During handling of the above exception, another exception occurred:

TypeError                        Traceback (most recent call last)
<ipython-input-110-9c448b1fadcb> in <module>
      3 url = 'https://github.com/mattharrison/datasets/blob/master/data/ames-housing-dataset.zip?raw=true'
      4 ames = pd.read_csv(url, compression='zip')
----> 5 s = pa.infer_schema(ames.sample(100, random_state=42))
      6 import time
      7 #for i in range(1, 20):

~/envs/menv/lib/python3.8/site-packages/pandera/schema_inference.py in infer_schema(pandas_obj)
     24     """
     25     if isinstance(pandas_obj, pd.DataFrame):
---> 26         return infer_dataframe_schema(pandas_obj)
     27     elif isinstance(pandas_obj, pd.Series):
     28         return infer_series_schema(pandas_obj)

~/envs/menv/lib/python3.8/site-packages/pandera/schema_inference.py in infer_dataframe_schema(df)
     58     :returns: DataFrameSchema
     59     """
---> 60     df_statistics = infer_dataframe_statistics(df)
     61     schema = DataFrameSchema(
     62         columns={

~/envs/menv/lib/python3.8/site-packages/pandera/schema_statistics.py in infer_dataframe_statistics(df)
     13     """Infer column and index statistics from a pandas DataFrame."""
     14     nullable_columns = df.isna().any()
---> 15     inferred_column_dtypes = {col: _get_array_type(df[col]) for col in df}
     16     column_statistics = {
     17         col: {

~/envs/menv/lib/python3.8/site-packages/pandera/schema_statistics.py in <dictcomp>(.0)
     13     """Infer column and index statistics from a pandas DataFrame."""
     14     nullable_columns = df.isna().any()
---> 15     inferred_column_dtypes = {col: _get_array_type(df[col]) for col in df}
     16     column_statistics = {
     17         col: {

~/envs/menv/lib/python3.8/site-packages/pandera/schema_statistics.py in _get_array_type(x)
    182         inferred_alias = pd.api.types.infer_dtype(x, skipna=True)
    183         if inferred_alias != "string":
--> 184             data_type = pandas_engine.Engine.dtype(inferred_alias)
    185     return data_type
    186 

~/envs/menv/lib/python3.8/site-packages/pandera/engines/pandas_engine.py in dtype(cls, data_type)
    139                 # let pandas transform any acceptable value
    140                 # into a numpy or pandas dtype.
--> 141                 np_or_pd_dtype = pd.api.types.pandas_dtype(data_type)
    142                 if isinstance(np_or_pd_dtype, np.dtype):
    143                     np_or_pd_dtype = np_or_pd_dtype.type

~/envs/menv/lib/python3.8/site-packages/pandas/core/dtypes/common.py in pandas_dtype(dtype)
   1779     # raise a consistent TypeError if failed
   1780     try:
-> 1781         npdtype = np.dtype(dtype)
   1782     except SyntaxError as err:
   1783         # np.dtype uses `eval` which can raise SyntaxError

TypeError: data type 'empty' not understood

cosmicBboy added a commit that referenced this issue Oct 27, 2022
fixes #988

A change in #658 introduced a step to handle str/object dtypes due
to issues handling np.str_. Will need to look into whether that's
still necessary in another PR, but this one compresses a bunch
of `strategy.map` calls to a single one.

This addresses an issue where the strategy would be way too long
for schemas with many str/object columns.

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
@cosmicBboy
Copy link
Collaborator

okay, I identified the promixal issue here: #989

Generating an example on the entire schema works up to 16-ish examples (it craps out at 32):

import pandas as pd
import pandera as pa
import time
from datetime import timedelta

from hypothesis import settings

url = 'https://github.com/mattharrison/datasets/blob/master/data/ames-housing-dataset.zip?raw=true'
ames = pd.read_csv(url, compression='zip')
s = pa.infer_schema(ames)

for i in [0, 1, 2, 4, 8, 16, 32]:
    start = time.time()
    s.example(i)
    print(f'{i} examples took {time.time()-start} seconds')

Output:

0 examples took 0.049291133880615234 seconds
1 examples took 42.040018796920776 seconds
2 examples took 42.64928913116455 seconds
4 examples took 42.541467905044556 seconds
8 examples took 40.94924283027649 seconds
16 examples took 41.02593684196472 seconds
Traceback (most recent call last):
  File "/Users/nielsbantilan/git/pandera/foo.py", line 15, in <module>
    s.example(i)
  File "/Users/nielsbantilan/git/pandera/pandera/schemas.py", line 945, in example
    return self.strategy(
  File "/Users/nielsbantilan/miniconda3/envs/pandera/lib/python3.9/site-packages/hypothesis/strategies/_internal/strategies.py", line 335, in example
    example_generating_inner_function()
  File "/Users/nielsbantilan/miniconda3/envs/pandera/lib/python3.9/site-packages/hypothesis/strategies/_internal/strategies.py", line 324, in example_generating_inner_function
    @settings(
  File "/Users/nielsbantilan/miniconda3/envs/pandera/lib/python3.9/site-packages/hypothesis/core.py", line 1235, in wrapped_test
    raise the_error_hypothesis_found
  File "/Users/nielsbantilan/miniconda3/envs/pandera/lib/python3.9/site-packages/hypothesis/core.py", line 815, in run_engine
    raise Unsatisfiable(f"Unable to satisfy assumptions of {rep}")
hypothesis.errors.Unsatisfiable: Unable to satisfy assumptions of example_generating_inner_function

Cutting a bugfix release 0.13.4 next week, this should be included in there!

cosmicBboy added a commit that referenced this issue Oct 28, 2022
fixes #988

A change in #658 introduced a step to handle str/object dtypes due
to issues handling np.str_. Will need to look into whether that's
still necessary in another PR, but this one compresses a bunch
of `strategy.map` calls to a single one.

This addresses an issue where the strategy would be way too long
for schemas with many str/object columns.

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
@mattharrison
Copy link
Author

Ok, trying this again with another dataset and running into issues.

This fails. Note I'm not every creating all of the columns (though I would like to), just two floating point columns.

Code:

import pandas as pd
import pandera as pa

raw = pd.read_csv('https://github.com/mattharrison/datasets/raw/master/data/alta-noaa-1980-2019.csv',
                  parse_dates=['DATE'])

pa.infer_schema(raw.iloc[:,3:4]).example(size=5)

Should I open another bug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants