-
BackgroundI've been trying to teach myself some pandas and pandera skills, because the scope of some of my personal projects is becoming fairly data oriented and I any advice or guidance is welcome. ProblemI'm not sure how I should "fail" the series so that I return the Index in the series. My goal is to be able to catch these errors and filter out invalid data to another frame. But in order to do that, I need to know the index. I am sure I am missing or not fundamentally understanding something because this all pretty new to me. Example:import pandas as pd
import pandera as pa
from re import match
from pandera import Column, Check
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
def regex_check(regex):
def func(series: pd.Series):
for i, data in enumerate(series.to_list()):
if isinstance(data, str):
series[i] = bool(match(regex, data))
if isinstance(data, list):
for item in data:
if not bool(match(regex, item)):
print(f"Series Index: {i}, Series Data: {data}, Item of List: {item}, Regex Match: {bool(match(regex, item))}")
series[i] = bool(match(regex, item))
break
return series
return func
def df_check_data(df: pd.DataFrame, schema: pa.DataFrameSchema) -> pd.DataFrame:
try:
schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
print("Schema errors and failure cases:")
print(err.failure_cases)
check1 = regex_check(r"^\d{8}$")
check2 = regex_check(r"^\d{9}$")
schema = pa.DataFrameSchema({
"NAME": Column(str, Check.isin(["BOB", "JERRY", "TOM", "NORM"])),
"SOME FIELD": Column(str, Check(check1, error="Regex Check 1")),
"SOME OTHER FIELD": Column(list, Check(check2, error="Regex Check 2"))
})
df = pd.DataFrame([
{ "NAME": "BOB", "SOME FIELD": "00123000", "SOME OTHER FIELD": ["123456789"] },
{ "NAME": "JERRY", "SOME FIELD": "00123000", "SOME OTHER FIELD": ["123456789", "87654321"] }
])
df_check_data(df, schema) Output
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Looks like def regex_check(regex: str):
def func(elem: Any):
if isinstance(elem, str):
return bool(match(regex, elem))
if isinstance(elem, list):
for item in elem:
if not bool(match(regex, item)):
return False
return True
return func |
Beta Was this translation helpful? Give feedback.
Looks like
Check(element_wise=True)
and a change to the regex_check to accept types other than series does the trick. Is there away to do this withoutelement_wise
though?