Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing from_record mehtod wich resturns DataFrame[Schema] #850

Closed
borissmidt opened this issue May 6, 2022 · 9 comments
Closed

missing from_record mehtod wich resturns DataFrame[Schema] #850

borissmidt opened this issue May 6, 2022 · 9 comments
Labels
enhancement New feature or request

Comments

@borissmidt
Copy link
Contributor

Is your feature request related to a problem? Please describe.
When i do panderas.typing.DataFrame[T].from_record() i get an untyped DataFrame back and not a panderas DataFrame[T]

Describe the solution you'd like
A from_record method which allows you to create a typed dataframe this is especially usefull for writing unit tests.

@borissmidt borissmidt added the enhancement New feature or request label May 6, 2022
@borissmidt
Copy link
Contributor Author

another improvement might be to have a typed 'record' constructor which matches the columns, but i'm not sure how you can make the IDE pick this up.

@cosmicBboy
Copy link
Collaborator

A from_record method which allows you to create a typed dataframe this is especially usefull for writing unit tests.

I'm open to supporting this!

Basically the pandera DataFrame type would need to override the from_record method by calling the super().from_record method and then typing.cast the output of that to "typing.pandas.DataFrame[T]" (a self referencing return annotation).

Let me know if you have the capacity to make a PR for this! would be happy to help guide further.

In the meantime, a workaround would be something like:

pa.typing.DataFrame[Schema](pd.DataFrame.from_record(...))

@borissmidt
Copy link
Contributor Author

borissmidt commented May 6, 2022

I made a workaround but it isn't ideal, since the schemas by default don't keep the names of the indexes.
Otherwise each time you do a from_records() call you have to specify which indexes are used in that data type.
I also could get to the 'generic' parameter of the schema inside the DataFrame type. (i'm kind of new to python, but programmed scala before where you can actually acces these things).

class ExtDataFrame(pat.DataFrame[TSchemaModel]):
    schema: TSchemaModel

    def __init__(self, t: typing.Type[TSchemaModel]):
        self.schema = t

    def from_records(  # type: ignore
        self: typing.Type[TSchemaModel],
        data,
        index=None,
        exclude=None,
        columns=None,
        coerce_float: bool = False,
        nrows: int | None = None,
    ) -> pat.DataFrame[TSchemaModel]:
       schema= self.schema.to_schema()
       index = schema.index.names if index is None else index
        return self.schema.validate(
            pat.DataFrame[TSchemaModel](
                pat.DataFrame[TSchemaModel].from_records(
                    data=data,
                    index=index,
                    exclude=exclude,
                    columns=columns,
                    coerce_float=coerce_float,
                    nrows=nrows,
                )
            )
        )

# then you can do:
ExtDataFrame(Schema).from_records([{"col1", 1, "idx": 2}])
# in most of my code i do:
ExtDataFrame(Schema).from_records([Schema.record(col1= 1, idx= 2)])

OffTopic:

  • I just noticed that panderas doesn't allow the IDE (pycharm) to easily refactor collumn names since the IDE doesn't understand that the collumns in pandas.typing.DataFrame are defined by a generic type.
  • also pa.typing.DataFrame[Schema] doesn't seem to validate the columns. i have to explicitly call Schema.validate and had to extend the schema to return a pandera.pandas.DataFrame instead of pandera.BaseDataFrame

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented May 6, 2022

I'm not quite clear on your use case here... would you mind elaborating on that and why you need strictly typed dataframes?
Example tests/code that you're using would help. I ask mainly because this part of the pandera functionality is still experimental, and people who need strictly typed dataframes might see some rough edges, as you have 🙂

I made a workaround but it isn't ideal, since the schemas by default don't keep the names of the indexes.
Otherwise each time you do a from_records() call you have to specify which indexes are used in that data type.

You can supply check_name to the Field associated with your index: https://pandera.readthedocs.io/en/stable/reference/generated/pandera.model_components.Field.html#pandera.model_components.Field

I just noticed that panderas doesn't allow the IDE (pycharm) to easily refactor collumn names since the IDE doesn't understand that the collumns in pandas.typing.DataFrame are defined by a generic type.

This is a known limitation of pandera... we haven't yet explored ways of modifying the pandera.typing.DataFrame class to make it aware of the columns/indexes defined in the schema, as this adds more complexity to the pandera DataFrame subclass, e.g. we'd need to deal with conflicts between pandas.DataFrame methods/attributes and user-defined column names. I understand this isn't ideal from a IDE autocompletion perspective, but why does this make refactoring column names hard?

also pa.typing.DataFrame[Schema] doesn't seem to validate the columns. i have to explicitly call Schema.validate and had to extend the schema to return a pandera.pandas.DataFrame instead of pandera.BaseDataFrame

Can you provide a minimally reproducible code snippet for this in a bug issue? This test makes sure validation of columns does indeed happen

@borissmidt
Copy link
Contributor Author

Yes, i like typed data frames, because it is really good for documenting the code so you don't make any errors in column names or type errors. it also catches a lot of problems in case of missing data.

Another usage i made out of it is to use it as a definition of my xlsx report output. I use the title in the field to actually set the title in the xlsx report output. and use reflection on the schema to get the right columns for serialization. This makes it very easy to change the order of columns in the output format and change the output itself. In case of a missing column the code would fail at the function that calculates the data instead of having to manually check the output file.

for example:

# Just extends the SchemaModel
class MonthlySummary(SchemaModelXlsx):
    @classmethod
    @property
    def sheet_name(cls) -> str: #this could be title in the config instead
        return "Summary"

    month: pat.Index[pat.DateTime] = pa.Field(check_name=True, title="Month")
    bruto_revenue: pat.Series[float] = pa.Field(title="Bruto revenue")
    expenses: pat.Series[float] = pa.Field(title="expenses")
    netto_revenue: pat.Series[float] = pa.Field(title="Netto revenue")

def revenue(sales: pat.DataFrame[ProductSales], services: pat.DataFrame[ServiceSales]) -> pat.DataFrame[Revenue]:
     pass

def montly_summary(bruto_revenue: pat.DataFrame[Revenue], expenses_per_day: pat.DataFrame[Expenses]  ):
     # reindexes and then makes a difference between the different types of revenue and expenses.

     return pat.DataFrame[MonthlySummary](
        {
           "bruto_revenue": total_revenue
           "expenses": total_expenses
           "netto_revenue": total_revenue - total_expenses
        }
     )

ideally a typed api dataframe should have a constructor so you could call

# each field should be typed to make construction easy
MonthlySummary(month, bruto_reveneu, espenses, netto_revenue)
or if you want to extract it from a df, this could drop the 'unstated columns' to enforce you to not just add some data.
MonthlySummary.from_df(df)

having this specialized types could also add the opportunity to add methods and properties to the data frames to make calculating aggregated data easy with the defined types.

Can you provide a minimally reproducible code snippet for this in a bug issue? This test makes sure validation of columns does indeed happen

Looking at the test it doesn't check for missing columns i'll try to spend some time today to make an sample to double check the problem.

@borissmidt
Copy link
Contributor Author

borissmidt commented May 9, 2022

Oke it is only the from_records that doesn't do any checks. But i only use it in my unit tests.

class MonthlySummary2(SchemaModel):
    month: pat.Index[int] = pa.Field(check_name=True, title="Month")
    bruto_revenue: pat.Series[float] = pa.Field(title="Bruto revenue")
    expenses: pat.Series[float] = pa.Field(title="expenses")
    netto_revenue: pat.Series[float] = pa.Field(title="Netto revenue")


        df = pat.DataFrame[MonthlySummary2].from_records(
            [
                {
                    "month": 1,
                    "bruto_revenue": 1.0,
                    "expenses": 2.0
                }
            ],
            index=["month"]
        )

@cosmicBboy
Copy link
Collaborator

Hi @borissmidt

Looking at the test it doesn't check for missing columns i'll try to spend some time today to make an sample to double check the problem.

Yeah if you can send a code snippet (or maybe a PR 🙂) to update that test that would be great!

Oke it is only the from_records that doesn't do any checks. But i only use it in my unit tests.

I'm down to support this use case, but I'm currently working on other stuff (#381) so if you'd like to own that part of the codebase I can help review changes and get them merged into the core library.

@borissmidt
Copy link
Contributor Author

borissmidt commented May 10, 2022 via email

cosmicBboy added a commit that referenced this issue Aug 9, 2022
* Add a from record that checks the schema for a pandas dataframe

* Add a from record that checks the schema for a pandas dataframe

* handle nox session.install issue

* fix lint

* fix noxfile issue

* remove unneeded types

* update type annotation

Co-authored-by: cosmicBboy <niels.bantilan@gmail.com>
cosmicBboy added a commit that referenced this issue Aug 10, 2022
* Add a from record that checks the schema for a pandas dataframe

* Add a from record that checks the schema for a pandas dataframe

* handle nox session.install issue

* fix lint

* fix noxfile issue

* remove unneeded types

* update type annotation

Co-authored-by: cosmicBboy <niels.bantilan@gmail.com>
@cosmicBboy
Copy link
Collaborator

fixed by #859

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants