missing from_record mehtod wich resturns DataFrame[Schema] #850

borissmidt · 2022-05-06T09:01:09Z

Is your feature request related to a problem? Please describe.
When i do panderas.typing.DataFrame[T].from_record() i get an untyped DataFrame back and not a panderas DataFrame[T]

Describe the solution you'd like
A from_record method which allows you to create a typed dataframe this is especially usefull for writing unit tests.

The text was updated successfully, but these errors were encountered:

borissmidt · 2022-05-06T09:15:45Z

another improvement might be to have a typed 'record' constructor which matches the columns, but i'm not sure how you can make the IDE pick this up.

cosmicBboy · 2022-05-06T13:36:53Z

A from_record method which allows you to create a typed dataframe this is especially usefull for writing unit tests.

I'm open to supporting this!

Basically the pandera DataFrame type would need to override the from_record method by calling the super().from_record method and then typing.cast the output of that to "typing.pandas.DataFrame[T]" (a self referencing return annotation).

Let me know if you have the capacity to make a PR for this! would be happy to help guide further.

In the meantime, a workaround would be something like:

pa.typing.DataFrame[Schema](pd.DataFrame.from_record(...))

borissmidt · 2022-05-06T14:47:17Z

I made a workaround but it isn't ideal, since the schemas by default don't keep the names of the indexes.
Otherwise each time you do a from_records() call you have to specify which indexes are used in that data type.
I also could get to the 'generic' parameter of the schema inside the DataFrame type. (i'm kind of new to python, but programmed scala before where you can actually acces these things).

class ExtDataFrame(pat.DataFrame[TSchemaModel]):
    schema: TSchemaModel

    def __init__(self, t: typing.Type[TSchemaModel]):
        self.schema = t

    def from_records(  # type: ignore
        self: typing.Type[TSchemaModel],
        data,
        index=None,
        exclude=None,
        columns=None,
        coerce_float: bool = False,
        nrows: int | None = None,
    ) -> pat.DataFrame[TSchemaModel]:
       schema= self.schema.to_schema()
       index = schema.index.names if index is None else index
        return self.schema.validate(
            pat.DataFrame[TSchemaModel](
                pat.DataFrame[TSchemaModel].from_records(
                    data=data,
                    index=index,
                    exclude=exclude,
                    columns=columns,
                    coerce_float=coerce_float,
                    nrows=nrows,
                )
            )
        )

# then you can do:
ExtDataFrame(Schema).from_records([{"col1", 1, "idx": 2}])
# in most of my code i do:
ExtDataFrame(Schema).from_records([Schema.record(col1= 1, idx= 2)])

OffTopic:

I just noticed that panderas doesn't allow the IDE (pycharm) to easily refactor collumn names since the IDE doesn't understand that the collumns in pandas.typing.DataFrame are defined by a generic type.
also pa.typing.DataFrame[Schema] doesn't seem to validate the columns. i have to explicitly call Schema.validate and had to extend the schema to return a pandera.pandas.DataFrame instead of pandera.BaseDataFrame

cosmicBboy · 2022-05-06T15:29:01Z

I'm not quite clear on your use case here... would you mind elaborating on that and why you need strictly typed dataframes?
Example tests/code that you're using would help. I ask mainly because this part of the pandera functionality is still experimental, and people who need strictly typed dataframes might see some rough edges, as you have 🙂

I made a workaround but it isn't ideal, since the schemas by default don't keep the names of the indexes.
Otherwise each time you do a from_records() call you have to specify which indexes are used in that data type.

You can supply check_name to the Field associated with your index: https://pandera.readthedocs.io/en/stable/reference/generated/pandera.model_components.Field.html#pandera.model_components.Field

I just noticed that panderas doesn't allow the IDE (pycharm) to easily refactor collumn names since the IDE doesn't understand that the collumns in pandas.typing.DataFrame are defined by a generic type.

This is a known limitation of pandera... we haven't yet explored ways of modifying the pandera.typing.DataFrame class to make it aware of the columns/indexes defined in the schema, as this adds more complexity to the pandera DataFrame subclass, e.g. we'd need to deal with conflicts between pandas.DataFrame methods/attributes and user-defined column names. I understand this isn't ideal from a IDE autocompletion perspective, but why does this make refactoring column names hard?

also pa.typing.DataFrame[Schema] doesn't seem to validate the columns. i have to explicitly call Schema.validate and had to extend the schema to return a pandera.pandas.DataFrame instead of pandera.BaseDataFrame

Can you provide a minimally reproducible code snippet for this in a bug issue? This test makes sure validation of columns does indeed happen

borissmidt · 2022-05-09T08:52:14Z

Yes, i like typed data frames, because it is really good for documenting the code so you don't make any errors in column names or type errors. it also catches a lot of problems in case of missing data.

Another usage i made out of it is to use it as a definition of my xlsx report output. I use the title in the field to actually set the title in the xlsx report output. and use reflection on the schema to get the right columns for serialization. This makes it very easy to change the order of columns in the output format and change the output itself. In case of a missing column the code would fail at the function that calculates the data instead of having to manually check the output file.

for example:

# Just extends the SchemaModel
class MonthlySummary(SchemaModelXlsx):
    @classmethod
    @property
    def sheet_name(cls) -> str: #this could be title in the config instead
        return "Summary"

    month: pat.Index[pat.DateTime] = pa.Field(check_name=True, title="Month")
    bruto_revenue: pat.Series[float] = pa.Field(title="Bruto revenue")
    expenses: pat.Series[float] = pa.Field(title="expenses")
    netto_revenue: pat.Series[float] = pa.Field(title="Netto revenue")

def revenue(sales: pat.DataFrame[ProductSales], services: pat.DataFrame[ServiceSales]) -> pat.DataFrame[Revenue]:
     pass

def montly_summary(bruto_revenue: pat.DataFrame[Revenue], expenses_per_day: pat.DataFrame[Expenses]  ):
     # reindexes and then makes a difference between the different types of revenue and expenses.

     return pat.DataFrame[MonthlySummary](
        {
           "bruto_revenue": total_revenue
           "expenses": total_expenses
           "netto_revenue": total_revenue - total_expenses
        }
     )

ideally a typed api dataframe should have a constructor so you could call

# each field should be typed to make construction easy
MonthlySummary(month, bruto_reveneu, espenses, netto_revenue)
or if you want to extract it from a df, this could drop the 'unstated columns' to enforce you to not just add some data.
MonthlySummary.from_df(df)

having this specialized types could also add the opportunity to add methods and properties to the data frames to make calculating aggregated data easy with the defined types.

Can you provide a minimally reproducible code snippet for this in a bug issue? This test makes sure validation of columns does indeed happen

Looking at the test it doesn't check for missing columns i'll try to spend some time today to make an sample to double check the problem.

borissmidt · 2022-05-09T09:11:02Z

Oke it is only the from_records that doesn't do any checks. But i only use it in my unit tests.

class MonthlySummary2(SchemaModel):
    month: pat.Index[int] = pa.Field(check_name=True, title="Month")
    bruto_revenue: pat.Series[float] = pa.Field(title="Bruto revenue")
    expenses: pat.Series[float] = pa.Field(title="expenses")
    netto_revenue: pat.Series[float] = pa.Field(title="Netto revenue")


        df = pat.DataFrame[MonthlySummary2].from_records(
            [
                {
                    "month": 1,
                    "bruto_revenue": 1.0,
                    "expenses": 2.0
                }
            ],
            index=["month"]
        )

cosmicBboy · 2022-05-10T01:24:43Z

Hi @borissmidt

Looking at the test it doesn't check for missing columns i'll try to spend some time today to make an sample to double check the problem.

Yeah if you can send a code snippet (or maybe a PR 🙂) to update that test that would be great!

Oke it is only the from_records that doesn't do any checks. But i only use it in my unit tests.

I'm down to support this use case, but I'm currently working on other stuff (#381) so if you'd like to own that part of the codebase I can help review changes and get them merged into the core library.

borissmidt · 2022-05-10T06:46:31Z

I will spend some time to make a PR.

…

On Tue, 10 May 2022, 03:24 Niels Bantilan, ***@***.***> wrote: Hi @borissmidt <https://github.com/borissmidt> Looking at the test it doesn't check for missing columns i'll try to spend some time today to make an sample to double check the problem. Yeah if you can send a code snippet (or maybe a PR 🙂) to update that test that would be great! Oke it is only the from_records that doesn't do any checks. But i only use it in my unit tests. I'm down to support this use case, but I'm currently working on other stuff (#381 <#381>) so if you'd like to own that part of the codebase I can help review changes and get them merged into the core library. — Reply to this email directly, view it on GitHub <#850 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABNXZFRJPRUIT3EN3R4PN5TVJG3GPANCNFSM5VHQG4UQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

* Add a from record that checks the schema for a pandas dataframe * Add a from record that checks the schema for a pandas dataframe * handle nox session.install issue * fix lint * fix noxfile issue * remove unneeded types * update type annotation Co-authored-by: cosmicBboy <niels.bantilan@gmail.com>

cosmicBboy · 2022-10-18T15:19:40Z

fixed by #859

borissmidt added the enhancement New feature or request label May 6, 2022

borissmidt mentioned this issue May 16, 2022

Add from records to panderas dataframe #850 #859

Merged

cosmicBboy closed this as completed Oct 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

missing from_record mehtod wich resturns DataFrame[Schema] #850

missing from_record mehtod wich resturns DataFrame[Schema] #850

borissmidt commented May 6, 2022

borissmidt commented May 6, 2022

cosmicBboy commented May 6, 2022

borissmidt commented May 6, 2022 •

edited

Loading

cosmicBboy commented May 6, 2022 •

edited

Loading

borissmidt commented May 9, 2022

borissmidt commented May 9, 2022 •

edited

Loading

cosmicBboy commented May 10, 2022

borissmidt commented May 10, 2022 via email

cosmicBboy commented Oct 18, 2022

missing from_record mehtod wich resturns DataFrame[Schema] #850

missing from_record mehtod wich resturns DataFrame[Schema] #850

Comments

borissmidt commented May 6, 2022

borissmidt commented May 6, 2022

cosmicBboy commented May 6, 2022

borissmidt commented May 6, 2022 • edited Loading

cosmicBboy commented May 6, 2022 • edited Loading

borissmidt commented May 9, 2022

borissmidt commented May 9, 2022 • edited Loading

cosmicBboy commented May 10, 2022

borissmidt commented May 10, 2022 via email

cosmicBboy commented Oct 18, 2022

borissmidt commented May 6, 2022 •

edited

Loading

cosmicBboy commented May 6, 2022 •

edited

Loading

borissmidt commented May 9, 2022 •

edited

Loading