Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inferring dtypes in get_as_dataframe #1

Closed
thorbjornwolf opened this issue Mar 22, 2017 · 9 comments
Closed

Inferring dtypes in get_as_dataframe #1

thorbjornwolf opened this issue Mar 22, 2017 · 9 comments

Comments

@thorbjornwolf
Copy link
Contributor

This is an enhancement proposal.

For my use case, it could be nice if gspread-dataframe was able to try to infer column dtypes when fetching data from a sheet. While individual cells are converted through numericise, their column dtype remains object, and the returned dataframe fails equality checks with the original dataframe.

Motivating example

>>> df = pd.DataFrame({'a': [4,1,2,4],
...                    'b': list('abba')},
...                    index=pd.Index(list('ABCD'), name='our index'))
>>> df
           a  b
our index      
A          4  a
B          1  b
C          2  b
D          4  a
>>> df.dtypes
a     int64
b    object
dtype: object
>>> ws =  # Get a test worksheet here
>>> set_with_dataframe(ws, df, include_index=True, resize=True)
>>> r = get_as_dataframe(ws, index_column_number=1)
>>> r  # Looks as expected
           a  b
our index      
A          4  a
B          1  b
C          2  b
D          4  a
>>> r.dtypes  # All object dtype
a    object
b    object
dtype: object
>>> [type(v) for v in r['a']]  # correctly converted to int
[int, int, int, int]
>>> df.equals(r)  # The equality check fails
False
>>> df['a'].equals(r['a'])  # because of the dtype of column 'a'.
False
>>> df['a'] == r['a']  # The values *are* the same, though.
our index
A    True
B    True
C    True
D    True
Name: a, dtype: bool
>>> df['b'].equals(r['b'])  # str works as expected
True

Suggested solution

I am unsure what is the best way to deal with this, and whether it is a general enough use-case to warrant an addition to gspread-dataframe. At any rate, the following code is my initial stab at how dtype inference could be implemented:

import pandas as pd

converters = (
    pd.to_numeric,
    pd.to_timedelta,
    pd.to_datetime,
)


def _assign_column_dtypes(df):
    for conv in converters:
        for col in df:
            if df[col].dtype != object:
                continue
            df[col] = conv(df[col], errors='ignore')

    return df

It intentionally places timedelta before datetime, as '00:03:00' can be interpreted as either one by pandas. In my use-case, datetimes always include a date, so '00:03:00' would definitely be a timedelta.

Take it for a spin!

# Construct a dataframe where everything is either str or object
n = 10
df = pd.DataFrame({
    'datetime str': pd.date_range('2017-03-15', freq='D', periods=n
                                  ).astype(str),
    'timedelta str': pd.timedelta_range('00:03:00', periods=n, freq='10 s'
                                        ).to_native_types().astype(str),
    'int obj': pd.Series(range(n), dtype=object),
    'int str': [str(i) for i in range(n)],
    'float obj': pd.Series(map(float, range(n)), dtype=object),
    'float str': [str(float(i)) for i in range(n)],
})

print(df)
#   datetime str float obj float str int obj int str timedelta str
# 0   2017-03-15         0       0.0       0       0      00:03:00
# 1   2017-03-16         1       1.0       1       1      00:03:10
# 2   2017-03-17         2       2.0       2       2      00:03:20
# 3   2017-03-18         3       3.0       3       3      00:03:30
# 4   2017-03-19         4       4.0       4       4      00:03:40
# 5   2017-03-20         5       5.0       5       5      00:03:50
# 6   2017-03-21         6       6.0       6       6      00:04:00
# 7   2017-03-22         7       7.0       7       7      00:04:10
# 8   2017-03-23         8       8.0       8       8      00:04:20
# 9   2017-03-24         9       9.0       9       9      00:04:30

print(df.dtypes)
# datetime str     object
# float obj        object
# float str        object
# int obj          object
# int str          object
# timedelta str    object
# dtype: object


df = _assign_column_dtypes(df)

print(df)
#   datetime str  float obj  float str  int obj  int str  timedelta str
# 0   2017-03-15        0.0        0.0        0        0       00:03:00
# 1   2017-03-16        1.0        1.0        1        1       00:03:10
# 2   2017-03-17        2.0        2.0        2        2       00:03:20
# 3   2017-03-18        3.0        3.0        3        3       00:03:30
# 4   2017-03-19        4.0        4.0        4        4       00:03:40
# 5   2017-03-20        5.0        5.0        5        5       00:03:50
# 6   2017-03-21        6.0        6.0        6        6       00:04:00
# 7   2017-03-22        7.0        7.0        7        7       00:04:10
# 8   2017-03-23        8.0        8.0        8        8       00:04:20
# 9   2017-03-24        9.0        9.0        9        9       00:04:30

print(df.dtypes)
# datetime str      datetime64[ns]
# float obj                float64
# float str                float64
# int obj                    int64
# int str                    int64
# timedelta str    timedelta64[ns]
# dtype: object
@robin900
Copy link
Owner

Very nice description of the problem. I also am unsure if the solution belongs in gspread-dataframe.

What do you think of the arguments to pandas.read_csv? Many of its arguments have the same purpose: to infer pandas dtypes for incoming values read from a file format that has no support for including data type metadata for columns.

@thorbjornwolf
Copy link
Contributor Author

thorbjornwolf commented Mar 27, 2017

Does it belong here?

(IMHO: Yes)
I have discussed it a bit with a colleague, and thought it over: I think this functionality does belong in gspread-dataframe: After all, the goal of gspread-dataframe is to remove friction from the use of dataframes with Google Sheets (gspread specifically).

Your work thus far has removed (all?) the real pain points like batching, offsets, and performance. The remaining problems are minor, and as of now, I can list only two:

  • I can't say get_as_dataframe(...).resample('M', on='date') without transforming 'date' myself first.
  • Equality fails, as shown above.

If we can make an approximate solution to both problems by just trying out the three different pd.to_* functions, I think that is a low-hanging fruit.

API

I see three options, each having its pros and cons:

  • Being consistent with the API of pandas.read_csv (which I think is messy),
  • Implementing our own flexible low-level API (which will be verbose),
  • Implementing our own brief (but inflexible) high-level API. (Full disclosure: My favourite!)

Here are some suggestions for how we could approach each of the options:

Consistency with read_csv

We could borrow such kwargs as

  • dtype (a dict where the user specifies the wanted dtypes)
  • parse_dates (mark columns to parse datetimes from)

However, I get the feeling that the selection of kwargs there is hacked together over a long time; I always found them a bit confusing :-)

Low-level (high detail)

We could consider a more low-level approach, for instance

  • numeric_cols (list of cols)
  • datetime_cols (list of cols)
  • timedelta_cols (list of cols)

However, IMHO this is already so verbose that the end user might as well do the conversions themself.

High-level (low detail)

  • infer_dtypes (bool or list of columns)

where the dataframe, or a subset of its columns, are run through _assign_column_dtypes. It could default to False, so that we don't step on anyone's toes by forcefully converting their strings.

What do you think?

@robin900
Copy link
Owner

I wonder: Since pandas.io.parsers seems to have factored out a whole lot of the common functionality of parsing, perhaps we could attempt to implement this as a pandas IO parser and then defer to pandas for all of this?

I'm looking especially at PythonParser and how we might subclass or otherwise adapt it to read a worksheet's values appropriately.

https://github.com/pandas-dev/pandas/blob/master/pandas/io/parsers.py#L1852

@robin900
Copy link
Owner

Actually I think we may have to go all the way to ParserBase and extend that:

https://github.com/pandas-dev/pandas/blob/master/pandas/io/parsers.py#L1101

@robin900
Copy link
Owner

@NTAWolf Turns out to be very easy to hook up pandas.io.parsers.TextParser to process worksheet.get_all_values(). Here is a working example:

https://github.com/robin900/gspread-dataframe/tree/pandas-parser

Note: None of the get_as_dataframe options like evaluate_formulas are yet supported; and I have not tested all the combinations of the pandas "read options". But I can get/set a DataFrame<->gspread.Worksheet using a sample sheet and it works fine. Let me know your thoughts...

@robin900
Copy link
Owner

pandas-parser branch now supports evaluate_formulas; the other previously specified arguments to get_as_dataframe are supplanted by standard pandas parser arguments: index_col and header. (numericise I've removed but can re-add if you find it a useful convenience.)

@thorbjornwolf
Copy link
Contributor Author

Great job! It works like a charm for me :-)
Good thinking on reusing the existing pandas stuff - I wasn't aware that it existed, this TextParser!

@robin900
Copy link
Owner

OK, I will be adding some tests to exercise the different keyword arguments for TextParser and ensure that the resulting DataFrames are as expected. Then I will plan a major version release.

In the meantime, a quick recipe with the current release is below. (It will always evaluate_formulas; to effect evaluate_formulas=False, you will need to build a list of values yourself using cell.input_value.)

from pandas.io.parsers import TextParser

def get_as_dataframe(worksheet, **options):
    return TextParser(worksheet.get_all_values(), **options).read()

@robin900
Copy link
Owner

@NTAWolf I've opened #2 to represent the switch to TextParser; a PR will happen in the next few days to implement. I'm going to close this issue; let me know if you think it should be re-opened.

robin900 pushed a commit that referenced this issue Jul 26, 2017
…ell values

in a DataFrame. Deal with regression where float precision is mangled
during round-trip testing, by using repr() on float values and str()
on other values.

Fixes #1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants