Inferring dtypes in get_as_dataframe #1

thorbjornwolf · 2017-03-22T14:32:46Z

This is an enhancement proposal.

For my use case, it could be nice if gspread-dataframe was able to try to infer column dtypes when fetching data from a sheet. While individual cells are converted through numericise, their column dtype remains object, and the returned dataframe fails equality checks with the original dataframe.

Motivating example

>>> df = pd.DataFrame({'a': [4,1,2,4],
...                    'b': list('abba')},
...                    index=pd.Index(list('ABCD'), name='our index'))
>>> df
           a  b
our index      
A          4  a
B          1  b
C          2  b
D          4  a
>>> df.dtypes
a     int64
b    object
dtype: object
>>> ws =  # Get a test worksheet here
>>> set_with_dataframe(ws, df, include_index=True, resize=True)
>>> r = get_as_dataframe(ws, index_column_number=1)
>>> r  # Looks as expected
           a  b
our index      
A          4  a
B          1  b
C          2  b
D          4  a
>>> r.dtypes  # All object dtype
a    object
b    object
dtype: object
>>> [type(v) for v in r['a']]  # correctly converted to int
[int, int, int, int]
>>> df.equals(r)  # The equality check fails
False
>>> df['a'].equals(r['a'])  # because of the dtype of column 'a'.
False
>>> df['a'] == r['a']  # The values *are* the same, though.
our index
A    True
B    True
C    True
D    True
Name: a, dtype: bool
>>> df['b'].equals(r['b'])  # str works as expected
True

Take it for a spin!

# Construct a dataframe where everything is either str or object
n = 10
df = pd.DataFrame({
    'datetime str': pd.date_range('2017-03-15', freq='D', periods=n
                                  ).astype(str),
    'timedelta str': pd.timedelta_range('00:03:00', periods=n, freq='10 s'
                                        ).to_native_types().astype(str),
    'int obj': pd.Series(range(n), dtype=object),
    'int str': [str(i) for i in range(n)],
    'float obj': pd.Series(map(float, range(n)), dtype=object),
    'float str': [str(float(i)) for i in range(n)],
})

print(df)
#   datetime str float obj float str int obj int str timedelta str
# 0   2017-03-15         0       0.0       0       0      00:03:00
# 1   2017-03-16         1       1.0       1       1      00:03:10
# 2   2017-03-17         2       2.0       2       2      00:03:20
# 3   2017-03-18         3       3.0       3       3      00:03:30
# 4   2017-03-19         4       4.0       4       4      00:03:40
# 5   2017-03-20         5       5.0       5       5      00:03:50
# 6   2017-03-21         6       6.0       6       6      00:04:00
# 7   2017-03-22         7       7.0       7       7      00:04:10
# 8   2017-03-23         8       8.0       8       8      00:04:20
# 9   2017-03-24         9       9.0       9       9      00:04:30

print(df.dtypes)
# datetime str     object
# float obj        object
# float str        object
# int obj          object
# int str          object
# timedelta str    object
# dtype: object


df = _assign_column_dtypes(df)

print(df)
#   datetime str  float obj  float str  int obj  int str  timedelta str
# 0   2017-03-15        0.0        0.0        0        0       00:03:00
# 1   2017-03-16        1.0        1.0        1        1       00:03:10
# 2   2017-03-17        2.0        2.0        2        2       00:03:20
# 3   2017-03-18        3.0        3.0        3        3       00:03:30
# 4   2017-03-19        4.0        4.0        4        4       00:03:40
# 5   2017-03-20        5.0        5.0        5        5       00:03:50
# 6   2017-03-21        6.0        6.0        6        6       00:04:00
# 7   2017-03-22        7.0        7.0        7        7       00:04:10
# 8   2017-03-23        8.0        8.0        8        8       00:04:20
# 9   2017-03-24        9.0        9.0        9        9       00:04:30

print(df.dtypes)
# datetime str      datetime64[ns]
# float obj                float64
# float str                float64
# int obj                    int64
# int str                    int64
# timedelta str    timedelta64[ns]
# dtype: object

The text was updated successfully, but these errors were encountered:

robin900 · 2017-03-22T15:41:33Z

Very nice description of the problem. I also am unsure if the solution belongs in gspread-dataframe.

What do you think of the arguments to pandas.read_csv? Many of its arguments have the same purpose: to infer pandas dtypes for incoming values read from a file format that has no support for including data type metadata for columns.

thorbjornwolf · 2017-03-27T13:46:27Z

Does it belong here?

(IMHO: Yes)
I have discussed it a bit with a colleague, and thought it over: I think this functionality does belong in gspread-dataframe: After all, the goal of gspread-dataframe is to remove friction from the use of dataframes with Google Sheets (gspread specifically).

Your work thus far has removed (all?) the real pain points like batching, offsets, and performance. The remaining problems are minor, and as of now, I can list only two:

I can't say get_as_dataframe(...).resample('M', on='date') without transforming 'date' myself first.
Equality fails, as shown above.

If we can make an approximate solution to both problems by just trying out the three different pd.to_* functions, I think that is a low-hanging fruit.

API

I see three options, each having its pros and cons:

Being consistent with the API of pandas.read_csv (which I think is messy),
Implementing our own flexible low-level API (which will be verbose),
Implementing our own brief (but inflexible) high-level API. (Full disclosure: My favourite!)

Here are some suggestions for how we could approach each of the options:

Consistency with `read_csv`

We could borrow such kwargs as

dtype (a dict where the user specifies the wanted dtypes)
parse_dates (mark columns to parse datetimes from)

However, I get the feeling that the selection of kwargs there is hacked together over a long time; I always found them a bit confusing :-)

Low-level (high detail)

We could consider a more low-level approach, for instance

numeric_cols (list of cols)
datetime_cols (list of cols)
timedelta_cols (list of cols)

However, IMHO this is already so verbose that the end user might as well do the conversions themself.

High-level (low detail)

infer_dtypes (bool or list of columns)

where the dataframe, or a subset of its columns, are run through _assign_column_dtypes. It could default to False, so that we don't step on anyone's toes by forcefully converting their strings.

What do you think?

robin900 · 2017-03-27T15:13:16Z

I wonder: Since pandas.io.parsers seems to have factored out a whole lot of the common functionality of parsing, perhaps we could attempt to implement this as a pandas IO parser and then defer to pandas for all of this?

I'm looking especially at PythonParser and how we might subclass or otherwise adapt it to read a worksheet's values appropriately.

https://github.com/pandas-dev/pandas/blob/master/pandas/io/parsers.py#L1852

robin900 · 2017-03-27T15:19:15Z

Actually I think we may have to go all the way to ParserBase and extend that:

https://github.com/pandas-dev/pandas/blob/master/pandas/io/parsers.py#L1101

robin900 · 2017-03-27T18:48:59Z

@NTAWolf Turns out to be very easy to hook up pandas.io.parsers.TextParser to process worksheet.get_all_values(). Here is a working example:

https://github.com/robin900/gspread-dataframe/tree/pandas-parser

Note: None of the get_as_dataframe options like evaluate_formulas are yet supported; and I have not tested all the combinations of the pandas "read options". But I can get/set a DataFrame<->gspread.Worksheet using a sample sheet and it works fine. Let me know your thoughts...

robin900 · 2017-03-27T19:01:51Z

pandas-parser branch now supports evaluate_formulas; the other previously specified arguments to get_as_dataframe are supplanted by standard pandas parser arguments: index_col and header. (numericise I've removed but can re-add if you find it a useful convenience.)

thorbjornwolf · 2017-03-28T08:38:55Z

Great job! It works like a charm for me :-)
Good thinking on reusing the existing pandas stuff - I wasn't aware that it existed, this TextParser!

robin900 · 2017-03-28T14:07:48Z

OK, I will be adding some tests to exercise the different keyword arguments for TextParser and ensure that the resulting DataFrames are as expected. Then I will plan a major version release.

In the meantime, a quick recipe with the current release is below. (It will always evaluate_formulas; to effect evaluate_formulas=False, you will need to build a list of values yourself using cell.input_value.)

from pandas.io.parsers import TextParser

def get_as_dataframe(worksheet, **options):
    return TextParser(worksheet.get_all_values(), **options).read()

robin900 · 2017-03-28T14:12:34Z

@NTAWolf I've opened #2 to represent the switch to TextParser; a PR will happen in the next few days to implement. I'm going to close this issue; let me know if you think it should be re-opened.

…ell values in a DataFrame. Deal with regression where float precision is mangled during round-trip testing, by using repr() on float values and str() on other values. Fixes #1.

robin900 mentioned this issue Mar 28, 2017

Have get_as_dataframe use pandas TextParser #2

Closed

3 tasks

robin900 closed this as completed Mar 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inferring dtypes in get_as_dataframe #1

Inferring dtypes in get_as_dataframe #1

thorbjornwolf commented Mar 22, 2017

robin900 commented Mar 22, 2017

thorbjornwolf commented Mar 27, 2017 •

edited

robin900 commented Mar 27, 2017

robin900 commented Mar 27, 2017

robin900 commented Mar 27, 2017

robin900 commented Mar 27, 2017

thorbjornwolf commented Mar 28, 2017

robin900 commented Mar 28, 2017

robin900 commented Mar 28, 2017

Inferring dtypes in get_as_dataframe #1

Inferring dtypes in get_as_dataframe #1

Comments

thorbjornwolf commented Mar 22, 2017

Motivating example

Suggested solution

Take it for a spin!

robin900 commented Mar 22, 2017

thorbjornwolf commented Mar 27, 2017 • edited

Does it belong here?

API

Consistency with read_csv

Low-level (high detail)

High-level (low detail)

robin900 commented Mar 27, 2017

robin900 commented Mar 27, 2017

robin900 commented Mar 27, 2017

robin900 commented Mar 27, 2017

thorbjornwolf commented Mar 28, 2017

robin900 commented Mar 28, 2017

robin900 commented Mar 28, 2017

thorbjornwolf commented Mar 27, 2017 •

edited

Consistency with `read_csv`