Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rolling_apply only applicable to numeric columns #4964

Closed
aschilling opened this issue Sep 24, 2013 · 12 comments
Closed

rolling_apply only applicable to numeric columns #4964

aschilling opened this issue Sep 24, 2013 · 12 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@aschilling
Copy link

Hi everybody,

I discovered that the rolling_apply function is only applicable to numeric columns. I think this should be changed as this seems too limited to me. Let's take the following example,

    import datetime as DT
    df = pd.DataFrame({
    'Buyer': 'Carl Mark Carl Joe Joe Carl'.split(),
    'Quantity': [1,3,5,8,9,3],
    'Date' : [
        DT.datetime(2013,9,1,13,0),
        DT.datetime(2013,9,1,13,5),
        DT.datetime(2013,10,1,20,0),
        DT.datetime(2013,10,3,10,0),
        DT.datetime(2013,12,2,12,0),                                      
        DT.datetime(2013,12,2,14,0),
        ]}).set_index('Date')

Now I want to count all new customers each 10 days.

buyers = []

def novices(x):
    new = [n for n in x if n not in buyers]

    if (len(new) > 0): buyers.extend(new) 

    return len(new)


pd.rolling_apply(df['Buyer'], 10, novices)

throws an exception "ValueError: could not convert string to float: Carl"

However, although meaningless a call with a numeric column such as:

pd.rolling_apply(df['Quantity'], 2, novices)

works.

@jreback
Copy link
Contributor

jreback commented Sep 24, 2013

this is a bit of work, because need right now the rollinging functions only accept float dtypes, so need a generator for object types (not that hard) in the code generator. then need to dispatch based on dtype. will put on the board for 0.14.

@aschilling
Copy link
Author

Thanks, is there maybe an alternative solution how I could implement the functionality with the current pandas version?

@jreback
Copy link
Contributor

jreback commented Sep 24, 2013

There's an embeded bug here (that's why i am astyping)...separate issue

In [28]: df['Quantity'].astype('float').resample('10D',how=np.sum)
Out[28]: 
Date
2013-09-01 13:00:00     4
2013-09-11 13:00:00   NaN
2013-09-21 13:00:00   NaN
2013-10-01 13:00:00    13
2013-10-11 13:00:00   NaN
2013-10-21 13:00:00   NaN
2013-10-31 13:00:00   NaN
2013-11-10 13:00:00   NaN
2013-11-20 13:00:00   NaN
2013-11-30 13:00:00    12
2013-12-10 13:00:00     0
Freq: 10D, dtype: float64

@aschilling
Copy link
Author

Thanks, but I think you got me wrong, I am looking for a possibility to count at each point the number of new customers in the upcoming 10 days. Hence, I need to compare the names of the customers to see if they are new or already known.

@jreback
Copy link
Contributor

jreback commented Sep 24, 2013

I think this will work. It is a 'manual' rolling_apply
you can do anything you want the function (e.g. if you can return a Series with multiple values or a DataFrame)
and can efectively 'group' how you want

The key is return None from the function which concat ignores

In [25]: df
Out[25]: 
                    Buyer  Quantity
Date                               
2013-09-01 13:00:00  Carl         1
2013-09-01 13:05:00  Mark         3
2013-10-01 20:00:00  Carl         5
2013-10-03 10:00:00   Joe         8
2013-12-02 12:00:00   Joe         9
2013-12-02 14:00:00  Carl         3

In [24]: dates = date_range(df.index[0],df.index[-1],freq='D')                                          

In [22]: def f(x,d):
   ....:     if len(x):
   ....:         return Series([x['Quantity'].sum()],index=[d])
   ....:     return None
   ....: 

In [23]: concat([ f(df.loc[(df.index>dates[i-10]) & (df.index<=dates[i])], dates[i]) for i in xrange(10,len(dates)) ])
Out[23]: 
2013-09-11 13:00:00     3
2013-10-02 13:00:00     5
2013-10-03 13:00:00    13
2013-10-04 13:00:00    13
2013-10-05 13:00:00    13
2013-10-06 13:00:00    13
2013-10-07 13:00:00    13
2013-10-08 13:00:00    13
2013-10-09 13:00:00    13
2013-10-10 13:00:00    13
2013-10-11 13:00:00    13
2013-10-12 13:00:00     8
2013-12-02 13:00:00     9
dtype: int64

@aschilling
Copy link
Author

Thanks, that is a great solution. Is there any possibility to use in your last statement [23] a TimeOffset (e.g. a week or month) to select the rows for the f function while still having the variable 'dates' on a daily basis?

@jreback
Copy link
Contributor

jreback commented Sep 25, 2013

In [24]: offset = pd.offsets.MonthEnd('1')

In [25]: concat([ f(df.loc[(df.index>d-offset) & (df.index<=d)], d) for d in dates ])
Out[25]: 
2013-09-01 13:00:00    1
2013-09-02 13:00:00    4
2013-09-03 13:00:00    4
2013-09-04 13:00:00    4
2013-09-05 13:00:00    4
2013-09-06 13:00:00    4
2013-09-07 13:00:00    4
2013-09-08 13:00:00    4
2013-09-09 13:00:00    4
2013-09-10 13:00:00    4
2013-09-11 13:00:00    4
2013-09-12 13:00:00    4
2013-09-13 13:00:00    4
2013-09-14 13:00:00    4
2013-09-15 13:00:00    4
...
2013-10-18 13:00:00    13
2013-10-19 13:00:00    13
2013-10-20 13:00:00    13
2013-10-21 13:00:00    13
2013-10-22 13:00:00    13
2013-10-23 13:00:00    13
2013-10-24 13:00:00    13
2013-10-25 13:00:00    13
2013-10-26 13:00:00    13
2013-10-27 13:00:00    13
2013-10-28 13:00:00    13
2013-10-29 13:00:00    13
2013-10-30 13:00:00    13
2013-10-31 13:00:00    13
2013-12-02 13:00:00     9
Length: 61

@dashesy
Copy link
Contributor

dashesy commented Jun 10, 2015

is it possible to ignore string columns and apply the rolling function to the rest?

@jreback
Copy link
Contributor

jreback commented Mar 6, 2016

going to repurpose this issue to skip nuiscance columns in a dataframe

In [4]: df = DataFrame({'A' : pd.date_range('20130101',periods=3),'B' : range(3)})

In [5]: df
Out[5]: 
           A  B
0 2013-01-01  0
1 2013-01-02  1
2 2013-01-03  2

This is a supported op

In [7]: df.rolling(window=2).count()
Out[7]: 
     A    B
0  1.0  1.0
1  2.0  2.0
2  2.0  2.0

This works

In [9]: df.rolling(window=2).B.sum()
Out[9]: 
0    NaN
1    1.0
2    3.0
Name: B, dtype: float64
df.rolling(window=2).sum()

should work (but excluded A) from the results.

@jreback
Copy link
Contributor

jreback commented Mar 6, 2016

closing in favor of #12537

@jreback jreback closed this as completed Mar 6, 2016
@Darel13712
Copy link

Darel13712 commented Mar 21, 2018

@jreback Hello is there any possibility to get rolling.apply for objects in 2018?) It still ignores objects as of 0.22.0.

idx = pd.date_range(pd.to_datetime('2017-01-01'), pd.to_datetime('2017-01-03'))
df = pd.DataFrame([[1, 's'],[2, 't'],[3, 'r']], index=idx)
0 1
2017-01-01 1 s
2017-01-02 2 t
2017-01-03 3 r

it's ok to ignore objects with built-in functions

df.rolling(2, min_periods=1).sum()
0 1
2017-01-01 1 s
2017-01-02 3 t
2017-01-03 5 r

but why can't i handle objects myself?

def const(x):
    return 5
df.rolling(2, min_periods=1).apply(const)
0 1
2017-01-01 5 s
2017-01-02 5 t
2017-01-03 5 r

This example above is not really useful but I want to be able to apply function to a column that contains objects.

The only solution I've found is to create your own DataFrame/Series type https://stackoverflow.com/questions/44479384/pandas-rolling-apply-doesnt-do-anything

But I really think this should be in pandas by default.

@jreback
Copy link
Contributor

jreback commented Mar 21, 2018

would always take a PR for this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

4 participants