# dataframe calculations

## 1. introduction

* calculations with pandas are 'easy'
* calculations are usually fast depending on your machine and size of dataset

## 2. data loading

As animals dataset link is not provided, let's use a 'huge' dataset from Renfe trips...

https://www.kaggle.com/thegurusteam/spanish-high-speed-rail-system-ticket-pricing/data

In [2]:
# import pandas as pd
import numpy as np
import os

os.environ['MODIN_OUT_OF_CORE'] = 'true'
import modin.pandas as pd

In [3]:
renfe = pd.read_csv('./data/spanish-high-speed-rail-system-ticket-pricing.zip')

In [4]:
renfe.info(memory_usage='deep')

To request implementation, send an email to feature_requests@modin.org.


<class 'modin.pandas.dataframe.DataFrame'>
RangeIndex: 7671354 entries, 0 to 7671353
Data columns (total 9 columns):
insert_date    object
origin         object
destination    object
start_date     object
end_date       object
train_type     object
price          float64
train_class    object
fare           object
dtypes: float64(1), object(8)
memory usage: 4.0 GB


In [5]:
renfe

Unnamed: 0,insert_date,origin,destination,start_date,end_date,train_type,price,train_class,fare
0,2019-04-11 21:49:46,MADRID,BARCELONA,2019-04-18 05:50:00,2019-04-18 08:55:00,AVE,68.95,Preferente,Promo
1,2019-04-11 21:49:46,MADRID,BARCELONA,2019-04-18 06:30:00,2019-04-18 09:20:00,AVE,75.40,Turista,Promo
2,2019-04-11 21:49:46,MADRID,BARCELONA,2019-04-18 07:00:00,2019-04-18 09:30:00,AVE,106.75,Turista Plus,Promo
3,2019-04-11 21:49:46,MADRID,BARCELONA,2019-04-18 07:30:00,2019-04-18 10:40:00,AVE,90.50,Turista Plus,Promo
4,2019-04-11 21:49:46,MADRID,BARCELONA,2019-04-18 08:00:00,2019-04-18 10:30:00,AVE,88.95,Turista,Promo
...,...,...,...,...,...,...,...,...,...
7671349,2019-05-25 21:26:25,VALENCIA,MADRID,2019-07-20 14:50:00,2019-07-20 22:17:00,REGIONAL,28.35,Turista,Adulto ida
7671350,2019-05-25 21:26:25,VALENCIA,MADRID,2019-07-20 14:10:00,2019-07-20 15:48:00,AVE,33.65,Turista,Promo
7671351,2019-05-25 21:26:25,VALENCIA,MADRID,2019-07-20 12:40:00,2019-07-20 14:20:00,AVE,45.30,Turista,Promo
7671352,2019-05-25 21:26:25,VALENCIA,MADRID,2019-07-20 10:40:00,2019-07-20 13:05:00,INTERCITY,15.70,Turista,Promo


## 3. calculations

### combining columns

with strings...

In [6]:
renfe['destination_origin'] = renfe['origin'] + '_' + renfe['destination']

In [7]:
renfe['destination_origin']

0          MADRID_BARCELONA
1          MADRID_BARCELONA
2          MADRID_BARCELONA
3          MADRID_BARCELONA
4          MADRID_BARCELONA
                 ...       
7671349     VALENCIA_MADRID
7671350     VALENCIA_MADRID
7671351     VALENCIA_MADRID
7671352     VALENCIA_MADRID
7671353     VALENCIA_MADRID
Name: destination_origin, Length: 7671354, dtype: object

dates...

In [8]:
renfe['start_date'] = pd.to_datetime(renfe['start_date'])
renfe['end_date'] = pd.to_datetime(renfe['end_date'])



converters? be careful...

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [21]:
renfe['trip_duration'] = renfe['end_date'] - renfe['start_date']

In [35]:
renfe['trip_duration'].dt.seconds / 3600

0          3.083333
1          2.833333
2          2.500000
3          3.166667
4          2.500000
             ...   
7671349    7.450000
7671350    1.633333
7671351    1.666667
7671352    2.416667
7671353    1.666667
Name: trip_duration, Length: 7671354, dtype: float64

## constants...

In [37]:
renfe['trip_duration'] * 10

0         1 days 06:50:00
1         1 days 04:20:00
2         1 days 01:00:00
3         1 days 07:40:00
4         1 days 01:00:00
                ...      
7671349   3 days 02:30:00
7671350   0 days 16:20:00
7671351   0 days 16:40:00
7671352   1 days 00:10:00
7671353   0 days 16:40:00
Name: trip_duration, Length: 7671354, dtype: timedelta64[ns]

### conditional calculations

In [40]:
renfe['duration_description'] = np.where(renfe['trip_duration'].dt.seconds > 3600 * 2,
                                         'long trip', 
                                         'short trip')

In [43]:
renfe['trip_duration_h'] = renfe['trip_duration'].dt.seconds / 3600

In [48]:
renfe[['trip_duration_h', 'price']].mean(axis=1)

0          36.016667
1          39.116667
2          54.625000
3          46.833333
4          45.725000
             ...    
7671349    17.900000
7671350    17.641667
7671351    23.483333
7671352     9.058333
7671353    23.483333
Length: 7671354, dtype: float64

It may take too long...

## 4. using pandas 'out of core' with modin

In [None]:
!pip install modin[ray] psutil

https://modin.readthedocs.io/en/latest/UsingPandasonRay/index.html