
# PARLA

## Problem
- Estimate the average time between purchases:
    - Take all customers who made 2 or more purchases.
    - Calculate the time between purchases (a customer with N purchases should have N–1 time intervals).
    - Combine the time intervals of all customers and compute the average.
- Use the data from the file '2022-04-01T12_df_sales.csv' to solve the task.
- Provide the average number of days between purchases, rounded to the nearest whole number.

## Action
- I estimated the average time between purchases using relevant Python, and Pandas functionality
- I loaded, filtered, grouped, aggregated, and sorted original dataframe
- I created dataframe, where dates for each user are shifted
- I merged dataframes to calculate time deltas

## Result
- Successfully calculated average time between purchases for active users

## Learning
- I revised relevant Python and Pandas functionality
- I revised working with 'datetime' data type

## Application
- I can apply relevant Python and Pandas functionality for similar data-related problems


In [43]:

import pandas as pd


In [44]:

# load sales data
df_sales = pd.read_csv('2022-04-01T12_df_sales.csv')
df_sales.date = pd.to_datetime(df_sales.date)
df_sales.head()


Unnamed: 0,sale_id,date,count_pizza,count_drink,price,user_id
0,1000001,2022-02-04 10:00:24,1,0,720,1c1543
1,1000002,2022-02-04 10:02:28,1,1,930,a9a6e8
2,1000003,2022-02-04 10:02:35,3,1,1980,23420a
3,1000004,2022-02-04 10:03:06,1,1,750,3e8ed5
4,1000005,2022-02-04 10:03:23,1,1,870,cbc468


In [45]:

# find users with multiple sales
df_users_with_many_sales = df_sales.groupby(by='user_id')['sale_id'].count().reset_index()
df_users_with_many_sales = df_users_with_many_sales[df_users_with_many_sales.sale_id > 1]
df_users_with_many_sales.head()


Unnamed: 0,user_id,sale_id
0,000096,2
1,0000d4,2
2,0000de,3
3,0000e4,2
6,0001e2,2


In [46]:

# select users with multiple sales
df_sales = df_sales[df_sales.user_id.isin(df_users_with_many_sales.user_id)]
df_sales = df_sales[['user_id', 'date']]

# create dataframe, where dates for each user are shifted (to calculate deltas later)
df_sales = df_sales.sort_values(by=['user_id', 'date'], ascending=[True, False]).reset_index(drop=True)
df_shifted_dates = df_sales.groupby(['user_id']).shift(-1)
df_shifted_dates = df_shifted_dates.rename(columns = {'date' : 'shifted_date'})
df_shifted_dates.head()


Unnamed: 0,shifted_date
0,2022-03-04 11:15:55
1,NaT
2,2022-02-28 16:32:09
3,NaT
4,2022-03-11 19:33:20


In [47]:

# merge sales and shifted dates
df_merged = df_sales.join(df_shifted_dates)
df_merged.head()


Unnamed: 0,user_id,date,shifted_date
0,000096,2022-03-22 13:16:09,2022-03-04 11:15:55
1,000096,2022-03-04 11:15:55,NaT
2,0000d4,2022-03-27 11:26:30,2022-02-28 16:32:09
3,0000d4,2022-02-28 16:32:09,NaT
4,0000de,2022-03-25 17:01:47,2022-03-11 19:33:20


In [48]:

# calculate deltas
df_delta = df_merged.copy()
df_delta = df_delta[pd.notnull(df_delta.shifted_date)]
df_delta['delta'] = df_delta.date - df_delta.shifted_date
df_delta.head()


Unnamed: 0,user_id,date,shifted_date,delta
0,000096,2022-03-22 13:16:09,2022-03-04 11:15:55,18 days 02:00:14
2,0000d4,2022-03-27 11:26:30,2022-02-28 16:32:09,26 days 18:54:21
4,0000de,2022-03-25 17:01:47,2022-03-11 19:33:20,13 days 21:28:27
5,0000de,2022-03-11 19:33:20,2022-02-11 18:57:15,28 days 00:36:05
7,0000e4,2022-03-27 14:54:35,2022-02-28 12:41:47,27 days 02:12:48


In [49]:

# calculate mean delta
df_delta.delta.mean()


Timedelta('17 days 07:37:57.653431857')