# Calculating average time between user login and first sale (PARLA)

## Problem
Estimate the average time between a user's visit to the website and the corresponding purchase:
- A visit is considered related to a purchase if it occurred no earlier than two hours before the purchase.
- I.e. for each purchase you need to calculate the time between the purchase and the earliest website visit by the same user within the two hours prior to the purchase.
- Then, compute the average of these time intervals

## Action
- I loaded web-logs data and sales data from CSV files
- I filtered and ordered the data
- I merged, filtered and transformed the datasets into a single dataset
- I calculated average interval by grouping and aggregating the dataset

## Result
- I successfully calculated the average interval between login and sale

## Learning
- I revised relevant python and pandas functions

## Application
- I can apply the relevant python and pandas functions for data-related problems

In [2]:

import pandas as pd


In [10]:

# load and prepare sales info
df_sales = pd.read_csv('2022-04-01T12_df_sales.csv')
df_sales = df_sales[['user_id', 'date']]
df_sales = df_sales.sort_values(['user_id', 'date'])
df_sales = df_sales.rename(columns={'user_id': 'user_id', 'date': 'sale'})
df_sales.head()


Unnamed: 0,user_id,sale
101925,000096,2022-03-04 11:15:55
168536,000096,2022-03-22 13:16:09
90423,0000d4,2022-02-28 16:32:09
186586,0000d4,2022-03-27 11:26:30
28831,0000de,2022-02-11 18:57:15


In [11]:

# load and prepare web-logs info
df_logs = pd.read_csv('2022-04-01T12_df_web_logs.csv')
df_logs = df_logs[['user_id', 'date']]
df_logs = df_logs.sort_values(['user_id', 'date'])
df_logs = df_logs.rename(columns={'user_id': 'user_id', 'date': 'login'})
df_logs.head()


Unnamed: 0,user_id,login
983633,96,2022-03-04 10:58:01
983704,96,2022-03-04 11:00:02
983803,96,2022-03-04 11:02:40
983890,96,2022-03-04 11:04:37
984189,96,2022-03-04 11:12:36


In [27]:

df = pd.merge(df_logs, df_sales, how='inner', on='user_id')
df.login = pd.to_datetime(df.login)
df.sale = pd.to_datetime(df.sale)
df = df[df.login < df.sale]  # keep only valid login-sale pairs
df = df[df.login >= df.sale - pd.Timedelta(hours=2)]  # keep logins in a two-hour window
df = df.groupby(['user_id', 'sale'])[['login']].min().reset_index()  # for each sale, find the earliest login
df['delta'] = df.sale - df.login
df['delta2'] = df.delta.dt.total_seconds()
avg_delta = round(df.delta2.mean() / 60)
avg_delta


17