# Online Casino Product Analytics
## A/B Test Perfromance Estimation

### Required Libraries

In [40]:
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency

### Reading Data into DataFrame

In [7]:
df = pd.read_parquet('events.parquet', engine='pyarrow')

### Preliminary Data Exploration

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 835357 entries, 0 to 835356
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   user_id     835357 non-null  object        
 1   user_group  835357 non-null  int64         
 2   time        835357 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 19.1+ MB


In [46]:
df.head()

Unnamed: 0,user_id,user_group,time,time_diff,new_session,session_id
82843,test_user_10021823,1,2023-12-07 01:08:51.008309,0.0,False,0
82853,test_user_10021823,1,2023-12-07 01:08:53.085064,2.076755,False,0
82865,test_user_10021823,1,2023-12-07 01:08:56.702185,3.617121,False,0
82866,test_user_10021823,1,2023-12-07 01:08:57.730887,1.028702,False,0
82867,test_user_10021823,1,2023-12-07 01:08:57.808475,0.077588,False,0


In [47]:
df.tail()

Unnamed: 0,user_id,user_group,time,time_diff,new_session,session_id
253401,test_user_999580,0,2023-12-06 15:42:07.074273,2.303254,False,1
209051,test_user_9998026,1,2023-12-06 15:36:14.988054,0.0,False,0
209054,test_user_9998026,1,2023-12-06 15:36:15.099478,0.111424,False,0
209056,test_user_9998026,1,2023-12-06 15:36:15.535345,0.435867,False,0
209073,test_user_9998026,1,2023-12-06 15:36:20.240469,4.705124,False,0


###  Building a DataFrame for A/B Test Performance Estimation

The objective is to construct a table with the following columns: 
* `user_id` 
* `ab_group`
* `session_start`
* `session_end`
* `session_date`,

where each row represents a unique session for a user, identifying when the session started and ended, along with the session date.

Each session is defined by a 30-minute inactivity rule: a new session starts after 30 minutes of inactivity.

In [16]:
# Sort by user_id and time to ensure chronological order
df.sort_values(['user_id', 'time'], inplace=True)

# Calculate the time difference between actions
df['time_diff'] = df.groupby('user_id')['time'].diff().dt.total_seconds().fillna(0)

df.head()

Unnamed: 0,user_id,user_group,time,time_diff
82843,test_user_10021823,1,2023-12-07 01:08:51.008309,0.0
82853,test_user_10021823,1,2023-12-07 01:08:53.085064,2.076755
82865,test_user_10021823,1,2023-12-07 01:08:56.702185,3.617121
82866,test_user_10021823,1,2023-12-07 01:08:57.730887,1.028702
82867,test_user_10021823,1,2023-12-07 01:08:57.808475,0.077588


In [17]:
# Define a new session if the time difference is more than 30 minutes (1800 seconds)
df['new_session'] = df['time_diff'] > 1800

df.head()

Unnamed: 0,user_id,user_group,time,time_diff,new_session
82843,test_user_10021823,1,2023-12-07 01:08:51.008309,0.0,False
82853,test_user_10021823,1,2023-12-07 01:08:53.085064,2.076755,False
82865,test_user_10021823,1,2023-12-07 01:08:56.702185,3.617121,False
82866,test_user_10021823,1,2023-12-07 01:08:57.730887,1.028702,False
82867,test_user_10021823,1,2023-12-07 01:08:57.808475,0.077588,False


In [24]:
# Cumulatively sum the new_session column to create a unique identifier for each session
df['session_id'] = df.groupby('user_id')['new_session'].cumsum()

df.tail(10)

Unnamed: 0,user_id,user_group,time,time_diff,new_session,session_id
253295,test_user_999580,0,2023-12-06 15:41:53.895570,1.953984,False,1
253296,test_user_999580,0,2023-12-06 15:41:53.901264,0.005694,False,1
253312,test_user_999580,0,2023-12-06 15:41:55.944301,2.043037,False,1
253365,test_user_999580,0,2023-12-06 15:42:02.763931,6.81963,False,1
253376,test_user_999580,0,2023-12-06 15:42:04.771019,2.007088,False,1
253401,test_user_999580,0,2023-12-06 15:42:07.074273,2.303254,False,1
209051,test_user_9998026,1,2023-12-06 15:36:14.988054,0.0,False,0
209054,test_user_9998026,1,2023-12-06 15:36:15.099478,0.111424,False,0
209056,test_user_9998026,1,2023-12-06 15:36:15.535345,0.435867,False,0
209073,test_user_9998026,1,2023-12-06 15:36:20.240469,4.705124,False,0


In [25]:
# Aggregate to find session start and end times
session_df = df.groupby(['user_id', 'user_group', 'session_id']).agg(
    session_start=('time', 'min'),
    session_end=('time', 'max')
).reset_index()

In [26]:
session_df.head()

Unnamed: 0,user_id,user_group,session_id,session_start,session_end
0,test_user_10021823,1,0,2023-12-07 01:08:51.008309,2023-12-07 01:10:11.804765
1,test_user_10030144,0,0,2023-12-02 20:37:34.861127,2023-12-02 20:38:14.301560
2,test_user_1003396,1,0,2023-12-01 00:00:48.238407,2023-12-01 00:01:41.376842
3,test_user_1003396,1,1,2023-12-02 08:21:13.134305,2023-12-02 08:35:03.351138
4,test_user_1003396,1,2,2023-12-02 21:12:09.512252,2023-12-02 21:34:53.437896


In [30]:
# Add session date as the date from session_start
session_df['session_date'] = session_df['session_start'].dt.date

# Rename columns to match the requested format
session_df.rename(columns={'user_group': 'ab_group'}, inplace=True)

# Drop session_id as it is not needed in the final output
session_df.drop('session_id', axis=1, inplace=True)

In [31]:
session_df.head()

Unnamed: 0,user_id,ab_group,session_start,session_end,session_date
0,test_user_10021823,1,2023-12-07 01:08:51.008309,2023-12-07 01:10:11.804765,2023-12-07
1,test_user_10030144,0,2023-12-02 20:37:34.861127,2023-12-02 20:38:14.301560,2023-12-02
2,test_user_1003396,1,2023-12-01 00:00:48.238407,2023-12-01 00:01:41.376842,2023-12-01
3,test_user_1003396,1,2023-12-02 08:21:13.134305,2023-12-02 08:35:03.351138,2023-12-02
4,test_user_1003396,1,2023-12-02 21:12:09.512252,2023-12-02 21:34:53.437896,2023-12-02


### Conversion Rates Calculation

Conversion is defined as the **occurrence of a second session for a user**. The conversion rate will then be calculated as the number of users who had a second session divided by the total number of users who **entered the A/B test (had a first session)**.

In [33]:
# Count the number of first sessions for each user
first_sessions_count = session_df.groupby(['user_id', 'ab_group']).size().reset_index(name='sessions_count')

first_sessions_count.head()

Unnamed: 0,user_id,ab_group,sessions_count
0,test_user_10021823,1,1
1,test_user_10030144,0,1
2,test_user_1003396,1,11
3,test_user_10035974,0,1
4,test_user_1005039,1,1


In [35]:
# Count how many users in each group had at least one session (entered the A/B test)
users_in_test = first_sessions_count.groupby('ab_group')['user_id'].nunique()

print(users_in_test)

ab_group
0    7429
1    7399
Name: user_id, dtype: int64


In [36]:
# Filter to consider only users with at least 2 sessions (i.e., those who had a second session)
users_with_second_session = first_sessions_count[first_sessions_count['sessions_count'] >= 2]

users_with_second_session.head()

Unnamed: 0,user_id,ab_group,sessions_count
2,test_user_1003396,1,11
6,test_user_10063799,1,11
9,test_user_10068940,0,3
10,test_user_10076625,1,2
13,test_user_1009703,1,2


In [38]:
# Count how many of these users are in each group
conversions = users_with_second_session.groupby('ab_group')['user_id'].nunique()

print(conversions)

ab_group
0    1833
1    1850
Name: user_id, dtype: int64


In [39]:
# Calculate the conversion rate for each group
conversion_rates = conversions / users_in_test

# Display the conversion rates
print(conversion_rates)

ab_group
0    0.246736
1    0.250034
Name: user_id, dtype: float64


**The group B (1, or variation) has the higher conversion rate (25.0% > 24.7%) and is the winner of the A/B test**. However, it requires statistical significance testing, in order to ensure that the observed difference is not due to random chance. 

### Statistical Significance Testing

* **Null Hypothesis (H0):** There is no difference in conversion rates between Group A and Group B.
* **Alternative Hypothesis (H1):** There is a significant difference in conversion rates between Group A and Group B.

Confidence level = 95%, significance level = 5%.

In [42]:
# Calculate the number of users who did not convert for each group
non_conversions = users_in_test - conversions

# Create the contingency table
contingency_table = pd.DataFrame({
    'Converted': conversions,
    'Not_Converted': non_conversions
})

print(contingency_table)

          Converted  Not_Converted
ab_group                          
0              1833           5596
1              1850           5549


In [44]:
# Chi-square test
chi2, p_value, _, _ = chi2_contingency(contingency_table)

# Results interpretation
if p_value < 0.05:
    print("There's a statistically significant difference between the groups. p-value:", p_value)
else:
    print("No statistically significant difference between the groups was found. p-value:", p_value)

No statistically significant difference between the groups was found. p-value: 0.655793039608316


Based on the data analyzed, **there isn't enough evidence to conclude that there's a real difference in conversion rates** between the two groups being tested:

* A p-value means there's a 65.58% chance that the observed differences (or more extreme differences) could occur just by random chance, even if there's no real effect or difference between the groups.
* Since the p-value (0.656) is much higher than the typical significance level of 0.05, we fail to reject the null hypothesis. It implies that the changes or variations tested in the A/B test did not lead to a meaningful difference in conversion rates.