# Event-based analytics

In this project, I will investigate user behavior for a company's app - a startup that sells food products. I will use a dataset the information on actions users made on the app (events), throughout an A/B test that was conducted.
in order to reach conclusions, I will do the following:
1. Download and explore the data
2. Prepare the data for analysis
3. Study and check the data
4. Study the event funnel
5. Study the results of the experiment


##  Download and explore the data


In [64]:
# Loading all the libraries i will use:

import pandas as pd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
import seaborn as sns
import math 
from functools import reduce
from scipy import stats as st
import warnings
import plotly.express as px
import plotly.graph_objects as go
import plotly.offline
import nltk, re
import string
from collections import Counter
from string import punctuation
import requests 
import io
plotly.offline.init_notebook_mode(connected=True)
warnings.filterwarnings("ignore")

In [65]:
# Downloading the csv file from my GitHub account

url = "https://raw.githubusercontent.com/yoav-karsenty/Event-based-analytics/main/logs_exp_us.csv"
download = requests.get(url).content

# Reading the downloaded content and turning it into a pandas dataframe

food = pd.read_csv(io.StringIO(download.decode('utf-8')),sep = "\t")


#exploring the data

food.head()

Unnamed: 0,EventName,DeviceIDHash,EventTimestamp,ExpId
0,MainScreenAppear,4575588528974610257,1564029816,246
1,MainScreenAppear,7416695313311560658,1564053102,246
2,PaymentScreenSuccessful,3518123091307005509,1564054127,248
3,CartScreenAppear,3518123091307005509,1564054127,248
4,PaymentScreenSuccessful,6217807653094995999,1564055322,248


In [66]:
food.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244126 entries, 0 to 244125
Data columns (total 4 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   EventName       244126 non-null  object
 1   DeviceIDHash    244126 non-null  int64 
 2   EventTimestamp  244126 non-null  int64 
 3   ExpId           244126 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 7.5+ MB


##  Preparing the data for analysis


I can see there are no missing values, but i need to fix the data type on the eventtimstamp column and change the columns names.

In [67]:
#lowering the columns letter case for my convenient 
food.columns = food.columns.str.lower()


In [68]:
food.head()

Unnamed: 0,eventname,deviceidhash,eventtimestamp,expid
0,MainScreenAppear,4575588528974610257,1564029816,246
1,MainScreenAppear,7416695313311560658,1564053102,246
2,PaymentScreenSuccessful,3518123091307005509,1564054127,248
3,CartScreenAppear,3518123091307005509,1564054127,248
4,PaymentScreenSuccessful,6217807653094995999,1564055322,248


In [69]:
# changing the columns name for my convineiet
food.columns = ['event_name','device_id','event_timestamp','expid']
food.head()

Unnamed: 0,event_name,device_id,event_timestamp,expid
0,MainScreenAppear,4575588528974610257,1564029816,246
1,MainScreenAppear,7416695313311560658,1564053102,246
2,PaymentScreenSuccessful,3518123091307005509,1564054127,248
3,CartScreenAppear,3518123091307005509,1564054127,248
4,PaymentScreenSuccessful,6217807653094995999,1564055322,248


In [70]:
#cheking for dupicates
print(food.duplicated().sum()) 


413


In [71]:
food[food.duplicated(['event_name','device_id','event_timestamp'])]


Unnamed: 0,event_name,device_id,event_timestamp,expid
453,MainScreenAppear,5613408041324010552,1564474784,248
2350,CartScreenAppear,1694940645335807244,1564609899,248
3573,MainScreenAppear,434103746454591587,1564628377,248
4076,MainScreenAppear,3761373764179762633,1564631266,247
4803,MainScreenAppear,2835328739789306622,1564634641,248
...,...,...,...,...
242329,MainScreenAppear,8870358373313968633,1565206004,247
242332,PaymentScreenSuccessful,4718002964983105693,1565206005,247
242360,PaymentScreenSuccessful,2382591782303281935,1565206049,246
242362,CartScreenAppear,2382591782303281935,1565206049,246


In [72]:
# checking the distribution of duplicates 
for i in food[food.duplicated()].columns:
    print(i,':',food[food.duplicated()][i].nunique())

event_name : 5
device_id : 237
event_timestamp : 352
expid : 3


In [73]:
# dropping the duplicates from the dataFrame
food = food.drop_duplicates(['event_name','device_id','event_timestamp'],keep= 'last')


In [74]:
#converting the 'event_timestamp' column into date time format using the seconds format.
food['event_timestamp'] = pd.to_datetime(food['event_timestamp'], unit='s')
food.head(80)

Unnamed: 0,event_name,device_id,event_timestamp,expid
0,MainScreenAppear,4575588528974610257,2019-07-25 04:43:36,246
1,MainScreenAppear,7416695313311560658,2019-07-25 11:11:42,246
2,PaymentScreenSuccessful,3518123091307005509,2019-07-25 11:28:47,248
3,CartScreenAppear,3518123091307005509,2019-07-25 11:28:47,248
4,PaymentScreenSuccessful,6217807653094995999,2019-07-25 11:48:42,248
...,...,...,...,...
75,MainScreenAppear,8039133903955652950,2019-07-27 15:48:41,247
76,MainScreenAppear,8039133903955652950,2019-07-27 15:48:48,247
77,OffersScreenAppear,7851137947039862735,2019-07-27 16:21:24,246
78,OffersScreenAppear,2862153300066037949,2019-07-27 16:26:51,247


In [75]:
#Adding a date column
food['date'] = food.event_timestamp.dt.date

In [76]:
food['date'] = pd.to_datetime(food['date'])

In [77]:
food.head()

Unnamed: 0,event_name,device_id,event_timestamp,expid,date
0,MainScreenAppear,4575588528974610257,2019-07-25 04:43:36,246,2019-07-25
1,MainScreenAppear,7416695313311560658,2019-07-25 11:11:42,246,2019-07-25
2,PaymentScreenSuccessful,3518123091307005509,2019-07-25 11:28:47,248,2019-07-25
3,CartScreenAppear,3518123091307005509,2019-07-25 11:28:47,248,2019-07-25
4,PaymentScreenSuccessful,6217807653094995999,2019-07-25 11:48:42,248,2019-07-25


##  Studying and checking the data

In [78]:
# Grouping by average number of events per user using mean 
food.groupby('device_id')['event_name'].count().mean()

32.27559263673685

In [79]:
#grouping by unique events per user
food.groupby('device_id')['event_name'].nunique().reset_index()

Unnamed: 0,device_id,event_name
0,6888746892508752,1
1,6909561520679493,4
2,6922444491712477,4
3,7435777799948366,1
4,7702139951469979,4
...,...,...
7546,9217594193087726423,3
7547,9219463515465815368,4
7548,9220879493065341500,3
7549,9221926045299980007,1


In [80]:
#getting the number of unique users per event
food.groupby('device_id')['event_name'].nunique().reset_index()['event_name'].value_counts()

4    3035
1    2707
2    1021
5     471
3     317
Name: event_name, dtype: int64

In [81]:
##getting the number of events

print('There are {} events in the logs '.format(food.shape[0]))

There are 243713 events in the logs 


In [82]:
#getting the number of unique users per event

print('There are {} users in the logs '.format(food.device_id.nunique()))

There are 7551 users in the logs 


In [83]:
#getting the average of events per user
print('The average number for events per user is {} '.format(food.shape[0]/food.device_id.nunique()))

The average number for events per user is 32.27559263673685 


In [84]:
# Grouping by average number of events per user using median 

food.groupby('device_id')['event_name'].count().median()

20.0

We can see that when we use the mean to get the average, we get a lot more from the median result. that probably due to oultliers - clients that order for many employees in thier office for example

Next, i want to check what period of time does the data cover, so i could be sure that i have equally complete data for the entire period.
Older events could end up in some users' logs for technical reasons, and this could skew the overall picture. 
I want to  Find the moment at which the data starts to be complete and ignore the earlier section.

In [85]:
#Using the describe method to get more information on the dataset
food.describe(include = 'all')

Unnamed: 0,event_name,device_id,event_timestamp,expid,date
count,243713,243713.0,243713,243713.0,243713
unique,5,,176654,,14
top,MainScreenAppear,,2019-08-01 14:40:35,,2019-08-01 00:00:00
freq,119101,,9,,36141
first,,,2019-07-25 04:43:36,,2019-07-25 00:00:00
last,,,2019-08-07 21:15:17,,2019-08-07 00:00:00
mean,,4.627963e+18,,247.022161,
std,,2.642723e+18,,0.82442,
min,,6888747000000000.0,,246.0,
25%,,2.372212e+18,,246.0,


In [86]:
group_check = food.groupby('device_id')['expid'].nunique().reset_index()
group_check.expid.value_counts()

1    7551
Name: expid, dtype: int64

In [87]:
food.expid.value_counts()

248    85582
246    80181
247    77950
Name: expid, dtype: int64

In [None]:
#Visualising the event_timestamp column data so i could see what period i should consider
fig = px.histogram(food, x = 'event_timestamp')
fig.show()

As we can sclearly see, the graph indicates that the expiriment started on 01.08.2019, only then we start getting real number of actions that reflects the number of participants we know we have on this data (7551). The data from before 01.08 is probably due to technical issues, and i will filter it from out dataswt.

In [89]:
#Filtering the data
food = food[food['date'] >='2019-08-01']

In [None]:
#checking if it worked
fig = px.histogram(food, x = 'event_timestamp')
fig.show()

## Study the event funnel

Next, i want to see what events are in the logs and their frequency of occurrence.

In [91]:
#Grouping the data by events and calculating their frequency of occurrence.
events = food.groupby('event_name')['event_timestamp'].count().reset_index()
events.columns = ['event','total']
events.sort_values(by = 'total',ascending = False)

Unnamed: 0,event,total
1,MainScreenAppear,117328
2,OffersScreenAppear,46333
0,CartScreenAppear,42303
3,PaymentScreenSuccessful,33918
4,Tutorial,1005


Next, I want to find the number of users who performed each of these actions.

In [92]:
#Grouping the data by events and calculating the number of unique users that performed that action.

event_users = food.groupby('event_name')['device_id'].nunique()
event_users = event_users.sort_values(ascending = False).reset_index()
event_users.columns = ['event','users_num']

event_users

Unnamed: 0,event,users_num
0,MainScreenAppear,7419
1,OffersScreenAppear,4593
2,CartScreenAppear,3734
3,PaymentScreenSuccessful,3539
4,Tutorial,840


We can see that the number of users that performed the Tutorial action are a lot less than the other actions number. I indicate from that that the Tutorial action is probably not mandatory and you can skip it, which most users do.

In [93]:
#calculating the percent of users that made the actions from all users
event_perc = food.groupby('event_name')['device_id'].nunique().sort_values(ascending = False)/food.device_id.nunique()
event_perc = event_perc.reset_index()
event_perc

Unnamed: 0,event_name,device_id
0,MainScreenAppear,0.984736
1,OffersScreenAppear,0.609636
2,CartScreenAppear,0.49562
3,PaymentScreenSuccessful,0.469737
4,Tutorial,0.111495


Next, i want to determine in what order  the actions took place. Are all of them part of a single sequence? 

In [94]:
#getting a time sorted data df without the Tutorial action.
sorted_data = food[food['event_name']!= 'Tutorial'].sort_values(by=['device_id','event_timestamp'])


In [95]:
# A functions that returns the users unique action in the order they acured. 
def sequence(user):
    sorted_user = sorted_data[sorted_data['device_id']==user].sort_values(by = ['device_id','event_timestamp'])
    return sorted_user['event_name'].drop_duplicates().to_list()

In [96]:
#getting a list of action dor each user
sequence_list = []
for i in sorted_data.device_id.unique():
    sequence_list.append([i,sequence(i)])
    

    

In [97]:
#creating a dataframe with users and their actions
path_data = pd.DataFrame(sequence_list,columns = ['user','path'])
path_data

Unnamed: 0,user,path
0,6888746892508752,[MainScreenAppear]
1,6909561520679493,"[MainScreenAppear, PaymentScreenSuccessful, Ca..."
2,6922444491712477,"[MainScreenAppear, PaymentScreenSuccessful, Ca..."
3,7435777799948366,[MainScreenAppear]
4,7702139951469979,"[MainScreenAppear, OffersScreenAppear, CartScr..."
...,...,...
7525,9217594193087726423,"[PaymentScreenSuccessful, CartScreenAppear, Of..."
7526,9219463515465815368,"[MainScreenAppear, OffersScreenAppear, CartScr..."
7527,9220879493065341500,"[MainScreenAppear, OffersScreenAppear, CartScr..."
7528,9221926045299980007,[MainScreenAppear]


In [98]:
#Getting the most common action order.
## We are going to get an error here, but afterwords well get the answer we were searching for
path_data['path'].value_counts()

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas/_libs/hashtable_class_helper.pxi", line 4588, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


[MainScreenAppear]                                                                   2882
[MainScreenAppear, OffersScreenAppear, CartScreenAppear, PaymentScreenSuccessful]     905
[MainScreenAppear, OffersScreenAppear]                                                872
[MainScreenAppear, OffersScreenAppear, PaymentScreenSuccessful, CartScreenAppear]     771
[MainScreenAppear, PaymentScreenSuccessful, CartScreenAppear, OffersScreenAppear]     668
[MainScreenAppear, CartScreenAppear, PaymentScreenSuccessful, OffersScreenAppear]     510
[MainScreenAppear, CartScreenAppear, OffersScreenAppear, PaymentScreenSuccessful]     259
[MainScreenAppear, OffersScreenAppear, CartScreenAppear]                               92
[OffersScreenAppear, MainScreenAppear, PaymentScreenSuccessful, CartScreenAppear]      65
[OffersScreenAppear, PaymentScreenSuccessful, CartScreenAppear, MainScreenAppear]      63
[MainScreenAppear, CartScreenAppear, OffersScreenAppear]                               52
[MainScree

Looks like [MainScreenAppear, OffersScreenAppear, CartScreenAppear, PaymentScreenSuccessful] is the most common sequence (after only MainScreenAppear) with 905 times that was the users unique action order.

Next, i want to check at what stage do we lose the most users and what is the share of users that make the entire journey from their first event to payment. 
in order to do that i will use a funne chart. 



In [99]:
# Filtering out the Tutorial event as it is not part of the sequence due to the fact that it is skipable.
clean_events = event_users[event_users['event'] != 'Tutorial']

In [100]:
#Visualizing the data in a funnel chart
figi = go.Figure(go.Funnel(
    x = clean_events['users_num'],
    y = clean_events['event'],textinfo = "value+percent previous")
    )

figi.update_layout(
    title="User Actions Funnel",title_x=0.5)

figi.show() 

we can see the stage where we lost most users was the second stage, OffersScreenApear, with 61.9 % of the users from the previous stage.
The number of users who made the entire journey from their first event to payment is 3539, 47.7 of initial users and 94.8% of users from the previous stage

Next, i want to get a funnel that shows the difference between the groups of the test

In [101]:
#Creating a list with groups with the number of unique users who perforemd every action in every group
funnel_groups = []
for i in food.expid.unique():
    group = food[(food.expid == i) & (food.event_name != 'Tutorial')].groupby(['event_name','expid'])['device_id'].nunique().reset_index().sort_values(by = 'device_id',ascending = False)
    display(group)
    funnel_groups.append(group)

Unnamed: 0,event_name,expid,device_id
1,MainScreenAppear,246,2450
2,OffersScreenAppear,246,1542
0,CartScreenAppear,246,1266
3,PaymentScreenSuccessful,246,1200


Unnamed: 0,event_name,expid,device_id
1,MainScreenAppear,247,2476
2,OffersScreenAppear,247,1520
0,CartScreenAppear,247,1238
3,PaymentScreenSuccessful,247,1158


Unnamed: 0,event_name,expid,device_id
1,MainScreenAppear,248,2493
2,OffersScreenAppear,248,1531
0,CartScreenAppear,248,1230
3,PaymentScreenSuccessful,248,1181


In [102]:
#concating the grops into one 
funnel_groups = pd.concat(funnel_groups)
funnel_groups

Unnamed: 0,event_name,expid,device_id
1,MainScreenAppear,246,2450
2,OffersScreenAppear,246,1542
0,CartScreenAppear,246,1266
3,PaymentScreenSuccessful,246,1200
1,MainScreenAppear,247,2476
2,OffersScreenAppear,247,1520
0,CartScreenAppear,247,1238
3,PaymentScreenSuccessful,247,1158
1,MainScreenAppear,248,2493
2,OffersScreenAppear,248,1531


In [103]:

figes =  go.Figure(go.Funnel(
    name = '246',
     y = funnel_groups[funnel_groups['expid']==246]['event_name'],
    x = funnel_groups[funnel_groups['expid']==246]['device_id'],
    textinfo = "value+percent initial"))

figes.add_trace(go.Funnel(
    name = '247',
     y = funnel_groups[funnel_groups['expid']==247]['event_name'],
    x = funnel_groups[funnel_groups['expid']==247]['device_id'],
    textinfo = "value+percent initial"))
figes.add_trace(go.Funnel(
    name = '248',
     y = funnel_groups[funnel_groups['expid']==248]['event_name'],
    x = funnel_groups[funnel_groups['expid']==248]['device_id'],
    textinfo = "value+percent initial"))
figes.update_layout(
    title="User Actions Funnel by Experiment ",title_x=0.5)

figes.show()

From a first look, it seems that the groups have pretty similar results, bit i will check that with statistical tools later.

# # The results of the experiment

In [104]:
#Checking the number of users in each group
food.groupby(['expid'])['device_id'].nunique()

expid
246    2484
247    2513
248    2537
Name: device_id, dtype: int64

We have two control groups in the A/A test, where we check our mechanisms and calculations.I want to See if there is a statistically significant difference between samples 246 and 247.


In [105]:
#checking that every user was assigned to only one group
food.groupby(['device_id'])['expid'].nunique().reset_index().query('expid > 1') 

Unnamed: 0,device_id,expid


Next, i eant to check that the groups in the A/A test are not statistically significant difference. in order to do that i will perform a z score test of proportions. 

In order to do that will formulate two hypotheses - A null hypitheses H0, and the alternative hypothesis- H1, and will perform a z scoretest. 


In order to determine whether to reject the null hypothesis or not
We set a threshold for statistical significance - critical statistical significance level alpha, wich will be 0.05.
if the p-value is less than alpha, we reject the null hypothesis -H0

The hypotheses:

H0: The difference between group A and group B is statistically significant.

H1: We cant say that the difference between group A and group B is statistically significant



In [106]:
#creating a pivot table that i waill write a function that will use its data to conduct a z score test
pivot = food.pivot_table(index = 'event_name',columns = 'expid',values = 'device_id',aggfunc = 'nunique').reset_index()
pivot

expid,event_name,246,247,248
0,CartScreenAppear,1266,1238,1230
1,MainScreenAppear,2450,2476,2493
2,OffersScreenAppear,1542,1520,1531
3,PaymentScreenSuccessful,1200,1158,1181
4,Tutorial,278,283,279


In [107]:
#creating a function that perforems z score test of proportions for the two groups 
def check_hyppothesis(group1,group2,event,alpha = 0.05):
    
    succ1 = pivot[pivot.event_name == event][group1].iloc[0] 
    succ2 = pivot[pivot.event_name == event][group2].iloc[0]
    
    trials1 = food[food.expid == group1]['device_id'].nunique() 
    trials2 = food[food.expid == group2]['device_id'].nunique()
    

    #proportions to success in the first group
    p1 = succ1 / trials1
    #proportions to success in the second group
    p2 = succ2 / trials2
    
    p_combined = (succ1 + succ2) / (trials1 + trials2)
    
    difference = p1 - p2
    
    z_value = difference / math.sqrt(p_combined * (1 - p_combined) *(1 / trials1 + 1/trials2))
                                     
    distr =  st.norm(0,1)           
    
    p_value = (1 - distr.cdf(abs(z_value))) * 2
    
    print('p_value :', p_value)
    
    if (p_value < alpha):
        
    
        print("Rejecting the null hypothesis for",event,"and groups",group1,group2)
    else:
        print("Failed to reject the null hypothesis for",event,"and groups",group1,group2)

In [108]:
#checking the function 
check_hyppothesis(246,247,'CartScreenAppear',alpha = 0.05)

p_value : 0.22883372237997213
Failed to reject the null hypothesis for CartScreenAppear and groups 246 247


In [109]:
#Applying the function 

for i in pivot.event_name.unique():
        check_hyppothesis(246,247,i,alpha = 0.05)
        print('---------')

p_value : 0.22883372237997213
Failed to reject the null hypothesis for CartScreenAppear and groups 246 247
---------
p_value : 0.7570597232046099
Failed to reject the null hypothesis for MainScreenAppear and groups 246 247
---------
p_value : 0.2480954578522181
Failed to reject the null hypothesis for OffersScreenAppear and groups 246 247
---------
p_value : 0.11456679313141849
Failed to reject the null hypothesis for PaymentScreenSuccessful and groups 246 247
---------
p_value : 0.9376996189257114
Failed to reject the null hypothesis for Tutorial and groups 246 247
---------


It seems the A/A test went well. there is not a statistically significant difference between the groups. we are good to go.

Next, i want to do the same with the test group 248, against each of the groups and combined,

In [110]:
for i in pivot.event_name.unique():
        check_hyppothesis(247,248,i,alpha = 0.05)
        print('---------')

p_value : 0.5786197879539783
Failed to reject the null hypothesis for CartScreenAppear and groups 247 248
---------
p_value : 0.4587053616621515
Failed to reject the null hypothesis for MainScreenAppear and groups 247 248
---------
p_value : 0.9197817830592261
Failed to reject the null hypothesis for OffersScreenAppear and groups 247 248
---------
p_value : 0.7373415053803964
Failed to reject the null hypothesis for PaymentScreenSuccessful and groups 247 248
---------
p_value : 0.765323922474501
Failed to reject the null hypothesis for Tutorial and groups 247 248
---------


In [111]:
 for i in pivot.event_name.unique():
        check_hyppothesis(246,248,i,alpha = 0.05)
        print('---------')

p_value : 0.07842923237520116
Failed to reject the null hypothesis for CartScreenAppear and groups 246 248
---------
p_value : 0.2949721933554552
Failed to reject the null hypothesis for MainScreenAppear and groups 246 248
---------
p_value : 0.20836205402738917
Failed to reject the null hypothesis for OffersScreenAppear and groups 246 248
---------
p_value : 0.2122553275697796
Failed to reject the null hypothesis for PaymentScreenSuccessful and groups 246 248
---------
p_value : 0.8264294010087645
Failed to reject the null hypothesis for Tutorial and groups 246 248
---------


In [112]:
#Modifieng the function so i could also check the combined result of 246 and 247
def check_hyppothesis_com(group1,group2,group3,event,alpha = 0.05):
    
    succ1 = pivot[pivot.event_name ==event][group1].iloc[0] +pivot[pivot.event_name == event][group2].iloc[0]
    succ2 = pivot[pivot.event_name == event][group3].iloc[0]
    
    trials1 = food[food.expid == group1]['device_id'].nunique() + food[food.expid == group2]['device_id'].nunique()
    trials2 = food[food.expid == group3]['device_id'].nunique()
    

    #proportions to success in the first group
    p1 = succ1 / trials1
    #proportions to success in the second group
    p2 = succ2 / trials2
    
    p_combined = (succ1 + succ2) / (trials1 + trials2)
    
    difference = p1 - p2
    
    z_value = difference / math.sqrt(p_combined * (1 - p_combined) *(1 / trials1 + 1/trials2))
                                     
    distr =  st.norm(0,1)           
    
    p_value = (1 - distr.cdf(abs(z_value))) * 2
    
    print('p_value :', p_value)
    
    if (p_value < alpha):
        
    
        print("Rejecting the null hypothesis for",event,"and groups",group1,group2)
    else:
        print("Failed to reject the null hypothesis for",event,"and groups",group1,group2)

In [113]:
 for i in pivot.event_name.unique():
        check_hyppothesis_com(246,247,248,i,alpha = 0.05)
        print('---------')

p_value : 0.18175875284404386
Failed to reject the null hypothesis for CartScreenAppear and groups 246 247
---------
p_value : 0.29424526837179577
Failed to reject the null hypothesis for MainScreenAppear and groups 246 247
---------
p_value : 0.43425549655188256
Failed to reject the null hypothesis for OffersScreenAppear and groups 246 247
---------
p_value : 0.6004294282308704
Failed to reject the null hypothesis for PaymentScreenSuccessful and groups 246 247
---------
p_value : 0.764862472531507
Failed to reject the null hypothesis for Tutorial and groups 246 247
---------


It seems that there is not a statistically significant difference between the groups, but i want to be sure that is not because i set the threshhold too high. I will use the Bonferroni correction to adjust alpha and conduct the tests again. 

In [114]:

#dividing alpha by the number of tests i will conduct - 16 
alpha = 0.05/16
alpha    

0.003125

In [115]:
 for i in pivot.event_name.unique():
        check_hyppothesis(247,248,i,alpha = 0.05)
        print('---------')

p_value : 0.5786197879539783
Failed to reject the null hypothesis for CartScreenAppear and groups 247 248
---------
p_value : 0.4587053616621515
Failed to reject the null hypothesis for MainScreenAppear and groups 247 248
---------
p_value : 0.9197817830592261
Failed to reject the null hypothesis for OffersScreenAppear and groups 247 248
---------
p_value : 0.7373415053803964
Failed to reject the null hypothesis for PaymentScreenSuccessful and groups 247 248
---------
p_value : 0.765323922474501
Failed to reject the null hypothesis for Tutorial and groups 247 248
---------


In [116]:
 for i in pivot.event_name.unique():
        check_hyppothesis(246,248,i,alpha = 0.05)
        print('---------')

p_value : 0.07842923237520116
Failed to reject the null hypothesis for CartScreenAppear and groups 246 248
---------
p_value : 0.2949721933554552
Failed to reject the null hypothesis for MainScreenAppear and groups 246 248
---------
p_value : 0.20836205402738917
Failed to reject the null hypothesis for OffersScreenAppear and groups 246 248
---------
p_value : 0.2122553275697796
Failed to reject the null hypothesis for PaymentScreenSuccessful and groups 246 248
---------
p_value : 0.8264294010087645
Failed to reject the null hypothesis for Tutorial and groups 246 248
---------


In [117]:
 for i in pivot.event_name.unique():
        check_hyppothesis_com(246,247,248,i,alpha = 0.05)
        print('---------')

p_value : 0.18175875284404386
Failed to reject the null hypothesis for CartScreenAppear and groups 246 247
---------
p_value : 0.29424526837179577
Failed to reject the null hypothesis for MainScreenAppear and groups 246 247
---------
p_value : 0.43425549655188256
Failed to reject the null hypothesis for OffersScreenAppear and groups 246 247
---------
p_value : 0.6004294282308704
Failed to reject the null hypothesis for PaymentScreenSuccessful and groups 246 247
---------
p_value : 0.764862472531507
Failed to reject the null hypothesis for Tutorial and groups 246 247
---------


## Conclusion

In this project about a food company app's users behaviour , I have exmined the data and reached the next conclusions:

1. The stage where we loose most users is stage 2 - OffersScreenAppear with 38.9% of the users not Proceding to the next stage.
- I would reccomend to check what can be improved on the previous stage, wich is the MainScreenApear, in order to get more users to get to the next stage. 
2. 3539 users, wich are 47.7% of the initial users and 94.8% of the previous stage users get to the PaymentScreenSuccessful stage.


3. There is not a statistically significant difference between the groups of the tests. The designers can change the fonts in the app without being afraid  it will affect the users actions (based on the A/A/B test we just conducted and analysed).