   # A/A/B Test result analysis. 

# Table of contents  
1.[Introduction](#intro)  
2.[Steps](#steps)  
3.[Data Preprocessing](#data_prep)    
4.[Study the event funnel](#step1)   
5.[Study the A/A/B test results. ](#step2)  
6.[Conclusion](#end)  


<div id='intro'/>

## Introduction:  

Investigate the user behavior for a company's app ,that sells food products.   
The designers would like to change the fonts for the entire app, but the managers are afraid the users  
might find the new design intimidating. They decide to make a decision based on the results of an A/A/B test.

<div id='steps'/>

## Steps:  

Analysis in done in 2 main steps.  
1.Study the sales funnel.  
2.Study the A/A/B test results.  

In [1]:
#importing all the packages
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats as st
import math as mth
from plotly import graph_objects as go
import warnings

In [2]:
warnings.filterwarnings("ignore")
#Download the dataset
logs_df=pd.read_csv('/datasets/logs_exp_us.csv', delimiter='\t')
logs_df.head(10)

FileNotFoundError: [Errno 2] No such file or directory: '/datasets/logs_exp_us.csv'

***Missing values***

In [None]:
#Dataset details
logs_df.info()

***We don't see any missing values in the dataset.***

<div id='data_prep'/>

## Data Preprocessing.  

In [None]:
#renaming all column names to lower case with _
logs_df=logs_df.rename(columns={'EventName':'event_name',
                                'DeviceIDHash':'device_id_hash',
                                'EventTimestamp':'event_timestamp',
                                 'ExpId':'exp_id'})
logs_df.head(10)


In [None]:
logs_df['event_name'].value_counts()

We see 5 events that are repeated throughout the dataset..So,will convert data type to category.

In [None]:
logs_df['event_name']=logs_df['event_name'].astype('category')

In [None]:
logs_df['device_id_hash'].value_counts().head(10)

In [None]:
logs_df['event_timestamp'].value_counts().head(10)

In [None]:
#creating new columns for datetime and date.
logs_df['event_datetime'] = pd.to_datetime(logs_df['event_timestamp'], 
                                  unit='s')
logs_df['event_date']=logs_df['event_datetime'].dt.date

In [None]:
logs_df['exp_id'].value_counts()

In [None]:
#exp_id changed to category type.  
logs_df['exp_id']=logs_df['exp_id'].astype('category')

In [None]:
logs_df.info()

In [None]:
logs_df.head(10)

In [None]:
logs_df['event_datetime'].value_counts().head(10)

In [None]:
logs_df['event_date'].value_counts()

# Data Analysis

In [None]:
n_users=logs_df['device_id_hash'].nunique()
n_users

In [None]:
total_events=logs_df['event_name'].count()
total_events

In [None]:
total_unique_events=logs_df['event_name'].nunique()
total_unique_events

In [None]:
n_events_per_user=round(total_events/n_users)
n_events_per_user

***1.There are 7551 unique users in total in the dataset.  
2.We see that they are 5 events - MainScreen,OffersScreen,CartScreen , Payment Screen and Tutorials page .These events are repeated around 244126 times..  
3.Number of events performed by each user is calculated to be around 32.***

In [None]:
#group by user
logs_df_user=logs_df.groupby(['device_id_hash','event_datetime']).agg({'event_name':'count'}).reset_index()
logs_df_user.head(10)

In [None]:
#average number of events per user
events_per_user=logs_df.groupby(['device_id_hash','event_date']).agg({'event_name':'count'}).reset_index()
events_per_user.head(10)


In [None]:
events_per_date=logs_df.groupby(['event_date','event_datetime']).agg({'event_name':'count'}).reset_index()
events_per_date.tail(10)

In [None]:
events_per_date['event_date'].describe()

In [None]:
events_per_date['event_date'].max()

In [None]:
events_per_date['event_date'].min()

'logs_df' dataset has records of events from 2019-07-25 to 2019-08-07.

In [None]:
plt.hist(logs_df['event_date'])
plt.xticks(rotation=90)
plt.show()

We see effective use of the app only by 31st of July 2019 ...We literally don't see any values prior to it from   
the dates 25-30 of July 2019.

In [None]:
#Verifying with percentile
np.percentile(logs_df['event_datetime'],[1,5,10,50,95])

Numpy percentile calculation shows that the data from '2019-07-25' to  '2019-07-31' doesn't even contribute  
to 1% of the data ..So, will ignore datas till 2019-07-31..

In [None]:
#create new dataset without records from 25-30 of July 2019
logs_updated_df=logs_df[logs_df['event_date']>pd.to_datetime('2019-07-31')]

In [None]:
logs_updated_df.head(10)

In [None]:
#checking for duplicates
logs_duplicates=logs_updated_df[logs_updated_df.duplicated(keep=False)].sort_values(by='event_datetime')
logs_duplicates.head(10)

In [None]:
logs_duplicates[(logs_duplicates['event_name']=='CartScreenAppear') & (logs_duplicates['device_id_hash']==2382591782303281935)]

***Analysing the duplicates ,most of the cases the event appears twice..Since , we have only less percentage of duplicates  (around 700 in a total of more than 200000 records , we will keep the first record and drop other duplicates .***  


In [None]:
logs_updated_df=logs_updated_df.drop_duplicates()

In [None]:
logs_updated_df['exp_id'].value_counts()

Since this dataset is to be used for A/A test , we need to make sure data from all 3 groups are  
evenly distributed with not more than 1% difference.  
Will take equal number of users from each experiment(user group).

In [None]:
#number of users in exp_id 246
n_246=logs_updated_df[logs_updated_df['exp_id']==246]['device_id_hash'].nunique()
n_246

In [None]:
#number of users in exp_id 247
n_247=logs_updated_df[logs_updated_df['exp_id']==247]['device_id_hash'].nunique()
n_247

In [None]:
#number of users in exp_id 248
n_248=logs_updated_df[logs_updated_df['exp_id']==248]['device_id_hash'].nunique()
n_248

In [None]:
logs_updated_df.duplicated().sum()

In [None]:
#Create a dataset with equal number of users in each exp_id (2484)
logs_updated_df2=logs_updated_df[logs_updated_df['exp_id']==247]
logs_updated_df2.head(5)

In [None]:
logs_updated_df2['device_id_hash'].nunique()

In [None]:
#take a sample of 2484 users with exp_id=247
logs_updated_df3=logs_updated_df2.groupby('device_id_hash').nunique().reset_index()
logs_updated_df3=logs_updated_df3['device_id_hash'].sample(2484)
logs_temp=pd.DataFrame(logs_updated_df3)
logs_temp.head(10)

In [None]:
#dataset contains 2484 users with exp_id=247
logs_updated_df2=logs_updated_df2.merge(logs_temp,on='device_id_hash')
logs_updated_df2.head(5)

In [None]:
logs_updated_df2[logs_updated_df2['exp_id']==247].agg({'device_id_hash':'nunique'})

In [None]:
logs_updated_df3=[]
logs_temp=[]
logs_updated_df3

In [None]:
#getting records with exp_id=248
logs_updated_df4=logs_updated_df[logs_updated_df['exp_id']==248]
logs_updated_df4.head(5)

In [None]:
logs_updated_df3=logs_updated_df4.groupby('device_id_hash').nunique().reset_index()
logs_updated_df3=logs_updated_df3['device_id_hash'].sample(2484)
logs_temp=pd.DataFrame(logs_updated_df3)
logs_temp.head(10)

In [None]:
#getting 2484 users of exp_id 248
logs_updated_df4=logs_updated_df4.merge(logs_temp,on='device_id_hash')
logs_updated_df4.head(5)

In [None]:
logs_updated_df4['device_id_hash'].nunique()

In [None]:
#appending to users from exp_id 247
logs_updated_df2=logs_updated_df2.append(logs_updated_df4)
logs_updated_df2.head(5)

In [None]:
logs_updated_df3=[]
logs_updated_df4=[]
logs_temp=[]
logs_temp=logs_updated_df[logs_updated_df['exp_id']==246]
logs_temp.head(5)

In [None]:
logs_temp['device_id_hash'].nunique()

In [None]:
#logs_updated_df2 contains equal users with 'exp_id ' 246,247 and 248
logs_updated_df2=logs_updated_df2.append(logs_temp)
logs_updated_df2.head(10)

In [None]:
logs_updated_df2[logs_updated_df2['exp_id']==246]['device_id_hash'].nunique()

In [None]:
logs_updated_df2[logs_updated_df2['exp_id']==247]['device_id_hash'].nunique()

In [None]:
logs_updated_df2[logs_updated_df2['exp_id']==248]['device_id_hash'].nunique()

In [None]:
logs_updated_df1=logs_updated_df2

In [None]:
logs_updated_df1.head(10)

In [None]:
logs_updated_df1['event_date'].min()

In [None]:
logs_updated_df1['event_date'].max()

In [None]:
plt.hist(logs_updated_df1['event_datetime'])
plt.xticks(rotation=90)
plt.show()

***Based on the updated dataset logs_updated_df1 , start date is '2019-08-01' and end date is '2019-08-07'.  
We see some increased activity between 5th and 6th.***

In [None]:
#updated dataset details
logs_updated_df1.info()

In [None]:
#original datset
logs_df.info()

In [None]:
#records in original dataset - records in updated one
records_lost=logs_df.shape[0] - logs_updated_df1.shape[0]
records_lost

In [None]:
lost_percentage=records_lost/logs_df.shape[0]
lost_percentage

In [None]:
#calculate number of users lost
users_in_original=logs_df['device_id_hash'].nunique()
users_in_updated=logs_updated_df1['device_id_hash'].nunique()
users_in_original


In [None]:
users_in_updated

In [None]:
lost_users=users_in_original-users_in_updated
lost_users_percentage=lost_users/users_in_original
lost_users

In [None]:
lost_users_percentage

***We see a loss of around 1% of users from the original logs dataset which should not be a source of concern in our   
further analysis.***

***Data lost is very negligible around 2% of the dataset . So no major impact on the dataset is predicted.***

In [None]:
#getting the frequency of each event.  
events_frequency=pd.DataFrame(logs_updated_df1['event_name'].value_counts()).reset_index()
events_frequency=events_frequency.rename(columns={'index':'event_name','event_name':'count'})
events_frequency

In [None]:
ax = sns.barplot(x='event_name', y='count', data=events_frequency,order=events_frequency.sort_values(by='count').event_name) 
ax.set_title('Frequency of different events')
plt.xticks(rotation=90)

plt.savefig('Frequency_of_events.png')

***From the above graph , we see that more visitors are there for MainSreen which also includes some repeated users.***

In [None]:
unique_user_in_event=logs_updated_df1.groupby('event_name').agg({'device_id_hash':['nunique','count']})
unique_user_in_event=unique_user_in_event.reset_index()
unique_user_in_event=unique_user_in_event.rename(columns={'nunique':'total_unique_user','count':'total_repeat_user'})
unique_user_in_event

***Based on the above table  we see that ,each user visits the 'Main screen'(7300) ,next they go to 'Offers page'(4550) and then adds to 'Carts Page '(3700) and finally 'payment successful ' page (3500)..Not much users  
are interested in 'Tutorial' page and it counts up to just 800..***

In [None]:
logs_updated_df1['device_id_hash'].nunique()

In [None]:
unique_user_in_event

In [None]:
unique_user_in_event.columns = unique_user_in_event.columns.droplevel(0) 

In [None]:
unique_user_in_event

In [None]:
unique_user_in_event['total_user']=7452

In [None]:
unique_user_in_event

In [None]:
unique_user_in_event.columns

In [None]:
#unique_user_in_event.columns.values[1] = "event_name"
unique_user_in_event=unique_user_in_event.rename(columns={unique_user_in_event.columns[0]:'event_name'})

In [None]:
unique_user_in_event

In [None]:
unique_user_in_event['event_once']=unique_user_in_event['total_unique_user']/unique_user_in_event['total_repeat_user']

In [None]:
ax = sns.barplot(x='event_name', y='total_unique_user', data=unique_user_in_event,order=unique_user_in_event.sort_values(by='total_unique_user').event_name) 
ax.set_title('number of unique users who performed each of these actions.')
plt.xticks(rotation=90)

plt.savefig('users_actions..png')

***The above graph has a trend similar to the one we saw early..MainScreen being the most performed action  
followed by Offers ,then cart and finally payment . Since Tutorials is way few in numbers , we will analyse the other 4 actions.***

In [None]:
unique_user_in_event

In [None]:
unique_user_in_event['event_once_percent']=round(unique_user_in_event['event_once']*100)
unique_user_in_event

In [None]:
logs_updated_df1.sort_values(by='event_date').head(10)

In [None]:
logs_updated_df1.isna().sum()

In [None]:
ax = sns.barplot(x='event_name', y='event_once_percent', data=unique_user_in_event,order=unique_user_in_event.sort_values(by='event_once_percent').event_name) 
ax.set_title('percent of each of these actions occured at least once.')
plt.xticks(rotation=90)


Since 'Tutorial ' is the least preferred action ,it has highest percent of single tiime as it has no repeated users.  
MAin page has more visitors which shows with slightly reduced bar compared to other events.  


##  Study the event funnel  


In [None]:
#get the first time user perform the event
users_first_access=logs_updated_df1.pivot_table(
    index='device_id_hash', 
    columns='event_name', 
    values='event_datetime',
    aggfunc='min')
users_first_access.head(10)

In [None]:
#users who have viewed main screen
step_1 = ~users_first_access['MainScreenAppear'].isna()
#users who have viewed main screen and later views offers page
step_2 = step_1 & (users_first_access['OffersScreenAppear'] > users_first_access['MainScreenAppear'])
#users viewing cart page after main and offers page.
step_3 = step_2 & (users_first_access['CartScreenAppear'] > users_first_access['OffersScreenAppear'])
#users who completes he sequence with payment page at last.
step_4 = step_3 & (users_first_access['PaymentScreenSuccessful'] > users_first_access['CartScreenAppear'])

#count of valid users in each stage
n_pageview = users_first_access[step_1].shape[0]
n_offersview = users_first_access[step_2].shape[0]
n_cartview = users_first_access[step_3].shape[0]
n_paymentview = users_first_access[step_4].shape[0]

In [None]:
n_pageview


In [None]:
n_offersview

In [None]:
n_cartview 

In [None]:
n_paymentview

In [None]:
data= {'event_name':['MainScreen','OffersScreen','CartScreen','PaymentScreen'],
        'n_users':[n_pageview,n_offersview,n_cartview,n_paymentview]}
sequence_views=pd.DataFrame(data)
sequence_views

In [None]:
fig = go.Figure(go.Funnel(
    y = sequence_views['event_name'],
    x = sequence_views['n_users']
    ))
fig.show()

In [None]:
#ratio of users moving from 
#MainScreen to OffersScreen
main_offer= n_offersview/n_pageview

main_offer

In [None]:
#OffersScreen to CartScreen
offer_cart=n_cartview/n_offersview
offer_cart

In [None]:
#Cartcreen to PaymentScreen
cart_payment= n_paymentview/n_cartview
cart_payment

In [None]:
#ratio of users that make from start (mainscreen) to end(payment)
main_payment=n_paymentview/n_pageview
main_payment


### Intermediate Conclusion:

***Around 56%of users move from Main screen to Offers page. From there ,around 42% moves ahead to Cart page.  
And finally around 25% from the cart page ends up paying for the transaction .  
Overall only 6% of users in MainScreen completes the sequence by payment .***  

***We see that more users are stuck at Cart page and don't turn up to Payment ..Though not as small as Cart page ,  
Offers page also shows lesser turn out of around 42%.***

<div id='step2'/>

## Study the A/A/B test results.  


In [None]:
#create a dataset 'logs_a_df'  for exp_id=246
logs_a_df=logs_updated_df1[logs_updated_df1['exp_id']==246]
logs_a_df['device_id_hash'].nunique()

In [None]:
#create a dataset 'logs_b_df'  for exp_id=247
logs_b_df=logs_updated_df1[logs_updated_df1['exp_id']==247]
logs_b_df.head(10)

In [None]:
#create a dataset 'logs_c_df'  for exp_id=248
logs_c_df=logs_updated_df1[logs_updated_df1['exp_id']==248]
logs_c_df.head(10)

***Based on the 'events_frequency' dataset we created earlier ,we see that 'MainScreenAppear' is the most popular event compared to  
all other events.***

In [None]:
#group 'logs_a_df' dataset of the groups based on event_name
temp_a_df=logs_a_df.groupby('event_name')['device_id_hash'].nunique().reset_index()
temp_a_df=temp_a_df.rename(columns={'device_id_hash':'n_users'})
temp_a_df['n_total']=logs_a_df['device_id_hash'].nunique()
temp_a_df['n_share']=temp_a_df['n_users']/temp_a_df['n_total']
temp_a_df

***There is no visible oulier/anamoly condition ..***

In [None]:
#group 'logs_b_df' dataset of the groups based on event_name
temp_b_df=logs_b_df.groupby('event_name')['device_id_hash'].nunique().reset_index()
temp_b_df=temp_b_df.rename(columns={'device_id_hash':'n_users'})
temp_b_df['n_total']=logs_b_df['device_id_hash'].nunique()
temp_b_df['n_share']=temp_b_df['n_users']/temp_b_df['n_total']
temp_b_df

***There is no visible oulier/anamoly condition ..***

In [None]:
#group 'logs_c_df' dataset of the groups based on event_name
temp_c_df=logs_c_df.groupby('event_name')['device_id_hash'].nunique().reset_index()
temp_c_df=temp_c_df.rename(columns={'device_id_hash':'n_users'})
temp_c_df['n_total']=logs_c_df['device_id_hash'].nunique()
temp_c_df['n_share']=temp_c_df['n_users']/temp_c_df['n_total']
temp_c_df

***There is no visible oulier/anamoly condition ..All the 3 groups shows similar trends with very less difference ..***

In [None]:
#test the datasets proportion determine their proportions, and confirm that the groups see absolutely identical
#versions of the product and share the same key metrics.
alpha=0.05
def test_proportions(success1,trial1,success2,trial2):
    successes = np.array([success1,success2])
    trials = np.array([trial1,trial2])

# success proportion in the first group:
    p1 = successes[0]/trials[0]

# success proportion in the second group:
    p2 = successes[1]/trials[1]

# success proportion in the combined dataset:
    p_combined = (successes[0] + successes[1]) / (trials[0] + trials[1])

# the difference between the datasets' proportions
    difference = p1 - p2 
    
#calculating the statistic in standard deviations of the standard normal distribution
    z_value = difference / mth.sqrt(p_combined * (1 - p_combined) * (1/trials[0] + 1/trials[1]))

# setting up the standard normal distribution (mean 0, standard deviation 1)
    distr = st.norm(0, 1) 

    p_value = (1 - distr.cdf(abs(z_value))) * 2

    print('p-value: ', p_value)

    if (p_value < alpha):
        print("Rejecting the null hypothesis: there is a significant difference between the proportions")
    else:
        print("Failed to reject the null hypothesis: there is no reason to consider the proportions different") 

In [None]:
# test 2 control groups -temp_a_df and temp_b_df
for i in range(0,5):
    test_proportions(temp_a_df['n_users'][i],temp_a_df['n_total'][i],temp_b_df['n_users'][i],temp_b_df['n_total'][i])
    

In [None]:
# test 2 groups -temp_a_df and temp_c_df
for i in range(0,5):
    test_proportions(temp_a_df['n_users'][i],temp_a_df['n_total'][i],temp_c_df['n_users'][i],temp_c_df['n_total'][i])
    

In [None]:
# test 2 groups -temp_b_df and temp_c_df
for i in range(0,5):
    test_proportions(temp_b_df['n_users'][i],temp_b_df['n_total'][i],temp_c_df['n_users'][i],temp_c_df['n_total'][i])
    

In [None]:
#test control groups combined  246+247 and test group 248
# test 2 groups -(temp_a_df+temp_b_df) and temp_c_df

#getting a sample of 1242 rows from each control group and making a dataset with the sampled records(temp_a_df+temp_b_df)
temp_a_df1=logs_a_df
#take a sample of 2484 users with exp_id=246
temp_a_df2=logs_a_df.groupby('device_id_hash').nunique().reset_index()
temp_a_df2=temp_a_df2['device_id_hash'].sample(1242)
temp_df=pd.DataFrame(temp_a_df2)
temp_df.head(10)


In [None]:
temp_a_df1=temp_a_df1.merge(temp_df,on='device_id_hash')
temp_a_df1.head(10)

In [None]:
temp_a_df1['device_id_hash'].nunique()

In [None]:
temp_b_df1=logs_b_df
#take a sample of 2484 users with exp_id=247
temp_b_df2=logs_b_df.groupby('device_id_hash').nunique().reset_index()
temp_b_df2=temp_b_df2['device_id_hash'].sample(1242)
temp_df1=pd.DataFrame(temp_b_df2)
temp_df1
temp_b_df1=temp_b_df1.merge(temp_df1,on='device_id_hash')
temp_b_df1.head(10)

In [None]:
temp_b_df1['device_id_hash'].nunique()

In [None]:
#appending the records from control group b to group a and get combined group(temp_ab_combined_df)
temp_ab_combined_df=temp_a_df1.append(temp_b_df1)
temp_ab_combined_df.head(10)

In [None]:
temp_ab_combined_df1=temp_ab_combined_df.groupby('event_name')['device_id_hash'].nunique().reset_index()
temp_ab_combined_df1=temp_ab_combined_df1.rename(columns={'device_id_hash':'n_users'})


In [None]:
#gettingn total users and share of users performing each event
total_users=temp_ab_combined_df['device_id_hash'].nunique()
temp_ab_combined_df1['n_total']=total_users
temp_ab_combined_df1['n_share']=temp_ab_combined_df1['n_users']/temp_ab_combined_df1['n_total']
temp_ab_combined_df1


In [None]:
#test control groups combined  246+247 and test group 248
# test 2 groups -(temp_a_df+temp_b_df) and temp_c_df

for i in range(0,5):
     test_proportions(temp_ab_combined_df1['n_users'][i],temp_ab_combined_df1['n_total'][i],temp_c_df['n_users'][i],temp_c_df['n_total'][i])


***We therefore conclude that group c and group formed by combining a and b also split correctly based on the above test  
of proportion.***

***Based on the test of proportions, we conclude that the groups were split properly.***

In [None]:
#function to Test if the 2 samples are staistically different
def test_stat_difference(sample1,sample2):
    results = st.ttest_ind(sample1,sample2)
    print('p-value: ', results.pvalue)
    if results.pvalue < alpha:
        print("We reject the null hypothesis")
    else:
        print("We can't reject the null hypothesis")

In [None]:
#testing the 2 control groups and the test groups for any difference
test_stat_difference(temp_a_df['n_share'],temp_b_df['n_share'])
test_stat_difference(temp_a_df['n_share'],temp_c_df['n_share'])
test_stat_difference(temp_b_df['n_share'],temp_a_df['n_share'])

***Based on the 'test_stat_difference' results , we see that all 3 samples doesn't have much difference .***

***Lets compare if any change in trend with respect to each event per day in all the 3 groups..  
Will go with cumulative user count here.***

In [None]:
#working with group a 'exp_id=246'
cum_a_df= logs_a_df.groupby(['event_date','event_name']).agg({'device_id_hash':'nunique'}).reset_index()
cum_a_df.head(5)


In [None]:
#getting cumulative number of users viewing main page 
cum_a_main_df=cum_a_df[cum_a_df['event_name']=='MainScreenAppear']
cum_a_main_df['user_cum']=cum_a_main_df['device_id_hash'].cumsum()

In [None]:
cum_a_main_df

In [None]:
#getting cumulative number of users viewing Offers page 
cum_a_offers_df=cum_a_df[cum_a_df['event_name']=='OffersScreenAppear']
cum_a_offers_df['user_cum']=cum_a_offers_df['device_id_hash'].cumsum()
cum_a_offers_df

In [None]:
#getting cumulative number of users viewing cart page 
cum_a_cart_df=cum_a_df[cum_a_df['event_name']=='CartScreenAppear']
cum_a_cart_df['user_cum']=cum_a_cart_df['device_id_hash'].cumsum()
cum_a_cart_df

In [None]:
#getting cumulative number of users viewing payment page 
cum_a_payment_df=cum_a_df[cum_a_df['event_name']=='PaymentScreenSuccessful']
cum_a_payment_df['user_cum']=cum_a_payment_df['device_id_hash'].cumsum()
cum_a_payment_df

In [None]:
#getting cumulative number of users viewing each  page for group b -'exp_id=247'
cum_b_df= logs_b_df.groupby(['event_date','event_name']).agg({'device_id_hash':'nunique'}).reset_index()
cum_b_main_df=cum_b_df[cum_b_df['event_name']=='MainScreenAppear']
cum_b_main_df['user_cum']=cum_b_main_df['device_id_hash'].cumsum()
cum_b_offers_df=cum_b_df[cum_b_df['event_name']=='OffersScreenAppear']
cum_b_offers_df['user_cum']=cum_b_offers_df['device_id_hash'].cumsum()
cum_b_cart_df=cum_b_df[cum_b_df['event_name']=='CartScreenAppear']
cum_b_cart_df['user_cum']=cum_b_cart_df['device_id_hash'].cumsum()
cum_b_payment_df=cum_b_df[cum_b_df['event_name']=='PaymentScreenSuccessful']
cum_b_payment_df['user_cum']=cum_b_payment_df['device_id_hash'].cumsum()


In [None]:
#getting cumulative number of users viewing each  page for group c-'exp_id=248'
cum_c_df= logs_c_df.groupby(['event_date','event_name']).agg({'device_id_hash':'nunique'}).reset_index()
cum_c_main_df=cum_c_df[cum_c_df['event_name']=='MainScreenAppear']
cum_c_main_df['user_cum']=cum_c_main_df['device_id_hash'].cumsum()
cum_c_offers_df=cum_c_df[cum_c_df['event_name']=='OffersScreenAppear']
cum_c_offers_df['user_cum']=cum_c_offers_df['device_id_hash'].cumsum()
cum_c_cart_df=cum_c_df[cum_c_df['event_name']=='CartScreenAppear']
cum_c_cart_df['user_cum']=cum_c_cart_df['device_id_hash'].cumsum()
cum_c_payment_df=cum_c_df[cum_c_df['event_name']=='PaymentScreenSuccessful']
cum_c_payment_df['user_cum']=cum_c_payment_df['device_id_hash'].cumsum()

In [None]:
# Plotting the group A main screen viewers
plt.plot(cum_a_main_df['event_date'], cum_a_main_df['user_cum'], label='A')
# Plotting the group B main screen viewers
plt.plot(cum_b_main_df['event_date'], cum_b_main_df['user_cum'], label='B')
# Plotting the group C main screen viewers
plt.plot(cum_c_main_df['event_date'], cum_c_main_df['user_cum'], label='C')

plt.legend()
plt.xticks(rotation=90)
plt.xlabel('Date')
plt.ylabel('Total users viewing MainScreen')
plt.show()

In [None]:
# Plotting the group A offers screen viewers
plt.plot(cum_a_offers_df['event_date'], cum_a_offers_df['user_cum'], label='A')
# Plotting the group B offers screen viewers
plt.plot(cum_b_offers_df['event_date'], cum_b_offers_df['user_cum'], label='B')
# Plotting the group C offers screen viewers
plt.plot(cum_c_offers_df['event_date'], cum_c_offers_df['user_cum'], label='C')

plt.legend()
plt.xticks(rotation=90)
plt.xlabel('Date')
plt.ylabel('Total users viewing Offers Screen')
plt.show()

In [None]:
# Plotting the group A cart screen viewers
plt.plot(cum_a_cart_df['event_date'], cum_a_cart_df['user_cum'], label='A')
# Plotting the group B cart screen viewers
plt.plot(cum_b_cart_df['event_date'], cum_b_cart_df['user_cum'], label='B')
# Plotting the group C cart screen viewers
plt.plot(cum_c_cart_df['event_date'], cum_c_cart_df['user_cum'], label='C')

plt.legend()
plt.xticks(rotation=90)
plt.xlabel('Date')
plt.ylabel('Total users viewing cart Screen')
plt.show()

In [None]:
# Plotting the group A payment screen viewers
plt.plot(cum_a_payment_df['event_date'], cum_a_payment_df['user_cum'], label='A')
# Plotting the group B payment  screen viewers
plt.plot(cum_b_payment_df['event_date'], cum_b_payment_df['user_cum'], label='B')
# Plotting the group C payment  screen viewers
plt.plot(cum_c_payment_df['event_date'], cum_c_payment_df['user_cum'], label='C')

plt.legend()
plt.xticks(rotation=90)
plt.xlabel('Date')
plt.ylabel('Total users viewing payment Screen')
plt.show()

***We don't see much difference in the groups based on the events.***

<div id='end'/>

## Final Conclusion

***Based oon the given data we see that  
1.Only 6% of the users viewing the Mainscreen finishes the sequence by doing payment.  
2.'Tutorial' is the least performed action .So we continued analysis with the rest 4 events.  
3.The number of user getting to next stage of the sequence gets reduced gradually from Main page  
but gets down to 25% when going to payment page.
4.Comparing eah group , we see that all 3 shows similar trend in tems of different events .  
Since we have just a week data ,it's better to wait for few more days before concluding if the   
change of fonts impacts the performance of the app.***