# T-Test with Python · Navigation App

Completed by [Anton Starshev](http://linkedin.com/in/starshev) on 18/04/2024

### Context

According to the fictional project scenario, I am working as a data professional in Waze, a free navigation app that makes it easier for drivers around the world to get to where they want to go. 

Waze’s data team is working on the churn project. An intermediate request from leadership has emerged: to analyze the relationship between mean amount of rides and device type. Specifically, leadership seeks to ascertain if there is a statistically significant difference in the mean number of rides between iPhone® and Android™ users.

### Data

This project uses a dataset called **waze_dataset.csv**. It contains synthetic data created for this project in partnership with Waze.

### Execution

Starting my project, I divided the execution process into four key phases to carry them out step by step:

1. Importing necessary Python packages and loading the dataset
2. Performing Exploratory Data Analysis (EDA) and computing descriptive statistics
3. Conducting hypothesis testing
4. Formulating business insights and recommendations

### 1 · Data Loading

Imported packages and libraries needed to compute descriptive statistics and conduct a hypothesis test.

In [1]:
import pandas as pd
from scipy import stats

Loaded the scenario dataset into a DataFrame.

In [6]:
df = pd.read_csv("waze_dataset.csv", index_col = 0)

### 2 · Data Exploration

Previewed the loaded data.

In [7]:
df.head()

Unnamed: 0_level_0,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,retained,283,226,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android
1,retained,133,107,326.896596,1225,19,64,13715.92055,3160.472914,13,11,iPhone
2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android
3,retained,49,40,67.589221,15,322,7,913.591123,587.196542,7,3,iPhone
4,retained,84,68,168.24702,1562,166,5,3950.202008,1219.555924,27,18,Android


Checked the data size.

In [8]:
df.shape

(14999, 12)

Verified the data types and names of columns.

In [9]:
df.dtypes

label                       object
sessions                     int64
drives                       int64
total_sessions             float64
n_days_after_onboarding      int64
total_navigations_fav1       int64
total_navigations_fav2       int64
driven_km_drives           float64
duration_minutes_drives    float64
activity_days                int64
driving_days                 int64
device                      object
dtype: object

Used descriptive statistics to conduct Exploratory Data Analysis (EDA) on the rides data.

In [12]:
df[['drives']].describe(include = 'all')

Unnamed: 0,drives
count,14999.0
mean,67.281152
std,65.913872
min,0.0
25%,20.0
50%,48.0
75%,93.0
max,596.0


Exploring the relationship between device type and the number of rides customers take, one approach within the EDA was to examine the average ride count for each device type.

In [13]:
df.groupby('device')[['drives']].mean()

Unnamed: 0_level_0,drives
device,Unnamed: 1_level_1
Android,66.231838
iPhone,67.859078


**Observation**: Firstly, I have found that there are just two device categories.

Secondly, based on my preliminary research analysis, iPhone users tend to take more rides on average than those who use Android. However, this difference could be due to sample variability. So, the next step was to check the statistical significance of this difference through hypothesis testing.

### 3 · Hypothesis Test

Since one of the variables is categorical, as a first step, I mapped the device category into numerical values, assigning "1" to iPhone devices and "2" to Android devices. Additionally, I added a corresponding column "device_type" to the DataFrame for testing purposes.

In [19]:
device_map = {'Android' : 2, 'iPhone' : 1}

df['device_type'] = df.device.map(device_map)

df[['device','device_type']].head()

Unnamed: 0_level_0,device,device_type
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Android,2
1,iPhone,1
2,Android,2
3,iPhone,1
4,Android,2


Stated the null hypothesis and the alternative hypothesis:

**H<sub>0</sub>**: There is no difference in the average number of rides between clients who use iPhones and those who use Android devices.

**H<sub>1</sub>**: There is a difference in the average number of rides between clients who use iPhones and those who use Android devices.

Assigned a **5% significance level** to the hypothesis test.

Determined the type of hypothesis testing: **two-sample two-tailed t-test**.

Filtered the data into two groups based on the device type: iPhone or Android.

In [15]:
iphone_drives = df[df['device_type'] == 1].drives

android_drives = df[df['device_type'] == 2].drives

Conducted the hypothesis test using SciPy Stats.

In [20]:
stats.ttest_ind(a = iphone_drives, b = android_drives, equal_var = False, 
alternative = 'two-sided')

TtestResult(statistic=1.463523206885235, pvalue=0.143351972680206, df=11345.066049381952)

**Test Result**: Given that the p-value of 14.3% is notably higher than the 5% significance level, I failed to reject the null hypothesis.

### 4 · Insight and Recommendation

**Business Insight**: Based on the conducted test, the key business insight is that there is no statistically significant difference in the average number of rides between clients who use iPhones and those who use Android devices.

**Business Recommendation**: *Since the test result revealed no direct correlation between user engagement with the service and the type of device they use, I would recommend exploring various other factors within the context of churn research that may influence the user's ride count and conducting hypothesis tests on them.*

### Acknowledgment

I would like to express gratitude to Google and Coursera for supporting the educational process and providing the opportunity to refine and showcase skills acquired during the courses by completing real-life scenario portfolio projects, such as this.

### Reference

This is an end-of-course workplace scenario project *«Waze, created in partnership with the realtime driving directions app»* proposed within the syllabus of *Google Advanced Data Analytics Professional Certificate* on Coursera.