# Waze User Churn Project


## Data Exploration & Hypothesis Testing

**The purpose** of this notebook is to conduct a two-sample hypothesis test.

**The goal** is to apply descriptive statistics and hypothesis testing in Python.
<br/>

This notebook has three parts:
1. Imports and data loading
2. Conduct hypothesis testing
3. Conclusion & Communicate Insights with Stakeholders


## 1. Imports & Data Loading


In [1]:
# Import
import pandas as pd
import numpy as np
from scipy import stats

In [3]:
# Load dataset into dataframe
df = pd.read_csv('waze_dataset.csv')

In [4]:
df.head()

Unnamed: 0,ID,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device
0,0,retained,283,226,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android
1,1,retained,133,107,326.896596,1225,19,64,13715.92055,3160.472914,13,11,iPhone
2,2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android
3,3,retained,49,40,67.589221,15,322,7,913.591123,587.196542,7,3,iPhone
4,4,retained,84,68,168.24702,1562,166,5,3950.202008,1219.555924,27,18,Android


## 2. Data Exploration

To preform EDA using descriptive statistics, we need to convert categorical data to numeric.

### Convert `device`

We will assign `1` to `iPhone` and `2` to `Android`

In [6]:
# Create a dict to preform the conversion
map_dictionary = {"iPhone": 1, "Android": 2}

# Create a new column to copy `device` data
df['device_type'] = df['device']

# Map the new column to the dict
df['device_type'] = df['device_type'].map(map_dictionary)

# View the change
df.head()

Unnamed: 0,ID,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device,device_type
0,0,retained,283,226,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android,2
1,1,retained,133,107,326.896596,1225,19,64,13715.92055,3160.472914,13,11,iPhone,1
2,2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android,2
3,3,retained,49,40,67.589221,15,322,7,913.591123,587.196542,7,3,iPhone,1
4,4,retained,84,68,168.24702,1562,166,5,3950.202008,1219.555924,27,18,Android,2


### Device Type vs. Amount of Drives
Investigating the relationship between `device_type` and `drives`. 

In [8]:
# Comparing the average of drives between iPhone and Android users
android_mean = df[df["device_type"] == 2]["drives"].mean()
iphone_mean = df[df["device_type"] == 1]["drives"].mean()

print("Android Mean Drives: ", android_mean)
print("Iphone Mean Drives: ", iphone_mean)

Android Mean Drives:  66.23183780739629
Iphone Mean Drives:  67.85907775020678


Android users seem to have on average about 1.6 drives less than iPhone users.  

We will conduct a hypothesis test to see if there is any statistical significance to this difference.

## 3. Hypothesis testing

**Type of Test** - two-sample t-test. 

**Steps in Hypothesis Testing:**
1. State the Null hypothesis ($H_0$) & Alternative hypothesis ($H_a$)
2. Choose a signficance level (usually 0.05)
3. Find the p-value
4. Make a decision based on the p-value (Reject/Fail to Reject $H_0$)


### Step 1 - State Hypotheses
* $H_0$: Iphone mean is **not** greater than Android mean.
* $H_a$: Iphone mean is **greater** than Android mean.

### Step 2 - Significance Level
* We will use 5%

### Step 3 - Calculate P-Value


In [13]:
# Isolate the `drives` data for iPhone users.
iphone_drives = df[df["device_type"] == 1]["drives"]

# Isolate the `drives` data for Android users.
android_drives = df[df["device_type"] == 2]["drives"]

# Perform the t-test (set equal_var to false since we can't assume the data sets have equal variance )
results = stats.ttest_ind(iphone_drives, android_drives, equal_var=False, alternative='greater')
if results.pvalue >= 0.05:
    print("Fail to Reject H0, p-val = ", results.pvalue)
else:
    print("Reject H0, p-val = ", results.pvalue)

Fail to Reject H0, p-val =  0.071675986340103


### Step 4 - Make a Decision

We will accept (aka fail to reject) $H_0$ since the p-value (0.07) was greater than the significance level (0.05).  
This implies that the _iPhone_ mean is **not** greater than the _Android_ mean.  

## 4. Conclusion 

We can conclude that the trends of drives between iPhone and Android users are equivalent. Iphone users don't drive more than Android users, so device type is not a distiguishing factor for churn.
