# **Waze Project: Data exploration and hypothesis testing**

In this activity, we will explore the data provided and conduct a hypothesis test.
<br/>

**The purpose** of this project is to demostrate how to conduct a two-sample hypothesis test.

**The goal** is to apply descriptive statistics and hypothesis testing in Python.
<br/>

*This activity has three parts:*

**Part 1:** Imports and data loading

**Part 2:** Conducting hypothesis testing

**Part 3:** Communicating insights with stakeholders

<br/>

1. What is our research question for this data project?

**"Is there a statistically significant difference in mean amount of rides between iPhone users and Android™ users?"**

### **Imports and data loading**

In [9]:
# Import any relevant packages or libraries
import pandas as pd
from scipy import stats

In [10]:
# Load dataset into dataframe
df = pd.read_csv('waze_dataset.csv')

1. How can computing descriptive statistics help us learn more about our data in this stage of your analysis?

**Descriptive statistics are useful because they let us quickly explore and understand large amounts of data. In this case, computing descriptive statistics helps us quickly compare the average amount of drives by device type.**

### **Data exploration**

Descriptive statistics to conduct exploratory data analysis (EDA).

**Note:** In the dataset, `device` is a categorical variable with the labels `iPhone` and `Android`.

In order to perform this analysis, we must turn each label into an integer.  The following code assigns a `1` for an `iPhone` user and a `2` for `Android`.  It assigns this label back to the variable `device_new`.

**Note:** Creating a new variable is ideal so that we don't overwrite original data.

In [11]:
# 1. Create `map_dictionary`
map_dictionary = {'Android': 2, 'iPhone': 1}

# 2. Create new `device_type` column
df['device_type'] = df['device'].map(map_dictionary)

We are interested in the relationship between device type and the number of drives. One approach is to look at the average number of drives for each device type.

In [12]:
df.groupby(['device'])['drives'].mean()

device
Android    66.231838
iPhone     67.859078
Name: drives, dtype: float64

Based on the averages shown, it appears that drivers who use an iPhone device to interact with the application have a higher number of drives on average. However, this difference might arise from random sampling, rather than being a true difference in the number of drives. To assess whether the difference is statistically significant, we can conduct a hypothesis test.


### **Hypothesis testing**

Our goal is to conduct a two-sample t-test.

**Note:** This is a t-test for two independent samples. This is the appropriate test since the two groups are independent (Android users vs. iPhone users).

**Question:** What are our hypotheses for this data project?

$H_0$: There is no difference in average number of drives between drivers who use iPhone devices and drivers who use Androids.

$H_A$: There is a difference in average number of drives between drivers who use iPhone devices and drivers who use Androids.

Next, we choose 5% as the significance level and proceed with a two-sample t-test.

In [13]:
# 1. Isolate the `drives` column for iPhone users.
iphone = df[df['device_type'] == 1]['drives']

# 2. Isolate the `drives` column for Android users.
android = df[df['device_type'] == 2]['drives']

# 3. Perform the t-test
stats.ttest_ind(a=iphone, b=android, equal_var=False)

Ttest_indResult(statistic=1.4635232068852353, pvalue=0.1433519726802059)

**Question:** Based on the p-value we got above, do we reject or fail to reject the null hypothesis?

**Since the pvalue (=0.1433519726802059) is larger than the significance level of 5%, we `fail to reject the null hypothesis`.**

**There is NOT a statistically significant difference in the average number of drives between drivers who use iPhones and drivers who use Androids.**

### **Communicate insights with stakeholders**

* What business insight(s) can we draw from the result of your hypothesis test?

> *The drivers who use iPhone devices on average have a similar number of drives as those who use Androids.*

> *Potential next step is to explore what other factors influence the variation in the number of drives, and run additonal hypothesis tests to learn more about user behavior. Further, temporary changes in marketing or user interface for the Waze app may provide more data to investigate churn.*