# **Waze Project**
**Course 4 - The Power of Statistics**

# **Course 4 End-of-course project: Data exploration and hypothesis testing**

In this activity, we will explore the data provided and conduct a hypothesis test.
<br/>

**The purpose** of this project is to demostrate knowledge of how to conduct a two-sample hypothesis test.

**The goal** is to apply descriptive statistics and hypothesis testing in Python.
<br/>

*This activity has three parts:*

**Part 1:** Imports and data loading
* What data packages will be necessary for hypothesis testing?

**Part 2:** Conduct hypothesis testing
* How did computing descriptive statistics help you analyze your data?

* How did you formulate your null hypothesis and alternative hypothesis?

**Part 3:** Communicate insights with stakeholders

* What key business insight(s) emerged from your hypothesis test?

* What business recommendations do you propose based on your results?

<br/>


# **Data exploration and hypothesis testing**

<img src="images/Pace.png" width="100" height="100" align=left>

# **PACE stages**


Throughout these project notebooks, we will see references to the problem-solving framework PACE. The following notebook components are labeled with the respective PACE stage: Plan, Analyze, Construct, and Execute.

## **PACE: Plan**

1. What is your research question for this data project? Later on, you will need to formulate the null and alternative hypotheses as the first step of your hypothesis test. Consider your research question now, at the start of this task.


==> conduct a two-sample hypothesis test (t-test) to analyze the difference in the mean amount of rides between iPhone users and Android users

### **Task 1. Imports and data loading**




Import packages and libraries needed to compute descriptive statistics and conduct a hypothesis test.

In [4]:
# Import any relevant packages or libraries
### YOUR CODE HERE ###
import pandas as pd
import numpy as np
from scipy import stats

In [5]:
# Load dataset into dataframe
df = pd.read_csv('waze_dataset.csv')
df.head()

Unnamed: 0,ID,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device
0,0,retained,283,226,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android
1,1,retained,133,107,326.896596,1225,19,64,13715.92055,3160.472914,13,11,iPhone
2,2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android
3,3,retained,49,40,67.589221,15,322,7,913.591123,587.196542,7,3,iPhone
4,4,retained,84,68,168.24702,1562,166,5,3950.202008,1219.555924,27,18,Android


## **PACE: Analyze and Construct**


1. Data professionals use descriptive statistics for exploratory data analysis (EDA). How can computing descriptive statistics help you learn more about your data in this stage of your analysis?


==> Computing descriptive statistics during EDA helps summarize and understand key characteristics of the data, such as central tendency (mean, median), variability (range, standard deviation).


### **Task 2. Data exploration**

Use descriptive statistics to conduct exploratory data analysis (EDA).

**Note:** In the dataset, `device` is a categorical variable with the labels `iPhone` and `Android`.

In order to perform this analysis, you must turn each label into an integer.  The following code assigns a `1` for an `iPhone` user and a `2` for `Android`.  It assigns this label back to the variable `device_new`.

In [6]:
# 1. Create `map_dictionary`
### YOUR CODE HERE ###
map_dictionary = {
    'Android': 2,
    'iPhone': 1
}

# 2. Create new `device_type` column
### YOUR CODE HERE ###
df['device_type']=df['device']
# 3. Map the new column to the dictionary
### YOUR CODE HERE ###
df['device_type'] = df['device_type'].map(map_dictionary)

In [7]:
df.head()

Unnamed: 0,ID,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device,device_type
0,0,retained,283,226,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android,2
1,1,retained,133,107,326.896596,1225,19,64,13715.92055,3160.472914,13,11,iPhone,1
2,2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android,2
3,3,retained,49,40,67.589221,15,322,7,913.591123,587.196542,7,3,iPhone,1
4,4,retained,84,68,168.24702,1562,166,5,3950.202008,1219.555924,27,18,Android,2


We are interested in the relationship between device type and the number of drives. One approach is to look at the average number of drives for each device type. Calculate these averages.

In [8]:
### YOUR CODE HERE ###
iphone_df = df[df['device_type'] == 1]
android_df = df[df['device_type'] == 2]

iphone_drives_avg = iphone_df['drives'].mean()
android_drives_avg = android_df['drives'].mean()

print('iphone_drives_avg is', iphone_drives_avg)
print('android_drives_avg is', android_drives_avg)

iphone_drives_avg is 67.85907775020678
android_drives_avg is 66.23183780739629


Based on the averages shown, it appears that drivers who use an iPhone device to interact with the application have a higher number of drives on average. However, this difference might arise from random sampling, rather than being a true difference in the number of drives. To assess whether the difference is statistically significant, we can conduct a hypothesis test.


### **Task 3. Hypothesis testing**

Our goal is to conduct a two-sample t-test. Recall the steps for conducting a hypothesis test:


1.   State the null hypothesis and the alternative hypothesis
2.   Choose a signficance level
3.   Find the p-value
4.   Reject or fail to reject the null hypothesis

**Note:** This is a t-test for two independent samples. This is the appropriate test since the two groups are independent (Android users vs. iPhone users).

**Question:** What are your hypotheses for this data project?

==> Null Hypothesis: The difference between the means of the two groups (Android users vs. iPhone users) is due to chance.
and Alternative Hypothesis: The difference between the means of the two groups is statistically significant, indicating that it is unlikely to have occurred by chance.

Next, choose 5% as the significance level and proceed with a two-sample t-test.

1. Isolate the `drives` column for iPhone users.
2. Isolate the `drives` column for Android users.
3. Perform the t-test

In [11]:
# 1. Isolate the `drives` column for iPhone users.
### YOUR CODE HERE ###
iphone_drives=iphone_df['drives']
#print(iphone_drives)
# 2. Isolate the `drives` column for Android users.
### YOUR CODE HERE ###
android_drives=android_df['drives']
#print(android_drives)
# 3. Perform the t-test
### YOUR CODE HERE ###
stats.ttest_ind(a=iphone_df['drives'], b=android_df['drives'], equal_var=False)

Ttest_indResult(statistic=1.4635232068852353, pvalue=0.1433519726802059)

**Question:** Based on the p-value you got above, do you reject or fail to reject the null hypothesis?

==> we have pvalue=0.1433519726802059 which is 14.3%>5% , then we will fail to reject the null hypothesis. in other world the difference between means is due to chance.

## **PACE: Execute**


### **Task 4. Communicate insights with stakeholders**

Now that we've completed our hypothesis test, the next step is to share our findings with the Waze leadership team. Consider the following question as we prepare to write your executive summary:

* What business insight(s) can you draw from the result of your hypothesis test?

==> The p-value (14.3%) is greater than the commonly used significance level of 5%, meaning we fail to reject the null hypothesis. This suggests that there is no statistically significant difference in the mean number of rides between Android and iPhone users. Therefore, platform type (Android vs. iPhone) does not appear to influence ride activity significantly.
