## **Waze Project - AB Test: Analyzing The Number of drives Between iPhone and Android Users**

### **Objective:**
The goal of this project was to analyze the difference in the mean number of drives between iPhone and Android users using a two-sample hypothesis test. Understanding the number of rides between different platform users could provide valuable insights into user engagement, app performance, or help improve marketing strategies for different user groups.This project was structured using the PACE methodology.

### **Methodology**
1. **Data Preparation:** Loaded and cleaned dataset using Pandas
2. **Descriptive Statistics:** Calculated mean, median and standard deviation for both groups (iPhone and Android).
3. **Hypothesis Testing:** Performed a two-sample t-test to determine if there was a significant difference between the average of rides for iPhone and Android users.
4. **Significance Level:** Set at α = 0.05


### **1. Research Question(Plan)**
**Research Question:** Does the average number of drives differ between iPhone and Android users?


### **Importing packages and data loading**

##### To begin the analysis, the necessary Python libraries are imported:

    - Pandas for data manipulation and cleaning
    - Numpy for numerical operations and handling arrays
    - Matplotlib for visualizing and plotting
    - Seaborn: For enhanced data visualization
    - Scipy for statistical functions to perform the hypothesis test

In [None]:
# Importing relevant packages and libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats


In [None]:
# Load dataset into dataframe
df = pd.read_csv('waze_dataset.csv')

### **2. Data Exploration (Analyze)**

1. Understanding the data and conducting Exploratory Data Analysis


In [None]:
df.head(5)

In [None]:
print(df.info())
df.describe(include='all')

In [None]:
#checking if there are any misspellings
df['device'].unique()

In [None]:
#checking missing data
print('Device column missing data:', df['device'].isnull().sum())
print('Drives column missing data:', df['drives'].isnull().sum())

##### Creating a new variable transforming  device in a new variable (device_type) integer.

    - iPhone: 1
    - Android: 2
This step converts the device type into a numeric form, which is better for statistical analysis and hypothesis testing.

In [None]:
df['device_type'] = df['device'].map({'iPhone' : 1, 'Android': 2})

In [None]:
#mean difference in drives between iPhone and Android users
df.groupby('device_type')['drives'].mean().round(2)

**Based on results, it appears that drives who use iPhone have a higher number of drives on average.** 

In [None]:
# other statistics (median)
df.groupby('device_type')['drives'].median().round(2)

In [None]:
#other statistics (standard deviation)
df.groupby('device_type')['drives'].std().round(2)

#### **Visualizations**
To better understand the distribution of the data, boxplots and histograms were generated.
The boxplot gives us a visual representation of the distribution of the number of drives for each device type. It shows potential outliers and the range of the data. The histogram helps visualize the frequency distribution of the number of drives per device, giving us a sense of how data is spread across the two groups."

In [None]:
# Set 2 figures 
fig, axes = plt.subplots(1, 2, figsize=(8, 4)) 
palette = {1: 'black', 2: '#3DDC84'}

#Boxplot
sns.boxplot(data=df, x='device_type', y='drives', ax=axes[0])
axes[0].set_title('Number of Drives by Device')

#Histogram - Distribution

sns.histplot(data=df, x='drives', hue='device_type', multiple='stack', palette = palette, ax=axes[1])
axes[1].set_title('Distribution os Drives by Device')

plt.tight_layout()
plt.subplots_adjust(wspace=0.4)
                   
plt.show()


### **3. Hypothesis testing (Construct)**

A two-sample t-test is appropriate because we are comparing the means of two independent groups (iPhone vs. Android users). The null hypothesis assumes that the means are the same, while the alternative hypothesis suggests that there is a significant difference in the means.

**Hypothesis:**

    - Null hypothesis (H0) --> The number of drives between iPhone and Android are the same
    - Alternative hypothesis (Ha) --> The number of drives between iphone users and android users are different.

**Significance Level (α):** Set to 5%



**Performing the Two-Sample T-Test**

    Two datasets were created to isolate rides for iPhone and Android users. A two-sample t-test was conducted to check for significant differences in the number of rides.(The two groups are independent)

In [None]:
# Isolating the drives columns for iPhone and Android users
iphone_drives = df[df['device_type'] == 1]['drives']
android_drives = df[df['device_type'] == 2]['drives']

In [None]:
# Performing the t-test
ttest, pvalue = stats.ttest_ind(a=iphone_drives, b=android_drives, equal_var=False)

print(f't value: {ttest}')
print(f'p value: {pvalue}')

### **4. Decision and Interpretation (Execute)**


Since the **p-value: 0.14** is greater than the significance level (**α = 0.05**), The test **fail to reject the null hypothesis**.
This indicates that **there is not a statistically significant difference** in the average number of drives between iPhone and Android Users. 

In [None]:
significance_level = 0.05

if pvalue < significance_level:
    print(f' The p-value: {pvalue} reject the null hypothesis, there is a statistical difference in drives per device')
else:
    print(f'The p-value: {pvalue:.2f} fail to reject the null hypothesis, there is NO  statistical difference in drives per device. Probably it happened by chance')


### **Business Insights**

 **Key Insight:**  
 The analysis revealed no significant difference in the number of drives between Android and iPhone  users, suggesting that the user experience between the two platforms may not differ substantially.

 **Next Steps:** Explore what other factors influence the variation in the number of drives and run additional hypothesis tests to learn more about user behavior.

### **Final Thoughts**
This project helped me understand the significance of statistical testing, the application of hypothesis testing in real-world scenarios, and how to use Python libraries for analysis. It's an excellent foundation for further exploration into user behavior and app performance

This analysis serves as a solid foundation for future explorations into user behavior and app performance across different platforms.