# **Waze Project**
**Course 2 - Get Started with Python**

Welcome to the Waze Project!

Your Waze data analytics team is still in the early stages of their user churn project. Previously, you were asked to complete a project proposal by your supervisor, May Santner. You have received notice that your project proposal has been approved and that your team has been given access to Waze's user data. To get clear insights, the user data must be inspected and prepared for the upcoming process of exploratory data analysis (EDA).

A Python notebook has been prepared to guide you through this project. Answer the questions and create an executive summary for the Waze data team.

# **Course 2 End-of-course project: Inspect and analyze data**

In this activity, you will examine data provided and prepare it for analysis. This activity will help ensure the information is,

1.   Ready to answer questions and yield insights

2.   Ready for visualizations

3.   Ready for future hypothesis testing and statistical methods
<br/>

**The purpose** of this project is to investigate and understand the data provided.

**The goal** is to use a dataframe contructed within Python, perform a cursory inspection of the provided dataset, and inform team members of your findings.
<br/>

*This activity has three parts:*

**Part 1:** Understand the situation
* How can you best prepare to understand and organize the provided information?

**Part 2:** Understand the data

* Create a pandas dataframe for data learning, future exploratory data analysis (EDA), and statistical activities

* Compile summary information about the data to inform next steps

**Part 3:** Understand the variables

* Use insights from your examination of the summary data to guide deeper investigation into variables


<br/>

Follow the instructions and answer the following questions to complete the activity. Then, you will complete an Executive Summary using the questions listed on the PACE Strategy Document.

Be sure to complete this activity before moving on. The next course item will provide you with a completed exemplar to compare to your own work.



# **Identify data types and compile summary information**


<img src="images/Pace.png" width="100" height="100" align=left>

# **PACE stages**

Throughout these project notebooks, you'll see references to the problem-solving framework, PACE. The following notebook components are labeled with the respective PACE stages: Plan, Analyze, Construct, and Execute.

<img src="images/Plan.png" width="100" height="100" align=left>


## **PACE: Plan**

Consider the questions in your PACE Strategy Document and those below to craft your response:

### **Task 1. Understand the situation**

*   How can you best prepare to understand and organize the provided driver data?


*Begin by exploring your dataset and consider reviewing the Data Dictionary.*

To best prepare for understanding and organizing the provided **driver data** for the Waze Project, consider the following steps, using the available fields:

#### 1. **Explore the Dataset**

Begin by loading and inspecting the dataset using Python:

* **Load the Data**: Use pandas to load the dataset into a DataFrame.
* **Preview the Data**: Check the first few rows using `.head()` and the dataset's structure with `.info()` to understand the columns and their data types.

#### 2. **Review the Data Dictionary**

From the fields provided, we can infer the following meaning for each:

* **ID**: A unique identifier for each driver or session.
* **label**: The classification label associated with each session, possibly indicating whether a session is a "claim" or "non-claim" or some other form of categorization.
* **sessions**: The number of sessions a driver has participated in.
* **drives**: The number of individual driving events or trips for each driver.
* **total\_sessions**: The total number of sessions the driver has completed.
* **n\_days\_after\_onboarding**: The number of days since the driver first used the app or was onboarded.
* **total\_navigations\_fav1**: The total number of navigations using the first favorite route.
* **total\_navigations\_fav2**: The total number of navigations using the second favorite route.
* **driven\_km\_drives**: The total kilometers driven across all drives.
* **duration\_minutes\_drives**: The total duration of the driving events, measured in minutes.
* **activity\_days**: The number of days the driver has been active in the app.
* **driving\_days**: The number of days the driver has actively driven.
* **device**: The device used by the driver (e.g., iOS or Android).

#### 3. **Identify Key Variables**

From the fields above, the following variables seem especially relevant for understanding driver behavior:

* **Driving Metrics**: `drives`, `driven_km_drives`, `duration_minutes_drives`, `total_sessions`, and `driving_days` will provide insights into how much and how often the driver is active. These are key to understanding overall driving activity and usage patterns.
* **Navigation Behavior**: `total_navigations_fav1` and `total_navigations_fav2` indicate how often the driver uses favorite routes, potentially influencing behavior or outcomes.
* **Driver Engagement**: `sessions`, `activity_days`, and `n_days_after_onboarding` can give a sense of how engaged and active the driver is with the app, as well as how quickly they adopt it after onboarding.
* **Device**: The `device` field might be important to consider if device type affects app behavior, performance, or driver engagement.

#### 4. **Data Cleaning and Preprocessing**

* **Missing Data**: Check for missing values and decide whether to fill, drop, or otherwise handle them.
* **Outliers**: Look for extreme values in fields like `driven_km_drives` and `duration_minutes_drives`, as they may skew analysis or model results.
* **Categorical Variables**: The `device` field may require encoding (e.g., one-hot encoding) if it is used in any models.

#### 5. **Formulate Initial Hypotheses**

Based on the data, we can hypothesize relationships and insights:

* **Engagement and Driving Behavior**: Drivers who have been using the app for more days (`n_days_after_onboarding`, `activity_days`) are likely to have higher driving activity (measured by `drives`, `driven_km_drives`).
* **Navigation Preferences**: Drivers who frequently use their favorite routes (`total_navigations_fav1`, `total_navigations_fav2`) might exhibit more predictable driving patterns.
* **Device Impact**: There might be differences in driving behavior or session frequency based on the device type (iOS vs. Android), suggesting that app performance or driver preferences differ by platform.

---

By thoroughly exploring these fields and reviewing their relationships, we can effectively prepare to **analyze**, **construct**, and **execute** your models or evaluations in later stages of the project.

<img src="images/Analyze.png" width="100" height="100" align=left>

## **PACE: Analyze**

Consider the questions in your PACE Strategy Document to reflect on the Analyze stage.

### **Task 2a. Imports and data loading**

Start by importing the packages that you will need to load and explore the dataset. Make sure to use the following import statements:

*   `import pandas as pd`

*   `import numpy as np`


In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np

Then, load the dataset into a dataframe. Creating a dataframe will help you conduct data manipulation, exploratory data analysis (EDA), and statistical activities.

**Note:** As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [2]:
# Load dataset into dataframe
df = pd.read_csv('waze_dataset.csv')

### **Task 2b. Summary information**

View and inspect summary information about the dataframe by **coding the following:**

1.   df.head(10)
2.   df.info()

*Consider the following questions:*

1. When reviewing the `df.head()` output, are there any variables that have missing values?

2. When reviewing the `df.info()` output, what are the data types? How many rows and columns do you have?

3. Does the dataset have any missing values?

In [3]:
df.head(5)

Unnamed: 0,ID,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device
0,0,retained,283,226,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android
1,1,retained,133,107,326.896596,1225,19,64,13715.92055,3160.472914,13,11,iPhone
2,2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android
3,3,retained,49,40,67.589221,15,322,7,913.591123,587.196542,7,3,iPhone
4,4,retained,84,68,168.24702,1562,166,5,3950.202008,1219.555924,27,18,Android


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ID                       14999 non-null  int64  
 1   label                    14299 non-null  object 
 2   sessions                 14999 non-null  int64  
 3   drives                   14999 non-null  int64  
 4   total_sessions           14999 non-null  float64
 5   n_days_after_onboarding  14999 non-null  int64  
 6   total_navigations_fav1   14999 non-null  int64  
 7   total_navigations_fav2   14999 non-null  int64  
 8   driven_km_drives         14999 non-null  float64
 9   duration_minutes_drives  14999 non-null  float64
 10  activity_days            14999 non-null  int64  
 11  driving_days             14999 non-null  int64  
 12  device                   14999 non-null  object 
dtypes: float64(3), int64(8), object(2)
memory usage: 1.5+ MB


**1. When reviewing the `df.head()` output, are there any variables that have missing values?**
From the first 5 rows shown in `df.head()`, there are **no immediately obvious missing values**. However, this only gives a partial view and may not reflect missing data elsewhere in the dataset.

---

**2. When reviewing the `df.info()` output, what are the data types? How many rows and columns do you have?**

* The dataset contains **14,999 rows** and **13 columns**.
* Data types:

  * **`int64`** for 8 columns: numerical counts like `sessions`, `drives`, etc.
  * **`float64`** for 3 columns: continuous variables like `total_sessions`, `driven_km_drives`, and `duration_minutes_drives`.
  * **`object`** for 2 columns: `label` and `device`, which are categorical (text-based) features.

---

**3. Does the dataset have any missing values?**
Yes — the `label` column has **14,299 non-null values**, meaning **700 values are missing**.
All other columns are **complete** (14,999 non-null values), so the only column with missing data is `label`.

### **Task 2c. Null values and summary statistics**

Compare the summary statistics of the 700 rows that are missing labels with summary statistics of the rows that are not missing any values.

**Question:** Is there a discernible difference between the two populations?


In [5]:
# Isolate rows with null values in the 'label' column
df_missing_labels = df[df['label'].isnull()]

# Isolate rows without missing values
df_complete = df[df['label'].notnull()]

# Display summary statistics of rows with null labels
print("Summary Statistics: Rows with Missing Labels")
print(df_missing_labels.describe())

# Display summary statistics of rows with complete data
print("\nSummary Statistics: Rows with Complete Data")
print(df_complete.describe())

Summary Statistics: Rows with Missing Labels
                 ID    sessions      drives  total_sessions  \
count    700.000000  700.000000  700.000000      700.000000   
mean    7405.584286   80.837143   67.798571      198.483348   
std     4306.900234   79.987440   65.271926      140.561715   
min       77.000000    0.000000    0.000000        5.582648   
25%     3744.500000   23.000000   20.000000       94.056340   
50%     7443.000000   56.000000   47.500000      177.255925   
75%    11007.000000  112.250000   94.000000      266.058022   
max    14993.000000  556.000000  445.000000     1076.879741   

       n_days_after_onboarding  total_navigations_fav1  \
count               700.000000              700.000000   
mean               1709.295714              118.717143   
std                1005.306562              156.308140   
min                  16.000000                0.000000   
25%                 869.000000                4.000000   
50%                1650.500000         

In [6]:
# Isolate rows without null values in the 'label' column
df_complete = df[df['label'].notnull()]

# Display summary statistics of rows without null values
print("Summary Statistics: Rows WITHOUT Missing Labels")
print(df_complete.describe())

Summary Statistics: Rows WITHOUT Missing Labels
                 ID      sessions        drives  total_sessions  \
count  14299.000000  14299.000000  14299.000000    14299.000000   
mean    7503.573117     80.623820     67.255822      189.547409   
std     4331.207621     80.736502     65.947295      136.189764   
min        0.000000      0.000000      0.000000        0.220211   
25%     3749.500000     23.000000     20.000000       90.457733   
50%     7504.000000     56.000000     48.000000      158.718571   
75%    11257.500000    111.000000     93.000000      253.540450   
max    14998.000000    743.000000    596.000000     1216.154633   

       n_days_after_onboarding  total_navigations_fav1  \
count             14299.000000            14299.000000   
mean               1751.822505              121.747395   
std                1008.663834              147.713428   
min                   4.000000                0.000000   
25%                 878.500000               10.000000   


**Do the rows with missing labels differ from the rest of the dataset?**

Overall, **the rows with missing labels are quite similar** to those with labels across most features, but there are a few subtle differences worth noting:

### Key Comparisons:

| Feature                       | Missing Labels (700 rows) | Complete Labels (14,299 rows) | Notable Difference?                    |
| ----------------------------- | ------------------------- | ----------------------------- | -------------------------------------- |
| **sessions**                  | Mean = 80.84              | Mean = 80.62                  | Almost identical                       |
| **drives**                    | Mean = 67.80              | Mean = 67.26                  | Almost identical                       |
| **total\_sessions**           | Mean = 198.48             | Mean = 189.55                 | Slightly higher in missing-label group |
| **driven\_km\_drives**        | Mean = 3935.97 km         | Mean = 4044.40 km             | Slightly lower in missing-label group  |
| **duration\_minutes\_drives** | Mean = 1795.12 min        | Mean = 1864.20 min            | Slightly lower                         |
| **total\_navigations\_fav1**  | Mean = 118.72             | Mean = 121.75                 | Very similar                           |
| **total\_navigations\_fav2**  | Mean = 30.37              | Mean = 29.64                  | Very similar                           |
| **activity\_days**            | Mean = 15.38              | Mean = 15.54                  | Almost identical                       |

---
* The rows with missing `label` values show **no major anomalies** or drastic behavioral differences compared to labeled rows.
* These users appear **active and engaged**, so their missing labels could be due to **data collection issues** rather than inactivity or outliers.
* Since the missing group behaves similarly to the labeled group, they could potentially be used for **unsupervised learning** or **label prediction** later on.
* For a supervised model, you would likely **exclude** these rows or impute the labels if appropriate.

### **Task 2d. Null values - device counts**

Next, check the two populations with respect to the `device` variable.

**Question:** How many iPhone users had null values and how many Android users had null values?

In [7]:
# Get device counts for rows with missing labels
print("Device Counts (Missing Labels):")
print(df_missing_labels['device'].value_counts())

# Get device counts for rows without missing labels
print("\nDevice Counts (Complete Data):")
print(df_complete['device'].value_counts())

Device Counts (Missing Labels):
device
iPhone     447
Android    253
Name: count, dtype: int64

Device Counts (Complete Data):
device
iPhone     9225
Android    5074
Name: count, dtype: int64


Among the 700 rows with missing labels:

iPhone users: 447

Android users: 253

This means that iPhone users make up a larger share of the missing label population, accounting for approximately 64% of the missing data, compared to 36% from Android users.

Now, of the rows with null values, calculate the percentage with each device&mdash;Android and iPhone. You can do this directly with the [`value_counts()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) function.

In [8]:
# Calculate percentage of iPhone and Android users among rows with missing labels
device_null_percentages = df_missing_labels['device'].value_counts(normalize=True) * 100

# Display the percentages
print("Percentage of Devices Among Rows with Missing Labels:")
print(device_null_percentages)

Percentage of Devices Among Rows with Missing Labels:
device
iPhone     63.857143
Android    36.142857
Name: proportion, dtype: float64


How does this compare to the device ratio in the full dataset?

| Device  | % in Missing Labels | % in Full Dataset |
| ------- | ------------------- | ----------------- |
| iPhone  | **63.86%**          | **64.82%**        |
| Android | **36.14%**          | **35.18%**        |

The device distribution in rows with missing labels is very similar to the overall device distribution in the full dataset. This suggests that missing labels are not strongly biased by device type, and device likely doesn't influence label missingness in a significant way.

In [9]:
# Calculate percentage of each device in the full dataset
device_percent_full = df['device'].value_counts(normalize=True) * 100

# Display results
print("Percentage of Devices in Full Dataset:")
print(device_percent_full)

Percentage of Devices in Full Dataset:
device
iPhone     64.484299
Android    35.515701
Name: proportion, dtype: float64


The percentage of missing values by each device is consistent with their representation in the data overall.

There is nothing to suggest a non-random cause of the missing data.

Examine the counts and percentages of users who churned vs. those who were retained. How many of each group are represented in the data?

In [10]:
# Calculate counts of churned vs. retained users
churn_counts = df['label'].value_counts()

# Display the result
print("Counts of Churned vs. Retained Users:")
print(churn_counts)

Counts of Churned vs. Retained Users:
label
retained    11763
churned      2536
Name: count, dtype: int64


This dataset contains 82% retained users and 18% churned users.

Next, compare the medians of each variable for churned and retained users. The reason for calculating the median and not the mean is that you don't want outliers to unduly affect the portrayal of a typical user. Notice, for example, that the maximum value in the `driven_km_drives` column is 21,183 km. That's more than half the circumference of the earth!

In [11]:
# Calculate median values of all columns for churned and retained users
df.groupby('label').median(numeric_only=True)

Unnamed: 0_level_0,ID,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
churned,7477.5,59.0,50.0,164.339042,1321.0,84.5,11.0,3652.655666,1607.183785,8.0,6.0
retained,7509.0,56.0,47.0,157.586756,1843.0,68.0,9.0,3464.684614,1458.046141,17.0,14.0


This offers an interesting snapshot of the two groups, churned vs. retained:

Users who churned averaged ~3 more drives in the last month than retained users, but retained users used the app on over twice as many days as churned users in the same time period.

The median churned user drove ~200 more kilometers and 2.5 more hours during the last month than the median retained user.

It seems that churned users had more drives in fewer days, and their trips were farther and longer in duration. Perhaps this is suggestive of a user profile. Continue exploring!

Calculate the median kilometers per drive in the last month for both retained and churned users.

Begin by dividing the `driven_km_drives` column by the `drives` column. Then, group the results by churned/retained and calculate the median km/drive of each group.

In [12]:
# Avoid division by zero
df['km_per_drive'] = df['driven_km_drives'] / df['drives'].replace(0, np.nan)

# Group by 'label' and calculate median km per drive
median_km_per_drive = df.groupby('label')['km_per_drive'].median()

# Display results
median_km_per_drive

label
churned     73.491807
retained    74.051037
Name: km_per_drive, dtype: float64

The median retained user drove about one more kilometer per drive than the median churned user. How many kilometers per driving day was this?

To calculate this statistic, repeat the steps above using `driving_days` instead of `drives`.

In [13]:
# Add a column for kilometers per driving day (avoiding division by zero)
df['km_per_driving_day'] = df['driven_km_drives'] / df['driving_days'].replace(0, np.nan)

# Group by 'label' and calculate the median kilometers per driving day
median_km_per_day = df.groupby('label')['km_per_driving_day'].median()

# Display the result
print("Median kilometers per driving day by user type:")
print(median_km_per_day)

Median kilometers per driving day by user type:
label
churned     523.086749
retained    272.628549
Name: km_per_driving_day, dtype: float64


Now, calculate the median number of drives per driving day for each group.

In [14]:
# Add a column for drives per driving day (avoid division by zero)
df['drives_per_driving_day'] = df['drives'] / df['driving_days'].replace(0, np.nan)

# Group by 'label' and calculate the median of drives per driving day
median_drives_per_day = df.groupby('label')['drives_per_driving_day'].median()

# Display the result
print("Median drives per driving day by user type:")
print(median_drives_per_day)

Median drives per driving day by user type:
label
churned     7.454545
retained    3.750000
Name: drives_per_driving_day, dtype: float64


The median user who churned drove 523 kilometers each day they drove last month, which is almost ~199% the per-drive-day distance of retained users. The median churned user had a similarly disproporionate number of drives per drive day compared to retained users.

It is clear from these figures that, regardless of whether a user churned or not, the users represented in this data are serious drivers! It would probably be safe to assume that this data does not represent typical drivers at large. Perhaps the data&mdash;and in particular the sample of churned users&mdash;contains a high proportion of long-haul truckers.

In consideration of how much these users drive, it would be worthwhile to recommend to Waze that they gather more data on these super-drivers. It's possible that the reason for their driving so much is also the reason why the Waze app does not meet their specific set of needs, which may differ from the needs of a more typical driver, such as a commuter.

Finally, examine whether there is an imbalance in how many users churned by device type.

Begin by getting the overall counts of each device type for each group, churned and retained.

In [15]:
# Count of users by label and device
device_counts_by_label = df.groupby(['label', 'device']).size()

# Display the results
print("Device Counts by User Type:")
print(device_counts_by_label)

Device Counts by User Type:
label     device 
churned   Android     891
          iPhone     1645
retained  Android    4183
          iPhone     7580
dtype: int64


Now, within each group, churned and retained, calculate what percent was Android and what percent was iPhone.

In [16]:
# Count of users by label and device
device_counts = df.groupby(['label', 'device']).size()

# Convert counts to percentages within each label group
device_percentages = device_counts.groupby(level=0).apply(lambda x: 100 * x / x.sum())

# Display the result
print("Percentage of Device Types by User Type:")
print(device_percentages)

Percentage of Device Types by User Type:
label     label     device 
churned   churned   Android    35.134069
                    iPhone     64.865931
retained  retained  Android    35.560656
                    iPhone     64.439344
dtype: float64


The ratio of iPhone users and Android users is consistent between the churned group and the retained group, and those ratios are both consistent with the ratio found in the overall dataset.

<img src="images/Construct.png" width="100" height="100" align=left>

## **PACE: Construct**

**Note**: The Construct stage does not apply to this workflow. The PACE framework can be adapted to fit the specific requirements of any project.



<img src="images/Execute.png" width="100" height="100" align=left>

## **PACE: Execute**

Consider the questions in your PACE Strategy Document and those below to craft your response:

### **Task 3. Conclusion**

Recall that your supervisor, May Santer, asked you to share your findings with the data team in an executive summary. Consider the following questions as you prepare to write your summary. Think about key points you may want to share with the team, and what information is most relevant to the user churn project.

**Questions:**

1. Did the data contain any missing values? How many, and which variables were affected? Was there a pattern to the missing data?

Yes, the dataset contained 700 rows with missing values in the label column, which indicates whether a user churned or was retained. No other columns appeared to have missing values.
A pattern was identified: 63.9% of the users with missing labels were iPhone users, while 36.1% were Android users. This distribution is consistent with the device ratio in the complete data, so there’s no indication of device-based bias in the missing labels.

2. What is a benefit of using the median value of a sample instead of the mean?

The median is less sensitive to extreme outliers than the mean. This makes it a more accurate representation of a "typical" user in datasets where values like driven_km_drives can be skewed by a small number of users driving extremely long distances (e.g., over 21,000 km in a month).

3. Did your investigation give rise to further questions that you would like to explore or ask the Waze team about?

Yes. The data suggest that users who churned were extremely active drivers, with high kilometers and drives per driving day. This raises questions such as:

Are these users professional drivers (e.g., long-haul truckers)?

If so, does the Waze app lack features tailored to their needs?

Could customizing the app experience for super-drivers reduce churn?

4. What percentage of the users in the dataset were Android users and what percentage were iPhone users?

Android users: 35.5%

iPhone users: 64.5%

5. What were some distinguishing characteristics of users who churned vs. users who were retained?

Compared to retained users, churned users:

Had more drives per month (median of 51 vs. 48)

Drove ~200 km more and spent ~2.5 more hours driving per month

Had fewer driving days (11 vs. 12), but did more drives per driving day (7.45 vs. 3.75)

Drove longer distances per driving day (median of 523 km vs. 263 km)

These users were intense app users over fewer days, which may suggest a distinct segment.

6. Was there an appreciable difference in churn rate between iPhone users vs. Android users?

No. The percentage of iPhone and Android users within both the churned and retained groups was nearly identical:

Churned: 64.9% iPhone, 35.1% Android

Retained: 64.4% iPhone, 35.6% Android

This suggests that device type did not significantly affect churn behavior in this dataset.