**Skills demonstrated**

* Ability to use python to import, inspect, and organize data.
* Ability to organise and communicate key information.

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Loading dataset into dataframe
df = pd.read_csv('/kaggle/input/waze-dataset-to-predict-user-churn/waze_dataset.csv')

In [None]:
df.head(10)

In [None]:
df.info()

**OBSERVATION:** The dataset has 700 missing values in the `label` column

In [None]:
# Isolating rows with null values
null_df = df[df['label'].isnull()]

# Display summary stats of rows with null values
null_df.describe()

In [None]:
# Isolating rows without null values
null_df = df[~df['label'].isnull()]

# Displaying summary stats of rows without null values
null_df.describe()

**OBSERVATION:** There doesn't seem to be any discernible difference between the two populations

In [None]:
# Getting count of null values by device
null_df['device'].value_counts()

In [None]:
# Calculating % of iPhone nulls and Android nulls
null_df['device'].value_counts(normalize=True)

In [None]:
# Calculating % of iPhone users and Android users in full dataset
df['device'].value_counts(normalize=True)

The percentage of missing values by each device is consistent with their representation in the data overall.

There is nothing to suggest a non-random cause of the missing data.

In [None]:
# Calculating counts of churned vs. retained
df['label'].value_counts()
df['label'].value_counts(normalize=True)


In [None]:
# Calculating median values of all columns for churned and retained users
df.groupby('label').median(numeric_only=True)

This offers an interesting snapshot of the two groups, churned vs. retained:

Users who churned averaged ~3 more drives in the last month than retained users, but retained users used the app on over twice as many days as churned users in the same time period.

The median churned user drove ~200 more kilometers and 2.5 more hours during the last month than the median retained user.

It seems that churned users had more drives in fewer days, and their trips were farther and longer in duration. Perhaps this is suggestive of a user profile. Continue exploring!

In [None]:
# Grouping data by `label` and calculate the medians
mediansByLabel = df.groupby('label').median(numeric_only=True)
print("Median km per drive")

# Dividing the median distance by median number of drives
mediansByLabel['driven_km_drives'] / mediansByLabel['drives']

In [None]:
# Dividing the median distance by median number of driving days
print("Median km per driving day")
mediansByLabel['driven_km_drives'] / mediansByLabel['driving_days']

In [None]:
# Divide the median number of drives by median number of driving days
print("Median number of drives per driving day")
mediansByLabel['drives'] / mediansByLabel['driving_days']

The median user who churned drove 608 kilometers each day they drove last month, which is almost 250% the per-drive-day distance of retained users. The median churned user had a similarly disproporionate number of drives per drive day compared to retained users.

It is clear from these figures that, regardless of whether a user churned or not, the users represented in this data are serious drivers! It would probably be safe to assume that this data does not represent typical drivers at large. Perhaps the data&mdash;and in particular the sample of churned users&mdash;contains a high proportion of long-haul truckers.

In consideration of how much these users drive, it would be worthwhile to recommend to Waze that they gather more data on these super-drivers. It's possible that the reason for their driving so much is also the reason why the Waze app does not meet their specific set of needs, which may differ from the needs of a more typical driver, such as a commuter.

In [None]:
# For each label, calculating the number of Android users and iPhone users
df.groupby('label')['device'].value_counts()

In [None]:
# For each label, calculating the percentage of Android users and iPhone users
df.groupby('label')['device'].value_counts(normalize=True)

The ratio of iPhone users and Android users is consistent between the churned group and the retained group, and those ratios are both consistent with the ratio found in the overall dataset.