# **Waze User Churn: Exploratory Data Analysis**

I will conduct exploratory data analysis on data for the churn project. I’ll also use tools to create visuals for an executive summary to help non-technical stakeholders engage and interact with the data.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('/kaggle/input/waze-dataset-to-predict-user-churn/waze_dataset.csv')

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.info()

#### **`sessions`**

_The number of occurrence of a user opening the app during the month_

In [None]:
# Box plot
plt.figure(figsize=(5, 1))
sns.boxplot(x=df["sessions"], fliersize=1)
plt.title("Sessions box plot")

In [None]:
# Histogram
plt.figure(figsize=(5, 3))
sns.histplot(x=df["sessions"])
median = df["sessions"].median()
plt.axvline(median, color='red', linestyle='--')
plt.text(75, 1200, 'median=56.0', color='red')
plt.title('sessions box plot')

`sessions` variable is right-skewed with half of the observations having 56 or fewer sessions. However, some users have more than 700.

#### **`drives`**

_An occurrence of driving at least 1 km during the month_

In [None]:
# Box plot
plt.figure(figsize=(5, 1))
sns.boxplot(x=df["drives"], fliersize=1)
plt.title("Drives box plot")

In [None]:
# Histogram
plt.figure(figsize=(5, 3))
sns.histplot(x=df["drives"])
median = df["drives"].median()
# print(median)
plt.axvline(median, color='red', linestyle='--')
plt.text(75, 1000, 'median=48.0', color='red')
plt.title('drives box plot')

`drives` follows a similar distribution, approximately log-normal, with a median of 48. However, some drivers had over 400 drives in the last month.

#### **`total_sessions`**

_A model estimate of the total number of sessions since a user has onboarded_

In [None]:
#Helper method
def histogrammer(column, median_text=True, **kwargs):
    median = round(df[column].median())
    plt.figure(figsize=(5, 3))
    ax = sns.histplot(x=df[column], **kwargs)
    plt.axvline(median, color='red', linestyle='--')
    if median_text==True:                                    # Add median text unless set to False
        ax.text(0.25, 0.85, f'median={median}', color='red',
            ha='left', va='top', transform=ax.transAxes)
    else:
        print('Median:', median)
    plt.title(f'{column} histogram');

def boxplot(column):
    plt.figure(figsize=(5, 1))
    sns.boxplot(x=df[column], fliersize=1)
    plt.title(f"{column} box plot")

In [None]:
# Box plot
boxplot('total_sessions')

In [None]:
# Histogram
histogrammer('total_sessions')

The `total_sessions` is also right-skewed. The median total number of sessions is 159.6. This is interesting because, if the median number of sessions in the last month was 48 and the median total sessions was ~160, then it seems that a large proportion (almost 25%) of a user's total drives might have taken place in the last month.

#### **`n_days_after_onboarding`**

_The number of days since a user signed up for the app_

In [None]:
# Box plot
boxplot('n_days_after_onboarding')

In [None]:
# Histogram
histogrammer('n_days_after_onboarding')

The total user tenure is a uniform distribution with values ranging from near-zero to \~3,500 (\~9.5 years).

#### **`driven_km_drives`**

_Total kilometers driven during the month_

In [None]:
# Box plot
boxplot('driven_km_drives')

In [None]:
# Histogram
histogrammer('driven_km_drives')

The number of drives driven in the last month per user is right-skewed with half the users driving under 3,495 kilometers.

#### **`duration_minutes_drives`**

_Total duration driven in minutes during the month_

In [None]:
# Box plot
boxplot('duration_minutes_drives')

In [None]:
# Histogram
histogrammer('duration_minutes_drives')

The `duration_minutes_drives` variable has a heavily skewed right tail. Half of the users drove less than \~1,478 minutes (\~25 hours), but some users clocked over 250 hours over the month.

#### **`activity_days`**

_Number of days the user opens the app during the month_

In [None]:
# Box plot
boxplot('activity_days')

In [None]:
# Histogram
histogrammer('activity_days')

Within the last month, users opened the app a median of 16 times. The box plot reveals a centered distribution. The histogram shows a nearly uniform distribution of ~500 people opening the app on each count of days. However, there are ~250 people who didn't open the app at all and ~250 people who opened the app every day of the month.

This distribution is noteworthy because it does not mirror the `sessions` distribution, which I would think would be closely correlated with `activity_days`.

#### **`driving_days`**

_Number of days the user drives (at least 1 km) during the month_

In [None]:
# Box plot
boxplot('driving_days')

In [None]:
# Histogram
histogrammer('driving_days')

The number of days users drove each month is almost uniform, and it largely correlates with the number of days they opened the app that month, except the `driving_days` distribution tails off on the right.

However, there were almost twice as many users (\~1,000 vs. \~550) who did not drive at all during the month. This might seem counterintuitive when considered together with the information from `activity_days`. That variable had \~500 users opening the app on each of most of the day counts, but there were only \~250 users who did not open the app at all during the month and ~250 users who opened the app every day.

#### **`device`**

_The type of device a user starts a session with_

In [None]:
# Pie chart
fig = plt.figure(figsize=(3, 3))
data = df['device'].value_counts()
plt.pie(data,
       labels=[
           f'{data.index[0]} : {data.values[0]}',
           f'{data.index[1]} : {data.values[1]}'
       ],
        autopct='%1.1f%%'
       )
plt.title('Users by device')

There are nearly twice as many iPhone users as Android users represented in this data.

#### **`label`**

_Binary target variable (“retained” vs “churned”) for if a user has churned anytime during the course of the month_

In [None]:
# Pie chart
fig = plt.figure(figsize=(3, 3))
data = df['label'].value_counts()
plt.pie(data,
       labels=[
           f'{data.index[0]} : {data.values[0]}',
           f'{data.index[1]} : {data.values[1]}'
       ],
        autopct='%1.1f%%'
       )
plt.title('Count of retained vs. churned')

Less than 18% of the users churned.

#### **`driving_days` vs. `activity_days`**


In [None]:
# Histogram
plt.figure(figsize=(12, 4))
label=['driving days', 'activity days']
plt.hist([df['driving_days'], df['activity_days']],
        bins=range(0, 33),
        label=label)
plt.xlabel('days')
plt.ylabel('counts')
plt.legend()
plt.title('driving_days vs. activity_days')

This is counterintuitive. Why are there _fewer_ people who didn't use the app at all during the month and _more_ people who didn't drive at all during the month?

While these variables are related to each other, they're not the same. People probably just open the app more than they use the app to drive&mdash;perhaps to check drive times or route information, to update settings, or even just by mistake.

Nonetheless, it might be worthwile to contact the data team at Waze to get more information about this, especially because it seems that the number of days in the month is not the same between variables.

In [None]:
print(df['driving_days'].max())
print(df['activity_days'].max())

It's true. Although it's possible that not a single user drove all 31 days of the month, it's highly unlikely, considering there are 15,000 people represented in the dataset.

In [None]:
# Scatter plot
sns.scatterplot(data=df, x='driving_days', y='activity_days')
plt.title('driving days vs. activity days')
plt.plot([0, 31], [0, 31], color='red', linestyle='--')

If you use the app to drive, then by definition it must count as a day-use as well. In other words, you cannot have more drive-days than activity-days. None of the samples in this data violate this rule, which is good.

#### **Retention by device**

In [None]:
# Histogram
plt.figure(figsize=(5, 4))
sns.histplot(data = df,
            x='device',
            hue='label',
            multiple='dodge',
            shrink=0.9)
plt.title("Retention by Device Histogram")

The proportion of churned users to retained users is consistent between device types.

#### **Retention by kilometers driven per driving day**

Previously, I discovered that the median distance driven last month for users who churned was 8.33 km, versus 3.36 km for people who did not churn.

In [None]:
df['km_per_driving_day'] = df['driven_km_drives'] / df['driving_days']

df['km_per_driving_day'].describe()


In [None]:
df.loc[df['km_per_driving_day']==np.inf, 'km_per_driving_day'] = 0
df['km_per_driving_day'].describe()

The maximum value is 15,420 kilometers _per drive day_. This is physically impossible. Driving 100 km/hour for 12 hours is 1,200 km. It's unlikely many people averaged more than this each day they drove, so, I'm disregarding anything over 1,200 km.

In [None]:
# Histogram
plt.figure(figsize=(12, 5))
sns.histplot(data=df,
            x = 'km_per_driving_day',
            bins=range(0, 1201, 20),
            hue='label',
            multiple='fill')
plt.ylabel('%', rotation=0)
plt.title('Churn rate by mean km per driving day')

The churn rate tends to increase as the mean daily distance driven increases. It would be worth investigating further the reasons for long-distance users to discontinue using the app.

#### **Churn rate per number of driving days**

In [None]:
# Histogram
plt.figure(figsize=(12, 5))
sns.histplot(data=df,
            x='driving_days',
            bins=range(1, 32),
            hue='label',
            multiple='fill',
            discrete=True)
plt.ylabel('%', rotation=0)
plt.title("Churn Rate per Driving Day")

The churn rate is highest for people who didn't use Waze much during the last month. The more times they used the app, the less likely they were to churn. While 40% of the users who didn't use the app at all last month churned, nobody who used the app 30 days churned.

This isn't surprising. If people who used the app a lot churned, it would likely indicate dissatisfaction. When people who don't use the app churn, it might be the result of dissatisfaction in the past, or it might be indicative of a lesser need for a navigational app. Maybe they moved to a city with good public transportation and don't need to drive anymore.

#### **Proportion of sessions that occurred in the last month**

In [None]:
df['percent_sessions_in_last_month'] = df['sessions'] / df['total_sessions']

In [None]:
df['percent_sessions_in_last_month'].median()

In [None]:
# Histogram
histogrammer('percent_sessions_in_last_month',
             hue=df['label'],
             multiple='layer',
             median_text=False)

In [None]:
df['n_days_after_onboarding'].median()

Half of the people in the dataset had 40% or more of their sessions in just the last month, yet the overall median time since onboarding is almost five years.

In [None]:
# Histogram
data = df.loc[df['percent_sessions_in_last_month']>=0.4]
plt.figure(figsize=(5,3))
sns.histplot(x=data['n_days_after_onboarding'])
plt.title('Num. days after onboarding for users with >=40% sessions in last month');

The number of days since onboarding for users with 40% or more of their total sessions occurring in just the last month is a uniform distribution. This is very strange. It's worth asking Waze why so many long-time users suddenly used the app so much in the last month.

### **Task 3b. Handling outliers**



In [None]:
def outlier_imputer(column_name, percentile):
    # Calculate threshold
    threshold = df[column_name].quantile(percentile)
    # Impute threshold for values > than threshold
    df.loc[df[column_name] > threshold, column_name] = threshold

    print('{:>25} | percentile: {} | threshold: {}'.format(column_name, percentile, threshold))

In [None]:
for column in ['sessions', 'drives', 'total_sessions',
               'driven_km_drives', 'duration_minutes_drives']:
               outlier_imputer(column, 0.95)

In [None]:
df.describe()

#### **Conclusion**

Analysis revealed that the overall churn rate is \~17%, and that this rate is consistent between iPhone users and Android users.

In [None]:
df['monthly_drives_per_session_ratio'] = (df['drives']/df['sessions'])

In [None]:
df.head(10)