**Project Objective:**

In this Exploratory Data Analysis (EDA) project focused on telecommunications churn, our primary goal is to understand customer behavior, identify patterns associated with customer churn, and visualize key metrics that could influence business strategies for customer retention. We utilize a dataset containing customer information from a telecom company to analyze churn behavior through various statistical and visual methods.

**Key Highlights:**

- **Data Filtering and Cleaning:** We implement a flexible filtering system to analyze data based on different criteria like state, international plan, and voicemail plan, allowing for targeted insights into customer segments.

- **Summary Statistics:** By computing comprehensive summary statistics on both the entire dataset and filtered subsets, we aim to uncover basic trends and anomalies in customer behavior related to service usage and churn.

- **Correlation Analysis:** We examine the relationships between various numeric and boolean features to understand how different service features correlate with each other and with churn. This is visualized through a heatmap, providing an intuitive look at which factors might be influencing customer churn.

- **Churn Distribution Visualization:** 
  - **Pie Chart:** To illustrate the overall churn rate within the customer base, providing a quick visual on the proportion of churn versus non-churn customers.
  - **Box Plot:** Analyzing the distribution of customer service calls by churn status to see if there's a noticeable difference in service interaction between churned and non-churned customers.
  - **Count Plot:** To explore how the subscription to international plans correlates with churn, offering insights into product preference and its impact on customer retention.

- **Group Analysis:** Grouping data by churn status to compare total charges across different service categories (day, evening, night, international), which helps in understanding if billing or usage patterns affect churn.

**Project Goals:**

- **Identify Churn Indicators:** Through statistical analysis and visualizations, pinpoint which services or behaviors are most indicative of a customers likelihood to churn.

- **Inform Retention Strategies:** Use insights from the EDA to suggest where the telecom company might focus its retention efforts or adjust its service offerings.

- **Enhance Data-Driven Decision Making:** Provide a clear, visual, and statistical foundation for stakeholders to make informed decisions regarding marketing, customer service improvements, or product adjustments aimed at reducing churn.

This project not only seeks to predict churn but also to understand the underlying reasons for customer attrition in the telecom sector, thereby aiding in the development of more targeted and effective customer retention strategies.

In [0]:
import time
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
!pip install tabulate
import warnings
from tabulate import tabulate

import plotly.graph_objs as go
import plotly


import sys
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.getOrCreate()

pd.set_option('display.max_columns', None)
pd.set_option('display.width', 500)
warnings.filterwarnings('ignore')

In [0]:
#Widgets used to pass value
dbutils.widgets.text("job-id", "100")
dbutils.widgets.text("postback-url", "")
jobId = dbutils.widgets.get("job-id")
webserverURL = dbutils.widgets.get("postback-url")
#Initialize and Start Execution
from fire_notebook.output.workflowcontext import RestWorkflowContext
restworkflowcontext = RestWorkflowContext(webserverURL, jobId)
message="20"
restworkflowcontext.outputProgress(9, title="Progress", progress=message)

In [0]:
dbutils.widgets.text("arg_state", "%")
dbutils.widgets.text("arg_intl_plan", "%")
dbutils.widgets.text("arg_voice_mail_plan", "%")

state_filter = dbutils.widgets.get("arg_state").strip().lower()
intl_plan_filter = dbutils.widgets.get("arg_intl_plan")
voice_mail_filter = dbutils.widgets.get("arg_voice_mail_plan")

#print("state_filter value is", state_filter)
FILE_PATH = "/dbfs/FileStore/churn.all"
# Read the CSV file, treating empty strings as NaN and forcing 'churned' to be read as a string
colnames=["state","account_length","area_code","phone_number","intl_plan","voice_mail_plan","number_vmail_messages","total_day_minutes","total_day_calls","total_day_charge","total_eve_minutes","total_eve_calls","total_eve_charge","total_night_minutes","total_night_calls","total_night_charge","total_intl_minutes","total_intl_calls","total_intl_charge","number_customer_service_calls","churned"]
df = pd.read_csv(FILE_PATH, names=colnames, header=None, na_values=[''], keep_default_na=False, dtype={'churned': str})
'''
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
df = df[df['churned'].isin(['True.', 'False.'])]  #This is the key line
'''
df = df.applymap(lambda x: x.strip().lower() if isinstance(x, str) else x)
# Filter rows with valid 'churned' values
df = df[df['churned'].isin(['true.', 'false.'])]


#print("Unique values in 'state' column (after cleaning):", df['state'].unique())
#print("state_filter string:", state_filter)  #Print the value of state_filter, before splitting it

### EDA Functions

In [0]:
# Define a function to filter the DataFrame based on user-specified criteria
def filter_dataframe(df, state_filter, intl_plan_filter, voice_mail_filter):
    filtered_df = df.copy()

    if state_filter != '%':
        #print("\nUnique states before filtering:", filtered_df['state'].unique())

        # Split and clean state filter values
        state_values = [x.strip().strip("'") for x in state_filter.split('|')]
        #print(f"State values for filtering (cleaned): {state_values}")

        # Apply the state filter
        filtered_df = filtered_df[filtered_df['state'].isin(state_values)]


        #print("Unique states after state filtering:", filtered_df['state'].unique())  # Debug statement
        #print("Shape after state filtering:", filtered_df.shape)  # Debug statement

    # Filter by international plan if a specific plan is provided
    if intl_plan_filter != '%':
        intl_plan_values = [x.strip() for x in intl_plan_filter.split(',')]
        filtered_df = filtered_df[filtered_df['intl_plan'].isin(intl_plan_values)]
        #print("Unique intl_plan after intl_plan filtering:", filtered_df['intl_plan'].unique())  # Debug statement
        #print("Shape after intl_plan filtering:", filtered_df.shape)  # Debug statement


    # Filter by voice mail plan if a specific plan is provided
    if voice_mail_filter != '%':
        filtered_df = filtered_df[filtered_df['voice_mail_plan'] == voice_mail_filter]
        #print("Unique voice_mail_plan after voice_mail_plan filtering:", filtered_df['voice_mail_plan'].unique())  # Debug statement
        #print("Shape after voice_mail_plan filtering:", filtered_df.shape)  # Debug statement

    return filtered_df

# Clean and ensure filters are strings with no extra whitespace or quotes
intl_plan_filter = str(intl_plan_filter).strip().strip("'")
voice_mail_filter = str(voice_mail_filter).strip().strip("'")

# Debugging print statements to verify the cleaned values
#print(f"Cleaned intl_plan_filter: {repr(intl_plan_filter)} (length: {len(intl_plan_filter)})")
#print(f"Cleaned voice_mail_filter: {repr(voice_mail_filter)} (length: {len(voice_mail_filter)})")

# Call the filter function with the specified criteria
filtered_df = filter_dataframe(df, state_filter, intl_plan_filter, voice_mail_filter)
filtered_df = filtered_df.applymap(lambda x: x.strip().upper() if isinstance(x, str) else x)

# Define a list of columns for which summary statistics will be computed
summary_cols = [
    # List of columns to include in the summary statistics
    "total_day_minutes", "total_day_calls", "total_day_charge", "total_eve_calls",
    "number_vmail_messages", "total_eve_minutes", "total_eve_charge",
    "total_night_minutes", "total_night_calls", "total_night_charge",
    "total_intl_minutes", "total_intl_calls", "total_intl_charge",
    "number_customer_service_calls"
]

# Convert Columns to Numeric:
for col in summary_cols:
    filtered_df[col] = pd.to_numeric(filtered_df[col], errors='coerce')

restworkflowcontext.outPandasDataframe(9, "Telco Dataset Sample", filtered_df.head())

# Compute summary statistics for the filtered DataFrame
summary_df = filtered_df[summary_cols].agg(['count', 'mean', 'min', 'max', 'std', 'var']).round(3)

# Print the summary statistics
#print("Summary Statistics for Filtered Data:\n", summary_df)
summary_df = summary_df.reset_index() # This line is the key change


spark_df = spark.createDataFrame(summary_df)
restworkflowcontext.outDataFrame(9, "Summary Statistics for Telco Dataset", spark_df)

# Drop rows with null values from the filtered DataFrame
df_drop_null = filtered_df.dropna()


# Select numeric and boolean columns for the correlation matrix calculation
numeric_bool_cols = df_drop_null.select_dtypes(include=['number', 'bool']).columns

# Calculate the correlation matrix for the numeric and boolean columns
corr_matrix = df_drop_null[numeric_bool_cols].corr()

spark_corr_matrix = spark.createDataFrame(corr_matrix)

# Visualize the correlation matrix using a heatmap
############################# 
# Create the heatmap trace
heatmap = go.Heatmap(
    z=corr_matrix,
    x=corr_matrix.columns,
    y=corr_matrix.index,
    colorscale=[
        [0, 'rgb(119, 228, 200)'],
        [0.2, 'rgb(86, 211, 203)'],
        [0.4, 'rgb(54, 194, 206)'],
        [0.5, 'rgb(62, 167, 207)'],
        [0.6, 'rgb(71, 140, 207)'],
        [0.7, 'rgb(70, 100, 200)'],
        [1, 'rgb(69, 53, 193)']
    ],
    zmid=0,
    zmin=-1,
    zmax=1
)


# Create the layout
layout = go.Layout(
    title='Correlation-Matrix',
    xaxis=dict(title='Features', tickangle=-45),
    yaxis=dict(title='Features', autorange='reversed'),
    width=800,
    height=800
)

# Create the figure
fig = go.Figure(data=[heatmap], layout=layout)
plotly.offline.iplot(fig)

# Generate the Plotly plot as an HTML div
plot_div = plotly.offline.plot(fig, output_type='div', include_plotlyjs=False)


# Display the Correlation Matrix graph
restworkflowcontext.outPlotly(9, title="Correlation Matrix", text=plot_div)

# Display the Correlation Matrix data
restworkflowcontext.outDataFrame(9, "Correlation Matrix Table", spark_corr_matrix)

#DISPLAY CHURNED INFO
# Group the filtered DataFrame by 'churned' and compute aggregated statistics
groupby_df = filtered_df.groupby('churned').agg({
    'total_day_charge': 'sum',
    'total_eve_charge': 'sum',
    'total_night_charge': 'sum',
    'total_intl_charge': 'sum'
}).reset_index()

# Print the grouped data
restworkflowcontext.outPandasDataframe(9, "Breakdown of Total Charges by Churn Status", groupby_df.head())

# Get the churn counts
churn_counts = filtered_df['churned'].value_counts()

# Create the pie chart trace
pie_trace = go.Pie(
    labels=churn_counts.index,
    values=churn_counts.values,
    textinfo='percent',
    insidetextorientation='radial',
    hole=0.4,  # Adjust the size of the donut hole (0 for a full pie)
    marker=dict(colors=['#7695FF', '#FF9874'])   # Custom colors for churned and non-churned segments
)  
# Create the layout
layout = go.Layout(
    title='Churn-Distribution',
    width=800,
    height=800,
    legend=dict(x=0.8, y=0.9)  # Adjust the position of the legend
)

# Create the figure
fig = go.Figure(data=[pie_trace], layout=layout)
plotly.offline.iplot(fig)

# Generate the Plotly plot as an HTML div
plot_div = plotly.offline.plot(fig, output_type='div', include_plotlyjs=False)

# Display the plot using the custom plotly implementation
restworkflowcontext.outPlotly(10, title="Churn Distribution", text=plot_div)



# Create the box plot for displaying Customer Service Calls by Churn Status,
box_trace = go.Box(
    x=filtered_df['churned'],
    y=filtered_df['number_customer_service_calls'],
    boxpoints='outliers',
    jitter=0.3,
    pointpos=-1.8,
    marker=dict(color='#7695FF')  #set the color

)

# Create the layout
layout = go.Layout(
    title='Customer-Service-Calls-by-Churn-Status',
    xaxis=dict(title='Churn Status'),
    yaxis=dict(title='Number of Customer Service Calls') 
)

# Create the figure
fig = go.Figure(data=[box_trace], layout=layout)
plotly.offline.iplot(fig)

# Generate the Plotly plot as an HTML div
plot_div = plotly.offline.plot(fig, output_type='div', include_plotlyjs=False)

# Display the plot using the custom plotly implementation
restworkflowcontext.outPlotly(9, title="Number of Customer Service Calls by Churn Status", text=plot_div)


'''

# Create a count plot to visualize the distribution of international plan subscriptions by churn status
# Convert to boolean, removing dot at end
#filtered_df['churned'] = filtered_df['churned'].map({'True.': True, 'False.': False})


filtered_df['churned'] = filtered_df['churned'].map({'true.': True, 'false.': False})
if 'intl_plan' not in filtered_df.columns:
    raise KeyError("'intl_plan' column is missing from the DataFrame.")

# Verify the filtered DataFrame for 'churned = True' rows
print("Filtered DataFrame (churned = True):")
print(filtered_df[filtered_df['churned']])

#churned_intl_plan = filtered_df[filtered_df['churned']]['intl_plan'].value_counts()
#not_churned_intl_plan = filtered_df[~filtered_df['churned']]['intl_plan'].value_counts()


# Create the count plot traces
churned_intl_plan = filtered_df[filtered_df['churned']]['intl_plan'].value_counts()
not_churned_intl_plan = filtered_df[~filtered_df['churned']]['intl_plan'].value_counts()

# Create the traces
trace_churned = go.Bar(
    x=churned_intl_plan.index,
    y=churned_intl_plan.values,
    marker=dict(
        color='#7695FF',  # Set the box color for "Yes"
    ),
    name='Churned'
)

trace_not_churned = go.Bar(
    x=not_churned_intl_plan.index,
    y=not_churned_intl_plan.values,
    marker=dict(
        color='#FF9874',  # Set the box color for "No"
    ),
    name='Not Churned'
)

layout = go.Layout(
    title='International-Plan-Subscription-by-Churn-Status',
    xaxis=dict(title='International Plan'),
    yaxis=dict(title='Count'),
    barmode='group'
)

fig = go.Figure(data=[trace_churned, trace_not_churned], layout=layout)
plotly.offline.iplot(fig)

# Generate the Plotly plot as an HTML div
plot_div = plotly.offline.plot(fig, output_type='div', include_plotlyjs=False)

restworkflowcontext.outPlotly(9, title="International Plan Subscription by Churn Status", text=plot_div)

#htmlstr1 = "<h4>The significantly <em>higher</em> churn rate among customers with an <strong>international plan</strong> suggests that there may be specific issues related to these plans that are driving customer dissatisfaction.</h4>"

#restworkflowcontext.outHTML(9, title="Recommendation", text = htmlstr1)
'''

# FINISH EXECUTION
message="100"
restworkflowcontext.outputProgress(9, title="Progress", progress=message)

message = "Job Execution Completed."
restworkflowcontext.outSuccess(9, title="Success", text=message)