## Expresso Churn Prediction

In this project, We are going to work on the **'Expresso churn'** dataset that was provided as part of Expresso Churn Prediction Challenge hosted by Zindi platform.

**Dataset description:** Expresso is an African telecommunications services company that provides telecommunication services in two African markets: Mauritania and Senegal. The data describes 2.5 million Expresso clients with more than 15 behaviour variables in order to predict the clients' churn probability.

➡️ Dataset link

https://i.imgur.com/OQKLgVy.png

**Instructions**

1. Install the necessary packages
2. Import you data and perform basic data exploration phase
- Display general information about the dataset
- Create a pandas profiling reports to gain insights into the dataset
- Handle Missing and corrupted values
- Remove duplicates, if they exist
- Handle outliers, if they exist
- Encode categorical features
3. Based on the previous data exploration train and test a machine learning classifier
4. Create a streamlit application (locally)
5. Add input fields for your features and a validation button at the end of the form
6. Import your ML model into the streamlit application and start making predictions given the provided features values

#### Importing necessary libraries

In [None]:
# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

#### Loading the dataset

In [None]:
df = pd.read_csv("Expresso_churn_dataset.csv")

In [None]:
df.head()

In [None]:
df.shape

#### Sampling the dataset

In [None]:
# Sample 50000 random rows from the cleaned DataFrame
df_sampled = df.sample(n=50000, random_state=42)

#### Overview of the dataset

In [None]:
df_sampled.head()

In [None]:
df_sampled.shape

In [None]:
df_sampled.info()

#### Summary Statistics

In [None]:
df_sampled.describe()

#### Checking for missing values and duplicates

In [None]:
df_sampled.isnull().sum()

##### Replacing missing values

In [None]:
df_sampled['MONTANT'].fillna(df_sampled['MONTANT'].median(), inplace=True)
df_sampled['FREQUENCE_RECH'].fillna(df_sampled['FREQUENCE_RECH'].median(), inplace=True)
df_sampled['REVENUE'].fillna(df_sampled['REVENUE'].median(), inplace=True)
df_sampled['ARPU_SEGMENT'].fillna(df_sampled['ARPU_SEGMENT'].median(), inplace=True)
df_sampled['FREQUENCE'].fillna(df_sampled['FREQUENCE'].median(), inplace=True)
df_sampled['DATA_VOLUME'].fillna(df_sampled['DATA_VOLUME'].median(), inplace=True)
df_sampled['ON_NET'].fillna(df_sampled['ON_NET'].median(), inplace=True)
df_sampled['ORANGE'].fillna(df_sampled['ORANGE'].median(), inplace=True)
df_sampled['TIGO'].fillna(df_sampled['TIGO'].median(), inplace=True)
df_sampled['ZONE1'].fillna(df_sampled['ZONE1'].median(), inplace=True)
df_sampled['ZONE2'].fillna(df_sampled['ZONE2'].median(), inplace=True)
df_sampled['FREQ_TOP_PACK'].fillna(df_sampled['FREQ_TOP_PACK'].median(), inplace=True)

In [None]:
# Replace missing values for categorical columns with the mode
df_sampled['REGION'].fillna(df_sampled['REGION'].mode()[0], inplace=True)
df_sampled['TOP_PACK'].fillna(df_sampled['TOP_PACK'].mode()[0], inplace=True)


In [None]:
df_sampled.isnull().sum()

In [None]:
df_sampled.duplicated().sum()

no duplicates in the dataset cool!

###### Churn Distribution Across Region

In [None]:
plt.figure(figsize=(18, 5))
sns.boxplot(x='REGION', y='CHURN', data=df_sampled)
plt.title('Churn Distribution Across Region')
plt.xticks(rotation=45)
plt.show()


###### plot features against CHURN and identify data points that are far away from the bulk of the data.

In [None]:
sns.scatterplot(x='REVENUE', y='MONTANT', data=df_sampled, hue='CHURN')
plt.show()

In [None]:
from scipy import stats
z_scores = stats.zscore(df_sampled['MONTANT'])
df_sampled[(z_scores > 3) | (z_scores < -3)]


All users with these outlier MONTANT values have not churned (CHURN = 0). This may suggest that higher MONTANT values are correlated with retention.

In [None]:
df_sampled['REGION'].value_counts()

In [None]:
df_sampled['TENURE'].value_counts()

In [None]:
# Mean encoding for TENURE based on REVENUE
mean_encoded = df_sampled.groupby('TENURE')['REVENUE'].mean()
df_sampled['TENURE_encoded'] = df_sampled['TENURE'].map(mean_encoded)

In [None]:
df_sampled.head()

In [None]:
# Selecting only numeric columns for correlation
numeric_df = df_sampled.select_dtypes(include=[float, int])
correlation_matrix = numeric_df.corr()

In [None]:
correlation_matrix

In [None]:
# Plotting the correlation matrix
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

### Modelling

Selecting my features and splitting the data into training and test sets

In [None]:
# Split features (X) and target (y)
X = df_sampled.drop(columns=[ "CHURN", "TOP_PACK", "MRG", "user_id", "REGION", "TENURE"])
y = df_sampled['CHURN']

# Splitting the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
print(X_train.columns)

Scaling my data to ensure that each feature contributes equally to the distance calculations or the optimization process.

In [None]:
# Create a StandardScaler instance
scaler = StandardScaler()

# Fit and transform the training data
X_train_scaled = scaler.fit_transform(X_train)

# Transform the test data
X_test_scaled = scaler.transform(X_test)

#### Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rf_classifier = RandomForestClassifier(class_weight='balanced')

# Fit the model to the training data
rf_classifier.fit(X_train_scaled, y_train)

In [None]:
# Make predictions
y_pred = rf_classifier.predict(X_test_scaled)

# Evaluate the model
print("Accuracy Score:")
print(accuracy_score(y_test, y_pred))

print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

In [None]:
# Feature Importance
import matplotlib.pyplot as plt

feature_importances = rf_classifier.feature_importances_
features = X.columns
importance_df = pd.DataFrame({'Feature': features, 'Importance': feature_importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Plotting
plt.figure(figsize=(12, 6))
plt.barh(importance_df['Feature'], importance_df['Importance'])
plt.xlabel('Importance')
plt.title('Feature Importance in the Random Forest Classifier')
plt.show()

Based on the feature Importance my new feature would be

'FREQUENCE_RECH', 'REVENUE', 'FREQUENCE', 'DATA_VOLUME', 'ON_NET', 'ORANGE', 'REGULARITY', 'FREQ_TOP_PACK'

### Light Gradient Boosting(LGBM)

In [None]:
import lightgbm as lgb
from lightgbm import LGBMClassifier

In [None]:
X = df_sampled[['FREQUENCE_RECH', 'REVENUE', 'FREQUENCE', 'DATA_VOLUME', 'ON_NET', 'ORANGE', 'REGULARITY', 'FREQ_TOP_PACK']]
y = df_sampled['CHURN']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
lgbm_model = LGBMClassifier(n_estimators=100, early_stopping_rounds=10, eval_metric='auc', verbose=1)
# Fit the model
lgbm_model.fit(X_train, y_train, eval_set=[(X_test, y_test)], eval_metric='auc')

In [None]:
# Make predictions
y_pred = lgbm_model.predict_proba(X_test)[:, 1]  # Get probabilities for the positive class
y_pred_binary = [1 if x >= 0.5 else 0 for x in y_pred]

In [None]:
# Evaluate
accuracy = accuracy_score(y_test, y_pred_binary)
print(f'Accuracy: {accuracy:.2f}')

print("Classification Report:")
print(classification_report(y_test, y_pred_binary))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_binary))

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
log_model= LogisticRegression(penalty = 'l1', C = 0.5, solver = 'liblinear')
log_model.fit(X_train,y_train)

In [None]:
logy_pred = log_model.predict(X_test)

In [None]:
ACC = accuracy_score(logy_pred, y_test)
conf_mat = confusion_matrix(logy_pred, y_test)
class_report = classification_report(logy_pred, y_test)
print(ACC)
print(conf_mat)
print(class_report)

##### LightGBM stands out as the best model overall based on my results. Here's why:

Accuracy: LightGBM has the highest accuracy (0.87), meaning it predicts both classes better overall.

Precision (Class 1): LightGBM also has the highest precision (0.69) for class 1, meaning it minimizes false positives better than the other models.

Balanced Performance: Although its recall (0.55) for class 1 is lower than Random Forest (0.75), it still provides a good balance between precision and recall, which is often desirable in many real-world cases where both overpredicting and underpredicting are costly.

Training Efficiency: LightGBM is also more efficient when it comes to training time and scaling to larger datasets, which can be beneficial as my project expands.

### saving my model

In [None]:
import joblib

In [None]:
joblib.dump(lgbm_model, 'expressoModel.pkl')

### Create The Streamlit App

In [None]:
# Create the file Expresso_Churn_Prediction_Streamlit_App.py in write mode
with open("Expresso_Churn_Prediction_Streamlit_App.py", "w") as file:
    # Writing the Streamlit code into the file
    file.write("""
# Import necessary libraries
import streamlit as st
import pandas as pd
import joblib

# Load the pre-trained model
model = joblib.load("expressoModel.pkl")  

# Set up the Streamlit app
st.title('Expresso Client Churn Prediction')
st.write("This app predicts the churn probability for Expresso clients based on their behavior.")


# Input fields for user to enter feature values
frequence_rech = st.number_input('Recharge Frequency (FREQUENCE_RECH)', min_value=1.0, max_value=114.0, value=11.44, step=1.0)
revenue = st.number_input('Revenue (REVENUE)', min_value=1.0, max_value=165166.0, value= 5454.27, step=0.1)
frequence = st.number_input('Frequency of usage (FREQUENCE)', min_value=1.0, max_value=91.0,value = 13.88, step=1.0)
data_volume = st.number_input('Data Volume (DATA_VOLUME)', min_value=0.0, max_value=560933.0,value = 3165.06, step=0.1)
on_net = st.number_input('On Net Usage (ON_NET)', min_value=0.0, max_value=20837.0,value = 272.18, step=1.0)
orange = st.number_input('Orange Network Usage (ORANGE)', min_value=0.0, max_value=4743.0, value = 96.23, step=1.0)
regularity = st.number_input('Regularity of usage (REGULARITY)', min_value=0.0, max_value=1346.0, value=7.93, step=1.0)
freq_top_pack = st.number_input('Frequency of Top Pack (FREQ_TOP_PACK)', min_value=1.0, max_value=320.0, value=9.20, step=1.0)

# Create a dictionary with the input data
input_data = {
    'FREQUENCE_RECH': frequence_rech,
    'REVENUE': revenue,
    'FREQUENCE': frequence,
    'DATA_VOLUME': data_volume,
    'ON_NET': on_net,
    'ORANGE': orange,
    'REGULARITY': regularity,
    'FREQ_TOP_PACK': freq_top_pack
}

# Convert the dictionary to a DataFrame
input_df = pd.DataFrame([input_data])

# Predict churn probability using the loaded model
if st.button('Predict Churn Probability'):
    prediction = model.predict_proba(input_df)[:, 1]  # Probability of churn
    churn_probability = round(prediction[0] * 100, 2)
    st.write(f"The predicted churn probability is {churn_probability}%")

# Option to display input data
if st.checkbox('Show Input Data'):
    st.write(input_df)

""")