# Telecom Revenue Assurance AI Model (El Classico with Ensemble)
Author: Fatih E. NAR

## Introduction
In this notebook, we showcase a machine learning model to detect fraudulent cases for telecom service use. We will use a synthetic dataset with features relevant to telco user activities, such as call duration, data usage, and SMS count. The goal is to accurately identify fraudulent events to help improve revenue assurance processes in the telco domain.

## Data Loading and Exploration
We will start by loading and exploring the synthetic dataset to understand its structure and the distribution of features.

In [46]:
## Install dependencies
!pip install -r requirements.txt



In [47]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the synthetic telecom data
data_path = "data/telecom_revass_data.csv"
data = pd.read_csv(data_path)

# Display basic information about the dataset
data.info()

# Display the first few rows of the dataset
data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 12 columns):
 #   Column                      Non-Null Count    Dtype  
---  ------                      --------------    -----  
 0   Call_Duration               1000000 non-null  float64
 1   Data_Usage                  1000000 non-null  float64
 2   Sms_Count                   1000000 non-null  int64  
 3   Roaming_Indicator           1000000 non-null  int64  
 4   MobileWallet_Use            1000000 non-null  int64  
 5   Plan_Type                   1000000 non-null  object 
 6   Cost                        1000000 non-null  float64
 7   Cellular_Location_Distance  1000000 non-null  float64
 8   Last_Time_Pin_Used          1000000 non-null  float64
 9   Avg_Call_Duration           1000000 non-null  float64
 10  Avg_Data_Usage              1000000 non-null  float64
 11  Fraud                       1000000 non-null  int64  
dtypes: float64(7), int64(4), object(1)
memory usage: 91.6+ MB

Unnamed: 0,Call_Duration,Data_Usage,Sms_Count,Roaming_Indicator,MobileWallet_Use,Plan_Type,Cost,Cellular_Location_Distance,Last_Time_Pin_Used,Avg_Call_Duration,Avg_Data_Usage,Fraud
0,4.692681,539.146554,1,0,0,postpaid,71.603238,3.629675,46.011773,1.460109,336.312984,0
1,30.101214,247.225104,4,0,0,postpaid,-3.794503,3.654629,47.339394,30.817472,150.96959,0
2,13.167457,117.971674,3,0,0,postpaid,4.58132,2.506765,8.653167,13.554912,79.394244,0
3,9.129426,411.883231,4,0,0,postpaid,24.955166,0.098861,18.875911,7.990501,317.191998,1
4,1.696249,1134.432099,4,0,0,prepaid,43.51732,4.20457,30.711223,0.159457,1073.58526,0


## Data Preprocessing
Before training the model, we need to preprocess the data. This includes handling missing values, converting categorical variables to numeric, and splitting the data into training and testing sets.

In [48]:
from sklearn.model_selection import train_test_split

# Check for missing values
missing_values = data.isnull().sum()
print("Missing values in each column:", missing_values)

# Convert categorical variables to numeric
data = pd.get_dummies(data, columns=['Plan_Type'], drop_first=True)

# Split the data into features and target variable
X = data.drop('Fraud', axis=1)
y = data['Fraud']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

Missing values in each column: Call_Duration                 0
Data_Usage                    0
Sms_Count                     0
Roaming_Indicator             0
MobileWallet_Use              0
Plan_Type                     0
Cost                          0
Cellular_Location_Distance    0
Last_Time_Pin_Used            0
Avg_Call_Duration             0
Avg_Data_Usage                0
Fraud                         0
dtype: int64


## Model Training
We will use a Random Forest classifier to train the model. This involves fitting the model on the training data and then making predictions on the test data.

In [49]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.ensemble import RandomForestClassifier

# We are using Classifier (which handles non-linear relationships, robust to outliers, and can manage imbalanced datasets effectively) approach with Ensemble learning (Random Forest & Balanced Random Forest), which is a machine learning paradigm where multiple models (often referred to as "weak learners") are trained and combined to solve a particular problem. The idea is that by combining the predictions of multiple models, the ensemble can achieve better performance than any individual model alone. Ensemble learning techniques are widely used because they can significantly improve the accuracy, robustness, and generalizability of machine learning models.

# Train a Random Forest classifier with class weights
model_1 = RandomForestClassifier(n_estimators=100, random_state=42, class_weight={0: 1, 1: 10})
model_1.fit(X_train, y_train)

# Initialize and train the BalancedRandomForestClassifier with Fine-tuned hyperparameters 
model_2 = BalancedRandomForestClassifier(
    random_state=42,
    n_estimators=200,
    min_samples_split=10,
    min_samples_leaf=2,
    max_features='sqrt',
    max_depth=50,
    bootstrap=True,
    sampling_strategy='all',  # Set to 'all' to adopt future behavior
    replacement=True  # Set to 'True' to silence the warning
)
model_2.fit(X_train, y_train)

## Model Evaluation
We will evaluate the model's performance using metrics such as confusion matrix, classification report, and accuracy score.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Make predictions on the test set with RandomForestClassifier
y_pred_1 = model_1.predict(X_test)
# Evaluate the model
print("---------------")
print("RandomForestClassifier Results:")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_1))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_1))
print("\nAccuracy Score:")
print(accuracy_score(y_test, y_pred_1))
print("---------------")

# Make predictions on the test set with BalancedRandomForestClassifier
y_pred_2 = model_2.predict(X_test)
# Evaluate the model
conf_matrix = confusion_matrix(y_test, y_pred_2)
class_report = classification_report(y_test, y_pred_2)
acc_score = accuracy_score(y_test, y_pred_2)
print("---------------")
print("BalancedRandomForestClassifier Results:")
print("Confusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)
print("\nAccuracy Score:")
print(acc_score)
print("---------------")

## Conclusion
In this notebook, we built a machine learning model to detect fraudulent telecom events using a synthetic dataset. The Random Forest classifier showed good performance in identifying fraud. Further steps could involve hyperparameter tuning, feature engineering, and testing with real-world data.

## Saving the Model
Finally, we will save the trained model to a file for future use.

In [None]:
import joblib

# Save the model1
model1_path = "models/telecom_revenueassurance_model1.pkl"
joblib.dump(model_1, model1_path)
print(f"Model-1: RandomForestClassifier saved to {model1_path}")

# Save the model2
model2_path = "models/telecom_revenueassurance_model2.pkl"
joblib.dump(model_2, model2_path)
print(f"Model-2: BalancedRandomForestClassifier saved to {model2_path}")