# Telecom Revenue Assurance AI Model (El Classico with Ensemble)
Author: Fatih E. NAR

## Introduction
In this notebook, we showcase a machine learning model to detect fraudulent cases for telecom service use. We will use a synthetic dataset with features relevant to telco user activities, such as call duration, data usage, and SMS count. The goal is to accurately identify fraudulent events to help improve revenue assurance processes in the telco domain.

## Data Loading and Exploration
We will start by loading and exploring the synthetic dataset to understand its structure and the distribution of features.

In [1]:
## Install dependencies
%pip install -r requirements.txt

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [2]:
import lzma
import shutil
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

# Extract the .xz file
with lzma.open('data/telecom_revass_data.csv.xz', 'rb') as f_in:
    with open('data/telecom_revass_data.csv', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

# Load the synthetic telecom data
data_path = "data/telecom_revass_data.csv"
data = pd.read_csv(data_path)

# Display basic information about the dataset
data.info()

# Display the first few rows of the dataset
data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 13 columns):
 #   Column                      Non-Null Count    Dtype  
---  ------                      --------------    -----  
 0   Call_Duration               1000000 non-null  float64
 1   Data_Usage                  1000000 non-null  float64
 2   Sms_Count                   1000000 non-null  int64  
 3   Roaming_Indicator           1000000 non-null  int64  
 4   MobileWallet_Use            1000000 non-null  int64  
 5   Plan_Type                   1000000 non-null  object 
 6   Cost                        1000000 non-null  float64
 7   Cellular_Location_Distance  1000000 non-null  float64
 8   Personal_Pin_Used           1000000 non-null  int64  
 9   Avg_Call_Duration           1000000 non-null  float64
 10  Avg_Data_Usage              1000000 non-null  float64
 11  Avg_Cost                    1000000 non-null  float64
 12  Fraud                       1000000 non-null  int64  
dty

Unnamed: 0,Call_Duration,Data_Usage,Sms_Count,Roaming_Indicator,MobileWallet_Use,Plan_Type,Cost,Cellular_Location_Distance,Personal_Pin_Used,Avg_Call_Duration,Avg_Data_Usage,Avg_Cost,Fraud
0,4.692681,539.146554,1,0,0,postpaid,56.99117,3.629675,0,1.460109,336.312984,71.603238,0
1,30.101214,247.225104,4,0,0,postpaid,9.718698,3.654629,0,30.817472,150.96959,-3.794503,0
2,13.167457,117.971674,3,0,0,postpaid,10.770598,2.506765,0,13.554912,79.394244,4.58132,0
3,9.129426,411.883231,4,0,0,postpaid,5.88596,0.098861,0,7.990501,317.191998,24.955166,0
4,1.696249,1134.432099,4,0,0,prepaid,49.433863,4.20457,0,0.159457,1073.58526,43.51732,0


## Data Preprocessing
Before training the model, we need to preprocess the data. This includes handling missing values, converting categorical variables to numeric, and splitting the data into training and testing sets.

In [3]:
from sklearn.model_selection import train_test_split

# Check for missing values
missing_values = data.isnull().sum()
print("Missing values in each column:", missing_values)

# Convert categorical variables to numeric
data = pd.get_dummies(data, columns=['Plan_Type'], drop_first=True)

# Split the data into features and target variable
X = data.drop('Fraud', axis=1)
y = data['Fraud']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

Missing values in each column: Call_Duration                 0
Data_Usage                    0
Sms_Count                     0
Roaming_Indicator             0
MobileWallet_Use              0
Plan_Type                     0
Cost                          0
Cellular_Location_Distance    0
Personal_Pin_Used             0
Avg_Call_Duration             0
Avg_Data_Usage                0
Avg_Cost                      0
Fraud                         0
dtype: int64


## Model Training
We will use a Random Forest classifier to train the model. This involves fitting the model on the training data and then making predictions on the test data.

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [5]:
# Initialize and train the BalancedRandomForestClassifier with Fine-tuned hyperparameters 
model = BalancedRandomForestClassifier(
    random_state=42,
    n_estimators=200,
    min_samples_split=10,
    min_samples_leaf=2,
    max_features='sqrt',
    max_depth=50,
    bootstrap=True,
    sampling_strategy='all',  # Set to 'all' to adopt future behavior
    replacement=True  # Set to 'True' to silence the warning
)
model.fit(X_train, y_train)

## Model Evaluation
We will evaluate the model's performance using metrics such as confusion matrix, classification report, and accuracy score.

In [6]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Make predictions on the test set with BalancedRandomForestClassifier
y_pred = model.predict(X_test)
# Evaluate the model
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
acc_score = accuracy_score(y_test, y_pred)
print("---------------")
print("BalancedRandomForestClassifier Results:")
print("Confusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)
print("\nAccuracy Score:")
print(acc_score)
print("---------------")

---------------
BalancedRandomForestClassifier Results:
Confusion Matrix:
[[231323      4]
 [     2  68671]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    231327
           1       1.00      1.00      1.00     68673

    accuracy                           1.00    300000
   macro avg       1.00      1.00      1.00    300000
weighted avg       1.00      1.00      1.00    300000


Accuracy Score:
0.99998
---------------


## Conclusion
In this notebook, we built a machine learning model to detect fraudulent telecom events using a synthetic dataset. The Random Forest classifier showed good performance in identifying fraud. Further steps could involve hyperparameter tuning, feature engineering, and testing with real-world data.

## Saving the Model
Finally, we will save the trained model to a file for future use.

In [7]:
# Save the model
model_path = "models/brfc_model.pkl"
with open(model_path, 'wb') as model_file:
    pickle.dump((model, X_train.columns.tolist()), model_file)
print(f"Revenue Assurance BalancedRandomForestClassifier Model Saved to {model_path}")

Revenue Assurance BalancedRandomForestClassifier Model Saved to models/brfc_model.pkl
