# Telecom Revenue Assurance AI Model (El Classico with Ensemble)
Author: Fatih E. NAR

## Introduction
In this notebook, we showcase a machine learning model to detect fraudulent cases for telecom service use. We will use a synthetic dataset with features relevant to telco user activities, such as call duration, data usage, and SMS count. The goal is to accurately identify fraudulent events to help improve revenue assurance processes in the telco domain.

## Data Loading and Exploration
We will start by loading and exploring the synthetic dataset to understand its structure and the distribution of features.

In [None]:
## Install dependencies
%pip install -r requirements.txt

In [None]:
import lzma
import shutil
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

# Extract the .xz file
with lzma.open('data/telecom_revass_data.csv.xz', 'rb') as f_in:
    with open('data/telecom_revass_data.csv', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

# Load the synthetic telecom data
data_path = "data/telecom_revass_data.csv"
data = pd.read_csv(data_path)

# Display basic information about the dataset
data.info()

# Display the first few rows of the dataset
data.head()

## Data Preprocessing
Before training the model, we need to preprocess the data. This includes handling missing values, converting categorical variables to numeric, and splitting the data into training and testing sets.

In [None]:
from sklearn.model_selection import train_test_split

# Check for missing values
missing_values = data.isnull().sum()
print("Missing values in each column:", missing_values)

# Convert categorical variables to numeric
data = pd.get_dummies(data, columns=['Plan_Type'], drop_first=True)

# Split the data into features and target variable
X = data.drop('Fraud', axis=1)
y = data['Fraud']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

## Model Training
We will use a Random Forest classifier to train the model. This involves fitting the model on the training data and then making predictions on the test data.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [None]:
# Initialize and train the BalancedRandomForestClassifier with Fine-tuned hyperparameters 
model = BalancedRandomForestClassifier(
    random_state=42,
    n_estimators=200,
    min_samples_split=10,
    min_samples_leaf=2,
    max_features='sqrt',
    max_depth=50,
    bootstrap=True,
    sampling_strategy='all',  # Set to 'all' to adopt future behavior
    replacement=True  # Set to 'True' to silence the warning
)
model.fit(X_train, y_train)

## Model Evaluation
We will evaluate the model's performance using metrics such as confusion matrix, classification report, and accuracy score.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Make predictions on the test set with BalancedRandomForestClassifier
y_pred = model.predict(X_test)
# Evaluate the model
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
acc_score = accuracy_score(y_test, y_pred)
print("---------------")
print("BalancedRandomForestClassifier Results:")
print("Confusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)
print("\nAccuracy Score:")
print(acc_score)
print("---------------")

## Conclusion
In this notebook, we built a machine learning model to detect fraudulent telecom events using a synthetic dataset. The Random Forest classifier showed good performance in identifying fraud. Further steps could involve hyperparameter tuning, feature engineering, and testing with real-world data.

## Saving the Model
Finally, we will save the trained model to a file for future use.

In [None]:
# Save the model
model_path = "models/brfc_model.pkl"
with open(model_path, 'wb') as model_file:
    pickle.dump((model, X_train.columns.tolist()), model_file)
print(f"Revenue Assurance BalancedRandomForestClassifier Model Saved to {model_path}")