# Megaline Phone Plan Recommendation Model

# Contents <a id='back'></a>

* [Introduction](#introduction)
* [Data Overview](#data_overview)
    * [Initialization](#initialization)
    * [Load Data](load_data)
* [Prepare the Data](#prepare_data)
* [Model](#model)
    * [Random Forest](#random_forest)
        * [Initial Model](#initial_model_random_forest)
        * [Hyperparameters](#hyperparameters_random_forest)
    * [Decision Tree](#decision_tree)
    * [Logistic Regression](#logistic_regression)
    * [Final Model](#final_model)
    * [Sanity Check](#sanity_check)
* [Conclusion](#conclusion)

# Introduction <a id='introduction'></a>

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.

Using the provided data file on behavior data (`/datasets/users_behavior.csv`) about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course), a model will be developed that will pick the right plan. The data preprocessing step was completed in the Statistical Data Analysis Project, thus the shown work will move straight to creating the model.

Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.

[Back to Contents](#back)

# Data Overview <a id='data_overview'></a>

## Initialization <a id='initialization'></a> <a class="tocSkip">

In [1]:
# Loading all the libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

## Load data <a id='load_data'></a> <a class="tocSkip">

In [2]:
# Reading the dataframe file and storing it to users_behavior_df
users_behavior_df = pd.read_csv('/datasets/users_behavior.csv')

# Prepare the data <a id='prepare_data'></a>

In [3]:
# Print the general/summary information about the DataFrame
users_behavior_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [4]:
# Print a sample of the data
display(users_behavior_df.head())

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


Data preprocessing on this dataset has already been performed from the Statistical Data Analysis Project, thus further preprocessing is not needed.

# Model <a id='model'></a>

## Random Forest <a id='random_forest'></a> </a> <a class="tocSkip">

### Initial Model <a id='initial_model_random_forest'></a> </a> <a class="tocSkip">

In [5]:
# Split the data into features (X) and the target variable (y), where y is the 'is_ultra' column
X = users_behavior_df.drop('is_ultra', axis=1)
y = users_behavior_df['is_ultra']

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=12345)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=12345)

# Model selection and hyperparameter tuning
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Validation set evaluation
y_val_pred = model.predict(X_val)
val_accuracy = accuracy_score(y_val, y_val_pred)
print("Validation Accuracy:", val_accuracy)

# Test set evaluation
y_test_pred = model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
print("Test Accuracy:", test_accuracy)

Validation Accuracy: 0.7838258164852255
Test Accuracy: 0.7869362363919129


Since we are provided only the source data, it will be split into three parts. By convention in this scenario, the sizes of the validation set and test are usually equal. In this code, the train_test_split function is used with a test_size of 0.4 for the initial split into training and a combined validation-test set. Then, this combined set is further split into validation and test sets using another test_size of 0.5 each. This results in a final 60-20-20 split for train-validation-test, respectively. The required threshold accuracy of 0.75 is achieved with this ratio.

In this example, a RandomForestClassifier, which is typically deemed to have the highest accuracy, is used as the model. To confirm if this is the best model to use for this data, other classifiers and experiments with hyperparameters will be done to find the best model for the dataset.

Also for consistency, the random_state will remain the same for all model experimentation to make the pseudorandomness static.

### Changing Hyperparameters <a id='hyperparameters_random_forest'></a> </a> <a class="tocSkip">

In [6]:
# Split the data into features (X) and the target variable (y), where y is the 'is_ultra' column
X = users_behavior_df.drop('is_ultra', axis=1)
y = users_behavior_df['is_ultra']

# Further split the data into training, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=12345)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=12345)

# Model selection and hyperparameter tuning
model = RandomForestClassifier(n_estimators=100, random_state=12345)
model.fit(X_train, y_train)

# Validation set evaluation
y_val_pred = model.predict(X_val)
val_accuracy = accuracy_score(y_val, y_val_pred)
print("Validation Accuracy:", val_accuracy)

# Test set evaluation
y_test_pred = model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
print("Test Accuracy:", test_accuracy)

Validation Accuracy: 0.7966804979253111
Test Accuracy: 0.7867494824016563


In this code, the train_test_split function is used with a test_size of 0.3 for the initial split into training and a combined validation-test set. Then, this combined set is further split into validation and test sets using another test_size of 0.5 each. This results in a final 70-15-15 split for train-validation-test, respectively. The required threshold accuracy of 0.75 is still achieved with this ratio, but when comparing validation and test accuracies, typically the test accuracy should be as close as possible to the validation accuracy. This indicates that your model's performance on unseen data (test data) is consistent with its performance on validation data.

In this case, the validation accuracy is higher than the test accuracy. There is a noticeable drop in accuracy when moving from validation to test data. While the difference isn't extremely large, it might suggest a potential for overfitting. The model might have learned specific patterns present in the validation set but struggled to generalize them to the test set.

Becuase the initial model with a 60-20-20 split had a more consistent performance between validation and test accuracies, that ratio will be used in the final model.

In [7]:
# Split the data into features (X) and the target variable (y), where y is the 'is_ultra' column
X = users_behavior_df.drop('is_ultra', axis=1)
y = users_behavior_df['is_ultra']

# Further split the data into training, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=12345)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=12345)

# Model selection and hyperparameter tuning
model = RandomForestClassifier(n_estimators=500, random_state=12345)
model.fit(X_train, y_train)

# Validation set evaluation
y_val_pred = model.predict(X_val)
val_accuracy = accuracy_score(y_val, y_val_pred)
print("Validation Accuracy:", val_accuracy)

# Test set evaluation
y_test_pred = model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
print("Test Accuracy:", test_accuracy)

Validation Accuracy: 0.7946058091286307
Test Accuracy: 0.7867494824016563


The only change in the above code from the previous one was n_estimators, changing from 100 to 500. Generally, the quality of the end result is directly propportional to the number of trees, as well as training time. Based on the results, there was no significant change by increasing the n_estimators, thus the lower n_estimator of 100 that was initially used will be kept for decreased training time. 

## Decision Tree <a id='decision_tree'></a> </a> <a class="tocSkip">

In [8]:
# Split the data into features (X) and the target variable (y), where y is the 'is_ultra' column
X = users_behavior_df.drop('is_ultra', axis=1)
y = users_behavior_df['is_ultra']

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=12345)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=12345)

# Model selection and hyperparameter tuning
model = DecisionTreeClassifier(random_state=12345)
model.fit(X_train, y_train)

# Validation set evaluation
y_val_pred = model.predict(X_val)
val_accuracy = accuracy_score(y_val, y_val_pred)
print("Validation Accuracy:", val_accuracy)

# Test set evaluation
y_test_pred = model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
print("Test Accuracy:", test_accuracy)

Validation Accuracy: 0.713841368584759
Test Accuracy: 0.7309486780715396


To ensure, The Random Forest is the best model for this applicaiton, a Decision Tree model was also tested and is shown to not pass the accuracy threshold of 0.75. Further tuning on the Decision Tree will not be done since the Random Forest model already achieved the accuracy threshold and is generally deemed the more accurate model.

## Logistic Regression <a id='logistic_regression'></a> </a> <a class="tocSkip">

In [9]:
# Split the data into features (X) and the target variable (y), where y is the 'is_ultra' column
X = users_behavior_df.drop('is_ultra', axis=1)
y = users_behavior_df['is_ultra']

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=12345)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=12345)

# Model selection and hyperparameter tuning
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

# Validation set evaluation
y_val_pred = model.predict(X_val)
val_accuracy = accuracy_score(y_val, y_val_pred)
print("Validation Accuracy:", val_accuracy)

# Test set evaluation
y_test_pred = model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
print("Test Accuracy:", test_accuracy)

Validation Accuracy: 0.7107309486780715
Test Accuracy: 0.6842923794712286


To further confirm if the Random Forest is the best model for this applicaiton, a Logistic Regression model was also tested and is also shown to not pass the accuracy threshold of 0.75. Further tuning on the Logistic Regression model will not be done since the Random Forest model already achieved the accuracy threshold and is generally deemed the more accurate model.

## Final Model <a id='final_model'></a> </a> <a class="tocSkip">

In [10]:
# Split the data into features (X) and the target variable (y), where y is the 'is_ultra' column
X = users_behavior_df.drop('is_ultra', axis=1)
y = users_behavior_df['is_ultra']

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=12345)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=12345)

# Model selection and hyperparameter tuning
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Validation set evaluation
y_val_pred = model.predict(X_val)
val_accuracy = accuracy_score(y_val, y_val_pred)
print("Validation Accuracy:", val_accuracy)

# Test set evaluation
y_test_pred = model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
print("Test Accuracy:", test_accuracy)

Validation Accuracy: 0.7838258164852255
Test Accuracy: 0.7869362363919129


After testing different models and hyperparameters, the above code was determined to be the final model used to achiev the highest possible accuracy. 

The model's validation accuracy is around 0.78, meaning it correctly predicted 78% of the labels (in this case, whether a user has the "Ultra" plan or not) in the validation dataset.

The model's test accuracy is around 0.79, indicating that it correctly predicted 79% of the labels in the test dataset.

## Sanity Check <a id='sanity_check'></a> </a> <a class="tocSkip">

In [11]:
# Calculate the predicted class distribution for validation and test sets
val_pred_distribution = pd.Series(y_val_pred).value_counts(normalize=True)
test_pred_distribution = pd.Series(y_test_pred).value_counts(normalize=True)

# Calculate the actual class distribution in the validation and test sets
val_actual_distribution = y_val.value_counts(normalize=True)
test_actual_distribution = y_test.value_counts(normalize=True)

# Compare the predicted distribution to the actual distribution for validation set
print("Validation Set:")
display("Predicted Distribution:", val_pred_distribution)
display("Actual Distribution:", val_actual_distribution)
print()

# Compare the predicted distribution to the actual distribution for test set
print("Test Set:")
display("Predicted Distribution:", test_pred_distribution)
display("Actual Distribution:", test_actual_distribution)

Validation Set:


'Predicted Distribution:'

0    0.782271
1    0.217729
dtype: float64

'Actual Distribution:'

0    0.706065
1    0.293935
Name: is_ultra, dtype: float64


Test Set:


'Predicted Distribution:'

0    0.741835
1    0.258165
dtype: float64

'Actual Distribution:'

0    0.684292
1    0.315708
Name: is_ultra, dtype: float64

Both in the validation and test sets, the predicted class distributions are fairly close to the actual class distributions (78% and 71% respectively for the validation set and 74% and 68% respectively for the test set). The model's predictions are consistent with the underlying data distribution.

Both the actual and predicted class distributions indicate a class imbalance. Class 0 is the majority class, while Class 1 is the minority class. The model tends to predict more instances as Class 0 ("Smart" plan) than as Class 1 ("Ultra" plan). This aligns with the majority class being more prevalent.

Because the predicted distributions are similar to the actual distributions, it suggests that the model's predictions make sense in the context of the problem. Overall, these results indicate that the final model is producing predictions that align with the class distribution in the data. This is a good sign, as it suggests that the model is not heavily biased towards predicting one class and is making reasonable predictions.

# Conclusion <a id='conclusion'></a>

After experimenting with various different models and hyperparameters, the random forest model was chosen as the optimum model for the given data, achieving a validation accuracy of 78% and test accuracy of 79%, meeting the required 75% threshold for the project. To ensure the model makes sense, a sanity check was performed and confirmed the soundness of the chosen model. The sanity check showed similar results in the validation and test sets, with the predicted class distributions being fairly close to the actual class distributions (78% and 71% respectively for the validation set and 74% and 68% respectively for the test set). Thus, the model's predictions are consistent with the underlying data distribution.