# Mobile Carrier Plan Prediction

**Project Description:**

Mobile carrier Megaline has observed that many subscribers are still using legacy plans. They aim to develop a machine learning model that analyzes subscriber behavior and recommends one of their newer plans: Smart or Ultra. This model will be trained on data from subscribers who have already transitioned to the new plans.

**Analysis Goal:**

The goal of this project is to build a classification model with the highest possible accuracy (threshold of 0.75) to predict whether a subscriber should be recommended the "Smart" or "Ultra" plan, based on their usage patterns. We will investigate different classification models and tune their hyperparameters to achieve optimal performance, and then assess the final model's accuracy on a separate test dataset.

In [11]:
# Importing the necessary libraries 
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [12]:
# Opening and looking through the data file.
df = pd.read_csv('/datasets/users_behavior.csv')
print(df.head())
print(df.info())
print(df.describe())
print(df['is_ultra'].value_counts())

   calls  minutes  messages   mb_used  is_ultra
0   40.0   311.90      83.0  19915.42         0
1   85.0   516.75      56.0  22696.96         0
2   77.0   467.66      86.0  21060.45         0
3  106.0   745.53      81.0   8437.39         1
4   66.0   418.74       1.0  14502.75         0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB
None
             calls      minutes     messages       mb_used     is_ultra
count  3214.000000  3214.000000  3214.000000   3214.000000  3214.000000
mean     63.038892   438.208787    38.281269  17207.673836     0.306472
std      33.236368   234.569872    36.148326   7570.968246  

In [13]:
# Checking for duplicates
print(f"Number of duplicate rows: {df.duplicated().sum()}")
df = df.drop_duplicates().reset_index(drop=True)

Number of duplicate rows: 0


**Observation:** The target variable `is_ultra` exhibits a significant class imbalance, with approximately 69.4% of users on the "Smart" plan (0) and 30.6% on the "Ultra" plan (1). This imbalance should be addressed during model training and evaluation to prevent biased results and ensure adequate performance on both classes.

In [14]:
#  Splitting the source data into a training set, a validation set, and a test set.
features = df.drop('is_ultra', axis=1)
target = df['is_ultra']

features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.2, random_state=12345
)

features_train, features_valid, target_train, target_valid = train_test_split(
    features_train, target_train, test_size=0.25, random_state=12345
) # 0.25 of train equals to 0.2 of total data.

print(f"Train set size: {len(features_train)}")
print(f"Validation set size: {len(features_valid)}")
print(f"Test set size: {len(features_test)}")


Train set size: 1928
Validation set size: 643
Test set size: 643


**Observation:** The dataset has been successfully split into training, validation, and test sets. The training set contains 1928 samples, while both the validation and test sets contain 643 samples each. This distribution ensures a substantial training dataset for model learning, while reserving equally sized validation and test sets for hyperparameter tuning and final performance evaluation, respectively. The validation set represents 20% of the original dataset, and the test set represents 20% of the original dataset, as intended.

In [15]:
# Investigating the quality of different models by changing hyperparameters.

best_model = None
best_result = 0
best_depth = 0

# Decision Tree
for depth in range(1, 11):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(features_train, target_train)
    result = model.score(features_valid, target_valid)
    if result > best_result:
        best_model = model
        best_result = result
        best_depth = depth

print(f"Decision Tree Accuracy: {best_result}, Max Depth: {best_depth}")

Decision Tree Accuracy: 0.7744945567651633, Max Depth: 7


**Decision Tree Performance:** The Decision Tree model achieved an accuracy of approximately 77.45% on the validation dataset, with an optimal `max_depth` of 7. This indicates that a tree with a maximum depth of 7 provided the best balance between model complexity and generalization performance, preventing overfitting while capturing relevant patterns in the data.

In [16]:
# Random Forest
best_forest_model = None
best_forest_result = 0
best_est = 0
best_forest_depth = 0

for est in range(10, 51, 10):
    for depth in range(1, 11):
        model = RandomForestClassifier(random_state=12345, n_estimators=est, max_depth=depth)
        model.fit(features_train, target_train)
        result = model.score(features_valid, target_valid)
        if result > best_forest_result:
            best_forest_model = model
            best_forest_result = result
            best_est = est
            best_forest_depth = depth

print(f"Random Forest Accuracy: {best_forest_result}, Estimators: {best_est}, Max Depth: {best_forest_depth}")

Random Forest Accuracy: 0.7978227060653188, Estimators: 50, Max Depth: 10


**Random Forest Performance:** The Random Forest model demonstrated an accuracy of approximately 79.78% on the validation dataset, achieving the highest performance among the tested models. This optimal result was obtained with 50 estimators (`n_estimators`) and a maximum depth of 10 (`max_depth`). This configuration suggests that a larger ensemble of trees with a greater depth effectively captured complex relationships within the data, leading to improved predictive accuracy compared to the Decision Tree and Logistic Regression models.

In [17]:
# Logistic Regression
logistic_model = LogisticRegression(random_state=12345, solver='liblinear')
logistic_model.fit(features_train, target_train)
logistic_result = logistic_model.score(features_valid, target_valid)
print(f"Logistic Regression Accuracy: {logistic_result}")

Logistic Regression Accuracy: 0.7293934681181959


**Logistic Regression Performance:** The Logistic Regression model achieved an accuracy of approximately 72.94% on the validation dataset. This suggests that while Logistic Regression can provide a reasonable baseline, it performed less effectively than the Decision Tree and Random Forest models in capturing the complexities of the data. This indicates that the relationships between the features and the target variable might be non-linear, which Logistic Regression, being a linear model, struggles to fully represent.

In [18]:
# Checking the quality of the model using the test set.

if best_forest_result > best_result and best_forest_result > logistic_result:
    final_model = best_forest_model
    print("Random Forest chosen as the final model.")
elif best_result > best_forest_result and best_result > logistic_result:
    final_model = best_model
    print("Decision Tree chosen as the final model.")
else:
    final_model = logistic_model
    print("Logistic Regression chosen as the final model.")
    
test_accuracy = final_model.score(features_test, target_test)
print(f"Test Accuracy: {test_accuracy}")

Random Forest chosen as the final model.
Test Accuracy: 0.7993779160186625


**Final Model Performance:** The Random Forest model, selected based on its superior performance on the validation set, achieved a test accuracy of approximately 79.94%. This confirms the model's ability to generalize well to unseen data, meeting the project's accuracy threshold of 0.75 and demonstrating its effectiveness in predicting user plan preferences.

In [19]:
#Sanity Check.

#Checking if the model predicts the majority class when the data is not useful.

from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(features_train, target_train)
dummy_pred = dummy_clf.predict(features_test)
dummy_accuracy = accuracy_score(target_test, dummy_pred)

print(f"Dummy Classifier (most frequent) Accuracy: {dummy_accuracy}")

Dummy Classifier (most frequent) Accuracy: 0.6951788491446346


**Dummy Classifier Baseline:** The Dummy Classifier, using the "most frequent" strategy, achieved an accuracy of approximately 69.52% on the test dataset. This baseline represents the accuracy we would expect if we simply predicted the majority class ("Smart" plan) for all users. The fact that our chosen Random Forest model achieved a test accuracy of approximately 79.94% demonstrates that it significantly outperforms this baseline, indicating that it has learned meaningful patterns from the data beyond simply predicting the most frequent class.

In [20]:
# Comparing the dummy accuracy with the test accuracy of the model to see if the model is better than just predicting the majority class.
if test_accuracy > dummy_accuracy:
    print("The model performs better than the dummy classifier.")
else:
    print("The model does not perform better than the dummy classifier.")

The model performs better than the dummy classifier.


**Model Validation:** The trained Random Forest model's test accuracy (approximately 79.94%) significantly exceeds the Dummy Classifier's baseline accuracy (approximately 69.52%). This confirms that the model has successfully learned meaningful patterns from the data and performs better than simply predicting the majority class, validating its effectiveness.

**Conclusion:**

This project aimed to develop a classification model to predict user plan preferences (Smart or Ultra) for a mobile carrier, Megaline. Through data exploration, preprocessing, and model training, we successfully built a Random Forest classifier that achieved a test accuracy of approximately 79.94%. This performance surpasses the project's accuracy threshold of 0.75 and significantly outperforms a baseline Dummy Classifier, demonstrating the model's ability to learn meaningful patterns from user behavior data.

The Random Forest model, with optimized hyperparameters (50 estimators and a maximum depth of 10), proved to be the most effective among the models tested, including Decision Tree and Logistic Regression. The model's generalization ability was validated by its consistent performance on the unseen test dataset.

The analysis highlighted the importance of addressing class imbalance, which was present in the target variable. Future improvements could explore techniques like oversampling or undersampling to further enhance the model's performance, particularly in predicting the minority "Ultra" plan class. 

Overall, the developed Random Forest model provides a valuable tool for Megaline to analyze subscriber behavior and recommend appropriate plans, potentially leading to increased customer satisfaction and revenue.