# Best Phone Plan Machine Learning Model

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra. 

You have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course). For this classification task, you need to develop a model that will pick the right plan. Since youâ€™ve already performed the data preprocessing step, you can move straight to creating the model.  

Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.  

## Initalization and Preparing Data

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import numpy as np

In [2]:
phone = pd.read_csv('/datasets/users_behavior.csv'); phone.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


According to the documentation:  
- `calls` - number of calls
- `minutes` - total call duration in minutes
- `messages` - number of text messages
- `mb_used` - internet traffic used in MB
- `is_ultra` - plan for the current month (Ultra - 1, Smart - 0)

In [3]:
phone.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [4]:
phone['calls'] = phone['calls'].astype(int)
phone['messages'] = phone['messages'].astype(int)
phone.isna().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

In [5]:
features = phone.drop(['is_ultra'], axis=1)
target = phone['is_ultra']

features_temp, features_train, target_temp, target_train = train_test_split(
    features, target, test_size=0.4, random_state=54321)
features_valid, features_test, target_valid, target_test = train_test_split(
    features_temp, target_temp, test_size=0.5, random_state=54321)

## Decision Tree Model

In [6]:
for depth in range(1, 6):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(features_train, target_train)
    predictions_valid = model.predict(features_valid)
    print('max_depth =', depth, ': ', end='')
    print(accuracy_score(target_valid, predictions_valid))

max_depth = 1 : 0.7479253112033195
max_depth = 2 : 0.7697095435684648
max_depth = 3 : 0.7863070539419087
max_depth = 4 : 0.7800829875518672
max_depth = 5 : 0.783195020746888


In [7]:
best_model = None
best_result = 10000
best_depth = 0
for depth in range(1, 11):
    model = DecisionTreeClassifier(max_depth=depth, random_state=12345)
    model.fit(features_train, target_train)
    predictions_valid = model.predict(features_valid)
    result = mean_squared_error(target_valid, predictions_valid)**0.5
    if result < best_result:
        best_model = model
        best_result = result
        best_depth = depth

print(f"The best decision tree model has a max depth of {best_depth} and a RMSE of {best_result:2f}")

The best decision tree model has a max depth of 7 and a RMSE of 0.455488


In [8]:
predictions = best_model.predict(features_test)
print(f'Test set accuracy score: {accuracy_score(target_test, predictions):2f}')

Test set accuracy score: 0.784232


## Random Forest Model

In [9]:
best_model = None
best_result = 10000
best_est = 0
best_depth = 0
for est in range(10, 100, 10):
    for depth in range (1, 11):
        model = RandomForestClassifier(n_estimators=est, max_depth=depth, random_state=12345)
        model.fit(features_train, target_train)
        predictions_valid = model.predict(features_valid)
        result = mean_squared_error(target_valid, predictions_valid)**0.5
        if result < best_result:
            best_model = model
            best_result = result
            best_est = est
            best_depth = depth

print(f"The best random forest model has an RMSE of {best_result:2f}, {best_est} n_estimators, and a depth of {best_depth}")

The best random forest model has an RMSE of 0.446285, 40 n_estimators, and a depth of 7


In [10]:
predictions = best_model.predict(features_test)
print(f'Test set accuracy score: {accuracy_score(target_test, predictions):2f}')

Test set accuracy score: 0.802905


## Logistic Regression Model

In [11]:
model = LogisticRegression(random_state=54321, solver='liblinear')
model.fit(features_train, target_train)
score_train = model.score(features_train, target_train)  
score_valid = model.score(features_valid, target_valid)  

print(f"Accuracy of the logistic regression model on the validation set: {score_valid:2f}")

Accuracy of the logistic regression model on the validation set: 0.691909


In [12]:
predictions = model.predict(features_test)
print(f'Test set accuracy score: {accuracy_score(target_test, predictions):2f}')

Test set accuracy score: 0.705394


## Conclusion

The decision tree is best, as shown to have the lowest rmse and high accuracy score.