## Recommendation of best plan for Megaline mobile carrier customers

## Introduction
#### The purpose of this project is to recommend the best plan to the new customers based on their mobile behavior.


In [1]:
## Loading the lbraries needed for the project
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats as st

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

from sklearn.model_selection import train_test_split
%matplotlib inline
from IPython.display import Image

In [2]:
# Reading the dataset
df = pd.read_csv('/datasets/users_behavior.csv')
df.sample(25)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
1583,70.0,561.85,28.0,23850.76,0
1625,38.0,271.41,14.0,8152.11,1
2193,44.0,295.84,0.0,15043.52,0
625,77.0,502.87,5.0,6928.0,0
1940,61.0,391.04,21.0,20625.04,0
230,36.0,267.58,13.0,12737.08,0
2511,72.0,439.92,29.0,14472.76,0
3149,78.0,503.55,57.0,15258.51,0
1109,69.0,479.79,0.0,17559.43,0
3101,115.0,795.39,94.0,11356.9,1


#### Upon initial observation, it appears that Ultra customers utilize more call services, whereas Smart customers are heavy internet users, as indicated by high data usage. I assume that Smart customers use the internet significantly more than Ultra customers. Therefore, I would like to develop a model that recommends the Ultra plan for those who inquire about internet usage.

In [3]:
df.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


#### 

In [4]:
# Looking at number of rows and columns
df.shape

(3214, 5)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [6]:
## checking any missing values
df.isna().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

## There is no missing data in the data frame

In [7]:
df.groupby('is_ultra')['calls','messages','mb_used', 'minutes'].sum()

Unnamed: 0_level_0,calls,messages,mb_used,minutes
is_ultra,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,130315.0,74413.0,36128672.83,904846.84
1,72292.0,48623.0,19176790.88,503556.2


In [8]:
df['is_ultra'].value_counts()

0    2229
1     985
Name: is_ultra, dtype: int64

## I can say with above data that smart has way more customers than Ultra.

In [9]:
## assigning variables features and target feature 
features = df.drop(['is_ultra'], axis = 1)
target = df['is_ultra']
print(features.shape)
print(target.shape)

(3214, 4)
(3214,)


### Splitting data into training  = 60% , test = 20%, validation = 20%

In [10]:

features_train, test_features, target_train, test_target = train_test_split(features, target, test_size=0.40, random_state =42) ## train 60% test 40%
features_valid, test_features,target_valid, test_target = train_test_split(test_features, test_target, test_size=0.50, random_state= 42) ## test 20%, valid 20%


In [11]:
test_features.shape

(643, 4)

In [12]:
features_train.shape

(1928, 4)

In [13]:
features_valid.shape

(643, 4)

### Creating the model by Decision Tree Classifier

In [14]:
# DecisionTreeClassifier
model = DecisionTreeClassifier(random_state =12345) 
 # initialize Decision Tree Classifier constructor with parameters random_state=54321
model.fit(features_train, target_train)
model.score(features_valid, target_valid)

0.7262830482115086

In [16]:
# to increase accuracy of the model
for depth in range(1, 10):
    model= DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(features_train, target_train) # train model on training set
    predictions_valid = model.predict(features_valid)
    print('max_depth =',depth,':', end=' ') 
    print(accuracy_score(target_valid, predictions_valid))

max_depth = 1 : 0.7309486780715396
max_depth = 2 : 0.7822706065318819
max_depth = 3 : 0.7916018662519441
max_depth = 4 : 0.7807153965785381
max_depth = 5 : 0.7729393468118196
max_depth = 6 : 0.776049766718507
max_depth = 7 : 0.7807153965785381
max_depth = 8 : 0.7947122861586314
max_depth = 9 : 0.7791601866251944


#### Utilizing hyperparameters can enhance the accuracy of a Decision Tree model. For instance, the max_depth parameter began at 0.72 and, after 10 trials, it increased to 0.77, indicating an improvement rather than a decrease.

### Creating the model by Random Forest Classifier

In [17]:
# RandomForestClassifier
model = RandomForestClassifier(random_state =54321) 
# initialize Random Forest Classifier constructor with parameters random_state=54321
model.fit(features_train, target_train)
target_pred = model.predict(features_valid)
model.score(features_valid, target_valid)

0.7947122861586314

In [18]:
best_score = 0
best_est = 0
for est in range(1, 10): # choose hyperparameter range
    model = RandomForestClassifier(random_state=54321, n_estimators= est) # set number of trees
    model.fit(features_train, target_train) # train model on training set
    score = model.score(features_valid, target_valid) # calculate accuracy score on validation set
    if score > best_score:
        best_score = score # save best accuracy score on validation set
        best_est = est # save number of estimators corresponding to best accuracy score

print("Accuracy of the best model on the validation set (n_estimators = {}): {}".format(best_est, best_score))

Accuracy of the best model on the validation set (n_estimators = 8): 0.7869362363919129


### Creating the model by Logistic Regression 

In [19]:
model =  LogisticRegression(random_state = 54321, solver='liblinear')# initialize logistic regression constructor with parameters random_state=54321 and solver='liblinear'
model.fit(features_train, target_train)  # train model on training set
score_train = model.score(features_train, target_train) # calculate accuracy score on training set
 
score_valid = model.score(features_valid,target_valid) # calculate accuracy score on validation set
 

print("Accuracy of the logistic regression model on the training set:", score_train,)
print( "Accuracy of the logistic regression model on the validation set:", score_valid,)

Accuracy of the logistic regression model on the training set: 0.7136929460580913
Accuracy of the logistic regression model on the validation set: 0.7200622083981337


#### In the logical Regression model we can see that validation model has more accuracy than training set. But we cannot consider this for our analysis as our accuracy threshold is .75.

### Upon examining three models, it is evident that the random forest model outperforms the decision tree model in accuracy. The random forest model achieved 0.79 accuracy, while the logistic regression model yielded 0.71 accuracy. Given that the accuracy threshold is set at 0.75, the random forest model should be considered.

In [31]:
## apply the best model on test set
final_model = RandomForestClassifier(random_state=54321, n_estimators= 8) 
final_model.fit(features_train, target_train)
final_model.score(test_features, test_target)


0.80248833592535

### In our previous analysis, the Random Forest Classifier emerged as the best model, showing a .76% accuracy rate on the validation set with hyperparameter of 8. However, upon applying the Random Forest Classifier to the final model, the accuracy is around .80 . Thus, this is the superior model by a significant margin.

## Conclusion:

##### Initially, I observed that Ultra plan customers tend to use more call services, while Smart plan customers appear to be heavy internet users, as indicated by high data usage. My assumption is that Smart customers utilize the internet significantly more than Ultra customers. Therefore, I would like to develop a model that recommends the Ultra plan for those who prioritize internet usage.

##### Upon examining three models, it is evident that the random forest model outperforms the decision tree model in accuracy. The random forest model achieved 0.79 accuracy, while the logistic regression model yielded 0.71 accuracy. Given that the accuracy threshold is set at 0.75, the random forest model should be considered. Also upon applying the Random Forest Classifier to the final model, the accuracy was 0.80 . Thus, this is the superior model by a significant margin.

##### For Megaline mobile carrier customers seeking the best plan, it is recommended to assess individual needs such as data usage, coverage requirements, and budget constraints. Options may include unlimited data plans for heavy users, cost-effective multi-line plans for families, or budget-friendly prepaid plans for those looking to save on monthly expenses. It is advisable to compare the latest offerings from Megaline and consider customer reviews and network performance in your area to make an informed decision.
