# Contents <a id='back'></a>

* [Introduction](#intro)
* [Stage 1. Data overview](#data_overview)
    * [Conclusions](#data_overview_conclusions)
* [Stage 2. Data preprocessing and testing and fitting ML models](#data_preprocessing)
    * [2.6 Conclusions](#data_preprocessing_conclusions)
* [Stage 3 Finding best machine learning model to suggest a plan.](#hypothesis)
    
   
* [Findings and Conclusions](#end)

## Introduction: <a id='intro'></a>

I work for Megaline which is a telecommunications company. We need to find out which plans would be the best fitted for customers on old legacy price plans. 

### Goal:
We must determine which new plan to recommend to existing customers on older plans. The new plans are either Smart or Ultra. We will use different machine learning models to get the best fitted and most accurate recommendations based on customer behavior, usage and other features. We will test the results of each model against each other to determine the best model to use for recommendations of the new price plan.




Brief description of data int he table:

    сalls — number of calls,
    minutes — total call duration in minutes,
    messages — number of text messages,
    mb_used — Internet traffic used in MB,
    is_ultra — plan for the current month (Ultra - 1, Smart - 0)

### Stage 1 Data import and overview  <a id='data_overview'></a>

In [1]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

In [2]:
try:
    user_data = pd.read_csv('datasets/users_behavior.csv') #read csv file stored locally on my laptop
except:
    user_data = pd.read_csv('/datasets/users_behavior.csv') #make an exception for when I upload the project to Practicum

In [3]:
user_data.head(10) #Initial look at the data in the df

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
5,58.0,344.56,21.0,15823.37,0
6,57.0,431.64,20.0,3738.9,1
7,15.0,132.4,6.0,21911.6,0
8,7.0,43.39,3.0,2538.67,1
9,90.0,665.41,38.0,17358.61,0


In [4]:
user_data.shape

(3214, 5)

In [5]:
user_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


I will first, split the source data into a training set, a validation set, and a test set. I will split is 60% to training, 20% for validation and 20% for test set.

In [6]:
df_train, df_valid = train_test_split(user_data, test_size=0.4, random_state=2022) #setting random state to our year 2022, 
#first I make a validation subset of 40% then I will split it 50-50 with test

In [7]:
df_train.shape # df_train has 60% of the rows and df_valid has the other 40% which we will split 50-50 with test

(1928, 5)

In [8]:
df_test, df_valid = train_test_split(df_valid, test_size=0.5, random_state=2022) #spliting the 40% 50/50 to get 20% for test and valid


In [9]:
df_test.shape

(643, 5)

In [10]:
df_valid.shape

(643, 5)

In [11]:
df_train.shape

(1928, 5)

We now have our sets but we need to separate the target from the features. In this case the target will be the pricing plan which is a binary classification - either Ultra or Smart. The values are booleans so it is 1 if it is Ultra and 0 for Smart. This is contained the the is_ultra column so we will exclude this column for the ML process since it is the target.

In [12]:
features_train = df_train.drop(['is_ultra'], axis=1) #drop the target from the features

In [13]:
features_test = df_test.drop(['is_ultra'], axis=1) #drop the target for test rows

In [14]:
features_valid = df_valid.drop(['is_ultra'], axis=1)

In [15]:
target_train = df_train['is_ultra'] #Make an array with just the target column

In [16]:
target_test = df_test['is_ultra'] #Same for test and valid

In [17]:
target_valid = df_valid['is_ultra'] 

We now have all our sets with the correct percentage of different rows. 

The first model I will use is a Decision Tree although I believe is a tree is good a forest should be better.

In [18]:
# < create a loop for max_depth from 1 to 5 >
for depth in range(1, 12):
        model = DecisionTreeClassifier(random_state=2022, max_depth=depth) # I will keep random state at 2022
        model.fit(features_train, target_train)
        # < train the model >
        predictions_valid = model.predict(features_valid) # get the model's predictions
        print("max_depth =", depth, ": ", end='')
        print(accuracy_score(target_valid, predictions_valid)) 
        

max_depth = 1 : 0.7589424572317263
max_depth = 2 : 0.7900466562986003
max_depth = 3 : 0.8040435458786936
max_depth = 4 : 0.7947122861586314
max_depth = 5 : 0.7993779160186625
max_depth = 6 : 0.8009331259720062
max_depth = 7 : 0.7993779160186625
max_depth = 8 : 0.8009331259720062
max_depth = 9 : 0.7931570762052877
max_depth = 10 : 0.7962674961119751
max_depth = 11 : 0.7791601866251944


As we can see above then we train the model and feed it the validation subset the depth with the best accuracy is 3 when we compare it to the test subset the best depth is 5. They are both above the minimum accuracy requested in the project of .75 so we could stop here but lets check what the random forest gives us.

In [19]:
best_score = 0
best_est = 0
for est in range(1, 12): # choose hyperparameter range
    model = RandomForestClassifier(random_state=2022, n_estimators=est) # set number of trees from 1 to 10
    model.fit(features_train, target_train) # train model on training set
    score = model.score(features_valid, target_valid) # calculate accuracy score on validation set
    if score > best_score:
        best_score = score  # save best accuracy score on validation set
        best_est = est   # save number of estimators corresponding to best accuracy score

print("Accuracy of the best model on the validation set (n_estimators = {}): {}".format(best_est, best_score))

Accuracy of the best model on the validation set (n_estimators = 10): 0.8102643856920684


So the best predictions we have obtained so far are using the RandomForestClassifier model with n_estimators = 9. So we will fit our final model with hyperparameter of n_estimators set to best_est. 

In [20]:
#When I set the max_depth to 5 I actually get higher accuracy so that we don't overfit the model. 
#We saw in the indivitual trees that a depth of 5 did the best and this also adjusted the 
best_score = 0
best_est = 0
for est in range(1, 10): # choose hyperparameter range
    model = RandomForestClassifier(random_state=2022, n_estimators=est, max_depth = 10) # set number of trees from 1 to 10
    model.fit(features_train, target_train) # train model on training set
    score = model.score(features_valid, target_valid) # calculate accuracy score on validation set
    if score > best_score:
        best_score = score  # save best accuracy score on validation set
        best_est = est   # save number of estimators corresponding to best accuracy score

print("Accuracy of the best model on the validation set (n_estimators = {}): {}".format(best_est, best_score))

Accuracy of the best model on the validation set (n_estimators = 8): 0.8195956454121306


In [21]:
final_model = RandomForestClassifier(random_state=2022, n_estimators=best_est, max_depth = 10) # We use n_estimators = best_est. We first create an object of the class for the model
final_model.fit(features_train, target_train)

RandomForestClassifier(max_depth=10, n_estimators=8, random_state=2022)

In [22]:
#using the RandomForestClassifier ML model with hyperparameters of n_estimator = best_est and max_debth = 10
score = final_model.score(features_test, target_test)
print(score)

0.8087091757387247


## Conclusions:

We found that the best classification model is the RandomForestClassifier() function. We tested different hyperparameters like max_depth and n_estimator for both individual trees using DecisionTreeClassifier() and the RandomForestClassifier(). We first found that the best depth for a tree was around 10. The best n_estimator for number of trees at max_depth of 10 was 8. After adjusting all these parameters we got a score of .808 on our test set so this is  significantly higher than the minimum required score of .75 that the project and will help migrate customers off of old legacy plans and into the new Ultra and Smart plans. We can now suggest to our customers a better and tailored plan to their needs and usage. 