## Introduction

The company 'Megaline', a mobile operator, is unhappy with the fact that many of its customers are using old plans.

It is desired to develop a model that can analyze customer behavior and recommend one of Megaline's newest plans:
- Surf
- Ultimate.

There is data on the behavior of subscribers who have already switched to the new plans.

## Objective

Develop a classification model that will choose the best plan for customers with old plans.

A model with the highest possible accuracy should be developed.

In this project, the accuracy threshold is **0.75**.

**PS**: The file data pre-processing step has already been executed, we will proceed directly to creating the model.

## File data

### Initialization

Importing libraries for use during project execution.

In [1]:
# Loading the libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

### Loading Data

Loading file data onto plans.

In [2]:
df_plans = pd.read_csv('users_behavior.csv')

### Description of data about the Plans

Each observation in the dataset contains monthly behavioral information about a user.

The information is as follows (by column):
- `сalls` — number of calls
- `minutes` — total call duration in minutes
- `messages` — number of text messages
- `mb_used` — Internet traffic used in MB
- `is_ultra` — plan for the current month (Ultimate - 1, Surf - 0)

### Plan conditions

Megaline implements the following 'dynamics' in its plans:
- rounds seconds to minutes, and megabytes to gigabytes.
- For calls, each individual call is rounded up. Even if a call lasts only one second, it will be counted as one minute.
- For web traffic, individual web sessions are not rounded up. Instead, the total for the month is rounded up. For example, if someone uses 1025 megabytes this month, they will be charged for 2 gigabytes.

**Plan conditions:**

**Surf** Plan

1. Monthly price: $20
2. 500 monthly minutes, 50 text messages and 15 GB of data
3. After exceeding the package limits:
- 1 minute: 3 cents
- 1 text message: 3 cents
- 1 GB of data: $10

**Ultimate** Plan

1. Monthly price: $70
2. 3000 monthly minutes, 1000 text messages and 30 GB of data
3. After exceeding the package limits:
- 1 minute: 1 cents
- 1 text message: 1 cents
- 1 GB of data: $7

In [3]:
# Printing the general information of the plans DataFrame
df_plans.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [4]:
# Printing a sample
display(df_plans)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


In [5]:
# Printing some statistical data
df_plans.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
calls,3214.0,63.038892,33.236368,0.0,40.0,62.0,82.0,244.0
minutes,3214.0,438.208787,234.569872,0.0,274.575,430.6,571.9275,1632.06
messages,3214.0,38.281269,36.148326,0.0,9.0,30.0,57.0,224.0
mb_used,3214.0,17207.673836,7570.968246,0.0,12491.9025,16943.235,21424.7,49745.73
is_ultra,3214.0,0.306472,0.4611,0.0,0.0,0.0,1.0,1.0


## Data division

To start machine learning modeling, we will separate the data into 2 variables with features and objectives for analysis.
- **features** - features
- **target** - objective

In [6]:
# Creating the variables from the DataFrame with the data.
features = df_plans.drop('is_ultra', axis=1)
target = df_plans['is_ultra']

# Checking
display(features.head(10))
print()
display(target.head(10))
print()
print(features.shape)
print(target.shape)

Unnamed: 0,calls,minutes,messages,mb_used
0,40.0,311.9,83.0,19915.42
1,85.0,516.75,56.0,22696.96
2,77.0,467.66,86.0,21060.45
3,106.0,745.53,81.0,8437.39
4,66.0,418.74,1.0,14502.75
5,58.0,344.56,21.0,15823.37
6,57.0,431.64,20.0,3738.9
7,15.0,132.4,6.0,21911.6
8,7.0,43.39,3.0,2538.67
9,90.0,665.41,38.0,17358.61





0    0
1    0
2    0
3    1
4    0
5    0
6    1
7    0
8    1
9    0
Name: is_ultra, dtype: int64


(3214, 4)
(3214,)


### Splitting the data into training and testing sets

The data must be divided so that it is possible to train the models, validate them and verify the results with a test.

The sets will be divided in the following ways:
- the test set will be composed of 25% of the original set
- the training set will be composed of 75% of the original set
- the validation set will be composed of 25% of the training set

In [7]:
# Splitting the data into training and testing sets
df_train, df_test = train_test_split(df_plans, test_size=0.25, random_state=12345)

# Declaring variables for feature and objective testing
features_test = df_test.drop('is_ultra', axis=1)
target_test = df_test['is_ultra']

# Checking
print(df_plans.shape)
print(df_train.shape)
print(features_test.shape)
print(target_test.shape)

(3214, 5)
(2410, 5)
(804, 4)
(804,)


### Splitting the data into Training and Validation sets

In [8]:
# Splitting the data into training and validation sets
df_train, df_valid = train_test_split(df_train, test_size=0.25, random_state=12345)

# Declaring variables for training and validation of features and objectives
features_train = df_train.drop('is_ultra', axis=1)
target_train = df_train['is_ultra']
features_valid = df_valid.drop('is_ultra', axis=1)
target_valid = df_valid['is_ultra']

# Checking
print(features_train.shape)
print(target_train.shape)
print(features_valid.shape)
print(target_valid.shape)

(1807, 4)
(1807,)
(603, 4)
(603,)


## Training and Validation

Different hyperparameter values ​​will be tested in the algorithms to find what gives us the best results in the models worked on.

### Training Decision Trees model

Finding the best value for the **max_depth** hyperparameter.

In [9]:
# Testing Decision Trees model
# Finding the best value for the max_depth hyperparameter
for depth in range(1, 11):
    model = DecisionTreeClassifier(max_depth=depth, random_state=12345)
    model.fit(features_train, target_train)
    predictions_valid = model.predict(features_valid)
    
    print("max_depth =", depth, ": ", end='')
    print(accuracy_score(target_valid, predictions_valid))

max_depth = 1 : 0.7495854063018242
max_depth = 2 : 0.7761194029850746
max_depth = 3 : 0.7943615257048093
max_depth = 4 : 0.7893864013266998
max_depth = 5 : 0.7877280265339967
max_depth = 6 : 0.7910447761194029
max_depth = 7 : 0.7827529021558872
max_depth = 8 : 0.7910447761194029
max_depth = 9 : 0.7744610281923715
max_depth = 10 : 0.7844112769485904


The best result of the Decision Tree model was with the `max_depth` value of **3**.

We have already managed to meet the minimum expectation stipulated for this project, since the precision limit is 0.75 and we were able to obtain 0.7943 with this model.

But we will test other models to see if we can achieve better results.

### Training Random Forest model

Finding the best value for the hyperparameter **n_estimators**.

In [10]:
# Testing Random Forest model
# Finding the best value for the hyperparameter n_estimators
best_score = 0
best_est = 0
for est in range(1, 11):
    model = RandomForestClassifier(random_state=12345, n_estimators=est) # defining the number of trees
    model.fit(features_train, target_train) # Training the model
    score = model.score(features_valid, target_valid) # accuracy calculation
    if score > best_score:
        best_score = score # save the best accuracy result
        best_est = est # saves best number of estimators

print("The accuracy of the best model on the validation set n_estimators = {}: {}".format(best_est, best_score))

The accuracy of the best model on the validation set n_estimators = 6: 0.7993366500829188


The best result of the Random Forest model was with an n_estimator value of 6.

We were also able to meet the minimum expectation stipulated for this project, and we also improved our result, obtaining 0.7993 accuracy.

Let's test the Logistic Regression model to see if we can improve our result.

### Training Logistic Regression model

The `liblinear` solver will be used as it is the most general and will meet the needs of this project.

In [11]:
# Testing Logistic Regression model
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_train, target_train) # training model
score_train = model.score(features_train, target_train) # calculation of training accuracy
score_valid = model.score(features_valid, target_valid) # calculation of validation accuracy

print("Accuracy of the logistic regression model on the training set:", score_train)
print("Accuracy of the logistic regression model on the validation set:", score_valid)

Accuracy of the logistic regression model on the training set: 0.7011621472053127
Accuracy of the logistic regression model on the validation set: 0.6849087893864013


We were unable to meet the minimum expectations set for this project with this model.

## Performance evaluation of the chosen model

The model with the best results was Random Forest with the hyperparameter `n_estimators` of **6** and achieving an accuracy of 0.7993.

For this performance evaluation, we will retrain with the training and validation sets and also evaluate with the test sets with data that have not yet been worked on this best observed model.

As mentioned, we will work with `n_estimators` of **6** for this performance evaluation.

And the test sets will be divided from `df_test`.

In [12]:
# Retraining the Random Forest model with training and validation data
model = RandomForestClassifier(random_state=12345, n_estimators=6)
model.fit(features_train, target_train) # Retraining the model
score_retreino = model.score(features_valid, target_valid) # accuracy calculation

print("The accuracy of the best model on the validation set n_estimators =", score_retreino)

The accuracy of the best model on the validation set n_estimators = 0.7993366500829188


The retraining result obtained the same accuracy result.

For the evaluation with the test data:
- features_test
- target_test

In [13]:
# Performance evaluation with the Random Forest model
model = RandomForestClassifier(random_state=12345, n_estimators=6)

# Training the model with the test set
model.fit(features_test, target_test)

# Accuracy calculation
score_final = model.score(features_test, target_test)

print("The accuracy of this model on the test set is =", score_final)

The accuracy of this model on the test set is = 0.9564676616915423


Evaluating performance with data that the model had not yet 'seen', it obtained a result of 0.9564.

## Final consideration

Of the 3 models tested, the **Random Forest** model was the most effective. The best result was obtained with this model and its **n-estimators parameter of 6**.

In the performance evaluation with the test data set, this model achieved **0.9564** accuracy.

## Additional Verification

In addition to this work, the best result obtained with a dummy model will be compared by performing a sanity check. This classifier makes predictions that ignore the input features and serves as a simple baseline to compare with more complex classifiers.

The specific behavior of the baseline is selected with the `strategy` parameter and `most_frequent` will be used.

In [14]:
# Performance evaluation with the Dummy model
model = DummyClassifier(random_state=12345, strategy='most_frequent')

# Training the model
model.fit(features_train, target_train)

# Accuracy calculation
score_dummy = model.score(features_valid, target_valid)

print("The accuracy of this model as a whole is =", score_dummy)

The accuracy of this model as a whole is = 0.681592039800995


This Dummy model presented a prediction of only 0.6815. Below the minimum required to fulfill this work and much worse than the Random Forest model which achieved 0.7993.