# Plan recommendation


What do you think about when you imagine the country's mobile communications market? I would venture to guess that there are several telecommunications companies competing for customers. However, when companies have already divided the market among themselves, it may make more sense to put efforts into retaining customers rather than attracting new ones.

The mobile operator found out: many customers use archival tariffs. He wants to build a system that can analyze customer behavior and offer users a new plan: "Smart" or "Ultra".

We have data on the behavior of customers who have already switched to these plans. You need to build a model for the classification problem that will select the appropriate plan.

Our task is to build a model with an **accuracy** value of at least 0.75.

The key steps are:
* Data exploration
* Splitting the data into samples
* Testing the quality of different models
* Testing the quality of the model on a test sample
* Sanity checking the model

**Data Description**

Each object in the dataset is information about the behavior of one user per month. Known:  
* **calls** — number of calls,  
* **minutes** — total duration of calls in minutes,  
* **messages** — number of sms messages,  
* **mb_used** - Internet traffic used in Mb,  
* **is_ultra** - what plan did you use during the month ("Ultra" - 1, "Smart" - 0).


The project is made in **Jupyter Notebook**, Notebook server version: 6.1.4. Version **Python** 3.7.8.
Libraries used in the project:
* **Pandas**
* **NumPy**
* **Math**
* **scikit-learn**
* **IPython**

## Data Exploration

In [1]:
# Import required libraries and modules.
import pandas as pd
import numpy as np
from IPython.display import display
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import accuracy_score
from sklearn.dummy import DummyClassifier
# Read the dataset.
data = pd.read_csv('users_behavior.csv')
display(data)
data.info()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [2]:
# There is no need to store the data in the calls and messages 
# columns in float64 format.
# Convert them to int64 format.
data = data.astype({'calls':'int', 'messages':'int'})
display(data)
data.info()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40,311.90,83,19915.42,0
1,85,516.75,56,22696.96,0
2,77,467.66,86,21060.45,0
3,106,745.53,81,8437.39,1
4,66,418.74,1,14502.75,0
...,...,...,...,...,...
3209,122,910.98,20,35124.90,1
3210,25,190.36,0,3275.61,0
3211,97,634.44,70,13974.06,0
3212,64,462.32,90,31239.78,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   int64  
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   int64  
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 125.7 KB


### Conclusion

We have data on the behavior of customers who have already switched to Smart and Ultra plans.
This is a table with 5 columns and 3214 rows.
Column formats:
* *calls*, *minutes*, *messages*, *mb_used* - `float64`
* *is_ultra* - `int64`
  
The data has already been preprocessed by us in the previous project. There are no passes. However, the *calls* and *messages* columns format has been converted from `float64` to `int64`, since it makes no sense to contain the number of calls and sms messages in floating point format.

## Splitting the data into the samples

In [3]:
# Prepare the data for splitting into samples.
# The is_ultra column is the target feature.
# Extract features.
features = data.drop('is_ultra', axis=1)
# Determine the target feature.
target = data['is_ultra']
# Let's split the data into training, validation and test sets.
# First, split into training and validation.
# Validation set size 0.2.
features_train, features_valid, target_train, target_valid = train_test_split(
    features,
    target, 
    test_size=.2,
    random_state=12345
)
# Now we split the training set into test and
# the final version of the training set.
features_train, features_test, target_train, target_test = train_test_split(
    features_train,
    target_train, 
    test_size=.25,
    random_state=12345
)
# Samples are ready. Let's take a look at their sizes.
# We remember that the original dataset
# had 5 features and 3214 objects.
samples = [
    features_train, target_train, 
    features_valid, target_valid, 
    features_test, target_test
]
for sample in samples:
    display(sample.shape)

(1928, 4)

(1928,)

(643, 4)

(643,)

(643, 4)

(643,)

### Conclusion

We identified the target and other features. The target feature is the data from the *is_ultra* column. The data for this column is stored in the `target` variable. The rest of the columns are the rest of the features. They are stored in the `features` variable.

Using the `train_test_split` method, we split the initial data into three sets: training (60% of the original set), validation (20% of the original set) and test (20% of the original set).
  The data is stored in the appropriate variables: `features_train`, `target_train`, `features_valid`, `target_valid`, `features_test`, `target_test`.
We are faced with the task of classification. We will check three models: decision tree, random forest, logistic regression.

## Testing the quality of different models

In [4]:
# Explore the decision tree model by checking the value of accuracy
# with different max_depth parameter values.
for depth in range(1, 6):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth) 
    model.fit(features_train, target_train)
    predictions_valid = model.predict(features_valid)
    print('max_depth =', depth, ': ', end='')
    print(accuracy_score(target_valid, predictions_valid))

max_depth = 1 : 0.7480559875583204
max_depth = 2 : 0.7838258164852255
max_depth = 3 : 0.7869362363919129
max_depth = 4 : 0.7869362363919129
max_depth = 5 : 0.7884914463452566


In [5]:
# Explore the random forest model by checking the accuracy value
# with different number of estimators.
best_model = None
best_result = 0
for est in range(1, 11):
    model = RandomForestClassifier(random_state=12345, n_estimators=est)
    model.fit(features_train, target_train)
    result = model.score(features_valid, target_valid)
    if result > best_result:
        best_model = model
        best_result = result
print(est)
print('Accuracy of the best model on the validation set:', best_result)

10
Accuracy of the best model on the validation set: 0.7869362363919129


In [6]:
# Now we try logistic regression.
model = LogisticRegression(solver='lbfgs', random_state=12345)
model.fit(features_train, target_train)
result = model.score(features_valid, target_valid)
print("Accuracy of logistic regression model on the validation set:", result)

Accuracy of logistic regression model on the validation set: 0.7589424572317263


### Conclusion

We studied the quality of such models as decision tree, random forest, logistic regression.
When examining the quality of the decision tree and random forest models, we changed hyperparameters such as the maximum depth and the number of estimators.
A decision tree with a maximum tree depth of 5 and a random forest with 10 estimators showed the best results.
We decided to test the random forest model on a test sample, since this model provides a higher quality. The low speed can be neglected, since we have a small dataset.

## Testing the quality of the model on a test sample

In [7]:
# Check the model on the test set.
model = RandomForestClassifier(random_state=12345, n_estimators=10)
model.fit(features_train, target_train)
result = model.score(features_test, target_test)
print("Accuracy of the best model on the test set:", result)

Accuracy of the best model on the test set: 0.7884914463452566


### Conclusion

Testing the model on the test set showed a quality of 0.7884914463452566. The task was completed because the quality exceeded the specified level of 0.75. Nevertheless, let's do the sanity check.

## Sanity checking the model

In [8]:
# Let's create a simple model that makes predictions.
# Compare its accuracy value with our model's value.
dummy = DummyClassifier(random_state=12345)
dummy.fit(features_train, target_train)
dummy.score(features_test, target_test)

0.6889580093312597

### Conclusion

With the help of `DummyClassifier` we have created a simple model that makes predictions. We trained it on the training set and tested it on the test set. The **accuracy** value was 0.6889580093312597, which is less than the value of our model.
We can conclude that our model is sane. The quality of 0.7884914463452566 can be considered quite high.

## General conclusion

We reviewed the dataset provided to us and splitted it into three sets: training, validation and test. By changing the hyperparameters of the models, we checked their quality on validation sets. As a result, we decided to test the random forest model. The quality level of our model predictions on the test set was 0.7884914463452566. Comparing it with the quality of the dummy model, we were convinced of the high quality of our model.

In [9]:
conclusion = pd.DataFrame(
    index=['Results'],
    columns=['Target', 
              'Training set share', 
              'Validation set share', 
              'Test set share', 
              'Chosen model', 
              'Hyperparameter', 
              'Validation set accuracy', 
              'Test set accuracy', 
              'Dummy model accuracy', 
              'Model is sane (Yes/No)'],
    data=[['is_ultra', 
           '60 %',
           '20 %',
           '20 %', 
           'Random Forest',
           'n_estimators=10',
           0.7869,
           0.7885, 
           0.6890, 
           'Yes']]
)
conclusionStyler = conclusion.style.set_properties(
    **{'text-align': 'center'}
)
conclusionStyler.set_table_styles(
    [dict(selector='th', props=[('text-align', 'center')])]
)
display(conclusionStyler)

Unnamed: 0,Target,Training set share,Validation set share,Test set share,Chosen model,Hyperparameter,Validation set accuracy,Test set accuracy,Dummy model accuracy,Model is sane (Yes/No)
Results,is_ultra,60 %,20 %,20 %,Random Forest,n_estimators=10,0.7869,0.7885,0.689,Yes
