# Tariff Recommendation Model (Subscribers classification)

You have the data on the behavior of customers who have are already using new tariffs. It is necessary to build a model for the classification problem that will select the appropriate tariff. Data preprocessing was already done in [Best Tariff](../best_tariff) project .

Build the model with the largest possible *accuracy* value. To pass the project successfully, you need to bring the percentage of correct answers to at least 0.75. 
Check *accuracy* on the test set yourself.

## Open and study the file

In [1]:
import pandas as pd
from IPython.display import display
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
from sklearn.dummy import DummyClassifier

In [2]:
PATH = '/Users/vasily/Learning/Data Science/Projects/Telecom subscribers classification/'
df = pd.read_csv(PATH+'/datasets/users_behavior.csv')
print(df.sample(n=10, random_state=12345))
print(df.info())
print(df.describe())

      calls  minutes  messages   mb_used  is_ultra
1415   82.0   507.89      88.0  17543.37         1
916    50.0   375.91      35.0  12388.40         0
1670   83.0   540.49      41.0   9127.74         0
686    79.0   562.99      19.0  25508.19         1
2951   78.0   531.29      20.0   9217.25         0
654    53.0   478.18      78.0  20152.53         0
2827   73.0   582.47      33.0  12095.91         0
1466   31.0   172.10      25.0  31077.59         0
2223   28.0   222.21      30.0  22986.30         0
2639   68.0   523.56      14.0  18910.66         0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB
None
   

## Split Dataframes

In [3]:
#Let's split the dataframes to features and target
features = df.drop(['is_ultra'], axis=1)
target = df['is_ultra']

#Let's split the dataframes to train, test and validation in the proportion 60:20:20
features_train, features_valid, target_train, target_valid = train_test_split(features, 
                                                                              target, 
                                                                              test_size = 0.4, random_state=123)
features_valid, features_test, target_valid, target_test = train_test_split(features_valid, 
                                                                            target_valid,
                                                                            test_size=0.5, random_state=123)
#Checking size of the frames
print(features_train.shape, target_train.shape)
print(features_valid.shape, target_valid.shape)
print(features_test.shape, target_test.shape)

print(target_train[target_train == 0].count())

(1928, 4) (1928,)
(643, 4) (643,)
(643, 4) (643,)
1334


## Study models

In [4]:
#Decision tree

#With default parameter values
model = tree.DecisionTreeClassifier(random_state=123)
model.fit(features_train, target_train)
predictions_valid = model.predict(features_valid)
print('accuracy of model with all defaults =', accuracy_score(target_valid, predictions_valid))

#Impact of max_depth on accuracy
best_max_depth = 0
best_accuracy = 0
best_tree_model = None
for depth in range(1,20):
    model = tree.DecisionTreeClassifier(random_state=123, max_depth=depth)
    model.fit(features_train, target_train)
    accuracy = model.score(features_valid, target_valid)
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_max_depth = depth
        best_tree_model = model
    print('max_depth = {0} : {1}'.format(depth, accuracy))
print('Optimal max depth: {0}, accuracy: {1}'.format(best_max_depth, best_accuracy))

#text_representation = tree.export_text(best_tree_model)
#print(text_representation)

accuracy of model with all defaults = 0.7465007776049767
max_depth = 1 : 0.7620528771384136
max_depth = 2 : 0.7900466562986003
max_depth = 3 : 0.80248833592535
max_depth = 4 : 0.8040435458786936
max_depth = 5 : 0.8227060653188181
max_depth = 6 : 0.8149300155520995
max_depth = 7 : 0.8118195956454122
max_depth = 8 : 0.80248833592535
max_depth = 9 : 0.8180404354587869
max_depth = 10 : 0.80248833592535
max_depth = 11 : 0.8149300155520995
max_depth = 12 : 0.8102643856920684
max_depth = 13 : 0.7884914463452566
max_depth = 14 : 0.7900466562986003
max_depth = 15 : 0.7713841368584758
max_depth = 16 : 0.7807153965785381
max_depth = 17 : 0.7698289269051322
max_depth = 18 : 0.7511664074650077
max_depth = 19 : 0.7573872472783826
Optimal max depth: 5, accuracy: 0.8227060653188181


In [5]:
#Random forest

best_forest_model = None
best_accuracy = 0
best_max_depth = 0
best_n_est = 0
for depth in range (1,10):
    for est in range (1, 20):
        model = RandomForestClassifier(random_state=123, n_estimators=est, max_depth=depth)
        model.fit(features_train, target_train)
        accuracy = model.score(features_valid, target_valid)
        if accuracy > best_accuracy:
            best_max_depth = depth
            best_n_est = est
            best_forest_model = model
            best_accuracy = accuracy
print('Optimal max depth: {0}, optimal number of trees : {1}, accuracy: {2}'.format(best_max_depth, 
                                                                                              best_n_est, 
                                                                                              best_accuracy))

Optimal max depth: 8, optimal number of trees : 17, accuracy: 0.833592534992224


In [6]:
#Logistic regression

regressor_model = LogisticRegression(random_state=123, solver='liblinear')
regressor_model.fit(features_train, target_train)
accuracy = regressor_model.score(features_valid,target_valid) 
print('Accuracy of the logistic regression model:', accuracy)

Accuracy of the logistic regression model: 0.7200622083981337


**Conclusion**

The highest accuracy was given by the random forest model: 0.83 with depth = 8 and number of trees = 17.
Least accuracy - logistic regression (0.72)

## Check the model on the test set

In [7]:
print('Accuracy of the randon forest model on the test set:', best_forest_model.score(features_test, target_test))
print('Accuracy of the linear regression model on the test set:', regressor_model.score(features_test, target_test))
print('Accuracy of the decision tree model on the test set:', best_tree_model.score(features_test, target_test))

Accuracy of the randon forest model on the test set: 0.7993779160186625
Accuracy of the linear regression model on the test set: 0.6967340590979783
Accuracy of the decision tree model on the test set: 0.7713841368584758


**Conclusion**

Checking on the test sample showed that the accuracy of the models is proportional to the accuracy of the validation one, although somewhat lower.

## (bonus) Check models for adequacy

To check for adequacy, let's take trivial models and compare their performance indicators with those of the model we trained.
As primitive models, we take two:
1. A model that always predicts the class that prevails in the training sample. For our sample, this will be the "Smart" tariff (is_ultra = 0), since it is skewed towards Smart: 1334 against 594 records.
2. A model that predicts a uniform number of 0s and 1s.

In [8]:
#Model which always predicts most frequent class
frequent_classifier = DummyClassifier(strategy='most_frequent').fit(features_train, target_train)
print('Accuracy of a trivial model which always predicts Smart tariff:', frequent_classifier.score(features_test, target_test))

Accuracy of a trivial model which always predicts Smart tariff: 0.6936236391912908


In [9]:
#Model which predicts a class with probability 50%
uniform_classifier = DummyClassifier(strategy='uniform').fit(features_train, target_train)
print('Accuracy of a trivial model which predicts random class:', uniform_classifier.score(features_test, target_test))

Accuracy of a trivial model which predicts random class: 0.48367029548989116


In [11]:
#Let's comnpare confusion matrix of all the three models
predictions_test = best_forest_model.predict(features_test)
freq_predictions_test = frequent_classifier.predict(features_test)
uniform_predictions_test = uniform_classifier.predict(features_test)

print('Confusion matrix of our model (random forest):')
print(confusion_matrix(target_test, predictions_test))
print('Confusion matrix of the model which always predicts Smart tariff:')
print(confusion_matrix(target_test, freq_predictions_test))
print('Confusion matrix of the model which predicts random class:')
print(confusion_matrix(target_test, uniform_predictions_test))

Confusion matrix of our model (random forest):
[[413  33]
 [ 96 101]]
Confusion matrix of the model which always predicts Smart tariff:
[[446   0]
 [197   0]]
Confusion matrix of the model which predicts random class:
[[227 219]
 [ 97 100]]


**In our model (random forest)** number of True Negative cases (real class:0, predicted:0) = 413; False Positive (real:0, predicted:1) = 33; False Negative (real:1, predicted:0) = 96; True Positive(real:1, predicted:1): 101.

Note that our model is not very good at guessing the most expensive Ultra: the number of False Negative cases, when "Smart" was predicted, but "Ultra" should have been predicted, is quite large = 96 (and correctly guessed Ultra - 101 cases). And this may mean that the operator will miss the potential profit.
Recall indicator (sensitivity) = 101/(101+96) = 0.51. Random model recall: 96/197 = 0.50

## Conclusion

Of the three trained models: Decision Tree, Random Fox, Logistic Regression, the highest prediction accuracy (accuracy) on the test set was achieved using the **Random Forest: 0.8** model. This is 11% higher than the accuracy of the trivial model, which always predicts "Smart" (0.69).
However, the sensitivity of our model is is almost the same as of the random one.