# Tariff recommendation

There is data on the behavior of customers who have already switched to these tariffs. We need to build a model for the classification problem that will select the appropriate rate. Data preprocessing is not required - you have already done it.

Build the model with the largest possible *accuracy* value. To pass the project successfully, you need to bring the percentage of correct answers to at least 0.75. Check *accuracy* on the test set yourself.

**Project plan:**
- Examine the data file.
- Divide the data into three samples: training, validation and test.
- Explore three classification models: Decision Tree, Random Forest and Logistic Regression.
- Find the optimal parameters for each model and choose one of them to train the model.
- Evaluate the accuracy of the trained model.
- Assess the adequacy of the model.

## Data loading

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, fbeta_score, precision_score, recall_score, classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split


In [2]:
data = pd.read_csv('users_behavior.csv')
data

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Target feature - `is_ultra`

## Split the data into samples

- train (55%) `data_train`
- valid (25%) `data_valid`
- test (20%) `data_test`

In [4]:
data_test, data_left = train_test_split(data, test_size=0.8, shuffle=True, random_state=123)
features_test = data_test.drop(['is_ultra'], axis=1)
target_test = data_test['is_ultra']

data_train, data_valid = train_test_split(data_left, test_size=0.25, shuffle=True, random_state=50)
features_train = data_train.drop(['is_ultra'], axis=1)
target_train = data_train['is_ultra']
features_valid = data_valid.drop(['is_ultra'], axis=1)
target_valid = data_valid['is_ultra']

In [5]:
print('Train sample size', data_train.shape[0])
print('Valid sample size', data_valid.shape[0])
print('Test sample size', data_test.shape[0])

Train sample size 1929
Valid sample size 643
Test sample size 642


## Models

### DecisionTree

Select the optimal max_depth hyperparameter using the loop

In [6]:
best_model = None
best_result = 0
best_depth = 0
result_accuracy = []

for depth in range(1, 11):
    tree_model = DecisionTreeClassifier(random_state=50, max_depth=depth)
    tree_model.fit(features_train, target_train)
    predictions_valid = tree_model.predict(features_valid)
    result = accuracy_score(target_valid, predictions_valid)
    result_accuracy.append(result)
    print(result, depth)
    
    if result > best_result:
        best_model = tree_model
        best_result = result
        best_depth = depth
    
    
print('Accuracy of the best model on the validation set:', best_result, 'Tree depth:', best_depth)

0.744945567651633 1
0.7776049766718507 2
0.7869362363919129 3
0.7807153965785381 4
0.7900466562986003 5
0.7667185069984448 6
0.7822706065318819 7
0.7884914463452566 8
0.7713841368584758 9
0.7853810264385692 10
Accuracy of the best model on the validation set: 0.7900466562986003 Tree depth: 5


### RandomForest

Select the optimal max_depth and n_estimators hyperparameters using the loop

In [8]:
best_model = None
best_result = 0
best_est = 0
best_depth = 0

for est in range(1, 11):
    for depth in range(1, 11):
        forest_model = RandomForestClassifier(random_state=50, n_estimators=est, max_depth=depth)
        forest_model.fit(features_train, target_train)
        predictions_valid = forest_model.predict(features_valid)
        result = accuracy_score(target_valid, predictions_valid)
        print(result, est, depth)
    
        if result > best_result:
            best_model = tree_model
            best_result = result
            best_est = est
            best_depth = depth
        
print('Accuracy of the best model on the validation set:', best_result,
      'n_estimators:', best_est, 'max_depth:', best_depth)


0.7387247278382582 1 1
0.7807153965785381 1 2
0.7853810264385692 1 3
0.7931570762052877 1 4
0.7916018662519441 1 5
0.7900466562986003 1 6
0.7807153965785381 1 7
0.7822706065318819 1 8
0.7900466562986003 1 9
0.7776049766718507 1 10
0.7387247278382582 2 1
0.7791601866251944 2 2
0.7838258164852255 2 3
0.7869362363919129 2 4
0.7807153965785381 2 5
0.7822706065318819 2 6
0.7744945567651633 2 7
0.7853810264385692 2 8
0.7573872472783826 2 9
0.7573872472783826 2 10
0.7387247278382582 3 1
0.7807153965785381 3 2
0.7884914463452566 3 3
0.7962674961119751 3 4
0.7947122861586314 3 5
0.7993779160186625 3 6
0.7900466562986003 3 7
0.7947122861586314 3 8
0.7931570762052877 3 9
0.7978227060653188 3 10
0.7387247278382582 4 1
0.7511664074650077 4 2
0.7838258164852255 4 3
0.7916018662519441 4 4
0.7838258164852255 4 5
0.7869362363919129 4 6
0.7807153965785381 4 7
0.7869362363919129 4 8
0.7822706065318819 4 9
0.7916018662519441 4 10
0.7356143079315708 5 1
0.744945567651633 5 2
0.80248833592535 5 3
0.78849144

### LogisticRegression

In [9]:
logistic_model = LogisticRegression(random_state=50, solver='lbfgs', max_iter=1000)
logistic_model.fit(features_train, target_train)
predictions_valid = logistic_model.predict(features_valid)
result = accuracy_score(target_valid, predictions_valid)

print('Accuracy of the best model on the validation set:', result)

Accuracy of the best model on the validation set: 0.7356143079315708


### Выводы

**Resume**

Accuracy of the best Decision Tree model on the validation set: 0.79 Depth of the tree: 5
Accuracy of the best Random Forest model on the validation set: 0.81 Number of trees: 9 Tree depth: 9
Accuracy of the best Logistic Regression model on the validation set: 0.74

The best result shows a RandomForest with hyperparameter max_depth=9 and n_estimators=9.

## Test

In [10]:
forest_model = RandomForestClassifier(random_state=50, n_estimators=9, max_depth=9)
forest_model.fit(features_train, target_train)
predictions_test = forest_model.predict(features_test)
accuracy_test = accuracy_score(target_test, predictions_test)

print(accuracy_test)

0.7897196261682243


The accuracy of the model on the test set is 0.79

## Check models

Precision, recall и F-мера

In [11]:
report = classification_report(target_test, predictions_test, target_names=['Non-ultra', 'Ultra'])
print(report)

              precision    recall  f1-score   support

   Non-ultra       0.80      0.93      0.86       442
       Ultra       0.76      0.48      0.59       200

    accuracy                           0.79       642
   macro avg       0.78      0.70      0.72       642
weighted avg       0.78      0.79      0.77       642



In [12]:
from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(features_train, target_train)
dummy_predict = dummy_clf.predict(features_test)
dummy_clf.score(dummy_predict, target_test)

0.6884735202492211

## Cocnlusions

1) Various classification models were analyzed: Decision Tree, Random Forest, Logistic Regression using various hyperparameters

The best result shows a random forest with hyperparameter max_depth=9 and n_estimators=9.
Accuracy on validation set: 0.81

2) The model was tested on a test sample with the following indicators:

- Share of correct answers (accuracy) = 0.79
- Precision = 0.78
- Recall (recall) = 0.70
- F-metric = 0.72

We got a fairly good model that well solves the task of determining the appropriate tariff for users.