# Recommendation of tariffs

At your disposal is data on the behavior of customers who have already switched to 2 tariffs. We need to build a model for the classification task that will select the appropriate tariff for a client.

We build a model with the *accuracy* value of at least 0.75.

## Open and examine the file

In [1]:
import pandas as pd

df = pd.read_csv('/datasets/users_behavior.csv')
df

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


In [2]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
calls,3214.0,63.038892,33.236368,0.0,40.0,62.0,82.0,244.0
minutes,3214.0,438.208787,234.569872,0.0,274.575,430.6,571.9275,1632.06
messages,3214.0,38.281269,36.148326,0.0,9.0,30.0,57.0,224.0
mb_used,3214.0,17207.673836,7570.968246,0.0,12491.9025,16943.235,21424.7,49745.73
is_ultra,3214.0,0.306472,0.4611,0.0,0.0,0.0,1.0,1.0


In [3]:
df.isna().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

In [4]:
df.duplicated().sum()

0

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


## Break the data into samples

In [6]:
from sklearn.model_selection import train_test_split

features = df.drop(columns=['is_ultra'])
target = df['is_ultra']

features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size=0.4, random_state=12345)
features_valid, features_test, target_valid, target_test = train_test_split(features_valid, target_valid, test_size=0.5, random_state=12345)

## Explore models

In [7]:
from sklearn.linear_model import LogisticRegression

logreg_result = 0
max_iter = 0
logreg_best = None
for iterations in range(100, 1001, 100):
    logreg = LogisticRegression(random_state=12345, solver='lbfgs', max_iter=iterations)
    logreg.fit(features_train, target_train)
    result = logreg.score(features_valid, target_valid)
    if result >= logreg_result:
        logreg_result = result
        max_iter = iterations
        logreg_best = logreg

print('Лучший показатель:', logreg_result, 'Число итераций:', max_iter)

Лучший показатель: 0.7107309486780715 Число итераций: 1000


In [8]:
from sklearn.ensemble import RandomForestClassifier

forest_result = 0
estimators = 0
depth = 0
forest_best = None
for est in range(10, 51, 10):
    for depth in range(1, 6):
        forest = RandomForestClassifier(random_state=12345, max_depth=depth, n_estimators=est)
        forest.fit(features_train, target_train)
        result = forest.score(features_valid, target_valid)
        if result >= forest_result:
            forest_result = result
            estimators = est
            depth = depth
            forest_best = forest
            
print('Лучший показатель:', forest_result, 'Всего estimators:', estimators, 'Глубина:', depth)

Лучший показатель: 0.7947122861586314 Всего estimators: 40 Глубина: 5


The best accuracy rate of 79% was obtained from a random forest model with a depth of 5 and a number of trees of 40.

## Test the model on a test sample

In [9]:
forest_test = forest_best.score(features_test, target_test)
print(forest_test)

0.7838258164852255


## Check the models for adequacy

In [10]:
ultra = df[df['is_ultra']==1]
ultra_x = ultra.drop(columns=['is_ultra'])
ultra_y = ultra['is_ultra']

smart = df[df['is_ultra']==0]
smart_x = smart.drop(columns=['is_ultra'])
ultra_y = smart['is_ultra']

In [11]:
ultra.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
calls,985.0,73.392893,43.916853,0.0,41.0,74.0,104.0,244.0
minutes,985.0,511.224569,308.0311,0.0,276.03,502.55,730.05,1632.06
messages,985.0,49.363452,47.804457,0.0,6.0,38.0,79.0,224.0
mb_used,985.0,19468.823228,10087.178654,0.0,11770.28,19308.01,26837.72,49745.73
is_ultra,985.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0


In [12]:
smart.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
calls,2229.0,58.463437,25.939858,0.0,40.0,60.0,76.0,198.0
minutes,2229.0,405.942952,184.512604,0.0,274.23,410.56,529.51,1390.22
messages,2229.0,33.384029,28.227876,0.0,10.0,28.0,51.0,143.0
mb_used,2229.0,16208.466949,5870.498853,0.0,12643.05,16506.93,20043.06,38552.62
is_ultra,2229.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [13]:
forest_best.predict(ultra_x).mean()

0.49746192893401014

In [14]:
forest_best.predict(smart_x).mean()

0.04979811574697174

<div style="background: #B0E0E6; padding: 5px; border: 1px solid SteelBlue; border-radius: 5px;">
    <font color='4682B4'><u><b>Conclusion
        
</b></u></font>
    <br />
    <font color='4682B4'>We were given data about users of the "Ultra" and "Smart" tariffs, their activity in terms of the number and duration of calls, number of SMS messages and traffic package consumed. The task was to train a model to categorize users depending on their activity. This will help us choose the appropriate tariff for the client. Two models with different hyperparameters were examined: a logistic regression model and a random forest. A random forest with a number of trees of 40 and a depth of 5 scored the best; its accuracy is 79%. On the test sample, the model gave a result of 78% </font>
</div>