# Recommendation of Tariff
(introduction into ML)

## Content

1. [Introduction](#intro)
2. [General information](#general)
3. [Split data](#split)
4. [Models](#models)
5. [Checking Models on Test Set](#testset)
6. [General Conclusion](#bigconclusion)



## Introduction <a href = 'intro'></a>

**Project Description**

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.
You have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course). For this classification task, you need to develop a model that will pick the right plan. Since you’ve already performed the data preprocessing step, you can move straight to creating the model.
Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.

**Data Description**

Every observation in the dataset contains monthly behavior information about one user. The information given is as follows:
- сalls — number of calls,
- minutes — total call duration in minutes,
- messages — number of text messages,
- mb_used — Internet traffic used in MB,
- is_ultra — plan for the current month (Ultra - 1, Smart - 0).

*Libraries*

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

## General information <a href = 'general'></a>

In [2]:
try:
    telecom_users = pd.read_csv('users_behavior.csv')
except:
    telecom_users = pd.read_csv('/datasets/users_behavior.csv')

In [3]:
telecom_users.info(
)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
calls       3214 non-null float64
minutes     3214 non-null float64
messages    3214 non-null float64
mb_used     3214 non-null float64
is_ultra    3214 non-null int64
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [4]:
telecom_users.sample(5)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
321,38.0,237.78,5.0,4905.5,0
24,56.0,360.3,30.0,13560.15,0
966,106.0,619.94,0.0,27762.85,1
941,35.0,270.26,15.0,15482.34,0
3171,50.0,393.33,14.0,6557.41,0


In [5]:
telecom_users.isna().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

In [6]:
telecom_users.duplicated().sum()

0

### Conclusion

Before starting(continue) working with a dataset I checked if everything is OK.
Seems data doesn't have errors, and we can go to our next step

## Split data <a href = 'split'></a>

In our dataset we have 3214 objects and 5 attributes(features). Target attribute(target) which we need to predict is a tariff (column: is_ultra)

First we need to split data into 3 parts: test, train, and valid.

In [7]:
df_train, df_valid = train_test_split(telecom_users, test_size=0.4, random_state=12345)
df_test, df_valid = train_test_split(df_valid, test_size=0.5, random_state=12345)

### Conclusion

We split dataset into the parts test, train and valid (60%/20%/20%)

## Models <a href = 'models'></a>


For now let go through these models:
- Decision tree
- Random forest
_ Logistic regression


In [8]:
features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']

features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']

features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']

*Decision Tree*


In [9]:
for depth in range(1, 15):
    model_dt = DecisionTreeClassifier(random_state = 12345, max_depth = depth)
    model_dt.fit(features_train, target_train)
    predictions_dt = model_dt.predict(features_valid)
    result_dt = accuracy_score(target_valid, predictions_dt)
    accuracy_dt = accuracy_score(target_valid, predictions_dt)
    print("max_depth =", depth, ": ", accuracy_dt)

# for depth in range(1, 50):
#     model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
#     model.fit(features_train, target_train)
#     predictions_dt = model.predict(features_valid)
#     result = accuracy_score(target_valid, predictions_dt)
#     print("max_depth =", depth, ": ", result)

max_depth = 1 :  0.7356143079315708
max_depth = 2 :  0.7744945567651633
max_depth = 3 :  0.7791601866251944
max_depth = 4 :  0.7744945567651633
max_depth = 5 :  0.7838258164852255
max_depth = 6 :  0.776049766718507
max_depth = 7 :  0.7993779160186625
max_depth = 8 :  0.7931570762052877
max_depth = 9 :  0.7807153965785381
max_depth = 10 :  0.7884914463452566
max_depth = 11 :  0.7744945567651633
max_depth = 12 :  0.7807153965785381
max_depth = 13 :  0.7713841368584758
max_depth = 14 :  0.76049766718507


*Random Forest*

In [10]:
for est in range(1, 15):
    model_rf = RandomForestClassifier(random_state = 12345, n_estimators = est)
    model_rf.fit(features_train, target_train)
    predictions_rf = model_rf.predict(features_valid)
    accuracy_rf = accuracy_score(target_valid, predictions_rf)
    print("n_estimators =", est, ": ", accuracy_rf)

# for est in range(1, 25):
#     model = RandomForestClassifier(random_state=12345, n_estimators=est)
#     model.fit(features_train, target_train)
#     result_rf = model.score(features_valid, target_valid)
#     print("max_depth =", est, ": ", result_rf)

n_estimators = 1 :  0.7402799377916018
n_estimators = 2 :  0.7589424572317263
n_estimators = 3 :  0.7573872472783826
n_estimators = 4 :  0.7729393468118196
n_estimators = 5 :  0.7667185069984448
n_estimators = 6 :  0.7791601866251944
n_estimators = 7 :  0.7807153965785381
n_estimators = 8 :  0.7869362363919129
n_estimators = 9 :  0.7838258164852255
n_estimators = 10 :  0.7807153965785381
n_estimators = 11 :  0.7729393468118196
n_estimators = 12 :  0.7869362363919129
n_estimators = 13 :  0.7838258164852255
n_estimators = 14 :  0.7807153965785381


*Logistic regression*


In [11]:
model_lr = LogisticRegression(random_state = 12345)
model_lr.fit(features_train, target_train)
result_lr = model_lr.score(features_valid, target_valid)
predictions_lr = model_lr.predict(features_valid)
accuracy_lr = accuracy_score(target_valid, predictions_lr)
accuracy_lr



0.7402799377916018

### Conclusion

We can say that when hyperparameter of tree depth is equals 7-8  and of estimators is equals 8-9 we got more correct answers.

## Checking Models on Test Set<a href = 'testset'></a>

In [12]:
model_dt = DecisionTreeClassifier(random_state = 12345, max_depth = 7)
model_dt.fit(features_train, target_train)
model_dt.score(features_test, target_test)

0.7822706065318819

In [13]:
model_rf = RandomForestClassifier(random_state = 12345, max_depth = 7, n_estimators = 8)
model_rf.fit(features_train, target_train)
model_rf.score(features_test, target_test)

0.7978227060653188

In [14]:
model_lr.score(features_test, target_test)

0.7589424572317263

### Conclusion
Test Set prove that we have 78-80% of correct answers.

## General Conclusion<a href = 'bigconclusion'></a>

According to our results models DecisionTreeClassifier, RandomForestClassifier is working in a good way.

