# Рекомендация тарифов

Нужно построить модель для задачи классификации с максимально большим значением *accuracy* (не менее 0.75), которая выберет подходящий тариф на основе данных клиентов, которые уже выбрали эти тарифы.

## Общая информация

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('/datasets/users_behavior.csv')

In [3]:
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


В таблице целевой признак - это категориальный признак тарифа **is_ultra**

## Разбиение данных на выборки

In [4]:
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [5]:
len(df)

3214

Выберем 20% записей для тестовой выборки.

In [6]:
df_test = df.sample(frac=0.2, random_state=42)

In [7]:
len(df_test)

643

Соберём оставшиеся записи

In [8]:
df_train_valid  = df[~df.index.isin(df_test.index)]

In [9]:
len(df_train_valid)

2571

Поделим оставшуюся часть на тренировочную и валидационную в соотношении 3 к 1.

In [10]:
df_train, df_valid = train_test_split(df_train_valid, test_size=0.25, random_state=42)

Проверим количество зависей

In [11]:
len(df_train)

1928

In [12]:
len(df_valid)

643

In [13]:
len(df_train) + len(df_valid)

2571

### Вывод

Мы отделили 20% (643 записи) данных в тестовую выборку, оставшуюся часть поделили на обучающую (75%: 1928 записей) и проверочную (25%: 643 записи).

## Исследование моделей

Выделим признаки для обучения в тренировочной и валидационной выборке.

In [14]:
features_train = df_train.drop(['is_ultra'], axis=1)
features_train.head()

Unnamed: 0,calls,minutes,messages,mb_used
2801,34.0,279.06,83.0,13463.84
237,56.0,469.49,145.0,15877.65
566,37.0,244.58,0.0,22306.89
2898,65.0,387.84,0.0,17035.25
2318,39.0,242.87,11.0,15370.83


In [15]:
features_valid = df_valid.drop(['is_ultra'], axis=1)
features_valid.head()

Unnamed: 0,calls,minutes,messages,mb_used
1443,9.0,88.63,0.0,3390.62
2639,68.0,523.56,14.0,18910.66
3108,52.0,337.17,11.0,13400.4
490,56.0,334.29,82.0,21969.01
2235,49.0,390.87,69.0,15413.9


Выделим также целевые признаки

In [16]:
target_train = df_train['is_ultra']
target_valid = df_valid['is_ultra']

In [17]:
target_train.head()

2801    0
237     1
566     0
2898    0
2318    0
Name: is_ultra, dtype: int64

In [18]:
target_valid.head()

1443    0
2639    0
3108    0
490     0
2235    0
Name: is_ultra, dtype: int64

In [19]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score

Сформируем таблицу с резултатами валидации моделей, чтобы сравнить.

In [20]:
model_results = []

Создадим функцию для подсчёта качества модели.

In [21]:
def get_model_score(model, features_train, target_train, features_valid, target_valid):
    result = None
    model.fit(features_train, target_train)
    predictions_valid = model.predict(features_valid)
    result = accuracy_score(target_valid, predictions_valid)
    return result

Напишем функцию для определения на ходу модели с лучшим значением качества

In [22]:
def check_best_model(cur_model, cur_score, compare_model, compare_score):
    best_score = cur_score
    best_model = cur_model
    if compare_score > cur_score:
        best_score = compare_score
        best_model = compare_model
    return best_model, best_score
    
        

Функция выбора данных по заданной модели в отсортированном виде.

In [23]:
def get_model_results(model_name=None):
    results = pd.DataFrame(model_results, columns=['model', 'score', 'n_estimators', 'max_depth', 'min_samples_split', 'min_samples_leaf'])
    if (model_name != None):
        results = results[results['model']==model_name]
    return results.sort_values(
        by=['score', 'n_estimators', 'max_depth', 'min_samples_split', 'min_samples_leaf'], ascending=[False, True, True, True, True])

Зададим диапазоны перебора гиперпараметров.

In [24]:
max_depth_range = range(1, 11)
n_estimators_range = range(1, 101, 5)
min_samples_split_range = range(2, 75, 5)
min_samples_leaf_range = range(1, 75, 5)

Проверим модели методом перебора гиперпараметров и соберём данные в общую таблицу.

In [25]:
best_model = None
best_score = 0

### LogisticRegression

In [26]:
best_model_lr = None
best_score_lr = 0

In [27]:
model = LogisticRegression(random_state=42)

In [28]:
score = get_model_score(model, features_train, target_train, features_valid, target_valid)

In [29]:
best_model, best_score = check_best_model(best_model, best_score, model, score)
best_model_lr, best_score_lr = check_best_model(best_model_lr, best_score_lr, model, score)

In [30]:
model_results.append(['LogisticRegression', score, None, None, None, None])

In [31]:
get_model_results('LogisticRegression')

Unnamed: 0,model,score,n_estimators,max_depth,min_samples_split,min_samples_leaf
0,LogisticRegression,0.7014,,,,


### BaggingClassifier

In [32]:
best_model_bc = None
best_score_bc = 0

In [33]:
for n_estimators in n_estimators_range:
    model = BaggingClassifier(random_state=42, n_estimators=n_estimators)
    score = get_model_score(model, features_train, target_train, features_valid, target_valid)
    best_model, best_score = check_best_model(best_model, best_score, model, score)
    best_model_bc, best_score_bc = check_best_model(best_model_bc, best_score_bc, model, score)
    model_results.append(
        ['BaggingClassifier', score, n_estimators, None, None, None])

In [34]:
get_model_results('BaggingClassifier').head(10)

Unnamed: 0,model,score,n_estimators,max_depth,min_samples_split,min_samples_leaf
6,BaggingClassifier,0.802488,26.0,,,
12,BaggingClassifier,0.799378,56.0,,,
8,BaggingClassifier,0.797823,36.0,,,
3,BaggingClassifier,0.794712,11.0,,,
4,BaggingClassifier,0.794712,16.0,,,
5,BaggingClassifier,0.794712,21.0,,,
13,BaggingClassifier,0.794712,61.0,,,
14,BaggingClassifier,0.794712,66.0,,,
7,BaggingClassifier,0.793157,31.0,,,
9,BaggingClassifier,0.793157,41.0,,,


### AdaBoostClassifier

In [35]:
best_model_abc = None
best_score_abc = 0

In [36]:
for n_estimators in n_estimators_range:
    model = AdaBoostClassifier(random_state=42, n_estimators=n_estimators)
    score = get_model_score(model, features_train, target_train, features_valid, target_valid)
    best_model, best_score = check_best_model(best_model, best_score, model, score)
    best_model_abc, best_score_abc = check_best_model(best_model_abc, best_score_abc, model, score)
    model_results.append(
        ['AdaBoostClassifier', score, n_estimators, None, None, None])

In [37]:
get_model_results('AdaBoostClassifier').head(10)

Unnamed: 0,model,score,n_estimators,max_depth,min_samples_split,min_samples_leaf
22,AdaBoostClassifier,0.805599,6.0,,,
25,AdaBoostClassifier,0.802488,21.0,,,
26,AdaBoostClassifier,0.802488,26.0,,,
27,AdaBoostClassifier,0.797823,31.0,,,
28,AdaBoostClassifier,0.797823,36.0,,,
24,AdaBoostClassifier,0.796267,16.0,,,
23,AdaBoostClassifier,0.793157,11.0,,,
29,AdaBoostClassifier,0.793157,41.0,,,
35,AdaBoostClassifier,0.793157,71.0,,,
30,AdaBoostClassifier,0.791602,46.0,,,


### DecisionTreeClassifier

In [38]:
best_model_dtc = None
best_score_dtc = 0

In [39]:
for max_depth in max_depth_range:
    for min_samples_split in min_samples_split_range:
        for min_samples_leaf in min_samples_leaf_range:
            model = DecisionTreeClassifier(random_state=42, max_depth=max_depth, min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf)
            score = get_model_score(model, features_train, target_train, features_valid, target_valid)
            best_model, best_score = check_best_model(best_model, best_score, model, score)
            best_model_dtc, best_score_dtc = check_best_model(best_model_dtc, best_score_dtc, model, score)
            model_results.append(
                ['DecisionTreeClassifier', score, None, max_depth, min_samples_split, min_samples_leaf])

In [40]:
get_model_results('DecisionTreeClassifier').head(10)

Unnamed: 0,model,score,n_estimators,max_depth,min_samples_split,min_samples_leaf
716,DecisionTreeClassifier,0.800933,,4.0,2.0,1.0
731,DecisionTreeClassifier,0.800933,,4.0,7.0,1.0
746,DecisionTreeClassifier,0.800933,,4.0,12.0,1.0
761,DecisionTreeClassifier,0.800933,,4.0,17.0,1.0
776,DecisionTreeClassifier,0.800933,,4.0,22.0,1.0
791,DecisionTreeClassifier,0.800933,,4.0,27.0,1.0
806,DecisionTreeClassifier,0.800933,,4.0,32.0,1.0
821,DecisionTreeClassifier,0.800933,,4.0,37.0,1.0
836,DecisionTreeClassifier,0.800933,,4.0,42.0,1.0
949,DecisionTreeClassifier,0.800933,,5.0,2.0,41.0


### RandomForestClassifier

In [41]:
best_model_rfc = None
best_score_rfc = 0

In [42]:
for n_estimators in n_estimators_range:
    model = RandomForestClassifier(random_state=42, 
                                   n_estimators=n_estimators)
    score = get_model_score(model, features_train, target_train, features_valid, target_valid)
    best_model, best_score = check_best_model(best_model, best_score, model, score)
    best_model_rfc, best_score_rfc = check_best_model(best_model_rfc, best_score_rfc, model, score)
    model_results.append(
        ['RandomForestClassifier', score, n_estimators, None, None, None])

In [43]:
get_model_results('RandomForestClassifier').head(10)

Unnamed: 0,model,score,n_estimators,max_depth,min_samples_split,min_samples_leaf
2310,RandomForestClassifier,0.802488,96.0,,,
2292,RandomForestClassifier,0.800933,6.0,,,
2300,RandomForestClassifier,0.799378,46.0,,,
2297,RandomForestClassifier,0.797823,31.0,,,
2298,RandomForestClassifier,0.797823,36.0,,,
2299,RandomForestClassifier,0.797823,41.0,,,
2303,RandomForestClassifier,0.797823,61.0,,,
2309,RandomForestClassifier,0.797823,91.0,,,
2295,RandomForestClassifier,0.796267,21.0,,,
2302,RandomForestClassifier,0.796267,56.0,,,


In [44]:
best_rfc_estimators = get_model_results('RandomForestClassifier')['n_estimators'].head(5).astype('int')

Возьмём 5 лучших результатов для моделей, построенных на основе перебора n_estimators. Исследуем перебором другие параметры.

In [45]:
for n_estimators in best_rfc_estimators:
    for max_depth in max_depth_range:
        for min_samples_leaf in min_samples_leaf_range:
            model = RandomForestClassifier(random_state=42, n_estimators=n_estimators, max_depth=max_depth, min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf)
            score = get_model_score(model, features_train, target_train, features_valid, target_valid)
            best_model, best_score = check_best_model(best_model, best_score, model, score)
            best_model_rfc, best_score_rfc = check_best_model(best_model_rfc, best_score_rfc, model, score)
            model_results.append(
                ['RandomForestClassifier', score, n_estimators, max_depth, None, min_samples_leaf])

In [46]:
df_model_results = get_model_results()

In [47]:
df_model_results

Unnamed: 0,model,score,n_estimators,max_depth,min_samples_split,min_samples_leaf
2866,RandomForestClassifier,0.821151,31.0,8.0,,1.0
3016,RandomForestClassifier,0.821151,36.0,8.0,,1.0
3017,RandomForestClassifier,0.821151,36.0,8.0,,6.0
2717,RandomForestClassifier,0.821151,46.0,8.0,,6.0
2851,RandomForestClassifier,0.819596,31.0,7.0,,1.0
...,...,...,...,...,...,...
2924,RandomForestClassifier,0.743390,36.0,1.0,,66.0
2925,RandomForestClassifier,0.743390,36.0,1.0,,71.0
2291,RandomForestClassifier,0.707621,1.0,,,
0,LogisticRegression,0.701400,,,,


### Лучшее качество по моделям

In [48]:
df_model_results.groupby(by='model').agg({'score': 'max'}).sort_values('score', ascending=False)

Unnamed: 0_level_0,score
model,Unnamed: 1_level_1
RandomForestClassifier,0.821151
AdaBoostClassifier,0.805599
BaggingClassifier,0.802488
DecisionTreeClassifier,0.800933
LogisticRegression,0.7014


### Лучшая модель по качеству

In [49]:
df_model_results[df_model_results['score']==df_model_results['score'].max()].head(1)

Unnamed: 0,model,score,n_estimators,max_depth,min_samples_split,min_samples_leaf
2866,RandomForestClassifier,0.821151,31.0,8.0,,1.0


### Вывод

Мы рассмотрели 5 моделей (LogisticRegression, BaggingClassifier, AdaBoostClassifier, DecisionTreeClassifier, RandomForestClassifier) с различным набором гиперпараметров.

Среди выбранных моделей наибелее высокая оценка качества оказалась у RandomForestClassifier (82.1%)  и самая низкая у LogisticRegression (70.1%). Остальные модели также при определённых гиперпараметрах вплотную приблизились к лидеру. Надо отметить, что небольшое отставание от лидера даёт преимущество другим моделям, так как они заметно быстрее строят модель.

## Проверка модели на тестовой выборке

In [50]:
features_test = df_test.drop(['is_ultra'], axis=1)
features_test.head()

Unnamed: 0,calls,minutes,messages,mb_used
506,46.0,338.6,35.0,11428.54
2513,39.0,242.71,0.0,20480.11
354,39.0,258.02,0.0,19998.8
1080,36.0,230.99,19.0,23525.07
2389,35.0,205.35,52.0,35177.94


In [51]:
target_test = df_test['is_ultra']

**LogisticRegression**

In [52]:
predictions_test = best_model_lr.predict(features_test)
accuracy_score(target_test, predictions_test)

0.7091757387247278

**BaggingClassifier**

In [53]:
predictions_test = best_model_bc.predict(features_test)
accuracy_score(target_test, predictions_test)

0.7962674961119751

**AdaBoostClassifier**

In [54]:
predictions_test = best_model_abc.predict(features_test)
accuracy_score(target_test, predictions_test)

0.8055987558320373

**DecisionTreeClassifier**

In [55]:
predictions_test = best_model_dtc.predict(features_test)
accuracy_score(target_test, predictions_test)

0.80248833592535

**RandomForestClassifier**

In [56]:
predictions_test = best_model_rfc.predict(features_test)
accuracy_score(target_test, predictions_test)

0.807153965785381

### Вывод

**На тестовой выборке модели победители показали такие же высокие проценты accuracy**.