# **DS** test assignment from mobile operator

Binary classification problem:
"1" - the subscriber is a driver (belongs to the drivers segment),
"0" - the subscriber is not a driver (does not belong to the drivers segment).

The files tabular_data.csv and hashed_feature.csv ̶ here are descriptive characteristics for 4084 subscribers ("ID" is the subscriber ID).
The train.csv file ̶ is the target label data (the subscriber's belonging to the driver segment).
File test.csv ̶ is the list of subscribers for which we want to make a prediction, by which we will evaluate the quality of the model. The ROC-AUC is used as the metric.

"The file tabular_data.csv contains the numeric data on the activity of subscribers for 12 periods.
- period - period number (periods are consecutive, 1 is the newest)
- id - subscriber ID
- feature_0 - feature_49 - data on the subscriber's activity in the corresponding period.


"File hashed_feature.csv - here is the set of hashed values of one categorical variable for the subscriber.
- id - subscriber's identifier
- feature_50 - hash of the value of the categorical variable.


"File train.csv - here is the data with the target label.
- id - identifier of the target
- target - target label value (1 - belongs to drivers segment, 0 - does not belong to drivers segment).


"test.csv file - list of subscribers for which you want to make predictions with your models.
- id - the subscriber's identifier
- score - the probability that the caller belongs to the drivers segment (class "1"). This probability is determined by your model.


You need to build your model on the subscribers whose target label is contained in the train.csv file. 
To do this, you need to use data from tabular_data.csv and hashed_feature.csv files. 
Then, using your model, you need to fill in the score column for the subscribers from the test.csv file - the probability that the subscriber belongs to the driver segment. 
Note that you need to predict the fact of the relation to the drivers segment, without reference to the period.

**P.S. The target in ROC-AUC is 90%+**

In [17]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, roc_auc_score

In [4]:
train = pd.read_csv('train.csv')

In [5]:
train

Unnamed: 0,id,target
0,0,0
1,1,0
2,2,1
3,3,0
4,4,1
...,...,...
4079,4079,0
4080,4080,0
4081,4081,0
4082,4082,0


In [6]:
tabular_data = pd.read_csv('tabular_data.csv')

In [7]:
tabular_data.head()

Unnamed: 0,id,period,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,...,feature_40,feature_41,feature_42,feature_43,feature_44,feature_45,feature_46,feature_47,feature_48,feature_49
0,0,1,110.0,55.0,0.432017,0.0,176.78,0.0,0.323712,,...,0.0,0,0.0,0.0,55.0,2.0,0.526552,145.0,133.28,0.0
1,0,2,110.0,110.0,0.397517,0.0,315.42,0.0,0.316798,,...,0.0,0,0.0,0.0,110.0,1.0,0.481063,130.0,229.97,0.0
2,0,3,110.0,55.0,0.35944,0.0,354.55,0.0,0.339188,,...,0.07,0,0.0,0.0,55.0,1.0,0.509598,180.0,231.78,0.0
3,0,4,110.0,55.0,0.285707,0.0,229.98,0.0,0.415428,,...,0.0,0,0.0,0.0,55.0,0.0,0.680089,142.0,183.83,0.0
4,0,5,110.0,55.0,0.101487,444.730391,307.12,0.0,0.56967,,...,0.95,0,20.014485,0.0,55.0,0.0,0.776175,85.0,155.83,0.0


## let's make a prediction on the last period, just to understand what the results will be and what we do next 

In [8]:
df = train.merge(tabular_data[tabular_data['period'] == 1], how='left', on='id')

In [11]:
df.isna().mean().sort_values(ascending=False)

feature_33    0.030362
feature_6     0.028648
feature_19    0.027914
feature_2     0.027179
feature_12    0.025955
feature_46    0.025465
feature_16    0.024486
feature_23    0.023751
feature_11    0.023751
feature_13    0.023017
feature_5     0.022527
feature_35    0.022527
feature_18    0.022282
feature_10    0.021548
feature_48    0.021548
feature_7     0.021303
feature_32    0.021058
feature_43    0.020813
feature_9     0.020813
feature_28    0.020568
feature_4     0.020568
feature_24    0.020078
feature_45    0.019833
feature_3     0.019833
feature_31    0.019833
feature_34    0.019833
feature_20    0.019589
feature_22    0.019344
feature_49    0.019099
feature_14    0.019099
feature_1     0.018854
feature_15    0.018609
feature_40    0.018609
feature_27    0.018364
feature_36    0.018364
feature_8     0.018364
feature_30    0.018119
feature_47    0.017875
feature_37    0.017875
feature_42    0.017630
feature_21    0.017385
feature_44    0.017385
feature_39    0.017385
feature_0  

In [14]:
(df.dtypes).value_counts()

float64    48
int64       4
object      1
dtype: int64

In [15]:
X = df.drop(columns=['target']).copy()
Y = df['target']

In [16]:
cat_features = ['feature_25']

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

In [20]:
! pip install catboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting catboost
  Downloading catboost-1.1.1-cp38-none-manylinux1_x86_64.whl (76.6 MB)
[K     |████████████████████████████████| 76.6 MB 1.2 MB/s 
Installing collected packages: catboost
Successfully installed catboost-1.1.1


In [21]:
from catboost import CatBoostClassifier


In [22]:
model = CatBoostClassifier(verbose=100, eval_metric='AUC', cat_features=cat_features)

In [24]:
model.fit(X_train, y_train, eval_set=(X_test,y_test))

Learning rate set to 0.041073
0:	test: 0.6778916	best: 0.6778916 (0)	total: 10.9ms	remaining: 10.9s
100:	test: 0.7338621	best: 0.7339603 (97)	total: 1.07s	remaining: 9.51s
200:	test: 0.7359767	best: 0.7371970 (140)	total: 2.12s	remaining: 8.42s
300:	test: 0.7332625	best: 0.7373057 (254)	total: 3.19s	remaining: 7.41s
400:	test: 0.7348054	best: 0.7373057 (254)	total: 4.25s	remaining: 6.34s
500:	test: 0.7287739	best: 0.7373057 (254)	total: 5.28s	remaining: 5.26s
600:	test: 0.7248499	best: 0.7373057 (254)	total: 6.33s	remaining: 4.2s
700:	test: 0.7218622	best: 0.7373057 (254)	total: 7.39s	remaining: 3.15s
800:	test: 0.7203578	best: 0.7373057 (254)	total: 8.43s	remaining: 2.1s
900:	test: 0.7185729	best: 0.7373057 (254)	total: 9.52s	remaining: 1.04s
999:	test: 0.7159148	best: 0.7373057 (254)	total: 10.6s	remaining: 0us

bestTest = 0.7373057286
bestIteration = 254

Shrink model to first 255 iterations.


<catboost.core.CatBoostClassifier at 0x7f8824c1d2e0>

In [25]:
df['target'].mean()

0.26322233104799214

In [42]:
model.feature_importances_

array([ 3.10356343,  0.        ,  0.13881276,  1.56778506,  2.96839169,
        1.87666142,  1.3283412 ,  1.19504394, 10.80025232,  2.52668995,
        2.26892069,  2.01253126,  2.09957908,  1.02097293,  2.46538565,
        1.83179983,  1.13965008,  1.1188869 ,  0.46460922,  3.2951441 ,
        2.70216613,  1.88517157,  1.6190395 ,  1.48734431,  1.30252298,
        1.25428477,  0.38825616,  2.01268971,  1.9739088 ,  1.36498632,
        2.24176946,  2.11704622,  1.17327402,  2.97356672,  1.71429383,
        1.56071224,  2.26051682,  1.33364282,  1.74208497,  0.33413897,
        0.39338795,  0.56855949,  1.44110191,  0.        ,  2.68658338,
        0.32816802,  0.60357207,  0.95612201, 10.11619094,  3.877913  ,
        1.36311789,  1.00084551])

In [48]:
fi = pd.DataFrame({'name':X.columns, 'w':model.feature_importances_})