# **DS** test assignment from mobile operator

Binary classification problem:
"1" - the subscriber is a driver (belongs to the drivers segment),
"0" - the subscriber is not a driver (does not belong to the drivers segment).

The files tabular_data.csv and hashed_feature.csv ̶ here are descriptive characteristics for 4084 subscribers ("ID" is the subscriber ID).
The train.csv file ̶ is the target label data (the subscriber's belonging to the driver segment).
File test.csv ̶ is the list of subscribers for which we want to make a prediction, by which we will evaluate the quality of the model. The ROC-AUC is used as the metric.

"The file tabular_data.csv contains the numeric data on the activity of subscribers for 12 periods.
- period - period number (periods are consecutive, 1 is the newest)
- id - subscriber ID
- feature_0 - feature_49 - data on the subscriber's activity in the corresponding period.


"File hashed_feature.csv - here is the set of hashed values of one categorical variable for the subscriber.
- id - subscriber's identifier
- feature_50 - hash of the value of the categorical variable.


"File train.csv - here is the data with the target label.
- id - identifier of the target
- target - target label value (1 - belongs to drivers segment, 0 - does not belong to drivers segment).


"test.csv file - list of subscribers for which you want to make predictions with your models.
- id - the subscriber's identifier
- score - the probability that the caller belongs to the drivers segment (class "1"). This probability is determined by your model.


You need to build your model on the subscribers whose target label is contained in the train.csv file. 
To do this, you need to use data from tabular_data.csv and hashed_feature.csv files. 
Then, using your model, you need to fill in the score column for the subscribers from the test.csv file - the probability that the subscriber belongs to the driver segment. 
Note that you need to predict the fact of the relation to the drivers segment, without reference to the period.

**P.S. The target in ROC-AUC is 90%+**

In [None]:
! pip3 install catboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting catboost
  Downloading catboost-1.1.1-cp38-none-manylinux1_x86_64.whl (76.6 MB)
[K     |█▎                              | 2.9 MB 6.7 MB/s eta 0:00:11

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from catboost import CatBoostClassifier
import matplotlib.pylab as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, roc_auc_score

In [None]:
tabular_data = pd.read_csv('tabular_data.csv')

In [None]:
train = pd.read_csv('train.csv')

In [None]:
test = pd.read_csv('test.csv')

In [None]:
df = train.merge(tabular_data, how='left', on='id')

In [None]:
hashed_feature = pd.read_csv('hashed_feature.csv')
hashed_feature.head()

In [None]:
hashed_feature["feature_50"].fillna( method ='ffill', inplace = True)
feature_50 = hashed_feature.groupby(['id'], as_index=True).agg({'feature_50': ' '.join})
feature_50['feature_50'].head()

In [None]:
df = df.merge(feature_50, how='left', on='id')

In [None]:
df['feature_50']

In [None]:
df["feature_25"].fillna( method ='ffill', inplace = True)

In [None]:
X = df.drop(columns=['target'], axis=0).copy()

In [None]:
Y = df['target']

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
sns.catplot(
    data=df, x="period", y="feature_46", hue="target",
    kind="violin", split=True,
)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

In [None]:
text_feature = ['feature_50']
cat_feature = ['feature_25']

In [None]:
model = CatBoostClassifier(verbose=100, eval_metric='AUC', text_features=text_feature, cat_features=cat_feature)

In [None]:
model.fit(X_train, y_train, eval_set=(X_test, y_test))