## 1. Умова задачі.

development_sample.csv - data for model development. 
One row represents one event in the data sample. 
This file contains:
"id" - the unique identifier of event and time order variable for data (in ascending order);
"target" - an outcome which should be predicted (two classes: 1 & 0);
Other columns are features.

holdout_sample.csv - data for model testing. 
One row represents one event in the data sample. 
This file contains all the fields from development_sample except target column.

For each "id" in the development_sample and in the holdout_sample you must predict a probability for a target variable (class 1). 
The file should contain a header and have the following format:

id,probability

1,0.0001

2,0.0002

3,0.0003

etc.

Also, we expect from your side the brief report regarding main modelling steps (optionally, but preferably).

## 2. Завантажуємо дані, перевіряємо та чистимо їх.

Завантажуємо потрібні нам бібліотеки.

In [29]:
import pandas as pd
import numpy as np
import random

Читаємо отриманий файл, на основі якого навчатимемо модель.

In [2]:
df = pd.read_csv('data/development_sample.csv')
df.head()

Unnamed: 0,id,target,feature_01,feature_02,feature_03,feature_04,feature_05,feature_06,feature_07,feature_08,...,feature_61,feature_62,feature_63,feature_64,feature_65,feature_66,feature_67,feature_68,feature_69,feature_70
0,1,0,value_1,value_004,value_024,3000,3000,3136.8,373,52.73,...,567.0,335.0,126.0,126.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,0,value_1,value_003,value_036,4000,4000,4273.6,358,363.35,...,269.0,269.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,0,value_2,value_009,value_022,2000,2000,2091.2,261,0.0,...,,,,,,,,,,
3,4,0,value_1,value_011,value_037,10000,10000,10912.0,905,0.0,...,373.0,373.0,5.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,0,value_1,value_010,value_033,5000,3140,3426.37,240,0.0,...,746.0,746.0,446.0,446.0,0.0,0.0,0.0,0.0,0.0,0.0


Перевіряємо типи даних.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26047 entries, 0 to 26046
Data columns (total 72 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          26047 non-null  int64  
 1   target      26047 non-null  int64  
 2   feature_01  26047 non-null  object 
 3   feature_02  26047 non-null  object 
 4   feature_03  26047 non-null  object 
 5   feature_04  26047 non-null  int64  
 6   feature_05  26047 non-null  int64  
 7   feature_06  26047 non-null  float64
 8   feature_07  26047 non-null  int64  
 9   feature_08  26047 non-null  float64
 10  feature_09  26047 non-null  float64
 11  feature_10  26047 non-null  float64
 12  feature_11  26047 non-null  float64
 13  feature_12  26047 non-null  float64
 14  feature_13  26047 non-null  int64  
 15  feature_14  26047 non-null  int64  
 16  feature_15  26047 non-null  float64
 17  feature_16  26047 non-null  float64
 18  feature_17  26047 non-null  float64
 19  feature_18  26047 non-nul

Як бачимо, маємо багато категоріальних даних.

In [4]:
df.shape

(26047, 72)

Отже, маємо таблицю на 72 стовпці та 26 047 рядків.

In [5]:
df.isnull().sum()

id                0
target            0
feature_01        0
feature_02        0
feature_03        0
              ...  
feature_66    14069
feature_67    14069
feature_68    14069
feature_69    14069
feature_70    14069
Length: 72, dtype: int64

Заміняємо дані NaN на нулі.

In [6]:
df = df.fillna(0)

In [7]:
df.isnull().sum()

id            0
target        0
feature_01    0
feature_02    0
feature_03    0
             ..
feature_66    0
feature_67    0
feature_68    0
feature_69    0
feature_70    0
Length: 72, dtype: int64

In [8]:
df.describe()

Unnamed: 0,id,target,feature_04,feature_05,feature_06,feature_07,feature_08,feature_09,feature_10,feature_11,...,feature_61,feature_62,feature_63,feature_64,feature_65,feature_66,feature_67,feature_68,feature_69,feature_70
count,26047.0,26047.0,26047.0,26047.0,26047.0,26047.0,26047.0,26047.0,26047.0,26047.0,...,26047.0,26047.0,26047.0,26047.0,26047.0,26047.0,26047.0,26047.0,26047.0,26047.0
mean,13024.0,0.012477,7966.846931,7118.296925,7969.635243,593.74569,292.829413,5583.025038,516.687985,5066.337053,...,215.768227,134.958306,50.323377,81.567666,0.057511,0.072177,0.038853,0.545207,19.924521,17.814566
std,7519.265567,0.111006,4162.39233,3905.659217,4435.382178,348.464426,602.319435,5192.583489,1861.977647,5067.124001,...,359.995359,245.965004,161.608202,233.959156,0.464754,0.477084,0.318411,2.833747,125.484214,116.696576
min,1.0,0.0,1000.0,1000.0,1045.6,61.0,0.0,500.0,0.0,-1884.99,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,6512.5,0.0,5000.0,4000.0,4651.3,341.0,0.0,2895.0,0.0,2500.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,13024.0,0.0,8000.0,6940.0,7630.0,549.0,50.0,4000.0,0.0,4000.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,19535.5,0.0,10000.0,10000.0,10642.025,733.0,370.0,6500.0,0.0,6000.0,...,362.0,204.0,7.0,15.0,0.0,0.0,0.0,0.0,0.0,0.0
max,26047.0,1.0,20000.0,20000.0,23600.0,2600.0,12753.06,140333.0,53486.0,140333.0,...,2759.0,2654.0,2429.0,2454.0,13.0,17.0,10.0,29.0,2210.0,2210.0


Заміняємо категоріальні дані на числові.

In [9]:
from sklearn.preprocessing import OrdinalEncoder

df_enc = OrdinalEncoder()
df["feature_01"] = df_enc.fit_transform(df[["feature_01"]])
df["feature_02"] = df_enc.fit_transform(df[["feature_02"]])
df["feature_03"] = df_enc.fit_transform(df[["feature_03"]])
df["feature_24"] = df_enc.fit_transform(df[["feature_24"]])
df["feature_25"] = df_enc.fit_transform(df[["feature_25"]])
df["feature_28"] = df_enc.fit_transform(df[["feature_28"]])
df["feature_29"] = df_enc.fit_transform(df[["feature_29"]])
df["feature_30"] = df_enc.fit_transform(df[["feature_30"]])
df["feature_31"] = df_enc.fit_transform(df[["feature_31"]])
df["feature_32"] = df_enc.fit_transform(df[["feature_32"]])
df["feature_33"] = df_enc.fit_transform(df[["feature_33"]])
df["feature_34"] = df_enc.fit_transform(df[["feature_34"]])
df["feature_35"] = df_enc.fit_transform(df[["feature_35"]])
df["feature_36"] = df_enc.fit_transform(df[["feature_36"]])
df["feature_37"] = df_enc.fit_transform(df[["feature_37"]])
df["feature_38"] = df_enc.fit_transform(df[["feature_38"]])
df["feature_39"] = df_enc.fit_transform(df[["feature_39"]])
df["feature_40"] = df_enc.fit_transform(df[["feature_40"]])
df["feature_41"] = df_enc.fit_transform(df[["feature_41"]])
df["feature_42"] = df_enc.fit_transform(df[["feature_42"]])
df["feature_43"] = df_enc.fit_transform(df[["feature_43"]])
df["feature_44"] = df_enc.fit_transform(df[["feature_44"]])
df["feature_45"] = df_enc.fit_transform(df[["feature_45"]])
df["feature_46"] = df_enc.fit_transform(df[["feature_46"]])
df["feature_47"] = df_enc.fit_transform(df[["feature_47"]])
df["feature_48"] = df_enc.fit_transform(df[["feature_48"]])
df["feature_49"] = df_enc.fit_transform(df[["feature_49"]])
df["feature_50"] = df_enc.fit_transform(df[["feature_50"]])

Перевіряємо, чи не лишилося у нас категоріальних даних.

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26047 entries, 0 to 26046
Data columns (total 72 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          26047 non-null  int64  
 1   target      26047 non-null  int64  
 2   feature_01  26047 non-null  float64
 3   feature_02  26047 non-null  float64
 4   feature_03  26047 non-null  float64
 5   feature_04  26047 non-null  int64  
 6   feature_05  26047 non-null  int64  
 7   feature_06  26047 non-null  float64
 8   feature_07  26047 non-null  int64  
 9   feature_08  26047 non-null  float64
 10  feature_09  26047 non-null  float64
 11  feature_10  26047 non-null  float64
 12  feature_11  26047 non-null  float64
 13  feature_12  26047 non-null  float64
 14  feature_13  26047 non-null  int64  
 15  feature_14  26047 non-null  int64  
 16  feature_15  26047 non-null  float64
 17  feature_16  26047 non-null  float64
 18  feature_17  26047 non-null  float64
 19  feature_18  26047 non-nul

In [35]:
df.head()

Unnamed: 0,id,target,feature_01,feature_02,feature_03,feature_04,feature_05,feature_06,feature_07,feature_08,...,feature_61,feature_62,feature_63,feature_64,feature_65,feature_66,feature_67,feature_68,feature_69,feature_70
0,1,0,0.0,3.0,23.0,3000,3000,3136.8,373,52.73,...,567.0,335.0,126.0,126.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,0,0.0,2.0,35.0,4000,4000,4273.6,358,363.35,...,269.0,269.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,0,1.0,8.0,21.0,2000,2000,2091.2,261,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,0,0.0,10.0,36.0,10000,10000,10912.0,905,0.0,...,373.0,373.0,5.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,0,0.0,9.0,32.0,5000,3140,3426.37,240,0.0,...,746.0,746.0,446.0,446.0,0.0,0.0,0.0,0.0,0.0,0.0


## 3. Робота з моделлю RandomForestClassifier

Визначаємо наші «фічі» та «таргети».

In [16]:
X = df.drop('target', axis=1)

y = df['target']

In [33]:
X.head()

Unnamed: 0,id,feature_01,feature_02,feature_03,feature_04,feature_05,feature_06,feature_07,feature_08,feature_09,...,feature_61,feature_62,feature_63,feature_64,feature_65,feature_66,feature_67,feature_68,feature_69,feature_70
0,1,0.0,3.0,23.0,3000,3000,3136.8,373,52.73,2400.0,...,567.0,335.0,126.0,126.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,0.0,2.0,35.0,4000,4000,4273.6,358,363.35,5000.0,...,269.0,269.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,1.0,8.0,21.0,2000,2000,2091.2,261,0.0,1600.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,0.0,10.0,36.0,10000,10000,10912.0,905,0.0,3478.86,...,373.0,373.0,5.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,0.0,9.0,32.0,5000,3140,3426.37,240,0.0,1048.0,...,746.0,746.0,446.0,446.0,0.0,0.0,0.0,0.0,0.0,0.0


In [34]:
y.head()

0    0
1    0
2    0
3    0
4    0
Name: target, dtype: int64

Розбиваємо дані на тренувальні та тестові (у співвідношенні 80/20).

In [30]:
from sklearn.model_selection import train_test_split

random.seed(42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Викликаємо нашу модель з бібліотеки

In [31]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(max_depth=2, random_state=0)

Навчаємо нашу модель на тренувальних даних.

In [32]:
clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=2, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

## 4. Оцінюємо точність нашої моделі.

На тренувальних даних:

In [42]:
clf.score(X_train, y_train)

0.9880021116283534

На тестових даних:

In [43]:
clf.score(X_test, y_test)

0.9856046065259118

In [45]:
y_preds = clf.predict(X_test)

In [46]:
accuracy_score(y_test, y_preds)

0.9856046065259118

In [47]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99      5135
           1       0.00      0.00      0.00        75

    accuracy                           0.99      5210
   macro avg       0.49      0.50      0.50      5210
weighted avg       0.97      0.99      0.98      5210



  _warn_prf(average, modifier, msg_start, len(result))


In [48]:
confusion_matrix(y_test, y_preds)

array([[5135,    0],
       [  75,    0]], dtype=int64)

Оцінюємо, чи зміниться точність при різній кількості n_estimators.

In [49]:
np.random.seed(42)
for i in range(10, 100, 10):
    print(f'Trying model with {i} estimators...')
    clf = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f'Model accuracy in test set: {clf.score(X_test, y_test) * 100:.2f}%')
    print('')

Trying model with 10 estimators...
Model accuracy in test set: 98.56%

Trying model with 20 estimators...
Model accuracy in test set: 98.56%

Trying model with 30 estimators...
Model accuracy in test set: 98.56%

Trying model with 40 estimators...
Model accuracy in test set: 98.56%

Trying model with 50 estimators...
Model accuracy in test set: 98.56%

Trying model with 60 estimators...
Model accuracy in test set: 98.56%

Trying model with 70 estimators...
Model accuracy in test set: 98.56%

Trying model with 80 estimators...
Model accuracy in test set: 98.56%

Trying model with 90 estimators...
Model accuracy in test set: 98.56%



## 5. Зберігаємо нашу модель.

In [50]:
import pickle

pickle.dump(clf, open('random_forst_model_1.pk1', 'wb'))

In [51]:
loaded_model = pickle.load(open('random_forst_model_1.pk1', 'rb'))
loaded_model.score(X_test, y_test)

0.9856046065259118

## 6. Прогнозуємо значення таргетів з допомогою збереженої моделі.

Читаємо наш другий файл.

In [52]:
df_sample = pd.read_csv('data/holdout_sample.csv')
df_sample.head()

Unnamed: 0,id,feature_01,feature_02,feature_03,feature_04,feature_05,feature_06,feature_07,feature_08,feature_09,...,feature_61,feature_62,feature_63,feature_64,feature_65,feature_66,feature_67,feature_68,feature_69,feature_70
0,26048,value_2,value_001,value_012,2620,2610,3079.8,160,500.0,2200.0,...,,,,,,,,,,
1,26049,value_2,value_002,value_028,6900,6900,8142.0,466,0.0,2300.0,...,,,,,,,,,,
2,26050,value_1,value_012,value_044,7000,6400,6976.0,553,0.0,2134.54,...,1006.0,436.0,166.0,664.0,0.0,0.0,0.0,0.0,0.0,0.0
3,26051,value_2,value_012,value_255,9000,9000,10620.0,549,219.0,3000.0,...,,,,,,,,,,
4,26052,value_1,value_006,value_073,1000,1000,1060.0,114,0.0,2000.0,...,222.0,222.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Перевіряємо, чи маємо через фіч категоріальні дані.

In [53]:
df_sample.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4597 entries, 0 to 4596
Data columns (total 71 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          4597 non-null   int64  
 1   feature_01  4597 non-null   object 
 2   feature_02  4597 non-null   object 
 3   feature_03  4597 non-null   object 
 4   feature_04  4597 non-null   int64  
 5   feature_05  4597 non-null   int64  
 6   feature_06  4597 non-null   float64
 7   feature_07  4597 non-null   int64  
 8   feature_08  4597 non-null   float64
 9   feature_09  4597 non-null   float64
 10  feature_10  4597 non-null   float64
 11  feature_11  4597 non-null   float64
 12  feature_12  4597 non-null   float64
 13  feature_13  4597 non-null   int64  
 14  feature_14  4597 non-null   int64  
 15  feature_15  4597 non-null   float64
 16  feature_16  4597 non-null   float64
 17  feature_17  4597 non-null   float64
 18  feature_18  4597 non-null   float64
 19  feature_19  4597 non-null  

In [54]:
df_sample.shape

(4597, 71)

In [55]:
df_sample.isnull().sum()

id               0
feature_01       0
feature_02       0
feature_03       0
feature_04       0
              ... 
feature_66    2403
feature_67    2403
feature_68    2403
feature_69    2403
feature_70    2403
Length: 71, dtype: int64

Замінюємо NaN на нулі.

In [56]:
df_sample = df_sample.fillna(0)

In [57]:
df_sample.describe()

Unnamed: 0,id,feature_04,feature_05,feature_06,feature_07,feature_08,feature_09,feature_10,feature_11,feature_12,...,feature_61,feature_62,feature_63,feature_64,feature_65,feature_66,feature_67,feature_68,feature_69,feature_70
count,4597.0,4597.0,4597.0,4597.0,4597.0,4597.0,4597.0,4597.0,4597.0,4597.0,...,4597.0,4597.0,4597.0,4597.0,4597.0,4597.0,4597.0,4597.0,4597.0,4597.0
mean,28346.0,9959.445508,8827.759408,9961.573548,702.595171,282.988723,5840.746476,431.928756,5408.81772,9862.401753,...,271.182728,151.938003,61.776376,104.568197,0.058951,0.081575,0.047857,0.60757,22.899935,19.762889
std,1327.183923,5962.810631,5549.79806,6326.609116,466.995349,609.752477,5375.804168,1768.267554,5287.274777,10441.40264,...,443.060188,312.548831,184.085624,267.933061,0.514165,0.497253,0.340056,3.055052,126.099616,115.989816
min,26048.0,1000.0,1000.0,1060.0,61.0,0.0,500.0,0.0,-4833.33,1000.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,27197.0,5000.0,4690.0,5300.0,350.0,0.0,3000.0,0.0,2700.0,5400.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,28346.0,10000.0,7900.0,8720.0,587.0,0.0,4500.0,0.0,4100.0,8000.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,29495.0,15000.0,12000.0,13483.2,924.0,350.0,7000.0,0.0,6500.0,11704.16,...,457.0,183.0,13.0,37.0,0.0,0.0,0.0,0.0,0.0,0.0
max,30644.0,20000.0,20000.0,23600.0,2600.0,10000.0,100000.0,60000.0,100000.0,450000.0,...,2807.0,2807.0,2551.0,2551.0,13.0,11.0,8.0,29.0,2384.0,2384.0


Замінюємо категоріальні дані на числові.

In [58]:
from sklearn.preprocessing import OrdinalEncoder

df_enc_sample = OrdinalEncoder()
df_sample["feature_01"] = df_enc_sample.fit_transform(df_sample[["feature_01"]])
df_sample["feature_02"] = df_enc_sample.fit_transform(df_sample[["feature_02"]])
df_sample["feature_03"] = df_enc_sample.fit_transform(df_sample[["feature_03"]])
df_sample["feature_24"] = df_enc_sample.fit_transform(df_sample[["feature_24"]])
df_sample["feature_28"] = df_enc_sample.fit_transform(df_sample[["feature_28"]])
df_sample["feature_29"] = df_enc_sample.fit_transform(df_sample[["feature_29"]])
df_sample["feature_30"] = df_enc_sample.fit_transform(df_sample[["feature_30"]])
df_sample["feature_31"] = df_enc_sample.fit_transform(df_sample[["feature_31"]])
df_sample["feature_32"] = df_enc_sample.fit_transform(df_sample[["feature_32"]])
df_sample["feature_33"] = df_enc_sample.fit_transform(df_sample[["feature_33"]])
df_sample["feature_34"] = df_enc_sample.fit_transform(df_sample[["feature_34"]])
df_sample["feature_35"] = df_enc_sample.fit_transform(df_sample[["feature_35"]])
df_sample["feature_36"] = df_enc_sample.fit_transform(df_sample[["feature_36"]])
df_sample["feature_37"] = df_enc_sample.fit_transform(df_sample[["feature_37"]])
df_sample["feature_38"] = df_enc_sample.fit_transform(df_sample[["feature_38"]])
df_sample["feature_39"] = df_enc_sample.fit_transform(df_sample[["feature_39"]])
df_sample["feature_40"] = df_enc_sample.fit_transform(df_sample[["feature_40"]])
df_sample["feature_41"] = df_enc_sample.fit_transform(df_sample[["feature_41"]])
df_sample["feature_42"] = df_enc_sample.fit_transform(df_sample[["feature_42"]])
df_sample["feature_43"] = df_enc_sample.fit_transform(df_sample[["feature_43"]])
df_sample["feature_44"] = df_enc_sample.fit_transform(df_sample[["feature_44"]])
df_sample["feature_45"] = df_enc_sample.fit_transform(df_sample[["feature_45"]])
df_sample["feature_46"] = df_enc_sample.fit_transform(df_sample[["feature_46"]])
df_sample["feature_47"] = df_enc_sample.fit_transform(df_sample[["feature_47"]])
df_sample["feature_48"] = df_enc_sample.fit_transform(df_sample[["feature_48"]])
df_sample["feature_49"] = df_enc_sample.fit_transform(df_sample[["feature_49"]])
df_sample["feature_50"] = df_enc_sample.fit_transform(df_sample[["feature_50"]])

Перевіряємо, чи не залишилося у нас числових даних.

In [59]:
df_sample.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4597 entries, 0 to 4596
Data columns (total 71 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          4597 non-null   int64  
 1   feature_01  4597 non-null   float64
 2   feature_02  4597 non-null   float64
 3   feature_03  4597 non-null   float64
 4   feature_04  4597 non-null   int64  
 5   feature_05  4597 non-null   int64  
 6   feature_06  4597 non-null   float64
 7   feature_07  4597 non-null   int64  
 8   feature_08  4597 non-null   float64
 9   feature_09  4597 non-null   float64
 10  feature_10  4597 non-null   float64
 11  feature_11  4597 non-null   float64
 12  feature_12  4597 non-null   float64
 13  feature_13  4597 non-null   int64  
 14  feature_14  4597 non-null   int64  
 15  feature_15  4597 non-null   float64
 16  feature_16  4597 non-null   float64
 17  feature_17  4597 non-null   float64
 18  feature_18  4597 non-null   float64
 19  feature_19  4597 non-null  

In [60]:
X_sample = df_sample

In [61]:
X_sample.head()

Unnamed: 0,id,feature_01,feature_02,feature_03,feature_04,feature_05,feature_06,feature_07,feature_08,feature_09,...,feature_61,feature_62,feature_63,feature_64,feature_65,feature_66,feature_67,feature_68,feature_69,feature_70
0,26048,1.0,0.0,11.0,2620,2610,3079.8,160,500.0,2200.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,26049,1.0,1.0,26.0,6900,6900,8142.0,466,0.0,2300.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,26050,0.0,11.0,41.0,7000,6400,6976.0,553,0.0,2134.54,...,1006.0,436.0,166.0,664.0,0.0,0.0,0.0,0.0,0.0,0.0
3,26051,1.0,11.0,241.0,9000,9000,10620.0,549,219.0,3000.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,26052,0.0,5.0,70.0,1000,1000,1060.0,114,0.0,2000.0,...,222.0,222.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [62]:
y_preds = clf.predict(X_sample)
y_preds

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [66]:
final_table = pd.DataFrame({'id': X_sample['id'], 'target': y_preds})
final_table

Unnamed: 0,id,target
0,26048,0
1,26049,0
2,26050,0
3,26051,0
4,26052,0
...,...,...
4592,30640,0
4593,30641,0
4594,30642,0
4595,30643,0
