## Anomaly Detection with XGB Classifier(base model Accuracy 79%) - Facebook Recruiting IV: Human or Robot? 

이 대회에는 두 개의 데이터 세트가 있습니다. 하나는 ID, 지불 계정 및 주소를 포함한 입찰자 정보 목록을 포함하는 입찰자 데이터 세트입니다. 다른 하나는 서로 다른 경매에 대한 760 만 개의 입찰을 포함하는 입찰 데이터 세트입니다. 이 데이터 세트의 입찰은 모두 휴대 기기에서 이루어집니다.

온라인 경매 플랫폼은 각 입찰에 대해 고정 된 금액 증가분을 가지므로 각 입찰에 대한 금액을 포함하지 않습니다. 입찰, 경매 또는 장치에서 입찰 행동을 배우는 것을 환영합니다.

There are two datasets in this competition. One is a bidder dataset that includes a list of bidder information, including their id, payment account, and address. The other is a bid dataset that includes 7.6 million bids on different auctions. The bids in this dataset are all made by mobile devices.

The online auction platform has a fixed increment of dollar amount for each bid, so it doesn't include an amount for each bid. You are welcome to learn the bidding behavior from the time of the bids, the auction, or the device. 


[데이터셋 다운로드 링크](https://bit.ly/3uIL3Fl)

### File descriptions

- train.csv - the training set from the bidder dataset
- test.csv - the test set from the bidder dataset
- sampleSubmission.csv - a sample submission file in the correct format
- bids.csv - the bid dataset

### Data fields
**bidder dataset**

- bidder_id – Unique identifier of a bidder.
- payment_account – Payment account associated with a bidder. These are obfuscated to protect privacy. 
- address – Mailing address of a bidder. These are obfuscated to protect privacy. 
- (target 데이터) **outcome** – Label of a bidder indicating whether or not it is a robot. Value 1.0 indicates a robot, where value 0.0 indicates human. 
The outcome was half hand labeled, half stats-based. There are two types of "bots" with different levels of proof:

1. Bidders who are identified as bots/fraudulent with clear proof. Their accounts were banned by the auction site.

2. Bidder who may have just started their business/clicks or their stats exceed from system wide average. There are no clear proof that they are bots. 


**bid dataset**

- bid_id - unique id for this bid
- bidder_id – Unique identifier of a bidder (same as the bidder_id used in train.csv and test.csv)
- auction – Unique identifier of an auction
- merchandise –  The category of the auction site campaign, which means the bidder might come to this site by way of searching for "home goods" but ended up bidding for "sporting goods" - and that leads to this field being "home goods". This categorical field could be a search term, or online advertisement. 
- device – Phone model of a visitor
- time - Time that the bid is made (transformed to protect privacy).
- country - The country that the IP belongs to
- ip – IP address of a bidder (obfuscated to protect privacy).
- url - url where the bidder was referred from (obfuscated to protect privacy). 

참고링크: https://www.kaggle.com/c/facebook-recruiting-iv-human-or-bot/data

---
- Step 1. Data load & EDA
- Step 2. Preprocessing - replace all values with count number
- Step 3. Preprocessing - Upsampling
- Step 4. Train / Valid Set Split
- Step 5. Modeling
- Step 6. Prediction

### Step 1. Data load & EDA

In [3]:
import pandas as pd

In [4]:
train_df = pd.read_csv('./data/train.csv')
test_df = pd.read_csv('./data/test.csv')

In [5]:
train_df.head()

Unnamed: 0,bidder_id,payment_account,address,outcome
0,91a3c57b13234af24875c56fb7e2b2f4rb56a,a3d2de7675556553a5f08e4c88d2c228754av,a3d2de7675556553a5f08e4c88d2c228vt0u4,0.0
1,624f258b49e77713fc34034560f93fb3hu3jo,a3d2de7675556553a5f08e4c88d2c228v1sga,ae87054e5a97a8f840a3991d12611fdcrfbq3,0.0
2,1c5f4fc669099bfbfac515cd26997bd12ruaj,a3d2de7675556553a5f08e4c88d2c2280cybl,92520288b50f03907041887884ba49c0cl0pd,0.0
3,4bee9aba2abda51bf43d639013d6efe12iycd,51d80e233f7b6a7dfdee484a3c120f3b2ita8,4cb9717c8ad7e88a9a284989dd79b98dbevyi,0.0
4,4ab12bc61c82ddd9c2d65e60555808acqgos1,a3d2de7675556553a5f08e4c88d2c22857ddh,2a96c3ce94b3be921e0296097b88b56a7x1ji,0.0


In [6]:
test_df.head()

Unnamed: 0,bidder_id,payment_account,address
0,49bb5a3c944b8fc337981cc7a9ccae41u31d7,a3d2de7675556553a5f08e4c88d2c228htx90,5d9fa1b71f992e7c7a106ce4b07a0a754le7c
1,a921612b85a1494456e74c09393ccb65ylp4y,a3d2de7675556553a5f08e4c88d2c228rs17i,a3d2de7675556553a5f08e4c88d2c228klidn
2,6b601e72a4d264dab9ace9d7b229b47479v6i,925381cce086b8cc9594eee1c77edf665zjpl,a3d2de7675556553a5f08e4c88d2c228aght0
3,eaf0ed0afc9689779417274b4791726cn5udi,a3d2de7675556553a5f08e4c88d2c228nclv5,b5714de1fd69d4a0d2e39d59e53fe9e15vwat
4,cdecd8d02ed8c6037e38042c7745f688mx5sf,a3d2de7675556553a5f08e4c88d2c228dtdkd,c3b363a3c3b838d58c85acf0fc9964cb4pnfa


In [7]:
bids = pd.read_csv('./data/bids.csv')

In [8]:
bids.head()

Unnamed: 0,bid_id,bidder_id,auction,merchandise,device,time,country,ip,url
0,0,8dac2b259fd1c6d1120e519fb1ac14fbqvax8,ewmzr,jewelry,phone0,9759243157894736,us,69.166.231.58,vasstdc27m7nks3
1,1,668d393e858e8126275433046bbd35c6tywop,aeqok,furniture,phone1,9759243157894736,in,50.201.125.84,jmqlhflrzwuay9c
2,2,aa5f360084278b35d746fa6af3a7a1a5ra3xe,wa00e,home goods,phone2,9759243157894736,py,112.54.208.157,vasstdc27m7nks3
3,3,3939ac3ef7d472a59a9c5f893dd3e39fh9ofi,jefix,jewelry,phone4,9759243157894736,in,18.99.175.133,vasstdc27m7nks3
4,4,8393c48eaf4b8fa96886edc7cf27b372dsibi,jefix,jewelry,phone5,9759243157894736,in,145.138.5.37,vasstdc27m7nks3


In [9]:
train_df['bidder_id'].nunique()

2013

In [10]:
bids['bidder_id'].nunique()

6614

Check - train set 

In [11]:
train_df.nunique()

bidder_id          2013
payment_account    2013
address            2013
outcome               2
dtype: int64

In [12]:
train_df.isnull().sum()

bidder_id          0
payment_account    0
address            0
outcome            0
dtype: int64

In [13]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2013 entries, 0 to 2012
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   bidder_id        2013 non-null   object 
 1   payment_account  2013 non-null   object 
 2   address          2013 non-null   object 
 3   outcome          2013 non-null   float64
dtypes: float64(1), object(3)
memory usage: 63.0+ KB


---
**All bidder_id are unique. - 2013**

Check - bids

In [14]:
bids.isnull().sum()

bid_id            0
bidder_id         0
auction           0
merchandise       0
device            0
time              0
country        8859
ip                0
url               0
dtype: int64

In [15]:
bids.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7656334 entries, 0 to 7656333
Data columns (total 9 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   bid_id       int64 
 1   bidder_id    object
 2   auction      object
 3   merchandise  object
 4   device       object
 5   time         int64 
 6   country      object
 7   ip           object
 8   url          object
dtypes: int64(2), object(7)
memory usage: 525.7+ MB


In [16]:
bids.nunique()

bid_id         7656334
bidder_id         6614
auction          15051
merchandise         10
device            7351
time            776529
country            199
ip             2303991
url            1786351
dtype: int64

Check - test set

In [17]:
test_df.isnull().sum()

bidder_id          0
payment_account    0
address            0
dtype: int64

In [18]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4700 entries, 0 to 4699
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   bidder_id        4700 non-null   object
 1   payment_account  4700 non-null   object
 2   address          4700 non-null   object
dtypes: object(3)
memory usage: 110.3+ KB


In [19]:
test_df.nunique()

bidder_id          4700
payment_account    4700
address            4700
dtype: int64

In [20]:
bids.country.isnull().sum() / len(bids.country)

0.0011570811827174728

In [21]:
sample = pd.read_csv('./data/sampleSubmission.csv')

In [22]:
sample

Unnamed: 0,bidder_id,prediction
0,49bb5a3c944b8fc337981cc7a9ccae41u31d7,0.0
1,a921612b85a1494456e74c09393ccb65ylp4y,0.0
2,6b601e72a4d264dab9ace9d7b229b47479v6i,0.0
3,eaf0ed0afc9689779417274b4791726cn5udi,0.0
4,cdecd8d02ed8c6037e38042c7745f688mx5sf,0.0
...,...,...
4695,bef56983ba78b2ee064443ae95972877jfkyd,0.0
4696,4da45cc915c32d4368ac7e773d92d4affwqrr,0.0
4697,0d0e6220bf59ab9a0c5b5987fb2c34a9p33f9,0.0
4698,4981c32c54dde65b79dbc48fd9ab6457caqze,0.0


In [23]:
bids

Unnamed: 0,bid_id,bidder_id,auction,merchandise,device,time,country,ip,url
0,0,8dac2b259fd1c6d1120e519fb1ac14fbqvax8,ewmzr,jewelry,phone0,9759243157894736,us,69.166.231.58,vasstdc27m7nks3
1,1,668d393e858e8126275433046bbd35c6tywop,aeqok,furniture,phone1,9759243157894736,in,50.201.125.84,jmqlhflrzwuay9c
2,2,aa5f360084278b35d746fa6af3a7a1a5ra3xe,wa00e,home goods,phone2,9759243157894736,py,112.54.208.157,vasstdc27m7nks3
3,3,3939ac3ef7d472a59a9c5f893dd3e39fh9ofi,jefix,jewelry,phone4,9759243157894736,in,18.99.175.133,vasstdc27m7nks3
4,4,8393c48eaf4b8fa96886edc7cf27b372dsibi,jefix,jewelry,phone5,9759243157894736,in,145.138.5.37,vasstdc27m7nks3
...,...,...,...,...,...,...,...,...,...
7656329,7656329,626159dd6f2228ede002d9f9340f75b7puk8d,3e64w,jewelry,phone91,9709222052631578,ru,140.204.227.63,cghhmomsaxi6pug
7656330,7656330,a318ea333ceee1ba39a494476386136a826dv,xn0y0,mobile,phone236,9709222052631578,pl,24.232.159.118,wgggpdg2gx5pesn
7656331,7656331,f5b2bbad20d1d7ded3ed960393bec0f40u6hn,gja6c,sporting goods,phone80,9709222052631578,za,80.237.28.246,5xgysg14grlersa
7656332,7656332,d4bd412590f5106b9d887a43c51b254eldo4f,hmwk8,jewelry,phone349,9709222052631578,my,91.162.27.152,bhtrek44bzi2wfl


### Step 2. Preprocessing - replace all values with count number
특정 bidder_id가 특정 auction / merchandise / device / time / country / ip / url 에서 중복으로 나타나는 빈도로 값을 치환한다.

In [24]:
bids.groupby(['bidder_id','auction']).size()

bidder_id                              auction
001068c415025a009fee375a12cff4fcnht8y  4ifac      1
002d229ffb247009810828f648afc2ef593rb  2tdw2      2
0030a2dd87ad2733e0873062e4f83954mkj86  obbny      1
003180b29c6a5f8f1d84a6b7b6f7be57tjj1o  cqsh6      1
                                       efh5o      1
                                                 ..
ffd62646d600b759a985d45918bd6f0431vmz  wthc6      1
                                       yfjur      7
                                       yv5hw      5
                                       zz4dz      5
fff2c070d8200e0a09150bd81452ce29ngcnv  2zggz      1
Length: 382341, dtype: int64

In [25]:
auction = pd.DataFrame({'count' : bids.groupby(['bidder_id','auction']).size()}).reset_index()

In [26]:
auction

Unnamed: 0,bidder_id,auction,count
0,001068c415025a009fee375a12cff4fcnht8y,4ifac,1
1,002d229ffb247009810828f648afc2ef593rb,2tdw2,2
2,0030a2dd87ad2733e0873062e4f83954mkj86,obbny,1
3,003180b29c6a5f8f1d84a6b7b6f7be57tjj1o,cqsh6,1
4,003180b29c6a5f8f1d84a6b7b6f7be57tjj1o,efh5o,1
...,...,...,...
382336,ffd62646d600b759a985d45918bd6f0431vmz,wthc6,1
382337,ffd62646d600b759a985d45918bd6f0431vmz,yfjur,7
382338,ffd62646d600b759a985d45918bd6f0431vmz,yv5hw,5
382339,ffd62646d600b759a985d45918bd6f0431vmz,zz4dz,5


In [27]:
auction.set_index('auction')['count']

auction
4ifac    1
2tdw2    2
obbny    1
cqsh6    1
efh5o    1
        ..
wthc6    1
yfjur    7
yv5hw    5
zz4dz    5
2zggz    1
Name: count, Length: 382341, dtype: int64

In [None]:
# 메모리 과부하. killed
bids.replace(auction.set_index('auction')['count'].to_dict())

In [30]:
cols = ['auction', 'merchandise','device','time','ip','url']

> 과부하로 코드 변경하였음. 아래 replace 구문 사용하지 않음.

In [28]:
count = 0
for col in cols:
    count += 1
    print('진행 중 - {}/7'.format(count))
    col_df = pd.DataFrame({'count' : bids.groupby(['bidder_id',col]).size()}).reset_index()
    bids.replace(col_df.set_index(col)['count'].to_dict())

진행 중 - 1/7


KeyboardInterrupt: 

In [34]:
bids.drop('country',axis=1, inplace=True)

> apply + lambda 조합으로 대체

In [36]:
count = 0
for col in cols:
    count += 1
    print('진행 중 - {}/6'.format(count))
    col_df = pd.DataFrame({'count' : bids.groupby(['bidder_id',col]).size()}).reset_index()
    col_df = col_df.set_index(col)['count'].to_dict()
    bids[col] = bids[col].apply(lambda x : col_df[x])

진행 중 - 1/6
진행 중 - 2/6
진행 중 - 3/6
진행 중 - 4/6
진행 중 - 5/6
진행 중 - 6/6


In [37]:
bids

Unnamed: 0,bid_id,bidder_id,auction,merchandise,device,time,ip,url
0,0,8dac2b259fd1c6d1120e519fb1ac14fbqvax8,35,3,72,1,1,1
1,1,668d393e858e8126275433046bbd35c6tywop,16,2917,161,1,108,7
2,2,aa5f360084278b35d746fa6af3a7a1a5ra3xe,16,1,21,1,4,1
3,3,3939ac3ef7d472a59a9c5f893dd3e39fh9ofi,1,3,200,1,1,1
4,4,8393c48eaf4b8fa96886edc7cf27b372dsibi,1,3,25,1,1,1
...,...,...,...,...,...,...,...,...
7656329,7656329,626159dd6f2228ede002d9f9340f75b7puk8d,1,3,1,1,4,1
7656330,7656330,a318ea333ceee1ba39a494476386136a826dv,1,664,123,1,4,1
7656331,7656331,f5b2bbad20d1d7ded3ed960393bec0f40u6hn,47,1,21,1,1,3
7656332,7656332,d4bd412590f5106b9d887a43c51b254eldo4f,42,3,160,1,1209,1445


In [39]:
bids.drop('bid_id',axis=1,inplace=True)

In [40]:
pd.merge(train_df,bids, on='bidder_id')

Unnamed: 0,bidder_id,payment_account,address,outcome,auction,merchandise,device,time,ip,url
0,91a3c57b13234af24875c56fb7e2b2f4rb56a,a3d2de7675556553a5f08e4c88d2c228754av,a3d2de7675556553a5f08e4c88d2c228vt0u4,0.0,16,1,72,1,1,1
1,91a3c57b13234af24875c56fb7e2b2f4rb56a,a3d2de7675556553a5f08e4c88d2c228754av,a3d2de7675556553a5f08e4c88d2c228vt0u4,0.0,24,1,27,1,216,1
2,91a3c57b13234af24875c56fb7e2b2f4rb56a,a3d2de7675556553a5f08e4c88d2c228754av,a3d2de7675556553a5f08e4c88d2c228vt0u4,0.0,16,1,200,1,1,1
3,91a3c57b13234af24875c56fb7e2b2f4rb56a,a3d2de7675556553a5f08e4c88d2c228754av,a3d2de7675556553a5f08e4c88d2c228vt0u4,0.0,12,1,200,1,1,1
4,91a3c57b13234af24875c56fb7e2b2f4rb56a,a3d2de7675556553a5f08e4c88d2c228754av,a3d2de7675556553a5f08e4c88d2c228vt0u4,0.0,24,1,200,1,1,1
...,...,...,...,...,...,...,...,...,...,...
3071219,c806dbb2decba0ed3c4ff5e2e60a74c2wjvbl,a3d2de7675556553a5f08e4c88d2c22856leq,d02c2b288b8aabd79ff47118aff41a2dqwzwc,0.0,1,664,36,1,1,2
3071220,c806dbb2decba0ed3c4ff5e2e60a74c2wjvbl,a3d2de7675556553a5f08e4c88d2c22856leq,d02c2b288b8aabd79ff47118aff41a2dqwzwc,0.0,1,664,17,1,696,2
3071221,0381a69b7a061e9ace2798fd48f1f537mgq57,fd87037ce0304077079c749f420f0b4c54uo0,f030a221726fbcdfc4dc7dfd1b381a112hieq,0.0,42,553,72,1,1209,1
3071222,84a769adc98498f52debfe57b93a0789556f4,fbe0ce34d6546ebd9e4c63afc68b085byd2tf,a3d2de7675556553a5f08e4c88d2c228fib6p,0.0,1,3,306,1,4,2


### Step 3. Preprocessing - Upsampling

In [41]:
train_df = pd.merge(train_df,bids, on='bidder_id')

In [42]:
train_df['outcome'].value_counts()

0.0    2658808
1.0     412416
Name: outcome, dtype: int64

In [43]:
from sklearn.utils import resample 

In [44]:
df_0 = train_df[train_df['outcome']==0]
df_1 = train_df[train_df['outcome']==1]

# Upsample minority class
df_upsample_1 = resample(df_1,
                         replace=True, 
                         n_samples=2658808,
                         random_state=111, 
                        )

df_upsampled = pd.concat([df_0, df_upsample_1])

df_upsampled['outcome'].value_counts()

1.0    2658808
0.0    2658808
Name: outcome, dtype: int64

In [45]:
train_df = df_upsampled

In [46]:
train_df

Unnamed: 0,bidder_id,payment_account,address,outcome,auction,merchandise,device,time,ip,url
0,91a3c57b13234af24875c56fb7e2b2f4rb56a,a3d2de7675556553a5f08e4c88d2c228754av,a3d2de7675556553a5f08e4c88d2c228vt0u4,0.0,16,1,72,1,1,1
1,91a3c57b13234af24875c56fb7e2b2f4rb56a,a3d2de7675556553a5f08e4c88d2c228754av,a3d2de7675556553a5f08e4c88d2c228vt0u4,0.0,24,1,27,1,216,1
2,91a3c57b13234af24875c56fb7e2b2f4rb56a,a3d2de7675556553a5f08e4c88d2c228754av,a3d2de7675556553a5f08e4c88d2c228vt0u4,0.0,16,1,200,1,1,1
3,91a3c57b13234af24875c56fb7e2b2f4rb56a,a3d2de7675556553a5f08e4c88d2c228754av,a3d2de7675556553a5f08e4c88d2c228vt0u4,0.0,12,1,200,1,1,1
4,91a3c57b13234af24875c56fb7e2b2f4rb56a,a3d2de7675556553a5f08e4c88d2c228754av,a3d2de7675556553a5f08e4c88d2c228vt0u4,0.0,24,1,200,1,1,1
...,...,...,...,...,...,...,...,...,...,...
781977,d2704c8bb6ebbf13e08f37131382b126wy4yc,308f53abff7d25a618069ec8b74feecexmpf3,a3d2de7675556553a5f08e4c88d2c228djz03,1.0,370,664,17,1,1,2
1064484,1aa485901ede7baf78ea116e297e9d7f9en6i,a3d2de7675556553a5f08e4c88d2c228n1tgz,ca8d4b018cb62966eebb2974f5a83b4fo1zpr,1.0,1,1,166,1,395,8
1493713,9655ccc7c0c193f1549475f02c54dce45kjw7,a04bad750c3144125fb80a399273bfa1wi6hx,fcec7ba7b352f0a5e62ca742391e8ab3yylj7,1.0,42,1,27,1,4,1
2144450,44bfa21b7d497f097a594c8861d4f8d4w5e97,5c0fc5af3b1ae4a5acd28349a71134a13x7o5,a3d2de7675556553a5f08e4c88d2c228irh1c,1.0,1,664,216,2768,4,1


In [53]:
X = train_df.drop(['bidder_id','payment_account','address','outcome'], axis=1)
y = train_df['outcome']

### Step 4. Train / Valid Set Split

In [54]:
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state=111)

In [60]:
print(X_train.shape, y_train.shape, X_valid.shape, y_valid.shape)

(3988212, 6) (3988212,) (1329404, 6) (1329404,)


In [61]:
X_train

Unnamed: 0,auction,merchandise,device,time,ip,url
2586287,36,1,1,2768,29,33
2150317,42,1,175,1,1,1
2015289,1,1,1,246,1,1
1092321,1,1,58,1,696,1
1561172,16,1,36,1,1,1
...,...,...,...,...,...,...
2004758,210,664,274,1,1209,1
2579389,16,1,36,2768,11,1
776796,1,1,1,2768,342,1
2161363,135,3,72,1,42,1


### Step 5. Modeling

> 메모리 누수(?)로 grid_cv 사용하지 않음. xgboost(default parameters)로 대체

In [56]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

params = {'n_estimators' : [10, 20, 30, 40, 50],
          'max_depth' : [6, 8, 10]
         }

model = RandomForestClassifier(random_state=111)
grid_cv = GridSearchCV(model, param_grid=params, cv=3) 

grid_cv.fit(X_train, y_train)
print('Best Hyper-Parameter : {}'.format(grid_cv.best_params_))
print('Best Accuracy : {:.4f}'.format(grid_cv.best_score_))

best_params = grid_cv.best_params_


KeyboardInterrupt: 

In [58]:
import xgboost as xgb

model = xgb.XGBClassifier(n_estimators=100)

In [59]:
# model = RandomForestClassifier(best_params) 메모리 누수로 사용x

model.fit(X_train, y_train)

print('Train Accuracy : {:.2f}'.format(model.score(X_train, y_train)))
print('Test Accuracy : {:.2f}'.format(model.score(X_valid, y_valid)))

Train Accuracy : 0.79
Test Accuracy : 0.79


In [62]:
test_df = pd.merge(test_df,bids, on='bidder_id')

### Step 6. Prediction

In [63]:
for_test = test_df.drop(['bidder_id','payment_account','address'], axis=1)

In [65]:
pred = model.predict(for_test)

In [67]:
pred

array([0., 1., 1., ..., 1., 1., 0.])

In [68]:
test_df['bidder_id']

0          49bb5a3c944b8fc337981cc7a9ccae41u31d7
1          49bb5a3c944b8fc337981cc7a9ccae41u31d7
2          49bb5a3c944b8fc337981cc7a9ccae41u31d7
3          49bb5a3c944b8fc337981cc7a9ccae41u31d7
4          a921612b85a1494456e74c09393ccb65ylp4y
                           ...                  
4585105    7ade70030d559a6c255be2f6feca17acnrqs0
4585106    7ade70030d559a6c255be2f6feca17acnrqs0
4585107    7ade70030d559a6c255be2f6feca17acnrqs0
4585108    7ade70030d559a6c255be2f6feca17acnrqs0
4585109    7ade70030d559a6c255be2f6feca17acnrqs0
Name: bidder_id, Length: 4585110, dtype: object

In [83]:
pred_df = pd.DataFrame(pred, columns=['prediction'])

In [84]:
id_df = pd.DataFrame(test_df['bidder_id'])

In [85]:
submission = pd.concat([id_df, pred_df], axis=1)

In [86]:
submission

Unnamed: 0,bidder_id,prediction
0,49bb5a3c944b8fc337981cc7a9ccae41u31d7,0.0
1,49bb5a3c944b8fc337981cc7a9ccae41u31d7,1.0
2,49bb5a3c944b8fc337981cc7a9ccae41u31d7,1.0
3,49bb5a3c944b8fc337981cc7a9ccae41u31d7,1.0
4,a921612b85a1494456e74c09393ccb65ylp4y,0.0
...,...,...
4585105,7ade70030d559a6c255be2f6feca17acnrqs0,1.0
4585106,7ade70030d559a6c255be2f6feca17acnrqs0,0.0
4585107,7ade70030d559a6c255be2f6feca17acnrqs0,1.0
4585108,7ade70030d559a6c255be2f6feca17acnrqs0,1.0


In [92]:
submission.to_csv('./data/sampleSubmission.csv', index=False)

In [93]:
pd.read_csv('./data/sampleSubmission.csv')

Unnamed: 0,bidder_id,prediction
0,49bb5a3c944b8fc337981cc7a9ccae41u31d7,0.0
1,49bb5a3c944b8fc337981cc7a9ccae41u31d7,1.0
2,49bb5a3c944b8fc337981cc7a9ccae41u31d7,1.0
3,49bb5a3c944b8fc337981cc7a9ccae41u31d7,1.0
4,a921612b85a1494456e74c09393ccb65ylp4y,0.0
...,...,...
4585105,7ade70030d559a6c255be2f6feca17acnrqs0,1.0
4585106,7ade70030d559a6c255be2f6feca17acnrqs0,0.0
4585107,7ade70030d559a6c255be2f6feca17acnrqs0,1.0
4585108,7ade70030d559a6c255be2f6feca17acnrqs0,1.0
