My project was based off a kaggle competition. The aim is to identify computer generated bidding bots on an online auction website. Removal of these bots would prevent unfair auction activity. The data was completely provided by kaggle. The performance measure would be the score returned by kaggle on submission of the testing data outcomes. The ideal outcomes will never be shown.

Import the necessary libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
%matplotlib inline

Get the data. Bid data has all the bids made by the users. The training data has the bidder IDs, the outcome(whether the particular ID was found to be a bot or not), the payment and address. The payment and address are scrambled. The testing data has all the same columns as the training data but no outcomes. The outcomes are to be predicted and given to kaggle to be evaluated. We cannot see the ideal outcome.

In [2]:
bid_data=pd.read_csv('ml_project_data/bids.csv')
training_data=pd.read_csv('ml_project_data/train.csv')
testing_data=pd.read_csv('ml_project_data/test.csv')

In [3]:
bid_data.head()

Unnamed: 0,bid_id,bidder_id,auction,merchandise,device,time,country,ip,url
0,0,8dac2b259fd1c6d1120e519fb1ac14fbqvax8,ewmzr,jewelry,phone0,9759243157894736,us,69.166.231.58,vasstdc27m7nks3
1,1,668d393e858e8126275433046bbd35c6tywop,aeqok,furniture,phone1,9759243157894736,in,50.201.125.84,jmqlhflrzwuay9c
2,2,aa5f360084278b35d746fa6af3a7a1a5ra3xe,wa00e,home goods,phone2,9759243157894736,py,112.54.208.157,vasstdc27m7nks3
3,3,3939ac3ef7d472a59a9c5f893dd3e39fh9ofi,jefix,jewelry,phone4,9759243157894736,in,18.99.175.133,vasstdc27m7nks3
4,4,8393c48eaf4b8fa96886edc7cf27b372dsibi,jefix,jewelry,phone5,9759243157894736,in,145.138.5.37,vasstdc27m7nks3


In [4]:
bid_data.shape

(7656334, 9)

In [5]:
training_data.head()

Unnamed: 0,bidder_id,payment_account,address,outcome
0,91a3c57b13234af24875c56fb7e2b2f4rb56a,a3d2de7675556553a5f08e4c88d2c228754av,a3d2de7675556553a5f08e4c88d2c228vt0u4,0.0
1,624f258b49e77713fc34034560f93fb3hu3jo,a3d2de7675556553a5f08e4c88d2c228v1sga,ae87054e5a97a8f840a3991d12611fdcrfbq3,0.0
2,1c5f4fc669099bfbfac515cd26997bd12ruaj,a3d2de7675556553a5f08e4c88d2c2280cybl,92520288b50f03907041887884ba49c0cl0pd,0.0
3,4bee9aba2abda51bf43d639013d6efe12iycd,51d80e233f7b6a7dfdee484a3c120f3b2ita8,4cb9717c8ad7e88a9a284989dd79b98dbevyi,0.0
4,4ab12bc61c82ddd9c2d65e60555808acqgos1,a3d2de7675556553a5f08e4c88d2c22857ddh,2a96c3ce94b3be921e0296097b88b56a7x1ji,0.0


In [6]:
training_data.shape

(2013, 4)

In [7]:
testing_data.head()

Unnamed: 0,bidder_id,payment_account,address
0,49bb5a3c944b8fc337981cc7a9ccae41u31d7,a3d2de7675556553a5f08e4c88d2c228htx90,5d9fa1b71f992e7c7a106ce4b07a0a754le7c
1,a921612b85a1494456e74c09393ccb65ylp4y,a3d2de7675556553a5f08e4c88d2c228rs17i,a3d2de7675556553a5f08e4c88d2c228klidn
2,6b601e72a4d264dab9ace9d7b229b47479v6i,925381cce086b8cc9594eee1c77edf665zjpl,a3d2de7675556553a5f08e4c88d2c228aght0
3,eaf0ed0afc9689779417274b4791726cn5udi,a3d2de7675556553a5f08e4c88d2c228nclv5,b5714de1fd69d4a0d2e39d59e53fe9e15vwat
4,cdecd8d02ed8c6037e38042c7745f688mx5sf,a3d2de7675556553a5f08e4c88d2c228dtdkd,c3b363a3c3b838d58c85acf0fc9964cb4pnfa


In [8]:
testing_data.shape

(4700, 3)

The values in the bid data need to be sorted by the bidder_id and time to make it easier to work with.

In [9]:
bid_data=bid_data.sort_values(by=['bidder_id','time'],ascending=[True,True])
bid_data.head()

Unnamed: 0,bid_id,bidder_id,auction,merchandise,device,time,country,ip,url
7179832,7179832,001068c415025a009fee375a12cff4fcnht8y,4ifac,jewelry,phone561,9706345052631578,bn,139.226.147.115,vasstdc27m7nks3
1281292,1281292,002d229ffb247009810828f648afc2ef593rb,2tdw2,mobile,phone640,9766744105263157,sg,37.40.254.131,vasstdc27m7nks3
1281311,1281311,002d229ffb247009810828f648afc2ef593rb,2tdw2,mobile,phone219,9766744210526315,sg,37.40.254.131,vasstdc27m7nks3
6805028,6805028,0030a2dd87ad2733e0873062e4f83954mkj86,obbny,mobile,phone313,9704553947368421,ir,21.67.17.162,vnw40k8zzokijsv
3967330,3967330,003180b29c6a5f8f1d84a6b7b6f7be57tjj1o,obbny,mobile,phone420,9640018631578947,id,44.241.8.179,sj4jidex850loas


The amount of time between bids done by the same bidder are calculated. I figured this will be useful as bots will be able to click much faster than humans.

In [10]:
time_difference=bid_data.groupby('bidder_id')['time'].diff()
bid_data['time_difference']=time_difference
bid_data.head()

Unnamed: 0,bid_id,bidder_id,auction,merchandise,device,time,country,ip,url,time_difference
7179832,7179832,001068c415025a009fee375a12cff4fcnht8y,4ifac,jewelry,phone561,9706345052631578,bn,139.226.147.115,vasstdc27m7nks3,
1281292,1281292,002d229ffb247009810828f648afc2ef593rb,2tdw2,mobile,phone640,9766744105263157,sg,37.40.254.131,vasstdc27m7nks3,
1281311,1281311,002d229ffb247009810828f648afc2ef593rb,2tdw2,mobile,phone219,9766744210526315,sg,37.40.254.131,vasstdc27m7nks3,105263158.0
6805028,6805028,0030a2dd87ad2733e0873062e4f83954mkj86,obbny,mobile,phone313,9704553947368421,ir,21.67.17.162,vnw40k8zzokijsv,
3967330,3967330,003180b29c6a5f8f1d84a6b7b6f7be57tjj1o,obbny,mobile,phone420,9640018631578947,id,44.241.8.179,sj4jidex850loas,


This is to create a dataframe with the details of all the bidders

In [12]:
bidder_data=pd.DataFrame(data=bid_data['bidder_id'].unique(),columns=['bidder_id'],index=bid_data['bidder_id'].unique())
bidder_data.head()

Unnamed: 0,bidder_id
001068c415025a009fee375a12cff4fcnht8y,001068c415025a009fee375a12cff4fcnht8y
002d229ffb247009810828f648afc2ef593rb,002d229ffb247009810828f648afc2ef593rb
0030a2dd87ad2733e0873062e4f83954mkj86,0030a2dd87ad2733e0873062e4f83954mkj86
003180b29c6a5f8f1d84a6b7b6f7be57tjj1o,003180b29c6a5f8f1d84a6b7b6f7be57tjj1o
00486a11dff552c4bd7696265724ff81yeo9v,00486a11dff552c4bd7696265724ff81yeo9v


In [13]:
bidder_data.shape #6614 users have made bids

(6614, 1)

I figured that bots would be making way more bids than humans. This is to calculate the number of bids made by each bidder.

In [14]:
bid_counts_by_id=bid_data.groupby('bidder_id')['bidder_id'].agg('count')
bidder_data['no_of_bids_made']=bid_counts_by_id
bidder_data.head(20)

Unnamed: 0,bidder_id,no_of_bids_made
001068c415025a009fee375a12cff4fcnht8y,001068c415025a009fee375a12cff4fcnht8y,1
002d229ffb247009810828f648afc2ef593rb,002d229ffb247009810828f648afc2ef593rb,2
0030a2dd87ad2733e0873062e4f83954mkj86,0030a2dd87ad2733e0873062e4f83954mkj86,1
003180b29c6a5f8f1d84a6b7b6f7be57tjj1o,003180b29c6a5f8f1d84a6b7b6f7be57tjj1o,3
00486a11dff552c4bd7696265724ff81yeo9v,00486a11dff552c4bd7696265724ff81yeo9v,20
0051aef3fdeacdadba664b9b3b07e04e4coc6,0051aef3fdeacdadba664b9b3b07e04e4coc6,68
0053b78cde37c4384a20d2da9aa4272aym4pb,0053b78cde37c4384a20d2da9aa4272aym4pb,10939
0061edfc5b07ff3d70d693883a38d370oy4fs,0061edfc5b07ff3d70d693883a38d370oy4fs,134
00862324eb508ca5202b6d4e5f1a80fc3t3lp,00862324eb508ca5202b6d4e5f1a80fc3t3lp,5
009479273c288b1dd096dc3087653499lrx3c,009479273c288b1dd096dc3087653499lrx3c,1


This is to get each bidders country

In [15]:
temp=bid_data.groupby('bidder_id')['bidder_id','country'].head()
temp=temp.drop_duplicates('bidder_id')
bidder_data=bidder_data.merge(temp,on='bidder_id')
bidder_data.head()

Unnamed: 0,bidder_id,no_of_bids_made,country
0,001068c415025a009fee375a12cff4fcnht8y,1,bn
1,002d229ffb247009810828f648afc2ef593rb,2,sg
2,0030a2dd87ad2733e0873062e4f83954mkj86,1,ir
3,003180b29c6a5f8f1d84a6b7b6f7be57tjj1o,3,id
4,00486a11dff552c4bd7696265724ff81yeo9v,20,ng


This is to get the sum of the time between each bid for individual users.

In [16]:
time_sums=bid_data.groupby(['bidder_id'])['time_difference'].sum()
time_sums=pd.DataFrame(data=time_sums)
bidder_data['sum_of_diffs']=list(time_sums['time_difference'])
bidder_data.head()

Unnamed: 0,bidder_id,no_of_bids_made,country,sum_of_diffs
0,001068c415025a009fee375a12cff4fcnht8y,1,bn,
1,002d229ffb247009810828f648afc2ef593rb,2,sg,105263200.0
2,0030a2dd87ad2733e0873062e4f83954mkj86,1,ir,
3,003180b29c6a5f8f1d84a6b7b6f7be57tjj1o,3,id,65955680000000.0
4,00486a11dff552c4bd7696265724ff81yeo9v,20,ng,76349840000000.0


This is to get the mean time taken between bids

In [17]:
bidder_data['mean_time_diff']=bidder_data['sum_of_diffs']/bidder_data['no_of_bids_made']
bidder_data.head()

Unnamed: 0,bidder_id,no_of_bids_made,country,sum_of_diffs,mean_time_diff
0,001068c415025a009fee375a12cff4fcnht8y,1,bn,,
1,002d229ffb247009810828f648afc2ef593rb,2,sg,105263200.0,52631580.0
2,0030a2dd87ad2733e0873062e4f83954mkj86,1,ir,,
3,003180b29c6a5f8f1d84a6b7b6f7be57tjj1o,3,id,65955680000000.0,21985230000000.0
4,00486a11dff552c4bd7696265724ff81yeo9v,20,ng,76349840000000.0,3817492000000.0


On exploration, I found that bids from countries id and in(Indonesia and India I assume) have a large portion of overall bots that were detected. I incorporated this feature that would tell me if they 

In [18]:
bidder_data['id_or_in']=np.where(bidder_data['country']=='id',1,
                                  np.where(bidder_data['country']=='in',1,0))
bidder_data.head()

Unnamed: 0,bidder_id,no_of_bids_made,country,sum_of_diffs,mean_time_diff,id_or_in
0,001068c415025a009fee375a12cff4fcnht8y,1,bn,,,0
1,002d229ffb247009810828f648afc2ef593rb,2,sg,105263200.0,52631580.0,0
2,0030a2dd87ad2733e0873062e4f83954mkj86,1,ir,,,0
3,003180b29c6a5f8f1d84a6b7b6f7be57tjj1o,3,id,65955680000000.0,21985230000000.0,1
4,00486a11dff552c4bd7696265724ff81yeo9v,20,ng,76349840000000.0,3817492000000.0,0


Merging the training data with the bidder data. There are a few bidders with no bids, hence the merge has to be a left join. This explains a few of the NaN values.

In [19]:
training_data=training_data.merge(bidder_data, how='left',on='bidder_id')

In [21]:
training_data.isnull().sum()

bidder_id            0
payment_account      0
address              0
outcome              0
no_of_bids_made     29
country             35
sum_of_diffs       331
mean_time_diff     331
id_or_in            29
dtype: int64

In [22]:
training_data[training_data['no_of_bids_made']!=training_data['no_of_bids_made']]

Unnamed: 0,bidder_id,payment_account,address,outcome,no_of_bids_made,country,sum_of_diffs,mean_time_diff,id_or_in
49,5f50c6187a179e2ee7ba2fbcfc845c7a1smgr,7326f0a1592b18cb1e6ed7c8ebbd03a72qf7p,a3d2de7675556553a5f08e4c88d2c228uaoqg,0.0,,,,,
88,02bde521e763e4f4e590e8368149e04a96il9,a3d2de7675556553a5f08e4c88d2c2286r1lb,935d2083173e96f099816c1b1f7ee249kk8zo,0.0,,,,,
175,dd661e2d6e79a5b3e66c82373d50f3ee86k85,e805bf9d2399ddc37a194e04703a333c7bv82,2c2b8b44b1615ef6d632fb115a85794djmktr,0.0,,,,,
236,908ce7060337fd8550760a100921f6f7wsemn,a3d2de7675556553a5f08e4c88d2c2282yldz,39b2bdec29461f8a0ae2a5a5b01d259fik8r7,0.0,,,,,
262,b64c209b3d1d91a663d134961af89125u0s9a,f7558102989f5665bbbea00358f8434adf9o9,5c9de1da50cc32a29ffd596ae24cd2be24cly,0.0,,,,,
271,96b1df591dbbf3a002578574671d9ff1rmzev,a3d2de7675556553a5f08e4c88d2c2288tx3v,1a1d480ca96a50e5fa8af28cf9121d80gx8g2,0.0,,,,,
299,d2138bce99f535c244dab68652ccfa2enshxk,72a51aa2faf94e7ffdf736fdb389d4efpyojt,458d233d676d8f62406213ab319b8334dbdxh,0.0,,,,,
305,c08e5e3325e2f4fea171f24ca018e675we8kj,efd15ad70741e38f212ac919ca569615r3g33,81d6c498369ab4af3fea529406dc7d96flle8,0.0,,,,,
339,187636e527504df29bf42d4a2b7767e54bgv7,a3d2de7675556553a5f08e4c88d2c2281swzb,a3d2de7675556553a5f08e4c88d2c2281rrln,0.0,,,,,
364,74f153bd134afc92866a6bc5cceb2088120y0,a3d2de7675556553a5f08e4c88d2c228fd0s8,796f3dd849480319c21677833dc1a6c87c6p1,0.0,,,,,


This replaces the NaN values in the no_of_bids_made column with zero

In [23]:
training_data.loc[training_data['no_of_bids_made']!=training_data['no_of_bids_made'],'no_of_bids_made']=0

This changes the NaN values in the id_or_in column to 0

In [24]:
training_data.loc[training_data['id_or_in']!=training_data['id_or_in'],'id_or_in']=0

In [25]:
training_data.isnull().sum()

bidder_id            0
payment_account      0
address              0
outcome              0
no_of_bids_made      0
country             35
sum_of_diffs       331
mean_time_diff     331
id_or_in             0
dtype: int64

This is to find the mean value of the mean_time_diff when the outcome is human in order to substitute NaN when a bidder has made 0 or 1 bid.

In [27]:
temp=training_data[training_data['outcome']==0].mean_time_diff.sum()/training_data[training_data['outcome']==0].mean_time_diff.count()
training_data.loc[training_data['mean_time_diff']!=training_data['mean_time_diff'],'mean_time_diff']=temp

In [28]:
training_data.isnull().sum()

bidder_id            0
payment_account      0
address              0
outcome              0
no_of_bids_made      0
country             35
sum_of_diffs       331
mean_time_diff       0
id_or_in             0
dtype: int64

This is to merge the testing data with the bidder_data

In [30]:
testing_data=testing_data.merge(bidder_data,how='left',on='bidder_id')

In [31]:
testing_data.head()

Unnamed: 0,bidder_id,payment_account,address,no_of_bids_made,country,sum_of_diffs,mean_time_diff,id_or_in
0,49bb5a3c944b8fc337981cc7a9ccae41u31d7,a3d2de7675556553a5f08e4c88d2c228htx90,5d9fa1b71f992e7c7a106ce4b07a0a754le7c,4.0,us,70223680000000.0,17555920000000.0,0.0
1,a921612b85a1494456e74c09393ccb65ylp4y,a3d2de7675556553a5f08e4c88d2c228rs17i,a3d2de7675556553a5f08e4c88d2c228klidn,3.0,az,76002050000000.0,25334020000000.0,0.0
2,6b601e72a4d264dab9ace9d7b229b47479v6i,925381cce086b8cc9594eee1c77edf665zjpl,a3d2de7675556553a5f08e4c88d2c228aght0,17.0,id,291052600000.0,17120740000.0,1.0
3,eaf0ed0afc9689779417274b4791726cn5udi,a3d2de7675556553a5f08e4c88d2c228nclv5,b5714de1fd69d4a0d2e39d59e53fe9e15vwat,148.0,bd,76521630000000.0,517038100000.0,0.0
4,cdecd8d02ed8c6037e38042c7745f688mx5sf,a3d2de7675556553a5f08e4c88d2c228dtdkd,c3b363a3c3b838d58c85acf0fc9964cb4pnfa,23.0,za,6574789000000.0,285860400000.0,0.0


In [32]:
testing_data.isnull().sum()

bidder_id            0
payment_account      0
address              0
no_of_bids_made     70
country             77
sum_of_diffs       825
mean_time_diff     825
id_or_in            70
dtype: int64

This is to replace the NaN values in various columns

In [33]:
testing_data.loc[testing_data['no_of_bids_made']!=testing_data['no_of_bids_made'],'no_of_bids_made']=0

In [34]:
testing_data.loc[testing_data['id_or_in']!=testing_data['id_or_in'],'id_or_in']=0

In [35]:
testing_data.loc[testing_data['mean_time_diff']!=testing_data['mean_time_diff'],'mean_time_diff']=temp

In [36]:
testing_data.isnull().sum()

bidder_id            0
payment_account      0
address              0
no_of_bids_made      0
country             77
sum_of_diffs       825
mean_time_diff       0
id_or_in             0
dtype: int64

This is to eliminate human outliers with a large number of bids(ie. greater than 50000)

In [42]:
human_data=training_data[training_data['outcome']==0]
human_data=human_data[human_data['no_of_bids_made']<50000]
bot_data=training_data[training_data['outcome']==1]

In [43]:
train2=human_data.append(bot_data)

This is to separate the training data so the model can be fitted.

In [44]:
xtrain=train2[['no_of_bids_made','mean_time_diff','id_or_in']]
ytrain=train2['outcome']

This is to slice the testing data to get predictions using the model.

In [45]:
xtest=testing_data[['no_of_bids_made','mean_time_diff','id_or_in']]

This is to fit the model. I played around with the n_estimators parameter(ie. the number of trees in the forest) and the best results came from this value.

In [46]:
rf = RandomForestRegressor(n_estimators = 3000, random_state = 42)
rf.fit(xtrain,ytrain)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=3000, n_jobs=1,
           oob_score=False, random_state=42, verbose=0, warm_start=False)

This is to make predictions using the model and the test data.

In [47]:
rf_predictions=rf.predict(xtest)

On submitting this to kaggle, it gets a score of 0.82714. 
Note:By passing a csv file with only 0.0 outcomes on it, a score of 0.50 was given so my model is able to detect a good portion of the bots and the score (0.82714) is not due to simply detecting humans(ie. 0)