# Data Preparation & Machine Learning for eBay Shill Bidding data
<b>by Victor Ferreira Silva<br>January 2023</b>
* [Introduction](#Introduction)
* [Data preparation](#DataPreparation)
    * [Data characterisation](#DataCharact)    
    * Exploratory Data Analysis
    * Data cleaning
    * Feature engineering
    * Data scaling
* Dimensionality reduction
    * Principal Component Analysis (PCA)
    * Linear Discriminant Analysis (LDA)
* Machine Learning
    * Clustering algorithms
    * Classification algorithms
* Conclusion
* References

[SBD Dataset Web Page](https://archive.ics.uci.edu/ml/datasets/Shill+Bidding+Dataset)

## <a id="Introduction"></a>Introduction ##
The ability to predict normal and abnormal bidding behavior of eBay users can help companies identify scams and other undesirable users on the platform. The Shill Bidding Dataset (SBD) consists of eBay auctions that have various features, including auction duration, bidder tendency and class. The goal of this report is to apply supervised and unsupervised machine learning techniques to the data set after properly preparing and characterizing it. To improve the results, scaling and feature reduction methods were used, and the performance and accuracy of the applied machine learning methods were compared. At the end of the report, the supervised and unsupervised methods that performed optimally on this dataset were identified. 

## Data preparation<a id="DataPreparation"></a>

### Data characterisation<a id="DataCharact"></a> 
Data characterization involves summarizing the various features and characteristics present in a dataset through preprocessing. This process typically involves using statistical measures to introduce the data to the viewer, as well as visualizing it using graphs such as bar charts and scatter plots.

In [5]:
# Imports and configurations
import warnings
warnings.filterwarnings('ignore')

import pandas as pd

import pandas_profiling
from pandas_profiling import ProfileReport

In [14]:
# loading the original dataset
# df = pd.read_csv('Shill Bidding Dataset.csv')
df = pd.read_csv('Shill Bidding Dataset.csv', dtype={'Record_ID':'object','Auction_ID':'object'})

In [15]:
df.head(10)

Unnamed: 0,Record_ID,Auction_ID,Bidder_ID,Bidder_Tendency,Bidding_Ratio,Successive_Outbidding,Last_Bidding,Auction_Bids,Starting_Price_Average,Early_Bidding,Winning_Ratio,Auction_Duration,Class
0,1,732,_***i,0.2,0.4,0.0,2.8e-05,0.0,0.993593,2.8e-05,0.666667,5,0
1,2,732,g***r,0.02439,0.2,0.0,0.013123,0.0,0.993593,0.013123,0.944444,5,0
2,3,732,t***p,0.142857,0.2,0.0,0.003042,0.0,0.993593,0.003042,1.0,5,0
3,4,732,7***n,0.1,0.2,0.0,0.097477,0.0,0.993593,0.097477,1.0,5,0
4,5,900,z***z,0.051282,0.222222,0.0,0.001318,0.0,0.0,0.001242,0.5,7,0
5,8,900,i***e,0.038462,0.111111,0.0,0.016844,0.0,0.0,0.016844,0.8,7,0
6,10,900,m***p,0.4,0.222222,0.0,0.006781,0.0,0.0,0.006774,0.75,7,0
7,12,900,k***a,0.137931,0.444444,1.0,0.768044,0.0,0.0,0.016311,1.0,7,1
8,13,2370,g***r,0.121951,0.185185,1.0,0.035021,0.333333,0.993528,0.023963,0.944444,7,1
9,27,600,e***t,0.155172,0.346154,0.5,0.570994,0.307692,0.993593,0.413788,0.611111,7,1


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6321 entries, 0 to 6320
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Record_ID               6321 non-null   object 
 1   Auction_ID              6321 non-null   object 
 2   Bidder_ID               6321 non-null   object 
 3   Bidder_Tendency         6321 non-null   float64
 4   Bidding_Ratio           6321 non-null   float64
 5   Successive_Outbidding   6321 non-null   float64
 6   Last_Bidding            6321 non-null   float64
 7   Auction_Bids            6321 non-null   float64
 8   Starting_Price_Average  6321 non-null   float64
 9   Early_Bidding           6321 non-null   float64
 10  Winning_Ratio           6321 non-null   float64
 11  Auction_Duration        6321 non-null   int64  
 12  Class                   6321 non-null   int64  
dtypes: float64(8), int64(2), object(3)
memory usage: 642.1+ KB


In [17]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Bidder_Tendency,6321.0,0.142541,0.197084,0.0,0.027027,0.0625,0.166667,1.0
Bidding_Ratio,6321.0,0.12767,0.13153,0.011765,0.043478,0.083333,0.166667,1.0
Successive_Outbidding,6321.0,0.103781,0.279698,0.0,0.0,0.0,0.0,1.0
Last_Bidding,6321.0,0.463119,0.380097,0.0,0.047928,0.440937,0.860363,0.9999
Auction_Bids,6321.0,0.231606,0.255252,0.0,0.0,0.142857,0.454545,0.788235
Starting_Price_Average,6321.0,0.472821,0.489912,0.0,0.0,0.0,0.993593,0.999935
Early_Bidding,6321.0,0.430683,0.380785,0.0,0.02662,0.360104,0.826761,0.9999
Winning_Ratio,6321.0,0.367731,0.436573,0.0,0.0,0.0,0.851852,1.0
Auction_Duration,6321.0,4.615093,2.466629,1.0,3.0,5.0,7.0,10.0
Class,6321.0,0.106787,0.308867,0.0,0.0,0.0,0.0,1.0


Upon initial examination, the SBD dataset contains 6321 observations with 13 features. The first three columns represent the record ID, auction, and bidder. According to the Pandas `info()` method, all columns except the bidder ID are numeric. However, the class and all ID columns should be treated as character data.

Also, the `describe()` method provides some general descriptive statistics for the data. Apparently, the data has undergone some pre-processing, as the range of auction duration is from 0 to 10 and the range of all other numerical features is from 0 to 1.