# eBay online auction data analysis
The datasets contain eBay auction information on *Cartier wristwatches, Palm Pilot M515 PDAs, Xbox game consoles, and Swarowski beads*.
### Goal for the analysis:
Find out the relationship between bids, auction type and open bid. Does this differ by type of items?


In [None]:
%matplotlib inline 

import pandas as pd             # data package
import matplotlib.pyplot as plt # graphics 
import seaborn as sns           # graphics 
import datetime as dt           # date tools, used to note current date  
import numpy as np
from scipy import stats

In [None]:
#read csv file 
auction=pd.read_csv("../input/auction.csv")
auction.head()

In [None]:
type(auction['item'])

In [None]:
auction.shape

In [None]:
auction.dtypes

## The data set includes 10681 observations and 9 variables
* auctionid: unique identifier of an auction
* bid: the proxy bid placed by a bidder
* bidtime: the time in days that the bid was placed, from the start of the auction
* bidder: eBay username of the bidder
* bidderrate: eBay feedback rating of the bidder
* openbid: the opening bid set by the seller
* price: the closing price that the item sold for (equivalent to the second highest bid + an increment)
* item: auction item
* auction_type

# Part 1 Understand eBay auction policy
It's important to understand how eBay auction policy works before conducting further analysis.
I found [this explanation](http://www.ebay.com/gds/Take-Advantage-of-Bid-Increments-/10000000002792188/g.html) easy to understand. 

The website helps to answer the following questions:

** What's the relationship between bids and bidtime?**
Poeple have to bid higher than current price, which is the second highest bid+increment. If someone places a bid much higher than current price, eBay will only use part of his/her bid(second highest bid+increment) as the new current price. People can bid as long as their bid is higher than current price. Therefore, as bidtime increases, we sometimes see bids that are lower than previous bids.
People will be notified when they are outbid, so they may increase their max bids to remain the higher bidder.

**What's the relationship between bids and closing price?**
Closing price=second highest bid+increment or highest bid in some cases when second highest bid+increment is greater than the thrid highest bid. For example, if third highest bid is 8 and second highest bid is 10, you can bid 10.01 and if no one's bid is higher than yours, you get the item by paying 10.01 instead of 10+increment(which is 10.5)

Person who placed the highest bid gets the item but how much he/she needs to pay in most cases depends on the second highest bid.

In [None]:
auction.head()

In [None]:
auction4=auction[['auctionid','bid','price','auction_type']]

In [None]:
auction4.head()

In [None]:
closing_price = auction4.groupby(['auctionid']).last()

In [None]:
closing_price.head()

In [None]:
closing_price['check']=np.where((closing_price['bid'] == closing_price['price']), closing_price['bid'], np.nan)

In [None]:
closing_price.isnull().sum()

In [None]:
closing_price.shape

In [None]:
1-146/628

76.8% of bids placed right before the auction ends is the same as the closing price but people who place the last bid is not necessaily the one who get the item. 

# Part 2 Data Exploration

## 2.1 Distribution of the length of bidtime for all types of auctions
I want to know how many days does an auction usually take?

In [None]:
auction1 = auction[['auctionid','bidtime']]

In [None]:
auction1.head()

In [None]:
max_bidtime = auction1.groupby(['auctionid']).max()

In [None]:
max_bidtime.head()

In [None]:
# when normed=True, sum of bin height*width for all bins=1
fig, ax = plt.subplots(figsize = (18, 16))
ax.hist(max_bidtime['bidtime'], bins=25, normed=True)
sns.kdeplot(max_bidtime['bidtime'], ax=ax, lw = 3)
ax.set_title("Histogram of bidtime")
plt.show()

<font color='green'>** From the kernel density distribution above, we can see that it is most likely that auctions will take 7 days and then 3 days and 5 days.
As is shown below, most auction types are 7 days. And this corresponds to the likihood of days most auctions** </font>

## 2.2 Analysis of bidtime for different acution_types

In [None]:
auction.head()

### Number of unique acutions for different auction types 

In [None]:
num_type=auction.groupby('auction_type')['auctionid'].nunique()
num_type=num_type.to_frame()
print(num_type)

A pie chart showing the percentage composition of different auction types. From the plot below, we can see that <font color='green'>**61% percent of the auctions are 7 day auction**</font>. Codes are referred to https://chrisalbon.com/python/matplotlib_pie_chart.html

In [None]:
# Create a list of colors (from iWantHue)
colors=['#5ABA10', '#FE110E','#CA5C05']
# Create a pie chart
plt.pie(
    # using data total)arrests
    num_type['auctionid'],
    # with the labels being officer names
    labels=num_type.index,
    # with no shadows
    shadow=False,
    # with colors
   colors=colors,
    # with the percent listed as a fraction
    autopct='%1.1f%%',
    )

# View the plot drop above
plt.axis('equal')

# View the plot
plt.tight_layout()
plt.show()

### Max bidtime for each unique auction

I also want to compare the distribution of max bidtime for three auction types. 
To achieve this, I used the max bidtime for each auctionid calculated above and then merge it with auction_type. Therefore, I get auction_type and max bidtime for each unique auction.

In [None]:
max_bidtime=auction.groupby('auctionid',as_index=False)['bidtime'].max()

In [None]:
max_bidtime.head()

In [None]:
auction2 = auction[['auctionid','auction_type']]

In [None]:
auction_type=auction2.groupby('auctionid',as_index=False)['auction_type'].first()

In [None]:
auction_type.head()

In [None]:
auction3 = pd.merge(max_bidtime, auction_type, how='inner', on=['auctionid'])

In [None]:
auction3.head()

In [None]:
seven_day_auction = auction3.query('auction_type == "7 day auction"') 
five_day_auction= auction3.query('auction_type == "5 day auction"') 
three_day_auction= auction3.query('auction_type == "3 day auction"') 

In [None]:
print(three_day_auction.shape,five_day_auction.shape,seven_day_auction.shape)

In [None]:
fig, ax = plt.subplots(figsize = (8, 6))
ax.hist(seven_day_auction['bidtime'], bins=25, normed=True,label='seven_day')
ax.hist(three_day_auction['bidtime'], bins=25, normed=True,label='three_day')
ax.hist(five_day_auction['bidtime'], bins=25, normed=True,label='five_day')
ax.set_title("Histogram of max bidtime for different auction types")
plt.legend()
plt.show()

From the plot above, we can see that <font color ='green'>** the ultimate bid time for most auctions are the same as maximun days that they are availbale**</font>.

## 2.3 Analysis of open bid 
* Does different items statistically different in average open bid?
* Does different auction types statistically different in average open bid?

In [None]:
auction.head()

In [None]:
open_bid = auction.groupby(['auctionid']).last()
open_bid=open_bid[['openbid','item','auction_type']]
open_bid.head()

### Boxplot

In [None]:
#create a boxplot first in order to have a more direct way of seeing means of open bid for different types
sns.set_style("whitegrid")
ax = sns.boxplot(x="item", y="openbid", data=open_bid)

In [None]:
df=open_bid[open_bid['item']=='Cartier wristwatch']['openbid']
df.max()

We can see that there's a Cartier wristwatch with an open bid of 5,000, which I think is very likely to bias my analysis of open bid. Therefore, I'm going to exclude that observation. <font color='red'>** I'm not quite sure if I should do this though. I would appreciate your suggestion.**</font>

In [None]:
#drop the row that has openbid=5000.0
value_list = [5000.0]
open_bid1=open_bid[~open_bid.openbid.isin(value_list)]
open_bid1.head()

### Distribution plot

Introduction of sns.displot function: 'sns.displot function combines the matplotlib ``hist`` function (with automatic calculation of a good default bin size) with the seaborn :func:`kdeplot`and :func:`rugplot` functions.)'

In [None]:
fig, ax = plt.subplots(figsize = (14, 13))
sns.distplot(open_bid1[open_bid['item']=='Cartier wristwatch']['openbid'], kde=True,label="Cartier wristwatch")
sns.distplot(open_bid1[open_bid['item']=='Palm Pilot M515 PDA']['openbid'], kde=True,label="Palm Pilot M515 PDA")
sns.distplot(open_bid1[open_bid['item']=='Xbox game console']['openbid'], kde=True,label='Xbox game console')
plt.legend()

In [None]:
fig, ax = plt.subplots(figsize = (8, 4.5))
ax = sns.boxplot(x="item", y="openbid", data=open_bid1)

In [None]:
fig, ax = plt.subplots(figsize = (8, 4.5))
ax = sns.boxplot(x="item", y="openbid", data=open_bid1)
ax.set(ylim=(-20, 1000))

Since the distribution of open bid for Cartier wristwatch is skewed to the right, a median comparison of open bid price for different types of items is better.

In [None]:
open_bid1.groupby('item')['openbid'].median()

In [None]:
open_bid1.groupby('item')['openbid'].mean()

In [None]:
open_bid1.groupby('item')['openbid'].var()

Next, I'm going to use Anova test to see if different items/aution types have significant difference in open bid.

### One way ANOVA

In [None]:
# compute one-way ANOVA P value   
from scipy import stats  
watch=open_bid1[open_bid1['item']=='Cartier wristwatch']['openbid']
PDA=open_bid1[open_bid1['item']=='Palm Pilot M515 PDA']['openbid']
console=open_bid1[open_bid1['item']=='Xbox game console']['openbid']

f_val, p_val = stats.f_oneway(watch, PDA, console)  
  
print("One-way ANOVA P =", p_val)

With a p-value less than 0.05, we are confident enough to conclude that <font color='green'>**differnt types of items are statistically different in average open bid**</font>. 

From the boxplot and analysis above, we can see that <font color='green'> **Cartier wristwatch has a much higher median open bid than Palm Pilot and Xbox game console.The distribution of open bid for Cartier wristwatch is more disperse as well. ** </font>

In [None]:
seven_days=open_bid1[open_bid1['auction_type']=='7 day auction']['openbid']
five_days=open_bid1[open_bid1['auction_type']=='5 day auction']['openbid']
three_days=open_bid1[open_bid1['auction_type']=='3 day auction']['openbid']

f_val, p_val = stats.f_oneway(seven_days, five_days, three_days)  
  
print("One-way ANOVA P =", p_val)

With a p-value greater than 0.05, we failed to reject that items with differnt aution types are statistically differnt in open bid. <font color='green'>**Different auction types are not statistically different in average open bid.**</font>

## 2.4 Analysis of ending price 

In [None]:
pct_change=auction[['auctionid','openbid','price','item','auction_type']].groupby(['auctionid']).last()
pct_change.head()

In [None]:
pct_change.groupby('item')['price'].median()

In [None]:
pct_change.groupby('item')['price'].mean()

In [None]:
pct_change.groupby('item')['price'].var()

# Part 3 Relation between variables
From the perspective of sellers, to attract more bids and generate more percentage increase in price, they want to know 
* how long should they set their auction 
* how they should set their open bid


## 3.1 Is there any relationship between open bid and bid times?
* Does a lower opening bid set by the seller attract more bids on the auction? 

### Correlation coefficient 

Is a lower open bid correlated with more people placing a bid on the item?

In [None]:
auction.head()

num_bidds refers to the number of bids an auction has during its auction period. 

In [None]:
num_bids=auction.groupby('auctionid')['bidder'].count()
num_bids=num_bids.to_frame()
num_bids.columns=['num_bids']
num_bids.head()

In [None]:
#open_bid1 is a dataframe that drops the extreme openbid value of 5,000
open_bid1.head()

In [None]:
num_bids=pd.merge(num_bids, open_bid1, left_index=True, right_index=True)

In [None]:
num_bids.tail()

In [None]:
sns.jointplot(x='num_bids', y="openbid", data=num_bids,stat_func=spearmanr)

In [None]:
watch=num_bids[num_bids['item']=='Cartier wristwatch']
xbox=num_bids[num_bids['item']=='Xbox game console']
PDA=num_bids[num_bids['item']=='Palm Pilot M515 PDA'] 

Since from the scatter plot, I noticed that the relation between num_bids and open bid for different tyes of itmes are not linear, Spearman's correlation is more appropriate for measuring increasing or decreasing trends. 

In [None]:
g = sns.jointplot(x='num_bids', y="openbid", data=watch,stat_func=spearmanr)

In [None]:
from scipy.stats import spearmanr
g = sns.jointplot(x='num_bids', y="openbid", data=xbox,stat_func=spearmanr)

In [None]:
g = sns.jointplot(x='num_bids', y="openbid", data=PDA,stat_func=spearmanr)

In gneral, number of bids is strongly negatively correlated with open bid price set by the seller, with a spearmanr correlation coefficient of -0.71. That is to say,  <font color='green'>**an auction with high open bid tends to attract fewer bids**. </font>

Besides, spearman correlation for watch is the weakest and strongest for PDA, which may be a result of outliers or larger variation in watch as well.

In [None]:
num_bids.groupby('item').corr()

<font color='green'> **Correlation of number of bidders with open bid is the strongest for Palm Pilot and weakest for Cartier wristwatch. **</font>

In [None]:
num_bids.groupby('auction_type').corr(method='spearman')

<font color='green'> **Differences of correlation between open bid and number of bidders are less significant among differnt auction types.**</font> 

In [None]:
num_bids.groupby(['auction_type','item']).corr(method='spearman')

## 3.2 What's the relationship between auction types and number of bids?

As shown before
* Auctions with lower open bid tend to have a larger number of bids
* Different auction types are not statistically different in open bid 

I would like to see whether the choice of auction type per se wil affect number of bids

In [None]:
num_bids.head()

In [None]:
#number of auctions for each auction type 
num_bids.groupby('auction_type')['num_bids'].count()

In [None]:
#total number of bids for each auction type 
num_bids.groupby('auction_type')['num_bids'].sum()

In [None]:
#averge number of bids for each type of auctions 
num_bids.groupby('auction_type')['num_bids'].sum()/num_bids.groupby('auction_type')['num_bids'].count()

<font color='green'>**We can see that auctions with longer days available have more bids on average. **</font>

Is longer auction time positively correlated with more bids for each type of items?

In [None]:
ave_bids=num_bids.groupby(['auction_type','item'])['num_bids'].sum()/num_bids.groupby(['auction_type','item'])['num_bids'].count()

In [None]:
ave_bids=ave_bids.to_frame()

In [None]:
ave_bids

In [None]:
ave_bids1=ave_bids.reset_index(level=['auction_type','item'])
ave_bids1

codes refer to http://bokeh.pydata.org/en/0.10.0/docs/user_guide/charts.html

In [None]:
from bokeh.charts import Bar, output_file, show
from bokeh.sampledata.autompg import autompg as df

In [None]:
#plot a bar chart for grouped data
p=Bar(ave_bids1, label="auction_type", values="num_bids", group="item", legend="top_left", ylabel='average_bids')
show(p)

![image](bokeh_plot.png)

In [None]:
ave_bids1.groupby(['item'])['num_bids'].var()

<font color='green'>**For Palm Pilot and Xbox console, a longer auction time is correlated with more bids, while for Cartier wristwatch, a longer auction does not necessarily correlated with more bids.** </font>

# Part 4 Conclusion 

Based on the analysis above, we can conclude that in general, a lower open bid and longer auction time can attract more bids on an auction. 

However, if we take a deeper look into cases for different types of items, we can find that the relation between open bid, auction time and number of bids is not that strong (or even do not hold) for items like Cartier wristwatch. 

**Possible explanation**

One possible explanation for this is the chareristics of different types of items. For items like Cartier wristwatch, which has a high disperse distribution of mean and median for both open bid and close price, these items sold on eBay are more likely to be rare and exclusive. Therefore, even if it's set at a high open bid and have limited auction price, it may still have quite a few bids.

For items like Palm Pilot and Xbox game consoles, (I assume that) they are much less likely to be limited editions since the data are collected from a book published in 2010(http://www.modelingonlineauctions.com/home), Palm Pilot and Xbox game consoles were very unlikely to be rare products in the market. Therefore, the market for these products are more competitive and thus people like to bid on products with lower open bid. 

**Tips for sellers on eBay**

 * If you are selling a product that's hard to be found on the market and it has great collection value, you don't need to worry much that setting it at a high open bid and limited amount of aution time will scare potential customers away. 

 * If you are selling a product which has quite a few substitues in the market, set it at a low open bid and make it a 7 day auction. You will attract more bidders by doing this. Since the market is competitive, I believe you won't end up having an unsatisfactory price.