# Building a prediction model for the final selling price of the auctions

IMPORTANT REMARK:

This code shall be executed from start to finish in the defined order. Errors may occur if the cells are executed in a different order.

In [1]:
import pandas as pd
import numpy as np

import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
%matplotlib inline

from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import RANSACRegressor
    
from sklearn.metrics import median_absolute_error
from sklearn.metrics import mean_absolute_error

from sklearn.model_selection import KFold

from sklearn.cluster import KMeans
from sklearn.preprocessing import OneHotEncoder

from sklearn import preprocessing

In [2]:
mirar outliers

SyntaxError: invalid syntax (<ipython-input-2-978e3eff27d9>, line 1)

In this section, different prediction models to predict the final selling price of an auction before it starts have been built. The predictions are made with the intention that having an estimation of the final selling price of an auction could be very helpful for auction business owners (in this case Swoopo).

Since the final selling price of an auction item is obtained by multiplying the bid price increment for that auction and the number of bids placed, and the bid price increment for an auction is known data, we are also indirectly predicting the number of bids placed for the auction. Swoopo's gain is obtained by multiplying the number of bids placed (predicted) and the bid fee (known data), and summing the result to the final price of the auction (predicted). Therefore, Swoopo's gain is also being indirectly predicted. Swoopo's profit would be calculated as the difference between the gain (predicted) and the retail price of the item (known data).

In [19]:
outcomesDf = pd.read_csv('./outcomes_clean.tsv',sep='\t')
outcomesDf.head(1)

Unnamed: 0,auction_id,product_id,item,desc,retail,price,finalprice,bidincrement,bidfee,winner,...,freebids,endtime_str,flg_click_only,flg_beginnerauction,flg_fixedprice,flg_endprice,bids_placed,swoopo_sale_price,swoopo_profit,winner_benefit
0,86827,10009602,sony-ericsson-s500i-unlocked-mysterious-,Sony Ericsson S500i Unlocked Mysterious Green,499.99,13.35,13.35,0.15,0.75,Racer11,...,0,2008-09-16 19:52:00,0,0,0,0,89.0,77.060489,-422.929511,467.14


The columns that are given as input for the prediction model are: "retail" (the retail price of the item), "bidincrement" (the bid increment), "bidfee" (the bid fee), and the flags that indicate whether the auction is a click-only auction, a beginner auction, a fixed-price auction or and end-price auction. 

The retail price of the item will clearly influence the final price that the auction reaches (in general, users will place more bids for more expensive items). The bid increment directly influences the final price that the auction reaches, because the last one is calculated by multiplying the bid increment and the number of bids placed in the auction. The bid fee will also influence the final price of the item, since users will be willing to place more bids if the bid fee is low, and placing more bids implies that the final price reached increases. The flags indicating the type of auction might have an influence for some users on whether they are willing to place more or less bids for the auction, so that is why they have been included.

In [4]:
x_column_names = ["retail","bidincrement","bidfee","flg_click_only","flg_beginnerauction","flg_fixedprice","flg_endprice"]

The "auction_id" (auction id) column has not been included as an input variable for the model because it is unique for each auction, and therefore, it does not provide any useful information for the predictions.

The columns "product_id" (product id), "desc" (product description) and "item" have not been included as input for the model either. The product that is being sold (i.e., the product id) can certainly influence the final price that the auction reaches, but if we were to include this variable as input for the model, we would have to one-hot encode it, as it is expressed as a number, but there is no actual order defined for the values that it contains (a product with a larger product id than another one is not "larger" than that product. The number is just used as an identifier). If we were to include the other two columns ("desc", containing the product description, and "item", containing an item string associated to the product), they would also have to be one-hot encoded.

As it can be seen, all of these columns contain many unique values. In the previous section, the item categories have been obtained from Amazon, and have been converted into a Word2Vec vector. The distances that these vectors define can be used to create clusters for the product categories that can be used as input for the prediction models rather than the other mentioned columns (which contain way too many different values, and the prediction model would not be able to generalize well if we used them as input variables to express diferences between the type of auctioned items).

In [24]:
print(len(outcomesDf["product_id"].unique()))
print(len(outcomesDf["desc"].unique()))
print(len(outcomesDf["item"].unique()))

2080
1769
1767


The columns "price" and "finalprice" contain the final selling price reached for the auction (with a small difference between each other that will later be reminded), which is exactly what we want to predict, and therefore, they cannot be an input for the prediction model.

The columns "winner" (winner of the auction) 'placedbids' (number of paid bids placed by the winnner of the auction), "freebids" (number of free bids placed by the winner of the auction) and "endtime_str' (time in which the auction finished) contain information that is not known until the auction finishes, and therefore, they cannot be used as input for the model to predict the final selling price before the auction starts.

The same thing happens with the other columns that were added in previous sections to the dataset: "bids_placed" (total number of bids placed for the auction), "swoopo_sale_price" (total gain obtained by Swoopo for the auction), "swoopo_profit" (total profit obtained by Swoopo for the auction, calculated as the gain obtained minus the retail price of the item), and "winner_benefit" (the difference between the money paid by the winner for the item and the retail price of the item). They cannot be used as input data for the model because they contain data that is not known at the beginning of the auction (and also, even if the data was known at the beginning of the auction, they have been calculated using the values of the other variables, so they could lead to multicollinearity problems if they were included in that case).

The columns "flg_fixedprice" and "flg_endprice" indicate whether the auction is a fixed-price auction or and end-price auction. Fixed-price actions are auctions for which the final selling price of the item is set from the beginning. End-price auctions are auctions in which the final selling price of the item is zero independently of the bids that are placed, and the revenue for Swoopo comes exclusively from the bids that are placed).

The column "finalprice" contains the real final selling price of the auction (that is, the fixed price for fixed-price auctions and zero for end-price auctions), while the "price" column contains the selling price of the auction calculated as the number of bids placed multiplied by the bid increment. Since the final selling price is already known from the beginning for fixed-price auctions and end-price auctions, and we are interested in knowing the gain that the auction provides for Swoopo, it has been chosen to predict the value in column "price".

In [None]:
X = outcomesDf[x_column_names]
y = outcomesDf["price"]

# One model - Without having into account the categories

The first models that have been built do not take into account the product categories. The input variables for the model are: the retail price of the item, the bid increment, the bid fee, and the flags that indicate whether the auction is a click-only auction, a beginner auction, a fixed-price auction or and end-price auction.

In order to be able to compare the results for different models, it has been chosen to develop the following ones: a random forest regressor, a k-neighbors regressor, a decision tree regressor, a linear regression and a RANSAC regressor.

The dataset has been divided into different parts for the training and the testing phases, and 5-fold cross-validation has been applied so that the resulting quality metrics are more reliable.

The metrics that have been calculated are the mean absolute error and the median absolute error. 

The mean absolute error gives an idea about the error that the model incurs in when performing final price predictions for individual auctions. Nevertheless, the final selling price of an auction depends on a lot of things: the users that are monitoring the auction, the amount of money that they have and their interests at the moment of the auction, etc.

Even when the item that is auctioned is the same one, the final price that is reached for each one of the auctions can be very different. As an example:

In [39]:
outcomesDf[outcomesDf["desc"] == "Sony Ericsson S500i Unlocked Mysterious Green"][["desc","finalprice"]].head(15)

Unnamed: 0,desc,finalprice
0,Sony Ericsson S500i Unlocked Mysterious Green,13.35
3,Sony Ericsson S500i Unlocked Mysterious Green,19.65
4,Sony Ericsson S500i Unlocked Mysterious Green,47.1
5,Sony Ericsson S500i Unlocked Mysterious Green,55.2
6,Sony Ericsson S500i Unlocked Mysterious Green,86.1
17,Sony Ericsson S500i Unlocked Mysterious Green,48.6
18,Sony Ericsson S500i Unlocked Mysterious Green,113.1
48,Sony Ericsson S500i Unlocked Mysterious Green,75.15
49,Sony Ericsson S500i Unlocked Mysterious Green,35.1
50,Sony Ericsson S500i Unlocked Mysterious Green,19.2


The most extreme values could be considered outliers and be removed from the dataset so that they do not negatively influence the model. However, in this case, the extreme values are legitimate observations and they are most likely not incorrectly entered data. Therefore, instead of removing them, it has been decided to include some prediction models that are robust to outliers. For example, random forests and decision trees isolate atypical observations into small leaves, and the RANSAC algorithm provides a robust linear model estimation.

Because of the high variance of the final selling price of the auctions, appart from the mean absolute error, it has also been decided to use the median absolute error as a metric. This metric is not be so highly influenced by extreme values. Making final selling price predictions can help Swoopo or similar businesses to have an estimation of the money that they will earn before a group of auctions start. However, we consider it more important to have an accurate estimation of groups of several auctions (for example, accurate weekly or monthly profit estimations, which can be calculated by using the estimations of the final price of the auctions included in those time periods), rather than accurate estimations for individual auctions, since Swoopo will profit of the total result of all of their auctions. Because of this reason, although the mean absolute error has been calculated, the median absolute error is considered to be a more appropiate metric in this case.

The mean absolute percentage error (MAPE) has also been considered as a metric. Nevertheless, if a video game is predicted to reach a selling price of 20\$, and the real selling price ends up being 30\$,the MAPE would be 33%. However, if a laptop is predicted to reach a selling price of 700\$ and the real selling price ends up being 800\$, the MAPE is only 12.50%, but there is a prediction error of 100\$ between the predicted price and the final price, which can have a higher impact in the business than the error in the video game prediction. For this reason, it has been decided not to use a metric that indicates a percentage.

As mentioned before, in this section, the models that have been built do not take into account the product categories. The input variables for the models are: the retail price of the item, the bid increment, the bid fee, and the flags that indicate whether the auction is a click-only auction, a beginner auction, a fixed-price auction or and end-price auction.

The following code is used to calculate the mean absolute error and the median absolute error (as the average of the results obtained for these two metrics over the 5 cross-validation folds) for the previously mentioned models. The results are analyzed after the lines of code.

In [6]:
kFoldMedianAbsoluteErrorsRandomForestRegressor = []
kFoldMedianAbsoluteErrorsKNeighborsRegressor = []
kFoldMedianAbsoluteErrorsDecisionTreeRegressor = []
kFoldMedianAbsoluteErrorsLinearRegression = []
kFoldMedianAbsoluteErrorsRANSACRegressor = []

kFoldMeanAbsoluteErrorsRandomForestRegressor = []
kFoldMeanAbsoluteErrorsKNeighborsRegressor = []
kFoldMeanAbsoluteErrorsDecisionTreeRegressor = []
kFoldMeanAbsoluteErrorsLinearRegression = []
kFoldMeanAbsoluteErrorsRANSACRegressor = []

kFoldNumber = 1
kf = KFold(n_splits=5, random_state=1)
for train_index, test_index in kf.split(X):
    #5-fold cross-validation
    print("Executing k fold = "+str(kFoldNumber))
    kFoldNumber=kFoldNumber+1
    
    #train and test split
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    #RandomForestRegressor
    model=RandomForestRegressor(random_state=1)
    model.fit(X_train,y_train)
    P_price = model.predict(X_test)
    
    kFoldMedianAbsoluteError = median_absolute_error(y_test,P_price)
    kFoldMedianAbsoluteErrorsRandomForestRegressor.append(kFoldMedianAbsoluteError)    
    kFoldMeanAbsoluteError = mean_absolute_error(y_test,P_price)
    kFoldMeanAbsoluteErrorsRandomForestRegressor.append(kFoldMeanAbsoluteError)  

    #KNeighborsRegressor
    model = KNeighborsRegressor()
    model.fit(X_train,y_train)
    P_price = model.predict(X_test)
    
    kFoldMedianAbsoluteError = median_absolute_error(y_test,P_price)
    kFoldMedianAbsoluteErrorsKNeighborsRegressor.append(kFoldMedianAbsoluteError)
    kFoldMeanAbsoluteError = mean_absolute_error(y_test,P_price)
    kFoldMeanAbsoluteErrorsKNeighborsRegressor.append(kFoldMeanAbsoluteError)
    
    #DecisionTreeRegressor
    model = DecisionTreeRegressor()
    model.fit(X_train,y_train)
    P_price = model.predict(X_test)
    
    kFoldMedianAbsoluteError = median_absolute_error(y_test,P_price)
    kFoldMedianAbsoluteErrorsDecisionTreeRegressor.append(kFoldMedianAbsoluteError)
    kFoldMeanAbsoluteError = mean_absolute_error(y_test,P_price)
    kFoldMeanAbsoluteErrorsDecisionTreeRegressor.append(kFoldMeanAbsoluteError)
    
    #LinearRegression
    model = LinearRegression()
    model.fit(X_train,y_train)
    P_price = model.predict(X_test)
    
    kFoldMedianAbsoluteError = median_absolute_error(y_test,P_price)
    kFoldMedianAbsoluteErrorsLinearRegression.append(kFoldMedianAbsoluteError)
    kFoldMeanAbsoluteError = mean_absolute_error(y_test,P_price)
    kFoldMeanAbsoluteErrorsLinearRegression.append(kFoldMeanAbsoluteError)
    
    #RANSACRegressor
    model = RANSACRegressor(random_state=1)
    model.fit(X_train,y_train)
    P_price = model.predict(X_test)
    
    kFoldMedianAbsoluteError = median_absolute_error(y_test,P_price)
    kFoldMedianAbsoluteErrorsRANSACRegressor.append(kFoldMedianAbsoluteError)
    kFoldMeanAbsoluteError = mean_absolute_error(y_test,P_price)
    kFoldMeanAbsoluteErrorsRANSACRegressor.append(kFoldMeanAbsoluteError)
    
medianAbsoluteErrorsRandomForestRegressorAverage = np.mean(kFoldMedianAbsoluteErrorsRandomForestRegressor)
meanAbsoluteErrorsRandomForestRegressorAverage = np.mean(kFoldMeanAbsoluteErrorsRandomForestRegressor)
medianAbsoluteErrorsKNeighborsRegressorAverage = np.mean(kFoldMedianAbsoluteErrorsKNeighborsRegressor)
meanAbsoluteErrorsKNeighborsRegressorAverage = np.mean(kFoldMeanAbsoluteErrorsKNeighborsRegressor)
medianAbsoluteErrorsDecisionTreeRegressorAverage = np.mean(kFoldMedianAbsoluteErrorsDecisionTreeRegressor)
meanAbsoluteErrorsDecisionTreeRegressorAverage = np.mean(kFoldMeanAbsoluteErrorsDecisionTreeRegressor)
medianAbsoluteErrorsLinearRegressionAverage = np.mean(kFoldMedianAbsoluteErrorsLinearRegression)
meanAbsoluteErrorsLinearRegressionAverage = np.mean(kFoldMeanAbsoluteErrorsLinearRegression)
medianAbsoluteErrorsRANSACRegressorAverage = np.mean(kFoldMedianAbsoluteErrorsRANSACRegressor)
meanAbsoluteErrorsRANSACRegressorAverage = np.mean(kFoldMeanAbsoluteErrorsRANSACRegressor)

print("--")
print("Random Forest Regressor")
print("Median absolute error: "+str(medianAbsoluteErrorsRandomForestRegressorAverage))
print("Mean absolute error: "+str(meanAbsoluteErrorsRandomForestRegressorAverage))
print("--")
print("K Neighbors Regressor")
print("Median absolute error: "+str(medianAbsoluteErrorsKNeighborsRegressorAverage))
print("Mean absolute error: "+str(meanAbsoluteErrorsKNeighborsRegressorAverage))
print("--")
print("Decision Tree Regressor")
print("Median absolute error: "+str(medianAbsoluteErrorsDecisionTreeRegressorAverage))
print("Mean absolute error: "+str(meanAbsoluteErrorsDecisionTreeRegressorAverage))
print("--")
print("Linear Regression")
print("Median absolute error: "+str(medianAbsoluteErrorsLinearRegressionAverage))
print("Mean absolute error: "+str(meanAbsoluteErrorsLinearRegressionAverage))
print("--")
print("RANSAC Regressor")
print("Median absolute error: "+str(medianAbsoluteErrorsRANSACRegressorAverage))
print("Mean absolute error: "+str(meanAbsoluteErrorsRANSACRegressorAverage))

Executing k fold = 1
Executing k fold = 2
Executing k fold = 3
Executing k fold = 4
Executing k fold = 5
--
Random Forest Regressor
Median absolute error: 11.1752458141
Mean absolute error: 29.3445832716
--
K Neighbors Regressor
Median absolute error: 11.9772
Mean absolute error: 31.8406889199
--
Decision Tree Regressor
Median absolute error: 11.0463421819
Mean absolute error: 29.7397227663
--
Linear Regression
Median absolute error: 24.2493912238
Mean absolute error: 42.0046659273
--
RANSAC Regressor
Median absolute error: 10.4691486861
Mean absolute error: 38.1582942494


As it can be seen above, the model with the better results in terms of the median absolute error is the RANSAC regressor. Nevertheless, its mean absolute error is much higher than for other models, which indicates that the deviation between individual prediction results and real results is higher than for other models. The decision tree regresor and the random forest regressor both present good results for the two metrics. The k-neighbors regressor performs alright, but worse than the other two. The worst results are obtained with the linear regression, probably because it is more sensitive to outliers than the others.

# One Model - Having into account the categories

In [7]:
outcomesDf = pd.read_csv('./outcomes_clean.tsv',sep='\t')

In [None]:
ver si esto tiene la resolucion correcta y subirlo a github

In [8]:
productDescriptionToVectorDf = pd.read_excel('./productDescriptionToVector.xlsx')

In [None]:
usar esta funcion tb arriba

In [9]:
def convertCategoryDataFrameToOneHotEncodedVersion(clusteredCategoriesDf, categoryColumnName, trainClusterColumNames = None):
    #categories assigned to each row of the DataFrame
    categories = clusteredCategoriesDf[categoryColumnName]
    onehot_encoder = OneHotEncoder(sparse=False)
    categories = categories.values.reshape(len(categories), 1)
    #One-Hot encoded version of the categories, where each column represents a category,
    #and each row has the value "1" on the column of the category that it belongs to,
    #and "0" on the other columns.
    onehot_encoded = onehot_encoder.fit_transform(categories)
    ncolumns = np.shape(onehot_encoded)[1]
    #array that will contain the column names for the category clusters
    clusterColumNames = ['cluster_']*ncolumns
    for clusterNumber in np.arange(ncolumns):
        #each column name is "cluster_" followed by the cluster number
        clusterColumNames[clusterNumber] += str(clusterNumber)

    #One-Hot Encoded DataFrame
    clusteredCategoriesDfOneHotEnc = pd.DataFrame(onehot_encoded,columns=clusterColumNames)
    
    if trainClusterColumNames is not None:
        #this function is called both in the training phase and the testing phase.
        #For the training phase, trainClusterColumNames = None
        #For the testing phase, trainClusterColumNames contains the column names
        #of the One-Hot encoded DataFrame obtained during the training phase.
        for trainColumnName in trainClusterColumNames:
            if trainColumnName not in clusterColumNames:
                #It may happen that a product category that appeared in the training phase
                #does not appear in the testing phase. However, since the model is given 
                #the training One-Hot encoded DataFrame column values as input, it is necessary
                #that the testing One-Hot encoded DataFrame contains the same columns
                
                #Therefore, if a category that appeared in the training phase does not
                #appear in the testing phase, a column filled in with "0" for all rows
                #is created.
                clusteredCategoriesDfOneHotEnc[trainColumnName] = 0.0
            
    return clusteredCategoriesDfOneHotEnc

In [10]:
kFoldMedianAbsoluteErrorsRandomForestRegressor = []
kFoldMedianAbsoluteErrorsKNeighborsRegressor = []
kFoldMedianAbsoluteErrorsDecisionTreeRegressor = []
kFoldMedianAbsoluteErrorsLinearRegression = []
kFoldMedianAbsoluteErrorsRANSACRegressor = []

kFoldMeanAbsoluteErrorsRandomForestRegressor = []
kFoldMeanAbsoluteErrorsKNeighborsRegressor = []
kFoldMeanAbsoluteErrorsDecisionTreeRegressor = []
kFoldMeanAbsoluteErrorsLinearRegression = []
kFoldMeanAbsoluteErrorsRANSACRegressor = []

kFoldNumber = 1
kf = KFold(n_splits=5, random_state=1)
for train_index, test_index in kf.split(outcomesDf):
    #5-fold cross-validation
    print("Executing k fold = "+str(kFoldNumber))
    kFoldNumber=kFoldNumber+1
    
    #train and test split
    outcomesDf_train, outcomesDf_test = outcomesDf.iloc[train_index], outcomesDf.iloc[test_index]   
    
    #TRAINING PART
    
    #Unique product descriptions present in the training dataset
    productDescriptionTrain = outcomesDf_train['desc'].unique()
    
    productDescriptionToVector={}
    for item in productDescriptionTrain:
        #Word2Vec vector obtained for each product description
        productDescriptionToVector[item] = productDescriptionToVectorDf[item]
    
    productVectors = list(productDescriptionToVector.values())
   
    km = KMeans(n_clusters=15,random_state=2)
    #the clustering is performed with the Word2Vec vectors for the product categories given as input
    km.fit(productVectors)
    clusters = km.labels_.tolist()
    #DataFrame containing columns for the product description and the category cluster that it is associated to
    clusteredCategoriesDf = pd.DataFrame({'desc': list(productDescriptionToVector.keys()), 'cat_cluster': clusters})
    #One-Hot encoded version of the DataFrame containing a column for each cluster
    clusteredCategoriesDfOneHotEnc = convertCategoryDataFrameToOneHotEncodedVersion(clusteredCategoriesDf,'cat_cluster')  
    #column names of the clusters
    cluster_column_names = clusteredCategoriesDfOneHotEnc.columns.values
    clusteredCategoriesDfOneHotEnc['desc'] = clusteredCategoriesDf['desc']
    #the training dataset is merged so that it contains a one hot-encoded 
    #column for each one of the clusters obtained.
    outcomesDfWithCatOneHot = outcomesDf_train.merge(clusteredCategoriesDfOneHotEnc,how='left')
    normal_x_column_names = ["retail","bidincrement","bidfee","flg_click_only","flg_beginnerauction","flg_fixedprice","flg_endprice"]
    #column names to be used in the prediction model: they include the columns corresponding to the clusters obtained.
    column_names_with_clusters = np.concatenate([normal_x_column_names,cluster_column_names])
    
    #training phase variables
    X_train = outcomesDfWithCatOneHot[column_names_with_clusters]
    y_train = outcomesDfWithCatOneHot["price"]
    
    #TEST PART
    
    #Unique product descriptions present in the training dataset
    productDescriptionTest = outcomesDf_test['desc'].unique()
    
    productDescriptionToVector={}
    for item in productDescriptionTest:
        #Word2Vec vector obtained for each product description
        productDescriptionToVector[item] = productDescriptionToVectorDf[item]
   
    productDescriptionToTrainCluster = {}
    for productDescription, vector in productDescriptionToVector.items():
        #The K-Means model obtained during the training phase is used to associate
        #an existing cluster to each one of the product vectors contained in the test data.
        clusterCategory = km.predict([vector])
        productDescriptionToTrainCluster[productDescription] = clusterCategory[0]
    
    #DataFrame containing columns for the product description and the category cluster that it is associated to
    clusteredCategoriesDf = pd.DataFrame(list(productDescriptionToTrainCluster.items()), columns=['desc', 'cat_cluster'])
    #One-Hot encoded version of the DataFrame containing a column for each cluster.
    #The columns are the same ones as in the training phase
    clusteredCategoriesDfOneHotEnc = convertCategoryDataFrameToOneHotEncodedVersion(clusteredCategoriesDf,'cat_cluster',cluster_column_names)
    #column names of the clusters
    cluster_column_names = clusteredCategoriesDfOneHotEnc.columns.values
    clusteredCategoriesDfOneHotEnc['desc'] = clusteredCategoriesDf['desc']
    #the test dataset is merged so that it contains a one hot-encoded 
    #column for each one of the clusters obtained.
    outcomesDfWithCatOneHot = outcomesDf_test.merge(clusteredCategoriesDfOneHotEnc,how='left')
    normal_x_column_names = ["retail","bidincrement","bidfee","flg_click_only","flg_beginnerauction","flg_fixedprice","flg_endprice"]
    #column names to be used in the prediction model: they include the columns corresponding to the clusters obtained.
    column_names_with_clusters = np.concatenate([normal_x_column_names,cluster_column_names])
    
    #test phase variables
    X_test = outcomesDfWithCatOneHot[column_names_with_clusters]
    y_test = outcomesDfWithCatOneHot["price"]
    
    #Predictors
    
    #RandomForestRegressor
    model=RandomForestRegressor(random_state=1)
    model.fit(X_train,y_train)
    P_price = model.predict(X_test)
    
    kFoldMedianAbsoluteError = median_absolute_error(y_test,P_price)
    kFoldMedianAbsoluteErrorsRandomForestRegressor.append(kFoldMedianAbsoluteError)
    kFoldMeanAbsoluteError = mean_absolute_error(y_test,P_price)
    kFoldMeanAbsoluteErrorsRandomForestRegressor.append(kFoldMeanAbsoluteError)  

    #KNeighborsRegressor
    model = KNeighborsRegressor()
    model.fit(X_train,y_train)
    P_price = model.predict(X_test)
    
    kFoldMedianAbsoluteError = median_absolute_error(y_test,P_price)
    kFoldMedianAbsoluteErrorsKNeighborsRegressor.append(kFoldMedianAbsoluteError)
    kFoldMeanAbsoluteError = mean_absolute_error(y_test,P_price)
    kFoldMeanAbsoluteErrorsKNeighborsRegressor.append(kFoldMeanAbsoluteError)  
    
    #DecisionTreeRegressor
    model = DecisionTreeRegressor()
    model.fit(X_train,y_train)
    P_price = model.predict(X_test)
    
    kFoldMedianAbsoluteError = median_absolute_error(y_test,P_price)
    kFoldMedianAbsoluteErrorsDecisionTreeRegressor.append(kFoldMedianAbsoluteError)
    kFoldMeanAbsoluteError = mean_absolute_error(y_test,P_price)
    kFoldMeanAbsoluteErrorsDecisionTreeRegressor.append(kFoldMeanAbsoluteError)  
    
    #LinearRegression
    model = LinearRegression()
    model.fit(X_train,y_train)
    P_price = model.predict(X_test)
    
    kFoldMedianAbsoluteError = median_absolute_error(y_test,P_price)
    kFoldMedianAbsoluteErrorsLinearRegression.append(kFoldMedianAbsoluteError) 
    kFoldMeanAbsoluteError = mean_absolute_error(y_test,P_price)
    kFoldMeanAbsoluteErrorsLinearRegression.append(kFoldMeanAbsoluteError)  
    
    #RANSACRegressor
    model = RANSACRegressor(random_state=1)
    model.fit(X_train,y_train)
    P_price = model.predict(X_test)
    
    kFoldMedianAbsoluteError = median_absolute_error(y_test,P_price)
    kFoldMedianAbsoluteErrorsRANSACRegressor.append(kFoldMedianAbsoluteError) 
    kFoldMeanAbsoluteError = mean_absolute_error(y_test,P_price)
    kFoldMeanAbsoluteErrorsRANSACRegressor.append(kFoldMeanAbsoluteError)  
    
medianAbsoluteErrorsRandomForestRegressorAverage = np.mean(kFoldMedianAbsoluteErrorsRandomForestRegressor)
meanAbsoluteErrorsRandomForestRegressorAverage = np.mean(kFoldMeanAbsoluteErrorsRandomForestRegressor)
medianAbsoluteErrorsKNeighborsRegressorAverage = np.mean(kFoldMedianAbsoluteErrorsKNeighborsRegressor)
meanAbsoluteErrorsKNeighborsRegressorAverage = np.mean(kFoldMeanAbsoluteErrorsKNeighborsRegressor)
medianAbsoluteErrorsDecisionTreeRegressorAverage = np.mean(kFoldMedianAbsoluteErrorsDecisionTreeRegressor)
meanAbsoluteErrorsDecisionTreeRegressorAverage = np.mean(kFoldMeanAbsoluteErrorsDecisionTreeRegressor)
medianAbsoluteErrorsLinearRegressionAverage = np.mean(kFoldMedianAbsoluteErrorsLinearRegression)
meanAbsoluteErrorsLinearRegressionAverage = np.mean(kFoldMeanAbsoluteErrorsLinearRegression)
medianAbsoluteErrorsRANSACRegressorAverage = np.mean(kFoldMedianAbsoluteErrorsRANSACRegressor)
meanAbsoluteErrorsRANSACRegressorAverage = np.mean(kFoldMeanAbsoluteErrorsRANSACRegressor)

print("--")
print("Random Forest Regressor")
print("Median absolute error: "+str(medianAbsoluteErrorsRandomForestRegressorAverage))
print("Mean absolute error: "+str(meanAbsoluteErrorsRandomForestRegressorAverage))
print("--")
print("K Neighbors Regressor")
print("Median absolute error: "+str(medianAbsoluteErrorsKNeighborsRegressorAverage))
print("Mean absolute error: "+str(meanAbsoluteErrorsKNeighborsRegressorAverage))
print("--")
print("Decision Tree Regressor")
print("Median absolute error: "+str(medianAbsoluteErrorsDecisionTreeRegressorAverage))
print("Mean absolute error: "+str(meanAbsoluteErrorsDecisionTreeRegressorAverage))
print("--")
print("Linear Regression")
print("Median absolute error: "+str(medianAbsoluteErrorsLinearRegressionAverage))
print("Mean absolute error: "+str(meanAbsoluteErrorsLinearRegressionAverage))
print("--")
print("RANSAC Regressor")
print("Median absolute error: "+str(medianAbsoluteErrorsRANSACRegressorAverage))
print("Mean absolute error: "+str(meanAbsoluteErrorsRANSACRegressorAverage))

Executing k fold = 1
Executing k fold = 2
Executing k fold = 3
Executing k fold = 4
Executing k fold = 5
--
Random Forest Regressor
Median absolute error: 10.3618596038
Mean absolute error: 28.4712221001
--
K Neighbors Regressor
Median absolute error: 11.0316
Mean absolute error: 31.4725272275
--
Decision Tree Regressor
Median absolute error: 10.5378217679
Mean absolute error: 28.9620298834
--
Linear Regression
Median absolute error: 22.8660347876
Mean absolute error: 39.4963330571
--
RANSAC Regressor
Median absolute error: 12.9409487405
Mean absolute error: 36.7040760076


# Multiple models by category

In [11]:
outcomesDf = pd.read_csv('./outcomes_clean.tsv',sep='\t')

In [12]:
productDescriptionToVectorDf = pd.read_excel('./productDescriptionToVector.xlsx')

In [16]:
kFoldMedianAbsoluteErrorsRandomForestRegressor = []
kFoldMedianAbsoluteErrorsKNeighborsRegressor = []
kFoldMedianAbsoluteErrorsDecisionTreeRegressor = []
kFoldMedianAbsoluteErrorsLinearRegression = []
kFoldMedianAbsoluteErrorsRANSACRegressor = []

kFoldMeanAbsoluteErrorsRandomForestRegressor = []
kFoldMeanAbsoluteErrorsKNeighborsRegressor = []
kFoldMeanAbsoluteErrorsDecisionTreeRegressor = []
kFoldMeanAbsoluteErrorsLinearRegression = []
kFoldMeanAbsoluteErrorsRANSACRegressor = []

kf = KFold(n_splits=5, random_state=1)
kFoldNumber = 1
for train_index, test_index in kf.split(outcomesDf):
    #5-fold cross-validation
    print("Executing k fold = "+str(kFoldNumber))
    kFoldNumber=kFoldNumber+1
    
    #train and test split
    outcomesDf_train, outcomesDf_test = outcomesDf.iloc[train_index], outcomesDf.iloc[test_index]   
    
    #TRAINING PART
                                        
    #Unique product descriptions present in the training dataset
    productDescriptionTrain = outcomesDf_train['desc'].unique()
    
    productDescriptionToVector={}
    for item in productDescriptionTrain:
        #Word2Vec vector obtained for each product description
        productDescriptionToVector[item] = productDescriptionToVectorDf[item]
        
    productVectors = list(productDescriptionToVector.values())
    
    km = KMeans(n_clusters=15,random_state=2)
    #the clustering is performed with the Word2Vec vectors for the product categories given as input
    km.fit(productVectors)
    clusters = km.labels_.tolist()
    #DataFrame containing columns for the product description and the category cluster that it is associated to
    clusteredCategoriesDf = pd.DataFrame({'desc': list(productDescriptionToVector.keys()), 'cat_cluster': clusters})
    #the training dataset is merged and now it contains a column with the category that
    #each product belongs to.
    outcomesDfTrainAndCatCluster = outcomesDf_train.merge(clusteredCategoriesDf,how='left')
    
    clusterIndexToOutcomesDfTrain = {}
    for cluster_index in outcomesDfTrainAndCatCluster['cat_cluster'].unique():
        #the rows associated to each different product category are stored in a dictionary,
        #where the dictionary key is the product category number (cluster number)
        clusterIndexToOutcomesDfTrain[cluster_index] = outcomesDfTrainAndCatCluster[outcomesDfTrainAndCatCluster['cat_cluster'] == cluster_index]
    
    #dictionaries that contain a trained model for each one of the product categories (clusters)
    clusterIndexToRandomForestRegressor = {}
    clusterIndexToKNeighborsRegressor = {}
    clusterIndexToDecisionTreeRegressor = {}
    clusterIndexToLinearRegression = {}
    clusterIndexToRANSACRegressor = {}
    
    for cluster_index, outcomesDfClusterIndex in clusterIndexToOutcomesDfTrain.items():
        #For each product category (cluster), a different model is trained using
        #the rows of the input dataset containing the products associated to that category
        X_train = outcomesDfClusterIndex[["retail","bidincrement","bidfee","flg_click_only","flg_beginnerauction","flg_fixedprice","flg_endprice"]].values
        y_train = outcomesDfClusterIndex["price"]
        
        #RandomForestRegressor
        model=RandomForestRegressor(random_state=1)
        model.fit(X_train,y_train)
        clusterIndexToRandomForestRegressor[cluster_index] = model

        #KNeighborsRegressor
        n_neighbors=5
        if outcomesDfClusterIndex.shape[0] < n_neighbors:
            #In KNeighborsRegressor it is expected n_neighbors <= n_samples
            n_neighbors = outcomesDfClusterIndex.shape[0]
            
        model = KNeighborsRegressor(n_neighbors=n_neighbors)
        model.fit(X_train,y_train)
        clusterIndexToKNeighborsRegressor[cluster_index] = model

        #DecisionTreeRegressor
        model = DecisionTreeRegressor()
        model.fit(X_train,y_train)
        clusterIndexToDecisionTreeRegressor[cluster_index] = model
    
        #LinearRegression
        model = LinearRegression()
        model.fit(X_train,y_train)
        clusterIndexToLinearRegression[cluster_index] = model
        
        #RANSACRegressor
        model = RANSACRegressor(random_state=1)
        model.fit(X_train,y_train)
        clusterIndexToRANSACRegressor[cluster_index] = model
        
    #TEST PART
    
    #Unique product descriptions present in the test dataset
    productDescriptionTest = outcomesDf_test['desc'].unique()
    
    productDescriptionToVector={}
    for item in productDescriptionTest:
        #Word2Vec vector obtained for each product description
        productDescriptionToVector[item] = productDescriptionToVectorDf[item]
   
    productDescriptionToTrainCluster = {}
    for productDescription, vector in productDescriptionToVector.items(): 
        #The K-Means model obtained during the training phase is used to associate
        #an existing cluster to each one of the product vectors contained in the test data.        
        clusterCategory = km.predict([vector])
        productDescriptionToTrainCluster[productDescription] = clusterCategory[0]
    
    #DataFrame containing columns for the product description and the category cluster that it is associated to
    clusteredCategoriesDf = pd.DataFrame(list(productDescriptionToTrainCluster.items()), columns=['desc', 'cat_cluster'])
    #the test dataset is merged and now it contains a column with the category that
    #each product belongs to.
    outcomesDfTestAndCatCluster = outcomesDf_test.merge(clusteredCategoriesDf,how='left')
    
    realAndPredictedValuesRandomForestRegressor = []
    realAndPredictedValuesKNeighborsRegressor = []
    realAndPredictedValuesDecisionTreeRegressor = []
    realAndPredictedValuesLinearRegression = []
    realAndPredictedValuesRANSACRegressor = []
    
    for index, row in outcomesDfTestAndCatCluster.iterrows():
        #iteration through each row of the test dataset
        
        #real value of the column to be predicted
        y_test_value = row["price"]
        #values given as input for the prediction model
        x_test_value = row[["retail","bidincrement","bidfee","flg_click_only","flg_beginnerauction","flg_fixedprice","flg_endprice"]].values
        #product category cluster for this row of the dataset
        clusterIndex = row["cat_cluster"]
        
        #the prediction models associated to this product category (cluster) are extracted
        modelRandomForestRegressor = clusterIndexToRandomForestRegressor[clusterIndex]
        modelKNeighborsRegressor = clusterIndexToKNeighborsRegressor[clusterIndex]
        modelDecisionTreeRegressor = clusterIndexToDecisionTreeRegressor[clusterIndex]
        modelLinearRegression = clusterIndexToLinearRegression[clusterIndex]
        modelRANSACRegressor = clusterIndexToRANSACRegressor[clusterIndex]
        
        #the real value and the predicted value are stored in a different list for
        #each one of the prediction models used.
        realAndPredictedValuesRandomForestRegressor.append((y_test_value,modelRandomForestRegressor.predict([x_test_value])[0]))
        realAndPredictedValuesKNeighborsRegressor.append((y_test_value,modelKNeighborsRegressor.predict([x_test_value])[0]))
        realAndPredictedValuesDecisionTreeRegressor.append((y_test_value,modelDecisionTreeRegressor.predict([x_test_value])[0]))
        realAndPredictedValuesLinearRegression.append((y_test_value,modelLinearRegression.predict([x_test_value])[0]))
        realAndPredictedValuesRANSACRegressor.append((y_test_value,modelRANSACRegressor.predict([x_test_value])[0]))
    
    #RandomForestRegressor
    #all real and predicted values are extracted and the metrics are calculated
    y_test, P_price = zip(*realAndPredictedValuesRandomForestRegressor)    
    kFoldMedianAbsoluteError = median_absolute_error(y_test,P_price)
    kFoldMedianAbsoluteErrorsRandomForestRegressor.append(kFoldMedianAbsoluteError) 
    kFoldMeanAbsoluteError = mean_absolute_error(y_test,P_price)
    kFoldMeanAbsoluteErrorsRandomForestRegressor.append(kFoldMeanAbsoluteError)  
    
    #KNeighborsRegressor
    #all real and predicted values are extracted and the metrics are calculated
    y_test, P_price = zip(*realAndPredictedValuesKNeighborsRegressor)    
    kFoldMedianAbsoluteError = median_absolute_error(y_test,P_price)
    kFoldMedianAbsoluteErrorsKNeighborsRegressor.append(kFoldMedianAbsoluteError)  
    kFoldMeanAbsoluteError = mean_absolute_error(y_test,P_price)
    kFoldMeanAbsoluteErrorsKNeighborsRegressor.append(kFoldMeanAbsoluteError)  
    
    #DecisionTreeRegressor
    #all real and predicted values are extracted and the metrics are calculated
    y_test, P_price = zip(*realAndPredictedValuesDecisionTreeRegressor)    
    kFoldMedianAbsoluteError = median_absolute_error(y_test,P_price)
    kFoldMedianAbsoluteErrorsDecisionTreeRegressor.append(kFoldMedianAbsoluteError)
    kFoldMeanAbsoluteError = mean_absolute_error(y_test,P_price)
    kFoldMeanAbsoluteErrorsDecisionTreeRegressor.append(kFoldMeanAbsoluteError)  
    
    #LinearRegression
    #all real and predicted values are extracted and the metrics are calculated
    y_test, P_price = zip(*realAndPredictedValuesLinearRegression)    
    kFoldMedianAbsoluteError = median_absolute_error(y_test,P_price)
    kFoldMedianAbsoluteErrorsLinearRegression.append(kFoldMedianAbsoluteError)
    kFoldMeanAbsoluteError = mean_absolute_error(y_test,P_price)
    kFoldMeanAbsoluteErrorsLinearRegression.append(kFoldMeanAbsoluteError) 
    
    #RANSACRegressor
    #all real and predicted values are extracted and the metrics are calculated
    y_test, P_price = zip(*realAndPredictedValuesRANSACRegressor)
    kFoldMedianAbsoluteError = median_absolute_error(y_test,P_price)
    kFoldMedianAbsoluteErrorsRANSACRegressor.append(kFoldMedianAbsoluteError)
    kFoldMeanAbsoluteError = mean_absolute_error(y_test,P_price)
    kFoldMeanAbsoluteErrorsRANSACRegressor.append(kFoldMeanAbsoluteError) 
    
medianAbsoluteErrorsRandomForestRegressorAverage = np.mean(kFoldMedianAbsoluteErrorsRandomForestRegressor)
meanAbsoluteErrorsRandomForestRegressorAverage = np.mean(kFoldMeanAbsoluteErrorsRandomForestRegressor)
medianAbsoluteErrorsKNeighborsRegressorAverage = np.mean(kFoldMedianAbsoluteErrorsKNeighborsRegressor)
meanAbsoluteErrorsKNeighborsRegressorAverage = np.mean(kFoldMeanAbsoluteErrorsKNeighborsRegressor)
medianAbsoluteErrorsDecisionTreeRegressorAverage = np.mean(kFoldMedianAbsoluteErrorsDecisionTreeRegressor)
meanAbsoluteErrorsDecisionTreeRegressorAverage = np.mean(kFoldMeanAbsoluteErrorsDecisionTreeRegressor)
medianAbsoluteErrorsLinearRegressionAverage = np.mean(kFoldMedianAbsoluteErrorsLinearRegression)
meanAbsoluteErrorsLinearRegressionAverage = np.mean(kFoldMeanAbsoluteErrorsLinearRegression)
medianAbsoluteErrorsRANSACRegressorAverage = np.mean(kFoldMedianAbsoluteErrorsRANSACRegressor)
meanAbsoluteErrorsRANSACRegressorAverage = np.mean(kFoldMeanAbsoluteErrorsRANSACRegressor)

print("--")
print("Random Forest Regressor")
print("Median absolute error: "+str(medianAbsoluteErrorsRandomForestRegressorAverage))
print("Mean absolute error: "+str(meanAbsoluteErrorsRandomForestRegressorAverage))
print("--")
print("K Neighbors Regressor")
print("Median absolute error: "+str(medianAbsoluteErrorsKNeighborsRegressorAverage))
print("Mean absolute error: "+str(meanAbsoluteErrorsKNeighborsRegressorAverage))
print("--")
print("Decision Tree Regressor")
print("Median absolute error: "+str(medianAbsoluteErrorsDecisionTreeRegressorAverage))
print("Mean absolute error: "+str(meanAbsoluteErrorsDecisionTreeRegressorAverage))
print("--")
print("Linear Regression")
print("Median absolute error: "+str(medianAbsoluteErrorsLinearRegressionAverage))
print("Mean absolute error: "+str(meanAbsoluteErrorsLinearRegressionAverage))
print("--")
print("RANSAC Regressor")
print("Median absolute error: "+str(medianAbsoluteErrorsRANSACRegressorAverage))
print("Mean absolute error: "+str(meanAbsoluteErrorsRANSACRegressorAverage))

Executing k fold = 1
Executing k fold = 2
Executing k fold = 3
Executing k fold = 4
Executing k fold = 5
--
Random Forest Regressor
Median absolute error: 10.3682931183
Mean absolute error: 28.429692278
--
K Neighbors Regressor
Median absolute error: 11.13
Mean absolute error: 30.7647559882
--
Decision Tree Regressor
Median absolute error: 10.432843812
Mean absolute error: 28.893296753
--
Linear Regression
Median absolute error: 16.4468950152
Mean absolute error: 33.9439732817
--
RANSAC Regressor
Median absolute error: 9.99884623609
Mean absolute error: 32.1491303303


In [26]:
kFoldMedianAbsoluteErrorsRandomForestRegressor = {}
kFoldMedianAbsoluteErrorsRANSACRegressor = {}

kFoldMeanAbsoluteErrorsRandomForestRegressor = {}
kFoldMeanAbsoluteErrorsRANSACRegressor = {}

nClusterGrid = [10,12,14,16,18,20]
nEstimatorsGrid=[10,15]

gridTotalCombinations = len(nClusterGrid)*len(nEstimatorsGrid)
gridCombinationNumber = 1

for nClustersGridValue in nClusterGrid:
    for nEstimatorsGridValue in nEstimatorsGrid:
        gridCombinationString = str(nClustersGridValue)+"-"+str(nEstimatorsGridValue)
        print("Executing grid combination "+str(gridCombinationNumber)+" out of "+str(gridTotalCombinations)+" combinations")
        gridCombinationNumber=gridCombinationNumber+1
            
        kf = KFold(n_splits=5, random_state=1)
        kFoldNumber = 1
        for train_index, test_index in kf.split(outcomesDf):
            print("Executing k fold = "+str(kFoldNumber))
            kFoldNumber=kFoldNumber+1

            #print("TRAIN:", train_index, "TEST:", test_index)
            outcomesDf_train, outcomesDf_test = outcomesDf.iloc[train_index], outcomesDf.iloc[test_index]   

            #train variables
            productDescriptionTrain = outcomesDf_train['desc'].unique()

            productDescriptionToVector={}
            for item in productDescriptionTrain:
                productDescriptionToVector[item] = productDescriptionToVectorDf[item]

            productVectors = list(productDescriptionToVector.values())

            km = KMeans(n_clusters=nClustersGridValue,random_state=1)
            km.fit(productVectors)
            clusters = km.labels_.tolist()
            clusteredCategoriesDf = pd.DataFrame({'desc': list(productDescriptionToVector.keys()), 'cat_cluster': clusters})

            outcomesDfTrainAndCatCluster = outcomesDf_train.merge(clusteredCategoriesDf,how='left')

            clusterIndexToOutcomesDfTrain = {}
            for cluster_index in outcomesDfTrainAndCatCluster['cat_cluster'].unique():
                clusterIndexToOutcomesDfTrain[cluster_index] = outcomesDfTrainAndCatCluster[outcomesDfTrainAndCatCluster['cat_cluster'] == cluster_index]

            clusterIndexToRandomForestRegressor = {}
            clusterIndexToRANSACRegressor = {}

            for cluster_index, outcomesDfClusterIndex in clusterIndexToOutcomesDfTrain.items():
                X_train = outcomesDfClusterIndex[["retail","bidincrement","bidfee","flg_click_only","flg_beginnerauction","flg_fixedprice","flg_endprice"]].values
                y_train = outcomesDfClusterIndex["price"]

                #RandomForestRegressor
                model=RandomForestRegressor(random_state=1,n_estimators=nEstimatorsGridValue)
                model.fit(X_train,y_train)
                clusterIndexToRandomForestRegressor[cluster_index] = model

                #RANSACRegressor
                model = RANSACRegressor(random_state=1)
                model.fit(X_train,y_train)
                clusterIndexToRANSACRegressor[cluster_index] = model

            #test variables

            productDescriptionTest = outcomesDf_test['desc'].unique()

            productDescriptionToVector={}
            for item in productDescriptionTest:
                productDescriptionToVector[item] = productDescriptionToVectorDf[item]

            productDescriptionToTrainCluster = {}
            for productDescription, vector in productDescriptionToVector.items(): 
                clusterCategory = km.predict([vector])
                productDescriptionToTrainCluster[productDescription] = clusterCategory[0]

            clusteredCategoriesDf = pd.DataFrame(list(productDescriptionToTrainCluster.items()), columns=['desc', 'cat_cluster'])

            outcomesDfTestAndCatCluster = outcomesDf_test.merge(clusteredCategoriesDf,how='left')

            realAndPredictedValuesRandomForestRegressor = []
            realAndPredictedValuesRANSACRegressor = []

            for index, row in outcomesDfTestAndCatCluster.iterrows():
                y_test_value = row["price"]
                x_test_value = row[["retail","bidincrement","bidfee","flg_click_only","flg_beginnerauction","flg_fixedprice","flg_endprice"]].values
                clusterIndex = row["cat_cluster"]

                modelRandomForestRegressor = clusterIndexToRandomForestRegressor[clusterIndex]
                modelRANSACRegressor = clusterIndexToRANSACRegressor[clusterIndex]

                realAndPredictedValuesRandomForestRegressor.append((y_test_value,modelRandomForestRegressor.predict([x_test_value])[0]))
                realAndPredictedValuesRANSACRegressor.append((y_test_value,modelRANSACRegressor.predict([x_test_value])[0]))

            #RandomForestRegressor
            y_test, P_price = zip(*realAndPredictedValuesRandomForestRegressor)    
            kFoldMedianAbsoluteError = median_absolute_error(y_test,P_price)
            if gridCombinationString not in kFoldMedianAbsoluteErrorsRandomForestRegressor:
                kFoldMedianAbsoluteErrorsRandomForestRegressor[gridCombinationString] = []
            kFoldMedianAbsoluteErrorsRandomForestRegressor[gridCombinationString].append(kFoldMedianAbsoluteError)    

            kFoldMeanAbsoluteError = mean_absolute_error(y_test,P_price)      
            if gridCombinationString not in kFoldMeanAbsoluteErrorsRandomForestRegressor:
                kFoldMeanAbsoluteErrorsRandomForestRegressor[gridCombinationString] = []
            kFoldMeanAbsoluteErrorsRandomForestRegressor[gridCombinationString].append(kFoldMeanAbsoluteError)

            #RANSACRegressor
            y_test, P_price = zip(*realAndPredictedValuesRANSACRegressor)
            kFoldMedianAbsoluteError = median_absolute_error(y_test,P_price)
            if gridCombinationString not in kFoldMedianAbsoluteErrorsRANSACRegressor:
                kFoldMedianAbsoluteErrorsRANSACRegressor[gridCombinationString] = []
            kFoldMedianAbsoluteErrorsRANSACRegressor[gridCombinationString].append(kFoldMedianAbsoluteError)

            kFoldMeanAbsoluteError = mean_absolute_error(y_test,P_price)
            if gridCombinationString not in kFoldMeanAbsoluteErrorsRANSACRegressor:
                kFoldMeanAbsoluteErrorsRANSACRegressor[gridCombinationString] = []
            kFoldMeanAbsoluteErrorsRANSACRegressor[gridCombinationString].append(kFoldMeanAbsoluteError)

Executing grid combination 1 out of 12 combinations
Executing k fold = 1
Executing k fold = 2
Executing k fold = 3
Executing k fold = 4
Executing k fold = 5
Executing grid combination 2 out of 12 combinations
Executing k fold = 1
Executing k fold = 2
Executing k fold = 3
Executing k fold = 4
Executing k fold = 5
Executing grid combination 3 out of 12 combinations
Executing k fold = 1
Executing k fold = 2
Executing k fold = 3
Executing k fold = 4
Executing k fold = 5
Executing grid combination 4 out of 12 combinations
Executing k fold = 1
Executing k fold = 2
Executing k fold = 3
Executing k fold = 4
Executing k fold = 5
Executing grid combination 5 out of 12 combinations
Executing k fold = 1
Executing k fold = 2
Executing k fold = 3
Executing k fold = 4
Executing k fold = 5
Executing grid combination 6 out of 12 combinations
Executing k fold = 1
Executing k fold = 2
Executing k fold = 3
Executing k fold = 4
Executing k fold = 5
Executing grid combination 7 out of 12 combinations
Execut

In [30]:
for key, value in kFoldMedianAbsoluteErrorsRandomForestRegressor.items():
    print(key)
    
    kFoldMedianAbsoluteErrorsRandomForestRegressorCurrentCombination = kFoldMedianAbsoluteErrorsRandomForestRegressor[key]
    kFoldMeanAbsoluteErrorsRandomForestRegressorCurrentCombination = kFoldMeanAbsoluteErrorsRandomForestRegressor[key]    
    print(np.mean(kFoldMedianAbsoluteErrorsRandomForestRegressorCurrentCombination))
    print(np.mean(kFoldMeanAbsoluteErrorsRandomForestRegressorCurrentCombination))

    kFoldMedianAbsoluteErrorsRANSACRegressorCurrentCombination = kFoldMedianAbsoluteErrorsRANSACRegressor[key]
    kFoldMeanAbsoluteErrorsRANSACRegressorCurrentCombination = kFoldMeanAbsoluteErrorsRANSACRegressor[key]
    print(np.mean(kFoldMedianAbsoluteErrorsRANSACRegressorCurrentCombination))
    print(np.mean(kFoldMeanAbsoluteErrorsRANSACRegressorCurrentCombination))
      
    print("--")

10-10
10.4594567194
28.6962893948
10.2178974538
33.2196202127
--
10-15
10.4213405461
28.6453434962
10.2178974538
33.2196202127
--
12-10
10.4395032143
28.4038919219
10.1100809616
33.0392161377
--
12-15
10.4472204551
28.4161170589
10.1100809616
33.0392161377
--
14-10
10.3468814515
28.5392290926
9.89376069183
32.277718041
--
14-15
10.3531427948
28.4877084596
9.89376069183
32.277718041
--
16-10
10.3705519721
28.4194095854
10.0567676519
31.7937890004
--
16-15
10.3612881889
28.4085002882
10.0567676519
31.7937890004
--
18-10
10.4760550579
28.4587696967
10.1733244581
31.9060245483
--
18-15
10.4508287166
28.4697061745
10.1733244581
31.9060245483
--
20-10
10.4003411201
28.4526045531
9.94284841493
31.8042127359
--
20-15
10.3935962208
28.4609372177
9.94284841493
31.8042127359
--


# Multiple models by category and retail price

In [None]:
outcomesDf = pd.read_csv('./outcomes_clean.tsv',sep='\t')

In [None]:
productDescriptionToVectorDf = pd.read_excel('./productDescriptionToVector.xlsx')

In [17]:
kFoldMedianAbsoluteErrorsRandomForestRegressor = []
kFoldMedianAbsoluteErrorsKNeighborsRegressor = []
kFoldMedianAbsoluteErrorsDecisionTreeRegressor = []
kFoldMedianAbsoluteErrorsLinearRegression = []
kFoldMedianAbsoluteErrorsRANSACRegressor = []

kFoldMeanAbsoluteErrorsRandomForestRegressor = []
kFoldMeanAbsoluteErrorsKNeighborsRegressor = []
kFoldMeanAbsoluteErrorsDecisionTreeRegressor = []
kFoldMeanAbsoluteErrorsLinearRegression = []
kFoldMeanAbsoluteErrorsRANSACRegressor = []

kf = KFold(n_splits=5, random_state=1)
kFoldNumber = 1
for train_index, test_index in kf.split(outcomesDf):
    #5-fold cross-validation
    print("Executing k fold = "+str(kFoldNumber))
    kFoldNumber=kFoldNumber+1
    
    #train and test split
    outcomesDf_train, outcomesDf_test = outcomesDf.iloc[train_index], outcomesDf.iloc[test_index]   
    
    #TRAINING PART
    
    productDescriptionToVector={}
    for item in outcomesDf['desc'].unique():
        #Word2Vec vector obtained for each product description
        productDescriptionToVector[item] = productDescriptionToVectorDf[item]
    
    #DataFrame containing unique combinations of the columns "desc" and "retail" (the same product
    #is auctioned more than once over time, and sometimes it is associated to a different retail price)
    descAndRetailPricesCombinationsTrain = outcomesDf_train.groupby(['desc','retail']).size().reset_index()
    descAndRetailPricesCombinationsTrain = descAndRetailPricesCombinationsTrain.drop(0,axis=1)
    
    #min-max scaler
    scaler = preprocessing.MinMaxScaler()
    #retail prices now range from 0 to 1
    retail_std = scaler.fit_transform(pd.DataFrame(descAndRetailPricesCombinationsTrain['retail']))
    #retail prices now range from -0.5 to 0.5 (in the same one as the Word2Vec vector values)
    retail_std = retail_std - 0.5
    #Google's pre-trained Word2Vec model vector length is 300 features.
    #Since the 300 features are given as input for the clustering, and the
    #retail price has been scaled,the retail price is multiplied by 300
    #so that it has the same importance for the clustering as the
    #300 vector features altogether.
    descAndRetailPricesCombinationsTrain['retail_std'] = retail_std*300
       
    productVectorColumnValuesAndRetail = []
    for index, row in descAndRetailPricesCombinationsTrain.iterrows():
        #iteration through each unique combination of the columns "desc" and "retail"
        productDescription=row['desc']
        auctionProductVector = np.array(productDescriptionToVector[productDescription])
        #the Word2Vec vector for the product description and the retail price will be given
        #as input for the clustering
        productVectorColumnValuesAndRetail.append(np.append(auctionProductVector,row['retail_std']))
    
    km = KMeans(n_clusters=15,random_state=2)
    #clustering with the Word2Vec vectors and retail prices given as input
    km.fit(productVectorColumnValuesAndRetail)
    clusters = km.labels_.tolist()
    descAndRetailPricesCombinationsTrain['cat_cluster'] = clusters
    #the training dataset is merged and now it contains a column with the cluster that
    #each product belongs to.
    outcomesDfTrainAndCatCluster = outcomesDf_train.merge(descAndRetailPricesCombinationsTrain,how='left')

    clusterIndexToOutcomesDfTrain = {}
    for cluster_index in outcomesDfTrainAndCatCluster['cat_cluster'].unique():
        #the rows associated to each different cluster are stored in a dictionary,
        #where the dictionary key is the cluster number
        clusterIndexToOutcomesDfTrain[cluster_index] = outcomesDfTrainAndCatCluster[outcomesDfTrainAndCatCluster['cat_cluster'] == cluster_index]
    
    #dictionaries that contain a trained model for each one of the clusters
    clusterIndexToRandomForestRegressor = {}
    clusterIndexToKNeighborsRegressor = {}
    clusterIndexToDecisionTreeRegressor = {}
    clusterIndexToLinearRegression = {}
    clusterIndexToRANSACRegressor = {}
    
    for cluster_index, outcomesDfClusterIndex in clusterIndexToOutcomesDfTrain.items():
        #For each cluster, a different model is trained using
        #the rows of the input dataset containing the products associated to that cluster
        X_train = outcomesDfClusterIndex[["retail","bidincrement","bidfee","flg_click_only","flg_beginnerauction","flg_fixedprice","flg_endprice"]].values
        y_train = outcomesDfClusterIndex["price"]
        
        #RandomForestRegressor
        model=RandomForestRegressor(random_state=1)
        model.fit(X_train,y_train)
        clusterIndexToRandomForestRegressor[cluster_index] = model

        #KNeighborsRegressor
        n_neighbors=5
        if outcomesDfClusterIndex.shape[0] < n_neighbors:
            #In KNeighborsRegressor it is expected n_neighbors <= n_samples
            n_neighbors = outcomesDfClusterIndex.shape[0]
            
        model = KNeighborsRegressor(n_neighbors=n_neighbors)
        model.fit(X_train,y_train)
        clusterIndexToKNeighborsRegressor[cluster_index] = model

        #DecisionTreeRegressor
        model = DecisionTreeRegressor()
        model.fit(X_train,y_train)
        clusterIndexToDecisionTreeRegressor[cluster_index] = model
    
        #LinearRegression
        model = LinearRegression()
        model.fit(X_train,y_train)
        clusterIndexToLinearRegression[cluster_index] = model
        
        #RANSACRegressor
        model = RANSACRegressor(random_state=1)
        # assume linear model by default
        min_samples = X_train.shape[1] + 1
        if min_samples > X_train.shape[0]:
            #min_samples may not be larger than number X_train.shape[0]
            min_samples = X_train.shape[0]
            model = RANSACRegressor(min_samples=min_samples,random_state=1)
        try:
            model.fit(X_train,y_train)
            clusterIndexToRANSACRegressor[cluster_index] = model
        except Exception as e: 
            print(outcomesDfClusterIndex)
            print(e)
        
    #TEST PART
    
    #DataFrame containing unique combinations of the columns "desc" and "retail" (the same product
    #is auctioned more than once over time, and sometimes it is associated to a different retail price)
    descAndRetailPricesCombinationsTest = outcomesDf_test.groupby(['desc','retail']).size().reset_index()
    descAndRetailPricesCombinationsTest = descAndRetailPricesCombinationsTest.drop(0,axis=1)
    
    #The scaler that is used in the test phase is the one that was obtained during the
    #training phase.
    #The same operations as in the training phase are performed.
    retail_std = scaler.transform(pd.DataFrame(descAndRetailPricesCombinationsTest['retail']))
    retail_std = retail_std - 0.5
    descAndRetailPricesCombinationsTest['retail_std'] = retail_std*300
     
    clusters = []
    for index, row in descAndRetailPricesCombinationsTest.iterrows():
        productDescription=row['desc']
        auctionProductVector = np.array(productDescriptionToVector[productDescription])
        productVectorColumnValuesAndRetail = np.append(auctionProductVector,row['retail_std'])
        #The K-Means model obtained during the training phase is used to associate
        #an existing cluster to each one of the combinations of product vectors and retail
        #prices contained in the test data.   
        clusterCategory = km.predict([productVectorColumnValuesAndRetail])
        clusters.append(clusterCategory[0])
        
    descAndRetailPricesCombinationsTest['cat_cluster'] = clusters
    #the test dataset is merged and now it contains a column with the cluster that
    #each product belongs to.
    outcomesDfTestAndCatCluster = outcomesDf_test.merge(descAndRetailPricesCombinationsTest,how='left')
    
    realAndPredictedValuesRandomForestRegressor = []
    realAndPredictedValuesKNeighborsRegressor = []
    realAndPredictedValuesDecisionTreeRegressor = []
    realAndPredictedValuesLinearRegression = []
    realAndPredictedValuesRANSACRegressor = []
    
    for index, row in outcomesDfTestAndCatCluster.iterrows():
        #iteration through each row of the test dataset
        
        #real value of the column to be predicted
        y_test_value = row["price"]
        #values given as input for the prediction model
        x_test_value = row[["retail","bidincrement","bidfee","flg_click_only","flg_beginnerauction","flg_fixedprice","flg_endprice"]].values
        #product category cluster for this row of the dataset
        clusterIndex = row["cat_cluster"]

        #the prediction models associated to this cluster are extracted
        modelRandomForestRegressor = clusterIndexToRandomForestRegressor[clusterIndex]
        modelKNeighborsRegressor = clusterIndexToKNeighborsRegressor[clusterIndex]
        modelDecisionTreeRegressor = clusterIndexToDecisionTreeRegressor[clusterIndex]
        modelLinearRegression = clusterIndexToLinearRegression[clusterIndex]
        
        realAndPredictedValuesRandomForestRegressor.append((y_test_value,modelRandomForestRegressor.predict([x_test_value])[0]))
        realAndPredictedValuesKNeighborsRegressor.append((y_test_value,modelKNeighborsRegressor.predict([x_test_value])[0]))
        realAndPredictedValuesDecisionTreeRegressor.append((y_test_value,modelDecisionTreeRegressor.predict([x_test_value])[0]))
        realAndPredictedValuesLinearRegression.append((y_test_value,modelLinearRegression.predict([x_test_value])[0]))
        
        if clusterIndex not in clusterIndexToRANSACRegressor:
            print("clusterIndex not in clusterIndexToRANSACRegressor")
            continue
        modelRANSACRegressor = clusterIndexToRANSACRegressor[clusterIndex]
        realAndPredictedValuesRANSACRegressor.append((y_test_value,modelRANSACRegressor.predict([x_test_value])[0]))
    
    #RandomForestRegressor
    #all real and predicted values are extracted and the metrics are calculated
    y_test, P_price = zip(*realAndPredictedValuesRandomForestRegressor)    
    kFoldMedianAbsoluteError = median_absolute_error(y_test,P_price)
    kFoldMedianAbsoluteErrorsRandomForestRegressor.append(kFoldMedianAbsoluteError) 
    kFoldMeanAbsoluteError = mean_absolute_error(y_test,P_price)
    kFoldMeanAbsoluteErrorsRandomForestRegressor.append(kFoldMeanAbsoluteError) 
    
    #KNeighborsRegressor
    #all real and predicted values are extracted and the metrics are calculated
    y_test, P_price = zip(*realAndPredictedValuesKNeighborsRegressor)    
    kFoldMedianAbsoluteError = median_absolute_error(y_test,P_price)
    kFoldMedianAbsoluteErrorsKNeighborsRegressor.append(kFoldMedianAbsoluteError) 
    kFoldMeanAbsoluteError = mean_absolute_error(y_test,P_price)
    kFoldMeanAbsoluteErrorsKNeighborsRegressor.append(kFoldMeanAbsoluteError) 
    
    #DecisionTreeRegressor
    #all real and predicted values are extracted and the metrics are calculated
    y_test, P_price = zip(*realAndPredictedValuesDecisionTreeRegressor)    
    kFoldMedianAbsoluteError = median_absolute_error(y_test,P_price)
    kFoldMedianAbsoluteErrorsDecisionTreeRegressor.append(kFoldMedianAbsoluteError)
    kFoldMeanAbsoluteError = mean_absolute_error(y_test,P_price)
    kFoldMeanAbsoluteErrorsDecisionTreeRegressor.append(kFoldMeanAbsoluteError) 
    
    #LinearRegression
    #all real and predicted values are extracted and the metrics are calculated
    y_test, P_price = zip(*realAndPredictedValuesLinearRegression)    
    kFoldMedianAbsoluteError = median_absolute_error(y_test,P_price)
    kFoldMedianAbsoluteErrorsLinearRegression.append(kFoldMedianAbsoluteError)
    kFoldMeanAbsoluteError = mean_absolute_error(y_test,P_price)
    kFoldMeanAbsoluteErrorsLinearRegression.append(kFoldMeanAbsoluteError) 
    
    #RANSACRegressor
    #all real and predicted values are extracted and the metrics are calculated
    y_test, P_price = zip(*realAndPredictedValuesRANSACRegressor)    
    kFoldMedianAbsoluteError = median_absolute_error(y_test,P_price)
    kFoldMedianAbsoluteErrorsRANSACRegressor.append(kFoldMedianAbsoluteError)
    kFoldMeanAbsoluteError = mean_absolute_error(y_test,P_price)
    kFoldMeanAbsoluteErrorsRANSACRegressor.append(kFoldMeanAbsoluteError) 
    
medianAbsoluteErrorsRandomForestRegressorAverage = np.mean(kFoldMedianAbsoluteErrorsRandomForestRegressor)
meanAbsoluteErrorsRandomForestRegressorAverage = np.mean(kFoldMeanAbsoluteErrorsRandomForestRegressor)
medianAbsoluteErrorsKNeighborsRegressorAverage = np.mean(kFoldMedianAbsoluteErrorsKNeighborsRegressor)
meanAbsoluteErrorsKNeighborsRegressorAverage = np.mean(kFoldMeanAbsoluteErrorsKNeighborsRegressor)
medianAbsoluteErrorsDecisionTreeRegressorAverage = np.mean(kFoldMedianAbsoluteErrorsDecisionTreeRegressor)
meanAbsoluteErrorsDecisionTreeRegressorAverage = np.mean(kFoldMeanAbsoluteErrorsDecisionTreeRegressor)
medianAbsoluteErrorsLinearRegressionAverage = np.mean(kFoldMedianAbsoluteErrorsLinearRegression)
meanAbsoluteErrorsLinearRegressionAverage = np.mean(kFoldMeanAbsoluteErrorsLinearRegression)
medianAbsoluteErrorsRANSACRegressorAverage = np.mean(kFoldMedianAbsoluteErrorsRANSACRegressor)
meanAbsoluteErrorsRANSACRegressorAverage = np.mean(kFoldMeanAbsoluteErrorsRANSACRegressor)

print("--")
print("Random Forest Regressor")
print("Median absolute error: "+str(medianAbsoluteErrorsRandomForestRegressorAverage))
print("Mean absolute error: "+str(meanAbsoluteErrorsRandomForestRegressorAverage))
print("--")
print("K Neighbors Regressor")
print("Median absolute error: "+str(medianAbsoluteErrorsKNeighborsRegressorAverage))
print("Mean absolute error: "+str(meanAbsoluteErrorsKNeighborsRegressorAverage))
print("--")
print("Decision Tree Regressor")
print("Median absolute error: "+str(medianAbsoluteErrorsDecisionTreeRegressorAverage))
print("Mean absolute error: "+str(meanAbsoluteErrorsDecisionTreeRegressorAverage))
print("--")
print("Linear Regression")
print("Median absolute error: "+str(medianAbsoluteErrorsLinearRegressionAverage))
print("Mean absolute error: "+str(meanAbsoluteErrorsLinearRegressionAverage))
print("--")
print("RANSAC Regressor")
print("Median absolute error: "+str(medianAbsoluteErrorsRANSACRegressorAverage))
print("Mean absolute error: "+str(meanAbsoluteErrorsRANSACRegressorAverage))

Executing k fold = 1
       auction_id  product_id                                      item  \
71144      217223    10013607  2009-mini-cooper-chili-red-and-black-con   

                                                   desc   retail    price  \
71144  2009 Mini Cooper Chili Red and Black Convertible  24550.0  3939.36   

       finalprice  bidincrement  bidfee winner     ...       flg_click_only  \
71144     3939.36          0.12     0.6  CaCO3     ...                    0   

       flg_beginnerauction flg_fixedprice  flg_endprice  bids_placed  \
71144                    0              0             0      32828.0   

       swoopo_sale_price  swoopo_profit  winner_benefit  retail_std  \
71144       22739.251441   -1810.748559        19017.64       150.0   

       cat_cluster  
71144            1  

[1 rows x 23 columns]
RANSAC could not find a valid consensus set. All `max_trials` iterations were skipped because each randomly chosen sub-sample failed the passing criteria. See es