# Dublin R El Dorado Competition

## Quick Summary

5,000 data points with pseudo-geological information (including proven gold reserves).

50 (or more) new sites up for auction with limited data.

Each team starts with $50,000,000.00 budget to bid with.

Blind, sealed-bid auctions for rights to mine the parcel of land (min bid $100,000.00)

Auctions happen in order by parcel_id

Extraction costs are non-trivial.

Winning team has most cash at the end.

The link for the [competition](http://www.kaybensoft.com/dublinr/)

In [1]:
import numpy as np
import pandas as pd

In [2]:
#Load data
costs_data = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\elDorado\costs_data.csv")
elevation_data = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\elDorado\elevation_data.csv")
sample_data = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\elDorado\sample_data.csv")

In [3]:
#merge sample_data and elevation datasets on parcel_id
train = pd.merge(sample_data,elevation_data, on='parcel_id', how = 'inner',suffixes=('_left', '_right'))
#drop duplicate
train.rename(columns={'Easting_left': 'Easting','Northing_left': 'Northing'}, inplace=True)
train = train.drop(['Easting_right','Northing_right'],axis=1)

In [4]:
#Join the the test and train dataframes
def get_combined_data():
    # reading train data
    train = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\elDorado\train_data.csv")

    
    # reading test data
    test = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\elDorado\auction_parcels.csv")

    # extracting and then removing the targets from the training data 
    targets = train.gold_available
    train.drop('gold_available',1,inplace=True)
    

    # merging train data and test data for future feature engineering
    combined = train.append(test)
    combined.reset_index(inplace=True)
    combined.drop('index',inplace=True,axis=1)
   
    return combined
combined = get_combined_data()

#### Same as total in sample_data and auctin_parcels

In [7]:
#drop unnamed column
combined.drop('Unnamed: 0',inplace=True,axis=1)

In [10]:
combined.tail(2)

Unnamed: 0,Arkonor,Bistot,Crokite,Easting,Gneiss,Hedbergite,Isogen,Megacyte,Mercoxit,Mexallon,Nocxium,Northing,Plagioclase,Pyerite,Spudumain,Tritanium,Veldspar,elevation,parcel_id
5048,103.044603,105.965291,104.451764,106.5,,,,,108.523457,,,146.5,,,,,,185.783466,22007
5049,100.591409,119.167135,114.396146,63.5,,,,,117.289582,,,148.5,,,,,,-128.664883,22264


In [9]:
combined.shape

(5050, 19)

Fill in the NaN values with mean. (mean and median is the same)

In [11]:
combined["Gneiss"].fillna(combined["Gneiss"].mean(), inplace=True)
combined["Hedbergite"].fillna(combined["Hedbergite"].mean(), inplace=True)
combined["Isogen"].fillna(combined["Isogen"].mean(), inplace=True)
combined["Mexallon"].fillna(combined["Mexallon"].mean(), inplace=True)
combined["Nocxium"].fillna(combined["Nocxium"].mean(), inplace=True)
combined["Plagioclase"].fillna(combined["Plagioclase"].mean(), inplace=True)
combined["Pyerite"].fillna(combined["Pyerite"].mean(), inplace=True)
combined["Spudumain"].fillna(combined["Spudumain"].mean(), inplace=True)
combined["Tritanium"].fillna(combined["Tritanium"].mean(), inplace=True)
combined["Veldspar"].fillna(combined["Veldspar"].mean(), inplace=True)
combined["Megacyte"].fillna(combined["Megacyte"].mean(), inplace=True)

In [12]:
print('Reading Training data')
print('\nSize of Training  data: ' + str(combined.shape))
print('Columns:' + str(combined.columns.values))

print('dtypes')
print('\n')
print(combined.dtypes)
print('\n')
print('Info: ')
print('\n')
print(combined.info)
print('Shape: ')
print('\n')
print(combined.shape)
print('\n')
print('numerical columns statistcs')
print('\n')
print(combined.describe())
import re
# Review input features (train set) - Part 2A
missing_values = []
nonumeric_values = []

print ("========================\n")

for column in combined:
    # Find all the unique feature values
    uniq = combined[column].unique()
    print ("'{}' has {} unique values" .format(column,uniq.size))
    
    
    # Find features with missing values
    if (True in pd.isnull(uniq)):
        s = "{} has {} missing" .format(column, pd.isnull(combined[column]).sum())
        missing_values.append(s)
    
    # Find features with non-numeric values
    for i in range (1, np.prod(uniq.shape)):
        if (re.match('nan', str(uniq[i]))):
            break
        if not (re.search('(^\d+\.?\d*$)|(^\d*\.?\d+$)', str(uniq[i]))):
            nonumeric_values.append(column)
            break
  
print ("\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n")
print ("Features with missing values:\n{}\n\n" .format(missing_values))
print ("Features with non-numeric values:\n{}" .format(nonumeric_values))
print ("\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n")

Reading Training data

Size of Training  data: (5050, 19)
Columns:['Arkonor' 'Bistot' 'Crokite' 'Easting' 'Gneiss' 'Hedbergite' 'Isogen'
 'Megacyte' 'Mercoxit' 'Mexallon' 'Nocxium' 'Northing' 'Plagioclase'
 'Pyerite' 'Spudumain' 'Tritanium' 'Veldspar' 'elevation' 'parcel_id']
dtypes


Arkonor        float64
Bistot         float64
Crokite        float64
Easting        float64
Gneiss         float64
Hedbergite     float64
Isogen         float64
Megacyte       float64
Mercoxit       float64
Mexallon       float64
Nocxium        float64
Northing       float64
Plagioclase    float64
Pyerite        float64
Spudumain      float64
Tritanium      float64
Veldspar       float64
elevation      float64
parcel_id        int64
dtype: object


Info: 


<bound method DataFrame.info of          Arkonor      Bistot     Crokite  Easting      Gneiss  Hedbergite  \
0     105.515728   94.116871  108.024488      3.5  114.181143  130.352913   
1     116.793372  130.856900  113.132975      8.5  119.666173  128

### Feature Extraction

#### How to distinguish between classification and regression?

* if you have y as floats, but only a finite number of different values can be obtained, and all of them are obtained in training set, then this is classification - just convert your values to strings or integers and you are good to go. Or if the values in target column are discrete.

* if you have y as floats, and this are actuall real values, and you can have plenty of values, even not seen in the training set and you expect your model to somehow "interpolate" this is regression and you are supposed to use DecisionTreeRegressor instead.Or if the values in target column are continuous.

In [75]:
#Recover the train set and the test set from the combined dataset
def recover_train_test_target():
    global combined
    
    train0 = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\elDorado\sample_data.csv")
    #classifier can only have integer but no floats..
    targets = train0.gold_available
    train = combined.ix[0:4999]
    test = combined.ix[5000:]
    
    return train,test,targets

In [76]:
train,test,targets = recover_train_test_target()

In [93]:
#Tree-based estimators can be used to compute feature importances, which in turn can be used to discard irrelevant features.
from sklearn.tree import DecisionTreeRegressor
from sklearn.feature_selection import SelectFromModel
clf = DecisionTreeRegressor(max_depth=200,max_features='auto',random_state=2)
clf = clf.fit(train, targets)

In [94]:
#Let's have a look at the importance of each feature.
features = pd.DataFrame()
features['feature'] = train.columns
features['importance'] = clf.feature_importances_

In [98]:
features.sort_values (by = 'importance',ascending=False)

Unnamed: 0,feature,importance
7,Megacyte,0.320334
12,Plagioclase,0.242324
15,Tritanium,0.094173
1,Bistot,0.073765
10,Nocxium,0.073336
5,Hedbergite,0.031163
4,Gneiss,0.025366
0,Arkonor,0.019466
2,Crokite,0.018449
13,Pyerite,0.016798


In [96]:
#Let's now transform our train set and test set in a more compact datasets.
model = SelectFromModel(clf, prefit=True)
train_new = model.transform(train)
train_new.shape

(5000, 5)

In [97]:
test_new = model.transform(test)
test_new.shape


(50, 5)

In [82]:
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(n_estimators = 200,)
forest.fit(train_new, targets)


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=200, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

In [88]:
submission_file = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\elDorado\output.csv")

The selling price for gold is 1500 per ounce as per the competition, Variable cost is calculated with (Elevation_Level / Threshhold) * gold_available. Profit is Total Selling price - Total Cost. I decided to calculate the ratio for profit from total profit and for bidding amount I multiplied total amount (50000000) * profit ratio. So the business can maximize ROI for their money accordingly.


In [89]:
submission_file

Unnamed: 0,parcel_id,gold_available,Total Selling Price,elevation,Fixed Cost,Variable Cost,Total Cost,Profit,Profit Ratio,bid_amount
0,515,11282.37,16923556.29,23.19,3000000,4720543.61,7720543.61,9203012.68,0.024964,1248208.72
1,914,11481.55,17222318.01,784.99,8000000,3181537.51,11181537.51,6040780.51,0.016386,819313.76
2,1538,11354.35,17031522.28,1258.73,8000000,3146290.39,11146290.39,5885231.89,0.015964,798216.63
3,1790,11349.07,17023599.53,474.18,3000000,4748450.89,7748450.89,9275148.64,0.02516,1257992.55
4,2416,11725.36,17588046.58,1584.86,8000000,3249097.26,11249097.26,6338949.33,0.017195,859754.53
5,3332,7495.1,11242648.24,209.17,3000000,3204155.25,6204155.25,5038492.99,0.013667,683373.05
6,3697,10478.7,15718050.71,152.18,3000000,4384288.08,7384288.08,8333762.63,0.022606,1130311.94
7,4006,15647.48,23471223.65,1128.92,8000000,4101875.11,12101875.11,11369348.54,0.030841,1542029.82
8,4125,10392.41,15588607.82,-770.37,4000000,4539404.69,8539404.69,7049203.13,0.019122,956086.57
9,4577,10414.18,15621273.62,-198.78,4000000,4548913.82,8548913.82,7072359.8,0.019185,959227.32


In [99]:
#select the columns 
submission_file = submission_file[['parcel_id','bid_amount']]


In [100]:
submission_file

Unnamed: 0,parcel_id,bid_amount
0,515,1248208.72
1,914,819313.76
2,1538,798216.63
3,1790,1257992.55
4,2416,859754.53
5,3332,683373.05
6,3697,1130311.94
7,4006,1542029.82
8,4125,956086.57
9,4577,959227.32


In [92]:
#export the file for submission
submission_file.to_csv('submission.csv')

###### For future, I need to know more about Vickrey Auction method.
Finally, any and all suggestions for improvements are more than welcome. 