# Dublin R El Dorado Competition

The link for the [competition](http://www.kaybensoft.com/dublinr/)
1. Quick Summary

5,000 data points with pseudo-geological information (including proven gold reserves).

50 (or more) new sites up for auction with limited data.

Each team starts with $50,000,000.00 budget to bid with.

Blind, sealed-bid auctions for rights to mine the parcel of land (min bid $100,000.00)

Auctions happen in order by parcel_id

Extraction costs are non-trivial.

Winning team has most cash at the end.



In [1]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd

## Loading all the data

In [2]:
#auction_parcel = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\elDorado\auction_parcels.csv")
costs_data = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\elDorado\costs_data.csv")
elevation_data = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\elDorado\elevation_data.csv")
sample_data = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\elDorado\sample_data.csv")

## Let us join the sample_data and elevation datasets on parcel_id

This way we will have elevation added to the dataset for predicting the gold based on the elevation. Both the fixed and variable costs depend on finding the gold so elevation is a big deal. Also, elevation has 10,000 rows and sample table has 5000 rows.

In [3]:
train = pd.merge(sample_data,elevation_data, on='parcel_id', how = 'inner',suffixes=('_left', '_right'))

let us drop the duplicate rows and rename the columns to easting and northing

In [4]:
train.rename(columns={'Easting_left': 'Easting','Northing_left': 'Northing'}, inplace=True)

In [5]:
train = train.drop(['Easting_right','Northing_right'],axis=1)

In [6]:
train.head(2)

Unnamed: 0,parcel_id,Easting,Northing,Pyerite,Mexallon,Tritanium,Megacyte,Nocxium,Isogen,Veldspar,Plagioclase,Hedbergite,Spudumain,Gneiss,Arkonor,Mercoxit,Bistot,Crokite,gold_available,elevation
0,4,3.5,0.5,133.271693,108.255699,95.881135,129.69544,111.145532,138.742495,109.816062,148.29849,130.352913,145.596832,114.181143,105.515728,107.604397,94.116871,108.024488,3901.517878,296.809685
1,9,8.5,0.5,144.446857,110.173244,97.841036,104.65559,104.429629,123.081045,104.578077,117.521651,128.958155,126.251966,119.666173,116.793372,109.626584,130.8569,113.132975,17523.530328,102.632863


## Export the joined dataset

In [7]:
train.to_csv('train_data.csv')

## Join the the test and train dataframes along rows

In [20]:
def get_combined_data():
    # reading train data
    train = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\elDorado\train_data.csv")

    
    # reading test data
    test = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\elDorado\auction_parcels.csv")

    # extracting and then removing the targets from the training data 
    targets = train.gold_available
    train.drop('gold_available',1,inplace=True)
    

    # merging train data and test data for future feature engineering
    combined = train.append(test)
    combined.reset_index(inplace=True)
    combined.drop('index',inplace=True,axis=1)
   
    return combined

In [21]:
combined = get_combined_data()

In [23]:
combined.drop('Unnamed: 0',inplace=True,axis=1)

In [25]:
combined = combined.drop_duplicates()

In [37]:
combined["Gneiss"].fillna(combined["Gneiss"].mean(), inplace=True)
combined["Hedbergite"].fillna(combined["Hedbergite"].mean(), inplace=True)
combined["Isogen"].fillna(combined["Isogen"].mean(), inplace=True)
combined["Mexallon"].fillna(combined["Mexallon"].mean(), inplace=True)
combined["Nocxium"].fillna(combined["Nocxium"].mean(), inplace=True)
combined["Plagioclase"].fillna(combined["Plagioclase"].mean(), inplace=True)
combined["Pyerite"].fillna(combined["Pyerite"].mean(), inplace=True)
combined["Spudumain"].fillna(combined["Spudumain"].mean(), inplace=True)
combined["Tritanium"].fillna(combined["Tritanium"].mean(), inplace=True)
combined["Veldspar"].fillna(combined["Veldspar"].mean(), inplace=True)
combined["Megacyte"].fillna(combined["Megacyte"].mean(), inplace=True)

In [40]:
print('Reading Training data')
print('\nSize of Training  data: ' + str(combined.shape))
print('Columns:' + str(combined.columns.values))

print('dtypes')
print('\n')
print(combined.dtypes)
print('\n')
print('Info: ')
print('\n')
print(combined.info)
print('Shape: ')
print('\n')
print(combined.shape)
print('\n')
print('numerical columns statistcs')
print('\n')
print(combined.describe())
import re
# Review input features (train set) - Part 2A
missing_values = []
nonumeric_values = []

print ("========================\n")

for column in combined:
    # Find all the unique feature values
    uniq = combined[column].unique()
    print ("'{}' has {} unique values" .format(column,uniq.size))
    
    
    # Find features with missing values
    if (True in pd.isnull(uniq)):
        s = "{} has {} missing" .format(column, pd.isnull(combined[column]).sum())
        missing_values.append(s)
    
    # Find features with non-numeric values
    for i in range (1, np.prod(uniq.shape)):
        if (re.match('nan', str(uniq[i]))):
            break
        if not (re.search('(^\d+\.?\d*$)|(^\d*\.?\d+$)', str(uniq[i]))):
            nonumeric_values.append(column)
            break
  
print ("\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n")
print ("Features with missing values:\n{}\n\n" .format(missing_values))
print ("Features with non-numeric values:\n{}" .format(nonumeric_values))
print ("\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n")

Reading Training data

Size of Training  data: (5053, 19)
Columns:['Arkonor' 'Bistot' 'Crokite' 'Easting' 'Gneiss' 'Hedbergite' 'Isogen'
 'Megacyte' 'Mercoxit' 'Mexallon' 'Nocxium' 'Northing' 'Plagioclase'
 'Pyerite' 'Spudumain' 'Tritanium' 'Veldspar' 'elevation' 'parcel_id']
dtypes


Arkonor        float64
Bistot         float64
Crokite        float64
Easting        float64
Gneiss         float64
Hedbergite     float64
Isogen         float64
Megacyte       float64
Mercoxit       float64
Mexallon       float64
Nocxium        float64
Northing       float64
Plagioclase    float64
Pyerite        float64
Spudumain      float64
Tritanium      float64
Veldspar       float64
elevation      float64
parcel_id        int64
dtype: object


Info: 


<bound method DataFrame.info of          Arkonor      Bistot     Crokite  Easting      Gneiss  Hedbergite  \
0     105.515728   94.116871  108.024488      3.5  114.181143  130.352913   
1     116.793372  130.856900  113.132975      8.5  119.666173  128

Elevation has some negative values, Northing and Easting has 150 unique values and are associated with elevation. There are no missing values.

In [42]:
#All of them except the parcel_id that we'll need for the submission.
    
predictors = list(combined.columns)
predictors.remove('parcel_id')
    

In [43]:
#Let's normalize all of them in the unit interval. 
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(combined[predictors])

train_data_scaled = scaler.transform(combined[predictors])

## Modelling

In [47]:
#Let's start by importing the useful libraries.

from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.cross_validation import StratifiedKFold
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier
from sklearn.cross_validation import cross_val_score

In [48]:
#To evaluate our model we'll be using a 5-fold cross validation with the Accuracy metric.
def compute_score(clf, X, y,scoring='accuracy'):
    xval = cross_val_score(clf, X, y, cv = 5,scoring=scoring)
    return np.mean(xval)


In [124]:
#Recover the train set and the test set from the combined dataset
def recover_train_test_target():
    global combined
    
    train0 = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\elDorado\sample_data.csv")
    #classifier can only have integer but no floats..
    targets = np.asarray(train0.gold_available, dtype="|S6") 
    
    train = combined.ix[0:4999]
    test = combined.ix[5000:]
    
    return train,test,targets

In [125]:
train,test,targets = recover_train_test_target()

In [126]:
#Tree-based estimators can be used to compute feature importances, which in turn can be used to discard irrelevant features.
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
clf = ExtraTreesClassifier(n_estimators=50)
clf = clf.fit(train, targets)

In [127]:
#Let's have a look at the importance of each feature.
features = pd.DataFrame()
features['feature'] = train.columns
features['importance'] = clf.feature_importances_

In [128]:
features.sort(['importance'],ascending=False)

Unnamed: 0,feature,importance
16,Veldspar,0.053529
7,Megacyte,0.053498
13,Pyerite,0.053395
6,Isogen,0.053193
18,parcel_id,0.05314
15,Tritanium,0.053132
10,Nocxium,0.053014
17,elevation,0.052772
1,Bistot,0.052686
12,Plagioclase,0.052602


In [95]:
#Let's now transform our train set and test set in a more compact datasets.
model = SelectFromModel(clf, prefit=True)
train_new = model.transform(train)
train_new.shape

(5000, 10)

In [96]:
test_new = model.transform(test)
test_new.shape

(53, 10)

In [101]:
import warnings
warnings.filterwarnings('ignore') 

# sk learn import 

from sklearn.ensemble import RandomForestClassifier


## Evaluate Algorithms

In [117]:
#Random Forest are quite handy. They do however come with some parameters to tweak in order to get an optimal model for the prediction task.
forest = RandomForestClassifier(max_features='sqrt')

forest.fit(train_new, targets)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='sqrt', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [119]:
output = forest.predict(test_new)
df_output = pd.DataFrame()
df_output['parcel_id'] = test['parcel_id']
df_output['gold_available'] = output
df_output[['parcel_id','gold_available']].to_csv('output.csv',index=False)

In [2]:
submission_file = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\elDorado\output_sorted.csv")

In [3]:
submission_file

Unnamed: 0,parcel_id,gold_available,Total Price of Gold,Elevation,Fixed Cost,Variable Cost,Total Cost,Profit,Ratio from total profit,bid_amount
0,6875,20658.0,30987000.0,798.47,800000.0,5239901.7,6039901.7,24947100.0,0.066711,3335526.83
1,5712,18109.0,27163500.0,169.53,3000000.0,7393502.278,10393502.28,16770000.0,0.044844,2242215.77
2,16067,17502.0,26253000.0,1304.02,8000000.0,4517703.75,12517703.75,13735300.0,0.036729,1836464.05
3,15815,16904.0,25356000.0,1596.11,8000000.0,4363345.0,12363345,12992660.0,0.034743,1737169.94
4,17186,16698.0,25047000.0,-178.9,4000000.0,7102849.5,11102849.5,13944150.0,0.037288,1864388.7
5,19984,13551.0,20326500.0,528.3,8000000.0,5624794.25,13624794.25,6701706.0,0.017921,896044.86
6,11614,12828.0,19242000.0,266.58,3000000.0,5324689.0,8324689,10917310.0,0.029194,1459688.15
7,4577,12497.0,18745500.0,-198.78,4000000.0,5374751.417,9374751.417,9370749.0,0.025058,1252906.57
8,13683,12289.0,18433500.0,195.77,3000000.0,3825719.313,11825719.31,6607781.0,0.01767,883486.71
9,8712,12535.0,18802500.0,1007.17,8000000.0,3361469.167,11361469.17,7441031.0,0.019898,994895.58


In [4]:
#Select the rows for submission
submission_file = submission_file .iloc[:50]

In [5]:
submission_file

Unnamed: 0,parcel_id,gold_available,Total Price of Gold,Elevation,Fixed Cost,Variable Cost,Total Cost,Profit,Ratio from total profit,bid_amount
0,6875,20658.0,30987000.0,798.47,800000.0,5239901.7,6039901.7,24947098.3,0.066711,3335526.83
1,5712,18109.0,27163500.0,169.53,3000000.0,7393502.278,10393502.28,16769997.72,0.044844,2242215.77
2,16067,17502.0,26253000.0,1304.02,8000000.0,4517703.75,12517703.75,13735296.25,0.036729,1836464.05
3,15815,16904.0,25356000.0,1596.11,8000000.0,4363345.0,12363345.0,12992655.0,0.034743,1737169.94
4,17186,16698.0,25047000.0,-178.9,4000000.0,7102849.5,11102849.5,13944150.5,0.037288,1864388.7
5,19984,13551.0,20326500.0,528.3,8000000.0,5624794.25,13624794.25,6701705.75,0.017921,896044.86
6,11614,12828.0,19242000.0,266.58,3000000.0,5324689.0,8324689.0,10917311.0,0.029194,1459688.15
7,4577,12497.0,18745500.0,-198.78,4000000.0,5374751.417,9374751.417,9370748.58,0.025058,1252906.57
8,13683,12289.0,18433500.0,195.77,3000000.0,3825719.313,11825719.31,6607780.69,0.01767,883486.71
9,8712,12535.0,18802500.0,1007.17,8000000.0,3361469.167,11361469.17,7441030.83,0.019898,994895.58


In [7]:
#select the columns 
submission_file = submission_file[['parcel_id','bid_amount']]

In [8]:
submission_file

Unnamed: 0,parcel_id,bid_amount
0,6875,3335526.83
1,5712,2242215.77
2,16067,1836464.05
3,15815,1737169.94
4,17186,1864388.7
5,19984,896044.86
6,11614,1459688.15
7,4577,1252906.57
8,13683,883486.71
9,8712,994895.58


In [9]:
#export the file for submission
submission_file.to_csv('submission.csv')