# Selecting a location for a well

Let's say you work for the mining company GlavRosGosNeft. We need to decide where to drill a new well.

You were provided with oil samples in three regions: in each 10,000 fields, where the quality of oil and the volume of its reserves were measured. Build a machine learning model that will help determine the region where mining will bring the greatest profit. Analyze the possible profits and risks using *Bootstrap*

Steps to select a location:

- Deposits are searched for in the selected region, and the feature values are determined for each;
- Build a model and estimate the volume of reserves;
- Deposits with the highest estimated values of reserves are selected. The number of fields depends on the company’s budget and the cost of developing one well;
- Profit is equal to the total profit of the selected fields.

## Loading and preparing data

In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from numpy.random import RandomState

In [2]:
df0 = pd.read_csv('/datasets/geo_data_0.csv')
df1 = pd.read_csv('/datasets/geo_data_1.csv')
df2 = pd.read_csv('/datasets/geo_data_2.csv')

In [3]:
print(df0.head())
print(df1.head())
df2.head()

      id        f0        f1        f2     product
0  txEyH  0.705745 -0.497823  1.221170  105.280062
1  2acmU  1.334711 -0.340164  4.365080   73.037750
2  409Wp  1.022732  0.151990  1.419926   85.265647
3  iJLyR -0.032172  0.139033  2.978566  168.620776
4  Xdl7t  1.988431  0.155413  4.751769  154.036647
      id         f0         f1        f2     product
0  kBEdx -15.001348  -8.276000 -0.005876    3.179103
1  62mP7  14.272088  -3.475083  0.999183   26.953261
2  vyE1P   6.263187  -5.948386  5.001160  134.766305
3  KcrkZ -13.081196 -11.506057  4.999415  137.945408
4  AHL4O  12.702195  -8.147433  5.004363  134.766305


Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697
2,ovLUW,0.194587,0.289035,-5.586433,62.87191
3,q6cA6,2.23606,-0.55376,0.930038,114.572842
4,WPMUX,-0.515993,1.716266,5.899011,149.600746


In [4]:
df0.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
f0,100000.0,0.500419,0.871832,-1.408605,-0.07258,0.50236,1.073581,2.362331
f1,100000.0,0.250143,0.504433,-0.848218,-0.200881,0.250252,0.700646,1.343769
f2,100000.0,2.502647,3.248248,-12.088328,0.287748,2.515969,4.715088,16.00379
product,100000.0,92.5,44.288691,0.0,56.497507,91.849972,128.564089,185.364347


In [5]:
df1.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
f0,100000.0,1.141296,8.965932,-31.609576,-6.298551,1.153055,8.621015,29.421755
f1,100000.0,-4.796579,5.119872,-26.358598,-8.267985,-4.813172,-1.332816,18.734063
f2,100000.0,2.494541,1.703572,-0.018144,1.000021,2.011479,3.999904,5.019721
product,100000.0,68.825,45.944423,0.0,26.953261,57.085625,107.813044,137.945408


In [6]:
df2.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
f0,100000.0,0.002023,1.732045,-8.760004,-1.162288,0.009424,1.158535,7.238262
f1,100000.0,-0.002081,1.730417,-7.08402,-1.17482,-0.009482,1.163678,7.844801
f2,100000.0,2.495128,3.473445,-11.970335,0.130359,2.484236,4.858794,16.739402
product,100000.0,95.0,44.749921,0.0,59.450441,94.925613,130.595027,190.029838


In [7]:
df0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


In [8]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


In [9]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


In [10]:
df0['id'].duplicated().sum()

10

In [11]:
duplicated_ids_0 = df0.loc[df0['id'].duplicated(), 'id'].values

In [12]:
df0[df0['id'].isin(duplicated_ids_0)]

Unnamed: 0,id,f0,f1,f2,product
931,HZww2,0.755284,0.368511,1.863211,30.681774
1364,bxg6G,0.411645,0.85683,-3.65344,73.60426
1949,QcMuo,0.506563,-0.323775,-2.215583,75.496502
3389,A5aEY,-0.039949,0.156872,0.209861,89.249364
7530,HZww2,1.061194,-0.373969,10.43021,158.828695
16633,fiKDv,0.157341,1.028359,5.585586,95.817889
21426,Tdehs,0.829407,0.298807,-0.049563,96.035308
41724,bxg6G,-0.823752,0.546319,3.630479,93.007798
42529,AGS9W,1.454747,-0.479651,0.68338,126.370504
51970,A5aEY,-0.180335,0.935548,-2.094773,33.020205


In [13]:
df1['id'].duplicated().sum()

4

In [14]:
df1.duplicated().sum()

0

In [15]:
df2['id'].duplicated().sum()

4

In [16]:
df2.duplicated().sum()

0

Duplicates in well IDs are most likely due to erroneous data entry, since there are no duplicates among entire rows. Therefore, it was decided not to delete them

In [17]:
print(df0.isna().sum())
print(df1.isna().sum())
print(df2.isna().sum())

id         0
f0         0
f1         0
f2         0
product    0
dtype: int64
id         0
f0         0
f1         0
f2         0
product    0
dtype: int64
id         0
f0         0
f1         0
f2         0
product    0
dtype: int64


## Model training and testing

In [18]:
def model(df):
    features = df.drop(columns=['id', 'product'])
    target = df['product']

    features_train, features_valid, target_train, target_valid = train_test_split(features, target, 
                                                                                  random_state=12345, test_size=0.25)
    scaler = StandardScaler()
    scaler.fit(features_train)
    features_train_scaled = scaler.transform(features_train)
    features_valid_scaled = scaler.transform(features_valid)
    
    model = LinearRegression()
    model.fit(features_train_scaled, target_train)
    
    predictions = model.predict(features_valid_scaled)
    rmse = mean_squared_error(target_valid, predictions)**0.5
    average_predicted_product = predictions.mean()

    return average_predicted_product, rmse, predictions, target_valid

In [19]:
df0_product, df0_rmse, df0_predictions, target0 = model(df0)
print('Средний запас сырья в первом регионе:', df0_product)
print('RMSE:', df0_rmse)

Средний запас сырья в первом регионе: 92.59256778438035
RMSE: 37.5794217150813


In [20]:
df1_product, df1_rmse, df1_predictions, target1 = model(df1)
print('Средний запас сырья в первом регионе:', df1_product)
print('RMSE:', df1_rmse)

Средний запас сырья в первом регионе: 68.728546895446
RMSE: 0.893099286775617


In [21]:
df2_product, df2_rmse, df2_predictions, target2 = model(df2)
print('Средний запас сырья в первом регионе:', df2_product)
print('RMSE:', df2_rmse)

Средний запас сырья в первом регионе: 94.96504596800489
RMSE: 40.02970873393434


The average reserves in the first region, as predicted by the model, are 93 +-38 thousand barrels of standard error. That is, the model is wrong in predicting the average stock of raw materials by an average of 38 thousand barrels.

Similarly, in the second region, the average stock of raw materials is 69 +-0.9 thousand barrels.

In the third - 95 +-40 thousand barrels.

The second region has the smallest spread in the predicted average stock of raw materials, but it is less than in other regions.

## Preparation for profit calculation

In [22]:
budget = 10000
number_of_wells_in_region = 200
budget_per_well = budget/number_of_wells_in_region
price = 0.45
min_product = budget_per_well/price
print(budget_per_well)
print(min_product)

50.0
111.11111111111111


In [23]:
print(min_product-df0_product)
print(min_product-df1_product)
print(min_product-df2_product)

18.518543326730764
42.38256421566511
16.146065143106227


10,000 million rubles are allocated for the region. In the region it is necessary to select 200 wells, therefore, on average, one well costs 50 million rubles. budget. Knowing that at current prices per thousand barrels we will receive 0.450 million rubles of profit, on average there should be more than 50/0.45~111 thousand barrels of raw material in a well. This is 18 thousand more than the average in the first region, 42 more than in the second and 16 more than in the third. This means that wells chosen at random are more likely to cause losses.

In [24]:
def profit(predictions, target):
    index = pd.Series(predictions).sort_values(ascending=False).head(200).index
    profit = price*target.iloc[index].sum() - budget
    return profit

In [25]:
profit(df0_predictions, target0)

3320.8260431398503

In [26]:
profit(df1_predictions, target1)

2415.086696681512

In [27]:
profit(df2_predictions, target2)

2710.3499635998323

## Calculation of profits and risks

In [28]:
def bootstrap(predictions, target, number):
    state = RandomState(12345)
    values = []
    for i in range(number):
        predicted_subsample = pd.Series(predictions).sample(n=500, replace=True, random_state=state)
        profits = profit(predicted_subsample, target)
        values.append(profits)
    
    values = pd.Series(values)
    average_profit = values.mean()
    lower = values.quantile(0.025)
    upper = values.quantile(0.975)
    risk = (values<0).sum()/number
    
    return average_profit, lower, upper, risk

In [29]:
bootstrap(df0_predictions, target0, 1000)

(396.16498480237146, -111.21554589049533, 909.7669415534222, 0.069)

In [30]:
bootstrap(df1_predictions, target1, 1000)

(456.04510578666105, 33.82050939898541, 852.2894538660361, 0.015)

In [31]:
bootstrap(df2_predictions, target2, 1000)

(404.4038665683571, -163.35041339560078, 950.3595749238001, 0.076)

Of all the regions, only the second one has a risk of loss of less than 2.5%, as well as the highest average profit. Therefore, the second region is recommended for drilling.