# Selecting a location for a well

Let's say we work for the mining company GlavRosGosNeft. We need to decide where to drill a new well.

We were provided with oil samples in three regions: in each 10,000 fields, where the quality of oil and the volume of its reserves were measured. Let's build a machine learning model that will help determine the region where mining will bring the greatest profit. Let's analyze the possible profits and risks using the *Bootstrap.* technique

Steps to select a location:

- Deposits are searched for in the selected region, and the characteristic values ​​are determined for each;
- Build a model and estimate the volume of reserves;
- Deposits with the highest value estimates are selected. The number of fields depends on the company’s budget and the cost of developing one well;
- Profit is equal to the total profit of the selected fields.

<b>Description of data</b>

Geological exploration data for three regions is in the following files:

- /datasets/geo_data_0.csv.
- /datasets/geo_data_1.csv.
- /datasets/geo_data_2.csv.
- id — unique identifier of the well;
- f0, f1, f2 - three signs of points (it doesn’t matter what they mean, but the signs themselves are significant);
- product — volume of reserves in the well (thousand barrels).


<b>Task conditions:</b>
- Only linear regression is suitable for training the model (the others are not predictable enough).
- When exploring a region, 500 points are examined, from which, using machine learning, the best 200 are selected for development.
- The budget for well development in the region is 10 billion rubles.
- At current prices, one barrel of raw materials brings 450 rubles in income. The income from each unit of product is 450 thousand rubles, since the volume is indicated in thousands of barrels.
- After assessing the risks, you need to leave only those regions in which the probability of losses is less than 2.5%. Among them, the region with the highest average profit is selected.

## Loading and preparing data

Import the necessary libraries and look at the data

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from scipy import stats as st


df_0 = pd.read_csv('/datasets/geo_data_0.csv')
df_1 = pd.read_csv('/datasets/geo_data_1.csv')
df_2 = pd.read_csv('/datasets/geo_data_2.csv')

data = [df_0, df_1, df_2]

for i in data:
    print(i.info())
    display(i.head())
    print('*' * 30)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
None


Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.22117,105.280062
1,2acmU,1.334711,-0.340164,4.36508,73.03775
2,409Wp,1.022732,0.15199,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647


******************************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
None


Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261
2,vyE1P,6.263187,-5.948386,5.00116,134.766305
3,KcrkZ,-13.081196,-11.506057,4.999415,137.945408
4,AHL4O,12.702195,-8.147433,5.004363,134.766305


******************************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
None


Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697
2,ovLUW,0.194587,0.289035,-5.586433,62.87191
3,q6cA6,2.23606,-0.55376,0.930038,114.572842
4,WPMUX,-0.515993,1.716266,5.899011,149.600746


******************************


Let's look at the correlation

In [225]:
for i in data:
    print(i.corr())
    print('*' * 30)

               f0        f1        f2   product
f0       1.000000 -0.440723 -0.003153  0.143536
f1      -0.440723  1.000000  0.001724 -0.192356
f2      -0.003153  0.001724  1.000000  0.483663
product  0.143536 -0.192356  0.483663  1.000000
******************************
               f0        f1        f2   product
f0       1.000000  0.182287 -0.001777 -0.030491
f1       0.182287  1.000000 -0.002595 -0.010155
f2      -0.001777 -0.002595  1.000000  0.999397
product -0.030491 -0.010155  0.999397  1.000000
******************************
               f0        f1        f2   product
f0       1.000000  0.000528 -0.000448 -0.001987
f1       0.000528  1.000000  0.000779 -0.001012
f2      -0.000448  0.000779  1.000000  0.445871
product -0.001987 -0.001012  0.445871  1.000000
******************************


### Conclusion

Data does not contain gaps

Of the three characteristics, f2 correlates most with product. Moreover, in df_1 the correlation is almost 100%.

Let's move on to building models

## Model training and testing

<b>Train the model using the data from the first region</b>

In [226]:
features_0 = df_0.drop(['product', 'id'], axis=1) #признаки 
target_0 = df_0['product'] # цель

features_train_0, features_valid_0, target_train_0, target_valid_0 = train_test_split(features_0,\
    target_0, test_size=.25, random_state=12345)

model_0 = LinearRegression(n_jobs=-1)
model_0.fit(features_train_0, target_train_0)
prediction_0 = model_0.predict(features_valid_0)
rmse_0 = mean_squared_error(target_valid_0, prediction_0)**0.5
scores_0 = model.score(features_train_0, target_train_0)
print("RMSE", rmse_0)
print('Среднее кол-во нефти в регионе', df_0['product'].mean())
print('Score', scores_0)

RMSE 37.5794217150813
Среднее кол-во нефти в регионе 92.50000000000001
Score 0.2266310049191338


<b>Train the model using the data from the second region</b>

In [227]:
features_1 = df_1.drop(['product', 'id'], axis=1) #признаки 
target_1 = df_1['product'] # цель

features_train_1, features_valid_1, target_train_1, target_valid_1 = train_test_split(features_1,\
    target_1, test_size=.25, random_state=12345)

model_1 = LinearRegression(n_jobs=-1)
model_1.fit(features_train_1, target_train_1)
prediction_1 = model_1.predict(features_valid_1)
rmse_1 = mean_squared_error(target_valid_1, prediction_1)**0.5
scores_1 = model.score(features_train_1, target_train_1)
print("RMSE", rmse_1)
print('Среднее кол-во нефти в регионе', df_1['product'].mean())
print('Score', scores_1)

RMSE 0.893099286775616
Среднее кол-во нефти в регионе 68.82500000000002
Score 0.047375840663758884


<b>Train the model using data from the third region</b>

In [228]:
features_2 = df_2.drop(['product', 'id'], axis=1) #признаки 
target_2 = df_2['product'] # цель

features_train_2, features_valid_2, target_train_2, target_valid_2 = train_test_split(features_2,\
    target_2, test_size=.25, random_state=12345)

model_2 = LinearRegression(n_jobs=-1)
model_2.fit(features_train_2, target_train_2)
prediction_2 = model_2.predict(features_valid_2)
rmse_2 = mean_squared_error(target_valid_2, prediction_2)**0.5
scores_2 = model.score(features_train_2, target_train_2)
print("RMSE", rmse_2)
print('Среднее кол-во нефти в регионе', df_2['product'].mean())
print('Score', scores_2)

RMSE 40.02970873393434
Среднее кол-во нефти в регионе 95.00000000000004
Score 0.19661432867329998


### Conclusion

The RMSE of the first and third regions is quite high - 37 and 40, respectively.
RMSE of the second region - 0.9

The average amount of raw materials in region 2 is the smallest - 68 tons

Let's move on to calculating profit

## Preparation for profit calculation

Let us denote the necessary variables according to the wording of the problem

In [229]:
wells_all = 500 # wells to be studied
wells_best = 200 # select only 200 out of 500
budget = 10_000_000_000 # total budget
barrel = 450 # revenue per barrel
product = 450,000 # income per product (1 product = 1,000 barrels)

Let's find a budget for 1 well

In [230]:
budget_well_best = budget / wells_best
budget_well_best

50000000.0

Minimum payback per well

In [231]:
min_product = budget_well_best / revenue_product
min_product

111.11111111111111

Let's write a function to determine profit for each region

In [232]:
prediction_0 = pd.Series(prediction_0)
target_valid_0 = pd.Series(target_valid_0).reset_index(drop=True)


In [233]:
def revenue(target, predicted, count=200):
    predicted_sorted = predicted.sort_values(ascending=False)
    selected = target[predicted_sorted.index][:count]
    return (((revenue_product * selected.sum() - budget))) / 1_000_000_000
print('Прибыль для первого региона:', revenue(target_valid_0, prediction_0), 'млрд руб')


Прибыль для первого региона: 3.3208260431398524 млрд руб


In [234]:
prediction_1 = pd.Series(prediction_1)
target_valid_1 = pd.Series(target_valid_1).reset_index(drop=True)

print('Прибыль для первого региона:', revenue(target_valid_1, prediction_1))

Прибыль для первого региона: 2.4150866966815108


In [235]:
prediction_2 = pd.Series(prediction_2)
target_valid_2 = pd.Series(target_valid_2).reset_index(drop=True)

print('Прибыль для первого региона:', revenue(target_valid_2, prediction_2))

Прибыль для первого региона: 2.7103499635998327


### Conclusion

From the results obtained above, we conclude that the profit of the 1st region is the highest and amounts to 3.3 billion rubles. 2nd - 2.4 billion rubles. 3rd - 2.7 billion rubles

Let's move on to calculating profits and risks

## Calculation of profits and risks

Let's analyze the first region

In [236]:
state = np.random.RandomState(12345)
values_0 = []
for i in range(1000):
    target_subsample_0 = target_valid_0.sample(n=500, replace=True, random_state = state)
    predictions_subsample_0 = prediction_0[target_subsample_0.index]
    values_0.append(revenue(target_subsample_0, predictions_subsample_0))
values_0 = pd.Series(values_0)
mean_0 = values_0.mean()
lower_0 = values_0.quantile(0.025)
upper_0 = values_0.quantile(0.975)
risk_0 = (((values_0 < 0).sum() / len(values_0)) * 100)
print('Средняя прибыль для региона 0:', mean_0)
print('Доверительный интервал для региона 0:', lower_0, ':', upper_0)
print('Риск', risk_0, '%')

Средняя прибыль для региона 0: 0.42593852691059236
Доверительный интервал для региона 0: -0.10209009483793655 : 0.9479763533583688
Риск 6.0 %


Let's analyze the second region

In [237]:
values_1 = []
for i in range(1000):
    target_subsample_1 = target_valid_1.sample(n=500, replace=True, random_state = state)
    predictions_subsample_1 = prediction_1[target_subsample_1.index]
    values_1.append(revenue(target_subsample_1, predictions_subsample_1))
values_1 = pd.Series(values_1)
mean_1 = values_1.mean()
lower_1 = values_1.quantile(0.025)
upper_1 = values_1.quantile(0.975)
risk_1 = (((values_1 < 0).sum() / len(values_1)) * 100)
print('Средняя прибыль для региона 1:', mean_1)
print('Доверительный интервал для региона 1:', lower_1, ':', upper_1)
print('Риск', risk_1, '%')

Средняя прибыль для региона 1: 0.5182594936973248
Доверительный интервал для региона 1: 0.1281232314330863 : 0.9536129820669086
Риск 0.3 %


Let's analyze the third region

In [238]:
values_2 = []
for i in range(1000):
    target_subsample_2 = target_valid_2.sample(n=500, replace=True, random_state = state)
    predictions_subsample_2 = prediction_2[target_subsample_2.index]
    values_2.append(revenue(target_subsample_2, predictions_subsample_2))
values_2 = pd.Series(values_2)
mean_2 = values_2.mean()
lower_2 = values_2.quantile(0.025)
upper_2 = values_2.quantile(0.975)
risk_2 = (((values_2 < 0).sum() / len(values_2)) * 100)
print('Средняя прибыль для региона 2:', mean_2)
print('Доверительный интервал для региона 2:', lower_2, ':', upper_2)
print('Риск', risk_2, '%')

Средняя прибыль для региона 2: 0.4201940053440501
Доверительный интервал для региона 2: -0.11585260916001143 : 0.989629939844574
Риск 6.2 %


### Conclusion

The second region has the highest average profit - 0.5 billion rubles

This region also has the lowest risk - 0.3%


## General conclusion

We analyzed data on raw material production for three regions.

Of the three regions, region 2 is the most attractive for mining.

This region has the highest average profit and the lowest risks