**Volkov Aleksandr**\
Email: volkov.aleks@gmail.com \
Telegram: @volkov_lxndr


# **Well placement**

In theese task we need to decide on where it is more efficient to drill a new well: 

Typical steps are: 
- In the chosen region, collect characteristics for the wells: the quality of oil and the volume of reserves;
- Build a model to predict the amount of reserves in new wells;
- Select the wells with the highest value estimates;
- Determine the region with the maximum revenue (for the selected wells).


We are provided with oil samples from three regions. The characteristics for each well in the region are already known. We need a model to determine the region where production is most profitable and to analyze the possible profits and risks.

# 1. Data preprocessing

In [1]:
# import all the necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

In [2]:
# process initial data
region_0 = pd.read_csv('geo_data_0.csv')
region_1 = pd.read_csv('geo_data_1.csv')
region_2 = pd.read_csv('geo_data_2.csv')

In [3]:
# check the data format 
display(region_0.head(3))
display(region_1.head(3))
display(region_2.head(3))

Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.22117,105.280062
1,2acmU,1.334711,-0.340164,4.36508,73.03775
2,409Wp,1.022732,0.15199,1.419926,85.265647


Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261
2,vyE1P,6.263187,-5.948386,5.00116,134.766305


Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697
2,ovLUW,0.194587,0.289035,-5.586433,62.87191


In [4]:
# Checking for  Missing Values
display(region_0.info())
display(region_1.info())
display(region_2.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


None

In [5]:
# checking for duplicates
print('Number of duplicates for region 0:', region_0.duplicated().sum())
print('Number of duplicates for region 1:',region_1.duplicated().sum())
print('Number of duplicates for region 2:',region_2.duplicated().sum())

Number of duplicates for region 0: 0
Number of duplicates for region 1: 0
Number of duplicates for region 2: 0


**Section summary:**
- There are both positive and negative values;
- The data are presented as decimal values;
- There are no missing values; 
- There are no duplicates in the datasets either.

# 2. Model building

Let's train the model using the function model_build, which:

1) will allocate the features and  target features into separate dataframes. Features: f0, f1, f2. Target features: production (volume of reserves in the well (thousand barrels);
2) splits sample into training and validation samples with ratio 75:25;
3) builds a linear regression model;
4) calculates the average predicted oil production and RMSE of the model.

In [6]:
# Function of model construction
def model_build(data):
    
    # Separate features and target features in corresponding parameters
    features = data[['f0', 'f1', 'f2']]
    target = data['product']

    # Split data into train and test  (1:3)
    features_train, features_valid, target_train, target_valid = train_test_split(
        features, target, test_size=0.25, random_state=12345)
    print('features train size:', len(features_train))
    print('featues valid size:', len(features_valid))
    
    # Features scaling
    scaler = StandardScaler()
    scaler.fit(features_train)
    features_train_scaled = scaler.transform(features_train)
    features_valid_scaled = scaler.transform(features_valid)

    # Since target features are quantitative, we will build a classification model: 
    # linear regression (the others are not sufficiently predictive).
    model = LinearRegression()
    model.fit(features_train_scaled, target_train)
    predictions_valid = model.predict(features_valid_scaled)

     # Let's calculate the average cummulutive production volume and RMSE of the model
    product_mean = predictions_valid.mean()
    RMSE = mean_squared_error(target_valid, predictions_valid, squared = False)
    print('Average production:', product_mean)
    print('RMSE :', RMSE)
    
    return (target_valid, predictions_valid, product_mean, RMSE)

Let's build a result table: 

In [7]:
results_table = pd.DataFrame(columns=['product_mean', 'RMSE', 'mean_income', 'risk'])

list_of_data = [region_0, region_1, region_2]
list_of_predictions_valid = [0]*3
list_of_target_valid = [0]*3

#partially fill the table:
for i in range(len(list_of_data)):
    print(i, 'region')
    list_of_target_valid[i], list_of_predictions_valid[i], results_table.loc[i, 'product_mean'], results_table.loc[i, 'RMSE'] = model_build(list_of_data[i])
    print()

0 region
features train size: 75000
featues valid size: 25000
Average production: 92.59256778438035
RMSE : 37.5794217150813

1 region
features train size: 75000
featues valid size: 25000
Average production: 68.72854689544602
RMSE : 0.8930992867756182

2 region
features train size: 75000
featues valid size: 25000
Average production: 94.96504596800489
RMSE : 40.02970873393434



**Section summary:**

- Regions 0 and 2 have the highest vaues of predicted hydrocarbon reserves (92 and 95 thousand barrels, respectively), while Region 1 has 68 thousand barrels.

- Along with high values of the predicted hydrocarbon reserves , Region 0 and 1 are characterized by high mean squared error. Region 1 has lowest RMSE metric values, whereas  Regions 0 and 2  have greater RMSE value. 


 # 3 Income calculations

To calculate profits, let's keep all the key values in separate variables.

1) When exploring the region, 500 points are investigated, from which the best 200 development wells are selected using machine learning; \
2) The budget for well development in the region is 10 billion;\
3) The income from each unit is 450 thousand;\
4) Probability of money loss must be less than 2.5%.

In [8]:
WELLS = 500 # units are investigated while region exploration
BEST_WELLS = 200 # wells with the best characteristics
BUDGET = 10e9 # budget for well drilling per region 
REVENUE_PER_KBBL = 450000 # income from selling product unit
RISK_THRESHOLD = .025 # treshold of miney loss probability 
production_min = BUDGET/BEST_WELLS/REVENUE_PER_KBBL
print('Minimum profitable product volume:',production_min)

Minimum profitable product volume: 111.11111111111111


**Section summary:**

- The minimum cumulative oil volume  for break-even development is 111 000 barrels, what exceeds the predicted oil reserves for all the regions.

# 4 Risk and profit calculations

In [9]:
state = np.random.RandomState(12345)

In [12]:
def income_calculation(target, predictions):
    # chose wells with the maximum oil deposits predicted
    predictions_sorted = predictions.sort_values(ascending=False)
    # total volume
    selected_wells = target[predictions_sorted.index][:(BEST_WELLS-1)]
    total_production = selected_wells.sum()
    # profit
    income = (total_production * REVENUE_PER_KBBL - BUDGET)/10e6
    return income

#function for the money loss risk estimation
def region_analys(targets, predictions):
    # Bootstrap technique with the 1000 samples
    incomes = []
    predictions = pd.Series(predictions)
    targets = pd.Series(targets.reset_index(drop=True))
    for h in range(1000):
        subsample_predict = predictions.sample(n=WELLS, replace = True, random_state = state)
        subsample_target = targets[subsample_predict.index]
        incomes.append(income_calculation(subsample_target, subsample_predict))
    incomes=pd.Series(incomes)
    
    # Average income, 95% confidencce interval, money loss risk 
    mean_income = incomes.mean()
    dov_int = (incomes.quantile(0.025), incomes.quantile(0.975))
    risk = (incomes < 0).mean()*100
    print('Average income for', i, 'region', mean_income)
    print('95% confidence interval for', i, 'region', dov_int)
    if risk<RISK_THRESHOLD*100:
        print('The risk of money loss for', i, 'region is less than 2,5%; Risk:', risk, '%')
    else:
        print('The risk of money loss for', i, 'region is greater than 2,5%; Risk:', risk, '%')
    return mean_income, risk

In [13]:
for i in range(3):
    print(i, 'region')
    results_table.loc[i, 'mean_income'], results_table.loc[i, 'risk'] = region_analys(
        list_of_target_valid[i], list_of_predictions_valid[i])
    print()

0 region
Average income for 0 region 38.031729576840036
95% confidence interval for 0 region (-15.376413173489198, 92.88623820333875)
The risk of money loss for 0 region is greater than 2,5%; Risk: 7.3 %

1 region
Average income for 1 region 46.72445589442629
95% confidence interval for 1 region (4.100770011942993, 89.5065834118917)
The risk of money loss for 1 region is less than 2,5%; Risk: 1.6 %

2 region
Average income for 2 region 36.54977280466331
95% confidence interval for 2 region (-25.15688825619246, 90.77281565517713)
The risk of money loss for 2 region is greater than 2,5%; Risk: 11.5 %



In [14]:
results_table

Unnamed: 0,product_mean,RMSE,mean_income,risk
0,92.592568,37.579422,38.03173,7.3
1,68.728547,0.893099,46.724456,1.6
2,94.965046,40.029709,36.549773,11.5


**Section summary:**

Based on the results we can c that the money loss probability  of less than 2.5% is only for region 1. Average revenue - 48 millions. 

# Conclusion

During the project the following steps were carrried out:\
Initial data analysis (there are both positive and negative values, the data are presented as decimal values; no missing values, incorrect names of features and data types were found; no duplicates in the datasets were also presented).

It was found that Regions 0 and 2 have the highest predicted volumes (92 000 and 95 000 barrels, respectively), while Region 1 has 68000 barrels. On a par with the high values of the predicted crude stock, Region 0 and 1 are characterized by a high prediction error. The RMSE metric for region 1 is the lowest (0,9), whereas for Regions 0 and 2 the model prediction error is high, at 38 and 40, respectively.

We found that the minimum amount of feedstock for break-even development is 111,000 barrels, which exceeds the predicted feedstock for all the regions.

Linear regression models were built to predict oil reserves in 3 regions. Based on the models predictions, region 1 is the recommended region to develop 200 wells:

Region 1 is characterized by:

- Predicted average revenue: 48 millions;

- There is 95% probability that income from this field will be in the range from 9.2 to 91.4 mln;

- Risk of money loss: 0.8%;

Prediction model has high enough accuracy, error of prediction of oil reserves = 0.9 (RMSE metric).

**Fell free to contact and ask questions!**