# **Well placement determ**

In theese task we need to decide on where it is more efficient to drill a new well: 

Typical steps are: 
- In the chosen region, collect characteristics for the wells: the quality of oil and the volume of reserves;
- Build a model to predict the amount of reserves in new wells;
- Select the wells with the highest value estimates;
- Determine the region with the maximum revenue (for the selected wells).


We are provided with oil samples from three regions. The characteristics for each well in the region are already known. We need a model to determine the region where production is most profitable and to analyze the possible profits and risks.

# 1. Data preprocessing

In [1]:
# import all the necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

In [2]:
# process initial data
region_0 = pd.read_csv('geo_data_0.csv')
region_1 = pd.read_csv('geo_data_1.csv')
region_2 = pd.read_csv('geo_data_2.csv')

In [3]:
# check the data format 
display(region_0.head(3))
display(region_1.head(3))
display(region_2.head(3))

Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.22117,105.280062
1,2acmU,1.334711,-0.340164,4.36508,73.03775
2,409Wp,1.022732,0.15199,1.419926,85.265647


Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261
2,vyE1P,6.263187,-5.948386,5.00116,134.766305


Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697
2,ovLUW,0.194587,0.289035,-5.586433,62.87191


In [4]:
# Checking for  Missing Values
display(region_0.info())
display(region_1.info())
display(region_2.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


None

In [5]:
# checking for duplicates
print('Number of duplicates for region 0:', region_0.duplicated().sum())
print('Number of duplicates for region 1:',region_1.duplicated().sum())
print('Number of duplicates for region 2:',region_2.duplicated().sum())

Number of duplicates for region 0: 0
Number of duplicates for region 1: 0
Number of duplicates for region 2: 0


**Section summary:**
- There are both positive and negative values;
- The data are presented as decimal values;
- There are no missing values; 
- There are no duplicates in the datasets either.

# 2. Model building

Let's train the model using the function model_build, which:

1) will allocate the features and  target features into separate dataframes. Features: f0, f1, f2. Target features: production (volume of reserves in the well (thousand barrels);
2) splits sample into training and validation samples with ratio 75:25;
3) builds a linear regression model;
4) calculates the average predicted oil production and RMSE of the model.

In [6]:
# Function of model construction
def model_build(data):
    
    # Separate features and target features in corresponding parameters
    features = data[['f0', 'f1', 'f2']]
    target = data['product']

    # Split data into train and test  (1:3)
    features_train, features_valid, target_train, target_valid = train_test_split(
        features, target, test_size=0.25, random_state=12345)
    print('features train size:', len(features_train))
    print('featues valid size:', len(features_valid))
    
    # Features scaling
    scaler = StandardScaler()
    scaler.fit(features_train)
    features_train_scaled = scaler.transform(features_train)
    features_valid_scaled = scaler.transform(features_valid)

    # Since target features are quantitative, we will build a classification model: 
    # linear regression (the others are not sufficiently predictive).
    model = LinearRegression()
    model.fit(features_train_scaled, target_train)
    predictions_valid = model.predict(features_valid_scaled)

     # Let's calculate the average cummulutive production volume and RMSE of the model
    product_mean = predictions_valid.mean()
    RMSE = mean_squared_error(target_valid, predictions_valid, squared = False)
    print('Average production:', product_mean)
    print('RMSE :', RMSE)
    
    return (target_valid, predictions_valid, product_mean, RMSE)

Let's build a result table: 

In [7]:
results_table = pd.DataFrame(columns=['product_mean', 'RMSE', 'mean_income', 'risk'])

list_of_data = [region_0, region_1, region_2]
list_of_predictions_valid = [0]*3
list_of_target_valid = [0]*3

#partially fill the table:
for i in range(len(list_of_data)):
    print(i, 'region')
    list_of_target_valid[i], list_of_predictions_valid[i], results_table.loc[i, 'product_mean'], results_table.loc[i, 'RMSE'] = model_build(list_of_data[i])
    print()

0 region
features train size: 75000
featues valid size: 25000
Average production: 92.59256778438035
RMSE : 37.5794217150813

1 region
features train size: 75000
featues valid size: 25000
Average production: 68.72854689544602
RMSE : 0.8930992867756182

2 region
features train size: 75000
featues valid size: 25000
Average production: 94.96504596800489
RMSE : 40.02970873393434



**Section summary:**

- Regions 0 and 2 have the highest vaues of predicted hydrocarbon reserves (92 and 95 thousand barrels, respectively), while Region 1 has 68 thousand barrels.

- Along with high values of the predicted hydrocarbon reserves , Region 0 and 1 are characterized by high mean squared error. Region 1 has lowest RMSE metric values, whereas  Regions 0 and 2  have greater RMSE value. 


 # 3 Income calculations