# The selection of a location for the new oil well.

We need to decide where to drill new wells. 

We have been provided with samples of oil in three regions, with 10,000 sites in each, where the quality of the oil and the volume of its reserves have been measured. We will build a machine learning model to help determine the region where extraction will bring the highest profit. We will analyze the potential profit and risks using the Bootstrap technique.

Steps for selecting a location:

- In the chosen region, sites are searched for, and values of features are determined for each;
- A model is built and the volume of reserves is estimated;
- Sites with the highest value ratings are selected. The number of sites depends on the company's budget and the cost of developing one well;
- The profit is equal to the total profit of the selected sites.

## Task conditions:
* Only linear regression is suitable for training the model (others are not predictable enough).
* 500 points are explored during the exploration of the region, from which 200 of the best are selected for development using machine learning.
* The budget for developing wells in the region is 10 billion rubles.
* At current prices, one barrel of raw material brings 450 rubles of income. The income from each unit of product is 450 thousand rubles, since the volume is indicated in thousands of barrels.
* After assessing the risks, only those regions with a probability of losses less than 2.5% should be left. Among them, the region with the highest average profit is chosen.

## Contents:
1. Reviewing and preprocessing of data.
2. Traning and testing model.
3. Preparing for profit calculation.
4. Calculating profits and risks.
5. Overall conclusion.

## Description of data:

* ID is a unique identifier for the well. 
* F0, F1, and F2 are three features of the points (it doesn't matter what they mean, but the features themselves are significant). 
* Product is the volume of reserves in the well (thousands of barrels).

## Plan for project execution:
1. Load and prepare the data.
2. Train and test the model for each region:
 - 2.1. Split the data into a training and validation set in a 75:25 ratio.
 - 2.2. Train the model and make predictions on the validation set.
 - 2.3. Save the predictions and the correct answers on the validation set.
 - 2.4. Print the average reserve of the predicted raw material and the RMSE of the model on the screen.
 - 2.5. Analyze the results.
3. Prepare for profit calculation:
 - 3.1. Save all the key values for the calculations in separate variables.
 - 3.2. Calculate the sufficient volume of raw material for the break-even development of a new well. Compare the obtained volume of raw material with the average reserve in each region.
  - 3.3. Write conclusions for the stage of preparation of profit calculation.
4. Write a function for calculating the profit for the selected wells and model predictions:
 - 4.1. Select wells with the maximum prediction values.
 - 4.2. Sum up the target value of the raw material volume corresponding to these predictions.
 - 4.3. Calculate the profit for the obtained volume of raw material.
5. Calculate the risks and profits for each region:
 - 5.1. Apply the Bootstrap technique with 1000 samples to find the profit distribution.
 - 5.2. Find the average profit, 95% confidence interval, and risk of loss. Loss is a negative profit.
 - 5.3. Write conclusions: propose a region for well development and justify the choice.

## 1. Reviewing and preprocessing of data.

In [1]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

from numpy.random import RandomState

In [2]:
pd.set_option('display.max_columns', 50)
pd.options.display.float_format = '{:,.2f}'.format

Let's load the data.

In [3]:
url_1 = 'https://code.s3.yandex.net/datasets/geo_data_0.csv'
url_2 = 'https://code.s3.yandex.net/datasets/geo_data_1.csv'
url_3 = 'https://code.s3.yandex.net/datasets/geo_data_2.csv'

geo_data_1 = pd.read_csv(url_1)
geo_data_2 = pd.read_csv(url_2)
geo_data_3 = pd.read_csv(url_3)

Let's take a look at the data.

In [4]:
geo_data_1.head(3)

Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.71,-0.5,1.22,105.28
1,2acmU,1.33,-0.34,4.37,73.04
2,409Wp,1.02,0.15,1.42,85.27


In [5]:
geo_data_2.head(3)

Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.0,-8.28,-0.01,3.18
1,62mP7,14.27,-3.48,1.0,26.95
2,vyE1P,6.26,-5.95,5.0,134.77


In [6]:
geo_data_3.head(3)

Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.15,0.96,-0.83,27.76
1,WJtFt,0.26,0.27,-2.53,56.07
2,ovLUW,0.19,0.29,-5.59,62.87


Let's look at the info of the datasets.

In [7]:
print('geo_data_1')
geo_data_1.info()
print()
print('geo_data_2')
geo_data_2.info()
print()
print('geo_data_3')
geo_data_3.info()

geo_data_1
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB

geo_data_2
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB

geo_data_3
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 

No missing values were detected.

Let's check for duplicates.

In [8]:
print('The number of duplicates geo_data_1:', geo_data_1.duplicated().sum())
print('The number of duplicates geo_data_2:', geo_data_2.duplicated().sum())
print('The number of duplicates geo_data_3:', geo_data_3.duplicated().sum())

The number of duplicates geo_data_1: 0
The number of duplicates geo_data_2: 0
The number of duplicates geo_data_3: 0


No duplicates were identified.

We will remove the column with the ID that has personalized categorical data, which will have a negative impact on the model training and its accuracy.

In [9]:
geo_data_1 = geo_data_1.drop(['id'], axis=1)
geo_data_2 = geo_data_2.drop(['id'], axis=1)
geo_data_3 = geo_data_3.drop(['id'], axis=1)

## 2. Traning and testing model.

Let's write a function:
* Identify the features and the target feature.
* Split the dataset into a training and validation set.
* Perform centering and standardization of the data by scaling them.
* Train a linear regression model.
* Calculate the RMSE metric for each region, as well as the average stock of raw materials in the region.

In [10]:
def region_prediction(data):
   
    features = data.drop(['product'], axis=1)
    target = data['product']
    
    features_train, features_valid, target_train, target_valid = train_test_split(features, 
                                                                                  target, 
                                                                                  test_size=0.25, 
                                                                                  random_state=12345)
    
    scaler = StandardScaler()
    
    features_train = scaler.fit_transform(features_train)
    features_valid = scaler.transform(features_valid)
    
    model_lr = LinearRegression()
    model_lr.fit(features_train, target_train)
    
    predictions = model_lr.predict(features_valid)
    predictions = pd.Series(predictions)
    rmse = np.sqrt(mean_squared_error(target_valid, predictions))  
    stock_mean = data['product'].mean()
    stock_mean_pred = predictions.mean()
    
    return predictions, rmse, stock_mean, stock_mean_pred, target_valid.reset_index(drop=True)

Let's look at the metric and the average stock of raw materials.

In [11]:
pred_1, rmse_1, stock_mean_1, stock_mean_pred_1, target_valid_1 = region_prediction(geo_data_1)
print('The RMSE of the model in Region 1 = {:.4f}'.format(rmse_1))
print('The average stock of predicted raw materials in Region 1 = {:.2f} ths. barrels'.format(stock_mean_pred_1))
print()
pred_2, rmse_2, stock_mean_2, stock_mean_pred_2, target_valid_2 = region_prediction(geo_data_2)
print('The RMSE of the model in Region 2 = {:.4f}'.format(rmse_2))
print('The average stock of predicted raw materials in Region 2 = {:.2f} ths. barrels'.format(stock_mean_pred_2))
print()
pred_3, rmse_3, stock_mean_3, stock_mean_pred_3, target_valid_3 = region_prediction(geo_data_3)
print('The RMSE of the model in Region 3 = {:.4f}'.format(rmse_3))
print('The average stock of predicted raw materials in Region 3 = {:.2f} ths. barrels'.format(stock_mean_pred_3))
print()

The RMSE of the model in Region 1 = 37.5794
The average stock of predicted raw materials in Region 1 = 92.59 ths. barrels

The RMSE of the model in Region 2 = 0.8931
The average stock of predicted raw materials in Region 2 = 68.73 ths. barrels

The RMSE of the model in Region 3 = 40.0297
The average stock of predicted raw materials in Region 3 = 94.97 ths. barrels



Conclusion: 

As we can see from the RMSE metric results, the model shows the most accurate result and the model is predictable on the dataset of Region 2 with 0.8931, but at the same time, Region 1 has the lowest average volume of predicted raw materials of 68.73 thousand barrels. The other regions 1 and 3 are close in metric and average predicted volume of raw materials, in Region 1 RMSE is 37.5794 in Region 2 - 40.0297, the average predicted stock of raw materials is 92.59 and 94.97 thousand barrels respectively.

## 3. Preparing for profit calculation.

Let's save the key values for calculations in separate variables.

In [12]:
budget = 10_000_000_000                # The budget for drilling in one region
research_wells = 500                   # The number of wells for exploration of the region
best_wells = 200                       # The number of best wells for development
revenue_per_thousand_barrel = 450_000  # The income from 1000 barrels
max_risk = 0.025                       # The maximum risk level
cost_hole = budget / best_wells        # The cost of developing one well

In [13]:
print('The volume of raw materials for break-even development of a new well = {:.2f} ths. barrels'.format(cost_hole / revenue_per_thousand_barrel))

The volume of raw materials for break-even development of a new well = 111.11 ths. barrels


In [14]:
print('The average reserves of the region 1 = {:.2f} ths. barrels'.format(stock_mean_1))
print('The average reserves of the region 2 = {:.2f} ths. barrels'.format(stock_mean_2))
print('The average reserves of the region 3 = {:.2f} ths. barrels'.format(stock_mean_3))

The average reserves of the region 1 = 92.50 ths. barrels
The average reserves of the region 2 = 68.83 ths. barrels
The average reserves of the region 3 = 95.00 ths. barrels


Conclusion: 

We can see that the average reserves of wells in the region are lower than the volume of raw materials for break-even development of a new well. The volume for break-even development of a new well is 111.11 thousand barrels. Choosing random wells for development will lead to losses.

## 4. Calculating profits and risks.

Let's write a function for calculating the profit by regions.

In [15]:
def profit(prediction, target):
    data = pd.concat([prediction, target],axis=1)
    data.columns = ['prediction','target']
    data = data.sort_values(by = 'prediction', ascending = False)[:best_wells]
    return (data['target'].sum() * revenue_per_thousand_barrel - budget)

In [16]:
revenue_1 = profit(pred_1, target_valid_1)
print('The profit for the received volume of raw materials in the region 1 = {:.2f} mln. rub.'.format(revenue_1  / 10e6))
print()
revenue_2 = profit(pred_2, target_valid_2)
print('The profit for the received volume of raw materials in the region 2 = {:.2f} mln. rub.'.format(revenue_2  / 10e6))
print()
revenue_3 = profit(pred_3, target_valid_3)
print('The profit for the received volume of raw materials in the region 3 = {:.2f} mln. rub.'.format(revenue_3  / 10e6))

The profit for the received volume of raw materials in the region 1 = 332.08 mln. rub.

The profit for the received volume of raw materials in the region 2 = 241.51 mln. rub.

The profit for the received volume of raw materials in the region 3 = 271.03 mln. rub.


Let's write a function for calculating the profit and risks of each region using the Bootstrap technique.

In [17]:
def estimate(prediction, target):
    state = np.random.RandomState(12345)
    
    values = []
    
    for i in range(1000):
        target_subsample = target.sample(n=500, 
                                         replace=True, 
                                         random_state=state)
        pred_subsample = prediction[target_subsample.index]
        values.append(profit(pred_subsample, target_subsample))
    
    values = pd.Series(values)
    mean = np.mean(values) / 10e6
    lower = values.quantile(0.025) / 10e6
    upper = values.quantile(0.975) / 10e6
    confidence_interval = (lower, upper)
    risk_of_loss = (values < 0).sum() / values.count()
    
    print('Average profit = {:.2f} mln. rub.'.format(mean))
    print('95% confidence interval from {:.2f} to {:.2f} mln. rub.'.format(lower, upper))
    print('Risk level {:.2%}'.format(risk_of_loss))

Let's look at the results.

In [18]:
print('Region 1')
region_1 = estimate(pred_1, target_valid_1)
print()
print('Region 2')
region_2 = estimate(pred_2, target_valid_2)
print()
print('Region 3')
region_3 = estimate(pred_3, target_valid_3)
print()

Region 1
Average profit = 39.62 mln. rub.
95% confidence interval from -11.12 to 90.98 mln. rub.
Risk level 6.90%

Region 2
Average profit = 45.60 mln. rub.
95% confidence interval from 3.38 to 85.23 mln. rub.
Risk level 1.50%

Region 3
Average profit = 40.44 mln. rub.
95% confidence interval from -16.34 to 95.04 mln. rub.
Risk level 7.60%



## Overall conclusion:
* No missing values, errors, anomalies, or duplicates were identified upon data review.
* The column with personalized categorical data, which would have a negative effect on model training and accuracy, was removed from the dataframes.
* Features and target feature were identified.
* The datasets were divided into training and validation sets.
* Data centering and standardization were performed by scaling the data.
* A linear regression model was trained.
* The RMSE metrics were calculated for each region, as well as the average raw material reserve in the region, and it was seen that the model was most accurate in region 2 with an RMSE of 0.8931, but with the lowest average predicted reserve of 68.73 thousand barrels. Regions 1 and 3 were close in terms of metrics and average predicted reserves, with RMSEs of 37.5794 and 40.0297, respectively, and average predicted reserves of 92.59 and 94.97 thousand barrels, respectively.
* The key values for profit and risk calculations were saved in separate variables.
* A function was written for profit and risk calculations using the Bootstrap technique.
* Based on the analysis, we see that region 2 is the most promising and profitable for further development, as the probability of loss in this region is 1.5%, which is lower than the threshold of 2.5%, and the probability of loss in the other regions is higher; according to the confidence interval, we see that possible losses in regions 1 and 3 can lead to a loss of up to 16.34 million rubles, while region 2 remains in the black; the average profit of region 2 is the highest of the three regions and amounts to 45.6 million rubles.