# Best Place for New Oil Well?

# Content

* [Introduction](#intro)
* [Initialization](#initialization)
* [Data Exploration](#data_exploration)
  * [Region 1](#region_1)
  * [Region 2](#region_2)
  * [Region 3](#region_3)
  * [Conclusion](#conclusion)
* [Model Training](#model_training)
  * [Splitting the source data](#splitting)
    * [Region 1](#region_1)
    * [Region 2](#region_2)
    * [Region 3](#region_3)
  * [Training the model and making predictions](#training_predictions)
    * [Region 1](#region_1)
    * [Region 2](#region_2)
    * [Region 3](#region_3)
  * [Conclusion](#conclusion)
* [Calculating Volume of Reserve](#volume)
  * [Conclusion](#conclusion)
* [Profit Calculation](#profit)
  * [Obtaining highest values of predictions](#highest_value)
  * [Calculating profit from selected oil well and model predictions](#profit_calculations)
* [Bootstrapping for Profit and Risk](#bootstrapping)
  * [Region 1](#region_1)
  * [Region 2](#region_2)
  * [Region 3](#region_3)
  * [Conclusion](#conclusion)
* [Findings](#findings)

## Introduction

OilyGiant, a mining company, wants to find the best place for a new oil well. Data on oil samples from three regions, and the parameters of each oil well in the region are given. The budget for development of 200 oil wells is 100 USD million. One barrel of raw materials brings 4.5 USD of revenue, and the revenue from one unit of product is 4,500 USD. We will build a model that helps pick the region with the highest profit margin. 

**Data Description:**

* Geological exploration data for the 3 regions are stored in files:
  * `geo_data_0.csv`
  * `geo_data_1.csv`
  * `geo_data_2.csv`
* `id` - unique oil well identifier
* `f0`, `f1`, `f2` - three features of points (their specific meaning is unimportant, but the features themselves are significant)
* `product` - volume of reserves in the oil well (thousand barrels)

**Objectives:**

* Train and test the model for each region 
* Calculate volume of reserves sufficient for developing a new well without losses
* Calculate profit from a set of selected oil wells and model predictions 
* Select a region for oil wells' development and calculate the profit for the obtained volume of reserves 
* Use bootstrapping technique with 1000 samples to find the distribution of profit 

## Initialization 

In [106]:
# Loading all libraries 
import pandas as pd
import numpy as np
from numpy.random import RandomState
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score


In [107]:
# Loading the data
try:
  df_0 = pd.read_csv('geo_data_0.csv')
  df_1 = pd.read_csv('geo_data_1.csv')
  df_2 = pd.read_csv('geo_data_2.csv')
except:
  df_0 = pd.read_csv('/datasets/geo_data_0.csv')
  df_1 = pd.read_csv('/datasets/geo_data_1.csv')
  df_2 = pd.read_csv('/datasets/geo_data_2.csv')


## Data Exploration

### Region 1 (`geo_data_0`)

Description of data:

* `id` - unique oil well identifier
* `f0`, `f1`, `f2` - three features of points (their specific meaning is unimportant, but the features themselves are significant)
* `product` - volume of reserves in the oil well (thousand barrels)

In [108]:
# Obtaining the first 5 rows of the table
df_0.head()

Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.22117,105.280062
1,2acmU,1.334711,-0.340164,4.36508,73.03775
2,409Wp,1.022732,0.15199,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647


In [109]:
# Obtaining the number of rows and columns
shape = df_0.shape
print('Region 1 has {} rows and {} columns'.format(shape[0], shape[1]))

Region 1 has 100000 rows and 5 columns


In [110]:
# Obtaining general info on the table
df_0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


Since each column has 100,000 rows, there is no missing data in this table. 

In [111]:
# Checking for duplicates
df_0.duplicated().sum()

0

### Region 2 (`geo_data_1`)

Description of data:

* `id` - unique oil well identifier
* `f0`, `f1`, `f2` - three features of points (their specific meaning is unimportant, but the features themselves are significant)
* `product` - volume of reserves in the oil well (thousand barrels)

In [112]:
# Obtaining first 5 rows of the table
df_1.head()

Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261
2,vyE1P,6.263187,-5.948386,5.00116,134.766305
3,KcrkZ,-13.081196,-11.506057,4.999415,137.945408
4,AHL4O,12.702195,-8.147433,5.004363,134.766305


In [113]:
# Obtaining number of rows and columns 
shape = df_1.shape
print('Region 2 has {} rows and {} columns'.format(shape[0], shape[1]))

Region 2 has 100000 rows and 5 columns


In [114]:
# Obtaining general info on the table
df_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


Since each column has 100,000 rows, there is no missing data in this table. 

In [115]:
# Checking for duplicates
df_1.duplicated().sum()

0

### Region 3 (`geo_data_2`)

Description of data:
* `id` - unique oil well identifier
* `f0`, `f1`, `f2` - three features of points (their specific meaning is unimportant, but the features themselves are significant)
* `product` - volume of reserves in the oil well (thousand barrels)

In [116]:
# Obtaining first 5 rows of the table
df_2.head()

Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697
2,ovLUW,0.194587,0.289035,-5.586433,62.87191
3,q6cA6,2.23606,-0.55376,0.930038,114.572842
4,WPMUX,-0.515993,1.716266,5.899011,149.600746


In [117]:
# Obtaining number of rows and columns 
shape = df_2.shape
print('Region 3 has {} rows and {} columns'.format(shape[0], shape[1]))

Region 3 has 100000 rows and 5 columns


In [118]:
# Obtaining general info on the table
df_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


Since each column has 100,000 rows, there is no missing data in the table. 

In [119]:
# Checking for duplicates
df_2.duplicated().sum()

0

### Conclusion

Each region has 100,000 rows and 5 columns. They all have the same column names and follow good rules of style. There are two types of data, float and object. The values in the column `id` are object type. The values in the columns, `f0`, `f1`, `f2`, and `product` are float type. There are no missing values or duplicates for our three tables. 

We will now train and test the model for each region.

## Model Training

### Splitting the source data of each region 


#### Region 1

In [120]:
# Dropping unimportant feature
df_0 = df_0.drop(['id'], axis=1)

# Checking to make sure column was dropped
df_0.head()

Unnamed: 0,f0,f1,f2,product
0,0.705745,-0.497823,1.22117,105.280062
1,1.334711,-0.340164,4.36508,73.03775
2,1.022732,0.15199,1.419926,85.265647
3,-0.032172,0.139033,2.978566,168.620776
4,1.988431,0.155413,4.751769,154.036647


In [121]:
# Declaring variables for features and target 
features_0 = df_0.drop(['product'], axis=1)
target_0 = df_0['product']

In [122]:
# Splitting the data into 75% training set and 25% validation set
features_train_0, features_valid_0, target_train_0, target_valid_0 = train_test_split(features_0, target_0, test_size=0.25, random_state=12345)

In [123]:
# Checking to see if data was split properly
print('The training set contains {} rows,'.format(features_train_0.shape[0]), 'which represents 75% of the data')
print('The validation set contains {} rows,'.format(features_valid_0.shape[0]), 'which represents 25% of the data')

The training set contains 75000 rows, which represents 75% of the data
The validation set contains 25000 rows, which represents 25% of the data


In [124]:
# Features scaling 
scaler = StandardScaler()
scaler.fit(features_train_0)
features_train_0 = scaler.transform(features_train_0)
features_valid_0 = scaler.transform(features_valid_0)

In [125]:
# Checking size of features and target set
print('Size of training features:', features_train_0.shape)
print('Size of training target:', target_train_0.shape)
print()
print('Size of validation features:', features_valid_0.shape)
print('Size of validation target:', target_valid_0.shape)

Size of training features: (75000, 3)
Size of training target: (75000,)

Size of validation features: (25000, 3)
Size of validation target: (25000,)


#### Region 2

In [126]:
# Dropping unimportant feature
df_1 = df_1.drop(['id'], axis=1)

# Checking to make sure column was dropped
df_1.head()

Unnamed: 0,f0,f1,f2,product
0,-15.001348,-8.276,-0.005876,3.179103
1,14.272088,-3.475083,0.999183,26.953261
2,6.263187,-5.948386,5.00116,134.766305
3,-13.081196,-11.506057,4.999415,137.945408
4,12.702195,-8.147433,5.004363,134.766305


In [127]:
# Declaring variables for features and target 
features_1 = df_1.drop(['product'], axis=1)
target_1 = df_1['product']

In [128]:
# Splitting the data into 75% training set and 25% validation set
features_train_1, features_valid_1, target_train_1, target_valid_1 = train_test_split(features_1, target_1, test_size=0.25, random_state=12345)

In [129]:
# Checking to see if data was split properly
print('The training set contains {} rows,'.format(features_train_1.shape[0]), 'which represents 75% of the data')
print('The validation set contains {} rows,'.format(features_valid_1.shape[0]), 'which represents 25% of the data')

The training set contains 75000 rows, which represents 75% of the data
The validation set contains 25000 rows, which represents 25% of the data


In [130]:
# Features scaling 
scaler = StandardScaler()
scaler.fit(features_train_1)
features_train_1 = scaler.transform(features_train_1)
features_valid_1 = scaler.transform(features_valid_1)

In [131]:
# Checking size of features and target set
print('Size of training features:', features_train_1.shape)
print('Size of training target:', target_train_1.shape)
print()
print('Size of validation features:', features_valid_1.shape)
print('Size of validation target:', target_valid_1.shape)

Size of training features: (75000, 3)
Size of training target: (75000,)

Size of validation features: (25000, 3)
Size of validation target: (25000,)


#### Region 3

In [132]:
# Dropping unimportant feature
df_2 = df_2.drop(['id'], axis=1)

# Checking to make sure column was dropped
df_2.head()

Unnamed: 0,f0,f1,f2,product
0,-1.146987,0.963328,-0.828965,27.758673
1,0.262778,0.269839,-2.530187,56.069697
2,0.194587,0.289035,-5.586433,62.87191
3,2.23606,-0.55376,0.930038,114.572842
4,-0.515993,1.716266,5.899011,149.600746


In [133]:
# Declaring variables for features and target 
features_2 = df_2.drop(['product'], axis=1)
target_2 = df_2['product']

In [134]:
# Splitting the data into 75% training set and 25% validation set
features_train_2, features_valid_2, target_train_2, target_valid_2 = train_test_split(features_2, target_2, test_size=0.25, random_state=12345)

In [135]:
# Checking to see if data was split properly
print('The training set contains {} rows,'.format(features_train_2.shape[0]), 'which represents 75% of the data')
print('The validation set contains {} rows,'.format(features_valid_2.shape[0]), 'which represents 25% of the data')

The training set contains 75000 rows, which represents 75% of the data
The validation set contains 25000 rows, which represents 25% of the data


In [136]:
# Features scaling 
scaler = StandardScaler()
scaler.fit(features_train_2)
features_train_2 = scaler.transform(features_train_2)
features_valid_2 = scaler.transform(features_valid_2)

In [137]:
# Checking size of features and target set
print('Size of training features:', features_train_2.shape)
print('Size of training target:', target_train_2.shape)
print()
print('Size of validation features:', features_valid_2.shape)
print('Size of validation target:', target_valid_2.shape)

Size of training features: (75000, 3)
Size of training target: (75000,)

Size of validation features: (25000, 3)
Size of validation target: (25000,)


### Training the model and making predictions for each region

#### Region 1

In [138]:
# Creating linear regression model
model = LinearRegression()
model.fit(features_train_0, target_train_0)
predicted_valid_0 = model.predict(features_valid_0)

# Calculating RMSE
mse = mean_squared_error(target_valid_0, predicted_valid_0)
print('MSE =', round(mse,3))
print('RMSE =', round(mse**0.5, 3))

# Calculating R2
print('R2 = ', round(r2_score(target_valid_0, predicted_valid_0), 4))

# Calculating volume of predicted reserves
print('Volume of predicted reserves = ', round(predicted_valid_0.mean(), 3)) 


MSE = 1412.213
RMSE = 37.579
R2 =  0.2799
Volume of predicted reserves =  92.593


#### Region 2

In [139]:
# Creating linear regression model 
model = LinearRegression()
model.fit(features_train_1, target_train_1)
predicted_valid_1 = model.predict(features_valid_1)

# Calculating RMSE
mse = mean_squared_error(target_valid_1, predicted_valid_1)
print('MSE =', round(mse,3))
print('RMSE =', round(mse**0.5, 3))

# Calculating R2
print('R2 = ', round(r2_score(target_valid_1, predicted_valid_1), 4))

# Calculating volume of predicted reserves
print('Volume of predicted reserves = ', round(predicted_valid_1.mean(), 3)) 

MSE = 0.798
RMSE = 0.893
R2 =  0.9996
Volume of predicted reserves =  68.729


#### Region 3

In [140]:
# Creating linear regression model 
model = LinearRegression()
model.fit(features_train_2, target_train_2)
predicted_valid_2 = model.predict(features_valid_2)

# Calculating RMSE
mse = mean_squared_error(target_valid_2, predicted_valid_2)
print('MSE =', round(mse,3))
print('RMSE =', round(mse**0.5, 4))

# Calculating R2
print('R2 = ', round(r2_score(target_valid_2, predicted_valid_2), 4))

# Calculating volume of predicted reserves
print('Volume of predicted reserves = ', round(predicted_valid_2.mean(), 3)) 

MSE = 1602.378
RMSE = 40.0297
R2 =  0.2052
Volume of predicted reserves =  94.965


### Conclusion

We split the source data for each region in a  75:25 ratio: 75% training set and 25% validation set.
* The training set contains 75000 rows.
* The validation set contains 25000 rows

We standardized the features of each region and trained the model with a linear regression. 

For region 1, we found that the root mean square error (RMSE) was 37.579, and that the R2 value was 0.2799. A low R2 score indicates poor model quality. 

For region 2, we found that the RMSE was .893, and that the R2 value was 0.996. A low RMSE value close to 1 shows that the model is predicting close to the actual average volume of reserves. A high R2 score is indicative of good model quality. 

For region 3, we found that the RMSE 40.030, and that the R2 value was 0.2052. A low R2 score is indicative of poor model quality. 

The average volume of predicted reserves was also calculated for each region:
* Volume of predicted reserves in Region 1: 92.593 thousand barrels
* Volume of predicted reserves in Region 2: 68.729 thousand barrels
* Volume of predicted reserves in Region 3: 94.96 thousand barrels

We will now prepare for profit calculation and find the volume of reserves sufficient for developing a new well without losses.  


## Calculating Volume of Reserve

In [141]:
# Creating variables for key values 

# When studying a region, 500 oil wells(points) are used
region_wells = 500

# Best 200 oil wells (points) used for profit calculations
best_wells = 200

# Budget for development of 200 oil wells
total_budget = 100000000

# Revenue from one unit of product (volume of reserves is in thousand barrels)
rev_per_unit = 4500

# Regions with risk loss lower than 2.5%
risk_loss = 0.025

In [142]:
# Calculate the volume of reserves sufficient for developing a new well without losses
new_well_volume = total_budget/rev_per_unit/best_wells
print(f'Volume of reserves sufficient for developing a new well: {new_well_volume:.3f}')

Volume of reserves sufficient for developing a new well: 111.111


In [143]:
# Average volume of reserves in each region 
print('Average volume of reserves in Region 1:', round(predicted_valid_0.mean(), 3))
print()
print('Average volume of reserves in Region 1:', round(predicted_valid_1.mean(), 3))
print()
print('Average volume of reserves in Region 1:', round(predicted_valid_2.mean(), 3))

Average volume of reserves in Region 1: 92.593

Average volume of reserves in Region 1: 68.729

Average volume of reserves in Region 1: 94.965


### Conclusion

The volume of reserves sufficient for developing a new well without a loss is 111.111 thousand barrels. This volume is the baseline value for finding sufficient wells. Upon comparison with the average volume of reserves in each region (Region 1 to Region 3), we see that the average volume of reserves is lower than our baseline value. 

Next, we will write a function to calculate profit from a set of selected oil wells and model predictions.

## Profit Calculation 

In [144]:
# Function to pick the wells with the highest values of predictions
def highest_pred(target, predicted, count):
  target = pd.Series(target)
  predicted = pd.Series(predicted)
  vol_predicted = predicted.reset_index(drop=True).sort_values(ascending=False)
  select_vol = target.reset_index(drop=True).iloc[vol_predicted.index][:count]
  return select_vol.sum() * 1000


# Function to calculate profit
def revenue(target, predicted, count):
  target = pd.Series(target)
  predicted = pd.Series(predicted)
  predicted_sorted = predicted.reset_index(drop=True).sort_values(ascending=False)
  selected = target.reset_index(drop=True)[predicted_sorted.index][:count]
  profit = selected.sum()*rev_per_unit - total_budget
  return profit

### Obtaining highest values of prediction for each region 

In [145]:
# Highest value prediction in Region 1 (geo_data_0)
print('The highest predicition value in the Region 1:', round(highest_pred(target_valid_0, predicted_valid_0, 200),3))

The highest predicition value in the Region 1: 29601835.651


In [146]:
# Highest value prediction in Region 2 (geo_data_1)
print('The highest predicition value in the Region 2:', round(highest_pred(target_valid_1, predicted_valid_1, 200),3))

The highest predicition value in the Region 2: 27589081.548


In [147]:
# Highest value prediction in Region 3 (geo_data_2)
print('The highest predicition value in the Region 3:', round(highest_pred(target_valid_2, predicted_valid_2, 200),3))

The highest predicition value in the Region 3: 28245222.141


Region 1 has the highest value of predictions, followed by Region 3 and Region 2, respectively. We will now calculate profit from each region. 

### Calculating profit from a set of selected oil well and model predictions in each region

In [148]:
# Profit from a set of selected oil wells and model predictions in Region 1 (geo_data_0)
print('Profit from set of selected oil wells and model predictions in Region 1:', round(revenue(target_valid_0, predicted_valid_0, 200),2))

Profit from set of selected oil wells and model predictions in Region 1: 33208260.43


In [149]:
# Profit from a set of selected oil wells and model predictions in Region 2 (geo_data_1)
print('Profit from set of selected oil wells and model predictions in Region 2:', round(revenue(target_valid_1, predicted_valid_1, 200),2))

Profit from set of selected oil wells and model predictions in Region 2: 24150866.97


In [150]:
# Profit from a set of selected oil wells and model predictions in Region 3 (geo_data_2)
print('Profit from set of selected oil wells and model predictions in Region 3:', round(revenue(target_valid_2, predicted_valid_2, 200),2))

Profit from set of selected oil wells and model predictions in Region 3: 27103499.64


Region 1 has the highest profit at 33 million USD, followed by Region 3 (27 million USD) and Region 2 (24 million USD), respectively.

We will now use the bootstrapping technique with 1000 samples to find the distribution of profit. Average profit will be assessed with a 95% confidence interval to find risk of losses. 

## Bootstrapping for Profit and Risk

### Region 1

In [151]:
# Defining random state
state = RandomState(12345)

# Bootstrapping technique
values = []
target_valid_0 = pd.Series(target_valid_0)
predicted_valid_0 = pd.Series(predicted_valid_0)
for i in range(1000):
  target_subsample_0 = target_valid_0.reset_index(drop=True).sample(replace=True, random_state=state, n=500)
  predicted_subsample_0 = predicted_valid_0.reset_index(drop=True)[target_subsample_0.index]
  values.append(revenue(target_subsample_0, predicted_subsample_0, 200))

# Computing bootstrapping profit, confidence interval, and risk of losses
values = pd.Series(values)
avg_profit = values.mean()
lower = values.quantile(0.025)
upper = values.quantile(.975)
losses = values < 0
risk_of_losses = (losses.sum()/len(values)) 

print(f'Average profit: {avg_profit}')
print(f'95% confidence interval: {(lower, upper)}')
print(f'Risk of losses: {risk_of_losses: .2%}')

Average profit: 3961649.8480237117
95% confidence interval: (-1112155.4589049604, 9097669.41553423)
Risk of losses:  6.90%


### Region 2

In [152]:
# Defining random state
state = RandomState(12345)

# Bootstrapping technique
values = []
target_valid_1 = pd.Series(target_valid_1)
predicted_valid_1 = pd.Series(predicted_valid_1)
for i in range(1000):
  target_subsample_1 = target_valid_1.reset_index(drop=True).sample(replace=True, random_state=state, n=500)
  predicted_subsample_1 = predicted_valid_1.reset_index(drop=True)[target_subsample_1.index]
  values.append(revenue(target_subsample_1, predicted_subsample_1, 200))

# Computing bootstrapping profit, confidence interval, and risk of losses
values = pd.Series(values)
avg_profit = values.mean()
lower = values.quantile(0.025)
upper = values.quantile(.975)
losses = values < 0
risk_of_losses = (losses.sum()/len(values)) 

print(f'Average profit: {avg_profit}')
print(f'95% confidence interval: {(lower, upper)}')
print(f'Risk of losses: {risk_of_losses: .2%}')

Average profit: 4560451.057866608
95% confidence interval: (338205.0939898458, 8522894.538660347)
Risk of losses:  1.50%


### Region 3

In [153]:
# Defining random state
state = RandomState(12345)

# Bootstrapping technique
values = []
target_valid_2 = pd.Series(target_valid_2)
predicted_valid_2 = pd.Series(predicted_valid_2)
for i in range(1000):
  target_subsample_2 = target_valid_2.reset_index(drop=True).sample(replace=True, random_state=state, n=500)
  predicted_subsample_2 = predicted_valid_2.reset_index(drop=True)[target_subsample_2.index]
  values.append(revenue(target_subsample_2, predicted_subsample_2, 200))

# Computing bootstrapping profit, confidence interval, and risk of losses
values = pd.Series(values)
avg_profit = values.mean()
lower = values.quantile(0.025)
upper = values.quantile(.975)
losses = values < 0
risk_of_losses = (losses.sum()/len(values)) 

print(f'Average profit: {avg_profit}')
print(f'95% confidence interval: {(lower, upper)}')
print(f'Risk of losses: {risk_of_losses: .2%}')

Average profit: 4044038.665683568
95% confidence interval: (-1633504.1339559986, 9503595.749237997)
Risk of losses:  7.60%


### Conclusion

A 95% confidence interval means that if we were to take 1000 different samples and compute a 95% confidence interval for each sample, then approximately 950 of the 1000 confidence intervals will contain the true mean. In other words, 950 out of 1000 times the estimate will fall between the upper and lower values specified by the confidence interval. From the use of bootstrapping technique , we found that Region 2 (`geo_data_1`) had the highest distribution of average profit. Region 2 (`geo_data_1`) had the highest average profit of 4.6 million US and the lowest risk loss at 1.50%. Region 3 (`geo_data_0`) had the second highest average profit of 4.0 million USD with a risk loss of 7.6%. Region 1 had the least average profit of 3.9 million USD with a risk loss of 6.9%. 

Based on our findings, we can say that Region 2 (`geo_data_1`) is the best place to develop a new oil well. 

## Findings

We split the source data for each region in a 75:25 ratio: 75% training set and 25% validation set.
* The training set contains 75000 rows.
* The validation set contains 25000 rows

We then standardized the features of each region and trained the model with using linear regression. For each region, we found the root mean square error (RMSE) and R2 value.
* For region 1, the RMSE was 37.579, and the R2 value was 0.2799. A low R2 score is indicative of poor model quality.
* For region 2, the RMSE was .893, and the R2 value was 0.996. The low RMSE value close to 1 indicates that the model is predicting close to the actual average volume of reserves. A high R2 score is indicative of good model quality.
* For region 3, the RMSE was 40.030, and the R2 value was 0.2052. A low R2 score is indicative of poor model quality.

The volume of predicted reserves calculated for each region was:
* Region 1: 92.593 thousand barrels
* Region 2: 68.729 thousand barrels
* Region 3: 94.96 thousand barrels

We found that the volume of reserves sufficient for developing a new well without a loss was 111.11 thousand barrels. This volume was the baseline value for finding sufficient wells. After comparing them with the average volume of reserves in each region (Region 1 to Region 3), we saw that the average volume of reserves was much lower than the baseline value.

Profit was calculated from a set of selected oil wells and model predictions. We found that Region 1 had the highest profit at 33 million USD, followed by Region 3 (27 million USD) and Region 2 (24 million USD), respectively. 

Bootstrapping technique with 1000 samples was used to find the distribution of profit, average profit, 95% confidence interval, and risk of loss. We found that Region 2 had the highest distribution of average profit, which means that Region 2 also had the highest average profit. Region 2 had an average profit of 4.5 million USD and the lowest risk of loss at 1.50%. Region 3 had the second highest average profit of 4.0 million USD with a risk of loss of 7.6%. Region 1 had the least average profit of 3.9 million USD with a risk of loss of 6.9%. 

Based on our findings, Region 2 (`geo_data_1`) is the best place to develop a new oil well. Region 2 provides the high profit with the lowest amount of risk involved. 