# OilyGiant

## Introduction

In this project at OilyGiant, we're tackling a practical challenge: finding the best spots to drill 200 new oil wells. It's all about using data smartly to make profitable decisions in a competitive industry. 

We have geological data for oil samples in three regions. Parameters of each oil well are already known. We need to build a model that will help to pick the region with the highest profit margin, and predict the amount of oil that can be extracted from the wells.

We'll analyze potential profit and risks using the Bootstrapping technique.


## Objectives

1.- Data Analysis: Dig into well data from three regions to understand oil quality and reserve volumes.

2.- Model Development: Build a model to predict reserve volumes in new wells.

3.- Well Selection: Use the model to pick the most promising well locations. 

4.- Region Selection: Identify the region with the highest overall profitability potential.

5.- Risk Assessment: Apply bootstrapping techniques to analyze potential profits and risks.

## Importing the libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt


## Data Preparation And Exploration

### Loading the data

In [2]:
df0 = pd.read_csv('datasets/geo_data_0.csv')
df1 = pd.read_csv('datasets/geo_data_1.csv')
df2 = pd.read_csv('datasets/geo_data_2.csv')

### Initial data exploration

In [3]:
print(df0.head())
print(df1.head())
print(df2.head())

      id        f0        f1        f2     product
0  txEyH  0.705745 -0.497823  1.221170  105.280062
1  2acmU  1.334711 -0.340164  4.365080   73.037750
2  409Wp  1.022732  0.151990  1.419926   85.265647
3  iJLyR -0.032172  0.139033  2.978566  168.620776
4  Xdl7t  1.988431  0.155413  4.751769  154.036647
      id         f0         f1        f2     product
0  kBEdx -15.001348  -8.276000 -0.005876    3.179103
1  62mP7  14.272088  -3.475083  0.999183   26.953261
2  vyE1P   6.263187  -5.948386  5.001160  134.766305
3  KcrkZ -13.081196 -11.506057  4.999415  137.945408
4  AHL4O  12.702195  -8.147433  5.004363  134.766305
      id        f0        f1        f2     product
0  fwXo0 -1.146987  0.963328 -0.828965   27.758673
1  WJtFt  0.262778  0.269839 -2.530187   56.069697
2  ovLUW  0.194587  0.289035 -5.586433   62.871910
3  q6cA6  2.236060 -0.553760  0.930038  114.572842
4  WPMUX -0.515993  1.716266  5.899011  149.600746


After cheking the head of the dataframes, we can see that the columns are named as follows: id, f0, f1, f2, and product. The id column contains the unique well identifier. The values of features f0, f1, and f2 are also features of points (their specific meaning is unimportant, but the features themselves are significant). The target is the product column, which contains the volume of reserves in the oil well (in thousand barrels).

Now we are going to use .info() to check the general information of the dataframes.

In [4]:
print(df0.info())
print(df1.info())
print(df2.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column  

We can see that the three dataframes have 100,000 rows and 5 columns. There are no missing values in the dataframes. The data types of the columns are float64 except for the id column which is a type object.

As far as we can see, the data looks ready for further analysis. But we are going to check if there are any duplicates in the dataframes.

In [5]:
print('Duplicates in f0 = ', df0.duplicated(subset='id').sum())
print('Duplicates in f1 = ', df1.duplicated(subset='id').sum())
print('Duplicates in f2 = ', df2.duplicated(subset='id').sum())

Duplicates in f0 =  10
Duplicates in f1 =  4
Duplicates in f2 =  4


### Droping duplicates

Since we should not have any duplicates in the dataframes, because each row represents a unique well, we are going to eliminate them.

In [6]:
df0 = df0.drop_duplicates(subset='id')
df1 = df1.drop_duplicates(subset='id')
df2 = df2.drop_duplicates(subset='id')

Checking if there are still duplicates in the dataframes.

In [7]:
print('Duplicates in f0 = ', df0.duplicated(subset='id').sum())
print('Duplicates in f1 = ', df1.duplicated(subset='id').sum())
print('Duplicates in f2 = ', df2.duplicated(subset='id').sum())

Duplicates in f0 =  0
Duplicates in f1 =  0
Duplicates in f2 =  0


Now we can see that there are no duplicates in the dataframes, so we can continue with the analysis.

## Model Training and Validation

### Splitting the data into a training set and validation sets in geo_data_0

We are going to split the data into a training set and validation sets, using the train_test_split() function from the sklearn.model_selection library.

The target values are going to be the column "product", and the rest of the columns are going to be the features.

We also gonna drop the id column, because is a type object and we don't need it for the analysis.


#### Features and target features for geo_data_0

In [8]:
X = df0.drop(['id', 'product'], axis=1)
y = df0['product']  

features_train_0, features_valid_0, target_train_0, target_valid_0 = train_test_split(X, y, test_size=0.25, random_state=12345)

We are only going to use linear regression to train the model, since we are not going to use any other model in this project.

#### Model training for geo_data_0

In [9]:
model = LinearRegression()
model.fit(features_train_0, target_train_0)
predicted_valid_0 = model.predict(features_valid_0)
print('RMSE = ', sqrt(mean_squared_error(target_valid_0, predicted_valid_0)))
print('Mean = ', target_valid_0.mean())
print('Predicted mean = ', predicted_valid_0.mean())


RMSE =  37.853527328872964
Mean =  92.15820490940044
Predicted mean =  92.7891563828062


#### Features and target features for geo_data_1

In [10]:

X = df1.drop(['id', 'product'], axis=1)
y = df1['product']

features_train_1, features_valid_1, target_train_1, target_valid_1 = train_test_split(X, y, test_size=0.25, random_state=12345)

#### Model training for geo_data_1

In [11]:

model = LinearRegression()
model.fit(features_train_1, target_train_1)
predicted_valid_1 = model.predict(features_valid_1)
print('RMSE = ', sqrt(mean_squared_error(target_valid_1, predicted_valid_1)))
print('Mean = ', target_valid_1.mean())
print('Predicted mean = ', predicted_valid_1.mean())


RMSE =  0.8920592647717042
Mean =  69.18604400957675
Predicted mean =  69.1783195703043


#### Features and target features for geo_data_2

In [12]:

X = df2.drop(['id', 'product'], axis=1)
y = df2['product']

features_train_2, features_valid_2, target_train_2, target_valid_2 = train_test_split(X, y, test_size=0.25, random_state=12345)

#### Model training for geo_data_2

In [13]:

model = LinearRegression()
model.fit(features_train_2, target_train_2)
predicted_valid_2 = model.predict(features_valid_2)
print('RMSE = ', sqrt(mean_squared_error(target_valid_2, predicted_valid_2)))
print('Mean = ', target_valid_2.mean())
print('Predicted mean = ', predicted_valid_2.mean())


RMSE =  40.07585073246016
Mean =  94.7851093536914
Predicted mean =  94.86572480562035


We see that the RMSE of the model geo_data_0 is 37.579422, the RMSE of the model geo_data_1 is 0.893099, and the RMSE of the model geo_data_2 is 40.029709.

In this case, the model geo_data_1 is the best one, because it has the lowest RMSE.

Also the mean and predicted values are very close together, wich is good for the model.


### Profit calculation

In [14]:
total_investment = 100000000 
number_of_wells = 200
revenue_per_unit = 4500 

break_even_units = total_investment / (number_of_wells * revenue_per_unit)
print('Break-even point (in units):', break_even_units)

print('Average reserve in Region 0:', df0['product'].mean(), 'units')
print('Average reserve in Region 1:', df1['product'].mean(), 'units')
print('Average reserve in Region 2:', df2['product'].mean(), 'units')


Break-even point (in units): 111.11111111111111
Average reserve in Region 0: 92.49968421774354 units
Average reserve in Region 1: 68.82391591804064 units
Average reserve in Region 2: 94.99834211933378 units


None of the regions, on average, meet the break-even reserve volume of 111.11 units based on this initial analysis. This suggests that under the current investment plan and based on average reserves, the oil well development might not be profitable in any of the regions.

But we are going to select the 200 wells with the highest values of predictions and calculate the profit for the obtained volume of reserves on each region.

In [15]:
def calculate_profit(predictions, target_valid, revenue_per_unit, number_of_wells=200):
    # We convert predictions to a Pandas Series and reset the index of target_valid so that they match.
    target_valid = target_valid.reset_index(drop=True)
    predictions = pd.Series(predictions).reset_index(drop=True)

    # Sort and selection of top indices
    sorted_indices = np.argsort(predictions)[::-1]
    top_indices = sorted_indices[:number_of_wells]

    # Total reserves and profit
    total_reserves = np.sum(target_valid[top_indices])
    total_profit = (total_reserves * revenue_per_unit) - total_investment
    return total_profit

# Now we apply this function to our predictions
profit_region0 = calculate_profit(predicted_valid_0, target_valid_0, revenue_per_unit)
profit_region1 = calculate_profit(predicted_valid_1, target_valid_1, revenue_per_unit)
profit_region2 = calculate_profit(predicted_valid_2, target_valid_2, revenue_per_unit)

print('Total predicted profit for Region 0:', profit_region0)
print('Total predicted profit for Region 1:', profit_region1)
print('Total predicted profit for Region 2:', profit_region2)



Total predicted profit for Region 0: 33651872.377002865
Total predicted profit for Region 1: 24150866.966815114
Total predicted profit for Region 2: 25012838.532820627


As we can see in the table above.

- Region 0: Predicted profit of approximately $139.82 million.
- Region 1: Predicted profit of approximately $124.86 million.
- Region 2: Predicted profit of approximately $133.63 million.

Region 0 shows the highest potential profit, followed closely by Region 2. Region 1, while still profitable, has a lower predicted profit compared to the other two regions.
These results suggest that, based on the model's predictions, investing in oil wells in either Region 0 or Region 2 could be more profitable.

### Bootstrapping

For each region, we are going to:

- Use the bootstrapping technique with 1000 samples to find the distribution of profit.
- Find average profit, 95% confidence interval, and risk of losses. Loss is negative profit.

In [16]:
def bootstrap_profit_analysis(predictions, target_valid, revenue_per_unit, n_bootstrap=1000, n_wells=500):
    bootstrap_profits = []
    target_valid = target_valid.reset_index(drop=True)
    for _ in range(n_bootstrap):
        indexes = np.random.choice(range(len(predictions)), size=n_wells, replace=True)
        
        indexed_predictions = predictions[indexes]
        indexed_target_valid = target_valid[indexes]
        
        profit = calculate_profit(indexed_predictions, indexed_target_valid, revenue_per_unit)
        bootstrap_profits.append(profit)


    average_profit = np.mean(bootstrap_profits)
    confidence_interval = np.percentile(bootstrap_profits, [2.5, 97.5])

    risk_of_loss = np.mean(np.array(bootstrap_profits) < 0) * 100 
    

    return average_profit, confidence_interval, risk_of_loss


avg_profit_0, ci_0, risk_0 = bootstrap_profit_analysis(predicted_valid_0, target_valid_0, revenue_per_unit)
avg_profit_1, ci_1, risk_1 = bootstrap_profit_analysis(predicted_valid_1, target_valid_1, revenue_per_unit)
avg_profit_2, ci_2, risk_2 = bootstrap_profit_analysis(predicted_valid_2, target_valid_2, revenue_per_unit)


print('Average profit for Region 0:', avg_profit_0)
print('95% confidence interval:', ci_0)
print('Risk of loss:', risk_0)
print()
print('Average profit for Region 1:', avg_profit_1)
print('95% confidence interval:', ci_1)
print('Risk of loss:', risk_1)
print()
print('Average profit for Region 2:', avg_profit_2)
print('95% confidence interval:', ci_2)
print('Risk of loss:', risk_2)



Average profit for Region 0: 3740770.621899106
95% confidence interval: [-1622671.77347309  9025539.73278304]
Risk of loss: 6.4

Average profit for Region 1: 4675197.310061187
95% confidence interval: [ 431675.89325556 8880092.63048083]
Risk of loss: 1.7000000000000002

Average profit for Region 2: 3581893.245733123
95% confidence interval: [-2186732.04780835  8645820.11408043]
Risk of loss: 10.100000000000001


Looking at the results of the bootstrapping analysis, we can see that: 

- Region 1 has the highest average profit of approximately 47.84 Millions
- Region 1 also has the lowest risk of losses, with an confidence interval (95%) of approximately .888 in the lower end and 8.958 millions in the upper end.

### Conclusion


As we can see in all this analysis, the best region to drill for oil is the region 1, because it has the highest average profit of approximately 47.84 Millions and also has the lowest risk of losses, with 1.6% of loss.

Also the confidence interval (95%) in the lower level is still positive compared with the other 2 regions, so it makes sense to take that into consideration.

This was not seen in the initial analysis, where we saw that the region 0 was the best one, for potential profit.

But doing the bootstrapping analysis, we see a more indeep result and we have more toosl to say that the region 1 is the best one, because it has the highest average profit, lower risk and a positive confidenec interval.