# **Machine Learning for Business: OilyGiant**

This project applies machine learning techniques to help OilyGiant identify the most promising location for developing a new oil well. The analysis is based on three datasets: geo_0, geo_1, and geo_2, each containing features related to oil reserve volumes across different regions.

To uncover actionable insights, a Linear Regression model was used as a baseline to predict oil production volume. By comparing model performance and profit potential across the three regions, the project aims to support strategic decision-making and determine which location offers the greatest expected return.

## 1. Import Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import os

## 2. Load Data

In [2]:
geo_0 = pd.read_csv("data/geo_data_0.csv")
geo_1 = pd.read_csv("data/geo_data_1.csv")
geo_2 = pd.read_csv("data/geo_data_2.csv")

## 3. Data overview

In [3]:
geo_0.head()

Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.22117,105.280062
1,2acmU,1.334711,-0.340164,4.36508,73.03775
2,409Wp,1.022732,0.15199,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647


In [4]:
geo_0.info()

<class 'pandas.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  str    
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), str(1)
memory usage: 3.8 MB


In [5]:
geo_1.head()

Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261
2,vyE1P,6.263187,-5.948386,5.00116,134.766305
3,KcrkZ,-13.081196,-11.506057,4.999415,137.945408
4,AHL4O,12.702195,-8.147433,5.004363,134.766305


In [6]:
geo_1.info()

<class 'pandas.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  str    
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), str(1)
memory usage: 3.8 MB


In [7]:
geo_2.head()

Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697
2,ovLUW,0.194587,0.289035,-5.586433,62.87191
3,q6cA6,2.23606,-0.55376,0.930038,114.572842
4,WPMUX,-0.515993,1.716266,5.899011,149.600746


In [8]:
geo_2.info()

<class 'pandas.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  str    
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), str(1)
memory usage: 3.8 MB


**Observation:**

After loading the datasets for all three regions, a brief inspection confirmed consistent feature structures and no missing values. Feature types are aligned across regions, and the target variable represents the estimated volume of oil reserves for each well.

Given the structured nature of the data and the assumption of a linear relationship between geological features and reserve volume, **Linear Regression** was selected as a baseline model. The primary objective of this project is decision-making under uncertainty, focusing on profitability and risk analysis across regions rather than extensive exploratory analysis.

## 4. Model Training and Validation by Region

### 1. Region 0

In [9]:
features_0 = geo_0.drop(['id','product'], axis=1)
target_0 = geo_0['product']

features_train_0, features_valid_0, target_train_0, target_valid_0 = train_test_split(features_0, target_0, test_size = 0.25, random_state = 12345)

model = LinearRegression()
model.fit(features_train_0, target_train_0)
predictions_0 = model.predict(features_valid_0)
mse = mean_squared_error(target_valid_0, predictions_0)

print('Linear Regression')
print('MSE =', mse)
print('RMSE =', mse ** 0.5)

baseline_predictions_0 = pd.Series(target_train_0.mean(), index=target_valid_0.index)
mse = mean_squared_error(target_valid_0, baseline_predictions_0)

Linear Regression
MSE = 1412.2129364399243
RMSE = 37.5794217150813


### 2. Region 1

In [10]:
features_1 = geo_1.drop(['id','product'], axis=1)
target_1 = geo_1['product']

features_train_1, features_valid_1, target_train_1, target_valid_1 = train_test_split(features_1, target_1, test_size = 0.25, random_state = 12345)

model = LinearRegression()
model.fit(features_train_1, target_train_1)
predictions_1 = model.predict(features_valid_1)
mse = mean_squared_error(target_valid_1, predictions_1)

print('Linear Regression')
print('MSE =', mse)
print('RMSE =', mse ** 0.5)

baseline_predictions_1 = pd.Series(target_train_1.mean(), index=target_valid_1.index)
mse = mean_squared_error(target_valid_1, baseline_predictions_1)


Linear Regression
MSE = 0.7976263360391149
RMSE = 0.8930992867756166


### 3. Region 2

In [11]:
features_2 = geo_2.drop(['id','product'], axis=1)
target_2 = geo_2['product']

features_train_2, features_valid_2, target_train_2, target_valid_2 = train_test_split(features_2, target_2, test_size = 0.25, random_state = 12345)

model = LinearRegression()
model.fit(features_train_2, target_train_2)
predictions_2 = model.predict(features_valid_2)
mse = mean_squared_error(target_valid_2, predictions_2)

print('Linear Regression')
print('MSE =', mse)
print('RMSE =', mse ** 0.5)

baseline_predictions_2 = pd.Series(target_train_2.mean(), index=target_valid_2.index)
mse = mean_squared_error(target_valid_2, baseline_predictions_2)

Linear Regression
MSE = 1602.377581323619
RMSE = 40.02970873393434


**OBSERVATION:**

Among the three regions, Region 1 demonstrates significantly lower prediction error, indicating a stronger linear relationship between geological features and reserve volume. This suggests more stable and reliable predictions compared to Regions 0 and 2.

## 5. Profit Calculation setup

In [12]:
# Region 0
mean_predictions_0 = predictions_0.mean()
mean_target_0 = target_valid_0.mean()
print(mean_predictions_0)
print(mean_target_0)

92.59256778438038
92.07859674082925


In [13]:
# Region 1
mean_predictions_1 = predictions_1.mean()
mean_target_1 = target_valid_1.mean()
print(mean_predictions_1)
print(mean_target_1)

68.72854689544603
68.72313602435999


In [14]:
# Region 2
mean_predictions_2 = predictions_2.mean()
mean_target_2 = target_valid_2.mean()
print(mean_predictions_2)
print(mean_target_2)

94.96504596800492
94.88423280885435


## **Conditions:**

Revenue per one barrel = $4.5 USD

Revenue per Unit of product(1000 barrels) = $4500 USD

Budget: $100 million USD per 200 oil wells

Total Revenue = Volume x 4500 Total profit = Total Revenue - Budget 4500 x V = 100000000 V = 100000000 / 4500 = 22222.22

In [15]:
required_volume = 100000000 / 4500
print(required_volume)

22222.222222222223


To establish a baseline scenario, we estimate the total reserves that would be obtained if 200 wells were selected at random (i.e., using the regional mean). This helps determine whether average production alone would be sufficient to break even before applying model-based selection.


### Baseline Production Estimate (Random Selection)

In [16]:
total_predicted_reserves_0 = mean_predictions_0 * 200
total_predicted_reserves_1 = mean_predictions_1 * 200
total_predicted_reserves_2 = mean_predictions_2 * 200

print(f"Geo 0: {total_predicted_reserves_0:.2f} thousand barrels")
print(f"Geo 1: {total_predicted_reserves_1:.2f} thousand barrels")
print(f"Geo 2: {total_predicted_reserves_2:.2f} thousand barrels")

Geo 0: 18518.51 thousand barrels
Geo 1: 13745.71 thousand barrels
Geo 2: 18993.01 thousand barrels


To establish a baseline scenario, we estimate the total production that would be obtained if 200 wells were selected without prioritization (i.e., using the regional mean reserve volume).

Multiplying the mean predicted reserves by 200 simulates the outcome of developing average wells rather than strategically selecting the highest-producing ones.

The results indicate that selecting wells at random would not reach the break-even threshold of 22,222 thousand barrels. Therefore, profitability depends on selecting the top wells based on model predictions.

## 6. Function to calculate profit

In [17]:
def calculate_profit(predictions, targets, n_selected=200,
                     revenue_per_unit=4500, total_budget=100000000):

    df = pd.DataFrame({
        'predicted': predictions,
        'target': targets
    }).reset_index(drop=True)

    df_sorted = df.sort_values(by='predicted', ascending=False).head(n_selected)

    total_reserves = df_sorted['target'].sum()
    revenue = total_reserves * revenue_per_unit
    profit = revenue - total_budget

    return profit, total_reserves

In [18]:
profit_0, reserves_0 = calculate_profit(predictions_0, target_valid_0)
print(f"Geo 0 - Profit: ${profit_0:,.2f}, Reserves: {reserves_0:,.2f} thousand barrerls")

profit_1, reserves_1 = calculate_profit(predictions_1, target_valid_1)
print(f"Geo 1 - Profit: ${profit_1:,.2f}, Reserves: {reserves_1:,.2f} thousand barrerls")

profit_2, reserves_2 = calculate_profit(predictions_2, target_valid_2)
print(f"Geo 2 - Profit: ${profit_2:,.2f}, Reserves: {reserves_2:,.2f} thousand barrerls")

Geo 0 - Profit: $33,208,260.43, Reserves: 29,601.84 thousand barrerls
Geo 1 - Profit: $24,150,866.97, Reserves: 27,589.08 thousand barrerls
Geo 2 - Profit: $27,103,499.64, Reserves: 28,245.22 thousand barrerls


Using the model-based selection of the top 200 wells, Geo 0 produced the highest estimated profit ($33.2M) and the largest total reserve volume (29,601.84 thousand barrels) among the three regions. Geo 1 and Geo 2 were also profitable under this selection strategy but with lower total reserves and returns.

Based on these point estimates, Geo 0 appears to be the most promising candidate; however, the final recommendation depends on bootstrapping results to evaluate profit uncertainty, the 95% confidence interval, and the risk of loss.

## 5. Risk and profit

In [19]:
def bootstrap_profit(pred, actual, n_iter=1000, wells_sample=500, top_n=200):
    rng = np.random.RandomState(42)
    profits = []
    df = pd.DataFrame({'pred': pred, 'act': actual})

    for _ in range(n_iter):
        sample = df.sample(n=wells_sample, replace=True, random_state=rng)
        best   = sample.nlargest(top_n, 'pred')
        profit = best['act'].sum() * 4_500 - 100_000_000
        profits.append(profit)

    return pd.Series(profits)

In [20]:
profits_0 = bootstrap_profit(predictions_0, target_valid_0)
profits_1 = bootstrap_profit(predictions_1, target_valid_1)
profits_2 = bootstrap_profit(predictions_2, target_valid_2)

In [21]:
def analyze_bootstrap_results(profits):
    profits = pd.Series(profits)

    avg_profit = profits.mean()
    confidence_interval = profits.quantile([0.05, 0.95])
    loss_risk = (profits < 0).mean() * 100

    return avg_profit, confidence_interval, loss_risk

In [22]:
avg_0, ci_0, risk_0 = analyze_bootstrap_results(profits_0)
avg_1, ci_1, risk_1 = analyze_bootstrap_results(profits_1)
avg_2, ci_2, risk_2 = analyze_bootstrap_results(profits_2)

In [23]:
def print_summary(region, avg, ci, risk):
    print(f"{region}:")
    print(f"  Average profit: ${avg:,.2f}")
    print(f"  95% CI: ${ci.loc[0.05]:,.2f} to ${ci.loc[0.95]:,.2f}")
    print(f"  Risk of loss: {risk:.2f}%\n")
print_summary("Geo 0", avg_0, ci_0, risk_0)
print_summary("Geo 1", avg_1, ci_1, risk_1)
print_summary("Geo 2", avg_2, ci_2, risk_2)

Geo 0:
  Average profit: $3,816,285.42
  95% CI: $-533,328.04 to $7,857,030.14
  Risk of loss: 7.30%

Geo 1:
  Average profit: $4,517,872.16
  95% CI: $1,102,003.50 to $7,841,293.22
  Risk of loss: 0.70%

Geo 2:
  Average profit: $3,903,056.30
  95% CI: $-658,562.52 to $8,586,988.24
  Risk of loss: 7.70%



# **Conclusion**

Bootstrapping analysis reveals meaningful differences in profitability stability across regions.

Although Geo 0 initially showed the highest point-estimate profit, the uncertainty analysis changes the decision. Geo 1 demonstrates the highest average profit ($4.52M), a fully positive 95% confidence interval, and the lowest risk of loss (0.70%).

In contrast, Geo 0 and Geo 2 both show wider confidence intervals that include negative values and carry substantially higher loss probabilities (7.30% and 7.70%, respectively).

Considering expected return and downside risk, **Geo 1 represents the most stable and economically justified region for investment.**