<div style="border:solid blue 2px; padding: 20px">

**Overall Summary of the Project**

Hi Raul! Your notebook is clear, logically structured, and meets most of the business objectives — nice work 🎉.  
Below you’ll find what already shines, a couple of optional polish ideas, and the **critical tweaks** we still need before final approval.

---

**✅ Strengths**

* Very clean data-prep (NA / duplicate checks, dropped the `id` column).  
* Consistent `random_state=42` in the splits for reproducibility.  
* Concise helper-functions (`train_and_validate`, `calculate_profit`, `bootstrap_profit`) keep the code DRY.  
* Good business narrative — every step ties model results back to profit.

---

**❗ Critical changes required (blockers)**  

* **Budget constant should be 100 M, not 10 M**  
  The current setting inflates every profit estimate and forces loss-risk to 0 %.  
  <code>BUDGET = 100_000_000</code>

* **Keep predictions and actual reserves aligned inside each bootstrap sample**  
  Resetting the index separately breaks the 1-to-1 correspondence.  
  ```python
  def bootstrap_profit(preds, actual, iterations=1000, top_n=200):
      df  = pd.DataFrame({'pred': preds, 'act': actual})
      rng = np.random.RandomState(42)
      profits = []

      for _ in range(iterations):
          sample = df.sample(n=500, replace=True, random_state=rng)  # same indices for both columns
          best   = sample.nlargest(top_n, 'pred')                    # rank by prediction
          profit = best['act'].sum() * 4_500 - 100_000_000           # sum *actual* reserves
          profits.append(profit)

      return pd.Series(profits)
  ```

* **Re-evaluate each region with the corrected bootstrap and budget**  
  ```python
  profits_1 = bootstrap_profit(pred_1, actual_1)      # Region 1 example
  mean_1     = profits_1.mean()
  loss_1     = (profits_1 < 0).mean() * 100
  print(f"Mean profit: ${mean_1:,.0f},  Loss risk: {loss_1:.2f}%")
  ```  

  Approve whichever region still satisfies **loss-risk ≤ 2.5 %** and has the highest mean profit — Region 1 will likely remain the safest, but please verify after rerunning.

---

Apply these tweaks, rerun the last section, and you should be all set for approval.  
Ping me if anything’s unclear — happy to help!

</div>

<div style="border:solid blue 2px; padding: 20px">

**Overall Summary of the Project Iter 2**

Thanks for the changes, Raul! Your code looks perfect now!! Just a minor thing, you didn't update your conclusion:

"Final Conclusion

After running all the models and doing the profit/risk analysis, Region 1 stood out as the safest choice. It had the lowest RMSE, solid average profit, and the risk of losses was 0%, just like Region 0 and 2. But what makes Region 1 better is that it has more consistent returns based on the confidence interval being very tight. So, if I had to recommend a region for OilyGiant to invest in, I’d definitely go with Region 1."

This is not correct anymore, update it based on the real risk of losses (not 0% anymore) ;)

After that, you are ready for approval, I'll be waiting for your final update. You got it !!

<div style="border:solid blue 2px; padding: 20px">

**Overall Summary of the Project Iter 3**

Congrats on your approval, Raul! ;)

Predicting Profitable Oil Wells with Machine Learning

Introduction

In this project, I worked with data from three different oil regions to help the company decide where to drill a new oil well. The goal was to build a model to predict oil reserves and figure out which region would be the most profitable with the least risk. I followed several steps, including training models, calculating potential profits, and using bootstrapping to evaluate risk.

In [1]:
# Step 1: Import the necessary libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error


In [2]:
# Step 2: Load all three datasets for each region
data_0 = pd.read_csv('/datasets/geo_data_0.csv')
data_1 = pd.read_csv('/datasets/geo_data_1.csv')
data_2 = pd.read_csv('/datasets/geo_data_2.csv')

In [3]:
# Step 3: Check for missing and duplicate values to ensure data quality
datasets = [data_0, data_1, data_2]

for i, df in enumerate(datasets):
    print(f"Region {i} — rows: {df.shape[0]}, missing values: {df.isna().sum().sum()}, duplicates: {df.duplicated().sum()}")


Region 0 — rows: 100000, missing values: 0, duplicates: 0
Region 1 — rows: 100000, missing values: 0, duplicates: 0
Region 2 — rows: 100000, missing values: 0, duplicates: 0


First, I imported all the libraries I needed, like pandas for data wrangling, numpy for calculations, and sklearn for building the machine learning model.
Next, I loaded the datasets for each region. These files contain features from oil wells and a target column called product, which tells how much oil (in thousands of barrels) a well is expected to produce.
I checked each dataset for missing or duplicate values to make sure there’s no dirty data that could mess up the model training. Thankfully, things looked pretty clean!

In [4]:
# Step 4: Function to train the model and make predictions
def train_and_validate(data):
    features = data.drop(['product', 'id'], axis=1)
    target = data['product']
    
    X_train, X_valid, y_train, y_valid = train_test_split(
        features, target, test_size=0.25, random_state=42)
    
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    predictions = model.predict(X_valid)
    rmse = mean_squared_error(y_valid, predictions, squared=False)
    
    print(f"Predicted mean: {predictions.mean():.2f}")
    print(f"Model RMSE: {rmse:.2f}")
    
    return pd.Series(predictions), y_valid.reset_index(drop=True)


In [5]:
# Step 5: Run the model for each region and store predictions
print("Region 0:")
pred_0, actual_0 = train_and_validate(data_0)

print("\nRegion 1:")
pred_1, actual_1 = train_and_validate(data_1)

print("\nRegion 2:")
pred_2, actual_2 = train_and_validate(data_2)


Region 0:
Predicted mean: 92.40
Model RMSE: 37.76

Region 1:
Predicted mean: 68.71
Model RMSE: 0.89

Region 2:
Predicted mean: 94.77
Model RMSE: 40.15


To see how well our model could predict oil reserves, I wrote a function that trains a simple linear regression model. I split each dataset into training and validation sets using a 75/25 split. Then, I trained the model and made predictions on the validation data. I also calculated the RMSE to see how accurate the model was for each region. I ran this process for all three regions and printed the average predicted reserves and the model’s RMSE. This step helped me compare how well the model performed in each region and gave me an idea of how reliable the predictions might be moving forward.

In [6]:
# Step 6: Set up constants for profit calculations

BUDGET = 100_000_000  # this was the major bug!
WELLS_TO_SELECT = 200
REVENUE_PER_BARREL = 4.5
REVENUE_PER_1000_BARRELS = REVENUE_PER_BARREL * 1000
WELL_COST = BUDGET / WELLS_TO_SELECT

# Calculate the break-even volume per well
break_even_volume = WELL_COST / REVENUE_PER_1000_BARRELS
print(f"\nBreak-even volume: {break_even_volume:.2f} thousand barrels")



Break-even volume: 111.11 thousand barrels


In this step, I set up all the constants I needed for the profit calculations. This includes the total budget, how many wells we're planning to develop (200), and how much revenue we make per barrel. I used those values to figure out how much revenue we earn per 1,000 barrels and how much each well costs. Then I calculated the break-even volume — basically, how much oil a well needs to produce so we don’t lose money. This value will help us later when we’re figuring out which wells are actually worth developing.

In [7]:
# Step 7: Define a function to calculate profit
def calculate_profit(preds, actual):
    top_indices = preds.sort_values(ascending=False).head(WELLS_TO_SELECT).index
    total_profit = actual[top_indices].sum() * REVENUE_PER_1000_BARRELS - BUDGET
    return total_profit


In [8]:
# Step 8: Estimate profit in each region
print("\nEstimated profit per region:")
print(f"Region 0: ${calculate_profit(pred_0, actual_0):,.2f}")
print(f"Region 1: ${calculate_profit(pred_1, actual_1):,.2f}")
print(f"Region 2: ${calculate_profit(pred_2, actual_2):,.2f}")


Estimated profit per region:
Region 0: $33,591,411.14
Region 1: $24,150,866.97
Region 2: $25,985,717.59


To figure out how much profit we could expect in each region, I created a function that selects the top 200 wells with the highest predicted reserves. Then, it sums up their actual reserves, multiplies that by the profit per thousand barrels, and subtracts the total budget. This gave me a rough estimate of how profitable each region might be based purely on the predictions from the model. It's not perfect, but it helped me compare the regions and get a sense of which one looked the most promising before running the bootstrapping analysis.

In [9]:
# Step 9: Use bootstrapping to evaluate risk and uncertainty
def bootstrap_profit(preds, actual, iterations=1000, top_n=200):
    df = pd.DataFrame({'pred': preds, 'actual': actual})
    profits = []
    rng = np.random.RandomState(42)
    
    for _ in range(iterations):
        sample = df.sample(n=500, replace=True, random_state=rng)
        top_wells = sample.nlargest(top_n, 'pred')
        profit = top_wells['actual'].sum() * REVENUE_PER_1000_BARRELS - BUDGET
        profits.append(profit)
    
    return pd.Series(profits)

In [10]:
# Step 10: Run bootstrapping for all regions
for i, (preds, actual) in enumerate([
    (pred_0, actual_0), (pred_1, actual_1), (pred_2, actual_2)]
):
    profits = bootstrap_profit(preds, actual)
    avg_profit = profits.mean()
    ci_lower = profits.quantile(0.025)
    ci_upper = profits.quantile(0.975)
    risk = (profits < 0).mean() * 100
    
    print(f"\nRegion {i}:")
    print(f"Average profit: ${avg_profit:,.2f}")
    print(f"95% CI: (${ci_lower:,.2f}, ${ci_upper:,.2f})")
    print(f"Risk of loss: {risk:.2f}%")



Region 0:
Average profit: $3,995,754.78
95% CI: ($-1,104,678.95, $8,974,603.28)
Risk of loss: 6.00%

Region 1:
Average profit: $4,520,488.91
95% CI: ($616,844.80, $8,453,401.78)
Risk of loss: 1.50%

Region 2:
Average profit: $3,750,099.03
95% CI: ($-1,447,667.27, $8,883,904.04)
Risk of loss: 8.00%


To better understand how reliable the profit estimates are for each region, I used the bootstrapping technique. This basically means I repeated the profit calculation 1,000 times, each time randomly picking 500 wells (with replacement) and selecting the top 200 predicted wells to estimate profit. Doing this helps account for randomness in the selection process.

After running the simulation, I looked at the average profit for each region, along with a 95% confidence interval and the percentage of cases where profit turned out negative (which I used as the risk of loss). This gave me a much clearer picture — not just of which region has the highest average returns, but which ones are actually stable and less risky. This step really helped narrow down which region would be safest and smartest for investment.



Final Conclusion 

After training the models, estimating profits, and running bootstrapping to assess risks, I found that Region 1 still comes out as the best option overall. It has the lowest risk of loss at 1.5%, which is under the 2.5% threshold we were aiming for. Its average profit is also solid, and even though Region 0 had a slightly higher profit, its risk level (6%) makes it a riskier bet.

Region 1’s confidence interval was also much tighter compared to the others, which means the results are more stable and consistent. So based on all this, if I had to recommend a region for OilyGiant to invest in, Region 1 would be the smartest and safest choice.