# Introduction

We are aiming to identify the optimal location for a new oil well. This will involve gathering crucial data on oil well parameters such as oil quality and reserve volumes in the selected region. Armed with this information, I will then develop a predictive model to forecast the volume of reserves in prospective new wells. By applying this model, I'll determine which wells boast the highest estimated values, leading me towards the region promising the highest total profit. To further bolster my choice, I will apply the Bootstrapping technique to analyze potential profits and associated risks. My work will thereby fuel OilyGiant's operational success and contribute to effective strategic decision-making.

---

In [1]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score, mean_squared_error, confusion_matrix, classification_report

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score

## Downloading and Preparing Data

In [2]:
first_geo = pd.read_csv('datasets/geo_data_0.csv')
second_geo = pd.read_csv('datasets/geo_data_1.csv')
third_geo = pd.read_csv('datasets/geo_data_2.csv')

In [4]:
first_geo['id'] = range(len(first_geo))

In [5]:
first_geo['product'] = first_geo['product'].apply(pd.to_numeric).astype('int')

In [5]:
first_geo.head() 

Unnamed: 0,id,f0,f1,f2,product
0,0,0.705745,-0.497823,1.22117,105
1,1,1.334711,-0.340164,4.36508,73
2,2,1.022732,0.15199,1.419926,85
3,3,-0.032172,0.139033,2.978566,168
4,4,1.988431,0.155413,4.751769,154


In [6]:
second_geo['id'] = range(len(second_geo))

In [7]:
second_geo['product'] = second_geo['product'].apply(pd.to_numeric).astype('int')

In [8]:
second_geo.head()

Unnamed: 0,id,f0,f1,f2,product
0,0,-15.001348,-8.276,-0.005876,3
1,1,14.272088,-3.475083,0.999183,26
2,2,6.263187,-5.948386,5.00116,134
3,3,-13.081196,-11.506057,4.999415,137
4,4,12.702195,-8.147433,5.004363,134


In [9]:
third_geo['id'] = range(len(third_geo))

In [10]:
third_geo['product'] = third_geo['product'].apply(pd.to_numeric).astype('int')

In [11]:
third_geo.head()

Unnamed: 0,id,f0,f1,f2,product
0,0,-1.146987,0.963328,-0.828965,27
1,1,0.262778,0.269839,-2.530187,56
2,2,0.194587,0.289035,-5.586433,62
3,3,2.23606,-0.55376,0.930038,114
4,4,-0.515993,1.716266,5.899011,149


While looking through the data in regions I decided that the best course of action would be to change the values in the "product" category from floats to ints given that they are supposed to represent thousands of barrels. For the values in the id column I decided to change the values into  sequential numbers as opposed to the sequences of seemingly random letters they were prior.  
I felt it would work better for readabilty/usage of the information

<div class="alert alert-success">
<b>Reviewer's comment</b>

Okay, as you wish:)

</div>

---

## Train and Test the Model for Each Region

### Split the data into a training set

In [12]:
first_geo_train, first_geo_valid = train_test_split(first_geo, test_size=0.25, random_state=12345)

In [13]:
second_geo_train, second_geo_valid = train_test_split(second_geo, test_size=0.25, random_state=12345)

In [14]:
third_geo_train, third_geo_valid = train_test_split(third_geo, test_size=0.25, random_state=12345)

### Train the Model and Make Predictions for the Validation Set

**Region One**

In [15]:
#Creating features and target
first_features_train = first_geo_train[['f0', 'f1', 'f2']]
first_target_train = first_geo_train['product']
first_features_valid = first_geo_valid[['f0', 'f1','f2']]
first_target_valid = first_geo_valid['product']

In [16]:
#Training the model
first_model = LinearRegression()
first_model.fit(first_features_train, first_target_train)

#Making predictions
first_predictions = first_model.predict(first_features_valid)
print(f"Predictions: {first_predictions}")

Predictions: [ 95.39556718  77.07497435  77.39534406 ...  61.01352027 117.67888826
 117.66768531]


**Region Two**

In [17]:
#Creating features and target
second_features_train = second_geo_train[['f0', 'f1', 'f2']]
second_target_train = second_geo_train['product']
second_features_valid = second_geo_valid[['f0', 'f1','f2']]
second_target_valid = second_geo_valid['product']

In [18]:
#Training the model
second_model = LinearRegression()
second_model.fit(second_features_train, second_target_train)

#Making predictions
second_predictions = second_model.predict(second_features_valid)
print(f"Predictions: {second_predictions}")

Predictions: [ 82.05169478  53.85279454  29.51321445 ... 137.13123348  83.23555971
  53.33610627]


**Region Three**

In [19]:
#Creating features and target
third_features_train = third_geo_train[['f0', 'f1', 'f2']]
third_target_train = third_geo_train['product']
third_features_valid = third_geo_valid[['f0', 'f1','f2']]
third_target_valid = third_geo_valid['product']

In [20]:
#Training the model
third_model = LinearRegression()
third_model.fit(third_features_train, third_target_train)

#Making predictions
third_predictions = third_model.predict(third_features_valid)
print(f"Predictions: {third_predictions}")

Predictions: [ 93.10469853  74.60562272  89.56787392 ...  98.91013164  77.28204128
 128.53552027]


### Save the Predictions and Correct Answers for the Validation Set

In [32]:
#Saving predictions and correct answers for the Validation set 
results_df = pd.DataFrame({
    'First Predictions': first_predictions,
    'First Answers': first_target_valid,
    'Second Predictions': second_predictions,
    'Second Answers': second_target_valid,
    'Third Predictions': third_predictions,
    'Third Answers': third_target_valid
})

In [33]:
results_df

Unnamed: 0,First Predictions,First Answers,Second Predictions,Second Answers,Third Predictions,Third Answers
71751,95.395567,10,82.051695,80,93.104699,61
80493,77.074974,114,53.852795,53,74.605623,41
2655,77.395344,132,29.513214,30,89.567874,57
53233,89.676522,169,52.881814,53,104.664264,100
91141,70.012879,122,0.956905,0,114.801980,109
...,...,...,...,...,...,...
12581,102.536975,170,136.045430,137,78.267616,28
18456,84.904735,93,110.075624,110,95.105076,21
73035,61.013520,127,137.131233,137,98.910132,125
63834,117.678888,99,83.235560,84,77.282041,99


### Print the Average Volume of Predicted Reserves and Model RMSE 

In [42]:
avg_vol = results_df[['First Predictions', 'Second Predictions', 'Third Predictions']].mean()


# Calculate RMSE for each set of predictions
first_rmse = np.sqrt(mean_squared_error(first_target_valid, first_predictions))
second_rmse = np.sqrt(mean_squared_error(second_target_valid, second_predictions))
third_rmse = np.sqrt(mean_squared_error(third_target_valid, third_predictions))

# Calculate the overall RMSE
overall_rmse = np.mean([first_rmse, second_rmse, third_rmse])

print(f"Average Volume of Predicted Reserves: {avg_vol}")
print(f"Model RMSE: {overall_rmse}")

Average Volume of Predicted Reserves: First Predictions     92.093398
Second Predictions    68.173200
Third Predictions     94.467429
dtype: float64
Model RMSE: 26.211658349702986


<div class="alert alert-danger">
<b>Reviewer's comment</b>

It's not the best idea to average rmse among different regions. Maybe in one region the model works really good and make predictions without mistakes but in two other regions the model works really bad. But you can't see that due to averaging.

</div>

<div class="alert alert-warning">
<b>Reviewer's comment V2</b>

It seems you missed my previous comment:) You need to print rmse for each region separately.

</div>

### Analyze the Results

**Average Volume of Predicted Reserves**  
The average volumes of predicted reserves for each set of predictions are:  

* **First Predictions**: 92.09  
* **Second Predictions**: 68.17  
* **Third Predictions**: 94.47  

These values indicate the mean predicted reserves for each model. The second model’s predictions are notably lower than the first and third models.  

**Root Mean Squared Error (RMSE)**  
The RMSE values for each set of predictions are calculated as follows:  

* RMSE for First Predictions: This measures the average magnitude of the errors between the first set of predictions and the actual answers.
* RMSE for Second Predictions: This measures the average magnitude of the errors between the second set of predictions and the actual answers.
* RMSE for Third Predictions: This measures the average magnitude of the errors between the third set of predictions and the actual answers.
**Overall RMSE**  
The overall RMSE is the mean of the RMSE values for the three sets of predictions:

* **Overall RMSE**: 26.21  

**Interpretation**  
* **Prediction Accuracy**: The overall RMSE of 26.21 suggests that, on average, the predictions are off by about 26.21 units from the actual values. This gives you an idea of the prediction accuracy of your models.
* **Model Comparison**: The average volumes and RMSE values can help you compare the performance of the three models. If the RMSE values for each set of predictions are significantly different, it might indicate that one model is performing better or worse than the others.
* **Model Improvement**: If the RMSE values are higher than desired, you might consider further tuning your models, using different features, or trying other machine learning algorithms to improve prediction accuracy.

---

## Prepare for Profit Calculation

### Store All Key Values for Calculations in Separate Variables

### Calculate the Volume of Reserves Sufficient for Developing a New Well Without Losses

In [43]:
# Storing key values
avg_vol_first = avg_vol['First Predictions']
avg_vol_second = avg_vol['Second Predictions']
avg_vol_third = avg_vol['Third Predictions']

#Other key values are already stored

In [58]:
# Define cost and profit per unit
budget = 100_000_000  # $100,000,000
num_wells = 200
cost_per_well = budget / num_wells  # Cost per well

# Revenue per unit of reserves
revenue_per_unit = 4500  # $4,500

# Calculate break-even volume of reserves
break_even_volume = cost_per_well / revenue_per_unit


In [59]:
# Compare break-even volume with average volumes
comparison_first = avg_vol_first >= break_even_volume
comparison_second = avg_vol_second >= break_even_volume
comparison_third = avg_vol_third >= break_even_volume


### Provide the Findings About the Preparation for Profit Calculation Step

In [41]:
print(f"Break-even volume of reserves: {break_even_volume} units")

print(f"Average Volume of Predicted Reserves (First Predictions): {avg_vol_first} units")
print(f"Average Volume of Predicted Reserves (Second Predictions): {avg_vol_second} units")
print(f"Average Volume of Predicted Reserves (Third Predictions): {avg_vol_third} units")

print(f"Is First Predictions' average volume sufficient? {'Yes' if comparison_first else 'No'}")
print(f"Is Second Predictions' average volume sufficient? {'Yes' if comparison_second else 'No'}")
print(f"Is Third Predictions' average volume sufficient? {'Yes' if comparison_third else 'No'}")


Break-even volume of reserves: 111.11111111111111 units
Average Volume of Predicted Reserves (First Predictions): 92.09339782217532 units
Average Volume of Predicted Reserves (Second Predictions): 68.17320024946817 units
Average Volume of Predicted Reserves (Third Predictions): 94.46742884611454 units
Is First Predictions' average volume sufficient? No
Is Second Predictions' average volume sufficient? No
Is Third Predictions' average volume sufficient? No


<div class="alert alert-success">
<b>Reviewer's comment</b>

Correct. Good job!

</div>

## Write a Function a Calculate Profit From a Set of Selected Oil Wells and Model Predictions

In [68]:
def calculate_profit(target, predictions):
    predictions = pd.Series(predictions, index=target.index)
    sorted_indices = predictions.sort_values(ascending=False).head(num_wells).index
    top_wells = target.loc[sorted_indices]
    target_reserves = target.loc[top_wells.index].sum()  # Getting the corresponding target reserves
    revenue = target_reserves * revenue_per_unit
    profit = revenue - budget
    
    return profit.round(2)

In [75]:
first_target_valid = first_target_valid.reset_index(drop=True)
second_target_valid = second_target_valid.reset_index(drop=True)
third_target_valid = third_target_valid.reset_index(drop=True)

print(f'First Region top 200 wells predicted profit:{calculate_profit(first_target_valid, first_predictions):,}')
print(f'Second Region top 200 wells predicted profit:{calculate_profit(second_target_valid, second_predictions):,}')
print(f'Third Region top 200 wells predicted profit:{calculate_profit(third_target_valid, third_predictions):,}')

First Region top 200 wells predicted profit:32,763,500
Second Region top 200 wells predicted profit:23,300,000
Third Region top 200 wells predicted profit:26,657,000


<div class="alert alert-danger">
<b>Reviewer's comment</b>

Unfortunately, this function is incorrect. 
You should calculate profit using both predictions and targets. You need to pick the top wells using predictions but then you need to use corresponding targets to calculate the profit.
    
P.S. In the lesson you have very similar example about lessons and students.

</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Good job!

</div>

---

## Calculate Risks and Profit for Each Region

In [71]:
state = np.random.RandomState(12345)
def bootstrap_profit(target, predictions, n_iterations=1000, sample_size=500):
    profits = []
    
    predictions = pd.Series(predictions, index=target.index)
    
    for i in range(n_iterations):
        sample_indices = np.random.choice(target.index, size=sample_size, replace=True)
        sample_predictions = predictions[sample_indices]
        sample_targets = target.loc[sample_indices]

        profit = calculate_profit(sample_targets, sample_predictions)
        profits.append(profit)
    
    profits = np.array(profits)
    mean_profit = profits.mean()
    confidence_interval = np.percentile(profits, [2.5, 97.5])
    risk_of_loss = (profits < 0).mean()
    
    return mean_profit, confidence_interval, risk_of_loss

In [72]:
predictions_one = pd.Series(first_predictions, index=first_target_valid.index)
predictions_two = pd.Series(second_predictions, index=second_target_valid.index)
predictions_three = pd.Series(third_predictions, index=third_target_valid.index)

In [73]:
#First Region 
profit_one, ci_one, risk_one = bootstrap_profit(first_target_valid, predictions_one)
#Second Region
profit_two, ci_two, risk_two = bootstrap_profit(second_target_valid, predictions_two)
#Third Region
profit_three, ci_three, risk_three = bootstrap_profit(third_target_valid, predictions_three)

In [74]:
# Print results
print(f"Region Profit One: Mean Profit: {profit_one}, 95% CI: {ci_one}, Risk of Loss: {risk_one*100:.2f}%")
print(f"Region G2: Mean Profit: {profit_two}, 95% CI: {ci_two}, Risk of Loss: {risk_two*100:.2f}%")
print(f"Region G3: Mean Profit: {profit_three}, 95% CI: {ci_three}, Risk of Loss: {risk_three*100:.2f}%")

Region Profit One: Mean Profit: 9511469.0, 95% CI: [  797412.5        22427337.49999999], Risk of Loss: 1.50%
Region G2: Mean Profit: 9933335.0, 95% CI: [ 1833200.  22503162.5], Risk of Loss: 0.40%
Region G3: Mean Profit: 9257363.0, 95% CI: [  105987.5        20629812.49999999], Risk of Loss: 2.40%


<div class="alert alert-danger">
<b>Reviewer's comment</b>

The results are incorrect due two reasons:
1. The function calculate_profit is not correct
2. According to the project description in the bootstrap you should sample with n=500
    
What is about a conclusion?
  

</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Everything is correct. Well done!

</div>

# Conclusion

In this project, our objective was to identify the optimal location for a new oil well. This was achieved by gathering critical oil well parameters, which were used to develop predictive models forecasting the volume of reserves in prospective wells. The models' performance was evaluated through analysis of average volumes of predicted reserves and Root Mean Squared Error (RMSE).  

The first, second, and third sets of predictions had mean values of 92.09, 68.17, and 94.47 units respectively. Unfortunately, none of these predictions met the break-even volume of reserves of 111.11 units. Despite this, the top 200 wells in each region yielded significant predicted profits, with the first region leading at $32,763,500.  

A risk analysis using the Bootstrapping technique revealed that the second region had the lowest risk of loss, standing at 0.40%, despite not having the highest predicted profit.  

Throughout this project, we've carefully analyzed data, built and evaluated models, while also considering potential profits and risks. Our research will significantly contribute to OilyGiant's strategic decision-making and operational success.