**Review**

Hi, my name is Dmitry and I will be reviewing your project.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  
First of all, thank you for turning in the project! You did an excellent job! The project is accepted! Keep up the good work on the next sprint!

<div class="alert alert-warning">
<b>Reviewer's comment</b>

One small note on presentation: the text would be more readable if you used the headers for actual headers and wrote conclusions and comments as regular text

</div>

## Importing Modules

In [1]:
import pandas as pd
import numpy as np

from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns


## Loading in the CSV Datasets

In [2]:
geo_data_0 = pd.read_csv(r'/datasets/geo_data_0.csv')
geo_data_1 = pd.read_csv(r'/datasets/geo_data_1.csv')
geo_data_2 = pd.read_csv(r'/datasets/geo_data_2.csv')

## Viewing the geo_data_0 dataset.

In [3]:
geo_data_0.head()

Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.22117,105.280062
1,2acmU,1.334711,-0.340164,4.36508,73.03775
2,409Wp,1.022732,0.15199,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647


## Viewing the geo_data_0 data types

In [4]:
geo_data_0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


## Shape of geo_data_0

In [5]:
geo_data_0.shape

(100000, 5)

## Inspecting the dataset for missing values and found none.

In [6]:
geo_data_0.isna().sum()

id         0
f0         0
f1         0
f2         0
product    0
dtype: int64

## Inspecting the dataset for duplicates. There are no duplicates in the data.

In [7]:
geo_data_0.duplicated().sum()

0

## View geo_data_1

In [8]:
geo_data_1.head()

Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261
2,vyE1P,6.263187,-5.948386,5.00116,134.766305
3,KcrkZ,-13.081196,-11.506057,4.999415,137.945408
4,AHL4O,12.702195,-8.147433,5.004363,134.766305


## View data types of geo_data_1

In [9]:
geo_data_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


## Shape of geo_data_1

In [10]:
geo_data_1.shape

(100000, 5)

## Inspecting the dataset for missing values and found none.

In [11]:
geo_data_1.isna().sum()

id         0
f0         0
f1         0
f2         0
product    0
dtype: int64

## Inspecting the dataset for duplicates and found none.

In [12]:
geo_data_1.duplicated().sum()

0

## Viewing geo_data_2

In [13]:
geo_data_2.head()

Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697
2,ovLUW,0.194587,0.289035,-5.586433,62.87191
3,q6cA6,2.23606,-0.55376,0.930038,114.572842
4,WPMUX,-0.515993,1.716266,5.899011,149.600746


## Geo_data_2 datatypes

In [14]:
geo_data_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


## Shape of geo_data_2

In [15]:
geo_data_2.shape

(100000, 5)

## Inspecting the dataset for missing values and found none.

In [16]:
geo_data_2.isna().sum()

id         0
f0         0
f1         0
f2         0
product    0
dtype: int64

## Inspecting the dataset for duplicates and found none.

In [17]:
geo_data_2.duplicated().sum()

0

<div class="alert alert-success">
<b>Reviewer's comment</b>

The data was loaded and inspected!

</div>

## Dropped Product because it is the target and dropped ID because it is not going to have an effect on the product outcome. 

<div class="alert alert-success">
<b>Reviewer's comment</b>

Yep, that makes sense!

</div>

In [18]:
features_data_0 = geo_data_0.drop(['product', 'id'], axis=1)
features_data_0.head()

Unnamed: 0,f0,f1,f2
0,0.705745,-0.497823,1.22117
1,1.334711,-0.340164,4.36508
2,1.022732,0.15199,1.419926
3,-0.032172,0.139033,2.978566
4,1.988431,0.155413,4.751769


In [19]:
type(features_data_0)

pandas.core.frame.DataFrame

## Creating a target dataset for geo_data_0.

In [20]:
target_data_0 = geo_data_0['product']
target_data_0.head()

0    105.280062
1     73.037750
2     85.265647
3    168.620776
4    154.036647
Name: product, dtype: float64

In [21]:
target_data_0.shape

(100000,)

## Used features and target to create a ML train_test_split function on the data.

In [22]:
X_train, X_valid, y_train, y_valid = train_test_split(features_data_0, target_data_0, test_size = 0.25, random_state=12345)

<div class="alert alert-success">
<b>Reviewer's comment</b>

The data was split into train and test sets

</div>

In [23]:
X_train

Unnamed: 0,f0,f1,f2
27212,0.022450,0.951034,2.197333
7866,1.766731,0.007835,6.436602
62041,0.724514,0.666063,1.840177
70185,-1.104181,0.255268,2.026156
82230,-0.635263,0.747990,6.643327
...,...,...,...
4094,1.863680,-0.298123,1.621324
85412,-1.162682,-0.014822,6.819941
2177,0.862688,-0.403776,1.867662
77285,0.846235,-0.489533,1.058786


In [24]:
y_train

27212    147.370612
7866     147.630053
62041     77.696728
70185     55.210501
82230    113.891723
            ...    
4094     124.380793
85412    144.874913
2177     134.967255
77285     64.494357
86498    151.514894
Name: product, Length: 75000, dtype: float64

## Created a Linear Regression model and fit the training dataset to the model

In [23]:
model_geo_data_0 = LinearRegression()
model_geo_data_0.fit(X_train, y_train)

LinearRegression()

## Used the model to make predictions showing that the average volume is around 92.6 per well.

In [24]:
model_0_pred = model_geo_data_0.predict(X_valid)
print(f"Average Volume of Predicted Reserves for Model 0: {model_0_pred.mean()}")

Average Volume of Predicted Reserves for Model 0: 92.59256778438035


In [25]:
sample_0_target = (y_valid.reset_index(drop=True))

sample_0_predictions = (pd.Series(model_0_pred))

In [26]:
sample_0_target

0         10.038645
1        114.551489
2        132.603635
3        169.072125
4        122.325180
            ...    
24995    170.116726
24996     93.632175
24997    127.352259
24998     99.782700
24999    177.821022
Name: product, Length: 25000, dtype: float64

In [27]:
sample_0_predictions

0         95.894952
1         77.572583
2         77.892640
3         90.175134
4         70.510088
            ...    
24995    103.037104
24996     85.403255
24997     61.509833
24998    118.180397
24999    118.169392
Length: 25000, dtype: float64

## The Mean Squared Error for the model is 37.6.

In [28]:
rmse_model_0 = mean_squared_error(y_valid, model_0_pred, squared=False)
print(f"Mean Squared Error for Model 0: {rmse_model_0}")

Mean Squared Error for Model 0: 37.5794217150813


## Showing ten values for model_0.

In [29]:
model_0_pred[:10]

array([ 95.89495185,  77.57258261,  77.89263965,  90.17513418,
        70.51008829,  69.12707635, 125.10675866,  87.64384928,
        86.03587058,  98.65531069])

## Same process but with Geo_Data_1.

In [30]:
features_data_1 = geo_data_1.drop(['product', 'id'], axis=1)
features_data_1.head()

Unnamed: 0,f0,f1,f2
0,-15.001348,-8.276,-0.005876
1,14.272088,-3.475083,0.999183
2,6.263187,-5.948386,5.00116
3,-13.081196,-11.506057,4.999415
4,12.702195,-8.147433,5.004363


## Created a target for dataset_1

In [31]:
target_data_1 = geo_data_1['product']
target_data_1.head()

0      3.179103
1     26.953261
2    134.766305
3    137.945408
4    134.766305
Name: product, dtype: float64

## Used features and target to create a ML train_test_split function on the data.

In [32]:
X_train_1, X_valid_1, y_train_1, y_valid_1 = train_test_split(features_data_1, target_data_1, test_size = 0.25, random_state=12345)

## Created a Linear Regression model and fit the training dataset to the model

In [33]:
model_geo_data_1 = LinearRegression()
model_geo_data_1.fit(X_train_1, y_train_1)

LinearRegression()

## Created a prediction based off of the validation set. 

In [34]:
model_1_pred = model_geo_data_1.predict(X_valid_1)
print(f"Average Volume of Predicted Reserves for Model 1: {model_1_pred.mean()}")

Average Volume of Predicted Reserves for Model 1: 68.728546895446


In [35]:
sample_1_target = (y_valid_1.reset_index(drop=True))

sample_1_predictions = (pd.Series(model_1_pred))

## Calculated the mean squared error to be 0.89

In [36]:
rmse_model_1 = mean_squared_error(y_valid_1, model_1_pred, squared=False)
print(f"Mean Squared Error for Model 1: {rmse_model_1}")

Mean Squared Error for Model 1: 0.893099286775617


## Looked at the first 10 models in the prediction set.

In [37]:
model_1_pred[:10]

array([ 82.66331365,  54.43178616,  29.74875995,  53.5521335 ,
         1.24385647, 111.43849042, 137.13437396,  82.88890232,
       110.89731069,  29.21930594])

## Same process but with Geo_Data_2

In [38]:
features_data_2 = geo_data_2.drop(['product', 'id'], axis=1)
features_data_2.head()

Unnamed: 0,f0,f1,f2
0,-1.146987,0.963328,-0.828965
1,0.262778,0.269839,-2.530187
2,0.194587,0.289035,-5.586433
3,2.23606,-0.55376,0.930038
4,-0.515993,1.716266,5.899011


## Created a target dataset using the product column.

In [39]:
target_data_2 = geo_data_2['product']
target_data_2.head()

0     27.758673
1     56.069697
2     62.871910
3    114.572842
4    149.600746
Name: product, dtype: float64

## Used features and target to create a ML train_test_split function on the data.

In [40]:
X_train_2, X_valid_2, y_train_2, y_valid_2 = train_test_split(features_data_2, target_data_2, test_size = 0.25, random_state=12345)

## Created a Linear Regression model and fit the training dataset to the model

In [41]:
model_geo_data_2 = LinearRegression()
model_geo_data_2.fit(X_train_2, y_train_2)

LinearRegression()

## Created a prediction model for our dataset 2

In [42]:
model_2_pred = model_geo_data_2.predict(X_valid_2)
print(f"Average Volume of Predicted Reserves for Model 2: {model_2_pred.mean()}")

Average Volume of Predicted Reserves for Model 2: 94.96504596800489


In [43]:
sample_2_target = (y_valid_2.reset_index(drop=True))

sample_2_predictions = (pd.Series(model_2_pred))

## Calculated the Mean Squared Error for model 2 to be 40.0

In [44]:
rmse_model_2 = mean_squared_error(y_valid_2, model_2_pred, squared=False)
print(f"Mean Squared Error for Model 2: {rmse_model_2}")

Mean Squared Error for Model 2: 40.02970873393434


## Showing the first ten predictions from our model.

In [45]:
model_2_pred[:10]

array([ 93.59963303,  75.10515854,  90.06680936, 105.16237507,
       115.30331048, 121.93919667, 119.05304048,  75.39657483,
       111.40054309,  84.02931965])

<div class="alert alert-success">
<b>Reviewer's comment</b>

The models were trained and evaluated correctly

</div>

## Calculating the volume of reserves sufficient for developing a new well without losses

#### The budget for 200 oil wells is 100 million dollars, so that means that 200 of our product will have to be equal to or greater than 100 million dollars by the end of the calculations

In [46]:
oil_wells = 100000000
product_number = 200
oil_well_individual = oil_wells/product_number
print(f"A single oil well will have to produce ${oil_well_individual}0 to make our $100 million dollars back.")

revenue_per_product = 4500
print(f"A single oil well will have to have {round(oil_well_individual/revenue_per_product, 2)} products at each oil well. Any less than that and our company loses money on that individual well. If we have more products than that then our company makes money.")

A single oil well will have to produce $500000.00 to make our $100 million dollars back.
A single oil well will have to have 111.11 products at each oil well. Any less than that and our company loses money on that individual well. If we have more products than that then our company makes money.


<div class="alert alert-success">
<b>Reviewer's comment</b>

Calculation is correct!

</div>

## Showing the mean of dataset_0 = 92.5, which is well below the 111.11 meaning that as a whole this region is losing money.

In [47]:
target_data_0.mean()


92.50000000000001

## Showing the mean of dataset_1 = 68.8, which is well below the 111.11 meaning that as a whole this region is losing money.

In [48]:
target_data_1.mean()

68.82500000000002

## Showing the mean of dataset_2 = 95.0, which is well below the 111.11 meaning that as a whole this region is losing money.

In [49]:
target_data_2.mean()

95.00000000000004

<div class="alert alert-success">
<b>Reviewer's comment</b>

Great!

</div>

## Creating a volume function for each model

In [50]:
def volume(target, predictions, count):
    id_sorted = np.argsort(predictions)[::-1]
    selected = target[id_sorted][:count]
    return selected 

<div class="alert alert-success">
<b>Reviewer's comment</b>

Good, the wells are sorted by predictions, but you use the targets to find total volume of oil contained in them

</div>

## Taking the top 200 wells with the most volume from dataset_0.

In [51]:
total_volume_0 = (volume(target_data_0, model_0_pred, 200))
print(total_volume_0.sort_values(ascending=False))

21277    184.419390
4521     183.623485
6769     180.534680
20340    179.438060
7890     177.547449
            ...    
10827     11.087085
10396     11.007289
6036       9.191087
7212       8.886762
23360      5.209102
Name: product, Length: 200, dtype: float64


In [52]:
total_volume_0.sum()

17950.77998728407

## Calculating the profit from dataset_0 and showing that as a whole this region is losing money, at the tune of 19.2 million dollars.

In [53]:
profit_0 = (total_volume_0).sum() * 4500 - 100000000
profit_0

-19221490.05722168

## Taking the top 200 wells with the most volume from dataset_0.

In [54]:
total_volume_1 = (volume(target_data_1, model_1_pred, 200))
print(total_volume_1.sort_values(ascending=False))

7777     137.945408
3770     137.945408
17534    137.945408
13461    137.945408
22350    137.945408
            ...    
15987      0.000000
5118       0.000000
10131      0.000000
21442      0.000000
8404       0.000000
Name: product, Length: 200, dtype: float64


In [55]:
total_volume_1.sum()

13298.32432740735

## Calculating the profit from dataset_1 and showing that as a whole this region is losing money, at the tune of 40 million dollars.

In [56]:
profit_1 = (total_volume_1).sum() * 4500 - 100000000
profit_1

-40157540.526666924

## Taking the top 200 wells with the most volume from dataset_0.

In [57]:
total_volume_2 = (volume(target_data_2, model_2_pred, 200))
print(total_volume_2.sort_values(ascending=False))

13864    188.892634
16198    187.523298
16263    185.106622
417      182.466197
22188    181.408048
            ...    
1581      17.945354
7013      16.228970
102        8.382735
20504      6.652556
11763      4.758346
Name: product, Length: 200, dtype: float64


In [58]:
total_volume_2.sum()

19423.869127363043

## Calculating the profit from dataset_2 and showing that as a whole this region is losing money, at the tune of 12.5 million dollars.

In [59]:
profit_2 = (total_volume_2).sum() * 4500 - 100000000
profit_2

-12592588.926866308

## Create a graph showing the top performing wells from each dataset.

In [60]:
total_volume_0.name

'product'

## Creating a profit function to use in the bootstrapping model

#### Bootstrap with 1000 random samples for a confidence interval of 95%. These are the lower and upper values with each.

In [61]:
def profit(target, predictions, count):
    predictions_sorted = predictions.sort_values(ascending=False)
    selected_points = target[predictions_sorted.index][:count]
    profit_selected_points = selected_points.sum() * 4500 - 100000000
    return profit_selected_points

<div class="alert alert-success">
<b>Reviewer's comment</b>

Profit function is correct

</div>

## Performing Bootleg on Model 0

In [62]:
state = np.random.RandomState(12345)

values = []
for i in range(1000):
    subsample = sample_0_target.sample(n = 500, replace=True,  random_state=state)
    predictions = sample_0_predictions[subsample.index]
    subsample_profit = profit(subsample, predictions, 200)
    values.append(subsample_profit)

    
values = pd.Series(values)

lower = values.quantile(0.025)
upper = values.quantile(0.975)

print(f"Lower: {lower}")
print(f"Upper: {upper}")

print(f"\nThe mean value is {values.mean()}")

negative_profit_chance = (values < 0).mean()
print("Risk of losses =", negative_profit_chance * 100, "%")

Lower: -1020900.9483793724
Upper: 9479763.533583675

The mean value is 4259385.269105923
Risk of losses = 6.0 %


## Performing Bootleg on Model 1

In [63]:
state = np.random.RandomState(12345)

values = []
for i in range(1000):
    subsample = sample_1_target.sample(n = 500, replace=True,  random_state=state)
    predictions = sample_1_predictions[subsample.index]
    subsample_profit = profit(subsample, predictions, 200)
    values.append(subsample_profit)

    
values = pd.Series(values)

lower = values.quantile(0.025)
upper = values.quantile(0.975)

print(f"Lower: {lower}")
print(f"Upper: {upper}")

print(f"\nThe mean value is {values.mean()}")

negative_profit_chance = (values < 0).mean()
print("Risk of losses =", negative_profit_chance * 100, "%")

Lower: 688732.2537050088
Upper: 9315475.912570495

The mean value is 5152227.734432898
Risk of losses = 1.0 %


## Performing Bootleg on Model 2

In [64]:
state = np.random.RandomState(12345)

values = []
for i in range(1000):
    subsample = sample_2_target.sample(n = 500, replace=True,  random_state=state)
    predictions = sample_2_predictions[subsample.index]
    subsample_profit = profit(subsample, predictions, 200)
    values.append(subsample_profit)

    
values = pd.Series(values)

lower = values.quantile(0.025)
upper = values.quantile(0.975)

print(f"Lower: {lower}")
print(f"Upper: {upper}")

print(f"\nThe mean value is {values.mean()}")

negative_profit_chance = (values < 0).mean()
print("Risk of losses =", negative_profit_chance * 100, "%")

Lower: -1288805.473297878
Upper: 9697069.541802654

The mean value is 4350083.627827557
Risk of losses = 6.4 %


<div class="alert alert-success">
<b>Reviewer's comment</b>

Boostrapping is done correctly. The 95% confidence interval, mean profit and risk of losses are calculated correctly

</div>

# Conclusion

## I would suggest using the Geo_Data_1 as our region where we are going to implement new well development. The lower quantile is positive, unlike the other two regions, and the risk of losses is at it's lowest at 1.0%. It is 95% likely that the profit of this region will net us between 688,000 USD and 9,310,000 USD.  

<div class="alert alert-success">
<b>Reviewer's comment</b>

Region choice is correct and justified!

</div>