<div class="alert alert-success">
<b>Reviewer's comment V2</b>

The project is accepted! Keep up the good work on the next sprint!

</div>

**Review**

Hi, my name is Dmitry and I will be reviewing your project.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  
First of all, thank you for turning in the project! You did a great job overall, but there is one small problem that needs to be fixed before the project is accepted. Let me know if you have questions!

# Introduction
#### For this project at OilyGiant, I will sorth through data to find the most profitable location for a new well. I will analyze oil well parameters across regions several, build predictive models to estimate reserves, pick the oil wells with the highest estimated values, and pick the region with the highest total profit of the selected oil wells. I will also use bootstrapping to analyze potential profit and risks
## data description
#### id — unique oil well identifier
#### f0, f1, f2 — three features of points (their specific meaning is unimportant, but the features themselves are significant)
#### product — volume of reserves in the oil well (thousand barrels).

### importing packages

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

### opening datasets 

In [2]:
geo_0 = pd.read_csv('/datasets/geo_data_0.csv')
geo_1 = pd.read_csv('/datasets/geo_data_1.csv')
geo_2 = pd.read_csv('/datasets/geo_data_2.csv')

#### geo_0

In [3]:
geo_0.head()

Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.22117,105.280062
1,2acmU,1.334711,-0.340164,4.36508,73.03775
2,409Wp,1.022732,0.15199,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647


In [4]:
geo_0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


#### checking for duplicates in the id column

In [5]:
duplicate_id_0 = geo_0[geo_0.duplicated('id')]

In [6]:
duplicate_id_0.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 7530 to 97785
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   id       10 non-null     object 
 1   f0       10 non-null     float64
 2   f1       10 non-null     float64
 3   f2       10 non-null     float64
 4   product  10 non-null     float64
dtypes: float64(4), object(1)
memory usage: 480.0+ bytes


#### there are 10 duplicates in the id column
#### dropping duplicates in the id column

In [7]:
geo_0.drop_duplicates(subset=['id'], inplace=True)

In [8]:
geo_0.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 99990 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   id       99990 non-null  object 
 1   f0       99990 non-null  float64
 2   f1       99990 non-null  float64
 3   f2       99990 non-null  float64
 4   product  99990 non-null  float64
dtypes: float64(4), object(1)
memory usage: 4.6+ MB


#### info reflects the 10 duplicated id's being dropped
#### checking for missing values

In [9]:
geo_0_missing = geo_0.isnull().sum()
geo_0_missing

id         0
f0         0
f1         0
f2         0
product    0
dtype: int64

#### there are no missing values in geo_0

#### saving the dataset

In [10]:
geo_0_clean = geo_0.copy()

#### geo_1

In [11]:
geo_1.head()

Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261
2,vyE1P,6.263187,-5.948386,5.00116,134.766305
3,KcrkZ,-13.081196,-11.506057,4.999415,137.945408
4,AHL4O,12.702195,-8.147433,5.004363,134.766305


In [12]:
geo_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


#### checking for duplicates in id column

In [13]:
duplicate_id_1 = geo_1[geo_1.duplicated('id')]
duplicate_id_1

Unnamed: 0,id,f0,f1,f2,product
41906,LHZR0,-8.989672,-4.286607,2.009139,57.085625
82178,bfPNe,-6.202799,-4.820045,2.995107,84.038886
82873,wt4Uk,10.259972,-9.376355,4.994297,134.766305
84461,5ltQ6,18.213839,2.191999,3.993869,107.813044


#### found 4 duplicates
#### dropping duplicates

In [14]:
geo_1.drop_duplicates(subset=['id'], inplace=True)

In [15]:
geo_1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 99996 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   id       99996 non-null  object 
 1   f0       99996 non-null  float64
 2   f1       99996 non-null  float64
 3   f2       99996 non-null  float64
 4   product  99996 non-null  float64
dtypes: float64(4), object(1)
memory usage: 4.6+ MB


#### info reflects the dropped duplicates
#### checking for missing values

In [16]:
geo_1_missing = geo_1.isnull().sum()
geo_1_missing

id         0
f0         0
f1         0
f2         0
product    0
dtype: int64

#### no missing values
#### saving dataset

In [17]:
geo_1_clean = geo_1.copy()

#### geo_2

In [18]:
geo_2.head()

Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697
2,ovLUW,0.194587,0.289035,-5.586433,62.87191
3,q6cA6,2.23606,-0.55376,0.930038,114.572842
4,WPMUX,-0.515993,1.716266,5.899011,149.600746


In [19]:
geo_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


#### checking for duplicates in id column

In [20]:
duplicate_id_2 = geo_2[geo_2.duplicated('id')]
duplicate_id_2

Unnamed: 0,id,f0,f1,f2,product
43233,xCHr8,-0.847066,2.101796,5.59713,184.388641
49564,VF7Jo,-0.883115,0.560537,0.723601,136.23342
55967,KUPhW,1.21115,3.176408,5.54354,132.831802
95090,Vcm5J,2.587702,1.986875,2.482245,92.327572


#### 4 duplicated id's
#### dropping duplicated id

In [21]:
geo_2.drop_duplicates(subset=['id'], inplace=True)

In [22]:
geo_2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 99996 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   id       99996 non-null  object 
 1   f0       99996 non-null  float64
 2   f1       99996 non-null  float64
 3   f2       99996 non-null  float64
 4   product  99996 non-null  float64
dtypes: float64(4), object(1)
memory usage: 4.6+ MB


#### info reflects the dropped id's
#### checking for missing values

In [23]:
geo_2_missing = geo_2.isnull().sum()
geo_2_missing

id         0
f0         0
f1         0
f2         0
product    0
dtype: int64

#### no missing values
#### saving dataset

In [24]:
geo_2_clean = geo_2.copy()

<div class="alert alert-success">
<b>Reviewer's comment</b>

The data was loaded and inspected

</div>

#### splitting datasets into features and targets

In [25]:
geo_0_features = geo_0_clean.drop(['product', 'id'], axis=1)
geo_0_target = geo_0_clean['product']

geo_1_features = geo_1_clean.drop(['product', 'id'], axis=1)
geo_1_target = geo_1_clean['product']

geo_2_features = geo_2_clean.drop(['product', 'id'], axis=1)
geo_2_target = geo_2_clean['product']

<div class="alert alert-success">
<b>Reviewer's comment</b>

Features and targets make sense

</div>

#### splitting the datasets into training and validation sets. 75% train 25% valid

In [26]:
geo_0_features_train, geo_0_features_valid, geo_0_target_train, geo_0_target_valid = train_test_split(
    geo_0_features, geo_0_target, test_size=0.25, random_state=12345)

geo_1_features_train, geo_1_features_valid, geo_1_target_train, geo_1_target_valid = train_test_split(
    geo_1_features, geo_1_target, test_size=0.25, random_state=12345)

geo_2_features_train, geo_2_features_valid, geo_2_target_train, geo_2_target_valid = train_test_split(
    geo_2_features, geo_2_target, test_size=0.25, random_state=12345)

<div class="alert alert-success">
<b>Reviewer's comment</b>

The data for each region was split into train and validation

</div>

## training linear regression models
### printing the average volume of predicted reserves and model RMSE.

#### geo_0 

In [27]:
geo_0_model = LinearRegression()
geo_0_model.fit(geo_0_features_train, geo_0_target_train)
predictions_valid_geo_0 = geo_0_model.predict(geo_0_features_valid)
result = mean_squared_error(geo_0_target_valid, predictions_valid_geo_0) ** 0.5
predictions_avg_geo_0 = predictions_valid_geo_0.mean()

print("average of predicted reserves:", predictions_avg_geo_0)
print('RMSE:', result)

average of predicted reserves: 92.78915638280621
RMSE: 37.853527328872964


#### geo_1

In [28]:
geo_1_model = LinearRegression()
geo_1_model.fit(geo_1_features_train, geo_1_target_train)
predictions_valid_geo_1 = geo_1_model.predict(geo_1_features_valid)
result = mean_squared_error(geo_1_target_valid, predictions_valid_geo_1) ** 0.5
predictions_avg_geo_1 = predictions_valid_geo_1.mean()

print("average of predicted reserves:", predictions_avg_geo_1)
print('RMSE:', result)

average of predicted reserves: 69.17831957030432
RMSE: 0.8920592647717029


#### geo_2

In [29]:
geo_2_model = LinearRegression()
geo_2_model.fit(geo_2_features_train, geo_2_target_train)
predictions_valid_geo_2 = geo_2_model.predict(geo_2_features_valid)
result = mean_squared_error(geo_2_target_valid, predictions_valid_geo_2) ** 0.5
predictions_avg_geo_2 = predictions_valid_geo_2.mean()

print("average of predicted reserves:", predictions_avg_geo_2) 
print('RMSE:', result)

average of predicted reserves: 94.86572480562035
RMSE: 40.07585073246016


#### geo_0 and geo_2 had the highest average value of predicted reserves with about 93 and 95. geo_1 had the lowest average  with 69. however the rmse of geo_1's model was the highest with .89 while geo_0 and geo_2 had about 38 and 40. with that being said geo_1's model is the most accurate

<div class="alert alert-success">
<b>Reviewer's comment</b>

The models were trained and evaluated correctly

</div>

## profit calculation

#### The budget for development of 200 oil wells is 100 USD million.
#### One barrel of raw materials brings 4.5 USD of revenue The revenue from one unit of product is 4,500 dollars (volume of reserves is in thousand barrels).
#### finding the minimum average of unites each well will have to produce

In [30]:
budget = 100000000
sample = 200
revenue_per_unit = 4500

In [31]:
min_per_well = (budget/sample) / revenue_per_unit

print(min_per_well)

111.11111111111111


#### will need to produce a minimum of 111.11 on average 
#### this average is higher than the average of all three regions. will be taking the top 200 wells in each region

<div class="alert alert-success">
<b>Reviewer's comment</b>

Calculation is correct!

</div>

#### finding profit based on top 200 wells in each region

In [32]:
def calculate_profit(target, predictions):
    predictions = pd.Series(predictions)
    top_wells = predictions.sort_values(ascending=False).head(sample).index
    target_reserves = target.loc[top_wells].sum()
    revenue = target_reserves * revenue_per_unit
    profit = revenue - budget
    return profit

<div class="alert alert-success">
<b>Reviewer's comment</b>

Profit is calculated correctly

</div>

In [33]:
geo_0_target_valid = geo_0_target_valid.reset_index(drop=True)
geo_1_target_valid = geo_1_target_valid.reset_index(drop=True)
geo_2_target_valid = geo_2_target_valid.reset_index(drop=True)

<div class="alert alert-success">
<b>Reviewer's comment</b>

Great, now targets and predictions have the same indices

</div>

In [34]:
print(f'geo_0 predicted profit: {calculate_profit(geo_0_target_valid, predictions_valid_geo_0):,}')
print(f'geo_1 predicted profit: {calculate_profit(geo_1_target_valid, predictions_valid_geo_1):,}')
print(f'geo_2 predicted profit: {calculate_profit(geo_2_target_valid, predictions_valid_geo_2):,}')

geo_0 predicted profit: 33,651,872.377002865
geo_1 predicted profit: 24,150,866.966815114
geo_2 predicted profit: 25,012,838.532820627


#### geo_0 provides the highest predicted profit with 33.7m in profit followed by geo_2 and geo_1 with 25m and 24.2m in profit

<div class="alert alert-warning">
<b>Reviewer's comment</b>

Alright, keep in mind that these numbers come from the overall top 200 wells in each region though, so it's an estimate of maximum possible profit which is not very likely to be achieved

</div>

## calculating the risks and profit for each region using bootstrapping

In [35]:
def bootstrapping(target, predictions, well_samples, bootstrap_samples, region):
    predictions = pd.Series(predictions)
    
    values = [] 
    
    state = np.random.RandomState(12345)
    
    for i in range(bootstrap_samples):
        target_subsample = target.sample(n=well_samples, replace=True, random_state=state)
        probs_subsample = predictions[target_subsample.index]
        values.append(calculate_profit(target_subsample, probs_subsample))
    
    values = pd.Series(values)
   
    upper = values.quantile(0.975)
    lower = values.quantile(0.025)
     
    mean = values.mean()
    
    risk_of_loss = (values < 0).mean() * 100
    
    print(region)
    print(f'upper and lower bounds: ({upper:,}, {lower:,})')
    print(f'average profit: {mean:,}')
    print(f'risk of losses: {risk_of_loss}%')
    print()

<div class="alert alert-danger">
<s><b>Reviewer's comment</b>

The bootstrapping function is correct, except there is a small misprint here:
    
```python
target_subsample = predictions.sample(n=well_samples, replace=True, random_state=state)
```
    
I assume you meant to sample from targets, not predictions here.

</div>

<div class="alert alert-info">
  fixed with = target.sample(n....)
</div>
  

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Great!

</div>

In [36]:
bootstrapping(geo_0_target_valid, predictions_valid_geo_0, 500, 1000, 'geo_0')
bootstrapping(geo_1_target_valid, predictions_valid_geo_1, 500, 1000, 'geo_1')
bootstrapping(geo_2_target_valid, predictions_valid_geo_2, 500, 1000, 'geo_2')

geo_0
upper and lower bounds: (12,792,400.755693015, -58,784.45790209988)
average profit: 5,856,522.724429933
risk of losses: 2.8000000000000003%

geo_1
upper and lower bounds: (12,476,844.632237483, 1,857,995.8171697822)
average profit: 6,822,983.0506521575
risk of losses: 0.3%

geo_2
upper and lower bounds: (11,679,714.100374795, -696,510.441707522)
average profit: 5,379,665.322994557
risk of losses: 4.1000000000000005%



#### with a confidence interval of 95% and a 1000 iterations of 500 samples, both geo_0 and geo_2 have higher risk than the threshold set by the company, I will reject these. geo_1 is under the threshold at .3%, has an average profit of 6.8m and an upper bound of 12.5m

## conclusion
#### after going over each region with a confidence interval of 95% and a 1000 iterations of 500 samples, geo_1 was the only regeion that was underneath the company threshold of 2.5%. It has the highest average profit of 6.8m with a .3% chance of a risk of a loss. This region has an upper bound of 12.5m. geo_1 is the region I will be presenting to the company. 

<div class="alert alert-danger">
<s><b>Reviewer's comment</b>

Please check the results after fixing the problem above

</div>

<div class="alert alert-info">
  checked results after fixing problem
</div>
  

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Alright! Region choice makes sense and is justified

</div>