# Find out which new well is the best
(machine learning in business)

## Content

1. [Introduction](#intro)
2. [General information](#general)
3. [Model](#model)
4. [Profit](#profit)
5. [Conclusion](#conclusion)

## Introduction<a href="intro"></a>
**Project Description**

You have data on oil samples from three regions. Parameters of each oil well in the region are already known. Build a model that will help to pick the region with the highest profit margin.

**Data description**
- id — unique oil well identifier
- f0, f1, f2 — three features of points (their specific meaning is unimportant, but the features themselves are significant)
- product — volume of reserves in the oil well (thousand barrels).

*Libraries*

In [1]:
import pandas as pd
import numpy as np

from sklearn.metrics import mean_squared_error

from sklearn.model_selection import train_test_split
from scipy import stats as st

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

import warnings
warnings.filterwarnings('ignore')

## General Information<a href="general"></a>


In [2]:
try:
    region_1 = pd.read_csv('geo_data_0.csv')
    region_2 = pd.read_csv('geo_data_1.csv')
    region_3 = pd.read_csv('geo_data_2.csv')
except:
    region_1 = pd.read_csv('/datasets/geo_data_0.csv')
    region_2 = pd.read_csv('/datasets/geo_data_1.csv')
    region_3 = pd.read_csv('/datasets/geo_data_2.csv')

*Region 1*

In [3]:
region_1.sample(5)

Unnamed: 0,id,f0,f1,f2,product
83245,WIp8z,-0.046338,0.586529,6.692993,178.947797
77530,y6JS0,0.164177,0.948437,6.570345,61.973446
5389,gLBAQ,2.18768,0.468166,3.096371,71.273613
56277,OsDPf,0.096591,0.836343,5.636373,87.565521
50633,FOohD,-1.1161,0.121117,0.216412,57.402938


In [4]:
region_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
id         100000 non-null object
f0         100000 non-null float64
f1         100000 non-null float64
f2         100000 non-null float64
product    100000 non-null float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


In [5]:
region_1.duplicated().sum()

0

In [6]:
region_1.isna().sum()

id         0
f0         0
f1         0
f2         0
product    0
dtype: int64

*Region 2*

In [7]:
region_2.sample(5)

Unnamed: 0,id,f0,f1,f2,product
36788,hDY5L,14.982935,-3.164337,2.999655,80.859783
80188,YyDwk,0.881283,-7.242448,-0.004558,3.179103
64587,RgHqz,3.098683,-4.119691,0.997039,26.953261
90097,JYFYX,4.362827,-5.654776,2.989045,80.859783
27106,8bmB7,12.558534,-0.164662,4.000755,107.813044


In [8]:

region_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
id         100000 non-null object
f0         100000 non-null float64
f1         100000 non-null float64
f2         100000 non-null float64
product    100000 non-null float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


In [9]:
region_2.duplicated().sum()

0

In [10]:
region_2.isna().sum()

id         0
f0         0
f1         0
f2         0
product    0
dtype: int64

*Region 3*

In [11]:
region_3.sample(5)

Unnamed: 0,id,f0,f1,f2,product
93699,BKIJ2,-0.437419,2.688932,8.862211,157.402711
80849,go5tw,0.998164,-3.723322,-0.255566,133.560497
97602,k0GlW,-0.858348,-0.53607,9.271542,134.899752
44667,Rzefu,0.710377,1.959778,1.006551,130.157433
40153,C6q8N,1.896837,0.149583,5.581972,121.745157


In [12]:
region_3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
id         100000 non-null object
f0         100000 non-null float64
f1         100000 non-null float64
f2         100000 non-null float64
product    100000 non-null float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


In [13]:
region_3.duplicated().sum()

0

In [14]:
region_3.isna().sum()

id         0
f0         0
f1         0
f2         0
product    0
dtype: int64

### Conclusion

We have three datasets with information about three regions.

No duplicates, no missing data.

Data is clean we need just to drop column "id", as this column isn't taking part in our work.

In [15]:
region_1 = region_1.drop('id', axis=1)

In [16]:
region_2 = region_2.drop('id', axis=1)

In [17]:
region_3 = region_3.drop('id', axis=1)

## Model<a href='model'></a>

In [18]:
def split_data(data):
    features = data.drop('product', axis=1)
    target = data['product']

    features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size=0.25, random_state=123456)

    return features_train, features_valid, target_train, target_valid

In [19]:
# def liner_regr(data):
#     features_train, features_valid, target_train, target_valid = split_data(data)
#     model = LinearRegression()
#     model.fit(features_train, target_train)

#     prediction = model.predict(features_valid)
#     prediction = pd.Series(prediction, index=target_valid.index)

#     mse = mean_squared_error(target_valid, prediction)
#     rmse = mse ** 0.5
#     print('RMSE=', rmse)

#     print('Answer: ', target_valid.mean())
#     print('Prediction: ', prediction.mean())
#     return prediction, target_valid

In [20]:
def liner_regr(data):
    features_train, features_valid, target_train, target_valid = split_data(data)
    
    numeric = ['f0', 'f1', 'f2']
    scaler = StandardScaler()
    scaler.fit(features_train[numeric])
    features_train[numeric] = scaler.transform(features_train[numeric])
    features_valid[numeric] = scaler.transform(features_valid[numeric])
                                               
    
    model = LinearRegression()
    model.fit(features_train, target_train)

    prediction = model.predict(features_valid)
    prediction = pd.Series(prediction, index=target_valid.index)

    mse = mean_squared_error(target_valid, prediction)
    rmse = mse ** 0.5
    print('RMSE=', rmse)

    print('Answer: ', target_valid.mean())
    print('Prediction: ', prediction.mean())
    return prediction, target_valid


In [21]:
prediction_1, answer_1 = liner_regr(region_1)

RMSE= 37.80046993478272
Answer:  93.1898161393134
Prediction:  92.49286560032228


In [22]:
prediction_2, answer_2 = liner_regr(region_2)

RMSE= 0.890493320627005
Answer:  69.02747842574298
Prediction:  69.03055429593262


In [23]:
prediction_3, answer_3 = liner_regr(region_3)

RMSE= 39.98039278339241
Answer:  94.97813012557837
Prediction:  94.84835708988288


### Conclusion

In this section we:
 - split data into a train and valid sets (75:25);
 - built and train model (Liner Regression) for each region
 - creat prediction using valid set
 - we get such results as:

| Region | RMSE | Barrel pred|Barrel ans|
| -------| ---- |-----------|-----------|
| 1     | 37.8005|  92.4929    |93.1898|
| 2   |  0.8905 |   69.0306  |69.0275|
|3|39.9804|94.8484|94.9781|


## Profit<a href='profit'></a>

In [24]:
points_all = 500
points = 200
budget_total = 10_000_000_000
income_per_barrel = 450000
risk_max = 2.5/100
budget_per_one = budget_total/points
unit_of_barrel = 1000

In [25]:
print(f"Minimum barrels = {budget_total/points/income_per_barrel:.2f}")

Minimum barrels = 111.11


In [26]:
# def profit(sample, points):
#     sample_sorted = sample.sort_values(ascending=False)
#     best = sample_sorted[:points]

#     return best.sum() * income_per_barrel * 1000 - budget_total

def profit(target, predict, points):
    predict_sorted = predict.sort_values(ascending=False)
    best = target[predict_sorted.index][:points]

    return (best.sum() * income_per_barrel - budget_total)*1000

In [27]:
def riskes(sample):
    risk_percent = sample[sample < 0].count()/len(sample)
    return risk_percent*100

In [28]:
def bootstrap(target, predict, all=points_all, one=points):
    state = np.random.RandomState(12345)
    values = []

    for i in range(1000):
        target_subsample = target.sample(
            n=all,
            replace=True,
            random_state=state)
        predict_subsample = predict[target_subsample.index]

        values.append(profit(target_subsample, predict_subsample, one))

    values = pd.Series(values)
    if riskes(values) > (risk_max * 100):
        print('Risks > 2.5%')
        return

    print('Risks are {:.2f}'.format(riskes(values)))
    print()

    mean_values = values.mean()
#     conf_int_values = st.t.interval(alpha = 0.95, df = (len(values)-1), loc=mean_values, scale=values.sem())
    lower = values.quantile(q=0.025)
    upper = values.quantile(q=0.975)
    print("Avg profit: {:.2f} billions rub".format(mean_values / 10**9))
    print()
    print("95% confidence limit: ({:.2f}, {:.2f})".format(lower, upper))

In [29]:
bootstrap(answer_1, prediction_1)

Risks > 2.5%


In [30]:
bootstrap(answer_2, prediction_2)

Risks are 0.70

Avg profit: 523.54 billions rub

95% confidence limit: (98792900111.34, 963564955370.86)


In [31]:
bootstrap(answer_3, prediction_3)

Risks > 2.5%


### Conclusion
In first and third regions risks are high then 2.5%. Best region is second one

## Conclusion<a href='conclusion'></a>

1. Minimum oil digging volume should be 11.111 barrels;
2. In first and third regions risks are high then 2.5%. The best region is second one
