## Quick Modeling Summary

In this analysis, I conducted a rapid modeling process using a Random Forest Regressor. Here's a brief overview of the steps involved:

1. **Data Preparation**:
    - Handled missing values using median imputation for numerical features. 
    - I assumed there are not missing values for categorical features.
    - Applied one-hot encoding to categorical features to convert them into a numerical representation suitable for modeling. Also, used a minimum frequency of 20.

2. **Model Training**:
    - Utilized scikit-learn's `RandomForestRegressor` as the modeling algorithm.
    - Employed grid search cross-validation (`GridSearchCV`) to systematically search through a range of hyperparameters and find the best combination for the model.

3. **Evaluation**:
    - Assessed model performance using suitable evaluation metrics, such as mean squared error (MSE) or R-squared, depending on the problem and dataset characteristics.
    - Validated the model's effectiveness through testing using five cross-validation technique.

4. **Conclusion**:
    - The process aimed to quickly develop and evaluate a predictive model to gain insights or make predictions based on the available data.
    - The selected approach leveraged common techniques in machine learning, such as data preprocessing, algorithm selection, and hyperparameter tuning, to build an effective straight forward predictive model.

This quick modeling approach provided a foundational understanding of the data and allowed for initial insights or predictions to be derived efficiently. Further iterations or refinements of the model can be (and will be) pursued based on additional data or specific requirements of the problem.


# Next step

**Conformal Prediction for Prediction Intervals**:
- Construct a pipeline including data preprocessing and model training steps.
- Applied conformal prediction to the model pipeline to obtain prediction intervals for future inferences.
- Conformal prediction provides a principled approach to quantify the uncertainty of predictions, allowing for the estimation of prediction intervals rather than point predictions.
- All of this will be found on the [conformal_prediction notebook](conformal_prediction.ipynb)

In [3]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

In [4]:
df = pd.read_csv("../data/clean/data.csv")

### Good to know... presence of multicolinearity

There is multicolinearity between "footage_lateral_length" and "md_ft". For simplicity reasons footage_lateral_length will be removed.

In [5]:
NUM_FEATURES = ["md_ft", "proppant_volume", "total_number_of_stages", "azimuth", "isip", "porosity", "proppant_fluid_ratio", "pump_rate", "tvd_ft"]

CAT_FEATURES = ["treatment_company", "operator"]

In [6]:
median_imputer = SimpleImputer(
    strategy="median"
)

In [7]:
df_num_features = pd.DataFrame(median_imputer.fit_transform(df[NUM_FEATURES]), columns=NUM_FEATURES)

In [8]:
one_hot_encoder = OneHotEncoder(
    sparse_output=False,
    handle_unknown="infrequent_if_exist",
    min_frequency=20
)

In [9]:
one_hot_encoded = one_hot_encoder.fit_transform(df[CAT_FEATURES])

# Extract column names for one-hot encoded features
column_names = one_hot_encoder.get_feature_names_out(CAT_FEATURES)

# Create DataFrame with one-hot encoded features and column names
df_one_hot_encoded = pd.DataFrame(one_hot_encoded, columns=column_names)

In [11]:
df_production = df["production"]

## Removing categorical features is affecting model overall performance

In [12]:
df_train = pd.concat([df_num_features, df_one_hot_encoded, df_production], axis=1)
# df_train = pd.concat([df_num_features, df_production], axis=1)

In [13]:
df_train.head()

Unnamed: 0,md_ft,proppant_volume,total_number_of_stages,azimuth,isip,porosity,proppant_fluid_ratio,pump_rate,tvd_ft,treatment_company_treatment_company_1,...,operator_operator_25,operator_operator_26,operator_operator_4,operator_operator_5,operator_operator_6,operator_operator_7,operator_operator_8,operator_operator_9,operator_infrequent_sklearn,production
0,19148.0,21568792.0,56.0,-32.279999,4149.0,0.02,1.23,83.0,6443.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5614.947951
1,15150.0,9841307.0,33.0,-19.799999,5776.0,0.17,1.47,102.0,7602.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2188.836707
2,14950.0,17116240.0,62.0,-26.879999,4628.0,0.02,1.67,88.0,5907.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1450.033022
3,11098.0,3749559.0,11.0,-49.099998,4582.0,0.03,0.77,100.0,6538.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1060.764407
4,10549.0,6690705.0,9.0,5.56,4909.0,0.02,1.32,94.0,7024.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,607.530385


In [14]:
X = df_train.drop(["production"], axis=1)
y = df["production"]

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=200, random_state=42)

In [24]:
rf = RandomForestRegressor(n_jobs=-1, random_state=42)

In [25]:
hyperparameters = {
    "n_estimators": [50, 100, 200, 500],
    "criterion": ["squared_error", "absolute_error", "friedman_mse"],
    "max_depth": [None, 10, 20, 30],
    "max_features": ["sqrt", "log2"],
    "max_samples": [None, 0.25, 0.50, 0.75]
}

In [26]:
rf_cv = GridSearchCV(
    estimator=rf,
    param_grid=hyperparameters,
    n_jobs=-1    
)

In [28]:
rf_cv.fit(X_train, y_train)

In [29]:
pd.DataFrame(rf_cv.cv_results_).to_csv("../images/model_performance/grid_search_cv.csv", index=False)

In [30]:
rf_cv.best_params_

{'criterion': 'absolute_error',
 'max_depth': 20,
 'max_features': 'sqrt',
 'max_samples': None,
 'n_estimators': 200}

In [31]:
rf_cv.best_score_

0.6370642775262598

# Features with importance greater than 0

In [32]:
winner = rf_cv.best_estimator_

In [33]:
winner.score(X_test, y_test)

0.6811481640504892

In [34]:
feature_importances = winner.feature_importances_.round(2)

In [35]:
features = X_train.columns

In [36]:
df_features = pd.DataFrame(zip(features, feature_importances), columns=["feature", "Importance"]).sort_values(by="Importance", ascending=False)
df_features[df_features["Importance"]>0]

Unnamed: 0,feature,Importance
0,md_ft,0.13
2,total_number_of_stages,0.12
1,proppant_volume,0.12
8,tvd_ft,0.1
4,isip,0.07
6,proppant_fluid_ratio,0.07
3,azimuth,0.06
7,pump_rate,0.06
24,operator_operator_14,0.05
5,porosity,0.04


# Model and pipeline configuration

### One hot encoding:
- 

- 

In [38]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

In [39]:
# Define the preprocessing steps for numerical and categorical features
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),  # Impute missing values with median
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=20))  # One-hot encode categorical features
])

# Combine preprocessing steps for numerical and categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, NUM_FEATURES),
        ('cat', categorical_transformer, CAT_FEATURES)
    ])

# Define the pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=200, criterion="absolute_error", max_depth=20, max_features="sqrt", max_samples=None,random_state=42))
])

In [40]:
pipeline.fit(X_train, y_train)

ValueError: A given column is not a column of the dataframe

# Save model