LA_Team/Facies_classification_LA_TEAM_05_VALIDATION.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Facies classification using Machine Learning #\n",
    "## LA Team Submission 5 ## \n",
    "### _[Lukas Mosser](https://at.linkedin.com/in/lukas-mosser-9948b32b/en), [Alfredo De la Fuente](https://pe.linkedin.com/in/alfredodelafuenteb)_ ####"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this approach for solving the facies classfication problem ( https://github.com/seg/2016-ml-contest. ) we will explore the following statregies:\n",
    "- Features Exploration: based on [Paolo Bestagini's work](https://github.com/seg/2016-ml-contest/blob/master/ispl/facies_classification_try02.ipynb), we will consider imputation, normalization and augmentation routines for the initial features.\n",
    "- Model tuning: "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Libraries\n",
    "\n",
    "We will need to install the following libraries and packages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# %%sh\n",
    "# pip install pandas\n",
    "# pip install scikit-learn\n",
    "# pip install tpot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from __future__ import print_function\n",
    "import numpy as np\n",
    "%matplotlib inline\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "from sklearn.model_selection import cross_val_score\n",
    "from sklearn.model_selection import KFold , StratifiedKFold\n",
    "from classification_utilities import display_cm, display_adj_cm\n",
    "from sklearn.metrics import confusion_matrix, f1_score\n",
    "from sklearn import preprocessing\n",
    "from sklearn.model_selection import LeavePGroupsOut\n",
    "from sklearn.multiclass import OneVsOneClassifier\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from scipy.signal import medfilt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Preprocessing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#Load Data\n",
    "data = pd.read_csv('../facies_vectors.csv')\n",
    "\n",
    "# Parameters\n",
    "feature_names = ['GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE', 'NM_M', 'RELPOS']\n",
    "facies_names = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS', 'WS', 'D', 'PS', 'BS']\n",
    "facies_colors = ['#F4D03F', '#F5B041','#DC7633','#6E2C00', '#1B4F72','#2E86C1', '#AED6F1', '#A569BD', '#196F3D']\n",
    "\n",
    "# Store features and labels\n",
    "X = data[feature_names].values \n",
    "y = data['Facies'].values \n",
    "\n",
    "# Store well labels and depths\n",
    "well = data['Well Name'].values\n",
    "depth = data['Depth'].values\n",
    "\n",
    "# Fill 'PE' missing values with mean\n",
    "imp = preprocessing.Imputer(missing_values='NaN', strategy='mean', axis=0)\n",
    "imp.fit(X)\n",
    "X = imp.transform(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We procceed to run [Paolo Bestagini's routine](https://github.com/seg/2016-ml-contest/blob/master/ispl/facies_classification_try02.ipynb) to include a small window of values to acount for the spatial component in the log analysis, as well as the gradient information with respect to depth. This will be our prepared training dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Feature windows concatenation function\n",
    "def augment_features_window(X, N_neig):\n",
    "    \n",
    "    # Parameters\n",
    "    N_row = X.shape[0]\n",
    "    N_feat = X.shape[1]\n",
    "\n",
    "    # Zero padding\n",
    "    X = np.vstack((np.zeros((N_neig, N_feat)), X, (np.zeros((N_neig, N_feat)))))\n",
    "\n",
    "    # Loop over windows\n",
    "    X_aug = np.zeros((N_row, N_feat*(2*N_neig+1)))\n",
    "    for r in np.arange(N_row)+N_neig:\n",
    "        this_row = []\n",
    "        for c in np.arange(-N_neig,N_neig+1):\n",
    "            this_row = np.hstack((this_row, X[r+c]))\n",
    "        X_aug[r-N_neig] = this_row\n",
    "\n",
    "    return X_aug\n",
    "\n",
    "\n",
    "# Feature gradient computation function\n",
    "def augment_features_gradient(X, depth):\n",
    "    \n",
    "    # Compute features gradient\n",
    "    d_diff = np.diff(depth).reshape((-1, 1))\n",
    "    d_diff[d_diff==0] = 0.001\n",
    "    X_diff = np.diff(X, axis=0)\n",
    "    X_grad = X_diff / d_diff\n",
    "        \n",
    "    # Compensate for last missing value\n",
    "    X_grad = np.concatenate((X_grad, np.zeros((1, X_grad.shape[1]))))\n",
    "    \n",
    "    return X_grad\n",
    "\n",
    "\n",
    "# Feature augmentation function\n",
    "def augment_features(X, well, depth, N_neig=1):\n",
    "    \n",
    "    # Augment features\n",
    "    X_aug = np.zeros((X.shape[0], X.shape[1]*(N_neig*2+2)))\n",
    "    for w in np.unique(well):\n",
    "        w_idx = np.where(well == w)[0]\n",
    "        X_aug_win = augment_features_window(X[w_idx, :], N_neig)\n",
    "        X_aug_grad = augment_features_gradient(X[w_idx, :], depth[w_idx])\n",
    "        X_aug[w_idx, :] = np.concatenate((X_aug_win, X_aug_grad), axis=1)\n",
    "    \n",
    "    # Find padded rows\n",
    "    padded_rows = np.unique(np.where(X_aug[:, 0:7] == np.zeros((1, 7)))[0])\n",
    "    \n",
    "    return X_aug, padded_rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X_aug, padded_rows = augment_features(X, well, depth)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# # Initialize model selection methods\n",
    "# lpgo = LeavePGroupsOut(2)\n",
    "\n",
    "# # Generate splits\n",
    "# split_list = []\n",
    "# for train, val in lpgo.split(X, y, groups=data['Well Name']):\n",
    "#     hist_tr = np.histogram(y[train], bins=np.arange(len(facies_names)+1)+.5)\n",
    "#     hist_val = np.histogram(y[val], bins=np.arange(len(facies_names)+1)+.5)\n",
    "#     if np.all(hist_tr[0] != 0) & np.all(hist_val[0] != 0):\n",
    "#         split_list.append({'train':train, 'val':val})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def preprocess():\n",
    "    \n",
    "    # Preprocess data to use in model\n",
    "    X_train_aux = []\n",
    "    X_test_aux = []\n",
    "    y_train_aux = []\n",
    "    y_test_aux = []\n",
    "    \n",
    "    # For each data split\n",
    "    split = split_list[5]\n",
    "        \n",
    "    # Remove padded rows\n",
    "    split_train_no_pad = np.setdiff1d(split['train'], padded_rows)\n",
    "\n",
    "    # Select training and validation data from current split\n",
    "    X_tr = X_aug[split_train_no_pad, :]\n",
    "    X_v = X_aug[split['val'], :]\n",
    "    y_tr = y[split_train_no_pad]\n",
    "    y_v = y[split['val']]\n",
    "\n",
    "    # Select well labels for validation data\n",
    "    well_v = well[split['val']]\n",
    "\n",
    "    # Feature normalization\n",
    "    scaler = preprocessing.RobustScaler(quantile_range=(25.0, 75.0)).fit(X_tr)\n",
    "    X_tr = scaler.transform(X_tr)\n",
    "    X_v = scaler.transform(X_v)\n",
    "        \n",
    "    X_train_aux.append( X_tr )\n",
    "    X_test_aux.append( X_v )\n",
    "    y_train_aux.append( y_tr )\n",
    "    y_test_aux.append (  y_v )\n",
    "    \n",
    "    X_train = np.concatenate( X_train_aux )\n",
    "    X_test = np.concatenate ( X_test_aux )\n",
    "    y_train = np.concatenate ( y_train_aux )\n",
    "    y_test = np.concatenate ( y_test_aux )\n",
    "    \n",
    "    return X_train , X_test , y_train , y_test "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Analysis\n",
    "\n",
    "In this section we will run a Cross Validation routine "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# from tpot import TPOTClassifier\n",
    "# from sklearn.model_selection import train_test_split\n",
    "# X_train, X_test, y_train, y_test = preprocess()\n",
    "\n",
    "# tpot = TPOTClassifier(generations=5, population_size=20, \n",
    "#                       verbosity=2,max_eval_time_mins=20,\n",
    "#                       max_time_mins=100,scoring='f1_micro',\n",
    "#                       random_state = 17)\n",
    "# tpot.fit(X_train, y_train)\n",
    "# print(tpot.score(X_test, y_test))\n",
    "# tpot.export('FinalPipeline.py')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.ensemble import  RandomForestClassifier, VotingClassifier\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.naive_bayes import BernoulliNB\n",
    "from sklearn.pipeline import make_pipeline, make_union\n",
    "from sklearn.preprocessing import FunctionTransformer\n",
    "import xgboost as xgb\n",
    "from xgboost.sklearn import  XGBClassifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Train and test a classifier\n",
    "\n",
    "# Pass in the classifier so we can iterate over many seed later.\n",
    "def train_and_test(X_tr, y_tr, X_v, well_v, clf):\n",
    "    \n",
    "    # Feature normalization\n",
    "    scaler = preprocessing.RobustScaler(quantile_range=(25.0, 75.0)).fit(X_tr)\n",
    "    X_tr = scaler.transform(X_tr)\n",
    "    X_v = scaler.transform(X_v)\n",
    "    \n",
    "    clf.fit(X_tr, y_tr)\n",
    "    \n",
    "    # Test classifier\n",
    "    y_v_hat = clf.predict(X_v)\n",
    "    \n",
    "    # Clean isolated facies for each well\n",
    "    for w in np.unique(well_v):\n",
    "        y_v_hat[well_v==w] = medfilt(y_v_hat[well_v==w], kernel_size=5)\n",
    "    \n",
    "    return y_v_hat"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prediction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "....................................................................................................\n",
      "||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||"
     ]
    }
   ],
   "source": [
    "#Load testing data\n",
    "test_data = pd.read_csv('../validation_data_nofacies.csv')\n",
    "\n",
    "    # Train classifier\n",
    "    #clf = make_pipeline(make_union(VotingClassifier([(\"est\", ExtraTreesClassifier(criterion=\"gini\", max_features=1.0, n_estimators=500))]), FunctionTransformer(lambda X: X)), XGBClassifier(learning_rate=0.73, max_depth=10, min_child_weight=10, n_estimators=500, subsample=0.27))\n",
    "    #clf =  make_pipeline( KNeighborsClassifier(n_neighbors=5, weights=\"distance\") ) \n",
    "    #clf = make_pipeline(MaxAbsScaler(),make_union(VotingClassifier([(\"est\", RandomForestClassifier(n_estimators=500))]), FunctionTransformer(lambda X: X)),ExtraTreesClassifier(criterion=\"entropy\", max_features=0.0001, n_estimators=500))\n",
    "    # * clf = make_pipeline( make_union(VotingClassifier([(\"est\", BernoulliNB(alpha=60.0, binarize=0.26, fit_prior=True))]), FunctionTransformer(lambda X: X)),RandomForestClassifier(n_estimators=500))\n",
    "\n",
    "# # Prepare training data\n",
    "# X_tr = X\n",
    "# y_tr = y\n",
    "\n",
    "# # Augment features\n",
    "# X_tr, padded_rows = augment_features(X_tr, well, depth)\n",
    "\n",
    "# # Removed padded rows\n",
    "# X_tr = np.delete(X_tr, padded_rows, axis=0)\n",
    "# y_tr = np.delete(y_tr, padded_rows, axis=0) \n",
    "\n",
    "# Prepare test data\n",
    "well_ts = test_data['Well Name'].values\n",
    "depth_ts = test_data['Depth'].values\n",
    "X_ts = test_data[feature_names].values\n",
    "\n",
    "\n",
    "    \n",
    "y_pred = []\n",
    "print('.' * 100)\n",
    "for seed in range(100):\n",
    "    np.random.seed(seed)\n",
    "\n",
    "    # Make training data.\n",
    "    X_train, padded_rows = augment_features(X, well, depth)\n",
    "    y_train = y\n",
    "    X_train = np.delete(X_train, padded_rows, axis=0)\n",
    "    y_train = np.delete(y_train, padded_rows, axis=0) \n",
    "\n",
    "    # Train classifier  \n",
    "    clf = make_pipeline(XGBClassifier(learning_rate=0.12,\n",
    "                                      max_depth=3,\n",
    "                                      min_child_weight=10,\n",
    "                                      n_estimators=150,\n",
    "                                      seed=seed,\n",
    "                                      colsample_bytree=0.9))\n",
    "\n",
    "    # Make blind data.\n",
    "    X_test, _ = augment_features(X_ts, well_ts, depth_ts)\n",
    "\n",
    "    # Train and test.\n",
    "    y_ts_hat = train_and_test(X_train, y_train, X_test, well_ts, clf)\n",
    "    \n",
    "    # Collect result.\n",
    "    y_pred.append(y_ts_hat)\n",
    "    print('|', end='')\n",
    "    \n",
    "np.save('LA_Team_100_realizations.npy', y_pred)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# # Augment features\n",
    "# X_ts, padded_rows = augment_features(X_ts, well_ts, depth_ts)\n",
    "\n",
    "# # Predict test labels\n",
    "# y_ts_hat = train_and_test(X_tr, y_tr, X_ts, well_ts)\n",
    "\n",
    "# # Save predicted labels\n",
    "# test_data['Facies'] = y_ts_hat\n",
    "# test_data.to_csv('Prediction_XX_Final.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:python3]",
   "language": "python",
   "name": "conda-env-python3-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}