# Case: Clusterization of reservoir rocks


#### Data Source:

Different logging techniques such us logging while drilling and wireline logging and also analyzed core samples
![](https://www.bhge.com/sites/default/files/styles/stand_alone_img/public/2018-05/open-hole-wireline-petrophysics_1.png?itok=qV6l0DiB)

source: https://www.bhge.com/upstream/evaluation/wireline-logging/open-hole-wireline-petrophysics


#### Bussines case:

Combine data from different datasources to create clustering data pipeline to reduce manhours of the experts and manual calculations. 

![](https://www.spec2000.net/text107fp/ecs_shgas_ans.png)

source: https://www.spec2000.net/17-totalcarbon.htm


### Training purpose: 
Compare different dimensionality reduction and clustering techniques


In [6]:
%pylab inline 
%load_ext autoreload
%autoreload 2

# external
import pandas as pd
import numpy as np
import seaborn as sns
import warnings
from pandas.plotting import register_matplotlib_converters

register_matplotlib_converters()
sns.set_style('whitegrid')
warnings.filterwarnings("ignore")
pd.set_option('display.float_format', lambda x: "%.5f"%x)

Populating the interactive namespace from numpy and matplotlib
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [7]:
data = pd.read_csv("../data/WellFormationClustering.csv")

In [8]:
data.head()

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X7,X8,X9,...,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21
0,4.38107,2279.82903,2366.16873,0.26976,2710.87317,2535.00708,2.44354,-4.06095,-0.94038,5798.87399,...,488.02468,-0.0958,-1.37634,-0.1401,775.02116,1.06325,1202.2362,-1.4632,0.81351,1.82343
1,-0.10485,2096.71231,3206.733,-1.27218,2148.02911,4487.8676,-2.63848,-1.58033,-1.06769,6139.80585,...,470.64686,-1.89673,2.64952,-3.24908,1936.45757,2.21207,5787.93825,1.10841,-1.33759,-1.99234
2,-0.64085,1363.16381,6944.33178,-1.37275,5040.47926,2980.02518,-1.17807,-1.97274,3.63486,2320.59874,...,401.01363,0.8128,0.22944,1.41945,3423.60024,3.56075,4477.27387,-0.0125,1.75208,2.28539
3,3.33353,1776.0582,2057.46104,0.53708,1883.58471,2476.7296,2.19906,-2.5083,-1.25397,5801.33026,...,476.75225,-0.07169,-1.24918,-0.11001,1512.38326,-0.01873,2869.85994,-0.38814,0.38624,1.08131
4,1.35529,1722.42357,5181.18323,-0.72539,4163.59916,2784.10304,0.28054,-2.82286,1.88573,3649.63105,...,434.72497,0.48809,-0.44102,0.86246,2384.83896,2.5995,3142.94454,-0.97524,1.42534,2.16684


In [9]:
data.describe()

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X7,X8,X9,...,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,...,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,2.19631,1482.61975,2545.1139,0.37313,1865.94103,2579.31328,1.49126,-1.62837,-0.80196,5384.6976,...,461.51746,-0.01372,-0.86735,-0.01503,2134.19586,-0.04418,4009.04133,-0.0706,0.2799,0.74966
std,1.6826,482.58018,1059.14458,1.03432,843.44373,1176.40905,2.5657,0.9389,1.10709,944.44627,...,18.55556,1.22555,2.17294,2.10626,597.57502,1.57469,1550.11423,1.00353,1.16045,1.88265
min,-4.44673,234.0,351.0,-2.8168,100.0,428.0,-4.57496,-4.93772,-4.90043,66.0,...,349.0,-4.51327,-4.91582,-7.75132,54.0,-3.34118,11.0,-2.98455,-4.05853,-5.80435
25%,0.37789,1120.59196,1913.337,-0.65169,1367.44244,1652.61941,-1.09116,-2.26546,-1.30783,4922.52964,...,450.95786,-0.70905,-2.57371,-1.21352,1733.94429,-1.32184,2790.31317,-0.75671,-0.45562,-0.55272
50%,2.55466,1478.24314,2213.65134,0.69388,1703.15826,2325.92847,2.30529,-1.49216,-0.92745,5346.49047,...,461.99683,0.33382,-1.43535,0.58481,2144.20235,-0.40898,3749.44256,-0.0811,0.62679,1.35152
75%,3.59303,1821.82753,2823.69034,1.24543,2064.63831,3469.08943,3.55947,-0.9663,-0.54109,5824.25497,...,473.08557,0.87887,0.96144,1.51764,2530.77321,1.2937,5533.16699,0.60438,1.11712,2.1583
max,5.97455,3000.0,9901.0,2.29432,6552.0,6127.0,6.23405,0.94642,6.50952,9446.0,...,534.0,2.1697,5.29099,3.74536,5049.0,5.62946,7128.0,2.95817,2.7476,4.5974


---
# Let's practice!

### Step 1. Normalize data

In [10]:
from sklearn.preprocessing import StandardScaler

# scaler = StandardScaler()

# call fit_transform method for the dataset
# we don't need original dataset, so you can replace it with scaled values
# data = 

### Step 2. Dimensionality reduction

In [11]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# On this step let's try two different approaches
# PCA will be our linear-reducer and TSNE will be our candidate
# for the non-linear dimentionality reduction

# pca = PCA(n_components=2)
# tsne = TSNE(n_components=2)

# let's save results into two different variables
# data_pca = 
# data_tsne = 

### Step 3. Visualie results

In [12]:
# It's always useful to look at the visualizations of your results
# Let's plot two scatter with pca and tsne outputs

# pylab.figure(figsize=(12, 8))
# pylab.scatter()

# pylab.figure(figsize=(12,8))
# pylab.scatter()

### Step 4. Clustering

In [13]:
from sklearn.cluster import KMeans, SpectralClustering

# Let's use clustering to determine different rock types
# I wanna use two different approaches: KMeans as basic and simple one
# and SpectralClustering as more sophisticated one
# Let's decide which data is better to use data_pca or data_tsne 
# based on the visualization above

# kmeans = KMeans(n_clusters=)
# spectral = SpectralClustering(n_clusters=)