# Intro to `pandas`

We'll explore the Pandas package for simple data handling tasks using geoscience data examples. 

## Basic Pandas

Introduces the concept of a `DataFrame` in Python. If you're familiar with R, it's pretty much the same idea! Useful cheat sheet [here](https://www.datacamp.com/community/blog/pandas-cheat-sheet-python#gs.59HV6BY)

The main purpose of Pandas is to allow easy manipulation of data in tabular form. Perhaps the most important idea that makes Pandas great for data science, is that it will always preserve **alignment** between data and labels.

In [None]:
import pandas as pd

The most common data structure in Pandas is the `DataFrame`. A 2D structure that can hold various types of Python objects indexed by an `index` array (or multiple `index` arrays). Columns are usually labelled as well using strings.

An easy way to think about a `DataFrame` is if you imagine it as an Excel spreadsheet.

Let's define one using a small dataset:

In [None]:
data =  [[2.13, 'sandstone'],
         [3.45, 'limestone'],
         [2.45, 'shale']]
data

Make a `DataFrame` from `data`

In [None]:
df = pd.DataFrame(data, columns=['velocity', 'lithology'])
df

Accessing the data is a bit more complex than in the numpy array cases but for good reasons

## Adding data

Add more data (row wise)

In [None]:
df.loc[3] = [2.6, 'shale']
df

Add a new column with a "complete" list, array or series

In [None]:
df['new_column'] = ["x", "y", "z", "a", "b"]
df

## Reading a CSV

Pandas also reads files from disk in tabular form ([here](http://pandas.pydata.org/pandas-docs/version/0.20/io.html)'s a list of all the formats that it can read and write). A very common one is CSV, so let's load one!

The data is the same as used in this study: http://www.kgs.ku.edu/PRS/publication/2003/ofr2003-30/index.html

From that poster:

> The Panoma Field (2.9 TCF gas) produces from Permian Council Grove Group marine carbonates and nonmarine silicilastics in the Hugoton embayment of the Anadarko Basin. It and the Hugoton Field, which has produced from the Chase Group since 1928, the top of which is 300 feet shallower have combined to produce 27 TCF gas, making it the largest gas producing area in North America. Both fields are stratigraphic traps with their updip west and northwest limits nearly coincident. Maximum recoveries in the Panoma are attained west of center of the field. Deeper production includes oil and gas from Pennsylvanian Lansing-Kansas City, Marmaton, and Morrow and the Mississippian.

In [None]:
df = pd.read_csv("../data/Panoma_Field_Permian.csv")
df.head()

We have some well logs, plus...

> Two other feature elements derived from other geologic data are geologic constraining variables (GCV), nonmarine-marine (NM-M) and relative position (RPos). NM-M is determined from formation tops and bases and RPos is the position of a particular sample with respect to the base of its respective nonmarine or marine interval. These two important variables help to incorporate geologic knowledge into the variable mix.

In [None]:
import seaborn as sns
import numpy as np

sns.displot(df['ILD'])

<div style="background: #e0ffe0; border: solid 2px #d0f0d0; border-radius:3px; padding: 1em; color: darkgreen">
<h3>Exercise</h3>

- Create a new column called `ILD_log10` and store in it the log<sub>10</sub> of the values in column `ILD`.
- Make a 'displot' of the new column.
- Check the Pandas documentation [here](http://pandas.pydata.org/pandas-docs/version/0.22/api.html#data-manipulations) and look for a way to determine how many different facies are part of the `DataFrame`.
</div>

# Inspecting the `DataFrame`

Using the `DataFrame` with well log information loaded before, we can make a summary using the `describe()` method of the `DataFrame` object

In [None]:
df.describe()

In [None]:
sns.displot(df'[GR]')

In [None]:
df['GR'] = df['GR'].clip(upper=200)

## Better descriptions

We can define a Python dictionary to relate facies with the integer label on the `DataFrame`

In [None]:
lithofacies = {1:'sandstone', 2:'c_siltstone', 3:'f_siltstone', 4:'marine_silt_shale',
               5:'mudstone', 6:'wackestone', 7:'dolomite', 8:'packstone', 9:'bafflestone'}

Let's add a new column with the name version of the facies. There is a `replace()` method on DataFrames and Series, and it takes a dictionary for what to replace with what. So we could also achieve the same thing by passing our dictionary to that.

In [None]:
df["Lithofacies"] = df["Facies"].replace(lithofacies)

In [None]:
df.head()

## Adding more data to the `DataFrame`

We'd like to augment the DataFrame with some new data, based on some of the existing data.

In [None]:
def calc_phi_rhob(phind, deltaphi):
    """
    Compute phi_RHOB from phi_ND and delta-phi.
    """
    return 2 * (phind/100) / (1 - deltaphi/100) - deltaphi/100

In [None]:
def calc_rhob(phi_rhob, matrix='sandstone', fluid='brine'):
    """
    Computes RHOB from phi_RHOB using some typical values for rho_matrix,
    and rho_fluid. See wiki.aapg.org/Density-neutron_log_porosity
    """
    matrixes = {
        'mudstone':   2350,
        'siltstone':  2550,
        'sandstone':  2650,
        'limestone':  2710,
        'dolomite':   2880,
        'anyhydrite': 2980,
        'salt':       2030,
    }

    fluids = {
        'water':       1000,
        'brine':       1100,
        'heavy oil':   1000,
        'light oil':    800,
        'lng':          650,
    }
    
    rho_matrix = matrixes.get(matrix.lower(), 2650)
    rho_fluid = fluids.get(fluid.lower(), 1100)
    return rho_matrix * (1 - phi_rhob) + rho_fluid * phi_rhob

<div style="background: #e0ffe0; border: solid 2px #d0f0d0; border-radius:3px; padding: 1em; color: darkgreen">
<h3>Exercise</h3>

- Create a new column called `RHOB` and use the functions `calc_phi_rhob` and `calc_rhob` with the appropriate arguments to produce to fill its values. Assume everything is sandstone.
- Check the distribution of the new RHOB values. Sedimentary rocks usually have densities in the range 2000 to 2500 kg/m³. Some of these seem rather small. Use `df.loc[...]` (with a condition in place of the ellipsis) to make a copy of the DataFrame that only includes the values above some reasonable number, maybe 1500 kg/m³.
- **Stretch goal:** create a function that processes a row, taking `row` as its only argument. Then use `row['Facies']` to get the matrix (the geological one!) and use that to make the calculation for each row, returning it. Then you can use `df.apply()` with `axis=1` to apply your function to every row and make a new column. Use this dictionary to look up the matrix type:
</div>

In [None]:
lithologies = {1:'sandstone',
               2:'siltstone', 3:'siltstone',
               4:'mudstone', 5:'mudstone',
               6:'wackestone',
               7:'dolomite',
               8:'limestone', 9:'limestone',
              }

In [None]:
# YOUR CODE HERE



In [None]:
sns.displot(df['RHOB'].loc[df['RHOB'] > 1000])

## Visual exploration of the data

Pandas has a `scatter_matrix()` function, but it's not that pretty.

In [None]:
_ = pd.plotting.scatter_matrix(df, figsize=(15,15))

We can better visualize the properties of each facies and how they compare using Seaborn's `PairPlot`. The library `seaborn` integrates with matplotlib to make these kind of plots easily.

In [None]:
sns.pairplot(df,
             hue="Lithofacies",
             vars=['GR','RHOB','PE','ILD_log10'])

We can have a lot of control over all of the elements in the pair-plot by using the `PairGrid` object.

In [None]:
import matplotlib.pyplot as plt

g = sns.PairGrid(df, hue="Lithofacies", vars=['GR','RHOB','PE','ILD_log10'], height=4)

g.map_upper(plt.scatter, alpha=0.4)  
g.map_lower(plt.scatter, alpha=0.4)
g.map_diag(plt.hist, bins=20)  
g.add_legend()

It is very clear that it's hard to separate these facies in feature space. Let's simplify a bit:

In [None]:
df["Lithology"] = df["Facies"].replace(lithologies)

In [None]:
mineralogy = {
     1:'siliciclastic',
     2:'siliciclastic', 3:'siliciclastic',
     4:'siliciclastic', 5:'siliciclastic',
     6:'carbonate',
     7:'carbonate',
     8:'carbonate', 9:'carbonate',
}

In [None]:
df["Mineralogy"] = df["Facies"].map(mineralogy)

In [None]:
g = sns.PairGrid(df, hue="Lithology", vars=['GR','RHOB','PE','ILD_log10'], height=4)  
g.map_upper(plt.scatter, alpha=0.4)
g.map_lower(plt.scatter, alpha=0.4)
g.map_diag(plt.hist, bins=20)
g.add_legend()

In [None]:
g = sns.PairGrid(df, hue="Mineralogy", vars=['GR','RHOB','PE','ILD_log10'], height=4)  
g.map_upper(plt.scatter, alpha=0.4)
g.map_lower(plt.scatter, alpha=0.4)
g.map_diag(plt.hist, bins=20)
g.add_legend()

In [None]:
df.head()

In [None]:
df.to_csv("../data/training_data.csv", index=False)

<hr />

<p style="color:gray">©2020 Agile Geoscience. Licensed CC-BY.</p>