## Calculating cumulative frequency and density
In this Notebook we will take the dataset a step further.

We will calculate and plot the **Cumulative Thickness Frequency** and **Cumulative Spacing Frequency** as we go across the vein transect. We will analyse what happens in these figures, and these frequency distributions which gives us insights into what mathematical distributions govern the thickness and spacing of these veins, such as powerlaw, lognormal, negative exponential, poisson distributions. 

Calculating these will set us up to fit these distributions to those mathematical functions using non-linear least-squares regression techniques, so we can be more quantitative/statistical, with the eye of giving insights into the vein forming mechanisms. The parameters of the fitted mathematical functions can then also be used to generate Discrete Fracture Networks for reservoir modelling or geotechnical modelling.

<img src="../images/s2_Gillespieetal1999_distributions1.png" alt="Variable" width="600"/>

<img src="../images/s2_Gillespieetal1999_distributions2.png" alt="Variable" width="600"/>

<center> <i> Example distributions of non-stratabound veins governed by certain distribution functions (after Gillespie et al 1999, doi: 10.1144/GSL.SP.1999.155.01.05) </i> </center>

First we will import the libraries we need (as before) and read in the datafile.

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
dataframe = pd.read_csv('Vein_dataset_large.csv')
dataframe.head()

## Calculation of Cumulative Frequency
For this step, we need to calculate the frequency of each value in our newly calculated dataframe columns of the cumulative sum of thickness and spacing. Here we will calculate ourselves how many times a given value of spacing or thickness is higher than that same treshold value.

In order to easily calculate, we will put the Thickness and Spacing columns in two normal arrays, aptly named *thickness* and *spacing*, using the Pandas `tolist()` method. We calculate our cumulative frequency we also need to sort our the values from smallest to largest, and we can use the Python `sort()` method to do this for us.

In [None]:
#s = dataframe.sort('Thickness', ascending=True)
thickness = dataframe['Thickness'].tolist()
spacing = dataframe['Spacing'].tolist()
thickness.sort()
spacing.sort()

Now that we have sorted our lists, we can calculate the amount of values that are higher than a treshold value. To do this, we will loop over our entire list, in increments of 1 mm, and calculate how many values are higher than that treshold value for each increment.

First we will create empty lists to hold our to be calculated cumulative frequency values.

In [None]:
cfreqThickness, cfreqSpacing = [],[]

We will then loop over the total length of the dataframe `for i in range(len(dataframe.index))` calculating the frequency values `sum(dataframe["Thickness"] > i` using a *Boolean expression* where the value `> i` and appending these values to the two lists using the `append()` method.

In [None]:
# For Thickness
for i in range(len(dataframe.index)):
    cfreqThickness.append(sum(dataframe["Thickness"] > i))
# For Spacing
for i in range(len(dataframe.index)):
    cfreqSpacing.append(sum(dataframe["Spacing"] > i))

We can now plot this data, as before. To be able to show all the values, we will need a list of values spaced 1 mm apart for spacing and thickness.

In [None]:
# Create 
index = np.arange(len(dataframe.index))

# Plot the data
fig, ax = plt.subplots(1,1, figsize=(7,7))
ax.plot(index, cfreqThickness, 'k.', label="Vein Thickness Frequency")
ax.plot(index, cfreqSpacing, 'c.', label="Vein Spacing Frequency")

#Everything below this line is just the formatting mark-up
ax.set_xlabel('Vein Thickness or Spacing (mm)', fontsize='large')
ax.set_ylabel('Cumulative Frequency (N)', fontsize='large')
ax.grid(b=True, which='major', color='#CCCCCC', linestyle='-')
ax.grid(b=True, which='minor', color='#999999', linestyle='--')
ax.set_xscale('log')
ax.set_yscale('log')
#ax.set_xlim((1,10))
#ax.set_ylim((1,200))
ax.set_title('Cumulative Frequency of veins larger than a value')
plt.legend(loc="upper right")
plt.tight_layout(pad=1.05)
plt.show()

## Fitting distributions

We will now attempt to fit a number of functions to our data, using non-linear least-squares regression. We will try to fit two distributions to this data: Powerlaw and lognormal distribution


### Curve fitting
We will use the curve fitting tools in the Python package/library SciPy, but many other scientific packages contain curve fitting algorithms. `scipy.optimize.curve_fit()` uses non-linear least squares regression to fit a mathematical function to data. For more information on what `curve_fit()` does and arguments or parameters it takes, [you can click here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html)

We first have to import `curve_fit()` from the `scipy.optimize` library.

In [None]:
from scipy.optimize import curve_fit

We first define our mathematical functions, and then input the mathematical functions and data as *arguments* to the `curve_fit()` Python *function*.

In [None]:
# We need to define an index to calculate 
index = np.arange(len(dataframe.index))

def function(x, a, b):
    return a * np.exp(-b * x)

#def function(x, a, b):
#    return (a * x) + b

# https://en.wikipedia.org/wiki/Generalised_logistic_function
#def function(x, a, b, c, d, g):
#    return ( ( (a-d) / ( (1+( (x/c)** b )) **g) ) + d )

params, params_covariance = curve_fit(function, index, cfreqThickness)
print('a =',params[0],'and b =',params[1])

You may get a warning here about *'overflow being encountered'*; this is to do with the level of precision that NumPy is able to store data to. Use type() to see what kind of data 'a' and 'b' are, and look at the links [here](https://stackoverflow.com/questions/40726490/overflow-error-in-pythons-numpy-exp-function) and [here](https://codesource.io/solved-overflow-encountered-in-long_scalars/) to think about what you might do about that.

We then plot the data and the fitted curves all in one graph, using a standard set of matplotlib parameters:

In [None]:
# Plot the data and a subsample of the fitted function
fig, ax = plt.subplots(1,1, figsize=(8,8))
ax.plot(index, cfreqThickness, 'k-', label='data', marker='o')
ax.plot(index, function(index, *params), 'r-', label='fit')
# Code below this line is just figure mark-up.
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlim((1,100))
ax.set_ylim((1,1000))
ax.grid(b=True, which='major', color='#CCCCCC', linestyle='-')
ax.grid(b=True, which='minor', color='#999999', linestyle='--')
plt.legend(loc="upper right")
plt.title("Vein frequency along a transect")
plt.xlabel('Vein Thickness or Spacing (mm)', fontsize='large')
plt.ylabel('Cumulative Frequency (N)', fontsize='large')
plt.show()

***-----Possible additional content----***

## Calculating cumulative sum

We will now calculate the *cumulative sum* of the thickness and spacing over the entire arrays in the pandas Dataframe, using a method called `cumsum()`.

[Library info on `cumsum()` in pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.cumsum.html)

We create new columns **Cumulative Thickness** and **Cumulative Spacing** in our Dataframe, and immediately populate them with our calculated cumulative sums. In the lines below, we use the method `cumsum()` on the *Spacing* and *Thickness* arrays in the Dataframe object.

In [None]:
dataframe['Cumulative Thickness'] = dataframe['Thickness'].cumsum()
dataframe['Cumulative Spacing'] = dataframe['Spacing'].cumsum()
dataframe.head()

In [None]:
fig2, ax2 = plt.subplots(1,1, figsize=(7,7))
ax2.bar(dataframe["Position"], dataframe["Cumulative Thickness"],width=3*dataframe["Thickness"], linewidth=0, color='k')
ax2.set_title("Scatter plot")
ax2.set_xlabel("X axis (mm)")
ax2.set_ylabel("Y axis (mm)")
plt.show()