# Introduction  

Copulas (Copulae) are functions that combine the marginal distributions of several variables into a joint distribution that matches their correlation structure. They enable us to construct multivariate distributions from any univariate distributions, and to capture the dependence among them. Copulas have various forms and applications, such as Gaussian, Archimedean, and non-parametric copulas, which are used in finance, engineering, and machine learning, respectively.   
Copulas are useful tools to capture both linear and non-linear dependencies between two stocks because they can model the joint distribution of the stock returns without making any assumptions about their marginal distributions. This means that copulas can account for different behaviors and shapes of the stock returns, i.e. fat tails, skewness, multimodality, etc., which in terms, provides a great application for market-neutral strategies such as Pairs Trading.   

## A Brief Overview of Copulas  
The technical detail of copulas, although quite intuitively straightforward, is rather quite long and deserves an extensive introduction (which I highly recommend the following article: https://towardsdatascience.com/copulas-an-essential-guide-applications-in-time-series-forecasting-f5c93dcd6e99). For those who are not familiar, this below will illustrate a short graphical guide to present the conceptual intuition of copulas.

In [5]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import plotly.express as px
import plotly.figure_factory as ff
from scipy.stats import norm, gamma, beta

Suppose we have 2 Normally distributed random variables
with a correlation of 0.8
presented as a bivariate normal distribution:

In [2]:
np.random.seed(seed=5)
mean = [0,0]
rho = 0.8
cov = [[1,rho],[rho,1]] # diagonal covariance, points lie on x or y-axis

# generate bivariate normal samples
norm_1,norm_2 = np.random.multivariate_normal(mean,cov,1000).T
# and their Cumulative Distribution Functions (CDFs) 
unif_1 = norm.cdf(norm_1) # uniform samples
unif_2 = norm.cdf(norm_2)

norm_data = pd.concat([pd.DataFrame(norm_1), pd.DataFrame(norm_2)], axis=1)
norm_data.columns = ['X', 'Y']
norm_data.corr()

Unnamed: 0,X,Y
X,1.0,0.803551
Y,0.803551,1.0


The correlation is plotted as the following:


In [3]:
fig = px.scatter(norm_data, x = 'X', y='Y', width=700, height=500, trendline='ols', trendline_color_override='DeepPink', marginal_x='histogram', marginal_y='histogram', title='Bi-Variate Normal')
fig.show()

This is we all do using correlation!

BUT,  
 what if the data looks different, what if the 2 variables are neither *Normal* nor *Linearly* correlated.  

Let us imagine a real world scenario where we need to measure the correlation between 2 quantities: the amounts of **time** and **money** spent on Amazon, where their respective data is derived via the [Inverse Transform Sampling](https://en.wikipedia.org/wiki/Inverse_transform_sampling) from our 2 previous Uniform samples of the Normal CDFs.

In [None]:
# generate time and money data as Gamma and Beta distributions respectively
# using the inverse sampling method (.ppf() functions)
# from the previous Normal CDFs
website_time = pd.DataFrame(gamma.ppf(unif_1, a=2, scale=5))
website_spend =  pd.DataFrame(beta.ppf(unif_2,a=0.5, b=0.5, loc=5, scale=100))
join_time_spend = pd.concat([website_time, website_spend], axis=1)
join_time_spend.columns = ['Time', 'Cash']

Time spent on website


In [6]:
gamma_dist  = ff.create_distplot([website_time.values.reshape(-1)], group_labels = [' '])
gamma_dist.update_layout(showlegend=False, title_text='Time Spent on Website', width=1000, height=500)
gamma_dist.show()

Dollars spent on website


In [7]:
beta_dist  = ff.create_distplot([website_spend.values.reshape(-1)], group_labels = [' '])
beta_dist.update_layout(showlegend=False, title_text='Dollars Spent on Website', width=1000, height=500)
beta_dist.show()

Correlation of time vs money spent on website 

In [11]:
fig = px.scatter(join_time_spend, x = 'Time', y='Cash', width=1000, height=500,  range_y=[0,110], trendline='ols', trendline_color_override='DeepPink',  marginal_x='histogram', marginal_y='histogram')
fig.show()
print("Corrlation between time and $$ is: " + str(round(join_time_spend.corr().values[0][1],4)))


Corrlation between time and $$ is: 0.7231


Something is different!  
Remember our earlier (Bivariate) correlation was **0.8**?