Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The method 'select_copula' of Bivariate return wrong CopulaType #101

Closed
gaboolic opened this issue Aug 26, 2019 · 3 comments · Fixed by #104
Closed

The method 'select_copula' of Bivariate return wrong CopulaType #101

gaboolic opened this issue Aug 26, 2019 · 3 comments · Fixed by #104
Assignees
Labels
bug There is an error in the code that needs to be fixed
Milestone

Comments

@gaboolic
Copy link

  • Copulas version: 0.2.2 (2019-07-31)
  • Python version: python 3.6
  • Operating System: mac

Description

In copulas/bivariate/base.py
class CopulaTypes(Enum):

CLAYTON = 0
FRANK = 1
GUMBEL = 2
INDEPENDENCE = 3

But the method 'select_copula' , is
copula_candidates = [frank]
copula_candidates.append(clayton)
copula_candidates.append(gumbel)

They're not in the same order.
When frank.tau > 0 and frank is best, it returns clayton. It should return frank

@csala
Copy link
Contributor

csala commented Aug 26, 2019

Thanks for reporting this @gaboolic !

It definitely looks like this should be fixed.

Would you mind making a PR fixing it? :-)

My suggested approach would be to return returning copula_candidates[selected_copula].copula_type instead of CopulaTypes(selected_copula):

selected_copula = np.argmax(cost_LR)
selected_theta = theta_candidates[selected_copula]
return copula_candidates[selected_copula].copula_type, selected_theta

@csala csala added the bug There is an error in the code that needs to be fixed label Aug 26, 2019
@csala csala modified the milestones: 0.2.3, 0.2.4 Aug 26, 2019
@gaboolic
Copy link
Author

OK,Thanks for your prompt reply

@csala
Copy link
Contributor

csala commented Aug 27, 2019

This issue seems to go much deeper than expected.

I reproduce here a short conversation that happened on a private channel between me and @aliciasun

In the base bivariate copula class, we have the select_copula method that supposedly selects the best copula (Frank, Gumbel or Clayton) doing the following:

Given out candidate copulas the procedure proposed for selecting the one
that best fit to a dataset of pairs :math:\{(u_j, v_j )\}, j=1,2,...n , is as follows:

  1. Estimate the most likely parameter :math:\theta of each copula candidate for the given dataset.
  2. Construct :math:R(z|\theta). Calculate the area under the tail for each of the copula candidates.
  3. Compare the areas: :math:a_u achieved using empirical copula against the ones achieved for > the copula candidates. Score the outcome of the comparison from 3 (best) down to 1 (worst).
  4. Proceed as in steps 2- 3 with the lower tail and function :math:L.
  5. Finally the sum of empirical upper and lower tail functions is compared against :math:R + L. Scores of the three comparisons are summed and the candidate with the highest value is selected.

NOTE: These steps are taken directly from the "Copula Graphical Models for Wind Resource Estimation" paper: https://groups.csail.mit.edu/EVO-DesignOpt/groupWebSite/uploads/Main/IJCAI_sub_KV_AC_UM.pdf

However, when looking at the actual code I see:

        z_left, L, z_right, R = cls.compute_empirical(X)
        left_dependence, right_dependence = cls.get_dependencies(
            copula_candidates, z_left, z_right)

        # compute L2 distance from empirical distribution
        cost_L = [np.sum((L - l) ** 2) for l in left_dependence]
        cost_R = [np.sum((R - r) ** 2) for r in right_dependence]
        cost_LR = np.add(cost_L, cost_R)

        selected_copula = np.argmax(cost_LR)

However, if I understand this correctly, the current code is selecting the copula based on a COST instead of a SCORE, which means that the copula that ends up being selected is not the one that is closest to the empirical distribution but rather the one that is further away.

The conclusion reached was:

Yeah I think it is should be reversed. Also to follow the exact description in the paper, each of cost_L/cost_R/cost_LR should be sorted and assigned a rank first (where the smallest element has a rank of 1). Also each of cost_L/cost_R/cost_LR should be compared. Right now the final selection is only based on cost_LR

And

The problem is that cost_LR should not be cost_L + cost_R as it is, but rather the score computed when comparing the sum of the empirical upper and lower tail functions with the candidate equivalent.
So we should really compute rank_L, rank_R and rank_LR independently, and then get the one with the highest sum, as you suggested.

Since the issue has grown bigger than we thought at the beginning, I'm assigning it to @JDTheRipperPC , who, a part from fixing the problem, will work on adding tests that reproduce and validate the behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug There is an error in the code that needs to be fixed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants