# A pattern for comparisons

[Doing business](https://gramener.com/playground/doingbusiness/) is an interesting analysis of the [ease of doing business data](https://gramener.com/playground/doingbusiness/doingbusiness.csv). It shows how each country fares relative to others.

Another way of doing the same is through radar charts.

In [139]:
import os
import vis              # The Gramener visualisation server
import layout           # The Gramener visualisation server
import color as _color  # The Gramener visualisation server
import urllib
import pandas as pd
import orderedattrdict
from IPython.display import HTML

In [59]:
if not os.path.exists('doingbusiness.csv'):
    urllib.urlretrieve('https://gramener.com/playground/doingbusiness/doingbusiness.csv', 'doingbusiness.csv')

data = pd.read_csv('doingbusiness.csv', encoding='utf-8')
ranks = data.set_index('Country')[[col for col in data.columns if 'Rank' in col]]
ease = ranks.max().max() - ranks

plot_args = {
    'width': 240,
    'height': 240,
    'color': ['rgba(255, 0, 0, 0.5)', 'rgba(0, 0, 255, 0.5)'],
    'radar': True,
    'stack': None
}
def radar(*countries, **kwargs):
    result = []
    if kwargs.get('split'):
        for country in countries:
            result.append(vis.SVG('areaplot.svg', data=ease.T[[country]], **plot_args))
    result.append(vis.SVG('areaplot.svg', data=ease.T[list(countries)], **plot_args))
    if kwargs.get('table'):
        data = ranks.ix[list(countries)]
        diff = data.irow(1) - data.irow(0)
        diff.name = 'Diff'        
        result.append(data.append(diff.to_frame().T).to_html())
    return HTML(''.join(result))

### Here is how Italy and Pakistan compare

Italy's generally better for business, but Pakistan is better for dealing with construction permits and protecting minority investors.

In [60]:
radar('Italy', 'Pakistan', table=True, split=True)

Unnamed: 0_level_0,Ease of Doing Business Rank,Starting a Business Rank,Dealing with Construction Permits Rank,Getting Electricity Rank,Registering Property Rank,Getting Credit Rank,Protecting Minority Investors Rank,Paying Taxes Rank,Trading Across Borders Rank,Enforcing Contracts Rank,Resolving Insolvency Rank
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Italy,45,50,86,59,24,97,36,137,1,111,23
Pakistan,138,122,61,157,137,133,25,171,169,151,94
Diff,93,72,-25,98,113,36,-11,34,168,40,71


### Here's how the US and Pakistan compare

It's actually easier to start a business in Pakistan than in the US -- though in every other parameter, the US is ahead.

In [61]:
radar('United States', 'Pakistan', table=True, split=True)

Unnamed: 0_level_0,Ease of Doing Business Rank,Starting a Business Rank,Dealing with Construction Permits Rank,Getting Electricity Rank,Registering Property Rank,Getting Credit Rank,Protecting Minority Investors Rank,Paying Taxes Rank,Trading Across Borders Rank,Enforcing Contracts Rank,Resolving Insolvency Rank
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
United States,7,49,33,44,34,2,35,53,34,21,5
Pakistan,138,122,61,157,137,133,25,171,169,151,94
Diff,131,73,28,113,103,131,-10,118,135,130,89


# Types of patterns

Some patterns are more interesting than others. Here are some thoughts:

## Comparison patterns

- Create a matrix between every pair of entities. Each cell shows a metric. The metric could be:
    - difference metric: how different are the countries on average? e.g. the % of parameters on which country X is ahead of country Y.
    - variation metric: how dissimilar are the countries, e.g. is country X well ahead of country Y on some parameters, but the exact opposite is true on others
    - similarity metric: how similar are the countries
- Visualise the matrix in different ways:
    - List the top 10 pairs based on metric
    - Network diagram
    - Heatgrid
    - Tree based on hierarchical clustering
    - Extend this by surveying linear algebra / graph theory literature

## Contrast / exception patterns

- Countries that are very high in most areas, low in few (or the opposite)
- Beats others in most, but beaten by X in just one or a few

- Discretisation into quartiles

# Comparisons

## Difference metric

The "difference" between two countries can be defined in many ways. Here's one metric: country X leads country Y by the % of areas X is ahead of Y on. In the above example:

- Italy leads Pakistan in 9 areas
- Pakistan leads Italy in 2 areas
- Italy leads Pakistan overall in 9 - 2 = 7 areas, which is ~63% (7/11)

In [62]:
count = len(ranks.columns)
def lead(a, b):
    diff = ranks.ix[a] - ranks.ix[b]
    return float((diff < 0).sum() - (diff > 0).sum()) / count

lead('Italy', 'Pakistan')

0.6363636363636364

Let's compute the lead for every pair of countries.

In [156]:
# Warning: unoptimised, slow computation
def pairwise(method):
    result = {}
    for a in ranks.index:
        for b in ranks.index:
            result[a, b] = method(a, b)
    result = pd.Series(result).reset_index()
    result.columns = ['Country X', 'Country Y', 'Metric']
    return result.pivot_table(index='Country X', columns='Country Y', values='Metric')

This is the pairwise difference for the first few countries (alphabetically), shown as a matrix. For each country, read horizontally to see which countries it is better than.

In [157]:
# Warning: slow computation
differences = pairwise(lead)

In [183]:
HTML('<style>.matrix text { font-size: 11px }</style>' +
     vis.SVG('clusterplot.svg', width=600, height=600, cls='matrix', gradient=_color.RdYlGn,
             data=diff_pivot.head(20).T.head(20),
             distance=True, dendrogram=False, label=100))

### Pareto sub-optimal countries

Are there countries that are worse than other countries in every way? Hence we'd never invest in them? In the above graph, that's where the values are 100 or -100. We can drop all other patterns and just look for these.

However, there are still too many of these countries.

Another approach is to see if these pareto optimal countries form a tree. The short answer is NO.

In [159]:
pareto_optimal_countries = (diff_pivot >= 1).sum(axis=1) > 0
diff_pareto = diff_pivot[pareto_optimal_countries].T[pareto_optimal_countries].T
HTML(vis.SVG('clusterplot.svg', width=600, height=500, cls='matrix', gradient=_color.RdYlGn,
             data=diff_pareto.head(20).T.head(20),
             distance=True, dendrogram=False, label=100))

## Variation metric

In [192]:
def variation(a, b):
    diff = ranks.ix[a] - ranks.ix[b]
    a_b = diff[diff > 0].abs().sum()
    b_a = diff[diff < 0].abs().sum()
    return abs(a_b - b_a)

In [193]:
# Warning: slow computation
variations = pairwise(variation)
variations = 1 - variations / float(variations.max().max())

In [None]:
variations

In [173]:
high_variation_countries = (variations >= 0.95).any(axis=1)
high_variations = variations[high_variation_countries].T[high_variation_countries].T

In [174]:
# Show only the highest variations
HTML(vis.SVG('clusterplot.svg', width=600, height=500, cls='matrix', gradient=_color.Blues,
             data=high_variations,
             distance=True, dendrogram=False, label=100))