# Goal
The purpose of this kernel is to find out which country has the best game ratio.


In [None]:
import numpy as np
import pandas as pd

import html
from io import StringIO

import matplotlib.pyplot as plt
%matplotlib inline

input = "../input/wta/"

Let's first load the matches.csv file and run some simple aggregations.

In [None]:
# some html encoded character's are present around Francesca La&#39;O
with open(input + 'matches.csv', 'r', encoding='utf-8') as f:
    content = StringIO(html.unescape(f.read()))

df_matches = pd.read_csv(content, low_memory=False, encoding="utf-8",sep=",", escapechar="'")

df_winners = df_matches.groupby(['winner_ioc']).size().sort_values(ascending=False).reset_index()
df_losers = df_matches.groupby(['loser_ioc']).size().sort_values(ascending=False).reset_index()

After these operations, we'll need to merge them. 

This brings another question. Does a country alway win or always loose ? 

In [None]:
df_winners[~df_winners['winner_ioc'].isin(df_losers['loser_ioc'])]

 Papua New Guinea won its only game !

In [None]:
df_losers[~df_losers['loser_ioc'].isin(df_winners['winner_ioc'])]

Bahrain, Haiti unfortunately lost all their games ... 

I wasn't able to find anything on 'ECA' neither on the two players Masika Williams and Sylma Phillip who apparently lives in 'ECA'

We now can merge our two aggregations and compute the ratio

In [None]:
df_ratio = pd.merge(
                df_winners,
                df_losers,
                how='inner',
                left_on='winner_ioc',
                right_on='loser_ioc'
                )

df_ratio.drop('loser_ioc', axis=1, inplace=True)
df_ratio.columns = ['IOC', 'wins', 'loses']
df_ratio.set_index('IOC', inplace=True)
df_ratio['ratio'] = df_ratio['wins'] / df_ratio['loses']
df_ratio['total'] = df_ratio['wins'] + df_ratio['loses']

Let's take a look at the top 10 :

In [None]:
df_ratio.sort_values('ratio', ascending=False)[:10]

We can see that some "small" countries (with a small number of games) can make it to the top 10 from a ratio point of view ! Denmark ranks first with a ratio up to 1.97 !

Let's display the top 20 winners with their ratios. The red line shows the ratio 1 threshold

In [None]:
df_ratio.sort_values('total', ascending=False)['ratio'][:20] \
    .plot.bar(title="Ratios of the top 20 country with the most played game", 
                  figsize=(20,10))
    
plt.axhline(y=1, color='r', linestyle='-')
plt.show()



So far, this exploration only gave us the global point of view. Let's dive in to the evolution of their ratio over the years. 


In [None]:
df_winners_evo = df_matches.groupby(['winner_ioc', 'year']).size().sort_values(ascending=False).reset_index()
df_losers_evo = df_matches.groupby(['loser_ioc', 'year']).size().sort_values(ascending=False).reset_index()

df_ratio_evo = pd.merge(
                df_winners_evo,
                df_losers_evo,
                how='inner',
                left_on=['winner_ioc','year'],
                right_on=['loser_ioc', 'year']
                )

df_ratio_evo.drop('loser_ioc', axis=1, inplace=True)
df_ratio_evo.columns = ['IOC', 'year', 'wins', 'loses']

df_ratio_evo['ratio'] = df_ratio_evo['wins'] / df_ratio_evo['loses']
df_ratio_evo['total'] = df_ratio_evo['wins'] + df_ratio_evo['loses']

In [None]:
top_country = df_ratio.sort_values('wins', ascending=False)[:10].index

df_ratio_evo_top = df_ratio_evo[df_ratio_evo['IOC'].isin(top_country)]

df_ratio_evo_top.set_index(['year', 'IOC'])['ratio'] \
                .unstack('IOC') \
                .plot(title="Ratios evolution of the top 10", 
                  figsize=(20,10))
        
plt.axhline(y=1, color='r', linestyle='-')
plt.axvline(2011, color='b', linestyle='-')
plt.show()        

We can see that Belgian tennis women went very well and dominated the game from 2001 until 2011 (blue line). Justine Henin is a famous Belgian tennis woman and according to her wikipedia page "She retired from professional tennis on 26 January 2011, due to a chronic elbow injury." Coincidence ? I think not !