# Roland Garros graph

This notebook prepares and collects data for this visu : https://bl.ocks.org/rhuille/8c1c29a182f186d3f17a1f67f90faf96

The purpose of this Notebook is to construct a cool dataset containing data about men matchs of the last seven [Roland Garros](http://www.rolandgarros.com/en_FR/news/index.html). The preparation is in two step :

- 1 - **I scrap data from *Equipe* newspaper website.** Data are then stored in a .csv file
- 2 - **I create a .json file representing the graph**

Equipe website I scrap : http://www.lequipe.fr/Tennis/SH_RG.html

*NB : If the Equipe website changes in the future, this notebook might not work anymore.*

## 1 - Scrapping

In [3]:
from urllib import request
from bs4 import BeautifulSoup
from pandas import DataFrame

Here is the list of the *Equipe* newspaper urls from where I scrap:

In [4]:
list_tournois = {
    "2016" : "http://www.lequipe.fr/Tennis/SH_RG.html" ,
    "2015" : "http://www.lequipe.fr/Tennis/TennisTableauTournoi5637.html",
    "2014" : "http://www.lequipe.fr/Tennis/TennisTableauTournoi4939.html",
    "2013" : "http://www.lequipe.fr/Tennis/TennisTableauTournoi4456.html",
    "2012" : "http://www.lequipe.fr/Tennis/TennisTableauTournoi3687.html",
    "2011" : "http://www.lequipe.fr/Tennis/TennisTableauTournoi3114.html",
    "2010" :"http://www.lequipe.fr/Tennis/TennisTableauTournoi2672.html"
 }

In [5]:
def is_joueur(tag):
    return tag.has_attr('nomcomplet')

Data will be stored in 5 lists.
Here is the hierarchical structure of the Equipe website table:
- table
    - rounds
        - matchs
            - players
            - scores

In [6]:
winners = []
losers = []
years = []
scores_winners = []
scores_losers = []
rounds = []

for year in list_tournois.keys():
    data = request.urlopen(list_tournois[year]).read()
    soup = BeautifulSoup(data, 'html.parser')
    
    table = soup.find(id = "rateau") #table is a html element containing all the data we want !

    dict_rounds = {}
    for i in [4,5,6]: # I only take the three last rounds. It is possible to take more.
        dict_rounds[i] = table.findAll("div", attrs = {"numtour": str(i)})[0] #we first parse table in rounds

    dict_match = {}
    for i in dict_rounds.keys(): #we parse dict_rounds[i] (which is the i-th rounds) in matchs
            dict_match[i] = dict_rounds[i].findAll("div", attrs = {"class": "joueurs"})

    for i in dict_match.keys(): #dict_match[i] is the matchs list corresponding to the round i
        for k in dict_match[i]: #k is a match
            years += [year]
            rounds += [i]
            for j in k.findAll("a"):
                if "gagne" in j["class"]:
                    winners += [j.findAll(is_joueur)[0]["nomcomplet"] ]
                    scores_winners += [j.findAll(attrs={"class": "score"})[0].text]
                else:
                    losers += [j.findAll(is_joueur)[0]["nomcomplet"] ]
                    scores_losers += [j.findAll(attrs={"class": "score"})[0].text]    

Now we can construct the Data Frame from our list : 

In [7]:
df = DataFrame({"Winners": winners, "Losers": losers, "Years" : years, "Rounds" : rounds,
                "Scores_Winners": scores_winners, "Scores_Losers": scores_losers})
df.head()

Unnamed: 0,Losers,Rounds,Scores_Losers,Scores_Winners,Winners,Years
0,NISHIKORI,4,61601,2676,MURRAY,2016
1,CILIC,4,331,666,WAWRINKA,2016
2,CARRENO BUSTA,4,20 ab.,62,NADAL,2016
3,DJOKOVIC,4,6530,766,THIEM,2016
4,MURRAY,5,7637631,66576,WAWRINKA,2016


In [8]:
df.to_csv("rolland_garros.csv", index=False)

## 2 - Creation of the graph

In [9]:
from pandas import *
import json

In [10]:
df = read_csv("rolland_garros.csv")
df.head() #NB : only the "Losers" and "Winners" columns will be used

Unnamed: 0,Losers,Rounds,Scores_Losers,Scores_Winners,Winners,Years
0,NISHIKORI,4,61601,2676,MURRAY,2016
1,CILIC,4,331,666,WAWRINKA,2016
2,CARRENO BUSTA,4,20 ab.,62,NADAL,2016
3,DJOKOVIC,4,6530,766,THIEM,2016
4,MURRAY,5,7637631,66576,WAWRINKA,2016


In [11]:
df.shape

(49, 6)

In [12]:
nodes = []
for player in list(set(df["Losers"]).union(df["Winners"])):
    nodes += [{"id": player, "victory": len(df[df.Winners == player]), "defeat": len(df[df.Losers == player])}]

In [13]:
counter = {}
for j in range(df.shape[0]):
    if (df.Winners[j], df.Losers[j]) not in counter.keys():
        counter[df.Winners[j], df.Losers[j]]=1
    else:
        counter[df.Winners[j], df.Losers[j]]+=1

In [14]:
links={}
for i in counter:
    if (i[1],i[0]) not in links.keys():
        links[i]={'source': i[0], 'target': i[1], 'victory '+i[0]: counter[i] ,'victory '+i[1]: 0}
        for j in counter:
            if j == (i[1],i[0]):
                links[i]['victory '+i[1]]=counter[j]

In [15]:
with open('rolland.json', 'w') as f:
     json.dump({"nodes": nodes, "links": list(links.values())}, f)