# Exploring Women's Tennis Players

In this dataset we explore WTA (Women's Tennis Association) tennis players. This dataset has data going back all the way to the beginning of the organization, as far as records are kept, and I'm interested to see what historical information is in the data!

## Players

First things first, the most immediate table in the dataset is `players.csv`. This dataset has some cleaning issues; the following code block steps through them.

In [None]:
import pandas as pd
players = pd.read_csv("../input/wta/players.csv", encoding='latin1', index_col=0)

# Top column is misaligned.
players.index.name = 'ID'
players.columns = ['First' , 'Last', 'Handidness', 'DOB', 'Country']

# Parse date data to dates.
players = players.assign(DOB=pd.to_datetime(players['DOB'], format='%Y%m%d'))

# Handidness is reported as U if unknown; set np.nan instead.
import numpy as np
players = players.assign(Handidness=players['Handidness'].replace('U', np.nan))

players.head()

There's a well-known supposition in tennis that left-handed = better. What does the historical data say?

In [None]:
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
players.Handidness.value_counts(dropna=False).plot.bar(figsize=(12, 6),
                                                       title='WTA Player Handidness')

Unfortunately too many records are left unknown to confirm much!

Next let's look at the trend of WTA player birth years.

In [None]:
players.set_index('DOB').resample('Y').count().Country.plot.line(
    linewidth=1, 
    figsize=(12, 4),
    title='WTA Player Year of Birth'
)

This data confirms that the number of professional women's tennis players has never been higher than it is today. What countries are the most heavily represented?

In [None]:
players.Country.value_counts().head(20).plot.bar(
    figsize=(12, 6),
    title='WTA Player Country Representing'
)

The United States is far and away the world leader when it comes to producing tennis players! Other big tennis countries are Europe, Japan, and Australia. With the notable exception of Japan, Asian countries like India (e.g. Sania Mirza) and China (e.g. Li Na) have a growing representation.

## Matches

The next table is the matches table. This one has a lot of missing values.

In [None]:
matches = pd.read_csv("../input/wta/matches.csv", encoding='latin1', index_col=0)
matches.head(3)

We can take a look at who wins matches.

In [None]:
matches['winner_name'].value_counts().head(20).plot.bar(
    figsize=(12, 4),
    title='WTA Players with Most Matches Won'
)

And who loses them.

In [None]:
matches['loser_name'].value_counts().head(20).plot.bar(
    figsize=(12, 4),
    title='WTA Players with Most Matches Lost'
)

Interestingly enough the data on biggest match winners and biggest match losers is very different. Martina Navratilova is far in the way leader in terms of all matches played; she was a multi Grand Slam champion who was active from 1975 to, incredibly, 2006 (a 31-year career!). Many of the rest of the list of biggest match winners are also best-evers: Serena and Venus Williams, Lindsay Davenport, Steffi Graf, Monica Seles...

The biggest losers are usually also formadible players, but more grindy ones a notch or more below the biggest winners. Ai Sugiyama, who has the inenviable distinction of losing more matches than any other female tennis player ever, nevertheless has three Grand Slam wins (albeit in doubles) and has won 43 individual titles. Jelena Jankovic, the seventh biggest loser, spent some time as the number 1 ranked WTA player in the world.

In [None]:
pd.concat([matches['winner_name'], matches['loser_name']]).value_counts().head(20).plot.bar(
    figsize=(12, 4),
    title='WTA Players with Most Matches Played'
)

The chart for most career wins mixes these two sets up quite nicely.

What does an average tennis career look like? It turns out that three quarters of players who have contested WTA matches *never made it to 20 matches played*.

In [None]:
(pd.concat([matches['winner_name'], matches['loser_name']]).value_counts() < 20).astype(int).sum()

In [None]:
pd.Series(
    [(pd.concat([matches['winner_name'], matches['loser_name']]).value_counts() < 20)\
         .astype(int).sum(),
    (pd.concat([matches['winner_name'], matches['loser_name']]).value_counts() >= 20)\
         .astype(int).sum()],
    index=['No', 'Yes']
).plot.bar(title='Played At Least 20 Matches?')

In [None]:
(pd.concat([matches['winner_name'], matches['loser_name']])
     .value_counts()
     .where(lambda v: v > 20)
     .dropna()
).plot.hist(
    bins=100,
    figsize=(12, 4),
    title='WTA Career Length'
)

Removing the `<20` players to keep the axis from being skewed to heavily, we get the chart above. The average tennis player will play upwards of 70 competitive matches per season, so 200 matches is perhaps 2 years in a career. Overall, it seems that many players do not make it very far into a career before calling it in. The line chart below, which tracks maximimums, emphasizes this:

In [None]:
np.maximum.accumulate(pd.concat([matches['winner_name'], matches['loser_name']])
     .value_counts(ascending=True)
).reset_index(drop=True).plot.line()

A brief note on format, recall that this table has some mild data nullity going on:

In [None]:
import missingno as msno
plt.rcdefaults()
msno.matrix(matches.head(500))

There's lots more to explore in this table. For fun, he's a quick look at how often higher-ranked seeds defeat lower-ranked or unseeded players in tourneys:

In [None]:
plt.style.use('fivethirtyeight')

(matches
     .assign(
         winner_seed = matches.winner_seed.fillna(0).map(lambda v: v if str.isdecimal(str(v)) else np.nan),
         loser_seed = matches.loser_seed.fillna(0).map(lambda v: v if str.isdecimal(str(v)) else np.nan)
     )
     .loc[:, ['winner_seed', 'loser_seed']]
     .pipe(lambda df: df.winner_seed.astype(float) >= df.loser_seed.astype(float))
     .value_counts()
).plot.bar(title='Higher Ranked Seed Won Match')

## Qualifying matches

Qualifying matches are matches which are played in the early stages of a tournament for spots in the main draw. Usually a handful of main draw slots are set aside for qualifiers, who must defeat other qualifiers to win a spot in the tournament. In a sense, these are a tournament within the tournament. Bigger tournaments, like Grand Tours, have bigger qualifier rounds as well!

The qualifiers dataset contains similar data to the matches dataset, but specific to tourny qualifier rounds. There are a handful of columns that are basically never filled and therefore useless, however, as demonstrated below.

In [None]:
qual = pd.read_csv("../input/wta/qualifying_matches.csv")
qual.head()

In [None]:
qual.shape

In [None]:
plt.rcdefaults()
msno.matrix(qual.head(500))

I will omit any further commentary on this part of the data because it's pretty similar to the mainline `matches` info.

## Rankings

The rankings table lists the rankings achieved by the various players for various dates. The dates provided are those at the ends of the weeks, so this data can be used to e.g. track the rise and fall of specific players or groups of players through the rankings!

In [None]:
rankings = pd.read_csv("../input/wta/rankings.csv")
rankings.head()

The rankings are, with certain exceptions, measurements made at the end of the week, going back as far as ~1994.

Player rankings are a function of how well they did in matches and tournaments that they played. Bigger tournaments award more points than smaller ones. Winning tournaments awards more points than being the second-place finalist, which awards more than being a semifinalist, and so on down the ladder. Points earned from winning something expire one year after they are earned (e.g. the next time the event is held).

The rankings table includes the top ~1200 or so players, with some variance. Here's one ranking plot:

In [None]:
plt.style.use('fivethirtyeight')
rankings[rankings['ranking_date'] == 20000101].ranking.sort_values().reset_index(drop=True).plot.line()

Here are the number of rankings available in each historical slice:

In [None]:
plt.style.use('fivethirtyeight')
rankings['ranking_date'].value_counts().sort_index().plot.line(linewidth=1.5)

The number of ranking points is exponentially distributed. This has something to do with the way tournament point awarding is done: generally somewhat exponentially. There are a lot of players winning matches in marginal tournaments picking up 50 points or so apiece, while the game's masters can make 1000 in one tournament!

In [None]:
rankings['ranking_points'].plot.hist(bins=100)

Just as an example, here's what Serena Williams, a (possibly the) WTA all-time great, has garnered over time.

In [None]:
serena_williams = (rankings.query('player_id == "200033"')
     .pipe(lambda df: df.assign(ranking_date=pd.to_datetime(df.ranking_date, format='%Y%m%d', errors='coerce')))
     .set_index('ranking_date')
     .loc[:, ['ranking', 'ranking_points']]
)

In [None]:
fig, axarr = plt.subplots(1, 2, figsize=(12, 6))
fig.suptitle('Serena Williams Rank (L) and Points (R) Over Time')

serena_williams['ranking'].plot.line(ax=axarr[0], linewidth=2, color='steelblue')
axarr[0].set_ylim(0, 20)

serena_williams['ranking_points'].plot.line(ax=axarr[1], linewidth=2, color='steelblue')
pass

## Conclusion

This is a fantastic dataset with a lot of exploratory potential. Hopefully this notebook has given you some ideas of further exploration you can do with it!