# Initial Data Analysis of a Typical Data Table

An initial analysis of the "atp_matches_2010.csv" table. This table contains data pertaining to all the matches played in the ATP World Tour in 2010. I started with one table to simplify the intial analysis. The other tables in the database have the same structure.

In [6]:
import numpy as np
import pandas as pd

table = pd.read_csv('atp_matches_2010.csv')

table.head(10)

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,...,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
0,2010-339,Brisbane,Hard,32,A,20100103,1,104053,1.0,,...,34.0,29.0,11.0,10.0,3.0,5.0,7.0,4410.0,77.0,598.0
1,2010-339,Brisbane,Hard,32,A,20100103,2,104958,,WC,...,34.0,22.0,14.0,9.0,7.0,10.0,134.0,400.0,78.0,590.0
2,2010-339,Brisbane,Hard,32,A,20100103,3,104755,,,...,58.0,38.0,14.0,14.0,7.0,11.0,52.0,850.0,88.0,568.0
3,2010-339,Brisbane,Hard,32,A,20100103,4,105051,,Q,...,29.0,16.0,15.0,9.0,2.0,5.0,285.0,151.0,28.0,1260.0
4,2010-339,Brisbane,Hard,32,A,20100103,5,104607,4.0,,...,41.0,26.0,14.0,9.0,6.0,9.0,20.0,1655.0,251.0,179.0
5,2010-339,Brisbane,Hard,32,A,20100103,6,104571,,,...,36.0,28.0,20.0,12.0,3.0,6.0,42.0,945.0,55.0,790.0
6,2010-339,Brisbane,Hard,32,A,20100103,7,103248,,,...,36.0,17.0,3.0,7.0,5.0,10.0,119.0,461.0,67.0,649.0
7,2010-339,Brisbane,Hard,32,A,20100103,8,105064,8.0,,...,38.0,20.0,17.0,12.0,2.0,8.0,36.0,1021.0,73.0,611.0
8,2010-339,Brisbane,Hard,32,A,20100103,9,103484,,,...,52.0,41.0,16.0,14.0,3.0,5.0,44.0,935.0,25.0,1320.0
9,2010-339,Brisbane,Hard,32,A,20100103,10,102967,,,...,21.0,15.0,10.0,8.0,1.0,5.0,58.0,744.0,80.0,587.0


### Structure and Information

First, let's identify the information contained within such a table.  
  
Each row represents one match. There are **3030 rows**.  
  
There are **49 columns** (description can be found in *matches_data_dictionary.txt*). These columns can be categorized into :
- information relative to the tournament :
    - *tourney_id : id of the tournament (no predictable structure)*
    - tourney_name : name of the tournament
    - surface : Hard, Clay or Grass
    - *draw_size : number of players in the tournament*
    - tourney_level : G (Grand Slam), M (Masters 100s), A (Other tour-level events), ...
    - tourney_date : YYYYMMDD
<br/><br/>
- information relative to the match
    - *match num*
    - score : e.g. 6-3 3-6 7-6(4)
    - best_of : 3 or 5
    - round : R32, R16, QF, SF, F
    - **minutes : length of the match in minutes, the data we are tyring to predict**
<br/><br/>
- information relative to the winner
    - winner_id : can be used to find the winner in *atp_players.csv*
    - *winner_seed : seed of the winner coming into the tournament, missing for most players*
    - *winner_entry*
    - winner_name
    - winner_hand : R or L
    - winner_ht : height in cm
    - winner_ioc : 3 digit country code
    - winner_age
    <br/><br/>
    - winner_rank : (most recent before match) ATP ranking
    - winner_rank_points : ATP points
<br/><br/>
- information relative to the loser
    - loser_id
    - loser_seed
    - *loser_entry*
    - loser_name
    - loser_hand
    - loser_ht
    - loser_ioc
    - loser_age
    <br/><br/>
    - loser_rank
    - loser_rank_points
<br/><br/>
- match statistics
    - w_ace : winner's number of aces
    - w_df : winner's number of doubles faults
    - w_svpt : winner's number of serve points
    - w_1stIn : winner's number of first serves made
    - w_1stWon : winner's number of first-serve points won
    - w_2ndWon : winner's number of second-serve points won
    - w_SvGms : winner's number of serve games
    - w_bpSaved : winner's number of break points saved
    - w_bpFaced : winner's number of break points faced
    <br/><br/>
    - l_ace
    - l_df
    - l_svpt
    - l_1stIn
    - l_1stWon
    - l_2ndWon
    - l_SvGms
    - l_bpSaved
    - l_bpFaced


tourney_id
- a unique identifier for each tournament, such as 2020-888. The exact formats are borrowed from several different sources, so while the first four characters are always the year, the rest of the ID doesn't follow a predictable structure.

tourney_name
surface
draw_size
- number of players in the draw, often rounded up to the nearest power of 2. (For instance, a tournament with 28 players may be shown as 32.)

tourney_level
- For men: 'G' = Grand Slams, 'M' = Masters 1000s, 'A' = other tour-level events, 'C' = Challengers, 'S' = Satellites/ITFs, 'F' = Tour finals and other season-ending events, and 'D' = Davis Cup 
- For women, there are several additional tourney_level codes, including 'P' = Premier, 'PM' = Premier Mandatory, and 'I' = International. The various levels of ITFs are given by the prize money (in thousands), such as '15' = ITF $15,000. Other codes, such as 'T1' for Tier I (and so on) are used for older WTA tournament designations. 'D' is used for Federation/Fed/Billie Jean King Cup, and also for Wightman Cup and Bonne Bell Cup.

- Others, eventually for both genders: 'E' = exhibition (events not sanctioned by the tour, though the definitions can be ambiguous), 'J' = juniors, and 'T' = team tennis, which does yet appear anywhere in the dataset but will at some point.

tourney_date
- eight digits, YYYYMMDD, usually the Monday of the tournament week.

match_num
- a match-specific identifier. Often starting from 1, sometimes counting down from 300, and sometimes arbitrary. 

winner_id
- the player_id used in this repo for the winner of the match

winner_seed
winner_entry
- 'WC' = wild card, 'Q' = qualifier, 'LL' = lucky loser, 'PR' = protected ranking, 'ITF' = ITF entry, and there are a few others that are occasionally used.

winner_name
winner_hand
- R = right, L = left, U = unknown. For ambidextrous players, this is their serving hand.
winner_ht
- height in centimeters, where available

winner_ioc
- three-character country code

winner_age
- age, in years, as of the tourney_date

loser_id
loser_seed
loser_entry
loser_name
loser_hand
loser_ht
loser_ioc
loser_age
score
best_of
- '3' or '5', indicating the the number of sets for this match

round
minutes
- match length, where available

w_ace
- winner's number of aces
w_df
- winner's number of doubles faults
w_svpt
- winner's number of serve points
w_1stIn
- winner's number of first serves made
w_1stWon
- winner's number of first-serve points won
w_2ndWon
- winner's number of second-serve points won
w_SvGms
- winner's number of serve games
w_bpSaved
- winner's number of break points saved
w_bpFaced
- winner's number of break points faced

l_ace
l_df
l_svpt
l_1stIn
l_1stWon
l_2ndWon
l_SvGms
l_bpSaved
l_bpFaced

winner_rank
- winner's ATP or WTA rank, as of the tourney_date, or the most recent ranking date before the tourney_date
winner_rank_points
- number of ranking points, where available
loser_rank
loser_rank_points