## Exploring the Match Charting Project
Some awesome data crowdsourced by many people charting different tennis matches. This dataset provides point by point data of most major ATP matches to date. Big ups to [Jeff Sackmann](http://www.tennisabstract.com/blog/2013/11/26/the-match-charting-project/) for co-ordinating this incredible effort to get this data out in the public. There are so many insights to be gotten from this dataset. I encourage any avid tennis fan or even sports statistics fan to look into answering some interesting questions using this data.

I might start charting up some matches when Wimbledon comes round!

### Finding the most clutch Player
One of the most perennnial question is: Who is the most clutch player on the tour? Many say it's Djokovic/Nadal, but I'd like to get some statistical proof that it actually is.

First, we need to define what clutch means. According to google clutch means: "denoting or occurring at a critical situation in which the outcome of a game or competition is at stake."

A critical situation in tennis would occur when:
- The player is playing to save a game point/break point/set point/match point
- The player is playing to win a game point/break point/set point/match point

Obviously saving break point vs. saving match point are different levels of clutchness. On top of that, not all matches carry the same amount of pressure. e.g. A Challenger 1st round event vs. Wimbledon Final. This however may skew the stats to those who are higher ranked, so we might do a version with and without event-scaling.

Perhaps there should also be multipliers for players who survive critical situations consecutive times in a row. For example: If you're 0-40 down and you bomb down 5 aces. Each ace will give the player a clutch score along with the multiplier for having done it 4 times in a row.

Likewise, if you were 40-0 up and you lose 5 consecutive points - that should decrease your clutch score. 

Furthermore, we should distinquish how the point was won. Winning long rallies in critical situations are probably more clutch than an ace. Also, ending the rally in a winner as opposed to an opponents unforced error is more clutch.

All these factors should come into play when determining the clutch score for each point.

In [295]:
%%html
<style>
  table {margin-left: 0 !important;}
</style>

### Assumptions
With the factors above in mind, we'll need to start off with assumptions which we will use as weights for determining the clutch-points (cp for short). 

1. Let's use a concept in [behavioural theory](https://www.behavioraleconomics.com/resources/mini-encyclopedia-of-be/loss-aversion/) that we feel losses twice as much as gains. 
2. Let's also assume there's twice as much pressure on a break point than on serve.

I've given saving gaming/set/match points: 1/2/3 respectively. Then halved it for winning as per (1.)
I did the same for saving on-serve as per (2.). 
Then I halved that to get Winning On-serve as (1.)

| Winning-Point   Type | Serve/Break | Game Point | Set Point | Match Point |
|--------------|-------------|------------|-----------|-------------|
| Win          | On Serve    | 0.25       | 0.5       | 0.75        |
| Win          | Break       | 0.5        | 1         | 1.5         |
| Save         | On Serve    | 0.5        | 1         | 1.5         |
| Save         | Break       | 1          | 2         | 3           |

I also assigned factors to how the point was won.

| Rally   Length | Factor |
|----------------|--------|
| 0 to 3         | 1      |
| 4 to 7         | 1.1    |
| 8+             | 1.2    |

| Type of Win      | Factor |
|--------------------|--------|
| Unreturnable Serve | 1      |
| Rally Winner       | 1.1    |
| Swinging Volley    | 1.1    |
| Dropshot           | 1.2    |
| Half-Volley        | 1.2    |
| Trick shot         | 1.3    |

**For Example**: If a player happened to save a matchpoint with a half volley dropshot after a 10 shot rally, they would get a clutch-point of 3*1.2*1.2*1.2=5.184

A somewhat crude calculation - but it'll have to do because being clutch is sort of subjective.
I've chosen to not include factors like the significance of the event, amongst many other things - but I think this list of factors will cover enough.

Anyways let's move on with the data shall we.

### Normalizing for how good a player is
I realized the data we're about to crunch will favour those who are inherantly just good at the game and that we'll just see the Top players at the top of the rankings. 

To normalize for this, I think we ought to calculate the point win-rate for each player, as well as their win-rate on pressure points. Then subtract the two to give a normalized score.



In [296]:
import pandas as pd

In [297]:
# df=pd.read_csv('samplepoints.csv', engine='python')
df=pd.read_csv('C:/Users/William Jiang/Documents/tennis_MatchChartingProject/charting-m-points.csv', engine='python')
# df=df.head(1000)

In [298]:
# df=pd.read_csv('poo.csv', engine='python')

## Data Cleansing
Let's isolate the components we are concerned with.
We are only concerned with the following information:
1. Names
2. Points that are one point away from having a game lost/won.
3. Rally Length
4. How the point ended.

Else we will strip away all the other data.

In [299]:
#Lets Split the points up 
df[['PtsServer','PtsRet']] = df['Pts'].str.split(expand=True,pat = "-")
#And the Player names
df[['Date','Gender','City','Round','P1Name','P2Name']] = df['match_id'].str.split(expand=True,pat = "-")
df=df[df['City']!='NextGen_Finals']

In [300]:
#Find all match_ids that last five sets
filter1 = df['Set1.1']==3
filter2 = df['Set2.1']==3
big_match_array=df[filter1|filter2]['match_id'].unique()

In [302]:
df_big_match = pd.DataFrame()
df_big_match['match_id']=big_match_array
df_big_match['bigmatchflag']=True
# df_big_match.to_csv('5setmatches.csv')

In [303]:
df=pd.merge(df,df_big_match,on='match_id',how='left')

In [304]:
#Replace AD with 50 because easier to write logic with integer values.
df.loc[df['PtsServer'] == 'AD', 'PtsServer'] = '50' 
df.loc[df['PtsRet'] == 'AD', 'PtsRet'] = '50' 

In [305]:
def dateToInt(x):
    switcher = {
        'Jan': 1,
        'Feb': 2,
        'Mar': 3,
        'Apr': 4,
        'May': 5,
        'Jun': 6,
        'Jul': 7,
        'Aug': 8,
        'Sep': 9,
        'Oct': 10,
        'Nov': 11,
        'Dec': 12, 
    }
    if x in switcher.keys():
        return_value=switcher[x]
    else:
        return_value=x
    return return_value


In [310]:
df['PtsServer']=df['PtsServer'].apply(dateToInt)
df['PtsServer']=df['PtsServer'].astype('int32')
df['PtsRet']=df['PtsRet'].apply(dateToInt)
df['PtsRet']=df['PtsRet'].astype('int32')

In [311]:

#Define whether a point is crucial or not
def isCrucialPoint(row):
    return_value=False
    if row['TB?'] in ('0','V'):
        if (row['PtsServer']==40) ^ (row['PtsRet']==40):
            return_value=True    
        if (row['PtsServer']==50) ^ (row['PtsRet']==50):
            return_value=True   
    elif row['TB?']=='1':
        if row['PtsServer']>=6 and row['PtsRet']>=6:
            if row['PtsServer']!=row['PtsRet']:
                return_value=True
        if (row['PtsServer']>=6) ^ (row['PtsRet']>=6):
            return_value=True
    #Super Tie-Break
    elif row['TB?']=='S':
        if row['PtsServer']>=6 and row['PtsRet']>=6:
            if row['PtsServer']!=row['PtsRet']:
                return_value=True
        if (row['PtsServer']>=6) ^ (row['PtsRet']>=6):
            return_value=True       
    #8-all
    elif row['TB?']=='W':
        if row['PtsServer']>=8 and row['PtsRet']>=8:
            if row['PtsServer']!=row['PtsRet']:
                return_value=True
        if (row['PtsServer']>=8) ^ (row['PtsRet']>=8):
            return_value=True         
    elif row['TB?']=='A':
        if row['PtsServer']>=10 and row['PtsRet']>=10:
            if row['PtsServer']!=row['PtsRet']:
                return_value=True
        if (row['PtsServer']>=10) ^ (row['PtsRet']>=10):
            return_value=True     
    elif row['TB?']=='T':
        if row['PtsServer']>=12 and row['PtsRet']>=12:
            if row['PtsServer']!=row['PtsRet']:
                return_value=True
        if (row['PtsServer']>=12) ^ (row['PtsRet']>=12):
            return_value=True     
                
    return return_value

In [312]:
df['IsCrucialPoint']=df.apply(isCrucialPoint,axis=1)
df1=df[df['IsCrucialPoint']==True]

In [313]:
df1

Unnamed: 0,match_id,Pt,Set1,Set2,Gm1,Gm2,Pts,Gm#,TbSet,TB?,...,PtsServer,PtsRet,Date,Gender,City,Round,P1Name,P2Name,bigmatchflag,IsCrucialPoint
3,20200116-M-Auckland-QF-Hubert_Hurkacz-Felician...,4,0,0,0.0,0.0,40-0,1 (4),1,0,...,40,0,20200116,M,Auckland,QF,Hubert_Hurkacz,Feliciano_Lopez,,True
9,20200116-M-Auckland-QF-Hubert_Hurkacz-Felician...,10,0,0,1.0,0.0,40-30,2 (6),1,0,...,40,30,20200116,M,Auckland,QF,Hubert_Hurkacz,Feliciano_Lopez,,True
13,20200116-M-Auckland-QF-Hubert_Hurkacz-Felician...,14,0,0,1.0,1.0,40-0,3 (4),1,0,...,40,0,20200116,M,Auckland,QF,Hubert_Hurkacz,Feliciano_Lopez,,True
19,20200116-M-Auckland-QF-Hubert_Hurkacz-Felician...,20,0,0,2.0,1.0,40-30,4 (6),1,0,...,40,30,20200116,M,Auckland,QF,Hubert_Hurkacz,Feliciano_Lopez,,True
23,20200116-M-Auckland-QF-Hubert_Hurkacz-Felician...,24,0,0,2.0,2.0,40-0,5 (4),1,0,...,40,0,20200116,M,Auckland,QF,Hubert_Hurkacz,Feliciano_Lopez,,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
414575,19700704-M-Wimbledon-F-John_Newcombe-Ken_Rosewall,285,2,2,3.0,1.0,AD-40,43 (8),0,0,...,50,40,19700704,M,Wimbledon,F,John_Newcombe,Ken_Rosewall,True,True
414581,19700704-M-Wimbledon-F-John_Newcombe-Ken_Rosewall,291,2,2,4.0,1.0,30-40,44 (6),0,0,...,30,40,19700704,M,Wimbledon,F,John_Newcombe,Ken_Rosewall,True,True
414587,19700704-M-Wimbledon-F-John_Newcombe-Ken_Rosewall,297,2,2,5.0,1.0,30-40,45 (6),0,0,...,30,40,19700704,M,Wimbledon,F,John_Newcombe,Ken_Rosewall,True,True
414589,19700704-M-Wimbledon-F-John_Newcombe-Ken_Rosewall,299,2,2,5.0,1.0,AD-40,45 (8),0,0,...,50,40,19700704,M,Wimbledon,F,John_Newcombe,Ken_Rosewall,True,True


In [316]:
#Can't rely on the points to determine this.
def abouttowin(row):
    if row['PtsAfter']=='GM':
        if row['PtWinner']==1:
            return_val=1
        else:
            return_val=2
    else:
        if row['PtWinner']==1:
            return_val=2
        else:
            return_val=1        
            
    return return_val
        


In [317]:
#winsave
def winorsave(row):
    abouttowin=row['abouttowin']
    ptWinner=row['PtWinner']
    if ptWinner==1:
        if abouttowin==1:
            return_val='Win'
        else:
            return_val='Save'
    else: 
        if abouttowin==2:
            return_val='Win'
        else:
            return_val='Save'

    
    return return_val
        

In [318]:
#Point Type
def PointType(row):
    if ( (row['Gm1']>=5 or row['Gm2']>=5 ) and row['Gm1']!=row['Gm2']) or (row['Gm1']==6 and row['Gm1']==6):
        if row['bigmatchflag']==True and (row['Set1']==2 or row['Set2']==2):
            return_val='MatchPoint'
        elif row['bigmatchflag']==False and (row['Set1']==1 or row['Set2']==1):
            return_val='MatchPoint'
        else:
            return_val='SetPoint'
    elif (row['abouttowin']==1 and row['Svr']==2) or (row['abouttowin']==2 and row['Svr']==1):
        return_val='BreakPoint'
    else:
        return_val='GamePoint'             
    

    
    return return_val
        

In [319]:
#Return Player who won the point in the clutch moment.
def ClutchPlayerWon(row):
    ptWinner=row['PtWinner']
    if ptWinner==1:
        return_val=row['P1Name']
    else:
        return_val=row['P2Name']        
    
 
    return return_val
        

In [320]:
#Return Player who won the point in the clutch moment.
def ClutchPlayerLost(row):
    ptWinner=row['PtWinner']
    if ptWinner==1:
        return_val=row['P2Name']
    else:
        return_val=row['P1Name']        
    
 
    return return_val
        

In [321]:
df1['abouttowin']=df1.apply(abouttowin,axis=1)
df1['winorsave']=df1.apply(winorsave,axis=1)
df1['PointType']=df1.apply(PointType,axis=1)
df1['ClutchPlayerWon']=df1.apply(ClutchPlayerWon,axis=1)
df1['ClutchPlayerLost']=df1.apply(ClutchPlayerLost,axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .l

In [322]:
df2=df1[df1['PointType']!='GamePoint']

In [323]:
# df2.to_csv('bla.csv')

In [324]:
dfClutchWon = pd.DataFrame()
dfClutchWon['count']=df2.groupby('ClutchPlayerWon')['ClutchPlayerWon'].count().sort_values(ascending=False)
dfClutchWon=dfClutchWon.reset_index()

In [325]:
#Clutch Points Lost
dfClutchLost = pd.DataFrame()
dfClutchLost['count']=df2.groupby('ClutchPlayerLost')['ClutchPlayerLost'].count().sort_values(ascending=False)
dfClutchLost=dfClutchLost.reset_index()

In [326]:
df_total=pd.merge(dfClutchWon,dfClutchLost,left_on='ClutchPlayerWon', right_on='ClutchPlayerLost',how='inner')

In [327]:
df_total['ratio']=df_total['count_x']/(df_total['count_x']+df_total['count_y'])

In [331]:
#Sorted List of most clutch Players
df_final=df_total.sort_values('ratio',ascending=False)
df_final[df_final['count_x']>100]

Unnamed: 0,ClutchPlayerWon,count_x,ClutchPlayerLost,count_y,ratio
68,Sam_Querrey,151,Sam_Querrey,106,0.587549
96,Thomas_Fabbiano,104,Thomas_Fabbiano,77,0.574586
80,Yoshihito_Nishioka,135,Yoshihito_Nishioka,101,0.572034
43,Bjorn_Borg,257,Bjorn_Borg,197,0.566079
40,Karen_Khachanov,269,Karen_Khachanov,207,0.565126
...,...,...,...,...,...
59,Sergi_Bruguera,179,Sergi_Bruguera,217,0.452020
85,Andrey_Rublev,130,Andrey_Rublev,161,0.446735
75,Marton_Fucsovics,142,Marton_Fucsovics,176,0.446541
16,David_Ferrer,457,David_Ferrer,577,0.441973


In [332]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    display(df_final[df_final['count_x']>100])

Unnamed: 0,ClutchPlayerWon,count_x,ClutchPlayerLost,count_y,ratio
68,Sam_Querrey,151,Sam_Querrey,106,0.587549
96,Thomas_Fabbiano,104,Thomas_Fabbiano,77,0.574586
80,Yoshihito_Nishioka,135,Yoshihito_Nishioka,101,0.572034
43,Bjorn_Borg,257,Bjorn_Borg,197,0.566079
40,Karen_Khachanov,269,Karen_Khachanov,207,0.565126
70,Petr_Korda,148,Petr_Korda,119,0.554307
45,Gustavo_Kuerten,254,Gustavo_Kuerten,208,0.549784
69,Felix_Auger_Aliassime,151,Felix_Auger_Aliassime,130,0.537367
44,David_Goffin,254,David_Goffin,219,0.536998
94,Pablo_Carreno_Busta,112,Pablo_Carreno_Busta,97,0.535885


In [18]:
# #Output Clutch-Factors depending on what kind of point was won.
# def clutchFactor(row):
#     pointType=row['PointType']
#     abouttowin=row['abouttowin']
#     winorsave=row['winorsave']
    
#     base=1
    
    
#     if pointType=='MatchPoint':
#         TypeFactor=3
#     elif pointType=='SetPoint':
#         TypeFactor=2
#     else:
#         TypeFactor=1
    
#     totalcf=winorsave*TypeFactor*base
#     return totalcf

In [19]:
df2['winorsave']=df2.apply(winorsave,axis=1)
df2['ClutchFactor']=df2.apply(clutchFactor,axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [20]:
stuff=['PtsServer','PtsRet','Gm1','Gm2','Set1','Set2','Svr','Ret','PointType','BreakType','PtWinner','PtsAfter','ClutchFactor']
df1[stuff]

Unnamed: 0,PtsServer,PtsRet,Gm1,Gm2,Set1,Set2,Svr,Ret,PointType,BreakType,PtWinner,PtsAfter,ClutchFactor
3,40,0,0.0,0.0,0,0,1,2,GamePointP1,OnServe,1,GM,0.25
9,40,30,1.0,0.0,0,0,2,1,GamePointP2,OnServe,2,GM,0.25
13,40,0,1.0,1.0,0,0,1,2,GamePointP1,OnServe,1,GM,0.25
19,40,30,2.0,1.0,0,0,2,1,GamePointP2,OnServe,2,GM,0.25
23,40,0,2.0,2.0,0,0,1,2,GamePointP1,OnServe,1,GM,0.25
...,...,...,...,...,...,...,...,...,...,...,...,...,...
984,40,0,3.0,3.0,0,0,1,2,GamePointP1,OnServe,1,GM,0.25
988,40,0,4.0,3.0,0,0,2,1,GamePointP2,OnServe,2,GM,0.25
993,40,15,4.0,4.0,0,0,1,2,GamePointP1,OnServe,1,GM,0.25
998,40,15,5.0,4.0,0,0,2,1,GamePointP2,OnServe,1,40-30,0.50


In [21]:
filter = df1["BreakType"]=="OnServe"
df1[stuff].where(filter)

Unnamed: 0,PtsServer,PtsRet,Gm1,Gm2,Set1,Set2,Svr,Ret,PointType,BreakType,PtWinner,PtsAfter,ClutchFactor
3,40.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,GamePointP1,OnServe,1.0,GM,0.25
9,40.0,30.0,1.0,0.0,0.0,0.0,2.0,1.0,GamePointP2,OnServe,2.0,GM,0.25
13,40.0,0.0,1.0,1.0,0.0,0.0,1.0,2.0,GamePointP1,OnServe,1.0,GM,0.25
19,40.0,30.0,2.0,1.0,0.0,0.0,2.0,1.0,GamePointP2,OnServe,2.0,GM,0.25
23,40.0,0.0,2.0,2.0,0.0,0.0,1.0,2.0,GamePointP1,OnServe,1.0,GM,0.25
...,...,...,...,...,...,...,...,...,...,...,...,...,...
984,40.0,0.0,3.0,3.0,0.0,0.0,1.0,2.0,GamePointP1,OnServe,1.0,GM,0.25
988,40.0,0.0,4.0,3.0,0.0,0.0,2.0,1.0,GamePointP2,OnServe,2.0,GM,0.25
993,40.0,15.0,4.0,4.0,0.0,0.0,1.0,2.0,GamePointP1,OnServe,1.0,GM,0.25
998,40.0,15.0,5.0,4.0,0.0,0.0,2.0,1.0,GamePointP2,OnServe,1.0,40-30,0.50


In [22]:
print(df1)

     Unnamed: 0  Unnamed: 0.1  \
3             3             3   
9             9             9   
13           13            13   
19           19            19   
23           23            23   
..          ...           ...   
984         984           984   
988         988           988   
993         993           993   
998         998           998   
999         999           999   

                                              match_id  Pt  Set1  Set2  Gm1  \
3    20200116-M-Auckland-QF-Hubert_Hurkacz-Felician...   4     0     0  0.0   
9    20200116-M-Auckland-QF-Hubert_Hurkacz-Felician...  10     0     0  1.0   
13   20200116-M-Auckland-QF-Hubert_Hurkacz-Felician...  14     0     0  1.0   
19   20200116-M-Auckland-QF-Hubert_Hurkacz-Felician...  20     0     0  2.0   
23   20200116-M-Auckland-QF-Hubert_Hurkacz-Felician...  24     0     0  2.0   
..                                                 ...  ..   ...   ...  ...   
984  20200109-M-ATP_Cup-QF-Alex_De_Minaur-Daniel_E

In [29]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    display(df1[stuff])

Unnamed: 0,PtsServer,PtsRet,Gm1,Gm2,Set1,Set2,Svr,Ret,PointType,BreakType,PtWinner,PtsAfter,ClutchFactor
3,40,0,0.0,0.0,0,0,1,2,GamePointP1,OnServe,1,GM,0.25
9,40,30,1.0,0.0,0,0,2,1,GamePointP2,OnServe,2,GM,0.25
13,40,0,1.0,1.0,0,0,1,2,GamePointP1,OnServe,1,GM,0.25
19,40,30,2.0,1.0,0,0,2,1,GamePointP2,OnServe,2,GM,0.25
23,40,0,2.0,2.0,0,0,1,2,GamePointP1,OnServe,1,GM,0.25
27,0,40,3.0,2.0,0,0,2,1,GamePointP1,OnServe,2,15-40,0.5
28,15,40,3.0,2.0,0,0,2,1,GamePointP1,OnServe,2,30-40,0.5
29,30,40,3.0,2.0,0,0,2,1,GamePointP1,OnServe,2,40-40,0.5
31,40,50,3.0,2.0,0,0,2,1,GamePointP1,Break,1,GM,0.5
35,0,40,4.0,2.0,0,0,1,2,GamePointP2,Break,2,GM,0.5


In [23]:
df1[stuff].to_csv('test_out.csv')