## Exploring the Match Charting Project
Some awesome data crowdsourced by many people charting different tennis matches. This dataset provides point by point data of most major ATP matches to date. Big ups to [Jeff Sackmann](http://www.tennisabstract.com/blog/2013/11/26/the-match-charting-project/) for co-ordinating this incredible effort to get this data out in the public. There are so many insights to be gotten from this dataset. I encourage any avid tennis fan or even sports statistics fan to look into answering some interesting questions using this data.

I might start charting up some matches when Wimbledon comes round!

### Finding the most clutch Player
One of the most perennnial question is: Who is the most clutch player on the tour? Many say it's Djokovic/Nadal, but I'd like to get some statistical proof that it actually is.

First, we need to define what clutch means. According to google clutch means: "denoting or occurring at a critical situation in which the outcome of a game or competition is at stake."

A critical situation in tennis would occur when:
- The player is playing to save a game point/break point/set point/match point
- The player is playing to win a game point/break point/set point/match point

Obviously saving break point vs. saving match point are different levels of clutchness. On top of that, not all matches carry the same amount of pressure. e.g. A Challenger 1st round event vs. Wimbledon Final. This however may skew the stats to those who are higher ranked, so we might do a version with and without event-scaling.

Perhaps there should also be multipliers for players who survive critical situations consecutive times in a row. For example: If you're 0-40 down and you bomb down 5 aces. Each ace will give the player a clutch score along with the multiplier for having done it 4 times in a row.

Likewise, if you were 40-0 up and you lose 5 consecutive points - that should decrease your clutch score. 

Furthermore, we should distinquish how the point was won. Winning long rallies in critical situations are probably more clutch than an ace. Also, ending the rally in a winner as opposed to an opponents unforced error is more clutch.

All these factors should come into play when determining the clutch score for each point.

In [4]:
%%html
<style>
  table {margin-left: 0 !important;}
</style>

### Assumptions
With the factors above in mind, we'll need to start off with assumptions which we will use as weights for determining the clutch-points (cp for short). 

1. Let's use a concept in [behavioural theory](https://www.behavioraleconomics.com/resources/mini-encyclopedia-of-be/loss-aversion/) that we feel losses twice as much as gains. 
2. Let's also assume there's twice as much pressure on a break point than on serve.

I've given saving gaming/set/match points: 1/2/3 respectively. Then halved it for winning as per (1.)
I did the same for saving on-serve as per (2.). 
Then I halved that to get Winning On-serve as (1.)

| Winning-Point   Type | Serve/Break | Game Point | Set Point | Match Point |
|--------------|-------------|------------|-----------|-------------|
| Win          | On Serve    | 0.25       | 0.5       | 0.75        |
| Win          | Break       | 0.5        | 1         | 1.5         |
| Save         | On Serve    | 0.5        | 1         | 1.5         |
| Save         | Break       | 1          | 2         | 3           |

I also assigned factors to how the point was won.

| Rally   Length | Factor |
|----------------|--------|
| 0 to 3         | 1      |
| 4 to 7         | 1.1    |
| 8+             | 1.2    |

| Type of Win      | Factor |
|--------------------|--------|
| Unreturnable Serve | 1      |
| Rally Winner       | 1.1    |
| Swinging Volley    | 1.1    |
| Dropshot           | 1.2    |
| Half-Volley        | 1.2    |
| Trick shot         | 1.3    |

**For Example**: If a player happened to save a matchpoint with a half volley dropshot after a 10 shot rally, they would get a clutch-point of 3*1.2*1.2*1.2=5.184

A somewhat crude calculation - but it'll have to do because being clutch is sort of subjective.
I've chosen to not include factors like the significance of the event, amongst many other things - but I think this list of factors will cover enough.

Anyways let's move on with the data shall we.

In [1]:
import pandas as pd

In [2]:
df=pd.read_csv('samplepoints.csv', engine='python')

In [31]:
# df=pd.read_csv('C:/Users/William Jiang/Documents/tennis_MatchChartingProject/charting-m-points.csv', engine='python')

In [125]:
#Let's first spit out a sample csv to look at the data more easliy.

df_sample=df.head(n=1000)
df_sample.to_csv('samplepoints.csv')

In [10]:
# #Isolate a particular match to test our algorithm on.
# df_test=df[df['match_id']=='20200116-M-Auckland-QF-Hubert_Hurkacz-Feliciano_Lopez']

## Data Cleansing
Let's isolate the components we are concerned with.
We are only concerned with the following information:
1. Names
2. Points that are one point away from having a game lost/won.
3. Rally Length
4. How the point ended.

Else we will strip away all the other data.

In [3]:
#Lets Split the points up 
df[['PtsServer','PtsRet']] = df['Pts'].str.split(expand=True,pat = "-")
#And the Player names
df[['Date','Gender','City','Round','P1Name','P2Name']] = df['match_id'].str.split(expand=True,pat = "-")

In [4]:
#Replace AD with 50 because easier to write logic with integer values.
df.loc[df['PtsServer'] == 'AD', 'PtsServer'] = '50' 
df.loc[df['PtsRet'] == 'AD', 'PtsRet'] = '50' 

In [5]:
def dateToInt(x):
    switcher = {
        'Jan': 1,
        'Feb': 2,
        'Mar': 3,
        'Apr': 4,
        'May': 5,
        'Jun': 6,
        'Jul': 7,
        'Aug': 8,
        'Sep': 9,
        'Oct': 10,
        'Nov': 11,
        'Dec': 12, 
    }
    if x in switcher.keys():
        return_value=switcher[x]
    else:
        return_value=x
    return return_value


In [6]:
df['PtsServer']=df['PtsServer'].apply(dateToInt)
df['PtsServer']=df['PtsServer'].astype('int32')
df['PtsRet']=df['PtsRet'].apply(dateToInt)
df['PtsRet']=df['PtsRet'].astype('int32')

In [36]:
df[['PtsServer','PtsRet']]

Unnamed: 0,PtsServer,PtsRet
0,0,0
1,15,0
2,30,0
3,40,0
4,0,0
...,...,...
995,15,0
996,15,15
997,30,15
998,40,15


In [7]:
#Define whether a point is crucial or not
def isCrucialPoint(row):
    return_value=False
    if row['TB?']==0:
        if (row['PtsServer']==40) ^ (row['PtsRet']==40):
            return_value=True    
        if (row['PtsServer']==50) ^ (row['PtsRet']==50):
            return_value=True   
    elif row['TB?']==1:
        if row['PtsServer']>=6 and row['PtsRet']>=6:
            if row['PtsServer']!=row['PtsRet']:
                return_value=True
        if (row['PtsServer']>=6) ^ (row['PtsRet']>=6):
            return_value=True
                
    return return_value

In [8]:
df['IsCrucialPoint']=df.apply(isCrucialPoint,axis=1)
df1=df[df['IsCrucialPoint']==True]

In [25]:
#Point Type
def PointType(row):
    if (row['Svr']==1 and row['PtsServer']>row['PtsRet']) or (row['Ret']==1 and row['PtsRet']>row['PtsServer']): 
        if (row['Gm1']>=5 and row['Gm1']>row['Gm2']) or (row['Gm1']==6 and row['Gm1']==6):
            if row['Set1']==1:
                return_val='MatchPointP1'
            else:
                return_val='SetPointP1'
        else:
            return_val='GamePointP1'
    elif (row['Svr']==2 and row['PtsServer']>row['PtsRet']) or (row['Ret']==2 and row['PtsRet']>row['PtsServer']): 
        if (row['Gm2']>=5 and row['Gm2']>row['Gm1']) or (row['Gm2']==6 and row['Gm1']==6):
            if row['Set2']==1:
                return_val='MatchPointP2'
            else:
                return_val='SetPointP2'
        else:
            return_val='GamePointP2'   
    return return_val
        

In [31]:
def BreakType(row):
    if ('P1' in row['PointType'] and row['Svr']==2) or ('P2' in row['PointType'] and row['Svr']==1):
        return_val='Break'
    else: 
        return_val='OnServe'
        
    return return_val
        
        
        

In [32]:
df1['PointType']=df1.apply(PointType,axis=1)
df1['BreakType']=df1.apply(BreakType,axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [33]:
stuff=['Svr','PtsServer','PtsRet','Ret','Gm1','Gm2','Set1','Set2','PointType','BreakType']
df1[stuff]

Unnamed: 0,Svr,PtsServer,PtsRet,Ret,Gm1,Gm2,Set1,Set2,PointType,BreakType
3,1,40,0,2,0.0,0.0,0,0,GamePointP1,OnServe
9,2,40,30,1,1.0,0.0,0,0,GamePointP2,OnServe
13,1,40,0,2,1.0,1.0,0,0,GamePointP1,OnServe
19,2,40,30,1,2.0,1.0,0,0,GamePointP2,OnServe
23,1,40,0,2,2.0,2.0,0,0,GamePointP1,OnServe
...,...,...,...,...,...,...,...,...,...,...
984,1,40,0,2,3.0,3.0,0,0,GamePointP1,OnServe
988,2,40,0,1,4.0,3.0,0,0,GamePointP2,OnServe
993,1,40,15,2,4.0,4.0,0,0,GamePointP1,OnServe
998,2,40,15,1,5.0,4.0,0,0,GamePointP2,OnServe


In [44]:
#Output Clutch-Factors depending on what kind of point was won.
def clutchFactor(row):
    pointType=row['PointType']
    breakType=row['BreakType']
    ptWinner=row['PtWinner']
    if pointType='GamePointP1' and breakType='OnServe' and ptWinner=1:
        return_val=0.25
    elif 
    

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,match_id,Pt,Set1,Set2,Gm1,Gm2,Pts,Gm#,...,rallyCount,PtsServer,PtsRet,Date,Gender,City,Round,P1Name,P2Name,IsCrucialPoint
3,3,3,20200116-M-Auckland-QF-Hubert_Hurkacz-Felician...,4,0,0,0.0,0.0,40-0,1 (4),...,1,40,0,20200116,M,Auckland,QF,Hubert_Hurkacz,Feliciano_Lopez,True
9,9,9,20200116-M-Auckland-QF-Hubert_Hurkacz-Felician...,10,0,0,1.0,0.0,40-30,2 (6),...,9,40,30,20200116,M,Auckland,QF,Hubert_Hurkacz,Feliciano_Lopez,True
13,13,13,20200116-M-Auckland-QF-Hubert_Hurkacz-Felician...,14,0,0,1.0,1.0,40-0,3 (4),...,1,40,0,20200116,M,Auckland,QF,Hubert_Hurkacz,Feliciano_Lopez,True
19,19,19,20200116-M-Auckland-QF-Hubert_Hurkacz-Felician...,20,0,0,2.0,1.0,40-30,4 (6),...,3,40,30,20200116,M,Auckland,QF,Hubert_Hurkacz,Feliciano_Lopez,True
23,23,23,20200116-M-Auckland-QF-Hubert_Hurkacz-Felician...,24,0,0,2.0,2.0,40-0,5 (4),...,3,40,0,20200116,M,Auckland,QF,Hubert_Hurkacz,Feliciano_Lopez,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
984,984,984,20200109-M-ATP_Cup-QF-Alex_De_Minaur-Daniel_Evans,43,0,0,3.0,3.0,40-0,7 (4),...,3,40,0,20200109,M,ATP_Cup,QF,Alex_De_Minaur,Daniel_Evans,True
988,988,988,20200109-M-ATP_Cup-QF-Alex_De_Minaur-Daniel_Evans,47,0,0,4.0,3.0,40-0,8 (4),...,1,40,0,20200109,M,ATP_Cup,QF,Alex_De_Minaur,Daniel_Evans,True
993,993,993,20200109-M-ATP_Cup-QF-Alex_De_Minaur-Daniel_Evans,52,0,0,4.0,4.0,40-15,9 (5),...,5,40,15,20200109,M,ATP_Cup,QF,Alex_De_Minaur,Daniel_Evans,True
998,998,998,20200109-M-ATP_Cup-QF-Alex_De_Minaur-Daniel_Evans,57,0,0,5.0,4.0,40-15,10 (5),...,8,40,15,20200109,M,ATP_Cup,QF,Alex_De_Minaur,Daniel_Evans,True


In [53]:
df_setstowin=df1.groupby("match_id")[["Set1.1","Set2.1"]].max(axis=1).max(axis=1)
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


3     NaN
9     NaN
13    NaN
19    NaN
23    NaN
       ..
984   NaN
988   NaN
993   NaN
998   NaN
999   NaN
Name: SetsToWin, Length: 274, dtype: float64