## Exploring the Match Charting Project
Some awesome data crowdsourced by many people charting different tennis matches. This dataset provides point by point data of most major ATP matches to date. Big ups to [Jeff Sackmann](http://www.tennisabstract.com/blog/2013/11/26/the-match-charting-project/) for co-ordinating this incredible effort to get this data out in the public. There are so many insights to be gotten from this dataset. I encourage any avid tennis fan or even sports statistics fan to look into answering some interesting questions using this data.

I might start charting up some matches when Wimbledon comes round!

### Normalizing for how good a player is
I realized the data we're about to crunch will favour those who are inherantly just good at the game and that we'll just see the Top players at the top of the rankings. 

To normalize for this, I think we ought to calculate the point win-rate for each player, as well as their win-rate on pressure points. Then subtract the two to give a normalized score.


In [1]:
import pandas as pd

In [2]:
df=pd.read_csv('C:/Users/William Jiang/Documents/tennis_MatchChartingProject/charting-m-points.csv', engine='python')

In [3]:
# df=df.head(1000)

In [4]:
# df=pd.read_csv('poo.csv', engine='python')

## Data Cleansing
Let's isolate the components we are concerned with.
We are only concerned with the following information:
1. Names
2. Points that are one point away from having a game lost/won.
3. Rally Length
4. How the point ended.

Else we will strip away all the other data.

In [5]:
#Lets Split the points up 
df[['PtsServer','PtsRet']] = df['Pts'].str.split(expand=True,pat = "-")
#And the Player names
df[['Date','Gender','City','Round','P1Name','P2Name']] = df['match_id'].str.split(expand=True,pat = "-")
df=df[df['City']!='NextGen_Finals']

In [6]:
#Find all match_ids that last five sets
filter1 = df['Set1.1']==3
filter2 = df['Set2.1']==3
big_match_array=df[filter1|filter2]['match_id'].unique()

In [7]:
df_big_match = pd.DataFrame()
df_big_match['match_id']=big_match_array
df_big_match['bigmatchflag']=True

In [8]:
df=pd.merge(df,df_big_match,on='match_id',how='left')
df['bigmatchflag'] = df['bigmatchflag'].fillna(value=False)

In [9]:
#Replace AD with 50 because easier to write logic with integer values.
df.loc[df['PtsServer'] == 'AD', 'PtsServer'] = '50' 
df.loc[df['PtsRet'] == 'AD', 'PtsRet'] = '50' 

In [10]:
def dateToInt(x):
    switcher = {
        'Jan': 1,
        'Feb': 2,
        'Mar': 3,
        'Apr': 4,
        'May': 5,
        'Jun': 6,
        'Jul': 7,
        'Aug': 8,
        'Sep': 9,
        'Oct': 10,
        'Nov': 11,
        'Dec': 12, 
    }
    if x in switcher.keys():
        return_value=switcher[x]
    else:
        return_value=x
    return return_value


In [11]:
df['PtsServer']=df['PtsServer'].apply(dateToInt)
df['PtsServer']=df['PtsServer'].astype('int32')
df['PtsRet']=df['PtsRet'].apply(dateToInt)
df['PtsRet']=df['PtsRet'].astype('int32')

In [12]:

#Define whether a point is crucial or not
def isCrucialPoint(row):
    return_value=False
    if row['TB?'] in ('0','V'):
        if (row['PtsServer']==40) ^ (row['PtsRet']==40):
            return_value=True    
        if (row['PtsServer']==50) ^ (row['PtsRet']==50):
            return_value=True   
    elif row['TB?']=='1':
        if row['PtsServer']>=6 and row['PtsRet']>=6:
            if row['PtsServer']!=row['PtsRet']:
                return_value=True
        if (row['PtsServer']>=6) ^ (row['PtsRet']>=6):
            return_value=True
    #Super Tie-Break
    elif row['TB?']=='S':
        if row['PtsServer']>=6 and row['PtsRet']>=6:
            if row['PtsServer']!=row['PtsRet']:
                return_value=True
        if (row['PtsServer']>=6) ^ (row['PtsRet']>=6):
            return_value=True       
    #8-all
    elif row['TB?']=='W':
        if row['PtsServer']>=8 and row['PtsRet']>=8:
            if row['PtsServer']!=row['PtsRet']:
                return_value=True
        if (row['PtsServer']>=8) ^ (row['PtsRet']>=8):
            return_value=True         
    elif row['TB?']=='A':
        if row['PtsServer']>=10 and row['PtsRet']>=10:
            if row['PtsServer']!=row['PtsRet']:
                return_value=True
        if (row['PtsServer']>=10) ^ (row['PtsRet']>=10):
            return_value=True     
    elif row['TB?']=='T':
        if row['PtsServer']>=12 and row['PtsRet']>=12:
            if row['PtsServer']!=row['PtsRet']:
                return_value=True
        if (row['PtsServer']>=12) ^ (row['PtsRet']>=12):
            return_value=True     
                
    return return_value

In [13]:
#Return Player who won the point in the clutch moment.
def ClutchPlayerWon(row):
    ptWinner=row['PtWinner']
    if ptWinner==1:
        return_val=row['P1Name']
    else:
        return_val=row['P2Name']        
    
 
    return return_val
        

In [14]:
#Return Player who won the point in the clutch moment.
def ClutchPlayerLost(row):
    ptWinner=row['PtWinner']
    if ptWinner==1:
        return_val=row['P2Name']
    else:
        return_val=row['P1Name']        
    
 
    return return_val
        

In [15]:
def WinnerIsServerOrReturner(row):
    if int(row['isSvrWinner'])==1:
        return_val= 'Server'
    else:
        return_val= 'Returner'
    return return_val

In [16]:
def LoserIsServerOrReturner(row):
    if int(row['isSvrWinner'])==0:
        return_val= 'Server'
    else:
        return_val= 'Returner'
    return return_val

In [17]:
def HowPoint(row):
    if row['isAce']==True:
        return_val='Ace'
    elif row['isUnret']==True:
        return_val='Unret'
    elif row['isRallyWinner']==True:
        return_val='RallyWinner'
    elif row['isForced']==True:
        return_val='Forced'
    elif row['isUnforced']==True:
        return_val='Unforced'
    elif row['isDouble']==True:
        return_val='Double'  
    else:
        return_val='Other'
 
    return return_val

In [18]:
#Have to handle some cases where the data isn't an integer.
def rallyCounttoint(row):
    if str(row['rallyCount']).isdecimal():
        return_val=int(row['rallyCount'])
    else:
        return_val=None
    
    return return_val

In [19]:
#Can't rely on the points to determine this.
def abouttowin(row):
    if row['IsCrucialPoint']:       
        if row['PtsAfter']=='GM':
            if row['PtWinner']==1:
                return_val=1
            else:
                return_val=2
        else:
            if row['PtWinner']==1:
                return_val=2
            else:
                return_val=1     
    else:
        return_val=None
            
    return return_val
        


In [20]:
#winsave
def PWinnerwinorsave(row):
    if row['IsCrucialPoint']:    
        abouttowin=row['abouttowin']
        ptWinner=row['PtWinner']
        if ptWinner==1:
            if abouttowin==1:
                return_val='Win'
            else:
                return_val='Save'
        else: 
            if abouttowin==2:
                return_val='Win'
            else:
                return_val='Save'
    else:
        return_val='N/A'

    
    return return_val
        

In [21]:
#winsave
def PLoserwinorsave(row):
    if row['IsCrucialPoint']:    
        abouttowin=row['abouttowin']
        ptWinner=row['PtWinner']
        if ptWinner==1:
            if abouttowin==2:
                return_val='Win'
            else:
                return_val='Save'
        else: 
            if abouttowin==1:
                return_val='Win'
            else:
                return_val='Save'
    else:
        return_val='N/A'

    
    return return_val
        

In [22]:
def BreakPoint(row):
    if row['IsCrucialPoint']:    
        if (row['abouttowin']==1 and row['Svr']==2) or (row['abouttowin']==2 and row['Svr']==1):
            return_val='Break'
        else:
            return_val='OnServe'
    else:
        return_val='N/A'
    
    return return_val

In [23]:
#Point Type
def PointType(row):
    #when p1 about to win set
    logicforset1=row['abouttowin']==1 and  (row['Gm1']>=5  and row['Gm1']>row['Gm2'] or row['Gm1']==6 and row['Gm2']==6)
    #when p2 about to win set
    logicforset2=row['abouttowin']==2 and  (row['Gm2']>=5  and row['Gm2']>row['Gm1'] or row['Gm1']==6 and row['Gm2']==6)
    if row['IsCrucialPoint']:    
        if logicforset1:
            if row['bigmatchflag']==True and row['Set1']==2:
                return_val='MatchPoint'
            elif row['bigmatchflag']==False and row['Set1']==1 :
                return_val='MatchPoint'
            else:
                return_val='SetPoint'
        elif logicforset2:
            if row['bigmatchflag']==True and row['Set2']==2:
                return_val='MatchPoint'
            elif row['bigmatchflag']==False and row['Set2']==1 :
                return_val='MatchPoint'
            else:
                return_val='SetPoint'    
        else:
            return_val='GamePoint'      
    else:
        return_val='N/A'
    

    
    return return_val
        

In [24]:
df['IsCrucialPoint']=df.apply(isCrucialPoint,axis=1)
df['PlayerWon']=df.apply(ClutchPlayerWon,axis=1)
df['PlayerLost']=df.apply(ClutchPlayerLost,axis=1)
df['HowPoint']=df.apply(HowPoint,axis=1)
df['rallyCount']=df.apply(rallyCounttoint,axis=1)

In [25]:
df['abouttowin']=df.apply(abouttowin,axis=1)
df['BreakPoint']=df.apply(BreakPoint,axis=1)
df['PointType']=df.apply(PointType,axis=1)
df['PWinnerwinorsave']=df.apply(PWinnerwinorsave,axis=1)
df['PLoserwinorsave']=df.apply(PLoserwinorsave,axis=1)
df['WinnerIsServerOrReturner']=df.apply(WinnerIsServerOrReturner,axis=1)
df['LoserIsServerOrReturner']=df.apply(LoserIsServerOrReturner,axis=1)
df['Year']=df['Date'].str.slice(stop=4)

In [26]:
#Points Won
dfallWon = pd.DataFrame()
dfallWon['count']=df.groupby(['WinnerIsServerOrReturner','BreakPoint','Year','PlayerWon','HowPoint','IsCrucialPoint','PWinnerwinorsave','PointType'])['PlayerWon'].count().sort_values(ascending=False)
dfallWon['avg_rally_count']=df.groupby(['WinnerIsServerOrReturner','BreakPoint','Year','PlayerWon','HowPoint','IsCrucialPoint','PWinnerwinorsave','PointType'])['rallyCount'].mean().sort_values(ascending=False)
dfallWon['WonLost']='Won'
dfallWon=dfallWon.reset_index()
dfallWon=dfallWon.rename(columns={"PlayerWon": "Player","PWinnerwinorsave": "winorsave",'WinnerIsServerOrReturner':'ServerOrReturner'})

#Points Lost
dfallLost = pd.DataFrame()
dfallLost['count']=df.groupby(['LoserIsServerOrReturner','BreakPoint','Year','PlayerLost','HowPoint','IsCrucialPoint','PLoserwinorsave','PointType'])['PlayerLost'].count().sort_values(ascending=False)
dfallLost['avg_rally_count']=df.groupby(['LoserIsServerOrReturner','BreakPoint','Year','PlayerLost','HowPoint','IsCrucialPoint','PLoserwinorsave','PointType'])['rallyCount'].mean().sort_values(ascending=False)
dfallLost['WonLost']='Lost'
dfallLost=dfallLost.reset_index()
dfallLost=dfallLost.rename(columns={"PlayerLost": "Player","PLoserwinorsave": "winorsave",'LoserIsServerOrReturner':'ServerOrReturner'})

dfallagg=pd.concat([dfallWon,dfallLost])


In [27]:
df[df['PlayerWon']=='Roger_Federer'].to_csv('test.csv')

In [62]:
dfallagg.to_csv('final_dataset_agg.csv')

# Done!