## T20 World Cup 2022 prediction

In this notebook, I am going to predict the winner of the upcoming T20 World Cup 2022 based on all T20 matches played since 2012 to August 2022 and current team rankings.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.ticker as ticker
import matplotlib.ticker as plticker
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

##### "Matches" dataset with results of all T20 matches played between Feburary 2012 and August 2022

In [2]:
matches = pd.read_csv("C://Users/eshub/OneDrive/Desktop/matches.csv")

In [3]:
matches.head()

Unnamed: 0,Ground,Date,Win by,Win margin,Result,Team_1,Team_2
0,Wellington,17-02-2012,wickets,6.0,England,New Zealand,England
1,Hamilton,19-02-2012,runs,48.0,England,New Zealand,England
2,Auckland,22-02-2012,runs,10.0,England,New Zealand,England
3,Abu Dhabi,25-11-2011,wickets,5.0,Pakistan,Pakistan,Sri Lanka
4,,23-02-2012,runs,8.0,Pakistan,England,Pakistan


In [4]:
matches.tail()

Unnamed: 0,Ground,Date,Win by,Win margin,Result,Team_1,Team_2
2013,Al Amarat,14-08-2022,wickets,5.0,Kuwait,Bahrain,Kuwait
2014,Al Amarat,17-08-2022,runs,102.0,Kuwait,Kuwait,Bahrain
2015,Al Amarat,20-08-2022,runs,8.0,Hong Kong,Hong Kong,Singapore
2016,Al Amarat,21-08-2022,wickets,1.0,Kuwait,United Arab Emirates,Kuwait
2017,Al Amarat,22-08-2022,runs,47.0,United Arab Emirates,United Arab Emirates,Singapore


In [5]:
matches.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2018 entries, 0 to 2017
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Ground      1855 non-null   object 
 1   Date        2018 non-null   object 
 2   Win by      1960 non-null   object 
 3   Win margin  1960 non-null   float64
 4   Result      2018 non-null   object 
 5   Team_1      2018 non-null   object 
 6   Team_2      2018 non-null   object 
dtypes: float64(1), object(6)
memory usage: 110.5+ KB


##### "Fixtures" dataset containing the fixtures for the T20 World Cup

In [6]:
fixtures = pd.read_csv("C://Users/eshub/OneDrive/Desktop/fixtures.csv")

In [7]:
fixtures.head()

Unnamed: 0,S. No,Date,Team_1,Team_2,Venue
0,1,22-Oct,New Zealand,Australia,Sydney
1,2,22-Oct,England,Afghanistan,Perth
2,3,23-Oct,Sri Lanka,Ireland,Hobart
3,4,23-Oct,India,Pakistan,Melbourne
4,5,24-Oct,Bangladesh,Netherlands,Hobart


In [8]:
fixtures.tail()

Unnamed: 0,S. No,Date,Team_1,Team_2,Venue
25,26,04-Nov,Australia,Afghanistan,Adelaide
26,27,05-Nov,England,Sri Lanka,Sydney
27,28,06-Nov,South Africa,Netherlands,Adelaide
28,29,06-Nov,Pakistan,Bangladesh,Adelaide
29,30,06-Nov,India,Zimbabwe,Melbourne


In [9]:
fixtures.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   S. No   30 non-null     int64 
 1   Date    30 non-null     object
 2   Team_1  30 non-null     object
 3   Team_2  30 non-null     object
 4   Venue   30 non-null     object
dtypes: int64(1), object(4)
memory usage: 1.3+ KB


##### "Rankings" dataset containing the current T20 team rankings

In [10]:
rankings = pd.read_csv("C://Users/eshub/OneDrive/Desktop/rankings.csv")

In [11]:
rankings.head()

Unnamed: 0,Position,Team,Rating
0,1,India,268
1,2,England,266
2,3,Pakistan,258
3,4,South Africa,256
4,5,New Zealand,252


In [12]:
rankings.tail()

Unnamed: 0,Position,Team,Rating
12,13,UAE,183
13,14,Namibia,183
14,15,Scotland,182
15,16,Nepal,180
16,17,Netherlands,177


In [13]:
rankings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17 entries, 0 to 16
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Position  17 non-null     int64 
 1   Team      17 non-null     object
 2   Rating    17 non-null     int64 
dtypes: int64(2), object(1)
memory usage: 536.0+ bytes


In [14]:
matches.head()

Unnamed: 0,Ground,Date,Win by,Win margin,Result,Team_1,Team_2
0,Wellington,17-02-2012,wickets,6.0,England,New Zealand,England
1,Hamilton,19-02-2012,runs,48.0,England,New Zealand,England
2,Auckland,22-02-2012,runs,10.0,England,New Zealand,England
3,Abu Dhabi,25-11-2011,wickets,5.0,Pakistan,Pakistan,Sri Lanka
4,,23-02-2012,runs,8.0,Pakistan,England,Pakistan


##### Considering only those teams which are present in the Super 12

In [15]:
worldcup_teams = ["India", "England", "Australia", "New Zealand", "South Africa", "Pakistan", "Bangladesh",
                  "Sri Lanka", "Netherlands", "Ireland"]
matches_team_1 = matches[matches['Team_1'].isin(worldcup_teams)]
matches_team_2 = matches[matches['Team_2'].isin(worldcup_teams)]
matches_teams = pd.concat((matches_team_1, matches_team_2))
matches_teams.drop_duplicates()

Unnamed: 0,Ground,Date,Win by,Win margin,Result,Team_1,Team_2
0,Wellington,17-02-2012,wickets,6.0,England,New Zealand,England
1,Hamilton,19-02-2012,runs,48.0,England,New Zealand,England
2,Auckland,22-02-2012,runs,10.0,England,New Zealand,England
3,Abu Dhabi,25-11-2011,wickets,5.0,Pakistan,Pakistan,Sri Lanka
4,,23-02-2012,runs,8.0,Pakistan,England,Pakistan
...,...,...,...,...,...,...,...
1985,Bulawayo,15-07-2022,wickets,7.0,Netherlands,United States of America,Netherlands
1989,Bulawayo,17-07-2022,runs,37.0,Zimbabwe,Zimbabwe,Netherlands
2006,Harare,30-07-2022,runs,17.0,Zimbabwe,Zimbabwe,Bangladesh
2007,Harare,31-07-2022,wickets,7.0,Bangladesh,Zimbabwe,Bangladesh


In [16]:
matches_teams.head()

Unnamed: 0,Ground,Date,Win by,Win margin,Result,Team_1,Team_2
0,Wellington,17-02-2012,wickets,6.0,England,New Zealand,England
1,Hamilton,19-02-2012,runs,48.0,England,New Zealand,England
2,Auckland,22-02-2012,runs,10.0,England,New Zealand,England
3,Abu Dhabi,25-11-2011,wickets,5.0,Pakistan,Pakistan,Sri Lanka
4,,23-02-2012,runs,8.0,Pakistan,England,Pakistan


Removing columns which do not play a role in predicting the results such as "Ground", "Date", "Win by" and "Win margin".

In [17]:
matches_teams_data = matches_teams.drop(["Ground", "Date", "Win by", "Win margin"], axis=1)

In [18]:
matches_teams_data.head()

Unnamed: 0,Result,Team_1,Team_2
0,England,New Zealand,England
1,England,New Zealand,England
2,England,New Zealand,England
3,Pakistan,Pakistan,Sri Lanka
4,Pakistan,England,Pakistan


Converting the data into a labelled form by adding a new column 'winning team' which will show 0 if Team 1 wins and 1 if Team 2 wins

In [19]:
matches_teams_data = matches_teams_data.reset_index(drop=True)
matches_teams_data.loc[matches_teams_data["Result"] == matches_teams_data["Team_1"], 'winning_team'] = 0
matches_teams_data.loc[matches_teams_data["Result"] == matches_teams_data["Team_2"], 'winning_team'] = 1
matches_teams_data = matches_teams_data.drop(['winning_team'], axis=1)

In [20]:
matches_teams_data.head()

Unnamed: 0,Result,Team_1,Team_2
0,England,New Zealand,England
1,England,New Zealand,England
2,England,New Zealand,England
3,Pakistan,Pakistan,Sri Lanka
4,Pakistan,England,Pakistan


In [21]:
labelled = pd.get_dummies(matches_teams_data, prefix=["Team_1", "Team_2"], columns=["Team_1", "Team_2"])

x = labelled.drop(["Result"], axis=1)
y = labelled["Result"]

In [22]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=42)

In [23]:
labelled.head()

Unnamed: 0,Result,Team_1_Afghanistan,Team_1_Australia,Team_1_Bangladesh,Team_1_Barbados,Team_1_Bermuda,Team_1_Canada,Team_1_England,Team_1_France,Team_1_Germany,...,Team_2_Scotland,Team_2_Singapore,Team_2_South Africa,Team_2_Sri Lanka,Team_2_Thailand,Team_2_Uganda,Team_2_United Arab Emirates,Team_2_United States of America,Team_2_West Indies,Team_2_Zimbabwe
0,England,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,England,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,England,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Pakistan,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,Pakistan,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


Using a random forest classifier

### Random forest classifier

In [24]:
rf = RandomForestClassifier(n_estimators=100, max_depth=20, random_state=0)
rf.fit(x_train, y_train) 

score = rf.score(x_train, y_train)
score2 = rf.score(x_test, y_test)

print("Training set accuracy: ", '%.3f'%(score))
print("Test set accuracy: ", '%.3f'%(score2))

Training set accuracy:  0.692
Test set accuracy:  0.618


In [25]:
rankings.head()

Unnamed: 0,Position,Team,Rating
0,1,India,268
1,2,England,266
2,3,Pakistan,258
3,4,South Africa,256
4,5,New Zealand,252


In [26]:
fixtures.head()

Unnamed: 0,S. No,Date,Team_1,Team_2,Venue
0,1,22-Oct,New Zealand,Australia,Sydney
1,2,22-Oct,England,Afghanistan,Perth
2,3,23-Oct,Sri Lanka,Ireland,Hobart
3,4,23-Oct,India,Pakistan,Melbourne
4,5,24-Oct,Bangladesh,Netherlands,Hobart


Combining with the rankings to keep a "favourite" of a match.

In [27]:
fixtures.insert(1, 'first_position', fixtures['Team_1'].map(rankings.set_index('Team')['Position']))
fixtures.insert(2, 'second_position', fixtures['Team_2'].map(rankings.set_index('Team')['Position']))

fixtures = fixtures.iloc[:45, :]
fixtures.head()

Unnamed: 0,S. No,first_position,second_position,Date,Team_1,Team_2,Venue
0,1,5.0,6,22-Oct,New Zealand,Australia,Sydney
1,2,2.0,10,22-Oct,England,Afghanistan,Perth
2,3,8.0,12,23-Oct,Sri Lanka,Ireland,Hobart
3,4,1.0,3,23-Oct,India,Pakistan,Melbourne
4,5,9.0,17,24-Oct,Bangladesh,Netherlands,Hobart


In [28]:
fixtures.tail()

Unnamed: 0,S. No,first_position,second_position,Date,Team_1,Team_2,Venue
25,26,6.0,10,04-Nov,Australia,Afghanistan,Adelaide
26,27,2.0,8,05-Nov,England,Sri Lanka,Sydney
27,28,4.0,17,06-Nov,South Africa,Netherlands,Adelaide
28,29,3.0,9,06-Nov,Pakistan,Bangladesh,Adelaide
29,30,1.0,11,06-Nov,India,Zimbabwe,Melbourne


#### Making the team with the higher ranking as "favourite" for a particular match.

In [29]:
pred_set = []

for index, row in fixtures.iterrows():
    if row['first_position'] < row['second_position']:
        pred_set.append({'Team_1': row['Team_1'], 'Team_2': row['Team_2'], 'winning_team': None})
    else:
        pred_set.append({'Team_1': row['Team_2'], 'Team_2': row['Team_1'], 'winning_team': None})
        
pred_set = pd.DataFrame(pred_set)
backup_pred_set = pred_set
pred_set.head()

Unnamed: 0,Team_1,Team_2,winning_team
0,New Zealand,Australia,
1,England,Afghanistan,
2,Sri Lanka,Ireland,
3,India,Pakistan,
4,Bangladesh,Netherlands,


In [30]:
pred_set = pd.get_dummies(pred_set, prefix=['Team_1', 'Team_2'], columns=['Team_1', 'Team_2'])
# Converting into dummy variables

missing_cols = set(labelled.columns) - set(pred_set.columns)
for c in missing_cols:
    pred_set[c] = 0
pred_set = pred_set[labelled.columns]


pred_set = pred_set.drop(['Result'], axis=1)
pred_set.head()

Unnamed: 0,Team_1_Afghanistan,Team_1_Australia,Team_1_Bangladesh,Team_1_Barbados,Team_1_Bermuda,Team_1_Canada,Team_1_England,Team_1_France,Team_1_Germany,Team_1_Hong Kong,...,Team_2_Scotland,Team_2_Singapore,Team_2_South Africa,Team_2_Sri Lanka,Team_2_Thailand,Team_2_Uganda,Team_2_United Arab Emirates,Team_2_United States of America,Team_2_West Indies,Team_2_Zimbabwe
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [31]:
fixtures.shape

(30, 7)

In [32]:
predictions = rf.predict(pred_set)
for i in range(fixtures.shape[0]):
    print(backup_pred_set.iloc[i, 1] + " and " + backup_pred_set.iloc[i, 0])
    if predictions[i] == 1:
        print("Winner: " + backup_pred_set.iloc[i, 1])
    
    else:
        print("Winner: " + backup_pred_set.iloc[i, 0])
    print("")

Australia and New Zealand
Winner: New Zealand

Afghanistan and England
Winner: England

Ireland and Sri Lanka
Winner: Sri Lanka

Pakistan and India
Winner: India

Netherlands and Bangladesh
Winner: Bangladesh

Zimbabwe and South Africa
Winner: South Africa

Sri Lanka and Australia
Winner: Australia

Ireland and England
Winner: England

Afghanistan and New Zealand
Winner: New Zealand

South Africa  and Bangladesh
Winner: Bangladesh

Netherlands and India
Winner: India

Zimbabwe and Pakistan
Winner: Pakistan

Ireland and Afghanistan
Winner: Afghanistan

Australia and England
Winner: England

Sri Lanka and New Zealand
Winner: New Zealand

Zimbabwe and Bangladesh
Winner: Bangladesh

Netherlands and Pakistan
Winner: Pakistan

South Africa and India
Winner: India

Ireland and Australia
Winner: Australia

Afghanistan and Sri Lanka
Winner: Sri Lanka

New Zealand and England
Winner: England

Netherlands and Zimbabwe
Winner: Zimbabwe

Bangladesh and India
Winner: India

South Africa and Pakistan

In [33]:
semis = [('New Zealand', 'India'), ('England', 'Pakistan')]

Writing a function for predicting the semi-finals and finals using the same logic used previously.

In [34]:
def predict(results, rankings, labelled, logreg):

    positions = []

    for match in results:
        positions.append(rankings.loc[rankings['Team'] == match[0],'Position'].iloc[0])
        positions.append(rankings.loc[rankings['Team'] == match[1],'Position'].iloc[0])
    
    pred_set = []

    i = 0
    j = 0

    while i < len(positions):
        dict1 = {}

        if positions[i] < positions[i + 1]:
            dict1.update({'Team_1': results[j][0], 'Team_2': results[j][1]})
        else:
            dict1.update({'Team_1': results[j][1], 'Team_2': results[j][0]})

        pred_set.append(dict1)
        i += 2
        j += 1
        
    pred_set = pd.DataFrame(pred_set)
    backup_pred_set = pred_set

    pred_set = pd.get_dummies(pred_set, prefix=['Team_1', 'Team_2'], columns=['Team_1', 'Team_2'])

    missing_cols2 = set(labelled.columns) - set(pred_set.columns)
    for c in missing_cols2:
        pred_set[c] = 0
    pred_set = pred_set[labelled.columns]

    pred_set = pred_set.drop(['Result'], axis=1)


    predictions = logreg.predict(pred_set)
    for i in range(len(pred_set)):
        print(backup_pred_set.iloc[i, 1] + " and " + backup_pred_set.iloc[i, 0])
        if predictions[i] == 1:
            print("Winner: " + backup_pred_set.iloc[i, 1])
        else:
            print("Winner: " + backup_pred_set.iloc[i, 0])
        print("")

In [35]:
predict(semis, rankings, labelled, rf)

New Zealand and India
Winner: India

Pakistan and England
Winner: England



In [36]:
final = [('India', 'England')]

In [37]:
predict(final, rankings, labelled, rf)

England and India
Winner: India

