![picture](https://drive.google.com/uc?export=view&id=1wjkS0L1i30melKIDhytUFnlAbPCRUwIb)


<hr color='#008B8B'>

<font size="5" color = '#19388A'> <b>CS5346 S2-AY2022-23: Information Visualisation - OTOT TASK A3 </b></font>
    
<font size="3" color = '#4F91CD'><i>Name: Varsha Singh</i></font> 
   

# Objective

<font size = 5  color = '#19388A'> <b>Inside IPL 2022: Analysis of India's Premier Cricket League(IPL)</b> </font>

This task aims to analyze the [IPL 2022 Matchwise dataset](https://www.kaggle.com/datasets/aravindas01/ipl-2022dataset) through visualizations. Our aim is to gain insights and answer specific queries while comprehending the different visual encoding and channel techniques.

<img src="https://www.sportstime247.com/wp-content/uploads/2022/02/IPL-2022-Group-Format.jpg"  width = 700 length = 700>




In [1]:
#Import libraries
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.offline import init_notebook_mode, iplot, plot
import plotly as py
init_notebook_mode(connected = True)
import plotly.graph_objs as go
#import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
from plotly import tools
import warnings
warnings.filterwarnings('ignore')

In [2]:
team_colors = {
    'CSK': '#FFFF3C',
    'PBKS':'#DCDDDF',
    'SRH': '#FF822A',
    'KKR': '#2E0854',
    'RCB': '#EC1C24',
    'RR': '#254AA5',
    'MI': '#004BA0',
    'DC': '#00008B',
    'GT': '#0C2340',
    'LSG': '#9D2235'}

<div class="alert alert-block alert-info">
Colorcode for headings/titles are inspired by <a href="https://www.schemecolor.com/ipl-colors.php" class="alert-link">IPL colors</a>
    <br>
Colorcode for teams are referenced from <a href="https://www.schemecolor.com/ipl-cricket-team-color-codes.php" class="alert-link">here</a></div>



# About the dataset

The [IPL 2022 Matchwise dataset](https://www.kaggle.com/datasets/aravindas01/ipl-2022dataset) is available on Kaggle. 

This dataset contains Matchwise data of IPL matches for the 2022 season, spanning from March 26 to May 29. There were 74 matches played in IPL 2022. It includes complete data for group stage matches, with the following 20 attributes:

`match_id`:  A unique identifier for each match in the IPL 2022 season <br>
`date`: The date on which the match was played<br>
`venue`: The name of the stadium where the match was played<br>
`team1`: The name of the first team playing in the match<br>
`team2`: The name of the second team playing in the match<br>
`stage`: The stage of the tournament in which the match was played<br>
`toss_winner`: The team that won the toss for the match<br>
`toss_decision`: The decision made by the team that won the toss, i.e., whether to bat or bowl first<br>
`first_ings_score`:  The total score of the team batting first in the match<br>
`first_ings_wkts` : The number of wickets lost by the team batting first in the match.<br>
`second_ings_score`: The total score of the team batting second in the match.<br>
`second_ings_wkts`: The number of wickets lost by the team batting second in the match<br>
`match_winner` : The name of the team that won the match.<br>
`won_by`: Specify whether won by runs or wickets: The method by which the winning team won the match, i.e., by runs or wickets.<br>
`margin`:  The margin by which the winning team won the match, in terms of runs or wickets.<br>
`player_of_the_match` :  The name of the player who was awarded the "Player of the Match" award for their performance in the match.<br>
`top_scorer`: The name of the player who scored the most runs in the match.<br>
`highscore`: The highest individual score by a player in the match, from either team.<br>
`best_bowling`: The name of the bowler who had the best bowling figures in the match (if two or more bowlers have the same bowling figures, the bowler who takes more wickets from less number of overs is selected).<br>
`best_bowling_figure`: The bowling figures of the bowler who had the best performance in the match.<br>





## Data Preprocessing 

In [3]:
#Read data 
#Please change the location of data before reruning
df = pd.read_csv("Book_ipl22_ver_33.csv")
df.shape

(74, 20)

In [4]:
df.head()

Unnamed: 0,match_id,date,venue,team1,team2,stage,toss_winner,toss_decision,first_ings_score,first_ings_wkts,second_ings_score,second_ings_wkts,match_winner,won_by,margin,player_of_the_match,top_scorer,highscore,best_bowling,best_bowling_figure
0,1,"March 26,2022","Wankhede Stadium, Mumbai",Chennai,Kolkata,Group,Kolkata,Field,131,5,133,4,Kolkata,Wickets,6,Umesh Yadav,MS Dhoni,50,Dwayne Bravo,3--20
1,2,"March 27,2022","Brabourne Stadium, Mumbai",Delhi,Mumbai,Group,Delhi,Field,177,5,179,6,Delhi,Wickets,4,Kuldeep Yadav,Ishan Kishan,81,Kuldeep Yadav,3--18
2,3,"March 27,2022","Dr DY Patil Sports Academy, Mumbai",Banglore,Punjab,Group,Punjab,Field,205,2,208,5,Punjab,Wickets,5,Odean Smith,Faf du Plessis,88,Mohammed Siraj,2--59
3,4,"March 28,2022","Wankhede Stadium, Mumbai",Gujarat,Lucknow,Group,Gujarat,Field,158,6,161,5,Gujarat,Wickets,5,Mohammed Shami,Deepak Hooda,55,Mohammed Shami,3--25
4,5,"March 29,2022","Maharashtra Cricket Association Stadium,Pune",Hyderabad,Rajasthan,Group,Hyderabad,Field,210,6,149,7,Rajasthan,Runs,61,Sanju Samson,Aiden Markram,57,Yuzvendra Chahal,3--22


**Replace team name with actual acronyms used in IPL** 

The team names in this dataset are currently the state names. To make them more IPL-friendly, they will be substituted with acronyms. PSB for reference - 

<img src="team_color.png" align ='left' width = 400 length = 400>


In [5]:
# replace team names with their short forms that are actually used in IPL 
df = df.replace({'Mumbai': 'MI', 'Kolkata': 'KKR','Gujarat': 'GT','Hyderabad': 'SRH','Delhi': 'DC','Banglore': 'RCB','Lucknow': 'LSG','Chennai': 'CSK','Punjab': 'PBKS','Rajasthan': 'RR'})

In [6]:
#function to return summary of dataframe
def df_summary(df):
    # Create an empty dataframe to store the summary
    summary = pd.DataFrame(columns=["Variable", "Type", "Missing Values", "Unique Values", "Distinct Unique Values"])
    
    # Iterate through the columns of the DataFrame
    for col in df.columns:
        # Get the variable type
        var_type = df[col].dtype
        
        # Get the number of missing values
        missing_values = df[col].isna().sum()
        
        # Get the number of unique values
        unique_values = df[col].nunique()
        
        # Get the distinct unique values
        distinct_unique_values = df[col].drop_duplicates().tolist()
        
        # Append the summary information to the summary dataframe
        summary = summary.append({"Variable": col, "Type": var_type, "Missing Values": missing_values, 
                                  "Unique Values": unique_values, "Distinct Unique Values": distinct_unique_values}, ignore_index=True)
    
    return summary

In [7]:
#display a brief insight of dataset
df_summary(df)

Unnamed: 0,Variable,Type,Missing Values,Unique Values,Distinct Unique Values
0,match_id,int64,0,74,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14..."
1,date,object,0,62,"[March 26,2022, March 27,2022, March 28,2022, ..."
2,venue,object,0,6,"[Wankhede Stadium, Mumbai, Brabourne Stadium, ..."
3,team1,object,0,9,"[CSK, DC, RCB, GT, SRH, KKR, MI, LSG, PBKS]"
4,team2,object,0,9,"[KKR, MI, PBKS, LSG, RR, GT, SRH, CSK, DC]"
5,stage,object,0,3,"[Group, Playoff, Final]"
6,toss_winner,object,0,10,"[KKR, DC, PBKS, GT, SRH, RCB, LSG, MI, CSK, RR]"
7,toss_decision,object,0,2,"[Field, Bat]"
8,first_ings_score,int64,0,55,"[131, 177, 205, 158, 210, 128, 137, 193, 171, ..."
9,first_ings_wkts,int64,0,10,"[5, 2, 6, 10, 7, 8, 3, 4, 9, 0]"


*The dataset does not have any null values and upon checking all the categorical variables have the valid values. We will generate visualisation by trusting that this dataset has correct values*

# Query 1

<font size="5" color = '#19388A'> <b>Champion of IPL 2022: The team that clinched the title </b></font>

To visualize the points table across 74 matches, a new Pandas DataFrame called points_table is created. Each match is added to the DataFrame in chronological order. When a team wins, they are awarded two points. This allows for the animation of the points table over the course of the matches.





In [8]:
# Initialize dictionaries to keep track of team scores and the points table
points_table = {'Match ID' : [], 'Team Name' : [], 'Points' : []}
team_points = {'MI':0, 'CSK':0, 'DC':0, 'RCB':0, 'GT':0, 'SRH':0, 'LSG':0, 'PBKS':0,'RR':0, 'KKR':0}

# Populate the points table with initial values for each team
for team in team_points.keys():
        points_table['Match ID'].append(0)
        points_table['Team Name'].append(team)
        points_table['Points'].append(team_points[team])

        # Iterate over each match in the input DataFrame and update the team scores and points table        
for i in range(len(df)):
    matchno = df.iloc[i,]['match_id']
    winner = df.iloc[i,]['match_winner']
    team_points[winner]+=2
    for team in team_points.keys():
        points_table['Match ID'].append(matchno)
        points_table['Team Name'].append(team)
        points_table['Points'].append(team_points[team])
        
#Convert the points table to a Pandas DataFrame        
points_table = pd.DataFrame(points_table)

In [9]:
points_table.head()

Unnamed: 0,Match ID,Team Name,Points
0,0,MI,0
1,0,CSK,0
2,0,DC,0
3,0,RCB,0
4,0,GT,0


In [10]:
fig = px.bar(points_table, x="Points", y="Team Name", color="Team Name",
             text='Points', orientation='h', range_x=[0, 25], 
             title='Champion of IPL 2022: The team that clinched the title',
             color_discrete_map=team_colors, template='plotly_white',
             animation_frame="Match ID", animation_group="Team Name")

fig.update_layout(yaxis={'categoryorder':'total ascending'})

fig.update_layout(images=[dict(
    source="ipl.jpeg",
    xref="paper", yref="paper",
    x=0.001, y=1.1,
    sizex=4, sizey=1.9,
    opacity=0.3,
    layer="below")])
fig.show()

In this visualization, we aim to determine the winner of IPL 2022.
- To visualize the points table across **74 matches**, we create a new Pandas DataFrame called points_table. Each match is added to the DataFrame in **chronological order**. When a team wins, they are awarded two points. This allows for the animation of the points table over the course of the matches.
<br>
- The visualization is plotted for all the 10 teams, and has a [standard color coding](https://www.schemecolor.com/ipl-cricket-team-color-codes.php) used in IPL. 
<br>
- **Reason for using animation**: The reason why animation is employed for this plot is that the matches are played in a sequential manner and the teams' positions may vary after each match. For instance, initially, KKR held the top spot until match 25, and then the Gujrat Titans (GT) took over the lead on the leaderboard. This animation offers the same feeling that every fan experiences after the match.
<br>

- The following visual encoding is used:
  - This visualisation is a **horizontal bar chart**
  - **x-axis** show the the points scored by the teams over the course of 74 matches in the interval of 5 points starting from 0. 
  - Animated frame shows the 74 matches in the chronological order 
  - **y-axis** shows the 10 team's name which varies as per their points scored in the course of the IPL 2022. After complete animation the teams are in descending order. 
  - The **bar** represents each and is color-encoded as per the IPL color standard.
  - The **bar length** represents the numerical value of the points scored by each team.
  - The **text label** on each bar provides a numerical value for the points scored by each team.
  
 

- From the visualization, we can see that Gujrat Titans has clinched the title of IPL 2022. 


<img src="https://images.indianexpress.com/2022/05/GT-trophy.jpg"  width = 700 length = 700>

# Query 2

<font size="5" color = '#19388A'> <b>Are teams more likely to win a match if they win the toss?  </b></font>

In [11]:
# count the number of matches won and lost when winning the toss
won = df[df['toss_winner'] == df['match_winner']].shape[0]
lost = df[df['toss_winner'] != df['match_winner']].shape[0]

fig = go.Figure(data=[go.Pie(
    labels=['Matches won', 'Matches lost'],
    values=[won, lost],
    pull=[0.1, 0],  
    hole=.5,  
    marker=dict(colors=['#19388A', '#4F91CD']),  
    textinfo='label+percent+value',   
    sort=False
)])

# add a title and subtitle to the chart
fig.update_layout(
    title={
        'text': 'Are teams more likely to win a match if they win the toss?',
    },
    annotations=[
        dict(
            x=0.5,
            y=0.5,
            text='Total Matches: ' + str(won + lost),
            showarrow=False,
            font=dict(size=20)
        ),
    ],
    margin=dict(l=20, r=20, t=80, b=40) 
)


In this visualization, we aim to determine whether winning the toss plays an important role in winning the match 
- The attribute "won" represents the number of matches won by the team that won the toss
- "lost" variable represents the number of matches lost by the team that won the toss.

- The following visual encoding is used:
  - The chart being used is a **donut pie chart**.
  - The **labels** of the pie chart represent the categories being compared, which are "Matches won" and "Matches lost."
  - The **values** of the pie chart represent the count and percentage of matches won and lost. In this case, the "won" and "lost" variables are used as the values.
  - IPL standard **color markers** are used to present loss and win. 
  
- From the visualization, we can see that toss does not have much influence on match result. 


# Query 3

<font size="5" color = '#19388A'> <b>Batting First = Match winning?  </b></font>

In [12]:
# Count the number of matches won by the team that won the toss and chose to field or bat
winner_field = df[(df['toss_winner']==df['match_winner']) & (df['toss_decision']=='Field')].count()[0]
winner_bat = df[(df['toss_winner']==df['match_winner']) & (df['toss_decision']=='Bat')].count()[0]

fig = go.Figure(go.Bar(
    x=['Field', 'Bat'],
    y=[winner_field, winner_bat],
    text=[winner_field, winner_bat],
    textposition='auto',
    marker=dict(
        color=['#19388A', '#4F91CD'],
    ),
))

fig.update_layout(
    template = 'plotly_white',height=500, width=500,
    title='Batting First = Match winning?',
    xaxis=dict(
        title='Toss Decision',
        title_font=dict(size=14),
        tickfont=dict(size=12)
    ),
    yaxis=dict(
        title='Number of Matches Won',
        title_font=dict(size=14),
        tickfont=dict(size=12)
    ),
)

fig.show()

In this visualization, we aim to determine whether electing to bat first gives the team an edge over the other team -
- There are **36** matches which were won after winning the toss. Out of which 29 decided to field first and 7 opted to bat first. 
- "field" shows the number of matches won by the team that won the toss and chose to field first.  
- "bat" represents number of matches won by the team that won the toss and elected to bat first. 
<br>

- The following visual encoding is used:
  - This visualisation is a **simple bar chart**
  - **x-axis** show the number of matches won
  - **y-axis** shows the toss decision i.e. whether the team selected to bat first or field first after winning the toss. 
  - The **bar** represents the toss decision color coded to distinguish fielding and batting decision. 
  - The **bar length** represents the number of matches won by team who batting first or team who is fielding first. 
  - The **text label** on each bar provides a numerical value matches won. 
  
  
- From the visualization, we can see that the team who decised to field first after winning the toss ends up winning most of the matches. 


# Query 4

<font size="5" color = '#19388A'> <b>Venue favoritism: Which teams dominate which cricket grounds?</b></font>

In [13]:
# Count the number of matches won by each team at each venue
venue_team_wins = df.groupby(['venue', 'match_winner'])['match_winner'].count().reset_index(name='matches_won')

# Find the maximum number of matches won by a single team at each venue
max_wins_per_venue = venue_team_wins.groupby(['venue'])['matches_won'].max().reset_index(name='max_wins')

# Filter the venue_team_wins DataFrame to only include rows where the number of matches won by a single team equals the maximum number of matches won at the venue
team_with_max_wins = pd.merge(venue_team_wins, max_wins_per_venue, on='venue', suffixes=('', '_max'))
team_with_max_wins = team_with_max_wins[team_with_max_wins['matches_won'] == team_with_max_wins['max_wins']]

fig = px.bar(team_with_max_wins, x='venue', y='matches_won', color='match_winner',color_discrete_map=team_colors,
             title='Venue favoritism: Which teams dominate which cricket grounds?', text='matches_won')
fig.update_layout(xaxis_title='Venue', yaxis_title='Number of Matches Won')
fig.update_layout(
    template = 'plotly_white',height=700, width=1000,
)
fig.show()

In [14]:
team_with_max_wins

Unnamed: 0,venue,match_winner,matches_won,max_wins
0,"Brabourne Stadium, Mumbai",DC,4,4
7,"Dr DY Patil Sports Academy, Mumbai",CSK,3,3
11,"Dr DY Patil Sports Academy, Mumbai",LSG,3,3
16,"Dr DY Patil Sports Academy, Mumbai",SRH,3,3
17,"Eden Gardens, Kolkata",GT,1,1
18,"Eden Gardens, Kolkata",RCB,1,1
20,"Maharashtra Cricket Association Stadium,Pune",GT,3,3
26,"Narendra Modi Stadium, Ahmedabad",GT,1,1
27,"Narendra Modi Stadium, Ahmedabad",RR,1,1
34,"Wankhede Stadium, Mumbai",RCB,4,4


In this visualization, we aim to determine whether a venue is favouring one team over the other -
- There were total **6 venues** in IPL 2022. 
- First number of matches won by each team at each venue is computed. From this the team that has won the max number of match is selected and plotted on bar chart. 
<br>

- The following visual encoding is used:
  - This visualisation is a **stacked bar chart**
  - **x-axis** represents the venue
  - **y-axis**  represents the number of matches won
  - The **color of the bars** represents the team that won the matches.
  - The **height of the bars** represents the maximum number of matches won by a team.  
  - The **text label** on each bar provides a numerical value matches won. 
  
  
- From the visualization, we can see that Brabourne Stadium, Mumbai had favoured Dehli Capital(DC) while Wankhede Stadium, Mumbai	seems to have favoured Royal Challengers Bangalore(RCB).  

# Query 5

<font size="5" color = '#19388A'> <b>Does success in IPL 2022 hinge on having top performing players as part of the winning team? </b></font>

In [15]:
top_players = df['player_of_the_match'].value_counts().nlargest(5)
player_colors = {
    'Umesh Yadav': '#2E0854',
    'Jos Buttler': '#254AA5',
    'Kuldeep Yadav': '#00008B',
    'David Miller': '#0C2340',
    'Quinton de Kock': '#9D2235'
}

fig = go.Figure()

for i in range(len(top_players)):
    player_name = top_players.index[i]
    count = top_players.values[i]
    color = player_colors[player_name]
    fig.add_trace(go.Scatter(
        x=[count],
        y=[player_name],
        mode='markers',
        name=player_name,
        marker=dict(
            size=count*20,
            sizemode='diameter',
            sizemin=5,
            color=color,
            showscale=False
        )
    ))

fig.update_layout(
    title='Top 5 Players of the Match',
    xaxis_title='No of time selected as Player of the Match',
    yaxis_title='Player of the Match',
    template='plotly_white', height=500, width=600,
)
fig.update_layout(yaxis={'categoryorder':'total ascending'})

fig.show()

In this visualization, we aim to determine whether a venue is favouring one team over the other -
- The top 5 players are computed using the simple group by on player_of_the_match and selecting the top 5 largest count values.  


- The following visual encoding is used:
  - This visualisation is a **scatter plot**.
  - **x-axis** represents the top 5 players.
  - **y-axis**  represents number of times they were selected as Player of the Match.
  - The **size of the markers** represents the number of times a player was selected as Player of the Match, and the **color of the markers** are color coded as per the team they were in. For example Kuldeep Yadav was in Delhi Capitals hence it is color coded with DC color. 
  
  
- From the visualization, we can see that the top player Kuldeep Yadav was part of Delhi Capital(DC) which ranked 5th in the tournament. 

# References

1. https://www.espncricinfo.com/series/indian-premier-league-2022-1298423
2. https://plotly.com/python/reference/


# Appendix 

In [16]:
scores = pd.melt(df, id_vars=['match_id'], value_vars=['first_ings_score', 'second_ings_score'], var_name='inning', value_name='score')

fig = px.line(scores, x='match_id', y='score', color='inning',
              title='Trend of Scores in the First Inning and Second Inning')
fig.update_layout(xaxis_title='Match ID', yaxis_title='Score')
fig.show()