<a href="https://colab.research.google.com/github/sidharth178/Exploratory-Data-Analysis-Sports-IPL/blob/master/Exploratory_Data_Analysis_Sports(IPL).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 style="color:green" align="center"><b> Exploratory Data Analysis - Sports </b> </h1>

### **Author: Sidharth Kumar Mohanty**
### **Data Science and Business Analytics Intern @ The Spark Foundation**
### **Task #5 : "Exploratory Data Analysis : Sports (Indian Premier League)"**
### **Dataset:** Click [here](https://bit.ly/34SRn3b)
### **Problem Statement :**


1.   Perform Exploratory Data Analysis on 'Indian Premiere League'.
2.   As a sports analysts, find out the most successful teams, players and factors contributing win or loss of a team.
3.  Suggest teams or players a company should endorse for its products.



# **1. Import Libraries**

In [None]:
# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

# **2. Load Datasets**

In [None]:
# load matches dataset
matches_df = pd.read_csv("../input/ipl-data-set/matches.csv")

matches_df.head()

In [None]:
# load deliveries dataset
deliveries_df = pd.read_csv("../input/ipl-data-set/deliveries.csv")

deliveries_df.head()

In [None]:
# merge matches & deliveries datasets
merge_df = pd.merge(deliveries_df,matches_df,left_on='match_id',right_on='id')

merge_df.head()

In [None]:
# size of each dataset
print("============================================")
print("size of matches dataset : ",matches_df.shape )
print("============================================")
print("size of deliveries dataset : ",deliveries_df.shape )
print("============================================")
print("size of merge dataset : ",merge_df.shape )
print("============================================")

## **3. EDA of Matches dataset**

In [None]:
matches_df.info()

In [None]:
# statistical analysis of matches_df
matches_df.describe(include='all')

## **4. Handling Missing Values**


In [None]:
# missing values in matches_df
matches_df.isnull().sum()

- Columns "city", "winner", "player_of_match", "umpire1", "umpire2" have missing values.
- Here "umpire3" column has maximum number of missing value present. So we should delete that column from the dataframe.

In [None]:
# drop "umpire3" column
matches_df.drop(["umpire3"],axis=1,inplace=True)

### **4.1. Handling Missing Values in "city" column**

In [None]:
# find the venue name of all missing value "city" 
matches_df[matches_df["city"].isnull()][["city","venue"]]

- As all missing values are from "Dubai International Cricket Stadium". So we can fill the missing value by "Dubai".

In [None]:
matches_df["city"] = matches_df["city"].fillna("Dubai")

- In matches_df "player_of_match", "umpirr1", and "umpire2" has 4,2,2 numbers of missing value. So we can delete these rows having missing values.

### **4.2. Handling Missing Values in "umpire1", "umpire2", "player_of_match" columns**

In [None]:
# rows having missing values
matches_df[(matches_df["umpire1"].isnull()) | (matches_df["umpire2"].isnull()) | (matches_df["player_of_match"].isnull())]

In [None]:
# delete rows having missing value in columns 'umpire1', 'umpire2', 'player_of_match'.
matches_df.dropna(subset=['umpire1', 'umpire2', 'player_of_match'],inplace=True)

In [None]:
# shape of updated matches_df DataFrame
matches_df.shape

## **5. EDA of Deliveries Dataset**

In [None]:
deliveries_df.info()

In [None]:
# statistical analysis of deliveries dataset
deliveries_df.describe()

## **6. Handling Missing Values**

In [None]:
# see how many missing value present each column
deliveries_df.isnull().sum()

- Here we can see column "player_dismissed", "dismissal_kind", "fielder" have maximum(more than 90%) number of missing value present.
- So we should delete these columns.

In [None]:
# drop columns "player_dismissed","dismissal_kind","fielder" from the DataFrame
deliveries_df.drop(columns=["player_dismissed","dismissal_kind","fielder"],axis=1,inplace=True)

In [None]:
# check for any missing value in deliveries_df
deliveries_df.isnull().sum().sum()

In [None]:
# check for any missing value in matches_df
matches_df.isnull().sum().sum()

Now both the datasets are clean i.e there is no missing value present. 

In [None]:
matches_df.tail()

In [None]:
deliveries_df.head()

## **7. Number of Teams Participated Each Season**

In [None]:
matches_df.groupby('Season')['team1'].nunique().plot(kind = 'bar', figsize=(15,5),color = 'c')
plt.title("Number of teams participated each season ",fontsize=18,fontweight="bold")
plt.ylabel("Count of teams", size = 25)
plt.xlabel("Season", size = 25)
plt.xticks(size = 15)
plt.yticks(size = 15)

- In the year of 2011, 2012, 2013, there were 10,9,9 teams participated while in other seasons participated teams were 8.

## **8. Matches Played in Each Season**

In [None]:
plt.figure(figsize = (18,6))
sns.countplot('Season',data=matches_df,)
plt.title("Number of Matches played in each IPL season",fontsize=20)
plt.xlabel("season",fontsize=15)
plt.ylabel('Matches',fontsize=15)
plt.show() 

## **9. Number of Matches Won by Team**

In [None]:
plt.figure(figsize = (18,6))
sns.countplot(x='winner',data=matches_df, palette='cool')
plt.title("Numbers of matches won by team ",fontsize=20)
plt.xticks(rotation=50)
plt.xlabel("Teams",fontsize=15)
plt.ylabel("No of wins",fontsize=15)
plt.show()

- Mumbai Indians has maximum number of winning matches followed by Chennai Super Kings.
- In matches_df DataFrame, "city" column has 32 unique values while "venue" column has 41 distinct values.
- Let's find out which city has many number of venues.

In [None]:
# find how many stadium present in each cities
city_venue = matches_df.groupby(['city','venue']).count()['Season']
city_venue_df = pd.DataFrame(city_venue)
city_venue_df

## **10. Venue which has hosted most number of IPL matches**

In [None]:
# matches_df["venue"].value_counts().sort_values(ascending = True).tail(10)
matches_df["venue"].value_counts().sort_values(ascending = True).tail(10).plot(kind = 'barh',figsize=(12,8), fontsize=15, color='c')
plt.title("Venue which has hosted most number of IPL matches",fontsize=18,fontweight="bold")
plt.ylabel("Venue", size = 25)
plt.xlabel("Frequency", size = 25)

## **11. Which Team has maximum number of win in IPL so far**

In [None]:
matches_df["winner"].value_counts().sort_values(ascending = True).tail().plot(kind = 'barh', figsize = (15,5), color = 'c')
plt.title("Winners of IPL across 11 seasons",fontsize=18,fontweight="bold")
plt.ylabel("Teams", size = 25)
plt.xlabel("Frequency", size = 25)
plt.xticks(size = 15)
plt.yticks(size = 15)

## **12. Does teams choose to bat or field first, after winning toss ?**

In [None]:
colors = ['#FFBF00', '#FA8072']
matches_df['toss_decision'].value_counts().plot(kind='pie', fontsize=14, autopct='%3.1f%%', colors=colors,
                                               figsize=(10,7), shadow=True, startangle=135, legend=True, cmap='Oranges')
plt.ylabel('Toss Decision')
plt.title('Decision taken by captains after winning tosses', size = 20)
plt.show()

- Usually after winning the toss, team choose to field first.

## **13. How toss decision affects match results ?**

In [None]:
# create a column which store 'win' if a team win a match & 
matches_df['toss_win_game_win'] = np.where((matches_df.toss_winner == matches_df.winner),'win','loss')
plt.figure(figsize = (15,5))
sns.countplot('toss_win_game_win', data=matches_df, hue = 'toss_decision',)
plt.title("How Toss Decision affects match result", fontsize=18,fontweight="bold")
plt.xticks(size = 15)
plt.yticks(size = 15)
plt.xlabel("Winning Toss and winning match", fontsize = 25)
plt.ylabel("Frequency", fontsize = 25)

- After winning the toss the team who choose to field first has higher probability of winning the match.

## **14. Number of Toss won by individual team**

In [None]:
plt.figure(figsize = (18,6))
sns.countplot(x='toss_winner',data=matches_df, palette='cool')
plt.title("Number of Toss won by team ",fontsize=20)
plt.xticks(rotation=50)
plt.xlabel("Teams",fontsize=15)
plt.ylabel("No of toss",fontsize=15)
plt.show()

## **15. Individual teams decision to choose bat first or second after winning toss**

In [None]:
plt.figure(figsize = (25,10))
sns.countplot('toss_winner', data = matches_df, hue = 'toss_decision')
plt.title("Teams decision to bat first or second after winning toss", size = 30, fontweight = 'bold')
plt.xticks(size = 15, rotation=50)
plt.yticks(size = 15)
plt.xlabel("Toss Winner", size = 35)
plt.ylabel("Count", size = 35)

## **16. Which player's performance has mostly led team's win ?**

In [None]:
matches_df['player_of_match'].value_counts().head(10).plot(kind = 'bar',figsize=(12,8), fontsize=15, color='c')
plt.title("Top 10 players with most MoM awards",fontsize=18,fontweight="bold")
plt.ylabel("Frequency", size = 25)
plt.xlabel("Players", size = 25)

- CH Gayle is the most lead run scorer for the team followed by AB de Villiers.

## **17. Teams total scoring runs over the years?**

In [None]:
merge_df.groupby('Season')['batsman_runs'].sum().plot(kind = 'line', linewidth = 3, figsize =(15,5),color = 'c')
                                                                                          
plt.title("Runs over the years",fontsize= 25, fontweight = 'bold')
plt.xlabel("Season", size = 25)
plt.ylabel("Total Runs Scored", size = 25)
plt.xticks(size = 12)
plt.yticks(size = 12)

## **18. Top Run Getters of IPL**

In [None]:
#let's plot the top 10 run getter so far in IPL
merge_df.groupby('batsman')['batsman_runs'].sum().sort_values(ascending = False).head(10).plot(kind = 'bar', color = 'c',
                                                                                            figsize = (15,5))
plt.title("Top Run Getters of IPL", fontsize = 20, fontweight = 'bold')
plt.xlabel("Batsmen", size = 25)
plt.ylabel("Total Runs Scored", size = 25)
plt.xticks(size = 12)
plt.yticks(size = 12)

- Virat Kohli is the top run getter of IPL in all over the seasons

## **19. Which batsman has been most consistent among top 10 run getters ?**

In [None]:
consistent_batsman = merge_df[merge_df.batsman.isin(['SK Raina', 'V Kohli','RG Sharma','G Gambhir',
                                            'RV Uthappa', 'S Dhawan','CH Gayle', 'MS Dhoni',
                                            'DA Warner', 'AB de Villiers'])][['batsman','Season','total_runs']]

consistent_batsman.groupby(['Season','batsman'])['total_runs'].sum().unstack().plot(kind = 'box', figsize = (15,8))
plt.title("Most Consistent batsmen of IPL", fontsize = 20, fontweight = 'bold')
plt.xlabel("Batsmen", size = 25)
plt.ylabel("Total Runs Scored each season", size = 25)
plt.xticks(size = 15)
plt.yticks(size = 15)

## **20. Top Wicket Takers of IPL**

In [None]:
merge_df.groupby('bowler')['player_dismissed'].count().sort_values(ascending = False).head(10).plot(kind = 'bar', 
                                                color = 'c', figsize = (15,5))
plt.title("Top Wicket Takers of IPL", fontsize = 20, fontweight = 'bold')
plt.xlabel("Bowler", size = 25)
plt.ylabel("Total Wickets Taken", size = 25)
plt.xticks(size = 12)
plt.yticks(size = 12)

- SL Malinga is the top wicket taker of IPL

## **21. Batsmen with the best strike rates over the years**

In [None]:
#We will consider players who have played 10 or more seasons
no_of_balls = pd.DataFrame(merge_df.groupby('batsman')['ball'].count()) #total number of matches played by each batsman
runs = pd.DataFrame(merge_df.groupby('batsman')['batsman_runs'].sum()) #total runs of each batsman
seasons = pd.DataFrame(merge_df.groupby('batsman')['Season'].nunique()) #season = 1 implies played only 1 season

batsman_strike_rate = pd.DataFrame({'balls':no_of_balls['ball'],'run':runs['batsman_runs'],'Season':seasons['Season']})
batsman_strike_rate.reset_index(inplace = True)

batsman_strike_rate['strike_rate'] = batsman_strike_rate['run']/batsman_strike_rate['balls']*100
highest_strike_rate = batsman_strike_rate[batsman_strike_rate.Season.isin([10,11])][['Season','batsman','strike_rate']].sort_values(by = 'strike_rate',
                                                                                                           ascending = False)

highest_strike_rate.head(10)

In [None]:
plt.figure(figsize = (15,6))
sns.barplot(x='batsman', y='strike_rate', data = highest_strike_rate.head(10), hue = 'Season',palette = 'cool')
plt.title("Highest strike rates in IPL",fontsize= 30, fontweight = 'bold')
plt.xlabel("Player", size = 25)
plt.ylabel("Strike Rate", size = 25)
plt.xticks(size = 15, rotation=50)
plt.yticks(size = 14)

**Q-1: As a sports analysts, find out the most successful teams, players and factors contributing win or loss of a team.**
- Mumbai Indians is the most successful team in IPL and has won the most number of toss.
- There were more matches won by chasing the total(419 matches) than defending(350 matches).
- When defending a total, the biggest victory was by 146 runs(Mumbai Indians defeated Delhi Daredevils by 146 runs on 06 May 2017 at Feroz Shah Kotla stadium, Delhi).
- When chasing a target, the biggest victory was by 10 wickets(without losing any wickets) and there were 11 such instances.
- The Mumbai city has hosted the most number of IPL matches.
- Chris Gayle has won the maximum number of player of the match title.
- Eden Gardens has hosted the maximum number of IPL matches.
- If a team wins a toss choose to field first as it has highest probablity of winning

**Q-2: Suggest teams or players a company should endorse for its products.**
- If the franchise is looking for a consistant batsman who needs to score good amount of runs then go for V Kohli, S Raina, Rohit Sharma , David Warner...
- If the franchise is looking for a game changing batsman then go for Chris Gayle, AB deVillers, R Sharma , MS Dhoni...
- If the franchise is looking for a batsman who could score good amount of runs every match the go for DA Warner, CH Gayle, V Kohli,AB de Villiers,S Dhawan
- If the franchise needs the best finisher in lower order having good strike rate then go for CH Gayle,KA Pollard, DA Warner,SR Watson,BB McCullum
- If the franchise need a experienced bowler then go for Harbhajan Singh ,A Mishra,PP Chawla ,R Ashwin,SL Malinga,DJ Bravo
- If the franchise need a wicket taking bowler then go for SL Malinga,DJ Bravo,A Mishra ,Harbhajan Singh, PP Chawla
- If the franchise need a bowler bowling most number of dot balls then go for Harbhajan Singh,SL Malinga,B Kumar,A Mishra,PP Chawla
- If the franchise need a bowler with good economy then go for DW Steyn ,M Muralitharan ,R Ashwin,SP Narine ,Harbhajan Singh.


Happy Learning!!!

### If you find this notebook useful, kindly **upvote** it
### If you want the code of this full project. Cleck [here](https://github.com/sidharth178/Exploratory-Data-Analysis-Sports-IPL)
### Follow me on [github](https://github.com/sidharth178). I used to upload good data science projects.