## IPL Data Set

**Task:** Perform EDA Exploratory Data Analysis on the data set

- Link to data set
https://www.kaggle.com/ramjidoolla/ipl-data-set

In [None]:
from IPython.display import Image
Image("../input/ipl-image/ipl.jpg")

The Indian Premier League (IPL) is a professional Twenty20 cricket league in India usually contested between March and May of every year by eight teams representing eight different cities or states in India.The league was founded by the Board of Control for Cricket in India (BCCI) in 2007. The IPL has an exclusive window in ICC Future Tours Programme.

The IPL is the most-attended cricket league in the world and in 2014 was ranked sixth by average attendance among all sports leagues.In 2010, the IPL became the first sporting event in the world to be broadcast live on YouTube.The brand value of the IPL in 2019 was ₹475 billion, according to Duff & Phelps. According to BCCI, the 2015 IPL season contributed ₹11.5 billion to the GDP of the Indian economy.

### Dataset Information

This Notebook contains 6 datasets. The information about all of them are as follows:-

|**Dataset**|**Content**|
|----|----|
|matches.csv|This dataset contains information about details of every ipl matches e.g teams,city and stadium in which match was played,name of all the umpires, winner of toss and toss decision and also the winner.|
|teamwise_home_and_away.csv|This dataset contains information of how many matches does a team won while playing in home stadium and away and their percentage win-loss.|
|deliveries.csv|This dataset contains information about the details of every deliveries bowled in ipl.|
|most_runs_average_strikerate.csv|This dataset has the details about the runs scored by every batter along with their strike rate and average.|
|teams.csv|This dataset has the name of all the teams that have played in the history of ipl.|
|Players.xlsx|This dataset has the details of the player including their dob,batting_hand and the country they belonged.|


There are few changes that we can see in the names of the teams, in these datasets. They are:
- Deccan Chargers is renamed to Sunrisers Hyderabad 
- Delhi Daredevils is renamed to Delhi Capitals
- Pune Warriors is renamed to Rising Pune Supergiants.

## ***Importing necessary libraries***
The following code is written in Python 3.x. Libraries provide pre-written functionality to perform necessary tasks.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [None]:
!pip install openpyxl --quiet

## ***Importing all the 6 datasets***

In [None]:
match= pd.read_csv('../input/ipl-data-set/matches.csv', index_col='id', parse_dates=True)
home_and_away= pd.read_csv('../input/ipl-data-set/teamwise_home_and_away.csv')
delivery= pd.read_csv('../input/ipl-data-set/deliveries.csv')
runs_avg_strikerate= pd.read_csv('../input/ipl-data-set/most_runs_average_strikerate.csv')
team= pd.read_csv('../input/ipl-data-set/teams.csv')
player= pd.read_excel('../input/ipl-data-set/Players.xlsx')

## ***Inspecting the datasets***

### **1. matches.csv**

In [None]:
# Viewing the top 5 values of the dataset
match.head()

# in order to view bottom 5 entries, we can do
#df.tail()

#in order to view more than 5 entries, we can enter any integer value into '()'.
#Ex: df.head(10) or df.tail(15), etc

In [None]:
# Now let us check for the shape of the dataset and also that are any null values present in our dataset.
# For that,

print('shape of the dataset=', match.shape)

print(' \nThe null count of each column of the dataset are as follows:')
match.isnull().sum()

- Here we can observe that there are 756 rows and 17 columns present in the dataset.
- We can also see that there are a few null values present in the dataset. Later, if need be, we can treat them, as required.

In [None]:
match['team1'].value_counts()
#match['team2'].value_counts()

In [None]:
# Function to identify numeric features:

def numeric_features(dataset):
    numeric_col = dataset.select_dtypes(include=np.number).columns.tolist()
    return dataset[numeric_col].head()
    
numeric_columns = numeric_features(match)
print("Numerical Features:")
print(numeric_columns)

print("===="*20)




# Function to identify categorical features:

def categorical_features(dataset):
    categorical_col = dataset.select_dtypes(exclude=np.number).columns.tolist()
    return dataset[categorical_col].head()

categorical_columns = categorical_features(match)
print("Categorical Features:")
print(categorical_columns)

print("===="*20)



# Function to check the datatypes of all the columns:

def check_datatypes(dataset):
    return dataset.dtypes

print("Datatypes of all the columns:")
check_datatypes(match)

### **2. teamwise_home_and_away.csv**

In [None]:
home_and_away.head()

In [None]:
print('shape of the dataset=', home_and_away.shape)

print(' \nThe null count of each column of the dataset are as follows:')
home_and_away.isnull().sum()

- Here we can observe that there are no null values present in our dataset.

In [None]:
# Function to identify numeric features:

def numeric_features(dataset):
    numeric_col = dataset.select_dtypes(include=np.number).columns.tolist()
    return dataset[numeric_col].head()
    
numeric_columns = numeric_features(home_and_away)
print("Numerical Features:")
print(numeric_columns)

print("===="*20)




# Function to identify categorical features:

def categorical_features(dataset):
    categorical_col = dataset.select_dtypes(exclude=np.number).columns.tolist()
    return dataset[categorical_col].head()

categorical_columns = categorical_features(home_and_away)
print("Categorical Features:")
print(categorical_columns)

print("===="*20)



# Function to check the datatypes of all the columns:

def check_datatypes(dataset):
    return dataset.dtypes

print("Datatypes of all the columns:")
check_datatypes(home_and_away)

### **3. deliveries.csv**

In [None]:
delivery.head()

In [None]:
print('shape of the dataset=', delivery.shape)

print(' \nThe null count of each column of the dataset are as follows:')
delivery.isnull().sum()

- Here we can see that there are 179078 rows and 21 columns in our dataset.
- The columns `player_dismissed`, `dismissal_kind` and `fielder` are almost empty. So we can straight away drop them. But we will keep them for now.

In [None]:
# Function to identify numeric features:

def numeric_features(dataset):
    numeric_col = dataset.select_dtypes(include=np.number).columns.tolist()
    return dataset[numeric_col].head()
    
numeric_columns = numeric_features(delivery)
print("Numerical Features:")
print(numeric_columns)

print("===="*20)




# Function to identify categorical features:

def categorical_features(dataset):
    categorical_col = dataset.select_dtypes(exclude=np.number).columns.tolist()
    return dataset[categorical_col].head()

categorical_columns = categorical_features(delivery)
print("Categorical Features:")
print(categorical_columns)

print("===="*20)



# Function to check the datatypes of all the columns:

def check_datatypes(dataset):
    return dataset.dtypes

print("Datatypes of all the columns:")
check_datatypes(delivery)

### **4. most_runs_average_strikerate.csv**

In [None]:
runs_avg_strikerate.head()

In [None]:
print('shape of the dataset=', runs_avg_strikerate.shape)

print(' \nThe null count of each column of the dataset are as follows:')
runs_avg_strikerate.isnull().sum()

In [None]:
#sns.distplot(runs_avg_strikerate['average'])
#runs_avg_strikerate.describe()
sns.boxplot(runs_avg_strikerate['average'])

- Here we can see that there are 516 rows and 6 columns present in our dataset.
- We can also see that there are 34 null values in the `average` column of our dataset.
- Also, the column values are somewhat right skewed. So, we can fill the null values instead of removing them completely.
- From the bpxplot we can see that there are few outlies present in the dataset. So we can fill the null values by the median value, as it is not affected by the outlier.

In [None]:
runs_avg_strikerate.fillna(runs_avg_strikerate['average'].median(), axis=1, inplace=True)

In [None]:
runs_avg_strikerate.isnull().sum()

In [None]:
# Function to identify numeric features:

def numeric_features(dataset):
    numeric_col = dataset.select_dtypes(include=np.number).columns.tolist()
    return dataset[numeric_col].head()
    
numeric_columns = numeric_features(runs_avg_strikerate)
print("Numerical Features:")
print(numeric_columns)

print("===="*20)




# Function to identify categorical features:

def categorical_features(dataset):
    categorical_col = dataset.select_dtypes(exclude=np.number).columns.tolist()
    return dataset[categorical_col].head()

categorical_columns = categorical_features(runs_avg_strikerate)
print("Categorical Features:")
print(categorical_columns)

print("===="*20)



# Function to check the datatypes of all the columns:

def check_datatypes(dataset):
    return dataset.dtypes

print("Datatypes of all the columns:")
check_datatypes(runs_avg_strikerate)

### **5. teams.csv**

In [None]:
team.head()

- This dataset only has names of the teams in IPL. So no further information can be drawn here.

### **6. Players.xlsx**

In [None]:
player.head()

In [None]:
print('shape of the dataset=', player.shape)

print(' \nThe null count of each column of the dataset are as follows:')
player.isnull().sum()

- We don't know yet, if we will require this dataset in our EDA further.
- So we can keep it as it is, and can modify the dataset, if needed.

## ***Exploratory Data Analysis (EDA)***

#### **1. Total wins by every teams so far**

In [None]:
match['winner'].value_counts()

In [None]:
plt.figure(figsize=(15,5))
match_wins=match['winner'].value_counts()
match_wins.plot.bar()

- From the above plot, we can see that maximum number of matches are won by **Mumbai Indians** according to this dataset, that contains the information from **2008 to 2019**.

#### **2. Top 10 Man of the matches so far**

In [None]:
match['player_of_match'].value_counts()[0:10]

In [None]:
match['player_of_match'].value_counts()[0:5].keys()

In [None]:
plt.figure(figsize=(15,10))
plt.bar(list(match['player_of_match'].value_counts()[0:10].keys()), list(match['player_of_match'].value_counts()[0:10]))
plt.show()

- From the above plot, we can see that **Chris Gayle**, aka CH Gayle has been declared player of the match the most of the time, from 2008 to 2019

#### **3. Top cities in which matches were played**

In [None]:
match['city'].value_counts()

In [None]:
plt.figure(figsize=(15,10))
match['city'].value_counts().plot.bar()

- From the above graph, we can see that **Mumbai** is the city where most of the matches were played, followed by **Kolkata**.

#### **4. Frequency of the result column**

In [None]:
match['result'].value_counts()

In [None]:
plt.figure(figsize=(15,10))
plt.pie(list(match['result'].value_counts()), labels= list(match['result'].value_counts().keys()),autopct='%0.1f%%')
plt.show()

- Here we can see that 98.3% of the times, the result of the match was normal.
- 1.2% of the time, the match was a tie.
- 0.5% of the time, there was no result of the match. This may have been due to cancellation of the match, with any reason.

#### **5. Total matches played by team so far**

In [None]:
total_match=match['team1'].value_counts()+match['team2'].value_counts()
total_match.sort_values(ascending=False)

In [None]:
plt.figure(figsize=(15,5))
total_match.sort_values(ascending=False).plot.bar()

- This plot shows that **Mumbai Indians** have played most number of matches, followed by **Royal Challengers Bangalore**.
- Also, the least matches are played by **Rising Pune Supergiants** and **Kochi Tuskers Kerala**. 

#### **6. Total number of umpires so far (Top 20)**

In [None]:
umpires= match['umpire1'].value_counts()+match['umpire2'].value_counts()+match['umpire3'].value_counts()
umpires.sort_values(ascending= False)[0:20]

In [None]:
plt.figure(figsize=(15,5))
umpires.sort_values(ascending=False).head(20).plot.bar()

- Umpire **S Ravi** has capped most matches as umpire with 114 matches ,followed by **C Shamshuddin** with 83 matches.

#### **7. Wins by biggest margin of runs**

In [None]:
big_mar= match.sort_values('win_by_runs',ascending = False).reset_index(drop = True)
big_mar = big_mar[:10]
big_mar = big_mar[['winner','win_by_runs']]
print(big_mar)

In [None]:
plt.figure(figsize=(15,10))
sns.barplot(x= big_mar['winner'], y= big_mar['win_by_runs'])

- From the above plot we can see that **Mumbai Indians** is the team, that has won by maximum number of runs.
- This is also evident as **MI** is our most winning team.

#### **8. Best Fielder**

In [None]:
bf = delivery.groupby('fielder').apply(lambda x : x).reset_index()
bf = delivery.groupby('fielder').count() 
bf = bf.dismissal_kind.reset_index(name='Dismissals')
bf = bf.sort_values(by='Dismissals',ascending=False)
bf = bf[0:10]

plt.figure(figsize=(10,8))
plt.title("Best IPL Fielders")
plt.bar(bf.fielder , bf.Dismissals)
plt.xlabel("Best Fielders")
plt.ylabel("No of  Dismissals")
count = 0
for i in bf.Dismissals:
    plt.text(count-0.2,i-4,str(i))
    count+=1
plt.xticks(rotation = 90)
plt.yticks()
plt.show()

- From the above plot, we can see that **MS Dhoni** is our best fielder, followed by **KD Karthik**.

#### **9. Most successful team in home condition**

In [None]:
home_wins=home_and_away[['team','home_win_percentage']]
home_wins.sort_values('home_win_percentage',ascending=False)

In [None]:
plt.figure(figsize=(15,10))
home_wins.sort_values('home_win_percentage',ascending=False).plot.bar(x='team',y='home_win_percentage')
plt.show()

- From the above graph, we can say that `Rising Pune Supergiant` is the most successful team, when playing in home condition

#### **10. Most successful team in away condition**

In [None]:
away_wins=home_and_away[['team','away_win_percentage']]
away_wins.sort_values('away_win_percentage',ascending=False)

In [None]:
plt.figure(figsize=(15,10))
away_wins.sort_values('away_win_percentage',ascending=False).plot.bar(x='team',y='away_win_percentage')
plt.show()

- From the above graph, we can say that `Gujarat Lions` is the most successful team, when playing in away condition

#### **11. Most Runs scored by batters (Top 10)**

In [None]:
runs_avg_strikerate.sort_values('total_runs',ascending=False).head(10)

In [None]:
runs_avg_strikerate.sort_values('total_runs',ascending=False).head(10).plot.bar(x='batsman',y='total_runs')

- We find that **Virat Kohli** is the player that scores the highest number of runs, with the strikerate of 131.987351.

# ***Team Specific Analysis***

- Here, I will be picking some teams randonly, say `Mumbai Indians` and `Royal Challengers Bangalore` and doing few analysis on them.

In [None]:
delivery.head()

### 1. Mumbai Indians

In [None]:
mumbai= delivery[delivery['inning']==3]
mumbai.head(10)

In [None]:
mumbai.info()

#### i. Batsman who gets the chance to bat the most

In [None]:
mumbai['batsman'].value_counts()

In [None]:
plt.figure(figsize=(15,5))
mumbai['batsman'].value_counts().plot.bar()

- We can observe that **Chris Gayle** gets to bat the most, when batting team was MI

#### ii. Types of dismissal in the batch when `MI` bats

In [None]:
mumbai['dismissal_kind'].value_counts()

In [None]:
plt.figure(figsize=(15,10))
plt.pie(list(mumbai['dismissal_kind'].value_counts()), labels= list(mumbai['dismissal_kind'].value_counts().keys()),
        autopct='%0.1f%%')
plt.show()

- We can see that 44.4% of the times, the player was dismissed due to `catch` of the ball.
- 33% of the reason for batsman dismissal is `run out`.
- Reamaining 22.2% is divided equally as `LBW` and `clean bowled`, as the reason for player dismissal.

### 2. Royal Challengers Bangalore

In [None]:
rcb= delivery[delivery['inning']==2]
rcb.head()

#### i. Batsman who gets the chance to bat the most

In [None]:
rcb['batsman'].value_counts()[0:10]

In [None]:
plt.figure(figsize=(15,5))
rcb['batsman'].value_counts().head(10).plot.bar()

- We can observe that **Gautam Gambhir** gets to bat the most, when batting team was RCB.

#### ii. Types of dismissal in the batch when `RCB` bats

In [None]:
rcb['dismissal_kind'].value_counts()

In [None]:
plt.figure(figsize=(15,15))
plt.pie(list(rcb['dismissal_kind'].value_counts()), labels= list(rcb['dismissal_kind'].value_counts().keys()),
        autopct='%0.1f%%')
plt.show()

- We can see that 60.6% of the time, player was dismissed when there was `catch` of the ball.
- The least reason for player dismissal was `obstructing the field`.

# **Please Upvote my work. That will be the best way to appreciate my work and improve my confidence.**