# **IPL EDA**
### IPL is one of the famous cricketing league in the world. I tried to summurize my knowledge to extract information out of provided data.Please feel free to check/edit and commemt if you have any doubts.
### Topics
* Data Cleaning and Rearranging
* Data Visualization
    * Histogram
    * Bar chart
    * Pie chart
    * Map - folium

In [None]:
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt # plotting
import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd
from pandas_profiling import ProfileReport

import seaborn as sns
import missingno as msno
from scipy import stats
sns.set(color_codes=True)
import warnings
warnings.filterwarnings('ignore')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Data Cleaning and Rearranging

In [None]:
df = pd.read_csv("/kaggle/input/ipl-dataset-20082019/matches.csv", index_col=0)
df.head()

## Check NaN values

In [None]:
df.isnull().sum()

## Remove unwanted data
#### In seasons column IPL- was prepended to year. Removing IPL- will convert Season column to use as year wise data.
#### umpire3 column has lost of NaN data, so drop that column.

In [None]:
df.Season = df.Season.str.replace(r'IPL-', '').astype(int)
df.drop(columns=["umpire3"], inplace = True)

## Rearrange Data
#### city, umpire1, umpire2 data will be replaced by "-" for easier processing.
#### Many teams has changed their names over the years so we consider those as same team (depending on the city)

In [None]:
df.city = df.city.fillna("-")
df.umpire1 = df.umpire1.fillna("-")
df.umpire2 = df.umpire2.fillna("-")
df = df.replace('Rising Pune Supergiants', 'Rising Pune Supergiant')
df = df.replace('Pune Warriors', 'Rising Pune Supergiant')
df = df.replace('Deccan Chargers', 'Sunrisers Hyderabad')
df = df.replace('Delhi Capitals', 'Delhi Daredevils')

#### Remove records which does not have valid information. So we can remove data basrd on results (no result)

In [None]:
is_NaN = df.isnull()
row_has_NaN = is_NaN.any(axis=1)
rows_with_NaN = df[row_has_NaN]
rows_with_NaN

In [None]:
df.dropna(inplace=True)
df.isnull().sum()

In [None]:
df.info()

In [None]:
df.describe([0.10,0.25,0.50,0.75,0.90,0.95,0.99]).T

# Visualization

#### We will start with histographic representation of all numeric data in the dataframe.

In [None]:
sns.pairplot(df)

### Number of matches played in each year

In [None]:
df.Season.value_counts().plot(kind="bar")

### Top 10 Player of the match winners over the time

In [None]:
player_of_match = df["player_of_match"].value_counts()[:10]
player_of_match.plot(kind="barh")
print(df["player_of_match"].value_counts()[:10])

### Who won the most number of matches

In [None]:
match_winner = df["winner"].value_counts()
match_winner.plot(kind="barh")
print(df["winner"].value_counts())

### Most matches won in particular season

In [None]:
df.loc[df.Season == 2019, "winner"].value_counts().plot(kind="barh")

### After winning the toss, what decision is taken bat or field.

In [None]:
df["toss_decision"].value_counts().plot(kind="barh")
df["toss_decision"].value_counts()

### Which team won most number of matches with the margin of more than 50 runs

In [None]:
df.loc[df.win_by_runs > 50, "winner"].value_counts().plot(kind="barh")

### Which team won most number of matches with the margin of more than 5 wickets.

In [None]:
df.loc[df.win_by_wickets > 5, "winner"].value_counts().plot(kind="barh")

### Toss winner won the match? which team did the best?

In [None]:
df.loc[df.toss_winner == df.winner, "winner"].value_counts().plot(kind="barh")
print(df.loc[df.toss_winner == df.winner, "winner"].value_counts())

### Team to team record

In [None]:
teams = df.team1.unique().tolist()
teams.sort()
for team1 in teams:
    for team2 in df.team2.unique().tolist():
        df_ttw = df.loc[(df["team1"] == team1) & (df["team2"] == team2), "winner"]
        if len(df_ttw) > 0:
            print(df_ttw.value_counts())

### Pie chart to show, which team has highest percentage of wins.

In [None]:
df.loc[(df["team1"] == "Chennai Super Kings") & (df["team2"] == "Mumbai Indians"), "winner"].value_counts().plot(kind="pie")

### We are mapping city and venues in the map

In [None]:
df = df[df['city'].notna()]
city = df.city.unique().tolist()
city.remove("-")

In [None]:
#ec38974f44884f42bfa871dc36b8a090
!pip install opencage
from opencage.geocoder import OpenCageGeocode
import folium
key = "ec38974f44884f42bfa871dc36b8a090"  
geocoder = OpenCageGeocode(key)
india = geocoder.geocode("India")
lat = india[0]['geometry']['lat']
lng = india[0]['geometry']['lng']
map = folium.Map(location=[lat, lng], zoom_start=2)
for query in city:
    pop = query
    if query == "Kochi":
        query = "Kochi India"
    results = geocoder.geocode(query)
    lat = results[0]['geometry']['lat']
    lng = results[0]['geometry']['lng']
    folium.Marker((lat, lng), popup=pop).add_to(map)
map

# Conclusion
### This might not be the best analysis, but it can start our IPL journey.

# If you like the analysis please **UPVOTE**.