# **Task 19 - (Article 116)** [![Static Badge](https://img.shields.io/badge/Open%20in%20Colab%20-%20orange?style=plastic&logo=googlecolab&labelColor=grey)](https://colab.research.google.com/github/sshrizvi/DS-Python/blob/main/Pandas/Tasks/task_19.ipynb)

|🔴 **WARNING** 🔴|
|:-----------:|
|If you have not studied article 116. Do checkout the articles before attempting the task.|
| Here is [DataFrameGroupBy Object](../Articles/116_groupby_object.md) |

### 📦 **Importing Relevant Libraries**

In [1]:
import numpy as np
import pandas as pd

### ⚠️ **Data Warning**
For the questions forward, we are going to use TITANIC Dataset, which is in the [Resources](../Resources/).

TITANIC Dataset : [Link](https://docs.google.com/spreadsheets/d/e/2PACX-1vQjh5HzZ1N0SU7ME9ZQRzeVTaXaGsV97rU8R7eAcg53k27GTstJp9cRUOfr55go1GRRvTz1NwvyOnuh/pub?gid=1562145139&single=true&output=csv)

#### **Reading Data into DataFrame**

In [2]:
titanic_df = pd.read_csv(
    filepath_or_buffer = '../Resources/Data/titanic_train_dataset.csv'
)

In [3]:
titanic_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


### **🎯 Q1: Grouping by "Pclass" to Analyze "Age"**  

Using **`groupby`**, create groups based on the `"Pclass"` column and perform the following operations:  

1. **Calculate the Average Age** for each `"Pclass"` group.  
2. **Count the Total Number of Missing Values** in the `"Age"` column for each `"Pclass"`.  

3. **📌 Expected Output Format**  
    | **Pclass** | **Average Age** | **Missing Values in Age** |
    |------------|----------------|---------------------------|
    | 1          | 38.233441      | 30                        |
    | 2          | 29.877630      | 11                        |
    | 3          | 25.140620      | 136                       |

4. **Notes:**  
   - The dataset should have a column **`"Pclass"`** representing **Passenger Class (1st, 2nd, or 3rd class)**.  
   - The `"Age"` column contains missing values (`NaN`), which should be counted for each `"Pclass"`.  

In [4]:
pclass_group = titanic_df.groupby(
    by = 'Pclass'
)

pd.DataFrame(
    data = {
        'Average Age' : pclass_group.Age.mean(),
        'Missing Values in Age' : pclass_group.Age.apply(
            func = lambda x : x.isna().sum(),
            include_groups = False
        )
    }
)

Unnamed: 0_level_0,Average Age,Missing Values in Age
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,38.233441,30
2,29.87763,11
3,25.14062,136


### **🎯 Q2: Handling Missing Values in `"Embarked"` Column Using `groupby`**  

Using **`groupby`**, create groups based on the `"Pclass"` column and perform the following operations:  

1. **Fill the Missing Values (`NaN`)** in the `"Embarked"` column with the **mode (most frequent value)** of that group.  
2. **Print the Value Counts** of the `"Embarked"` column for each `"Pclass"`, sorted in **ascending order**.  

3. **📌 Expected Output Format**  
    For each `"Pclass"` group, print the `"Embarked"` column's value counts in ascending order:  

    - **Example Output (Pclass = 1)**  
    ```
    Embarked Value Counts (Sorted in Ascending Order) for Pclass = 1:
    Q: 5  
    C: 35  
    S: 100  
    ```  
    - **Example Output (Pclass = 2)**  
    ```
    Embarked Value Counts (Sorted in Ascending Order) for Pclass = 2:
    Q: 8  
    C: 20  
    S: 120  
    ```  
    - **Example Output (Pclass = 3)**  
    ```
    Embarked Value Counts (Sorted in Ascending Order) for Pclass = 3:
    C: 10  
    Q: 15  
    S: 200  
    ```

4. **Notes:**  
   - The `"Pclass"` column represents **Passenger Class (1st, 2nd, or 3rd class)**.  
   - The `"Embarked"` column contains **port names** (`S`, `C`, `Q`), where missing values (`NaN`) should be replaced by the **mode of each group**.  
   - The **mode** (most frequently occurring value) should be determined **per `"Pclass"` group** before filling missing values.  
   - Use **`groupby("Pclass")`** to process each class separately.  
   - **Sort** the `"Embarked"` column's value counts in **ascending order** before printing.

In [5]:
def fill_nan(group):
    return group['Embarked'].fillna(
        value = group['Embarked'].mode()[0],
    )

pclass_group.apply(
    func = fill_nan,
    include_groups = False
)

for name, data in pclass_group:
    print('Embarked Value Counts (Sorted in Ascending Order) for Pclass {}:'.format(name))
    print(
        data['Embarked'].value_counts().sort_values()
    )

Embarked Value Counts (Sorted in Ascending Order) for Pclass 1:
Embarked
Q      2
C     85
S    127
Name: count, dtype: int64
Embarked Value Counts (Sorted in Ascending Order) for Pclass 2:
Embarked
Q      3
C     17
S    164
Name: count, dtype: int64
Embarked Value Counts (Sorted in Ascending Order) for Pclass 3:
Embarked
C     66
Q     72
S    353
Name: count, dtype: int64


### **🎯 Q3: Grouping by `"Embarked"` and `"Pclass"` to Find Average Fare**  

Perform the following steps:  

1. **Group the DataFrame** based on the `"Embarked"` column.  
2. **Within each `"Embarked"` group**, further group the data based on the `"Pclass"` column.  
3. **Calculate the Average Fare** for each `"Pclass"` within each `"Embarked"` group.  
4. **Round off** the average fare values **to 2 decimal places**.  
5. **Store the result in a nested dictionary format** where:  
   - The **keys** are `"Embarked"` values (`S`, `C`, `Q`).  
   - The **nested keys** are `"Pclass"` values (`1`, `2`, `3`).  
   - The **values** are the rounded average fares for each `"Pclass"`.  

6. **📌 Expected Output Format**  
    The output should be a **nested dictionary**, as shown below:  

    ```python
    {
      'C': {1: 104.72, 2: 25.36, 3: 11.21},
      'Q': {1: 90.0, 2: 12.35, 3: 11.18},
      'S': {1: 70.36, 2: 20.33, 3: 14.64}
    }
    ```
7. **Notes:**  
   - The `"Embarked"` column contains values:  
     - **`S`** → Southampton  
     - **`C`** → Cherbourg  
     - **`Q`** → Queenstown  
   - The `"Pclass"` column represents **Passenger Class**:  
     - **1** → First Class  
     - **2** → Second Class  
     - **3** → Third Class  
   - The `"Fare"` column represents the **ticket fare** paid by passengers.  
   - The output should be **structured as a dictionary**, with `"Embarked"` groups as primary keys and `"Pclass"` as sub-keys.  
   - **Use `round( , 2)`** to round the average fare values to **2 decimal places**.

In [6]:
embarked_pclass_group = titanic_df.groupby(
    by = [
        'Embarked',
        'Pclass'
    ]
)

fare_dict = {}

for name, data in embarked_pclass_group:
    embarked, pclass = name
    if embarked not in fare_dict:
        fare_dict[embarked] = {}
    fare_dict[embarked][pclass] = round(data.Fare.mean(), 2)

fare_dict

{'C': {1: 104.72, 2: 25.36, 3: 11.21},
 'Q': {1: 90.0, 2: 12.35, 3: 11.18},
 'S': {1: 70.36, 2: 20.33, 3: 14.64}}

### ⚠️ **Data Warning**
For the questions forward, we are going to use FIFA WorldCup 2022 Dataset, which is in the [Resources](../Resources/).

FIFA WorldCup 2022 Dataset : [Link](https://docs.google.com/spreadsheets/d/e/2PACX-1vT3D_x_4DS6d51LKJ7ze1sxT5WpV5uiSVOFYHLwBiGru6vFyVv5h5-83AwFjxWYiWfCDjDAaarHAV-k/pub?gid=0&single=true&output=csv)

#### **Reading Data into DataFrame**

In [7]:
fifa_df = pd.read_csv(
    filepath_or_buffer = '../Resources/Data/fifa_worldcup_2022.csv'
)

In [8]:
fifa_df

Unnamed: 0,Sl. No,Match No.,Team,Against,Group,Goal,Possession (%),Inside Penalty Area,Outside Penalty Area,Assists,...,Fouls Against,Offsides,Passes,Passes Completed,Crosses,Crosses Completed,Corners,Free Kicks,Penalties Scored,Pts
0,1,1,Qatar,Ecuador,A,0,40,0,0,0,...,15,3,453,387,10,5,1,19,0,0
1,2,1,Ecuador,Qatar,A,2,46,2,0,1,...,15,4,484,419,26,10,3,17,1,3
2,3,2,England,Iran,B,6,69,6,0,6,...,9,2,810,733,29,9,8,16,0,3
3,4,2,Iran,England,B,2,20,2,0,1,...,14,2,232,156,11,3,0,10,1,0
4,5,3,Senegal,Netherlands,A,0,39,0,0,0,...,13,2,391,326,22,8,6,14,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
123,124,62,Morocco,Fram,F,0,55,0,0,0,...,11,3,583,518,22,1,3,15,0,0
124,125,63,Croatia,Morocco,F,2,45,2,0,2,...,13,2,491,430,21,3,6,13,0,3
125,126,63,Morocco,Croatia,F,1,45,1,0,0,...,11,2,494,428,20,5,3,15,0,0
126,127,64,Argentina,France,C,3,46,3,0,1,...,26,4,648,544,20,4,6,22,5,3


### **🎯 Q4: Grouping by `"Team"` and Performing Z-Normalization**  

**Tasks to Perform:**  
1. **Group the DataFrame** using the `"Team"` column.  
2. **Perform Z-Normalization** on the following columns **within each group**:  
   - `"Passes"`  
   - `"Passes Completed"`  
   - `"Attempted Line Breaks"`  
   - `"Completed Line Breaks"`  
3. **Define a function `z_normalization`** that:  
   - Takes **two arguments**:  
     1. **`group`** → Represents each grouped `"Team"` data.  
     2. **`cols_to_perform`** → List of columns on which Z-Normalization is applied.  
   - Uses the formula:  
     $$
     Z = \frac{X_i - \mu}{\text{std}}
     $$
   - Uses **`.apply()`** to apply Z-Normalization across the grouped data.  
4. **After performing normalization**, calculate the following statistics for each `"Team"`:  
   - **Minimum `"Passes"`**  
   - **Maximum `"Passes"`**  
   - **Minimum `"Yellow Cards"`**  
   - **Maximum `"Yellow Cards"`**  
   - **Average `"Yellow Cards"`**  
   - **Maximum `"Attempted Line Breaks"`**  
   - **Minimum `"Attempted Line Breaks"`**  
   - **Standard Deviation of `"Attempted Line Breaks"`**  
   - **Average `"Possession(%)"`**  


5. **📌 Expected Output Structure**  
    After performing the operations, the **final output** should be a **summary table** where each row corresponds to a `"Team"` and includes:  
    - **Z-normalized values** for selected columns.  
    - **Statistical summaries** for `"Passes"`, `"Yellow Cards"`, `"Attempted Line Breaks"`, and `"Possession"`.  

1. **Notes:**  
   - The Z-Normalization is applied **within each `"Team"` group**, meaning each team's stats are normalized **relative to their own data**.  
   - Ensure to **avoid NaN values** in standard deviation calculations.  
   - Use **`.apply()`** for efficient column-wise transformations.  
   - Use **grouped aggregations** for statistical summaries after normalization.

In [9]:
def z_normalization(group, cols_to_perform):
    for column in cols_to_perform:
        return group[column].apply(
            func = lambda x : (x - group[column].mean()) / group[column].std()
        )

In [10]:
team_groups = fifa_df.groupby(
    by = 'Team'
)

team_groups.apply(
    func = z_normalization,
    cols_to_perform = [
        'Passes',
        'Passes Completed',
        'Attempted Line Breaks',
        'Completed Line Breaks'
    ],
    include_groups = False
)

Team          
Argentina  8     -0.208138
           46    -0.630447
           77     1.685214
           98     0.622403
           115   -0.137753
                    ...   
Uruguay    63    -0.931702
           89    -0.124867
Wales      7     -0.054584
           32     1.026174
           68    -0.971590
Name: Passes, Length: 128, dtype: float64

In [11]:
team_groups.agg(
    func = {
        'Passes' : ['min', 'max'],
        'Yellow Cards' : ['min', 'max', 'mean'],
        'Attempted Line Breaks' : ['min', 'max', 'std'],
        'Possession (%)' : 'mean'
    }
)

Unnamed: 0_level_0,Passes,Passes,Yellow Cards,Yellow Cards,Yellow Cards,Attempted Line Breaks,Attempted Line Breaks,Attempted Line Breaks,Possession (%)
Unnamed: 0_level_1,min,max,min,max,mean,min,max,std,mean
Team,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Argentina,408,862,0,8,2.285714,141,249,38.774685,49.285714
Australia,286,466,0,3,1.75,133,171,17.682383,31.25
Belgium,512,685,1,3,1.666667,167,195,14.422205,49.0
Brazil,548,695,0,3,1.2,164,193,10.691118,50.4
Cameroon,295,500,1,5,2.666667,144,182,19.857828,38.333333
Canada,448,536,2,4,2.666667,102,176,37.753587,44.333333
Costa Rica,231,454,1,3,2.0,86,154,38.974351,27.333333
Croatia,461,724,0,2,1.142857,97,259,54.499891,47.428571
Denmark,537,650,1,2,1.666667,173,241,34.0,51.333333
Ecuador,429,484,0,2,1.0,143,177,17.009801,42.666667


### ⚠️ **Data Warning**
For the questions forward, we are going to use IPL DELIVERIES Dataset, which is in the [Resources](../Resources/).

IPL DELIVERIES Dataset : [Link](https://drive.google.com/file/d/1-kvv_9KCSAFWcrhS9WgTxSrURkRh6GNt/view?usp=share_link)

#### **Reading Data into DataFrame**

In [12]:
deliveries = pd.read_csv(
    filepath_or_buffer = '../Resources/Data/deliveries.csv'
)

In [13]:
pd.options.display.max_columns = None

In [14]:
deliveries

Unnamed: 0,match_id,inning,batting_team,bowling_team,over,ball,batsman,non_striker,bowler,is_super_over,wide_runs,bye_runs,legbye_runs,noball_runs,penalty_runs,batsman_runs,extra_runs,total_runs,player_dismissed,dismissal_kind,fielder
0,1,1,Sunrisers Hyderabad,Royal Challengers Bangalore,1,1,DA Warner,S Dhawan,TS Mills,0,0,0,0,0,0,0,0,0,,,
1,1,1,Sunrisers Hyderabad,Royal Challengers Bangalore,1,2,DA Warner,S Dhawan,TS Mills,0,0,0,0,0,0,0,0,0,,,
2,1,1,Sunrisers Hyderabad,Royal Challengers Bangalore,1,3,DA Warner,S Dhawan,TS Mills,0,0,0,0,0,0,4,0,4,,,
3,1,1,Sunrisers Hyderabad,Royal Challengers Bangalore,1,4,DA Warner,S Dhawan,TS Mills,0,0,0,0,0,0,0,0,0,,,
4,1,1,Sunrisers Hyderabad,Royal Challengers Bangalore,1,5,DA Warner,S Dhawan,TS Mills,0,2,0,0,0,0,0,2,2,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
179073,11415,2,Chennai Super Kings,Mumbai Indians,20,2,RA Jadeja,SR Watson,SL Malinga,0,0,0,0,0,0,1,0,1,,,
179074,11415,2,Chennai Super Kings,Mumbai Indians,20,3,SR Watson,RA Jadeja,SL Malinga,0,0,0,0,0,0,2,0,2,,,
179075,11415,2,Chennai Super Kings,Mumbai Indians,20,4,SR Watson,RA Jadeja,SL Malinga,0,0,0,0,0,0,1,0,1,SR Watson,run out,KH Pandya
179076,11415,2,Chennai Super Kings,Mumbai Indians,20,5,SN Thakur,RA Jadeja,SL Malinga,0,0,0,0,0,0,2,0,2,,,


### **🎯 Q5: Identifying Batsmen in Specific Chasing Categories**  

**Tasks to Perform:**  
1. **Define "Chasing" Matches**  
   - A match is considered as a **chasing scenario** when the **team bats in the second innings**.  
2. **Identify Batsmen in Two Categories:**  
   - **Batsman with the Highest Score While Chasing**  
     - Find the **batsman** who has scored the **highest individual score** in a chasing innings.  
   - **Batsman with the Best Strike Rate While Chasing (100+ Balls Faced)**  
     - Consider only those batsmen who have **faced at least 100 balls** while chasing.  
     - Find the batsman with the **best (highest) strike rate** under these conditions.  

3. **📌 Expected Output Structure**  
    After performing the operations, the final output should include:  
    1. **Batsman Name** with the highest individual score while chasing.  
    2. **Batsman Name** with the best strike rate while chasing (among those who have faced 100+ balls).  

4. **Notes:**  
   - Ensure that **only second innings performances** are considered.  
   - The **strike rate formula** is:  
     $$
     \text{Strike Rate} = \left(\frac{\text{Runs Scored}}{\text{Balls Faced}}\right) \times 100
     $$
   - **Filtering Condition**: Include only batsmen who have **faced at least 100 balls** when calculating the best strike rate.

In [15]:
chasing_matches = deliveries[deliveries['inning'] == 2]

batsman_groups = chasing_matches.groupby(
    by = 'batsman'
)

print(
    'Batsman with the Highest Score While Chasing : {}'.format(
        batsman_groups['batsman_runs'].sum().sort_values(
            ascending = False
        ).index[0]
    )
)

batsman_groups = batsman_groups.filter(
    func = lambda x : x.shape[0] > 100
).groupby(
    by = 'batsman'
)

def strike_rate(group):
    return (
        group.sum() / group.size
    ) * 100


print(
    'Batsman with the Best Strike Rate While Chasing (100+ Balls Faced) : {}'.format(
        batsman_groups['batsman_runs'].apply(
            func = strike_rate
        ).sort_values(
            ascending = False
        ).index[0]
    )
)

Batsman with the Highest Score While Chasing : RV Uthappa
Batsman with the Best Strike Rate While Chasing (100+ Balls Faced) : SP Narine


### **🎯 Q6: Identify the Most Successful Bowler Against Any Batsman**  

**Tasks to Perform:**  
1. **Define "Most Successful" Bowler-Batsman Pair**  
   - A bowler is considered **most successful** against a batsman if they have dismissed that batsman **the most number of times**.  
   - If two or more bowler-batsman pairs have the **same number of dismissals**, then the **bowler who has conceded fewer runs** to that batsman is ranked higher.  

2. **Steps to Identify the Pair:**  
   - Count the **number of dismissals** for every bowler-batsman combination.  
   - If there is a tie in dismissals, use the **runs conceded** by the bowler to determine the ranking.  
   - Return the **bowler and batsman pair** with the highest success rate.  

3. **📌 Expected Output Structure**  
    After processing the dataset, the final output should include:  
    - **Bowler Name**  
    - **Batsman Name**  
    - **Number of Times Dismissed**  
    - **Total Runs Conceded**  

4. **Notes:**  
   - Consider **all modes of dismissal** where the bowler is credited (e.g., bowled, caught, LBW, etc.).  
   - Ignore dismissals like **run-outs** where the bowler does not play a direct role.  
   - If multiple pairs have the same dismissal count, the one with **fewer runs conceded** should be ranked higher.

In [16]:
deliveries['dismissal_kind'].unique()

array([nan, 'caught', 'bowled', 'run out', 'lbw', 'caught and bowled',
       'stumped', 'retired hurt', 'hit wicket', 'obstructing the field'],
      dtype=object)

In [17]:
bowler_batsman_groups = deliveries.groupby(
    by = [
        'bowler',
        'batsman'
    ]
)

list_data = []

for pair, data in bowler_batsman_groups:
    bowler, batsman = pair
    dismissed_balls = data.dropna(
        subset = ['player_dismissed']
    )
    dismissal_count = (dismissed_balls['player_dismissed'].notna() & dismissed_balls['dismissal_kind'] != 'run out').sum()
    conceded_runs = data['batsman_runs'].sum()

    list_data.append(
        [
            bowler,
            batsman,
            dismissal_count,
            conceded_runs
        ]
    )

dismissal_df = pd.DataFrame(
    data = list_data,
    columns = [
        'Bowler Name',
        'Batsman Name',
        'No. of Times Dismissed',
        'Runs Conceded'
    ]
)

In [18]:
dismissal_df.sort_values(
    by = [
        'No. of Times Dismissed',
        'Runs Conceded'
    ],
    ascending = [
        False,
        True
    ]
).head(1)

Unnamed: 0,Bowler Name,Batsman Name,No. of Times Dismissed,Runs Conceded
2515,B Kumar,PA Patel,7,66


### **🎯 Q7: Identify the Most Successful Batting Pair in IPL**  

**Tasks to Perform:**  
1. **Define "Most Successful" Batting Pair**  
   - A batting pair is considered **most successful** if they have **scored the most runs together** while batting in partnership.  

2. **Steps to Identify the Pair:**  
   - Identify **all batting pairs** from the dataset based on partnerships.  
   - Sum the **total runs scored** by each pair across all matches.  
   - Return the **batting pair** with the highest aggregate runs.  

3. **📌 Expected Output Structure**  
    After processing the dataset, the final output should include:  
    - **Batsman 1 Name**  
    - **Batsman 2 Name**  
    - **Total Runs Scored Together**  
    - **Number of Partnerships (optional, for insight)**  

4. **Notes:**  
   - Partnerships should be considered **irrespective of wickets falling** (e.g., if a partnership ends due to a non-striker getting run out, it should still be counted).  
   - Runs should include **all contributions** made by the two batsmen together (including boundaries, running between the wickets, etc.).  
   - If two pairs have the **same total runs**, the one with **fewer innings played together** should be ranked higher.

In [21]:
deliveries['batting_pair'] = deliveries.apply(
    func = lambda row : tuple(
        sorted(
            [row['batsman'], row['non_striker']]
        )
    ),
    axis = 1
)

batsmen_partnership_group = deliveries.groupby(
    by = 'batting_pair'
)

list_data = []

for pair, data in batsmen_partnership_group:
    batsman1, batsman2 = pair
    total_runs = data['batsman_runs'].sum()
    innings_count = data['match_id'].unique().size
    list_data.append(
        [
            batsman1,
            batsman2,
            total_runs,
            innings_count
        ]
    )

batsmen_partnership_df = pd.DataFrame(
    data = list_data,
    columns = [
        'Batsman 1',
        'Batsman 2',
        'Total Runs',
        'Innings Count',
    ]
)

In [22]:
batsmen_partnership_df.sort_values(
    by = [
        'Total Runs',
        'Innings Count'
    ],
    ascending = [
        False,
        False
    ]
)

Unnamed: 0,Batsman 1,Batsman 2,Total Runs,Innings Count
277,AB de Villiers,V Kohli,2773,68
1075,CH Gayle,V Kohli,2650,60
1265,DA Warner,S Dhawan,2242,50
1624,G Gambhir,RV Uthappa,1795,48
2775,MS Dhoni,SK Raina,1411,53
...,...,...,...,...
3332,SN Thakur,Washington Sundar,0,1
3376,Sandeep Sharma,VR Aaron,0,1
3384,Shoaib Ahmed,TL Suman,0,1
3385,Shoaib Ahmed,WPUJC Vaas,0,1


### **🎯 Q8: Create a DataFrame for All Batting Pairs in IPL**  

**Tasks to Perform:**  
1. **Identify all batting pairs** from the dataset based on partnerships in IPL matches.  
2. **Calculate and store the following statistics** for each batting pair:  
   - **`Batsman1`** → Name of the first batsman in the partnership.  
   - **`Batsman2`** → Name of the second batsman in the partnership.  
   - **`Runs`** → Total runs scored by the pair while batting together.  
   - **`Avg` (Batting Average)** → Runs scored together divided by the number of innings they batted together.  
   - **`StrikeRate`** → (Total runs scored / Total balls faced) × 100.  
     - **Note:** Wide balls can be counted in total balls to simplify the calculation.  

3. **📌 Expected Output Structure**  
    The final DataFrame should have the following columns:  
    | Batsman1       | Batsman2       | Runs | Avg  | StrikeRate |
    |---------------|---------------|------|------|------------|
    | Player A      | Player B      | 1500 | 42.8 | 135.23     |
    | Player C      | Player D      | 1200 | 37.5 | 128.45     |
    | Player E      | Player F      | 1100 | 40.0 | 132.67     |


4. **Notes:**  
   - **The order of batsmen in a pair does not matter** (e.g., "Virat Kohli & AB de Villiers" is the same as "AB de Villiers & Virat Kohli").  
   - Partnerships should be considered **only in completed innings**, even if a player remains not out.  
   - If two pairs have the **same total runs**, rank them based on **higher strike rate**.

In [23]:
deliveries['batting_pair'] = deliveries.apply(
    func = lambda row : tuple(
        sorted(
            [row['batsman'], row['non_striker']]
        )
    ),
    axis = 1
)

batsmen_partnership_group = deliveries.groupby(
    by = 'batting_pair'
)

list_data = []

for pair, data in batsmen_partnership_group:
    striker, non_striker = pair
    total_runs = data['batsman_runs'].sum()
    innings_count = data['match_id'].unique().size
    list_data.append(
        [
            striker,
            non_striker,
            total_runs,
            total_runs / innings_count,
            (total_runs / data.shape[0]) * 100
        ]
    )

batsmen_partnership_df = pd.DataFrame(
    data = list_data,
    columns = [
        'Batsman 1',
        'Batsman 2',
        'Total Runs',
        'Average Run',
        'Strike Rate'
    ]
)

In [24]:
batsmen_partnership_df.sort_values(
    by = [
        'Total Runs'
    ],
    ascending = [
        False
    ]
)

Unnamed: 0,Batsman 1,Batsman 2,Total Runs,Average Run,Strike Rate
277,AB de Villiers,V Kohli,2773,40.779412,148.925886
1075,CH Gayle,V Kohli,2650,44.166667,134.313229
1265,DA Warner,S Dhawan,2242,44.840000,129.971014
1624,G Gambhir,RV Uthappa,1795,37.395833,125.964912
2775,MS Dhoni,SK Raina,1411,26.622642,133.744076
...,...,...,...,...,...
3175,RR Bhatkal,V Kohli,0,0.000000,0.000000
1005,C de Grandhomme,P Negi,0,0.000000,0.000000
999,C de Grandhomme,IR Jaggi,0,0.000000,0.000000
3168,RP Singh,S Sreesanth,0,0.000000,0.000000
