### Day 1: Introduction to Python for Data Science

Welcome to Day 1! In today’s session, we’ll cover some essential concepts and techniques you’ll use in Python for data science.

**Overview of the Topics for Today:**
1. Introduction to Python for Data Science.
2. Data manipulation with Pandas:
   - Data cleaning and transformation.
   - Grouping, merging, and aggregating data.
3. Introduction to NumPy:
   - Array creation and manipulation.
   - Vectorized operations for efficient computation.

Let’s get started!


### Install the libraries using pip if not already installed

In [1]:

!pip install pandas numpy matplotlib
# This command uses `!pip install` to install the necessary libraries
# for data manipulation (`pandas`), numerical calculations (`numpy`)





[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### Import Libraries
In this cell, we will import the required libraries. The following libraries are crucial for our project:
- `pandas`: for data manipulation.
- `numpy`: for numerical operations.

In [2]:
import pandas as pd
import numpy as np
import time
#import matplotlib.pyplot as plt


### Data Manipulation with Pandas: Data Cleaning and Transformation

One of the first steps in data science is to clean and transform your data. We will use **Pandas** for that purpose, which provides robust tools for data manipulation.

**Data Cleaning Tasks:**
- Handling missing values.
- Converting data types.
- Removing duplicates.
- Filtering data.


### Data set link : <br>
- https://www.kaggle.com/datasets/mahendran1/icc-cricket?resource=download

In [3]:
# Sample DataFrame
df = pd.read_csv("Batting\\ODI data.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,Player,Span,Mat,Inns,NO,Runs,HS,Ave,BF,SR,100,50,0,Unnamed: 13
0,0,SR Tendulkar (INDIA),1989-2012,463,452,41,18426,200*,44.83,21367,86.23,49,96,20,
1,1,KC Sangakkara (Asia/ICC/SL),2000-2015,404,380,41,14234,169,41.98,18048,78.86,25,93,15,
2,2,RT Ponting (AUS/ICC),1995-2012,375,365,39,13704,164,42.03,17046,80.39,30,82,20,
3,3,ST Jayasuriya (Asia/SL),1989-2011,445,433,18,13430,189,32.36,14725,91.2,28,68,34,
4,4,DPMD Jayawardene (Asia/SL),1998-2015,448,418,39,12650,144,33.37,16020,78.96,19,77,28,


### Find different exsting columns

In [4]:
df.columns

Index(['Unnamed: 0', 'Player', 'Span', 'Mat', 'Inns', 'NO', 'Runs', 'HS',
       'Ave', 'BF', 'SR', '100', '50', '0', 'Unnamed: 13'],
      dtype='object')

In [5]:
# Drop a single column by name
df = df.drop(['Unnamed: 0','Unnamed: 13'], axis=1)


In [6]:
df.head()

Unnamed: 0,Player,Span,Mat,Inns,NO,Runs,HS,Ave,BF,SR,100,50,0
0,SR Tendulkar (INDIA),1989-2012,463,452,41,18426,200*,44.83,21367,86.23,49,96,20
1,KC Sangakkara (Asia/ICC/SL),2000-2015,404,380,41,14234,169,41.98,18048,78.86,25,93,15
2,RT Ponting (AUS/ICC),1995-2012,375,365,39,13704,164,42.03,17046,80.39,30,82,20
3,ST Jayasuriya (Asia/SL),1989-2011,445,433,18,13430,189,32.36,14725,91.2,28,68,34
4,DPMD Jayawardene (Asia/SL),1998-2015,448,418,39,12650,144,33.37,16020,78.96,19,77,28


In [7]:
df.isnull().sum()

Player    0
Span      0
Mat       0
Inns      0
NO        0
Runs      0
HS        0
Ave       0
BF        0
SR        0
100       0
50        0
0         0
dtype: int64

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 13 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Player  2500 non-null   object
 1   Span    2500 non-null   object
 2   Mat     2500 non-null   int64 
 3   Inns    2500 non-null   object
 4   NO      2500 non-null   object
 5   Runs    2500 non-null   object
 6   HS      2500 non-null   object
 7   Ave     2500 non-null   object
 8   BF      2500 non-null   object
 9   SR      2500 non-null   object
 10  100     2500 non-null   object
 11  50      2500 non-null   object
 12  0       2500 non-null   object
dtypes: int64(1), object(12)
memory usage: 254.0+ KB


### **Overview of the DataFrame**
- **Type**:  
  `<class 'pandas.core.frame.DataFrame'>` indicates that the object is a Pandas DataFrame.

- **Rows**:  
  `RangeIndex: 2500 entries, 0 to 2499` specifies the DataFrame has **2,500 rows**, indexed from `0` to `2499`.

- **Columns**:  
  `Data columns (total 13 columns)` means the DataFrame contains **13 columns**.

---

### **Column Information**
For each column, the following details are provided:

1. **Column Name**:  
   The name of each column, e.g., `Player`, `Span`, `Mat`, etc.

2. **Non-Null Count**:  
   The number of **non-null (non-missing)** values in each column.  
   - Here, every column has **2500 non-null values**, meaning there are no missing values in any column.

3. **Data Type (Dtype)**:  
   - The type of data stored in each column.  
     For example:
     - `object`: Represents strings or mixed types.
     - `int64`: Represents integers.

---

### **Per-Column Details**
| **Column** | **Description** | **Dtype** |
|------------|-----------------|-----------|
| `Player`   | contains player names. | `object` (text) |
| `Span`     | represents the player's career span (e.g., 2000–2010). | `object` (text) |
| `Mat`      | Total matches played by each player. | `int64` (integer) |
| `Inns`     | Number of innings played. | `object` (text, possibly due to special cases like `'-'`) |
| `NO`       | Not-outs (number of times the player remained not out). | `object` (text) |
| `Runs`     | Total runs scored by the player. | `object`  |
| `HS`       | Highest score made by the player. | `object` |
| `Ave`      | Batting average. | `object` (text, possibly due to missing or invalid values) |
| `BF`       | Balls faced by the player. | `object` (text, likely contains invalid or missing values) |
| `SR`       | Strike rate. | `object` (text, possibly due to missing/invalid entries) |
| `100`      | Number of centuries. | `object` (text, possibly contains special values) |
| `50`       | Number of half-centuries. | `object` (text, possibly contains special values) |
| `0`        | Number of ducks (scores of zero). | `object` (text) |

---

### **Data Type Summary**
- `dtypes: int64(1), object(12)`:
  - `int64(1)`: There is **1 column** (`Mat`) with numeric data.
  - `object(12)`: There are **12 columns** with text or mixed data types.

---

### **Memory Usage**
- `memory usage: 254.0+ KB`: The DataFrame uses approximately **254 KB** of memory.


In [9]:
df.describe()

Unnamed: 0,Mat
count,2500.0
mean,37.1616
std,58.885075
min,1.0
25%,4.0
50%,13.0
75%,43.0
max,463.0


In [10]:
df.columns

Index(['Player', 'Span', 'Mat', 'Inns', 'NO', 'Runs', 'HS', 'Ave', 'BF', 'SR',
       '100', '50', '0'],
      dtype='object')

In [11]:
df = df.replace('-', 0)

In [12]:
df["Player"].value_counts()

Player
Raqibul Hasan (BDESH)     2
SR Tendulkar (INDIA)      1
PJ Martin (ENG)           1
A Shahzad (ENG)           1
SN Thakur (INDIA)         1
                         ..
Aaqib Javed (PAK)         1
Shamsur Rahman (BDESH)    1
CA Soper (PNG)            1
JM Vince (ENG)            1
GR Beard (AUS)            1
Name: count, Length: 2499, dtype: int64

In [13]:
df = df.drop_duplicates()

In [14]:
df.shape

(2500, 13)

In [15]:
# why duplicate player....

In [16]:


split_data = df['Player'].str.split('(', expand=True)


data = pd.DataFrame({
    'Player_Name': split_data[0].str.strip(),
    'Team': split_data[1].str.strip(')')
})


In [17]:
data

Unnamed: 0,Player_Name,Team
0,SR Tendulkar,INDIA
1,KC Sangakkara,Asia/ICC/SL
2,RT Ponting,AUS/ICC
3,ST Jayasuriya,Asia/SL
4,DPMD Jayawardene,Asia/SL
...,...,...
2495,ZS Ansari,ENG
2496,Ariful Haque,BDESH
2497,Ashfaq Ahmed,PAK
2498,MD Bailey,NZ


In [18]:
data["Team"].value_counts()

Team
ENG               239
PAK               213
AUS               212
INDIA             208
WI                188
NZ                186
SL                178
ZIM               132
BDESH             126
SA                103
CAN                83
UAE                81
SCOT               68
NL                 64
IRE                54
KENYA              46
AFG                44
HKG                40
BMUDA              35
USA                27
NAM                26
PNG                23
NEPAL              19
Afr/SA             15
EAf                13
OMAN               13
Asia/INDIA          7
Asia/SL             5
AUS/ICC             5
Afr/ZIM             4
Asia/PAK            4
USA/WI              3
Afr/KENYA           3
Asia/BDESH          3
ENG/ICC             3
ENG/IRE             3
ICC/NZ              3
Asia/ICC/PAK        2
Asia/ICC/SL         2
Afr/ICC/SA          2
ICC/WI              2
Asia/ICC/INDIA      2
ICC/SA              1
SA/USA              1
NL/SA               1
HKG/N

In [19]:
data.columns

Index(['Player_Name', 'Team'], dtype='object')

In [20]:
df["Span"].value_counts()

Span
2019-2019    101
2018-2019     37
2017-2019     35
2003-2003     33
2018-2018     32
            ... 
1980-1993      1
1995-2002      1
1987-1989      1
1974-1982      1
1990-1993      1
Name: count, Length: 566, dtype: int64

In [21]:
# Perform the split with safety
split_columns = df['Span'].str.split('-', expand=True)

# Assign columns dynamically based on split result
data['Debute_Year'] = split_columns[0]
# Convert 'Player' column to string
data['Debute_Year'] = data['Debute_Year'].astype(int)

data['Retirment_Year'] = split_columns[1]
data['Retirment_Year'] = data['Retirment_Year'].astype(int)
data

Unnamed: 0,Player_Name,Team,Debute_Year,Retirment_Year
0,SR Tendulkar,INDIA,1989,2012
1,KC Sangakkara,Asia/ICC/SL,2000,2015
2,RT Ponting,AUS/ICC,1995,2012
3,ST Jayasuriya,Asia/SL,1989,2011
4,DPMD Jayawardene,Asia/SL,1998,2015
...,...,...,...,...
2495,ZS Ansari,ENG,2015,2015
2496,Ariful Haque,BDESH,2018,2018
2497,Ashfaq Ahmed,PAK,1994,1994
2498,MD Bailey,NZ,1998,1998


### Did Virat Kholi retire in ODIS

In [22]:
data["Span"] = data["Retirment_Year"] - data["Debute_Year"]

In [23]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Player_Name     2500 non-null   object
 1   Team            2500 non-null   object
 2   Debute_Year     2500 non-null   int32 
 3   Retirment_Year  2500 non-null   int32 
 4   Span            2500 non-null   int32 
dtypes: int32(3), object(2)
memory usage: 68.5+ KB


In [24]:
print(df["Inns"].info())


<class 'pandas.core.series.Series'>
RangeIndex: 2500 entries, 0 to 2499
Series name: Inns
Non-Null Count  Dtype 
--------------  ----- 
2500 non-null   object
dtypes: object(1)
memory usage: 19.7+ KB
None


In [25]:
data['Inns'] = df['Inns'].astype(int)

In [26]:
# Create the 'Not_Out' column: 1 if '*' is in 'HS', else 0
data['HS_Not_Out'] = df['HS'].apply(lambda x: 1 if '*' in str(x) else 0)

# Optionally, clean the 'HS' column to remove '*'
data['HS'] = df['HS'].apply(lambda x: str(x).replace('*', '').strip())


In [27]:
data

Unnamed: 0,Player_Name,Team,Debute_Year,Retirment_Year,Span,Inns,HS_Not_Out,HS
0,SR Tendulkar,INDIA,1989,2012,23,452,1,200
1,KC Sangakkara,Asia/ICC/SL,2000,2015,15,380,0,169
2,RT Ponting,AUS/ICC,1995,2012,17,365,0,164
3,ST Jayasuriya,Asia/SL,1989,2011,22,433,0,189
4,DPMD Jayawardene,Asia/SL,1998,2015,17,418,0,144
...,...,...,...,...,...,...,...,...
2495,ZS Ansari,ENG,2015,2015,0,0,0,0
2496,Ariful Haque,BDESH,2018,2018,0,0,0,0
2497,Ashfaq Ahmed,PAK,1994,1994,0,0,0,0
2498,MD Bailey,NZ,1998,1998,0,0,0,0


In [28]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Player_Name     2500 non-null   object
 1   Team            2500 non-null   object
 2   Debute_Year     2500 non-null   int32 
 3   Retirment_Year  2500 non-null   int32 
 4   Span            2500 non-null   int32 
 5   Inns            2500 non-null   int32 
 6   HS_Not_Out      2500 non-null   int64 
 7   HS              2500 non-null   object
dtypes: int32(4), int64(1), object(3)
memory usage: 117.3+ KB


In [29]:
df["NO"].info()

<class 'pandas.core.series.Series'>
RangeIndex: 2500 entries, 0 to 2499
Series name: NO
Non-Null Count  Dtype 
--------------  ----- 
2500 non-null   object
dtypes: object(1)
memory usage: 19.7+ KB


### Creating new columns for better readability

In [30]:
data["NotOuts"] = df["NO"].astype("int")
data["Average"] = df["Ave"].astype("float")
data["BallsFaced"] = df["BF"].astype("int")
data["StrikeRate"] = df["SR"].astype("float")
data["100s"] = df["100"].astype("int")
data["50s"] = df["50"].astype("int")
data["0s"] = df["0"].astype("int")
data["Runs"] = df["Runs"].astype("int")

In [31]:
data.head()

Unnamed: 0,Player_Name,Team,Debute_Year,Retirment_Year,Span,Inns,HS_Not_Out,HS,NotOuts,Average,BallsFaced,StrikeRate,100s,50s,0s,Runs
0,SR Tendulkar,INDIA,1989,2012,23,452,1,200,41,44.83,21367,86.23,49,96,20,18426
1,KC Sangakkara,Asia/ICC/SL,2000,2015,15,380,0,169,41,41.98,18048,78.86,25,93,15,14234
2,RT Ponting,AUS/ICC,1995,2012,17,365,0,164,39,42.03,17046,80.39,30,82,20,13704
3,ST Jayasuriya,Asia/SL,1989,2011,22,433,0,189,18,32.36,14725,91.2,28,68,34,13430
4,DPMD Jayawardene,Asia/SL,1998,2015,17,418,0,144,39,33.37,16020,78.96,19,77,28,12650


In [32]:
# Separate numerical and categorical columns
numerical_columns = data.select_dtypes(include=['int32', 'float64']).columns
categorical_columns = data.select_dtypes(include=['object', 'bool']).columns

# Display results
print("Numerical Columns:", numerical_columns.tolist())



Numerical Columns: ['Debute_Year', 'Retirment_Year', 'Span', 'Inns', 'NotOuts', 'Average', 'BallsFaced', 'StrikeRate', '100s', '50s', '0s', 'Runs']


In [33]:
print("Categorical Columns:", categorical_columns.tolist())

Categorical Columns: ['Player_Name', 'Team', 'HS']


In [34]:
data

Unnamed: 0,Player_Name,Team,Debute_Year,Retirment_Year,Span,Inns,HS_Not_Out,HS,NotOuts,Average,BallsFaced,StrikeRate,100s,50s,0s,Runs
0,SR Tendulkar,INDIA,1989,2012,23,452,1,200,41,44.83,21367,86.23,49,96,20,18426
1,KC Sangakkara,Asia/ICC/SL,2000,2015,15,380,0,169,41,41.98,18048,78.86,25,93,15,14234
2,RT Ponting,AUS/ICC,1995,2012,17,365,0,164,39,42.03,17046,80.39,30,82,20,13704
3,ST Jayasuriya,Asia/SL,1989,2011,22,433,0,189,18,32.36,14725,91.20,28,68,34,13430
4,DPMD Jayawardene,Asia/SL,1998,2015,17,418,0,144,39,33.37,16020,78.96,19,77,28,12650
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2495,ZS Ansari,ENG,2015,2015,0,0,0,0,0,0.00,0,0.00,0,0,0,0
2496,Ariful Haque,BDESH,2018,2018,0,0,0,0,0,0.00,0,0.00,0,0,0,0
2497,Ashfaq Ahmed,PAK,1994,1994,0,0,0,0,0,0.00,0,0.00,0,0,0,0
2498,MD Bailey,NZ,1998,1998,0,0,0,0,0,0.00,0,0.00,0,0,0,0


### Average Above 40 from India

In [35]:
data[(data["Average"] > 40) &  (data["Team"] == "INDIA")]

Unnamed: 0,Player_Name,Team,Debute_Year,Retirment_Year,Span,Inns,HS_Not_Out,HS,NotOuts,Average,BallsFaced,StrikeRate,100s,50s,0s,Runs
0,SR Tendulkar,INDIA,1989,2012,23,452,1,200,41,44.83,21367,86.23,49,96,20,18426
6,V Kohli,INDIA,2008,2019,11,233,0,183,39,59.84,12445,93.28,43,55,13,11609
19,RG Sharma,INDIA,2007,2019,12,214,0,264,32,49.14,10063,88.88,28,43,13,8944
73,S Dhawan,INDIA,2010,2019,9,131,0,143,7,44.5,5869,94.01,17,27,5,5518
268,AT Rayudu,INDIA,2013,2019,6,50,1,124,14,47.05,2143,79.04,3,10,3,1694
309,KM Jadhav,INDIA,2014,2019,5,50,0,120,18,42.31,1325,102.18,2,6,2,1354
420,KL Rahul,INDIA,2016,2019,3,25,0,111,4,42.33,1098,80.96,3,5,2,889
626,SS Iyer,INDIA,2017,2019,2,10,0,88,0,47.6,454,104.84,0,6,0,476
1186,GK Khoda,INDIA,1998,1998,0,2,0,89,0,57.5,185,62.16,0,1,0,115
1607,AV Mankad,INDIA,1974,1974,0,1,0,44,0,44.0,61,72.13,0,0,0,44


In [36]:
data["Team"].value_counts()

Team
ENG               239
PAK               213
AUS               212
INDIA             208
WI                188
NZ                186
SL                178
ZIM               132
BDESH             126
SA                103
CAN                83
UAE                81
SCOT               68
NL                 64
IRE                54
KENYA              46
AFG                44
HKG                40
BMUDA              35
USA                27
NAM                26
PNG                23
NEPAL              19
Afr/SA             15
EAf                13
OMAN               13
Asia/INDIA          7
Asia/SL             5
AUS/ICC             5
Afr/ZIM             4
Asia/PAK            4
USA/WI              3
Afr/KENYA           3
Asia/BDESH          3
ENG/ICC             3
ENG/IRE             3
ICC/NZ              3
Asia/ICC/PAK        2
Asia/ICC/SL         2
Afr/ICC/SA          2
ICC/WI              2
Asia/ICC/INDIA      2
ICC/SA              1
SA/USA              1
NL/SA               1
HKG/N

In [37]:
# Replace specific patterns
data.Team = data.Team.str.replace('Afr/', '', regex=False)
data.Team= data.Team.str.replace('ASIA/', '', regex=False)
data.Team= data.Team.str.replace('Asia/', '', regex=False)
data.Team = data.Team.str.replace('ICC/', '', regex=False)
data.Team = data.Team.str.replace('/ICC', '', regex=False)

In [38]:
data["Team"].value_counts()

Team
ENG         242
PAK         219
INDIA       217
AUS         217
WI          190
NZ          189
SL          185
ZIM         136
BDESH       129
SA          121
CAN          83
UAE          81
SCOT         68
NL           64
IRE          54
KENYA        49
AFG          44
HKG          40
BMUDA        35
USA          27
NAM          26
PNG          23
NEPAL        19
OMAN         13
EAf          13
USA/WI        3
ENG/IRE       3
NL/SA         1
HKG/NZ        1
3)            1
SA/USA        1
AUS/NZ        1
ENG/SCOT      1
CAN/WI        1
ENG/PNG       1
AUS/SA        1
1)            1
Name: count, dtype: int64

In [39]:
data

Unnamed: 0,Player_Name,Team,Debute_Year,Retirment_Year,Span,Inns,HS_Not_Out,HS,NotOuts,Average,BallsFaced,StrikeRate,100s,50s,0s,Runs
0,SR Tendulkar,INDIA,1989,2012,23,452,1,200,41,44.83,21367,86.23,49,96,20,18426
1,KC Sangakkara,SL,2000,2015,15,380,0,169,41,41.98,18048,78.86,25,93,15,14234
2,RT Ponting,AUS,1995,2012,17,365,0,164,39,42.03,17046,80.39,30,82,20,13704
3,ST Jayasuriya,SL,1989,2011,22,433,0,189,18,32.36,14725,91.20,28,68,34,13430
4,DPMD Jayawardene,SL,1998,2015,17,418,0,144,39,33.37,16020,78.96,19,77,28,12650
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2495,ZS Ansari,ENG,2015,2015,0,0,0,0,0,0.00,0,0.00,0,0,0,0
2496,Ariful Haque,BDESH,2018,2018,0,0,0,0,0,0.00,0,0.00,0,0,0,0
2497,Ashfaq Ahmed,PAK,1994,1994,0,0,0,0,0,0.00,0,0.00,0,0,0,0
2498,MD Bailey,NZ,1998,1998,0,0,0,0,0,0.00,0,0.00,0,0,0,0


In [40]:
# Group by 'Team' and find the earliest debut year for each country
earliest_debut = data.groupby('Team')['Debute_Year'].min()

# Merge the result back to the original DataFrame to get player details
first_players = data[data.apply(lambda x: x['Debute_Year'] == earliest_debut[x['Team']], axis=1)]
first_players

Unnamed: 0,Player_Name,Team,Debute_Year,Retirment_Year,Span,Inns,HS_Not_Out,HS,NotOuts,Average,BallsFaced,StrikeRate,100s,50s,0s,Runs
37,EJG Morgan,ENG/IRE,2006,2019,13,217,0,148,32,39.71,8053,91.24,13,46,15,7348
108,WTS Porterfield,IRE,2006,2019,13,133,0,139,3,31.05,5818,69.38,11,17,9,4037
131,KJ O'Brien,IRE,2006,2019,13,130,0,142,17,30.88,3922,88.98,2,18,5,3490
133,SO Tikolo,KENYA,1996,2014,18,130,0,111,12,29.05,4524,75.77,3,24,13,3428
138,KC Wessels,AUS/SA,1983,1994,11,105,0,107,7,34.35,6088,55.30,1,26,0,3367
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2440,LR Gibbs,WI,1973,1975,2,1,1,0,1,0.00,0,0.00,0,0,0,0
2448,Kaleemullah,OMAN,2019,2019,0,1,0,0,0,0.00,2,0.00,0,0,1,0
2462,Nasir Hameed,HKG,2004,2004,0,1,0,0,0,0.00,1,0.00,0,0,1,0
2484,GS Sobers,WI,1973,1973,0,1,0,0,0,0.00,6,0.00,0,0,1,0


In [41]:
first_players[["Team","Debute_Year"]].value_counts()

Team      Debute_Year
BMUDA     2006           20
AFG       2009           15
NEPAL     2018           15
KENYA     1996           15
NZ        1973           14
NAM       2003           14
IRE       2006           13
HKG       2004           13
WI        1973           13
SL        1975           13
SCOT      1999           13
SA        1991           13
OMAN      2019           13
INDIA     1974           13
ZIM       1983           13
EAf       1975           13
CAN       1979           13
NL        1996           12
USA       2004           12
PAK       1973           11
ENG       1971           11
UAE       1994           11
BDESH     1986           11
PNG       2014           10
AUS       1971            8
ENG/IRE   2006            2
AUS/SA    1983            1
USA/WI    1990            1
AUS/NZ    2008            1
ENG/SCOT  1997            1
SA/USA    2010            1
HKG/NZ    2015            1
CAN/WI    1991            1
NL/SA     2009            1
3)        2016            

In [42]:
data.sort_values(by='Runs', ascending=False).head(10)

Unnamed: 0,Player_Name,Team,Debute_Year,Retirment_Year,Span,Inns,HS_Not_Out,HS,NotOuts,Average,BallsFaced,StrikeRate,100s,50s,0s,Runs
0,SR Tendulkar,INDIA,1989,2012,23,452,1,200,41,44.83,21367,86.23,49,96,20,18426
1,KC Sangakkara,SL,2000,2015,15,380,0,169,41,41.98,18048,78.86,25,93,15,14234
2,RT Ponting,AUS,1995,2012,17,365,0,164,39,42.03,17046,80.39,30,82,20,13704
3,ST Jayasuriya,SL,1989,2011,22,433,0,189,18,32.36,14725,91.2,28,68,34,13430
4,DPMD Jayawardene,SL,1998,2015,17,418,0,144,39,33.37,16020,78.96,19,77,28,12650
5,Inzamam-ul-Haq,PAK,1991,2007,16,350,1,137,53,39.52,15812,74.24,10,83,20,11739
6,V Kohli,INDIA,2008,2019,11,233,0,183,39,59.84,12445,93.28,43,55,13,11609
7,JH Kallis,SA,1996,2014,18,314,0,139,53,44.36,15885,72.89,17,86,17,11579
8,SC Ganguly,INDIA,1992,2007,15,300,0,183,23,41.02,15416,73.7,22,72,16,11363
9,R Dravid,INDIA,1996,2011,15,318,0,153,40,39.16,15284,71.24,12,83,13,10889


In [43]:

# Group by 'Team' and calculate average 'Ave' and total 'Runs'
team_stats = data.groupby('Team').agg({'Average': 'mean', 'Runs': 'sum'}).reset_index()

# Find the team with the highest average
team_highest_average = team_stats.loc[team_stats['Average'].idxmax()]

# Find the team with the highest total runs
team_highest_runs = team_stats.loc[team_stats['Runs'].idxmax()]

# Display results
print("Team with the highest average:")
print(team_highest_average)

print("\nTeam with the highest total runs:")
print(team_highest_runs)


Team with the highest average:
Team       HKG/NZ
Average      40.0
Runs          160
Name: 16, dtype: object

Team with the highest total runs:
Team           INDIA
Average    16.978341
Runs          204716
Name: 17, dtype: object


## Introduction to NumPy: Array Creation and Operations

In [44]:
# Convert 'Runs' to numeric if needed
runs = pd.to_numeric(df['Runs'], errors='coerce').to_numpy()

# Total runs scored
total_runs = np.nansum(runs)  # Use np.nansum to ignore NaN values
print("Total Runs Scored:", total_runs)


Total Runs Scored: 1683824


In [45]:
# Ensure 'debuted_year' is numeric
data['debuted_year'] = pd.to_numeric(data['Debute_Year'], errors='coerce')

# Group by 'debuted_year' and count the number of players
data.groupby('debuted_year').size().reset_index(name='Total_Debutants')



Unnamed: 0,debuted_year,Total_Debutants
0,1971,19
1,1972,12
2,1973,49
3,1974,25
4,1975,49
5,1976,19
6,1977,26
7,1978,34
8,1979,35
9,1980,32


In [46]:
# Convert 'Ave' to numeric if needed
averages = pd.to_numeric(df['Ave'], errors='coerce').to_numpy()

# Mean and median
mean_ave = np.nanmean(averages)  # Average ignoring NaN values
median_ave = np.nanmedian(averages)  # Median ignoring NaN values

print("Mean Average:", mean_ave)
print("Median Average:", median_ave)


Mean Average: 17.330384
Median Average: 15.33


In [47]:
# Sort and select the top 5 run-scorers
top_5_runs = np.sort(runs)[-5:][::-1]
print("Top 5 Run Scores:", top_5_runs)


Top 5 Run Scores: [18426 14234 13704 13430 12650]


In [48]:
# Calculate the mean of runs
average_runs = np.nanmean(runs)

# Find players with above-average runs
above_average = runs > average_runs

# Use the boolean mask to extract player names
players_above_average = df['Player'].to_numpy()[above_average]
print("Players with Above-Average Runs:")
print(players_above_average)


Players with Above-Average Runs:
['SR Tendulkar (INDIA)' 'KC Sangakkara (Asia/ICC/SL)'
 'RT Ponting (AUS/ICC)' 'ST Jayasuriya (Asia/SL)'
 'DPMD Jayawardene (Asia/SL)' 'Inzamam-ul-Haq (Asia/PAK)'
 'V Kohli (INDIA)' 'JH Kallis (Afr/ICC/SA)' 'SC Ganguly (Asia/INDIA)'
 'R Dravid (Asia/ICC/INDIA)' 'MS Dhoni (Asia/INDIA)' 'CH Gayle (ICC/WI)'
 'BC Lara (ICC/WI)' 'TM Dilshan (SL)' 'Mohammad Yousuf (Asia/PAK)'
 'AC Gilchrist (AUS/ICC)' 'AB de Villiers (Afr/SA)' 'M Azharuddin (INDIA)'
 'PA de Silva (SL)' 'RG Sharma (INDIA)' 'Saeed Anwar (PAK)'
 'S Chanderpaul (WI)' 'Yuvraj Singh (Asia/INDIA)' 'DL Haynes (WI)'
 'MS Atapattu (SL)' 'ME Waugh (AUS)' 'LRPL Taylor (NZ)'
 'V Sehwag (Asia/ICC/INDIA)' 'HM Amla (SA)' 'HH Gibbs (SA)'
 'Shahid Afridi (Asia/ICC/PAK)' 'SP Fleming (ICC/NZ)' 'MJ Clarke (AUS)'
 'SR Waugh (AUS)' 'Shoaib Malik (PAK)' 'A Ranatunga (SL)'
 'Javed Miandad (PAK)' 'EJG Morgan (ENG/IRE)' 'Younis Khan (PAK)'
 'Saleem Malik (PAK)' 'NJ Astle (NZ)' 'GC Smith (Afr/SA)'
 'WU Tharanga (Asia/SL)

In [49]:
# Extract 'debuted_year' column as NumPy array
debut_years = data['Debute_Year'].to_numpy()

# Get unique years and their counts
unique_years, debutants_count = np.unique(debut_years, return_counts=True)

# Combine into a structured array
debutants_data = np.array(list(zip(unique_years, debutants_count)), dtype=[('Year', 'int'), ('Debutants', 'int')])

# Display the debutants data
print("Total Debutants Per Year:")
print(debutants_data)


Total Debutants Per Year:
[(1971,  19) (1972,  12) (1973,  49) (1974,  25) (1975,  49) (1976,  19)
 (1977,  26) (1978,  34) (1979,  35) (1980,  32) (1981,  16) (1982,  26)
 (1983,  48) (1984,  30) (1985,  26) (1986,  47) (1987,  29) (1988,  41)
 (1989,  19) (1990,  42) (1991,  25) (1992,  48) (1993,  33) (1994,  60)
 (1995,  45) (1996,  83) (1997,  53) (1998,  50) (1999,  62) (2000,  46)
 (2001,  56) (2002,  60) (2003,  80) (2004,  86) (2005,  37) (2006, 121)
 (2007,  66) (2008,  98) (2009,  82) (2010,  77) (2011,  68) (2012,  38)
 (2013,  69) (2014,  80) (2015,  58) (2016,  66) (2017,  59) (2018,  69)
 (2019, 101)]


In [50]:
# Normalize runs
normalized_runs = (runs - np.nanmin(runs)) / (np.nanmax(runs) - np.nanmin(runs))
print("Normalized Runs:")
print(normalized_runs)


Normalized Runs:
[1.         0.77249539 0.74373168 ... 0.         0.         0.        ]


In [51]:
data

Unnamed: 0,Player_Name,Team,Debute_Year,Retirment_Year,Span,Inns,HS_Not_Out,HS,NotOuts,Average,BallsFaced,StrikeRate,100s,50s,0s,Runs,debuted_year
0,SR Tendulkar,INDIA,1989,2012,23,452,1,200,41,44.83,21367,86.23,49,96,20,18426,1989
1,KC Sangakkara,SL,2000,2015,15,380,0,169,41,41.98,18048,78.86,25,93,15,14234,2000
2,RT Ponting,AUS,1995,2012,17,365,0,164,39,42.03,17046,80.39,30,82,20,13704,1995
3,ST Jayasuriya,SL,1989,2011,22,433,0,189,18,32.36,14725,91.20,28,68,34,13430,1989
4,DPMD Jayawardene,SL,1998,2015,17,418,0,144,39,33.37,16020,78.96,19,77,28,12650,1998
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2495,ZS Ansari,ENG,2015,2015,0,0,0,0,0,0.00,0,0.00,0,0,0,0,2015
2496,Ariful Haque,BDESH,2018,2018,0,0,0,0,0,0.00,0,0.00,0,0,0,0,2018
2497,Ashfaq Ahmed,PAK,1994,1994,0,0,0,0,0,0.00,0,0.00,0,0,0,0,1994
2498,MD Bailey,NZ,1998,1998,0,0,0,0,0,0.00,0,0.00,0,0,0,0,1998


In [52]:
# Using Pandas
import timeit
start_pandas = timeit.timeit()
data['Runs'].sum()
end_pandas = timeit.timeit()
print("Pandas Execution Time:", end_pandas - start_pandas)

# Using NumPy
start_numpy = timeit.timeit()
runs_array = data['Runs'].to_numpy()
np.nansum(runs_array)
end_numpy = timeit.timeit()
print("NumPy Execution Time:", end_numpy - start_numpy)


Pandas Execution Time: -0.00011150000500492752
NumPy Execution Time: 0.00034080000477842987



 ### Summary of Differences
 | Feature               | Pandas                           | NumPy                        |
 |------------------------|-----------------------------------|------------------------------|
 | **Data Structure**     | DataFrame (tabular), Series      | Multidimensional arrays      |
 | **Ease of Use**        | High (especially for tabular data)| Moderate (focused on arrays) |
 | **Performance**        | Slower than NumPy for numbers    | Faster for numerical operations |
 | **Missing Data**       | Handles `NaN` natively           | Requires `np.nan_to_num` or `np.isnan` |
 | **Grouped Operations** | Easy with `groupby`              | Requires additional logic    |

#### NumPy is optimal for raw numerical operations, while Pandas excels in handling structured, tabular data.


In [53]:
# Save the DataFrame to a CSV file
data.to_csv('cricket_data.csv', index=False)

print("Data saved to 'cricket_data.csv'")


Data saved to 'cricket_data.csv'
