# Pandas

Pandas is a powerful open-source data manipulation and analysis library for Python. It provides easy-to-use data structures and functions needed to work with structured data seamlessly. The two primary data structures in Pandas are:

Pandas is particularly useful for:
- **Data Cleaning**: It provides tools for handling missing data, converting data types, and other data cleaning tasks.
- **Data Exploration**: It allows you to explore and analyze data easily, including statistical analysis, aggregation, and summarization.
- **Data Manipulation**: Pandas supports operations like merging and joining datasets, reshaping data, and more.
- **Data I/O**: It can read and write data in various formats, including CSV, Excel, SQL databases, and more.

In [4]:
# Basic Example

import pandas as pd

# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35],
        'City': ['New York', 'San Francisco', 'Los Angeles']}

df = pd.DataFrame(data)

# Displaying the DataFrame
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,San Francisco
2,Charlie,35,Los Angeles


Pandas simplifies many data manipulation tasks, making it a popular choice for data scientists, analysts, and developers working with data in Python.

_**NOTE**_: It is customary to import `numpy` as `np` and `pandas` as `pd`. Everyone does it this way.

In [18]:
import numpy as np
import pandas as pd

# create numpy array or list
# fixing the seed value to generate same random number
np.random.seed(100)

# range from 0 to 100 and array of dimension 5 X 3
arr = np.random.randint(0, 100, (5, 3))
arr

array([[ 8, 24, 67],
       [87, 79, 48],
       [10, 94, 52],
       [98, 53, 66],
       [98, 14, 34]])

In [22]:
# Create DataFrame from Array. Default column names and Row indexes starting from 0, 1, ... is generated.
df = pd.DataFrame(arr)
df

Unnamed: 0,0,1,2
0,8,24,67
1,87,79,48
2,10,94,52
3,98,53,66
4,98,14,34


In [26]:
print(type(arr))
print(type(df))

<class 'numpy.ndarray'>
<class 'pandas.core.frame.DataFrame'>


In [28]:
# You can create your own custom column and rows as well
rownames = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri']
columnnames = ['Jan', 'Feb', 'Mar']

df = pd.DataFrame(arr, index=rownames, columns=columnnames)
df

Unnamed: 0,Jan,Feb,Mar
Mon,8,24,67
Tue,87,79,48
Wed,10,94,52
Thu,98,53,66
Fri,98,14,34


### How to create DataFrame from a dictionary?

In [31]:
mydict = {
    'Jan': [1, 2, 3, 4, 5],
    'Feb': [10, 20, 30, 40, 50],
    'Mar': [15, 25, 35, 45, 55],
}

# dataframe from dict
df = pd.DataFrame(mydict, index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri'])
df

Unnamed: 0,Jan,Feb,Mar
Mon,1,10,15
Tue,2,20,25
Wed,3,30,35
Thu,4,40,45
Fri,5,50,55


### Reading data from file

In [33]:
CSV_SAMPLE_FILE = "datasets/ODI_WC_2023_batting_summary.csv"
df = pd.read_csv(CSV_SAMPLE_FILE)

df

Unnamed: 0,Match_no,Match_Between,Team_Innings,Batsman_Name,Batting_Position,Dismissal,Runs,Balls,4s,6s,Strike_Rate
0,1,England vs New Zealand,England,Jonny Bairstow,1,c Daryl Mitchell b Mitchell Santner,33,35,4,1,94.300
1,1,England vs New Zealand,England,Dawid Malan,2,c Tom Latham b Matt Henry,14,24,2,0,58.300
2,1,England vs New Zealand,England,Joe Root,3,b Glenn Phillips,77,86,4,1,89.500
3,1,England vs New Zealand,England,Harry Brook,4,c Devon Conway b Rachin Ravindra,25,16,4,1,156.300
4,1,England vs New Zealand,England,Moeen Ali,5,b Glenn Phillips,11,17,1,0,64.700
...,...,...,...,...,...,...,...,...,...,...,...
911,48,India vs Australia,Australia,Travis Head,2,c Shubman Gill b Mohammed Siraj,137,120,15,4,114.167
912,48,India vs Australia,Australia,Mitchell Marsh,3,c KL Rahul b Jasprit Bumrah,15,15,1,1,100.000
913,48,India vs Australia,Australia,Steve Smith,4,lbw b Jasprit Bumrah,4,9,1,0,44.444
914,48,India vs Australia,Australia,Marnus Labuschagne,5,not out,58,110,4,0,52.727


In [37]:
# Reading only the first few rows
df.head() ## by default it only reads the first 5 rows

Unnamed: 0,Match_no,Match_Between,Team_Innings,Batsman_Name,Batting_Position,Dismissal,Runs,Balls,4s,6s,Strike_Rate
0,1,England vs New Zealand,England,Jonny Bairstow,1,c Daryl Mitchell b Mitchell Santner,33,35,4,1,94.3
1,1,England vs New Zealand,England,Dawid Malan,2,c Tom Latham b Matt Henry,14,24,2,0,58.3
2,1,England vs New Zealand,England,Joe Root,3,b Glenn Phillips,77,86,4,1,89.5
3,1,England vs New Zealand,England,Harry Brook,4,c Devon Conway b Rachin Ravindra,25,16,4,1,156.3
4,1,England vs New Zealand,England,Moeen Ali,5,b Glenn Phillips,11,17,1,0,64.7


In [38]:
df.head(2) ## will read only the first 2 rows

Unnamed: 0,Match_no,Match_Between,Team_Innings,Batsman_Name,Batting_Position,Dismissal,Runs,Balls,4s,6s,Strike_Rate
0,1,England vs New Zealand,England,Jonny Bairstow,1,c Daryl Mitchell b Mitchell Santner,33,35,4,1,94.3
1,1,England vs New Zealand,England,Dawid Malan,2,c Tom Latham b Matt Henry,14,24,2,0,58.3


In [39]:
# Reading only the last few rows
df.tail() ## by default it only reads the last 5 rows

Unnamed: 0,Match_no,Match_Between,Team_Innings,Batsman_Name,Batting_Position,Dismissal,Runs,Balls,4s,6s,Strike_Rate
911,48,India vs Australia,Australia,Travis Head,2,c Shubman Gill b Mohammed Siraj,137,120,15,4,114.167
912,48,India vs Australia,Australia,Mitchell Marsh,3,c KL Rahul b Jasprit Bumrah,15,15,1,1,100.0
913,48,India vs Australia,Australia,Steve Smith,4,lbw b Jasprit Bumrah,4,9,1,0,44.444
914,48,India vs Australia,Australia,Marnus Labuschagne,5,not out,58,110,4,0,52.727
915,48,India vs Australia,Australia,Glenn Maxwell,6,not out,2,1,0,0,200.0


In [40]:
df.tail(2) ## will read only the last 2 rows

Unnamed: 0,Match_no,Match_Between,Team_Innings,Batsman_Name,Batting_Position,Dismissal,Runs,Balls,4s,6s,Strike_Rate
914,48,India vs Australia,Australia,Marnus Labuschagne,5,not out,58,110,4,0,52.727
915,48,India vs Australia,Australia,Glenn Maxwell,6,not out,2,1,0,0,200.0


In [41]:
# Shape of the dataframe. For tabular data, it is number of rows x columns
df.shape

(916, 11)

In [43]:
# To get the underlying array behind the dataframe, use the `.values` attributes

df.values

array([[1, 'England vs New Zealand', 'England', ..., 4, 1, '94.300'],
       [1, 'England vs New Zealand', 'England', ..., 2, 0, '58.300'],
       [1, 'England vs New Zealand', 'England', ..., 4, 1, '89.500'],
       ...,
       [48, 'India vs Australia', 'Australia', ..., 1, 0, '44.444'],
       [48, 'India vs Australia', 'Australia', ..., 4, 0, '52.727'],
       [48, 'India vs Australia', 'Australia', ..., 0, 0, '200.000']],
      dtype=object)

Beside reading from text file, clipboard pandas also support reading files in a variety of the formats such as pickle, fwf (fixed width format), Excel, JSON, HTML, Tables, HDF Store, Feather, Parquet, ORC, SAS, SPSS, Stat, SQL Queries & Google Big Query

#### Mini Challenge
Convert the following lists to a pandas DataFrame with columns and an index

```python
index = [1, 2, 3, 4, 5]
col1 = list('abcde')
col2 = list('pqrst')
```

```python
lst = [
    ["Bunny", 25],
    ["Sunny", 30],
    ["Funny", 26],
    ["Hunny", 22],
]
```

In [75]:
# Mini challenge: 1

index = [1, 2, 3, 4, 5]
col1 = list('abcde')
col2 = list('pqrst')

df = pd.DataFrame([[col1[i], col2[i]]for i in range(len(index))], index=index)
df

Unnamed: 0,0,1
1,a,p
2,b,q
3,c,r
4,d,s
5,e,t


In [76]:
# Mini challenge: 2

lst = [
    ["Bunny", 25],
    ["Sunny", 30],
    ["Funny", 26],
    ["Hunny", 22],
]

df = pd.DataFrame(lst)
df

Unnamed: 0,0,1
0,Bunny,25
1,Sunny,30
2,Funny,26
3,Hunny,22


### Series and its relation with DataFrame

A `series` is a type that is used to store one column only. You can think of a series as one column of a `DataFrame` extracted.
Series is very similar to NumPy array, with a main difference that it has an index label for each observation.

**Relationship between a Series and a DataFrame**

If you extract any given column from a DataFrame, the resulting object is a Series.

In [78]:
# creating a 5x4 (row x column) matrix with number range between (1, 100)
arr = np.random.randint(1, 100, (5, 4))
arr

array([[25, 16, 61, 59],
       [17, 10, 94, 87],
       [ 3, 28,  5, 32],
       [ 2, 14, 84,  5],
       [92, 60, 68,  8]])

In [79]:
df = pd.DataFrame(arr, columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,25,16,61,59
1,17,10,94,87
2,3,28,5,32
3,2,14,84,5
4,92,60,68,8


In [81]:
df['A']

0    25
1    17
2     3
3     2
4    92
Name: A, dtype: int64

In [82]:
type(df['A'])

pandas.core.series.Series

In [83]:
# To get specific elements through indexing
df['A'][0:3]

0    25
1    17
2     3
Name: A, dtype: int64

In [84]:
# To get the numpy array, use `.values`
df['A'][0:3].values

array([25, 17,  3])