# Introduction to Pandas

## What is Pandas?
**Pandas** is an open-source Python library used for **data manipulation and data analysis**.  
It provides fast, flexible, and easy-to-use data structures for working with **structured data** such as tables, time series, and CSV/Excel files.

The two main data structures in Pandas are:
- **Series** → 1D labeled array (like a column in a table)
- **DataFrame** → 2D labeled table (like a spreadsheet or SQL table)

Pandas is widely used in:
- Data Science
- Machine Learning
- Data Engineering
- Finance
- Scientific Computing

---

## Who Invented Pandas?

Pandas was **created by Wes McKinney in 2008** while he was working at **AQR Capital Management**, a quantitative finance firm.

### Why It Was Created
Wes McKinney needed a powerful tool to:
- Clean messy financial data
- Handle missing values
- Work with time-series data efficiently
- Perform fast statistical analysis

Python didn’t have a strong data analysis library at the time — so he built Pandas.

---

## Meaning of the Name "Pandas"

The name **Pandas** comes from:
> **“Panel Data”** — a term used in econometrics for multi-dimensional structured datasets.

---

## Why Pandas is Important Today

Pandas is now a **core tool in the Python data ecosystem** and works closely with:
- **NumPy** (numerical computing)
- **Matplotlib / Seaborn** (visualization)
- **Scikit-learn** (machine learning)

It powers workflows from **simple CSV cleaning** to **large-scale data pipelines**.

In [127]:
import pandas as pd 
import numpy as np

# Series in Pandas

In [128]:
a = pd.Series([1 , 2, 3, 4, 5])
a

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [129]:
a.dtypes

dtype('int64')

In [130]:
a.values

array([1, 2, 3, 4, 5])

In [131]:
a.index

RangeIndex(start=0, stop=5, step=1)

## Assigning a name to the series

In [132]:
a.name = "Numbers"
print(a.name)

Numbers


## Indexing

In [133]:
a[0]

np.int64(1)

In [134]:
a[0:3]

0    1
1    2
2    3
Name: Numbers, dtype: int64

## Location Based Indexing by `iloc()`

In [135]:
a.iloc[3]

np.int64(4)

In [136]:
a.iloc[[1 ,2 ,4]] # works only for integers 

1    2
2    3
4    5
Name: Numbers, dtype: int64

In [137]:
a.name = 'calories'

# Assign index

In [138]:
Index = ['apple' , 'mango' , 'banana','grapes', 'Peach']
a.index = Index
a

apple     1
mango     2
banana    3
grapes    4
Peach     5
Name: calories, dtype: int64

In [139]:
a['grapes']

np.int64(4)

# Difference between iloc and loc functions

In [140]:
a.iloc[4] # can index only th integers 

np.int64(5)

In [141]:
a.loc['grapes'] # can also index the strings

np.int64(4)

#` In label based indexing , both the starting and the ending values are inlcuded`

In [142]:
a['banana': 'Peach']

banana    3
grapes    4
Peach     5
Name: calories, dtype: int64

In [143]:
a.loc['banana':'grapes']

banana    3
grapes    4
Name: calories, dtype: int64

In [144]:
fruit_protein = {
    "Fruit": ["Apple", "Banana", "Orange", "Mango", "Papaya", "Guava", "Strawberry"],
    "Protein (g per 100g)": [0.3, 1.1, 0.9, 0.8, 0.5, 2.6, 0.7]
}


In [145]:
s2 = pd.Series(fruit_protein["Protein (g per 100g)"], index = fruit_protein["Fruit"])
s2

Apple         0.3
Banana        1.1
Orange        0.9
Mango         0.8
Papaya        0.5
Guava         2.6
Strawberry    0.7
dtype: float64

In [146]:
print(s2.iloc[5])

2.6


In [147]:
print(s2.loc['Apple': 'Papaya'])

Apple     0.3
Banana    1.1
Orange    0.9
Mango     0.8
Papaya    0.5
dtype: float64


In [148]:
print(s2[::-1])

Strawberry    0.7
Guava         2.6
Papaya        0.5
Mango         0.8
Orange        0.9
Banana        1.1
Apple         0.3
dtype: float64


# Conditional Selections

In [149]:
print(s2>1)

Apple         False
Banana         True
Orange        False
Mango         False
Papaya        False
Guava          True
Strawberry    False
dtype: bool


In [150]:
s2[s2>1] #If u want to diplay all the True 

Banana    1.1
Guava     2.6
dtype: float64

# Logical Operators

In [151]:
s2[(s2>1) & (s2<= 3)]

Banana    1.1
Guava     2.6
dtype: float64

# Not operation

In [152]:
s2[~(s2>1)]  # ~  - For balues not greater than 1

Apple         0.3
Orange        0.9
Mango         0.8
Papaya        0.5
Strawberry    0.7
dtype: float64

In [153]:
s2['Banana'] = 2.8

In [154]:
s2

Apple         0.3
Banana        2.8
Orange        0.9
Mango         0.8
Papaya        0.5
Guava         2.6
Strawberry    0.7
dtype: float64

In [155]:
data = {
    "Name":['saee' , 'kishori' , 'Charlie' , 'David' , 'Shravani'] ,
    "Age" : [21 , 20 , 25 , 26 , 23],
    "Department":['HR' , 'IT' , 'Finance' , 'Medical' , 'HR'],
    "Salary":[10000000000 , 5000000 , 300000 , 100000 , 450000]
}

In [156]:
data

{'Name': ['saee', 'kishori', 'Charlie', 'David', 'Shravani'],
 'Age': [21, 20, 25, 26, 23],
 'Department': ['HR', 'IT', 'Finance', 'Medical', 'HR'],
 'Salary': [10000000000, 5000000, 300000, 100000, 450000]}

## Creating a DataFrame

In [157]:
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Department,Salary
0,saee,21,HR,10000000000
1,kishori,20,IT,5000000
2,Charlie,25,Finance,300000
3,David,26,Medical,100000
4,Shravani,23,HR,450000


In [158]:
df.head() # First few rows

Unnamed: 0,Name,Age,Department,Salary
0,saee,21,HR,10000000000
1,kishori,20,IT,5000000
2,Charlie,25,Finance,300000
3,David,26,Medical,100000
4,Shravani,23,HR,450000


In [159]:
df.tail() # Last few Rows

Unnamed: 0,Name,Age,Department,Salary
0,saee,21,HR,10000000000
1,kishori,20,IT,5000000
2,Charlie,25,Finance,300000
3,David,26,Medical,100000
4,Shravani,23,HR,450000


In [160]:
df.iloc[1:4]

Unnamed: 0,Name,Age,Department,Salary
1,kishori,20,IT,5000000
2,Charlie,25,Finance,300000
3,David,26,Medical,100000


In [161]:
df.loc[1:3 , ['Age' , 'Department']]

Unnamed: 0,Age,Department
1,20,IT
2,25,Finance
3,26,Medical


In [162]:
df.loc[: , ['Salary' , 'Name' , 'Age']]

Unnamed: 0,Salary,Name,Age
0,10000000000,saee,21
1,5000000,kishori,20
2,300000,Charlie,25
3,100000,David,26
4,450000,Shravani,23


In [163]:
df.loc[::-1 , ['Salary' , 'Name' , 'Age']]

Unnamed: 0,Salary,Name,Age
4,450000,Shravani,23
3,100000,David,26
2,300000,Charlie,25
1,5000000,kishori,20
0,10000000000,saee,21


In [164]:
df[['Age' , 'Department']]

Unnamed: 0,Age,Department
0,21,HR
1,20,IT
2,25,Finance
3,26,Medical
4,23,HR


In [165]:
df.sample(5 , random_state = 42) #random_state is just a seed number for the random generator.


Unnamed: 0,Name,Age,Department,Salary
1,kishori,20,IT,5000000
4,Shravani,23,HR,450000
2,Charlie,25,Finance,300000
0,saee,21,HR,10000000000
3,David,26,Medical,100000


In [166]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        5 non-null      object
 1   Age         5 non-null      int64 
 2   Department  5 non-null      object
 3   Salary      5 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 292.0+ bytes


In [167]:
df.describe()

Unnamed: 0,Age,Salary
count,5.0,5.0
mean,23.0,2001170000.0
std,2.54951,4471482000.0
min,20.0,100000.0
25%,21.0,300000.0
50%,23.0,450000.0
75%,25.0,5000000.0
max,26.0,10000000000.0


In [168]:
df.describe(include = 'all')

Unnamed: 0,Name,Age,Department,Salary
count,5,5.0,5,5.0
unique,5,,4,
top,saee,,HR,
freq,1,,2,
mean,,23.0,,2001170000.0
std,,2.54951,,4471482000.0
min,,20.0,,100000.0
25%,,21.0,,300000.0
50%,,23.0,,450000.0
75%,,25.0,,5000000.0


In [169]:
df['Age'].value_counts() #It counts how many times each unique value appears in the "Age" column.

Age
21    1
20    1
25    1
26    1
23    1
Name: count, dtype: int64

## Typecasting

In [170]:
df['age'] = df['Age'].astype(int)

In [171]:
df['Age'].dtype

dtype('int64')

In [172]:
df['Age'] = df['Age'].astype(float)

In [173]:
df['Age'].dtype

dtype('float64')

In [174]:
df['Age'] = df['Age'].astype(str)

In [175]:
df['Age'].dtype

dtype('O')

In [176]:
df

Unnamed: 0,Name,Age,Department,Salary,age
0,saee,21.0,HR,10000000000,21
1,kishori,20.0,IT,5000000,20
2,Charlie,25.0,Finance,300000,25
3,David,26.0,Medical,100000,26
4,Shravani,23.0,HR,450000,23


In [177]:
df.dtypes

Name          object
Age           object
Department    object
Salary         int64
age            int64
dtype: object

In [199]:
df['Salary'] = pd.to_numeric(df['Salary'], errors='coerce')
df['Salary'] = df['Salary'] + 5000


In [201]:
df['Salary'] = pd.to_datetime(df['Salary'])
df['Salary']

age
(21.0, 21)   1970-01-01 00:00:10.000005
(20.0, 20)   1970-01-01 00:00:00.005005
(25.0, 25)   1970-01-01 00:00:00.000305
(26.0, 26)   1970-01-01 00:00:00.000105
(23.0, 23)   1970-01-01 00:00:00.000455
Name: Salary, dtype: datetime64[ns]

In [180]:
df.isna()

Unnamed: 0,Name,Age,Department,Salary,age
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False


In [181]:
df.isna().sum()

Name          0
Age           0
Department    0
Salary        0
age           0
dtype: int64

In [182]:
df.duplicated()

0    False
1    False
2    False
3    False
4    False
dtype: bool

# Adding a real time Timestamp

In [183]:
df['Timestamp'] = pd.Timestamp.now()
df

Unnamed: 0,Name,Age,Department,Salary,age,Timestamp
0,saee,21.0,HR,1970-01-01 00:00:10.000000,21,2026-01-25 17:51:55.740111
1,kishori,20.0,IT,1970-01-01 00:00:00.005000,20,2026-01-25 17:51:55.740111
2,Charlie,25.0,Finance,1970-01-01 00:00:00.000300,25,2026-01-25 17:51:55.740111
3,David,26.0,Medical,1970-01-01 00:00:00.000100,26,2026-01-25 17:51:55.740111
4,Shravani,23.0,HR,1970-01-01 00:00:00.000450,23,2026-01-25 17:51:55.740111


## `df.duplicated()`

This returns a Boolean Series:

True → this row is a duplicate

False → this row is unique

In [184]:
df.duplicated()

0    False
1    False
2    False
3    False
4    False
dtype: bool

In [185]:
df.duplicated().sum()

np.int64(0)

# `nunique()` - To chcek the number of unique values in the dataframe

In [186]:
df.nunique()

Name          5
Age           5
Department    4
Salary        5
age           5
Timestamp     1
dtype: int64

In [187]:
df.nunique().sum()

np.int64(25)

# `df.rename(columns = {initial : change} , inplace  =  True) - Used for renaming the column of the dataframe
# `inplace  = True` - It modifies the original DataFrame directly instead of returning a new one.

In [188]:
df.rename(columns = {'Age': 'age'} , inplace = True)

In [189]:
df

Unnamed: 0,Name,age,Department,Salary,age.1,Timestamp
0,saee,21.0,HR,1970-01-01 00:00:10.000000,21,2026-01-25 17:51:55.740111
1,kishori,20.0,IT,1970-01-01 00:00:00.005000,20,2026-01-25 17:51:55.740111
2,Charlie,25.0,Finance,1970-01-01 00:00:00.000300,25,2026-01-25 17:51:55.740111
3,David,26.0,Medical,1970-01-01 00:00:00.000100,26,2026-01-25 17:51:55.740111
4,Shravani,23.0,HR,1970-01-01 00:00:00.000450,23,2026-01-25 17:51:55.740111


# `set_index()` - used to set a particular column as index of that dataframe


In [190]:
df.set_index('age' , inplace  =  True)

In [191]:
df

Unnamed: 0_level_0,Name,Department,Salary,Timestamp
age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"(21.0, 21)",saee,HR,1970-01-01 00:00:10.000000,2026-01-25 17:51:55.740111
"(20.0, 20)",kishori,IT,1970-01-01 00:00:00.005000,2026-01-25 17:51:55.740111
"(25.0, 25)",Charlie,Finance,1970-01-01 00:00:00.000300,2026-01-25 17:51:55.740111
"(26.0, 26)",David,Medical,1970-01-01 00:00:00.000100,2026-01-25 17:51:55.740111
"(23.0, 23)",Shravani,HR,1970-01-01 00:00:00.000450,2026-01-25 17:51:55.740111


In [192]:
df.sort_values('age' , ascending  =  True)

Unnamed: 0_level_0,Name,Department,Salary,Timestamp
age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"(20.0, 20)",kishori,IT,1970-01-01 00:00:00.005000,2026-01-25 17:51:55.740111
"(21.0, 21)",saee,HR,1970-01-01 00:00:10.000000,2026-01-25 17:51:55.740111
"(23.0, 23)",Shravani,HR,1970-01-01 00:00:00.000450,2026-01-25 17:51:55.740111
"(25.0, 25)",Charlie,Finance,1970-01-01 00:00:00.000300,2026-01-25 17:51:55.740111
"(26.0, 26)",David,Medical,1970-01-01 00:00:00.000100,2026-01-25 17:51:55.740111


In [193]:
print(df.columns)


Index(['Name', 'Department', 'Salary', 'Timestamp'], dtype='object')


## Shows how much RAM your DataFrame is using, in megabytes (MB).

`deep`=True → counts real memory for text columns

`sum()` → total memory

`/ 1024**2` → converts bytes → MB

In [197]:
df.memory_usage(deep=True).sum() / 1024**2


np.float64(0.0008974075317382812)