## 🐼 PANDAS COMPLETE COURSE (2025 EDITION)


## 1️⃣ Introduction to Pandas
🐍 What is Pandas?

Pandas (short for Panel Data) is a Python library built on top of NumPy, designed for:

Data manipulation (cleaning, transforming, reshaping)

Data analysis (aggregating, grouping, summarizing)

Data visualization (basic charts and exploration)

Integration with many formats — CSV, Excel, SQL, JSON, etc.

It provides fast, flexible, and expressive data structures to work with labeled and relational data (like a spreadsheet or SQL table).

👉 In short:

Pandas = NumPy + SQL + Excel (combined, but in Python syntax)

### 🧠 Why Use Pandas?
🔹 Before Pandas:

Working with raw Python lists, dictionaries, or NumPy arrays for data analysis was hard because:

You needed loops for simple operations.

Data often came from CSVs or databases — not easy to clean manually.

NumPy arrays don’t store labels (column names / indices).

🔹 With Pandas:

Data is organized like a table with rows and columns.

You can easily filter, aggregate, merge, clean, and transform data.

Pandas is optimized in C for performance.

| Feature               | Python Lists | NumPy Arrays | Pandas                          |
| --------------------- | ------------ | ------------ | ------------------------------- |
| Labeled Data          | ❌            | ❌            | ✅                               |
| Heterogeneous Data    | ✅            | ❌            | ✅                               |
| Missing Data Handling | ❌            | ❌            | ✅                               |
| SQL-like Operations   | ❌            | ❌            | ✅                               |
| File I/O Support      | ❌            | ✅ (limited)  | ✅ (CSV, Excel, JSON, SQL, etc.) |


### 🧩 Pandas vs NumPy

| **Aspect**    | **NumPy**                               | **Pandas**                                     |
| ------------- | --------------------------------------- | ---------------------------------------------- |
| Structure     | Homogeneous (same data type)            | Heterogeneous (different types)                |
| Main Object   | ndarray                                 | Series, DataFrame                              |
| Label Support | Only numeric indices                    | Custom row/column labels                       |
| Missing Data  | Limited handling (NaN for float only)   | Full support with `NaN`, `isna()`, `fillna()`  |
| Functionality | Mathematical operations                 | Data analysis, manipulation, grouping, joining |
| Performance   | Slightly faster for numeric computation | Slightly slower (adds abstraction)             |
| Use Case      | Scientific computing                    | Data analysis / manipulation                   |


###### 👉 Pandas actually uses NumPy under the hood, so you often use both together.

## Install pandas 

In [1]:
!pip install pandas



## 🧱 Core Data Structures in Pandas
Pandas introduces two primary data structures:


#### 1️⃣ Series — 1D Labeled Array

A one-dimensional array that holds data + labels (index).

In [2]:
import pandas as pd

s = pd.Series([1,2,3,4,5,6])
print(s)

s_index = pd.Series([1,2,3],index=['a','b','c'])
print(s_index)

0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64
a    1
b    2
c    3
dtype: int64


###### 🔹 Key Points:

Each element has an index label.

Works like a column in Excel or a single NumPy array with labels.

Vectorized operations supported:

In [3]:
print(s * 2)   # multiplies each element
print(s + 5)   # adds 5 to each element


0     2
1     4
2     6
3     8
4    10
5    12
dtype: int64
0     6
1     7
2     8
3     9
4    10
5    11
dtype: int64


### 2️⃣ DataFrame — 2D Labeled Table

A tabular data structure with rows and columns, like an Excel sheet or SQL table.

In [4]:
import pandas as pd 

data = {
    'name':["vineeth","jagan","jagadeesh"],
    'languges':['python','c','java']
}

df = pd.DataFrame(data)
print(df)

        name languges
0    vineeth   python
1      jagan        c
2  jagadeesh     java


##### 🔹 Key Features:

Columns can be different data types (int, float, string, bool, etc.)

Supports indexing, slicing, filtering, joining, grouping, plotting etc..

Built-in methods for reading/writing data from files (CSV, Excel, SQL, JSON...)

#### 💡 Tips & Tricks

#### Quick Overview    

In [5]:
df.info()  # Summary of columns, dtypes, memory usage

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   name      3 non-null      object
 1   languges  3 non-null      object
dtypes: object(2)
memory usage: 180.0+ bytes


In [6]:
df.head(1) ## first row

Unnamed: 0,name,languges
0,vineeth,python


In [7]:
df.tail(1) ## last row

Unnamed: 0,name,languges
2,jagadeesh,java


In [8]:
df.describe()  # Statistical summary 

Unnamed: 0,name,languges
count,3,3
unique,3,3
top,vineeth,python
freq,1,1


#### Inspect Data

In [9]:
df.shape  # (rows ,columns)

(3, 2)

In [10]:
df.columns  # list of columns

Index(['name', 'languges'], dtype='object')

In [11]:
df.dtypes # data types of columns

name        object
languges    object
dtype: object

In [12]:
df.index # row indices

RangeIndex(start=0, stop=3, step=1)

# 🧱 PART  — CORE DATA STRUCTURES 
# Series

A Series is a one-dimensional labeled array capable of holding any data type — integers, floats, strings, objects, etc.

Think of it like an Excel column or a NumPy array with labels.


### 🔹 1. Creating a Series

In [13]:
import pandas as pd
import numpy as np

# From list
s1 = pd.Series([10, 20, 30, 40])

# From list with custom index
s2 = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])

# From dictionary
s3 = pd.Series({'a': 10, 'b': 20, 'c': 30})

# From NumPy array
arr = np.array([5, 10, 15])
s4 = pd.Series(arr, index=['x', 'y', 'z'])

print(s1)
print("\n", s2)
print("\n", s3)
print("\n", s4)

# ✅ Notes:
# Default index starts from 0.
# You can assign custom labels.
# If you use a dictionary → keys become index.

0    10
1    20
2    30
3    40
dtype: int64

 a    10
b    20
c    30
d    40
dtype: int64

 a    10
b    20
c    30
dtype: int64

 x     5
y    10
z    15
dtype: int64


#### 🔹 2. Accessing Elements (Indexing & Slicing)

In [14]:
s = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])

print(s['b'])       # Access by label → 2
print(s[1])         # Access by position → 2
print(s[['a', 'd']])# Multiple labels
print(s[1:4])       # Slice by position → b,c,d


2
2
a    1
d    4
dtype: int64
b    2
c    3
d    4
dtype: int64


  print(s[1])         # Access by position → 2


✅ Tip:

.loc[] → by label

.iloc[] → by position

In [15]:
print(s.loc['a':'c'])   # label-based slice (inclusive)
print(s.iloc[1:4])      # position-based slice (exclusive)

a    1
b    2
c    3
dtype: int64
b    2
c    3
d    4
dtype: int64


### 🔹 3. Vectorized Operations

#### Pandas performs element-wise operations automatically (NumPy-style broadcasting).

In [16]:
s = pd.Series([1, 2, np.nan, 4], index=['a', 'b', 'c', 'd'])

print(s + 10)        # Add 10 to all elements
print(s * 2)         # Multiply all elements by 2
print(s ** 2)        # Square
print(s + s)         # Element-wise addition

a    11.0
b    12.0
c     NaN
d    14.0
dtype: float64
a    2.0
b    4.0
c    NaN
d    8.0
dtype: float64
a     1.0
b     4.0
c     NaN
d    16.0
dtype: float64
a    2.0
b    4.0
c    NaN
d    8.0
dtype: float64


✅ Note:
Operations ignore missing values (NaN) where possible.

#### 🔹 4. Handling Null Values

In [17]:
s = pd.Series([1, np.nan, 3, np.nan, 5])

print(s.isna())      # Detect missing
print(s.notna())     # Detect non-missing
print(s.fillna(0))   # Replace NaN with 0
print(s.dropna())    # Remove NaN entries
print(s.mean())      # NaN-aware mean
### ✅ Pandas math functions are NaN-aware, meaning they skip missing values automatically.

0    False
1     True
2    False
3     True
4    False
dtype: bool
0     True
1    False
2     True
3    False
4     True
dtype: bool
0    1.0
1    0.0
2    3.0
3    0.0
4    5.0
dtype: float64
0    1.0
2    3.0
4    5.0
dtype: float64
3.0


### 🔹 5. Descriptive Statistics

In [18]:
s = pd.Series([10, 20, 30, 40, 50])

print(s.mean())      # Average
print(s.median())    # Middle value
print(s.std())       # Standard deviation
print(s.min(), s.max())
print()
print(s.describe())  # Quick summary
# ✅ describe() gives a mini report of count, mean, std, min, quartiles, and max

30.0
30.0
15.811388300841896
10 50

count     5.000000
mean     30.000000
std      15.811388
min      10.000000
25%      20.000000
50%      30.000000
75%      40.000000
max      50.000000
dtype: float64


### 🔹 6. Useful Attributes

In [19]:
print(s.values)     # Underlying NumPy array
print(s.index)      # Index labels
print(s.dtype)      # Data type
print(s.shape)      # Shape (tuple)
print(s.ndim)       # Dimension (1)
print(s.size)       # Number of elements
print(s.name)       # Series name (can assign)


[10 20 30 40 50]
RangeIndex(start=0, stop=5, step=1)
int64
(5,)
1
5
None


In [20]:
s.name = "Sales"
print(s)

0    10
1    20
2    30
3    40
4    50
Name: Sales, dtype: int64


## 🔹 7. Applying Functions

In [21]:
s = pd.Series([1, 2, 3, 4, 5])

print(s.apply(lambda x: x ** 2))   # Using apply
print(s.map(lambda x: x * 10))     # Using map
# ✅ Both apply() and map() apply a function element-wise.

0     1
1     4
2     9
3    16
4    25
dtype: int64
0    10
1    20
2    30
3    40
4    50
dtype: int64


## 🔹 8. Unique Values & Frequency

In [22]:
s = pd.Series(['apple', 'banana', 'apple', 'orange', 'banana', 'apple'])

print(s.unique())         # Unique values
print(s.nunique())        # Number of unique values
print(s.value_counts())   # Frequency count


['apple' 'banana' 'orange']
3
apple     3
banana    2
orange    1
Name: count, dtype: int64


## 9. Boolean Filtering

In [23]:
s = pd.Series([10, 20, 30, 40, 50])

print(s[s > 25])      # Filter values greater than 25
print(s[(s > 20) & (s < 50)])  # Combine conditions


2    30
3    40
4    50
dtype: int64
2    30
3    40
dtype: int64


In [24]:
s = pd.Series([10, 20, 30, 40, 50])

print(s[s > 25])      # Filter values greater than 25
print(s[(s > 20) & (s < 50)])  # Combine conditions
# ✅ Boolean masks let you select data conditionally — like WHERE in SQL.

2    30
3    40
4    50
dtype: int64
2    30
3    40
dtype: int64


## 🔹 10. Combining Series (Alignment)
When combining Series with different indices, Pandas aligns automatically by label.

In [25]:
s1 = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
s2 = pd.Series([5, 15, 25], index=['b', 'c', 'd'])

print(s1 + s2)
# ✅ Automatic label alignment — unmatched labels → NaN.

a     NaN
b    25.0
c    45.0
d     NaN
dtype: float64


### 🧩 Summary: Series Parameters & Attributes

| Category            | Attribute / Method                                                 | Description                      |
| ------------------- | ------------------------------------------------------------------ | -------------------------------- |
| **Creation Params** | `data`, `index`, `dtype`, `name`, `copy`                           | Define input, labels, type, name |
| **Attributes**      | `.values`, `.index`, `.dtype`, `.shape`, `.size`, `.ndim`, `.name` | Metadata of the Series           |
| **Null Ops**        | `.isna()`, `.notna()`, `.fillna()`, `.dropna()`                    | Handle missing data              |
| **Math & Stats**    | `.mean()`, `.sum()`, `.std()`, `.min()`, `.max()`, `.describe()`   | Numeric operations               |
| **Vectorization**   | `+`, `-`, `*`, `/`, `**`                                           | Element-wise operations          |
| **Apply Functions** | `.apply()`, `.map()`                                               | Custom transformations           |
| **Unique / Count**  | `.unique()`, `.nunique()`, `.value_counts()`                       | Frequency analysis               |
| **Indexing**        | `.loc[]`, `.iloc[]`, direct `[ ]`                                  | Label vs position access         |
| **Head/Tail**       | `.head(n)`, `.tail(n)`                                             | Peek at first/last rows          |
