
> *Pandas is simple library . It just helps us efficiently perform operations on rows and columns of a CSV file: a single row (or column) is a Series, and multiple rows and columns form a DataFrame.*


## **Why Use Pandas?**

| Task | Without Pandas | With Pandas |
| :--- | :--- | :--- |
| Load a CSV | `open()` + loops | `pd.read_csv()` |
| Filter rows | Custom loop logic | `df[df["col"] > 5]` |
| Group & summarize | Manual aggregation | `df.groupby()` |
| Merge two datasets | Nested loops | `pd.merge()` |

Pandas saves time, reduces code, and increases readability.

---

## **Pandas vs. Excel vs. SQL vs. NumPy**

| Tool | Strengths | Weaknesses |
| :--- | :--- | :--- |
| **Excel** | Easy UI, great for small data | Slow, manual, not scalable |
| **SQL** | Efficient querying of big data | Not ideal for transformation logic |
| **NumPy** | Fast, low-level array operations | No labels, harder for tabular data |
| **Pandas** | Label-aware, fast, flexible | Slightly steep learning curve |

---



## **What is Pandas?** 🐼


Pandas is a powerful, open-source Python library used for **data manipulation, cleaning, and analysis**. It provides two main data structures:

* **Series**: A one-dimensional labeled array.
* **DataFrame**: A two-dimensional labeled table (like an Excel sheet ~ **just rows and columns**).

Pandas makes working with structured data fast, expressive, and flexible.

> If you are working with tables, spreadsheets, or CSVs in Python, Pandas is your best friend.

---

📍 **Series Advantage And Proof of Advantage**

1. Fast vectorized operations
2. Automatic index alignment
3. Handles missing data (`NaN`)
4. Label and position access
5. Integrates seamlessly with DataFrames

In [20]:
import pandas as pd

s = pd.Series([10, 20, 30, 40])
data_dict = {"a": 10, "b": 20, "c": 30, "d": 40}
series = pd.Series(data_dict)
print("--- Original Series ---")
print(series)

"""1 - Vectorized Operations"""

# Without Pandas (requires a loop)
dict_multiplied = {key: value * 2 for key, value in data_dict.items()}
print("\nRequires a loop", dict_multiplied,"\n")

# With Pandas (vectorized)
series_multiplied = series * 2  # With a Series (vectorized)
print("Vectorization\n", series_multiplied)
print("=" * 40 + "\n")

--- Original Series ---
a    10
b    20
c    30
d    40
dtype: int64

Requires a loop {'a': 20, 'b': 40, 'c': 60, 'd': 80} 

Vectorization
 a    20
b    40
c    60
d    80
dtype: int64



In [22]:
"""2 - Automatic Index Alignment"""

s = pd.Series([10, 20, 30, 40])
print(s, "\n")

# Custom Indexing
s = pd.Series([10, 20, 30, 40], index=["Rajat", "Simba", "Coco", "Joe"])
print(s)

0    10
1    20
2    30
3    40
dtype: int64 

Rajat    10
Simba    20
Coco     30
Joe      40
dtype: int64


In [23]:
"""3 - Handles Missing Data (NaN)"""

data_with_missing = pd.Series([100, 200, None, 400])
print("Series with a missing value:\n", data_with_missing)

# Most pandas operations automatically handle NaN
total_sum = data_with_missing.sum()
print(f"\nThe sum of the series is: {total_sum}")

Series with a missing value:
 0    100.0
1    200.0
2      NaN
3    400.0
dtype: float64

The sum of the series is: 700.0


In [24]:
"""4 - Label and Position Access"""

access_series = pd.Series([10, 20, 30], index=["first", "second", "third"])
print("Series for access demonstration:\n", access_series)

# Access by label (like a dictionary)
label_access = access_series["second"]
print(f"\nAccess by label 'second': {label_access}")

# Access by integer position (like a list) using .iloc
position_access = access_series.iloc[0]
print(f"Access by position 0 (.iloc[0]): {position_access}")

Series for access demonstration:
 first     10
second    20
third     30
dtype: int64

Access by label 'second': 20
Access by position 0 (.iloc[0]): 10


In [27]:
"""5 - Integrates Seamlessly with DataFrames"""

names = pd.Series(["Alice", "Bob", "Charlie"], name="Name")
print(names,"\n")
ages = pd.Series([25, 30, 35], name="Age")
print(ages,"\n")
cities = pd.Series(["New York", "Los Angeles", "Chicago"], name="City")
print(cities,"\n")

# Create a DataFrame from Series
df = pd.DataFrame({"Name": names, "Age": ages, "City": cities})
print("DataFrame created from multiple Series:\n")
print(df)

0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object 

0    25
1    30
2    35
Name: Age, dtype: int64 

0       New York
1    Los Angeles
2        Chicago
Name: City, dtype: object 

DataFrame created from multiple Series:

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
