ðŸ§  WHY PANDAS EXISTS
* NumPy arrays are powerful but they have one big limitation â€” every element must be the same data type. In real data you have names (strings), ages (integers), salaries (floats), cities (strings) â€” all mixed together in one table.
* That's what Pandas solves. It gives you a DataFrame â€” think of it as a supercharged Excel spreadsheet that you can manipulate with code. Rows are samples, columns are features, and each column can have its own data type.

ðŸ§  TOPIC 1 â€” Series and DataFrame

Pandas has two core data structures. A Series is one column â€” a 1D labelled array. A DataFrame is the full table â€” multiple Series stuck togethe

In [None]:
import pandas as pd

# Series â€” one column
s = pd.Series([10, 20, 30, 40, 50])
print(s)
# DataFrame â€” full table
data = {
    "name": ["Tony", "Pepper", "Rhodey", "Happy"],
    "age": [30, 28, 35, 40],
    "cgpa": [9.2, 8.5, 7.8, 6.9],
    "branch": ["AIML", "CSE", "ECE", "MECH"]
}

df = pd.DataFrame(data)
print(df)


The numbers on the left (0, 1, 2, 3) are the index â€” automatic row labels

ðŸ§  TOPIC 2 â€” Exploring a DataFrame

These are the first things you run on any new dataset â€” every single time:

In [None]:
print(df.shape)        # (4, 4) â€” 4 rows, 4 columns
print(df.dtypes)       # data type of each column
print(df.columns)      # column names
print(df.head(2))      # first 2 rows â€” default is 5
print(df.tail(2))      # last 2 rows
print(df.info())       # summary â€” types, non-null counts
print(df.describe())   # statistics â€” mean, std, min, max for numeric columns

**df.describe()** is powerful â€” in one line it gives you count, mean, std, min, 25th percentile, median, 75th percentile, and max for every numeric column. That's your first look at the data's personality.

ðŸ§  TOPIC 3 â€” Selecting Data

In [None]:
# Select one column â€” returns a Series
print(df["name"])

# Select multiple columns â€” returns a DataFrame
print(df[["name", "cgpa"]])

# Select rows by index number â€” iloc (integer location)
print(df.iloc[0])       # first row
print(df.iloc[1:3])     # rows 1 and 2
print(df.iloc[0, 2])    # row 0, column 2 â†’ 9.2

# Select rows by label â€” loc
print(df.loc[0])                        # row with index 0
print(df.loc[0, "name"])               # row 0, column "name" â†’ Tony
print(df.loc[:, "cgpa"])               # all rows, cgpa column

**iloc** vs **loc** â€” this trips everyone up at first. *iloc* uses integer positions like NumPy. *loc* uses labels â€” row index labels and column name labels. When index is default numbers they look the same, but they behave differently when index has custom labels.

ðŸ§  TOPIC 4 â€” Filtering

This is where Pandas becomes powerful. You can filter rows based on conditions â€” just like SQL WHERE clause:

In [None]:
# Students with cgpa above 8
print(df[df["cgpa"] > 8])

# Students in AIML branch
print(df[df["branch"] == "AIML"])

# Multiple conditions â€” use & for AND, | for OR
print(df[(df["cgpa"] > 7) & (df["age"] < 35)])

# NOT condition
print(df[df["branch"] != "MECH"])

df["cgpa"] > 8 returns a Series of True/False values. When you pass that inside df[...], Pandas keeps only the rows where the value is True. This is called boolean indexing and it's used in every ML project for filtering data.

ðŸ§  TOPIC 5 â€” Adding, Modifying, Dropping

In [9]:
# Add a new column
df["passing"] = df["cgpa"] >= 7
df["marks"] = df["cgpa"] * 10

# Modify a column
df["age"] = df["age"] + 1

# Drop a column
df = df.drop("marks", axis=1)      # axis=1 means column
df = df.drop(0, axis=0)            # axis=0 means row â€” drops row with index 0

# Rename columns
df = df.rename(columns={"name": "student_name", "cgpa": "gpa"})

ðŸ§  TOPIC 6 â€” Handling Missing Values
* Real world data is always dirty â€” missing values everywhere. This is the reality of every ML project:

In [11]:
#import numpy as np

data = {
    "name": ["Tony", "Pepper", "Rhodey", "Happy"],
    "age": [30, None, 35, 40],
    "cgpa": [9.2, 8.5, None, 6.9]
}

df = pd.DataFrame(data)

print(df.isnull())           # True where value is missing
print(df.isnull().sum())     # count of missing values per column

# Drop rows with any missing value
df_dropped = df.dropna()

# Fill missing values
df_filled = df.fillna(0)                        # fill with 0
df_filled2 = df.fillna(df.mean(numeric_only=True))   # fill with column mean â€” common in ML

    name    age   cgpa
0  False  False  False
1  False   True  False
2  False  False   True
3  False  False  False
name    0
age     1
cgpa    1
dtype: int64


Filling with the column mean is the most common strategy in ML â€” you don't lose the row, and you replace the gap with the most representative value.

ðŸ§  TOPIC 7 â€” Loading Real Data

In [None]:
# Load a CSV file
df = pd.read_csv("students.csv")

# Save a DataFrame to CSV
df.to_csv("output.csv", index=False)   # index=False means don't save row numbers

Task 1 â€” DataFrame Basics (20 mins)

In [None]:
import pandas as pd
data = {
    "Name":["Tony","Sherlock","David","Peter"],
    "Age":[20,19,20,21],
    "cgpa":[9,9.2,7.8,8.8],
    "Branch":["aiml","ai","core","aiml"],
    "city":["Los angeles","Washington","Paris","Hawai"]
}
df = pd.DataFrame(data)
print(df.shape)
print(df.dtypes)
df.describe()

print(df.loc[:,("Name","cgpa")])
print(df.tail(3))
print(df.iloc[2])

The pandas.DataFrame.tail() function is a method used to return the last n rows of a DataFrame or Series. By default, it displays the last 5 rows, making it a quick and useful tool for data exploration and verification.

Task 2 â€” Filtering and Adding Columns (25 mins)

In [1]:
import pandas as pd
import utils
data = {
    "Name":["Tony","Sherlock","David","Peter"],
    "Age":[20,19,20,22],
    "cgpa":[9,9.2,6.8,8.8],
    "Branch":["aiml","ai","core","aiml"],
    "city":["Los angeles","Washington","Paris","Hawai"]
}
df = pd.DataFrame(data)

print(df[df["cgpa"]>8])
print(df[df["Branch"]=="aiml"])
print(df[(df["cgpa"]>7) & (df["Age"]<22)])

Grade=[]
scholarship = []
'''
for cgpa in df["cgpa"]:
    Grade.append(utils.get_grade(cgpa))
    scholarship.append(utils.scholarship(cgpa))
'''

#using .apply

# Pandas way â€” apply a function to every value in a column:
df["Grade"] = df["cgpa"].apply(utils.get_grade)
df["Scholarship"] = df["cgpa"].apply(utils.scholarship)

print(df.Grade)
print(df.Scholarship)

       Name  Age  cgpa Branch         city
0      Tony   20   9.0   aiml  Los angeles
1  Sherlock   19   9.2     ai   Washington
3     Peter   22   8.8   aiml        Hawai
    Name  Age  cgpa Branch         city
0   Tony   20   9.0   aiml  Los angeles
3  Peter   22   8.8   aiml        Hawai
       Name  Age  cgpa Branch         city
0      Tony   20   9.0   aiml  Los angeles
1  Sherlock   19   9.2     ai   Washington
0    Outstanding
1    Outstanding
2           Good
3      Excellent
Name: Grade, dtype: str
0     True
1     True
2    False
3    False
Name: Scholarship, dtype: bool


.apply() is one of the most used Pandas methods. It takes a function and runs it on every value in a column. No loop, no empty list, no append. One line.
From now on when you want to transform a column â€” think .apply() first.

Task 3 â€” Missing Values + CSV (30 mins)

In [24]:
import pandas as pd
import numpy as np

data = {
    "name": ["Tony", "Pepper", "Rhodey", "Happy", "Natasha"],
    "age": [19, np.nan, 21, 20, np.nan],
    "cgpa": [9.2, 8.5, np.nan, 7.1, 8.8],
    "city": ["Mumbai", "Delhi", None, "Chennai", "Pune"]
}

df = pd.DataFrame(data)

print(df.isnull().sum())
df["age"] = df["age"].fillna(df["age"].mean())        # fill age with mean
df["cgpa"] = df["cgpa"].fillna(df["cgpa"].median())   # fill cgpa with median
df = df.dropna(subset=["city"])                        # drop rows where city missing
df.to_csv("cleaned_students.csv",index=False)
df2=pd.read_csv("cleaned_students.csv")
print(df2)

print(df)

name    0
age     2
cgpa    1
city    1
dtype: int64
      name   age  cgpa     city
0     Tony  19.0   9.2   Mumbai
1   Pepper  20.0   8.5    Delhi
2    Happy  20.0   7.1  Chennai
3  Natasha  20.0   8.8     Pune
      name   age  cgpa     city
0     Tony  19.0   9.2   Mumbai
1   Pepper  20.0   8.5    Delhi
3    Happy  20.0   7.1  Chennai
4  Natasha  20.0   8.8     Pune
