# Introduction to Pandas for Machine Learning

In this notebook, we will practice **Python with Pandas**, a powerful library for data handling.  
Understanding Pandas is *essential* for Machine Learning because:

- Most ML datasets are tabular (rows = examples, columns = features).  
- Pandas DataFrames make it easy to explore, clean, and preprocess data.  
- ML models require clean numerical input, which we can prepare using Pandas.

We will go step by step:
1. Pandas basics (Series, DataFrame).  
2. Loading and inspecting data.  
3. Selecting and filtering data.  
4. Descriptive statistics and summaries.  
5. Handling missing values.  
6. Grouping and aggregation.  
7. Sorting and merging/joining.  

Run each code cell and read the explanation carefully. Try small variations yourself!

## Import Pandas

In [None]:
import pandas as pd

# Check pandas version
print("Pandas version:", pd.__version__)

Pandas version: 2.2.2


## Pandas Objects: Series and DataFrame

- **Series**: A one-dimensional labeled array (like a column in Excel).  
- **DataFrame**: A two-dimensional labeled table (rows + columns).  

Think of:
- A **Series** as a single column of data.  
- A **DataFrame** as a spreadsheet with multiple columns.

We’ll start by creating them manually before working with real datasets.


## Create Series

In [None]:
# A Pandas Series: like one column
s = pd.Series([10, 20, 30, 40], name="Numbers")
print(s)

0    10
1    20
2    30
3    40
Name: Numbers, dtype: int64


In [None]:
# Access index and values
print("Index:", s.index)
print("Values:", s.values)

Index: RangeIndex(start=0, stop=4, step=1)
Values: [10 20 30 40]


## Create DataFrame

In [None]:
# A DataFrame: table with rows and columns
data = {
    "Name": ["Alice", "Bob", "Charlie", "David"],
    "Age": [25, 30, 35, 40],
    "Salary": [50000, 60000, 70000, 80000]
}

In [None]:
df = pd.DataFrame(data)

In [None]:
print(df)

      Name  Age  Salary
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   35   70000
3    David   40   80000


## Accessing Data

You can access:
- Columns using `df["column_name"]`  
- Multiple columns using `df[["col1","col2"]]`  
- Rows using `.loc` (label-based) or `.iloc` (position-based)  


In [None]:
# Access one column
print(df["Name"])

0      Alice
1        Bob
2    Charlie
3      David
Name: Name, dtype: object


In [None]:
# Access multiple columns
print(df[["Name", "Salary"]])

      Name  Salary
0    Alice   50000
1      Bob   60000
2  Charlie   70000
3    David   80000


In [None]:
# Access a single row by index position
print(df.iloc[1])

Name        Bob
Age          30
Salary    60000
Name: 1, dtype: object


In [None]:
# Access row(s) by label index
print(df.loc[0:2])

In [None]:
# Access Name, Age column and only first 2 rows
print(df.loc[0:1, ["Name", "Age"]])

    Name  Age
0  Alice   25
1    Bob   30


## Filtering Data (Selecting Rows by Condition)

We can filter rows based on conditions, just like in SQL or Excel filters.

Examples:
- `df[df["Salary"] > 60000]` → Selects rows where Salary is greater than 60,000.  
- `df[df["Age"] < 35]` → Selects rows where Age is less than 35.  

Multiple conditions can be combined using:
- `&` for AND  
- `|` for OR  
- `~` for NOT


In [None]:
# Filtering rows
print("Employees with Salary > 60000:")
print(df[df["Salary"] > 60000])

Employees with Salary > 60000:
      Name  Age  Salary
2  Charlie   35   70000
3    David   40   80000


In [None]:
print("Employees younger than 35:")
print(df[df["Age"] < 35])

Employees younger than 35:
    Name  Age  Salary
0  Alice   25   50000
1    Bob   30   60000


## Descriptive Statistics

Pandas provides built-in methods to quickly summarize data.

- `df.describe()` → Summary statistics (count, mean, std, min, max, quartiles).  
- `df.mean()` → Mean values for numeric columns.  
- `df["Salary"].max()` → Maximum salary.  
- Similarly: `.min()`, `.sum()`, `.median()`, `.std()`.

These are very useful for **exploring datasets** before applying ML.


In [None]:
# Summary statistics
print(df.describe())

             Age        Salary
count   4.000000      4.000000
mean   32.500000  65000.000000
std     6.454972  12909.944487
min    25.000000  50000.000000
25%    28.750000  57500.000000
50%    32.500000  65000.000000
75%    36.250000  72500.000000
max    40.000000  80000.000000


In [None]:
# Mean of each column
print("Mean values:\n", df.mean(numeric_only=True))

Mean values:
 Age          32.5
Salary    65000.0
dtype: float64


In [None]:
# Maximum salary
print("Max salary:", df["Salary"].max())

Max salary: 80000


## Reading and Writing CSV Files

In real ML projects, data usually comes from files (CSV, Excel, etc.).

- `df.to_csv("file.csv")` → Save DataFrame to CSV.  
- `pd.read_csv("file.csv")` → Load DataFrame from CSV.  

In Colab, the file is saved temporarily in the runtime environment.  

*   List item
*   List item


You can also upload/download CSVs for practice.


In [None]:
# Save to CSV
df.to_csv("employees.csv", index=False)

In [None]:
# Read CSV back
df_loaded = pd.read_csv("employees.csv")
print(df_loaded)

      Name  Age  Salary
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   35   70000
3    David   40   80000


## Handling Missing Data

Datasets often have **missing values**. Pandas helps with:
- `df.isnull()` → Shows where data is missing.  
- `df.fillna(value)` → Replace missing values with given value (e.g., mean).  
- `df.dropna()` → Remove rows with missing values.  

Handling missing data properly is *very important* in ML pipelines.


In [None]:
# Create a DataFrame with missing values
data = {
    "Name": ["Alice", "Bob", "Charlie", "David"],
    "Age": [25, None, 35, 40],
    "Salary": [50000, 60000, None, 80000]
}

In [None]:
df_missing = pd.DataFrame(data)
print(df_missing)

      Name   Age   Salary
0    Alice  25.0  50000.0
1      Bob   NaN  60000.0
2  Charlie  35.0      NaN
3    David  40.0  80000.0


In [None]:
# Detect missing values
print(df_missing.isnull().sum())

Name      0
Age       1
Salary    1
dtype: int64


In [None]:
# Fill missing values with mean
df_filled = df_missing.fillna(df_missing.mean(numeric_only=True))
print("After filling missing values:\n", df_filled)

After filling missing values:
       Name        Age        Salary
0    Alice  25.000000  50000.000000
1      Bob  33.333333  60000.000000
2  Charlie  35.000000  63333.333333
3    David  40.000000  80000.000000


In [None]:
# Drop rows with missing values
df_dropped = df_missing.dropna()
print("After dropping missing values:\n", df_dropped)

After dropping missing values:
     Name   Age   Salary
0  Alice  25.0  50000.0
3  David  40.0  80000.0


## Grouping and Aggregation

Pandas can group data by categories (like pivot tables in Excel).  

- `df.groupby("column")` → Groups by unique values of a column.  
- Then you can apply aggregation functions like `.mean()`, `.sum()`, `.count()`.  

Example:
- Find average salary per department.


In [None]:
# Add a new column Department
df["Department"] = ["HR", "IT", "IT", "HR"]

print("DataFrame with Department column:\n", df)

DataFrame with Department column:
       Name  Age  Salary Department
0    Alice   25   50000         HR
1      Bob   30   60000         IT
2  Charlie   35   70000         IT
3    David   40   80000         HR


In [None]:
# Group by Department and compute mean
print("Average salary per department:\n", df.groupby("Department")["Salary"].mean())

Average salary per department:
 Department
HR    65000.0
IT    65000.0
Name: Salary, dtype: float64


## Sorting Data

We can sort rows in Pandas using:
- `df.sort_values(by="column")` → Sorts ascending (default).  
- `df.sort_values(by="column", ascending=False)` → Sorts descending.  

Examples:
- Sort employees by Age.  
- Sort employees by Salary in descending order.

In [None]:
# Sort by Age
print(df.sort_values(by="Age"))

      Name  Age  Salary Department
0    Alice   25   50000         HR
1      Bob   30   60000         IT
2  Charlie   35   70000         IT
3    David   40   80000         HR


In [None]:
# Sort by Salary (descending)
print(df.sort_values(by="Salary", ascending=False))

      Name  Age  Salary Department
3    David   40   80000         HR
2  Charlie   35   70000         IT
1      Bob   30   60000         IT
0    Alice   25   50000         HR


## Merging DataFrames

In ML projects, data often comes from multiple tables.  
We can combine them using `pd.merge()` (similar to SQL joins).  

Example:
- Employee info in one table.  
- Bonus info in another table.  
- Merge them on the "Name" column.

In [None]:
# Another DataFrame with bonus info
bonus = pd.DataFrame({
    "Name": ["Alice", "Bob", "Charlie", "David"],
    "Bonus": [2000, 3000, 4000, 5000]
})

In [None]:
# Merge on Name
df_merged = pd.merge(df, bonus, on="Name")
print(df_merged)

      Name  Age  Salary Department  Bonus
0    Alice   25   50000         HR   2000
1      Bob   30   60000         IT   3000
2  Charlie   35   70000         IT   4000
3    David   40   80000         HR   5000


# Summary

We practiced Pandas basics step by step:

- Filtering rows using conditions.  
- Descriptive statistics.  
- Reading and writing CSVs.  
- Handling missing data.  
- Grouping and aggregation.  
- Sorting data.  
- Merging DataFrames.  

These operations are the foundation of **data preprocessing in ML**.  

## PRACTICE EXERCISES

In [None]:
# Sample students grades dataset
data = {
    "Student": ["Ahmed", "Fatima", "Hassan", "Ayesha", "Bilal", "Sara", "Imran", "Zainab"],
    "Math": [85, 78, 92, 70, 88, 60, 95, 80],
    "Physics": [90, 75, 85, 65, 92, 58, 97, 82],
    "Chemistry": [88, 80, 89, 72, 85, 55, 94, 78],
    "Section": ["A", "B", "A", "B", "A", "B", "A", "B"]
}

df_students = pd.DataFrame(data)
print(df_students)


### Exercise 1: Filtering Rows

Select all students who scored **more than 85 in Math**.


### Exercise 2: Descriptive Statistics

Compute the **average score in Physics and Chemistry** for the whole class.

### Exercise 3: Handling Missing Data

Suppose the Chemistry score for "Bilal" is missing (NaN). Fill it with the **average Chemistry score**.

- Hint: First set the value to NaN: df.loc[row_index, "Chemistry"] = np.nan

### Exercise 4: Grouping and Aggregation

Compute the **average Math score per Section**.


### Exercise 5: Sorting and Adding New Column

1. Add a new column `Total` = sum of Math + Physics + Chemistry.  
2. Sort the DataFrame by `Total` in descending order.
