
<style>
/* Increase size for section headers */
h2 { font-size: 2.5em !important; }
h3 { font-size: 2.25em !important; }
h4 { font-size: 1.75em !important; }
</style>

# **Day 3: Introduction to Data Analysis with Pandas 🐼**

Welcome to Day 3! Today, we'll dive into **Pandas**, a powerful Python library essential for data analysis. By the end of this session, you'll be comfortable with the basics of manipulating and analyzing data.

# **🎯 Goals for Today:**

1.  Understand what Python **libraries** are and why they are useful.
2.  Learn about **Pandas** and its primary data structures: **Series** and **DataFrame**.
3.  Load data from a CSV file into a DataFrame.
4.  Inspect and explore data to understand its structure and contents.
5.  Perform data cleaning tasks like handling missing values and duplicates.
6.  Transform data by creating new columns and changing data types.
7.  Calculate basic statistics and perform simple aggregations.
8.  Create basic visualizations (histograms, bar plots, scatter plots) to understand data patterns.
9.  Practice with hands-on exercises and group activities.

---
# **1. What are Python Libraries? 🤔**

Python libraries are collections of pre-written code (modules, functions, classes) that provide functionalities to perform specific tasks without you having to write the code from scratch. They extend Python's capabilities.

**Common Libraries in Data Analysis:**
* **Pandas:** For data manipulation and analysis (what we're learning today!).
* **NumPy:** For numerical operations, especially with arrays.
* **Matplotlib:** For creating static, animated, and interactive visualizations.
* **Seaborn:** Built on Matplotlib, provides a high-level interface for drawing attractive statistical graphics.
* **SciPy:** For scientific and technical computing.
* **Scikit-learn:** For machine learning tasks.

---
# **2. Introduction to Pandas 🐼**

**Pandas** is a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation tool, built on top of the Python programming language. It's the go-to library for handling structured data.

**Why use Pandas?**
* Easily loads data from various sources (CSV, Excel, databases, etc.).
* Provides rich data structures for holding and manipulating data.
* Offers a wide range of functions for cleaning, transforming, merging, and reshaping data.
* Integrates well with other data science libraries like NumPy and Matplotlib.



# **🛠️ Setup: Import Libraries**

First, we need to import the libraries we'll be using. **Pandas** is for data manipulation, and **Matplotlib** is for plotting. We use aliases (`pd`, `plt`) by convention to make them easier to use.

In [6]:
# Import pandas and matplotlib
import pandas as pd

# **Core Pandas Data Structures:**

# **2.1. Pandas Series**
A **Series** is a one-dimensional labeled array capable of holding data of any type (integers, strings, floating-point numbers, Python objects, etc.). It's like a single column in a spreadsheet or a SQL table. Each element in a Series has an associated label, called an **index**.

In [None]:
# Example 2.1.1: Create a Series from a list
student_scores = pd.Series([90, 85, 77, 92, 88])
print("Student Scores Series:")
print(student_scores)

# Example 2.1.2: Create a Series with a custom index
fruit_quantities = pd.Series([10, 15, 8, 12], index=['apples', 'bananas', 'cherries', 'dates'])
print("\nFruit Quantities Series:")
print(fruit_quantities)

# Accessing an element using index
print("\nQuantity of bananas:", fruit_quantities['bananas'])

**Exercise 2.1.1:** Create a Pandas Series named `book_prices` with the following prices: `[15.99, 22.50, 12.75, 9.99]`. Print the Series.

In [None]:
# Exercise 2.1.1 Code Cell


**Exercise 2.1.2:** Create a Pandas Series named `subject_teachers` with subjects as indices `['Math', 'Science', 'History', 'English']` and teacher names as values `['Mr. Smith', 'Ms. Jones', 'Dr. Brown', 'Ms. Davis']`. Print the Series and then print the teacher for 'History'.

In [None]:
# Exercise 2.1.2 Code Cell


# **✨ Group Activity 1: Series Operations ✨**

**Scenario:** You have a Series of product prices and want to apply a 10% discount.
1. Create a Pandas Series named `original_prices` with values `[20, 50, 30, 75, 90]`.
2. Try to calculate the discounted prices (original price - 10% of original price).
3. **Buggy Code attempt:** `discounted_prices = original_prices - "10%"`
4. Discuss: Why does this fail? How can you correctly calculate the discount (e.g., `original_prices * 0.90`)?
5. Write the corrected code to calculate and print the `discounted_prices`.

In [None]:
# Group Activity 1 Code Cell
original_prices = pd.Series([20, 50, 30, 75, 90])
# discounted_prices = original_prices - "10%" # This is buggy!
# print(discounted_prices)

# Write your corrected code here:
discount_percentage = 0.10
discounted_prices = original_prices * (1 - discount_percentage)
print("Original Prices:")
print(original_prices)
print("\nDiscounted Prices (10% off):")
print(discounted_prices)

# **2.2. Pandas DataFrame**
A **DataFrame** is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). You can think of it as a spreadsheet, a SQL table, or a dictionary of Series objects. It's the most commonly used Pandas object.

In [None]:
# Example 2.2.1: Create a DataFrame from a dictionary of lists
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 22, 28],
    'City': ['New York', 'Paris', 'London', 'Berlin']
}
df_people = pd.DataFrame(data)
print("People DataFrame:")
print(df_people)

In [None]:
# Example 2.2.1: Create a DataFrame from a dictionary of lists with a custom index
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 22, 28],
    'City': ['New York', 'Paris', 'London', 'Berlin']
}

# Define a custom index
custom_index = ['ID101', 'ID102', 'ID103', 'ID104']

# Create the DataFrame with the custom index
df_people = pd.DataFrame(data, index=custom_index)

print("People DataFrame with Custom Index:")
print(df_people)


**Exercise 2.2.1:** Create a DataFrame named `df_products` with three columns: `Product_Name` (strings), `Price` (floats), and `In_Stock` (booleans). Include at least 3 rows of data. Print the DataFrame.

In [None]:
# Exercise 2.2.1 Code Cell


**Exercise 2.2.2:** Create a DataFrame named `df_temps` with rows indexed by `['Mon', 'Tue', 'Wed']` and columns for `['Morning_Temp', 'Evening_Temp']`. Populate it with some temperature data. Print the DataFrame.

In [None]:
# Exercise 2.2.2 Code Cell


---
# **3. Loading Data from CSV Files 📄**

CSV (Comma Separated Values) files are a common way to store tabular data. Pandas makes it very easy to read data from CSV files into a DataFrame.

**Key function:** `pd.read_csv('your_file_name.csv')`

**For this notebook, we will assume you have a `student.csv` file in the same directory as this notebook with the following approximate structure:**
```csv
id,name,class,mark,gender
1,John Deo,Four,75,female
2,Max Ruin,Three,85,male
3,Arnold,Three,55,male
4,Krish Star,Four,60,female
5,John Mike,Four,60,female
6,Alex John,Four,55,male
7,My John Rob,Fifth,78,male
8,Asruid,Five,85,male
9,Tes Qry,Six,78,
10,Big John,Four,55,female
11,Ronald,Six,89,female
12,Recky,Six,94,female
13,Kty,Seven,88,female
14,Bigy,Seven,88,female
15,Tade Row,Eight,88,male
16,Gimmy,Four,88,male
17, HONNY,Five,75,male
18,KINN ENG,Six,98,female
19,Linnea,Seven,69,female
20,Jackly,Nine,65,female
21,Babby John,Four,69,female
22,Reggid,Seven,72,male
23,Herod,Eight,79,male
24,Tiddy Now,Seven,78,male
25,Mikky,Seven,72,male
26,Crelea,Seven,79,male
27,Big Nose,Three,82,female
28, Anto,Six,67,male
29,Tes Qry,Six,78,
30,Reppy Red,Six,79,female
31,Malik,Five,82,male
32,イドリ,Four,75,male
33,Monika,Nine,58,female
34,Gain Toe,Seven,69,male
35,BSR,Eight,92,male
```
*(You can create this file yourself, or I will provide code to create a sample DataFrame if the file is not found.)*

In [None]:
# 1. Read the file into a table called df
df = pd.read_csv('student.csv')

# 3. Peek at the first 5 rows to check your data
df.head()

**Exercise 3.1:** Imagine you have another CSV file named `courses.csv` with columns `Course_ID`, `Course_Name`, `Credits`. Write the Python code to load this hypothetical file into a DataFrame called `df_courses`.

In [None]:
# Exercise 3.1 Code Cell


---
# **4. Inspecting Your Data (Initial Exploration) 🕵️‍♀️**

Once your data is loaded, the first step is to inspect it to understand its structure, content, and quality. Pandas provides several useful functions for this:

* `df.head(n)`: View the first `n` rows (default is 5).
* `df.tail(n)`: View the last `n` rows (default is 5).
* `df.shape`: Get the dimensions (number of rows, number of columns).
* `df.info()`: Get a concise summary including data types, non-null values, and memory usage.
* `df.dtypes`: Check the data type of each column.
* `df.columns`: View the column names.
* `df.describe()`: Generate descriptive statistics for numerical columns (count, mean, std, min, max, quartiles).

In [None]:
# 1. Print the first 3 rows
print("First 3 rows:")
print(df.head(3))                   # head() returns a DataFrame, so we wrap it in print()

# 2. Print the last 2 rows
print("\nLast 2 rows:")
print(df.tail(2))                   # tail() same as head(), needs print()

# 3. Print the shape (rows, columns)
print("\nShape (rows, columns):", df.shape)

# 4. Print info (counts & data types)
print("\nInfo:")
df.info()                           # info() prints on its own; don’t wrap in print()

# 5. Print each column’s data type
print("\nData types:")
print(df.dtypes)                    # dtypes is a Series, so we wrap it in print()

# 6. Print all column names
print("\nColumns:")
print(list(df.columns))             # convert Index to list for cleaner printout

# 7. Print descriptive stats for numeric columns
print("\nDescriptive stats:")
print(df.describe())                # describe() returns a DataFrame, so we wrap it in print()


**Exercise 4.1:** Using the `df_students` DataFrame, display:
1. The first 7 rows.
2. The last 4 rows.
3. Only the column names.

In [None]:
# Exercise 4.1 Code Cell


**Exercise 4.2:** For the `df_students` DataFrame:
1. Get a full summary using `info()`.
2. Get descriptive statistics for only the 'mark' column. (Hint: `df['column_name'].describe()`)

In [None]:
# Exercise 4.2 Code Cell


## 5. Basic Data Selection in Pandas (showing first 5 rows)

This example shows five simple ways to grab parts of your table and see the first five results:  
- **Columns by label (`loc`)** – pick one (`df.loc[:, 'name']`) or two columns (`df.loc[:, ['name','mark']]`) and add `.head()`.  
- **Columns by index (`iloc`)** – pick the first column (`df.iloc[:, 0]`) or first two columns (`df.iloc[:, 0:2]`) and add `.head()`.  
- **Rows by label (`loc`)** – get a single row (`df.loc[2]`), an inclusive range (`df.loc[2:4]`), or specific rows (`df.loc[[0,3,5]]`), using `.head()` when you have multiple.  
- **Rows by position (`iloc`)** – grab the second row (`df.iloc[1]`), a slice of rows (`df.iloc[0:3]`), or a list (`df.iloc[[0,3,5]]`), with `.head()` for multi-row outputs.  
- **Rows & columns by position (`iloc`)** – slice rows and columns at once, e.g. `df.iloc[0:3, 0:2].head()` to see the first three rows and two columns.  


In [None]:
# 1. Select columns by label with loc
one_col   = df.loc[:, 'name']              # Single column → Series
two_cols  = df.loc[:, ['name', 'mark']]    # Two columns → DataFrame

print("1. Names (first 5):")
print(one_col.head(), "\n")

print("2. Name & Mark (first 5):")
print(two_cols.head(), "\n")


# 2. Select rows by label with loc
row_2     = df.loc[2]                      # Row with index label 2
rows_2_4  = df.loc[2:4]                    # Rows 2,3,4 (inclusive)
some_rows = df.loc[[0, 3, 5]]              # Specific rows by label

print("3. Row at label 2:")
print(row_2, "\n")

print("4. Rows 2–4 (first 5 of that slice):")
print(rows_2_4.head(), "\n")

print("5. Rows [0, 3, 5] (first 5):")
print(some_rows.head(), "\n")


# 3. Select columns by position with iloc
col_0      = df.iloc[:, 0]              # First column → Series
cols_0_2   = df.iloc[:, 0:2]            # First two columns → DataFrame
some_cols  = df.iloc[:, [0, 2]]         # First and third columns → DataFrame

print("6. Column at position 0 (first 5):")
print(col_0.head(), "\n")

print("7. Columns 0–1 (first 5):")
print(cols_0_2.head(), "\n")

print("8. Columns at positions [0, 2] (first 5):")
print(some_cols.head(), "\n")


# 4. Select rows by position with iloc
row_pos1  = df.iloc[1]                     # Second row (position 1)
rows_0_2  = df.iloc[0:3]                   # Rows 0,1,2 (end exclusive)
some_pos  = df.iloc[[0, 3, 5]]             # Specific rows by position

print("9. Row at position 1:")
print(row_pos1, "\n")

print("10. Rows 0–2 (first 5 of that slice):")
print(rows_0_2.head(), "\n")

print("11. Positions [0, 3, 5] (first 5):")
print(some_pos.head(), "\n")


# 5. Combine row & column selection
sub_loc   = df.loc[1:3, ['name', 'class']] # Labels 1–3, columns name & class
sub_iloc  = df.iloc[0:3, 0:2]              # First 3 rows, first 2 columns

print("12. loc rows 1–3 & cols name,class (first 5):")
print(sub_loc.head(), "\n")

print("13. iloc rows 0–3 & cols 0–2 (first 5):")
print(sub_iloc.head(), "\n")



**Exercise 5.1:** From `df_students`:
1. Select and display only the 'class' column.
2. Select and display the 'name', 'mark', and 'gender' columns for all students.

In [None]:
# Exercise 5.1 Code Cell


**Exercise 5.2:** From `df_students`:
1. Using `loc`, select the data for students at index positions 2, 4, and 6 (if these indices exist in your DataFrame's index), showing only their 'name' and 'mark'.
2. Using `iloc`, select the students in rows 0 through 4 (inclusive for start, exclusive for end for `iloc`) and columns 1 through 3 (exclusive for end for `iloc`).

In [None]:
# Exercise 5.2 Code Cell


### **6. Filtering DataFrames (Simple Conditions)**

Filtering in pandas means selecting specific rows from your data based on a rule or condition. This is useful when you want to look at a smaller part of your data that meets certain criteria.

For example, you might want to see:
- only the students who scored more than 80
- only the students who are in class "Four"
- only the students who scored less than 60

#### Basic Syntax

df[condition]


In [None]:
# Students with marks greater than 80
high_scorers = df[df['mark'] > 80]
print("Students with mark > 80:\n", high_scorers)

# Students in class 'Four'
class_four = df[df['class'] == 'Four']
print("\nStudents in class 'Four':\n", class_four)

# Students with marks less than 60
low_mark = df[df['mark'] < 60]
print("\nStudents with mark < 60:\n", low_mark)

**Exercise 6.1:** From `df_students`, select and display:
1. All students who are 'male'.
2. All students whose 'mark' is exactly 78.

In [None]:
# Exercise 6.1 Code Cell


**Exercise 6.2:** From `df_students`, select and display:
1. All 'female' students who have a 'mark' greater than or equal to 70.
2. All students who are in 'class' 'Six' OR 'class' 'Seven'.

In [None]:
# Exercise 6.2 Code Cell


---
# **7. Data Cleaning**

Real-world data is often messy. Data cleaning involves handling inconsistencies, errors, and missing data.

# **7.1. Handling Missing Values (`NaN`)**
Missing values are usually represented as `NaN` (Not a Number).

* **Detecting missing values:** `df.isnull()` (returns a boolean DataFrame) or `df.isnull().sum()` (counts missing values per column).
* **Dropping missing values:** `df.dropna()` (drops rows with any NaN).
    * `axis=1` drops columns with NaN.
    * `how='all'` drops rows/columns if all values are NaN.
    * `thresh=N` keeps rows/columns with at least N non-NaN values.
* **Filling missing values:** `df.fillna(value)`
    * Fill with a specific value (0, 'Unknown', etc.).
    * Fill with mean: `df['column'].fillna(df['column'].mean())`.
    * Fill with median: `df['column'].fillna(df['column'].median())`.
    * Fill with mode: `df['column'].fillna(df['column'].mode()[0])` (mode can return multiple values, so take the first).
    * `method='ffill'` (forward fill) or `method='bfill'` (backward fill).

In [None]:
# Example: Basic data cleaning with Pandas

# -------------------------------
# Step 1: View the original shape and check for missing values
# -------------------------------
print("Original data shape (rows, columns):")
print(df.shape)

print("\nMissing values in each column:")
print(df.isnull().sum())

# -------------------------------
# Step 2: Remove rows with missing data
# -------------------------------
df_no_missing = df.copy()
df_no_missing = df_no_missing.dropna()

print("\nData shape after dropping rows with missing values:")
print(df_no_missing.shape)
print("Missing values after dropping:")
print(df_no_missing.isnull().sum())

# -------------------------------
# Step 3: Fill missing values
# -------------------------------
df_filled = df.copy()

# Convert 'mark' to numeric, turn errors into NaN
df_filled['mark'] = pd.to_numeric(df_filled['mark'], errors='coerce')

# Fill missing marks with the average (mean) value
mean_mark = df_filled['mark'].mean()
df_filled['mark'] = df_filled['mark'].fillna(mean_mark)

# Fill missing gender values with 'Unknown'
df_filled['gender'] = df_filled['gender'].fillna('Unknown')

print("\nMissing values after filling:")
print(df_filled.isnull().sum())

# -------------------------------
# Step 4: Remove duplicate rows
# -------------------------------
df_no_duplicates = df_filled.copy()

# Drop duplicate rows
df_no_duplicates = df_no_duplicates.drop_duplicates()

print("\nShape after removing duplicate rows:")
print(df_no_duplicates.shape)


### Exercise: Handling Missing Values

Use a fresh copy of the DataFrame `df`.

1. Count the total number of missing values in the entire DataFrame.  
   💡 *Hint:* Use `df.isnull().sum().sum()` to get the total number.

2. Fill the missing values in the **`gender`** column with the most common value (called the **mode**).  
   💡 *Hint:* Use `df['gender'].mode()[0]` to get the most frequent value.

3. Print the number of missing values again using `df.isnull().sum()` to check the result.


In [None]:
# Exercise 7.1.1 Code Cell


In [None]:
# Demo solution
# Step 1: Make a fresh copy of the data
df_clean = df.copy()

# Step 2: Count total missing values
print("Total missing values:", df_clean.isnull().sum().sum())

# Step 3: Fill missing gender values with the mode
most_common_gender = df_clean['gender'].mode()[0]
df_clean['gender'] = df_clean['gender'].fillna(most_common_gender)

# Step 4: Check missing values again
print("\nMissing values after filling 'gender':")
print(df_clean.isnull().sum())


# **7.2. Renaming Columns**
You can rename columns using `df.rename(columns={'old_name': 'new_name', ...}, inplace=True)`.

In [None]:
# Example: Renaming columns in a DataFrame

# Make a copy of the original DataFrame
df_renamed = df.copy()

# Rename the columns 'class' to 'student_class' and 'mark' to 'score'
df_renamed.rename(columns={'class': 'student_class', 'mark': 'score'}, inplace=True)

# Print the new column names
print("Columns after renaming:", df_renamed.columns)


# **8. Data Transformation**

Data transformation is the process of modifying the structure or values of a DataFrame to make the data more useful or meaningful. This often includes creating new columns, changing data types, or preparing values for analysis.
---

## **8.1. Creating New Columns**

In this example, we perform several transformations to clean and enrich the dataset step by step.

### **Step 1: Make a Copy of the Original DataFrame**

Before making any changes, it's a good practice to work on a copy of the original DataFrame to avoid altering the original data.

# Make a copy of the original DataFrame
df_transformed = df.copy()


In [None]:
# Example: Creating new columns in a DataFrame

# Make a copy of the original DataFrame
df_transformed = df.copy()

# Convert 'mark' to numeric and fill any missing values with 0
df_transformed['mark'] = pd.to_numeric(df_transformed['mark'], errors='coerce').fillna(0)

# Create a new column: 'mark_percentage' (assuming max mark is 100)
df_transformed['mark_percentage'] = df_transformed['mark']

# Create a new column: 'pass_fail' (Pass if mark >= 60)
df_transformed['pass_fail'] = df_transformed['mark'].apply(lambda x: 'Pass' if x >= 60 else 'Fail')

# Display the result
print(df_transformed[['name', 'mark', 'mark_percentage', 'pass_fail']].head())


**Exercise 8.1:** In a copy of `df` (ensure 'mark' is numeric and NaNs are handled, e.g., filled with 0 for this exercise):
1. Create a new column called `bonus_mark` which is 10% of the original 'mark'.
2. Create another column `final_mark` which is the sum of 'mark' and `bonus_mark`.
3. Print the 'name', 'mark', `bonus_mark`, and `final_mark` columns for the first 5 students.

In [None]:
# Exercise 8.1.1 Code Cell


---
# **9. Basic Statistics and Aggregation 📊**

Pandas makes it easy to get insights from your data through statistics and aggregation.

* `df['column'].value_counts()`: Counts of unique values in a Series (good for categorical data).
* Basic statistics on Series/DataFrame columns:
    * `df['column'].mean()`
    * `df['column'].median()`
    * `df['column'].min()`, `df['column'].max()`
    * `df['column'].std()` (standard deviation)
    * `df['column'].sum()`
    * `df['column'].count()` (non-null values)
* `df.groupby('column_to_group_by')`: Groups data based on categories in a column. You can then apply aggregate functions (like mean, sum, count) to the groups.
    * Example: `df.groupby('class')['mark'].mean()` (average mark per class)

In [None]:
# Example 9.1: Statistics and Aggregation

# Convert 'mark' column to numeric
df['mark'] = pd.to_numeric(df['mark'], errors='coerce')

# Value counts for 'class'
print("Class distribution:")
print(df['class'].value_counts())

# Basic statistics
print("\nMean mark:", df['mark'].mean())
print("Median mark:", df['mark'].median())
print("Highest mark:", df['mark'].max())
print("Lowest mark:", df['mark'].min())

# Mean mark per class
print("\nAverage mark per class:")
print(df.groupby('class')['mark'].mean())

# Student count per gender (filling missing values with 'Unknown')
print("\nStudents per gender:")
print(df.fillna({'gender': 'Unknown'}).groupby('gender')['id'].count())


**Exercise 9.1:** Using `df`:
1. Get the value counts for the 'gender' column.
2. Calculate and print the percentage of each gender using `value_counts(normalize=True) * 100`.

In [None]:
# Exercise 9.1 Code Cell


---
# **10. Basic Visualizations 📈📉**

Visualizations help in understanding data patterns and trends. Pandas integrates with Matplotlib for plotting.
Remember we imported `matplotlib.pyplot as plt`.

# **10.1. Histograms**
Histograms show the distribution of a numerical variable.
`df['column'].plot(kind='hist', title='My Histogram', bins=10)`
`plt.xlabel('X-axis Label')`
`plt.ylabel('Frequency')`
`plt.show()`

# **10. Basic Visualization**

Visualising data helps you quickly identify patterns, trends, and outliers. Pandas integrates with Matplotlib for simple and effective plotting. Below are common visualisation types using only Pandas:

- **Bar Chart**: Displays the count of categories or frequency of values.
- **Line Plot**: Shows how a variable changes over an index (e.g. time or order).
- **Scatter Plot**: Explores the relationship between two numerical variables.



# **10.1. Bar Plots**
Bar plots are useful for comparing categorical data.
`df['column'].value_counts().plot(kind='bar', title='My Bar Plot')`
`plt.xlabel('Categories')`
`plt.ylabel('Count')`
`plt.xticks(rotation=45)` (to rotate x-axis labels if they overlap)
`plt.show()`

In [None]:
# Bar graph of student counts per class using only pandas
df['class'].value_counts().plot.bar().figure.show()


# **10.2. Line Plots**

Line plots show how a numerical variable changes over an index or time. They are useful for visualising trends and patterns in data sequences.  
`df['column'].plot(title='My Line Plot')`  
`plt.show()`


In [None]:
# Line plot of student marks using only pandas
df['mark'].plot(title='Line Plot of Student Marks').figure.show()


# **10.3. Scatter Plots**
Scatter plots show the relationship between two numerical variables.
`df.plot.scatter(x='col1', y='col2', title='My Scatter Plot')`
`plt.show()`

In [None]:
import numpy as np

# Create a DataFrame with random data
df = pd.DataFrame({
    'x': np.random.randint(1, 100, size=50),
    'y': np.random.randint(1, 100, size=50)
})

# Create scatter plot using only pandas
df.plot.scatter(x='x', y='y', title='Scatter Plot of Random Values').figure.show()


# **Group Activity: Choosing the Right Plot**

**Scenario:** Working with the `df` dataset, discuss the following questions with your group. For each one, decide which type of plot (Bar Plot, Line Plot, or Scatter Plot) would be the most suitable to answer the question — and explain why.

1. How many students are there in each `'class'`?
2. What does the overall distribution of student `'marks'` look like? Are the marks generally high, low, or spread out?
3. Is there a relationship between a student's `'id'` and their `'mark'`? (Assume `'id'` represents enrolment order or time sequence).
4. How do the numbers of `'male'` and `'female'` students compare?





In [36]:
# Group Activity Answers

# 1. How many students are there in each 'class'?
#    - Appropriate plot: Bar Plot
#    - Reason: A bar plot is ideal for showing the number of students in each category of the 'class' column.
# Example:
# df_students['class'].value_counts().plot(kind='bar', title='Students per Class').figure.show()

# 2. What is the overall distribution of student 'marks'?
#    - Appropriate plot: Histogram
#    - Reason: A histogram shows how marks are distributed (e.g. skewed, spread out, or clustered).
# Example:
# pd.to_numeric(df_students['mark'], errors='coerce').dropna().plot(kind='hist', title='Mark Distribution').figure.show()

# 3. Is there a relationship between a student's 'id' and their 'mark'?
#    - Appropriate plot: Scatter Plot
#    - Reason: A scatter plot shows the relationship or correlation between two numeric variables.
# Example:
# df_students.dropna(subset=['id', 'mark']).plot.scatter(x='id', y='mark', title='ID vs Mark').figure.show()

# 4. How do the numbers of 'male' and 'female' students compare?
#    - Appropriate plot: Bar Plot
#    - Reason: A bar plot is useful for comparing the counts of categorical values like gender.
# Example:
# df_students['gender'].fillna('Unknown').value_counts().plot(kind='bar', title='Gender Comparison').figure.show()


---
# **11. Best Practices Recap**

* **Clear Variable Names:** Use descriptive names for DataFrames and Series (e.g., `df_students`, `mean_score`).
* **Comment Your Code:** Explain complex steps or a non-obvious logic with `#` comments.
* **Inspect Frequently:** Use `.head()`, `.info()`, `.shape`, `.describe()` often to check your data and the results of operations.
* **Work on Copies:** When performing operations that modify a DataFrame (especially `inplace=True` or tricky transformations), it's often safer to work on a copy: `df_copy = df.copy()`.
* **Break Down Problems:** For complex tasks, break them into smaller, manageable steps.
* **Understand Your Data Types:** Ensure columns have the correct data types for the operations you want to perform.

---
# **🎉 Day 3 Wrap-up**

Congratulations! You've learned the fundamentals of data analysis using Pandas. You can now:
* Create and understand Series and DataFrames.
* Load data from CSV files.
* Inspect, clean, and transform datasets.
* Calculate basic statistics and perform aggregations.
* Create simple visualizations to explore your data.

**Keep practicing!** The more you work with data, the more comfortable you'll become with Pandas. Try these techniques on different datasets or explore more advanced Pandas features.

**Next Steps:** We'll build upon these skills to tackle more complex data analysis tasks and possibly explore other libraries like Seaborn for more advanced visualizations.