# Session 2: Data Manipulation with Pandas & NumPy
## Week 1 - Data Science & Machine Learning Training Programme

**Duration:** 3 hours  
**Learning Objectives:**
- Master Pandas DataFrames for data manipulation
- Understand NumPy arrays and mathematical operations
- Learn data loading from various sources
- Apply basic data cleaning techniques
- Introduction to the Iris dataset

---

## Part 1: Advanced Pandas Operations (75 minutes)

In [None]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Set up plotting
%matplotlib inline
plt.style.use('default')
sns.set_palette("husl")

print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

### 1.1 Creating DataFrames from Different Sources

In [None]:
# Method 1: From dictionary
employee_data = {
    'employee_id': [1001, 1002, 1003, 1004, 1005],
    'name': ['Alice Johnson', 'Bob Smith', 'Charlie Brown', 'Diana Prince', 'Eve Wilson'],
    'department': ['Engineering', 'Marketing', 'Engineering', 'Sales', 'HR'],
    'salary': [75000, 65000, 80000, 60000, 55000],
    'hire_date': ['2020-01-15', '2019-03-22', '2021-07-10', '2020-11-03', '2018-05-18'],
    'performance_score': [4.2, 3.8, 4.5, 3.9, 4.1]
}

df = pd.DataFrame(employee_data)
print("DataFrame from dictionary:")
print(df)
print(f"\nShape: {df.shape}")
print(f"Columns: {list(df.columns)}")

In [None]:
# Method 2: From lists of lists
data_lists = [
    [1006, 'Frank Miller', 'Finance', 70000, '2021-02-28', 4.0],
    [1007, 'Grace Lee', 'Engineering', 78000, '2020-09-15', 4.3],
    [1008, 'Henry Davis', 'Marketing', 62000, '2021-01-20', 3.7]
]

columns = ['employee_id', 'name', 'department', 'salary', 'hire_date', 'performance_score']
df_additional = pd.DataFrame(data_lists, columns=columns)

print("DataFrame from lists:")
print(df_additional)

# Combine dataframes
df_combined = pd.concat([df, df_additional], ignore_index=True)
print(f"\nCombined DataFrame shape: {df_combined.shape}")

In [None]:
# Method 3: Generate synthetic data (useful for practice)
np.random.seed(42)

n_samples = 100
synthetic_data = pd.DataFrame({
    'customer_id': range(1001, 1001 + n_samples),
    'age': np.random.randint(18, 65, n_samples),
    'income': np.random.normal(50000, 15000, n_samples),
    'purchase_amount': np.random.exponential(100, n_samples),
    'category': np.random.choice(['Electronics', 'Clothing', 'Books', 'Home'], n_samples),
    'satisfaction': np.random.uniform(1, 5, n_samples)
})

# Clean the data
synthetic_data['income'] = synthetic_data['income'].round(2)
synthetic_data['purchase_amount'] = synthetic_data['purchase_amount'].round(2)
synthetic_data['satisfaction'] = synthetic_data['satisfaction'].round(1)

print("Synthetic DataFrame:")
print(synthetic_data.head())
print(f"\nDataFrame info:")
print(synthetic_data.info())

### 1.2 DataFrame Indexing and Selection

In [None]:
# Working with the employee data
df = df_combined.copy()

# Convert hire_date to datetime
df['hire_date'] = pd.to_datetime(df['hire_date'])

print("Original DataFrame:")
print(df)
print(f"\nData types:")
print(df.dtypes)

In [None]:
# Column selection
print("=== Column Selection ===")

# Single column (returns Series)
names = df['name']
print(f"Names (Series): {type(names)}")
print(names.head())

# Single column (returns DataFrame)
names_df = df[['name']]
print(f"\nNames (DataFrame): {type(names_df)}")
print(names_df.head())

# Multiple columns
subset = df[['name', 'department', 'salary']]
print("\nMultiple columns:")
print(subset)

In [None]:
# Row selection
print("=== Row Selection ===")

# By index position (.iloc)
print("First 3 rows (.iloc):")
print(df.iloc[:3])

# By label (.loc) - first set index
df_indexed = df.set_index('employee_id')
print("\nRow with employee_id 1003 (.loc):")
print(df_indexed.loc[1003])

# Boolean indexing (very important!)
high_performers = df[df['performance_score'] >= 4.0]
print(f"\nHigh performers (score >= 4.0): {len(high_performers)} employees")
print(high_performers[['name', 'performance_score']])

In [None]:
# Advanced filtering
print("=== Advanced Filtering ===")

# Multiple conditions with &, |, ~
high_paid_engineers = df[(df['department'] == 'Engineering') & (df['salary'] > 70000)]
print("High-paid Engineers:")
print(high_paid_engineers[['name', 'salary']])

# Using .isin() for multiple values
tech_sales = df[df['department'].isin(['Engineering', 'Sales'])]
print(f"\nEngineering or Sales employees: {len(tech_sales)}")

# String operations
j_names = df[df['name'].str.startswith('J')]
print("\nEmployees whose names start with 'J':")
print(j_names['name'].tolist())

### 1.3 Data Aggregation and Grouping

In [None]:
# Basic aggregations
print("=== Basic Aggregations ===")
print(f"Average salary: ${df['salary'].mean():,.2f}")
print(f"Median salary: ${df['salary'].median():,.2f}")
print(f"Salary standard deviation: ${df['salary'].std():,.2f}")
print(f"Total employees: {len(df)}")

# Descriptive statistics
print("\nDescriptive statistics for numerical columns:")
print(df.describe())

In [None]:
# GroupBy operations (very important for data analysis)
print("=== GroupBy Operations ===")

# Group by department
dept_stats = df.groupby('department').agg({
    'salary': ['mean', 'median', 'count'],
    'performance_score': 'mean'
})

print("Statistics by department:")
print(dept_stats)

# Flatten column names for easier access
dept_stats.columns = ['_'.join(col).strip() for col in dept_stats.columns]
print("\nWith flattened column names:")
print(dept_stats)

In [None]:
# More groupby examples
print("=== Advanced GroupBy ===")

# Create salary bands
df['salary_band'] = pd.cut(df['salary'], 
                          bins=[0, 60000, 70000, 80000, float('inf')],
                          labels=['Low', 'Medium', 'High', 'Very High'])

print("Salary bands:")
print(df[['name', 'salary', 'salary_band']])

# Cross-tabulation
cross_tab = pd.crosstab(df['department'], df['salary_band'])
print("\nCross-tabulation of department vs salary band:")
print(cross_tab)

## Part 2: NumPy Arrays and Mathematical Operations (45 minutes)

### 2.1 NumPy Array Creation and Properties

In [None]:
# Different ways to create arrays
print("=== Array Creation ===")

# From lists
arr1 = np.array([1, 2, 3, 4, 5])
print(f"From list: {arr1}")
print(f"Type: {type(arr1)}, Shape: {arr1.shape}, Dtype: {arr1.dtype}")

# 2D array
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(f"\n2D array:\n{arr2d}")
print(f"Shape: {arr2d.shape}, Dimensions: {arr2d.ndim}")

# Using built-in functions
zeros = np.zeros((3, 4))
ones = np.ones((2, 3))
identity = np.eye(3)
print(f"\nZeros array:\n{zeros}")
print(f"\nIdentity matrix:\n{identity}")

In [None]:
# Range-based arrays
print("=== Range-based Arrays ===")

# arange: similar to range() but returns array
range_arr = np.arange(0, 10, 2)
print(f"arange(0, 10, 2): {range_arr}")

# linspace: evenly spaced numbers
linear = np.linspace(0, 1, 5)
print(f"linspace(0, 1, 5): {linear}")

# Random arrays
np.random.seed(42)
random_uniform = np.random.uniform(0, 1, 5)
random_normal = np.random.normal(0, 1, 5)
print(f"Random uniform: {random_uniform}")
print(f"Random normal: {random_normal}")

### 2.2 Array Operations and Broadcasting

In [None]:
# Element-wise operations
print("=== Element-wise Operations ===")

a = np.array([1, 2, 3, 4, 5])
b = np.array([2, 3, 4, 5, 6])

print(f"a: {a}")
print(f"b: {b}")
print(f"a + b: {a + b}")
print(f"a * b: {a * b}")
print(f"a ** 2: {a ** 2}")
print(f"np.sqrt(a): {np.sqrt(a)}")

# Comparison operations
print(f"\na > 3: {a > 3}")
print(f"a[a > 3]: {a[a > 3]}")

In [None]:
# Broadcasting - NumPy's powerful feature
print("=== Broadcasting ===")

# Scalar with array
arr = np.array([1, 2, 3, 4, 5])
result1 = arr + 10
print(f"Array + scalar: {arr} + 10 = {result1}")

# Arrays with different shapes
matrix = np.array([[1, 2, 3], [4, 5, 6]])
vector = np.array([10, 20, 30])

print(f"\nMatrix:\n{matrix}")
print(f"Vector: {vector}")
print(f"Matrix + Vector:\n{matrix + vector}")

# Column vector broadcasting
col_vector = np.array([[100], [200]])
print(f"\nColumn vector:\n{col_vector}")
print(f"Matrix + Column vector:\n{matrix + col_vector}")

In [None]:
# Statistical operations
print("=== Statistical Operations ===")

data = np.random.normal(100, 15, (4, 5))  # 4x5 matrix
print(f"Data matrix (4x5):\n{data.round(2)}")

print(f"\nOverall statistics:")
print(f"Mean: {data.mean():.2f}")
print(f"Std: {data.std():.2f}")
print(f"Min: {data.min():.2f}")
print(f"Max: {data.max():.2f}")

print(f"\nAxis-wise operations:")
print(f"Mean along axis 0 (columns): {data.mean(axis=0).round(2)}")
print(f"Mean along axis 1 (rows): {data.mean(axis=1).round(2)}")

### 2.3 Array Reshaping and Indexing

In [None]:
# Reshaping arrays
print("=== Array Reshaping ===")

original = np.arange(12)
print(f"Original array: {original}")
print(f"Shape: {original.shape}")

# Reshape to 2D
reshaped = original.reshape(3, 4)
print(f"\nReshaped (3x4):\n{reshaped}")

# Reshape to 3D
reshaped_3d = original.reshape(2, 2, 3)
print(f"\nReshaped (2x2x3):\n{reshaped_3d}")

# Flatten back to 1D
flattened = reshaped.flatten()
print(f"\nFlattened: {flattened}")

In [None]:
# Advanced indexing
print("=== Advanced Indexing ===")

matrix = np.arange(20).reshape(4, 5)
print(f"Matrix (4x5):\n{matrix}")

# Slicing
print(f"\nFirst two rows, first three columns:\n{matrix[:2, :3]}")
print(f"\nLast row: {matrix[-1, :]}")
print(f"\nLast column: {matrix[:, -1]}")

# Boolean indexing
mask = matrix > 10
print(f"\nElements > 10: {matrix[mask]}")

# Fancy indexing
rows = [0, 2]
cols = [1, 3, 4]
print(f"\nSelected elements (rows {rows}, cols {cols}):\n{matrix[np.ix_(rows, cols)]}")

## Part 3: Data Loading from Various Sources (30 minutes)

### 3.1 Loading Data from Files

In [None]:
# Create sample CSV data first
sample_data = {
    'product_id': range(1, 101),
    'product_name': [f'Product_{i}' for i in range(1, 101)],
    'category': np.random.choice(['Electronics', 'Clothing', 'Books', 'Home'], 100),
    'price': np.random.uniform(10, 500, 100).round(2),
    'stock_quantity': np.random.randint(0, 100, 100),
    'rating': np.random.uniform(1, 5, 100).round(1)
}

df_products = pd.DataFrame(sample_data)

# Save to CSV (we'll load it back)
df_products.to_csv('sample_products.csv', index=False)
print("Sample CSV file created: sample_products.csv")

# Display first few rows
print("\nSample data:")
print(df_products.head())

In [None]:
# Loading CSV files
print("=== Loading CSV Files ===")

# Basic loading
df_loaded = pd.read_csv('sample_products.csv')
print(f"Loaded DataFrame shape: {df_loaded.shape}")
print(f"Columns: {list(df_loaded.columns)}")

# Loading with specific parameters
df_custom = pd.read_csv('sample_products.csv', 
                       usecols=['product_name', 'category', 'price'],
                       nrows=10)
print("\nLoaded with custom parameters (first 10 rows, selected columns):")
print(df_custom)

# Check data types
print("\nData types:")
print(df_loaded.dtypes)

In [None]:
# Create and load JSON data
import json

# Create sample JSON
json_data = {
    'customers': [
        {'id': 1, 'name': 'John Doe', 'email': 'john@email.com', 'orders': [101, 102]},
        {'id': 2, 'name': 'Jane Smith', 'email': 'jane@email.com', 'orders': [103]},
        {'id': 3, 'name': 'Bob Johnson', 'email': 'bob@email.com', 'orders': [104, 105, 106]}
    ]
}

# Save to JSON
with open('sample_customers.json', 'w') as f:
    json.dump(json_data, f)

# Load JSON data
df_json = pd.read_json('sample_customers.json')
print("JSON data loaded:")
print(df_json)

# Normalize nested JSON
from pandas import json_normalize
df_normalized = json_normalize(json_data['customers'])
print("\nNormalized JSON data:")
print(df_normalized)

### 3.2 Loading from URLs and APIs

In [None]:
# Loading data from URL (if internet connection available)
try:
    # Example: Loading Iris dataset from UCI repository
    url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
    
    # Column names for Iris dataset
    iris_columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
    
    iris_df = pd.read_csv(url, names=iris_columns)
    print("Iris dataset loaded from URL:")
    print(iris_df.head())
    print(f"Shape: {iris_df.shape}")
    
except Exception as e:
    print(f"Could not load from URL: {e}")
    print("This might be due to internet connectivity. We'll create a local Iris dataset instead.")
    
    # Create local Iris-like dataset
    from sklearn.datasets import load_iris
    iris_sklearn = load_iris()
    
    iris_df = pd.DataFrame(iris_sklearn.data, columns=iris_sklearn.feature_names)
    iris_df['species'] = iris_sklearn.target_names[iris_sklearn.target]
    print("\nLocal Iris dataset created:")
    print(iris_df.head())

## Part 4: Introduction to the Iris Dataset (30 minutes)

### 4.1 Understanding the Iris Dataset

In [None]:
# Load the complete Iris dataset
from sklearn.datasets import load_iris

# Load data
iris_data = load_iris()
iris_df = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)
iris_df['species'] = iris_data.target_names[iris_data.target]
iris_df['target'] = iris_data.target

print("=== Iris Dataset Overview ===")
print(f"Dataset shape: {iris_df.shape}")
print(f"Features: {iris_data.feature_names}")
print(f"Classes: {iris_data.target_names}")
print(f"\nDataset description:")
print(iris_data.DESCR[:500] + "...")

In [None]:
# Explore the dataset structure
print("=== Dataset Exploration ===")
print("First 5 rows:")
print(iris_df.head())

print("\nLast 5 rows:")
print(iris_df.tail())

print("\nDataset info:")
print(iris_df.info())

print("\nBasic statistics:")
print(iris_df.describe())

In [None]:
# Explore species distribution
print("=== Species Analysis ===")
species_counts = iris_df['species'].value_counts()
print("Species distribution:")
print(species_counts)

print("\nPercentage distribution:")
print((species_counts / len(iris_df) * 100).round(2))

# Statistics by species
print("\nMean values by species:")
species_stats = iris_df.groupby('species').mean()
print(species_stats)

### 4.2 Basic Data Visualization

In [None]:
# Create visualizations for the Iris dataset
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('Iris Dataset - Basic Visualizations', fontsize=16)

# Histogram of sepal length
axes[0, 0].hist(iris_df['sepal length (cm)'], bins=20, alpha=0.7, color='skyblue')
axes[0, 0].set_title('Sepal Length Distribution')
axes[0, 0].set_xlabel('Sepal Length (cm)')
axes[0, 0].set_ylabel('Frequency')

# Box plot of petal length by species
species_data = [iris_df[iris_df['species'] == species]['petal length (cm)'] 
                for species in iris_df['species'].unique()]
axes[0, 1].boxplot(species_data, labels=iris_df['species'].unique())
axes[0, 1].set_title('Petal Length by Species')
axes[0, 1].set_ylabel('Petal Length (cm)')

# Scatter plot
for species in iris_df['species'].unique():
    species_data = iris_df[iris_df['species'] == species]
    axes[1, 0].scatter(species_data['sepal length (cm)'], 
                       species_data['sepal width (cm)'], 
                       label=species, alpha=0.7)
axes[1, 0].set_title('Sepal Length vs Width')
axes[1, 0].set_xlabel('Sepal Length (cm)')
axes[1, 0].set_ylabel('Sepal Width (cm)')
axes[1, 0].legend()

# Bar chart of species counts
species_counts.plot(kind='bar', ax=axes[1, 1], color=['coral', 'lightgreen', 'lightblue'])
axes[1, 1].set_title('Species Count')
axes[1, 1].set_ylabel('Count')
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

In [None]:
# Advanced visualization with Seaborn
plt.figure(figsize=(15, 10))

# Pairplot to show all feature relationships
sns.pairplot(iris_df, hue='species', markers=["o", "s", "D"])
plt.suptitle('Iris Dataset - Pairwise Feature Relationships', y=1.02)
plt.show()

# Correlation heatmap
plt.figure(figsize=(8, 6))
numeric_cols = iris_df.select_dtypes(include=[np.number])
correlation_matrix = numeric_cols.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Heatmap')
plt.show()

## Part 5: Basic Data Cleaning Techniques (30 minutes)

### 5.1 Handling Missing Data

In [None]:
# Create dataset with missing values for practice
np.random.seed(42)
n_samples = 50

messy_data = pd.DataFrame({
    'name': [f'Person_{i}' for i in range(n_samples)],
    'age': np.random.randint(18, 65, n_samples),
    'income': np.random.normal(50000, 15000, n_samples),
    'score': np.random.uniform(0, 100, n_samples)
})

# Introduce missing values
missing_indices = np.random.choice(n_samples, size=10, replace=False)
messy_data.loc[missing_indices[:5], 'age'] = np.nan
messy_data.loc[missing_indices[5:], 'income'] = np.nan

print("=== Dataset with Missing Values ===")
print(f"Dataset shape: {messy_data.shape}")
print(f"Missing values per column:")
print(messy_data.isnull().sum())

print("\nFirst 10 rows:")
print(messy_data.head(10))

In [None]:
# Different strategies for handling missing data
print("=== Missing Data Strategies ===")

# Strategy 1: Drop rows with any missing values
clean_drop = messy_data.dropna()
print(f"After dropping rows with missing values: {clean_drop.shape}")

# Strategy 2: Fill with mean/median
clean_fill = messy_data.copy()
clean_fill['age'].fillna(clean_fill['age'].median(), inplace=True)
clean_fill['income'].fillna(clean_fill['income'].mean(), inplace=True)

print(f"After filling with mean/median: {clean_fill.shape}")
print(f"Missing values after filling: {clean_fill.isnull().sum().sum()}")

# Strategy 3: Forward fill (useful for time series)
clean_ffill = messy_data.fillna(method='ffill')
print(f"Missing values after forward fill: {clean_ffill.isnull().sum().sum()}")

### 5.2 Data Type Conversions and Validation

In [None]:
# Create dataset with mixed data types
mixed_data = pd.DataFrame({
    'id': ['001', '002', '003', '004', '005'],
    'price': ['10.50', '25.99', 'N/A', '15.75', '30.00'],
    'quantity': ['5', '10', '3', '7', '12'],
    'date': ['2023-01-15', '2023-02-20', '2023-03-10', '2023-04-05', '2023-05-12'],
    'is_active': ['True', 'False', 'True', 'True', 'False']
})

print("=== Data Type Conversion ===")
print("Original data types:")
print(mixed_data.dtypes)
print("\nOriginal data:")
print(mixed_data)

In [None]:
# Convert data types
cleaned_data = mixed_data.copy()

# Convert price to numeric (handle 'N/A')
cleaned_data['price'] = pd.to_numeric(cleaned_data['price'], errors='coerce')

# Convert quantity to integer
cleaned_data['quantity'] = cleaned_data['quantity'].astype(int)

# Convert date to datetime
cleaned_data['date'] = pd.to_datetime(cleaned_data['date'])

# Convert boolean string to boolean
cleaned_data['is_active'] = cleaned_data['is_active'].map({'True': True, 'False': False})

print("After conversion:")
print(cleaned_data.dtypes)
print("\nCleaned data:")
print(cleaned_data)
print(f"\nMissing values: {cleaned_data.isnull().sum().sum()}")

### 5.3 Practical Exercise: Clean the Iris Dataset

In [None]:
# Let's artificially introduce some issues to the Iris dataset for cleaning practice
iris_messy = iris_df.copy()

# Introduce some issues
# 1. Add some missing values
iris_messy.loc[10:12, 'sepal length (cm)'] = np.nan

# 2. Add some outliers
iris_messy.loc[5, 'petal length (cm)'] = 100  # Unrealistic value

# 3. Add inconsistent species names
iris_messy.loc[20:22, 'species'] = 'SETOSA'  # Uppercase
iris_messy.loc[80:82, 'species'] = 'versicolor '  # Extra space

print("=== Iris Dataset Cleaning Exercise ===")
print("Issues introduced:")
print(f"Missing values: {iris_messy.isnull().sum().sum()}")
print(f"Unique species (should be 3): {iris_messy['species'].nunique()}")
print(f"Species values: {iris_messy['species'].unique()}")
print(f"Max petal length: {iris_messy['petal length (cm)'].max()}")

In [None]:
# Clean the dataset
iris_clean = iris_messy.copy()

# 1. Handle missing values - fill with median
for column in iris_clean.select_dtypes(include=[np.number]).columns:
    if iris_clean[column].isnull().any():
        median_value = iris_clean[column].median()
        iris_clean[column].fillna(median_value, inplace=True)
        print(f"Filled missing values in {column} with median: {median_value:.2f}")

# 2. Handle outliers - cap extreme values
for column in ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']:
    Q1 = iris_clean[column].quantile(0.25)
    Q3 = iris_clean[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = iris_clean[(iris_clean[column] < lower_bound) | (iris_clean[column] > upper_bound)]
    if len(outliers) > 0:
        print(f"Found {len(outliers)} outliers in {column}")
        iris_clean[column] = iris_clean[column].clip(lower_bound, upper_bound)

# 3. Standardize species names
iris_clean['species'] = iris_clean['species'].str.lower().str.strip()

print(f"\nAfter cleaning:")
print(f"Missing values: {iris_clean.isnull().sum().sum()}")
print(f"Unique species: {iris_clean['species'].nunique()}")
print(f"Species values: {iris_clean['species'].unique()}")
print(f"Max petal length: {iris_clean['petal length (cm)'].max():.2f}")

## Summary and Homework

### What We Covered Today:
1. ✅ Advanced Pandas operations: DataFrame creation, indexing, grouping, aggregation
2. ✅ NumPy arrays: creation, operations, broadcasting, statistical functions
3. ✅ Data loading from various sources: CSV, JSON, URLs
4. ✅ Introduction to the Iris dataset: exploration and visualization
5. ✅ Basic data cleaning: missing values, data types, outliers

### Key Concepts Mastered:
- **DataFrame manipulation** with advanced indexing and filtering
- **NumPy array operations** and broadcasting
- **Data loading** from multiple sources
- **Exploratory Data Analysis** techniques
- **Data cleaning** strategies

### Homework Before Next Session:
1. **Complete the data cleaning exercise** above if not finished
2. **Explore the Iris dataset** further - try different visualizations
3. **Practice loading data** from different file formats (create your own CSV/JSON files)
4. **Read about data visualization** - we'll create comprehensive EDA visualizations next session

### Next Session Preview:
**Session 3: Data Visualisation & EDA + Iris Classification Project**
- Advanced data visualization with Matplotlib and Seaborn
- Comprehensive EDA methodology
- Building your first machine learning model with the Iris dataset
- GitHub project documentation

### Additional Resources:
- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [NumPy Documentation](https://numpy.org/doc/)
- [Iris Dataset Information](https://archive.ics.uci.edu/ml/datasets/iris)
- [Data Cleaning Guide](https://towardsdatascience.com/data-cleaning-with-python-and-pandas-detecting-missing-values-3e9c6ebcf78b)