### **Module 2: Data Wrangling and Manipulation**
Students will learn how to filter, transform, and combine data for analysis.

#### **Topics:**
- **Selecting and Filtering Data:**
  - Conditional selection (`boolean indexing`).
  - Filtering rows/columns based on conditions.
  
- **Sorting Data:**
  - Sorting by index or values using `sort_values()` and `sort_index()`.
  
- **Renaming Columns:**
  - Renaming columns using `rename()`.
  
- **Adding/Removing Columns:**
  - Creating new columns from existing data.
  - Dropping columns or rows using `drop()`.

- **Data Transformation:**
  - Applying functions with `apply()` and `map()`.
  - Using lambda functions for custom transformations.
  
- **Handling Duplicates:**
  - Detecting and removing duplicate rows using `drop_duplicates()`.

#### **Hands-on Lab:**
- Filter a dataset based on specific conditions (e.g., values greater than a threshold).
- Sort a dataset by multiple columns.
- Create new columns by applying transformations to existing ones.

---

## **Pandas Data Wrangling and Manipulation Notes with Hands-on Exercises**

### **Overview:**
In this module, you will learn how to filter, transform, and combine data for analysis using `pandas`. These operations are essential for cleaning and preparing data before performing analysis.

---

### **1. Selecting and Filtering Data**
#### **Concepts:**
- **Boolean Indexing**: Allows filtering of data based on conditions.
- **Filtering Rows and Columns:** Apply conditions to extract subsets of data.

#### **Key Methods:**
- `df[condition]` – Filter rows based on condition.
- `df.loc[]` – Select rows and columns using labels.
- `df.iloc[]` – Select rows and columns using indices.

#### **Example:**

In [1]:
import pandas as pd

# Sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Score': [85, 78, 92, 70, 88],
    'Passed': [True, True, True, False, True]
}
df = pd.DataFrame(data)

# Filter rows where score is greater than 80
filtered_df = df[df['Score'] > 80]
print(filtered_df)

      Name  Score  Passed
0    Alice     85    True
2  Charlie     92    True
4      Eva     88    True


#### **Real-World Example:**
**Use Case:** An e-commerce company wants to filter customers who spent more than $500 in a single transaction.

In [2]:
import pandas as pd

# Sample e-commerce dataset
data = {
    'CustomerID': [101, 102, 103, 104, 105],
    'TransactionAmount': [250, 600, 300, 750, 200],
    'Location': ['NY', 'LA', 'SF', 'NY', 'LA']
}
df = pd.DataFrame(data)

# Filter customers who spent more than $500
high_spending_customers = df[df['TransactionAmount'] > 500]
print(high_spending_customers)

   CustomerID  TransactionAmount Location
1         102                600       LA
3         104                750       NY


#### **Exercise 1:**
1. Load a dataset (e.g., a CSV file).
2. Filter the data to display only rows where values in a specific column exceed a threshold.

#### **Exercise 2:**
1. Filter rows where employees have more than 10 years of experience.
2. Filter rows where orders were placed from a specific city (e.g., "Chicago").


### **2. Sorting Data**
#### **Concepts:**
- Sorting helps in organizing data for better readability and analysis.
- Sort by specific columns or indices.

#### **Key Methods:**
- `df.sort_values(by='column_name')` – Sort rows based on column values.
- `df.sort_index()` – Sort rows by index.

#### **Example:**

In [3]:
import pandas as pd

# Sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Score': [85, 78, 92, 70, 88],
    'Passed': [True, True, True, False, True]
}

# Creating a DataFrame
df = pd.DataFrame(data)

# Sort by Score (ascending)
sorted_df = df.sort_values(by='Score')

# Sort by Name (descending)
sorted_df_desc = df.sort_values(by='Name', ascending=False)
print(sorted_df_desc)

      Name  Score  Passed
4      Eva     88    True
3    David     70   False
2  Charlie     92    True
1      Bob     78    True
0    Alice     85    True


#### **Real-World Example:**
**Use Case:** A hotel booking system wants to sort customer bookings by check-in date.

In [4]:
# Sample booking dataset
bookings = {
    'BookingID': [201, 202, 203, 204, 205],
    'CheckInDate': ['2025-01-12', '2025-01-10', '2025-01-15', '2025-01-11', '2025-01-09'],
    'RoomType': ['Suite', 'Deluxe', 'Standard', 'Deluxe', 'Suite']
}
df_bookings = pd.DataFrame(bookings)

# Sort by check-in date
sorted_bookings = df_bookings.sort_values(by='CheckInDate')
print(sorted_bookings)

   BookingID CheckInDate  RoomType
4        205  2025-01-09     Suite
1        202  2025-01-10    Deluxe
3        204  2025-01-11    Deluxe
0        201  2025-01-12     Suite
2        203  2025-01-15  Standard


#### **Exercise 3:**
1. Sort a list of products based on their sales numbers in descending order.
2. Sort employee records by department and then by name alphabetically.

#### **Exercise 4:**
1. Sort the dataset based on multiple columns.
2. Sort in descending order by one column and ascending order by another.

### **3. Renaming Columns**
#### **Concepts:**
- Useful when you want to make column names more descriptive or readable.

#### **Key Methods:**
- `df.rename(columns={'old_name': 'new_name'})` – Rename specific columns.

#### **Example:**

In [5]:
# Rename column 'Score' to 'Exam Score'
renamed_df = df.rename(columns={'Score': 'Exam Score'})
print(renamed_df.head())

      Name  Exam Score  Passed
0    Alice          85    True
1      Bob          78    True
2  Charlie          92    True
3    David          70   False
4      Eva          88    True


#### **Real-World Example:**
**Use Case:** A company wants to rename columns for a payroll report to follow the required format.

In [6]:
# Sample payroll data
payroll_data = {
    'EmpID': [1, 2, 3, 4],
    'SalaryUSD': [50000, 70000, 65000, 58000],
    'TaxRate': [0.20, 0.22, 0.18, 0.21]
}
df_payroll = pd.DataFrame(payroll_data)

# Rename columns
renamed_df = df_payroll.rename(columns={'EmpID': 'EmployeeID', 'SalaryUSD': 'AnnualSalary', 'TaxRate': 'TaxPercentage'})
print(renamed_df)

   EmployeeID  AnnualSalary  TaxPercentage
0           1         50000           0.20
1           2         70000           0.22
2           3         65000           0.18
3           4         58000           0.21


#### **Exercise 5:**
1. Rename columns in a student grade report to more descriptive names.
2. Rename columns in a product inventory to include units (e.g., "Price" to "Price (USD)").

#### **Exercise 6:**
1. Rename columns to have consistent naming conventions (e.g., all lowercase).
2. Rename multiple columns at once.

### **4. Adding and Removing Columns**
#### **Concepts:**
- Add new columns derived from existing columns.
- Remove unnecessary columns or rows.

#### **Key Methods:**
- `df['new_column'] = ...` – Add a new column.
- `df.drop(columns=['column_name'])` – Drop specific columns.
- `df.drop(index=[index_number])` – Drop specific rows.

#### **Example:**

In [7]:
# Add a new column 'Grade'
df['Grade'] = ['A', 'B', 'A+', 'C', 'B+']

# Drop the column 'Passed'
df = df.drop(columns=['Passed'])
print(df)

      Name  Score Grade
0    Alice     85     A
1      Bob     78     B
2  Charlie     92    A+
3    David     70     C
4      Eva     88    B+


#### **Real-World Example:**
**Use Case:** An airline adds a column indicating whether a flight is domestic or international based on destination.

In [8]:
# Sample flight data
flights = {
    'FlightNumber': ['AA123', 'BA456', 'UA789'],
    'Destination': ['New York', 'London', 'Chicago'],
    'Duration': [180, 420, 200]
}
df_flights = pd.DataFrame(flights)

# Add new column for flight type
df_flights['FlightType'] = df_flights['Destination'].apply(lambda x: 'International' if x == 'London' else 'Domestic')
print(df_flights)
print("\n")
# Remove Duration column
df_flights = df_flights.drop(columns=['Duration'])
print(df_flights)

  FlightNumber Destination  Duration     FlightType
0        AA123    New York       180       Domestic
1        BA456      London       420  International
2        UA789     Chicago       200       Domestic


  FlightNumber Destination     FlightType
0        AA123    New York       Domestic
1        BA456      London  International
2        UA789     Chicago       Domestic


#### **Exercise 7:**
1. Add a new column based on the values of other columns (e.g., calculate a percentage).
2. Drop rows based on a condition.

#### **Exercise 8:**
1. Add a column in a sales dataset to calculate total revenue (quantity * price).
2. Remove columns related to legacy data that are no longer needed.

### **5. Data Transformation**
#### **Concepts:**
- Applying functions to transform data.
- Lambda functions allow for custom, inline transformations.

#### **Key Methods:**
- `df.apply(func)` – Apply a function to rows or columns.
- `df['column'].map(func)` – Apply a function element-wise.

#### **Example:**

In [9]:
# Convert scores to pass/fail status
def pass_fail(score):
    return 'Pass' if score >= 75 else 'Fail'

df['Status'] = df['Score'].apply(pass_fail)
print(df)
print("\n")
# Using lambda function to increase scores by 5%
df['Score_Updated'] = df['Score'].map(lambda x: x * 1.05)
print(df)

      Name  Score Grade Status
0    Alice     85     A   Pass
1      Bob     78     B   Pass
2  Charlie     92    A+   Pass
3    David     70     C   Fail
4      Eva     88    B+   Pass


      Name  Score Grade Status  Score_Updated
0    Alice     85     A   Pass          89.25
1      Bob     78     B   Pass          81.90
2  Charlie     92    A+   Pass          96.60
3    David     70     C   Fail          73.50
4      Eva     88    B+   Pass          92.40


#### **Real-World Example:**
**Use Case:** A healthcare system needs to calculate BMI (Body Mass Index) for patients based on their height and weight.

In [10]:
# Sample healthcare dataset
patients = {
    'PatientID': [101, 102, 103],
    'WeightKg': [70, 85, 60],
    'HeightM': [1.75, 1.82, 1.60]
}
df_patients = pd.DataFrame(patients)

# Add BMI column
df_patients['BMI'] = df_patients.apply(lambda row: row['WeightKg'] / (row['HeightM'] ** 2), axis=1)
print(df_patients)

   PatientID  WeightKg  HeightM        BMI
0        101        70     1.75  22.857143
1        102        85     1.82  25.661152
2        103        60     1.60  23.437500


#### **Exercise 9:**
1. Create a new column that applies a mathematical operation on existing columns.
2. Use `apply()` to categorize values based on conditions.

#### **Exercise 10:**
1. Apply a transformation to categorize customer ratings into "Low", "Medium", and "High".
2. Use `map()` to convert temperatures from Celsius to Fahrenheit.

### **6. Handling Duplicates**
#### **Concepts:**
- Detecting and removing duplicate rows.

#### **Key Methods:**
- `df.duplicated()` – Returns a Boolean Series indicating duplicate rows.
- `df.drop_duplicates()` – Removes duplicate rows.

#### **Example:**

In [11]:
# Sample DataFrame with duplicates
data_dup = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Eva'],
    'Score': [85, 78, 92, 85, 88]
}
df_dup = pd.DataFrame(data_dup)

# Detect duplicates
print(df_dup.duplicated())
print("\n")
# Remove duplicates
df_no_duplicates = df_dup.drop_duplicates()
print(df_no_duplicates)

0    False
1    False
2    False
3     True
4    False
dtype: bool


      Name  Score
0    Alice     85
1      Bob     78
2  Charlie     92
4      Eva     88


#### **Real-World Example:**
**Use Case:** A streaming service wants to ensure no duplicate subscription records exist in their system.

In [12]:
# Sample subscription data
subscriptions = {
    'UserID': [1001, 1002, 1003, 1001],
    'SubscriptionPlan': ['Premium', 'Basic', 'Premium', 'Premium'],
    'DateJoined': ['2025-01-01', '2025-01-02', '2025-01-03', '2025-01-01']
}
df_subscriptions = pd.DataFrame(subscriptions)

# Detect duplicates
print(df_subscriptions.duplicated())
print("\n")
# Remove duplicates
df_unique_subscriptions = df_subscriptions.drop_duplicates()
print(df_unique_subscriptions)

0    False
1    False
2    False
3     True
dtype: bool


   UserID SubscriptionPlan  DateJoined
0    1001          Premium  2025-01-01
1    1002            Basic  2025-01-02
2    1003          Premium  2025-01-03


#### **Exercise 11:**
1. Identify duplicate rows in your dataset.
2. Remove duplicates and verify the shape of the DataFrame before and after.

#### **Exercise 6:**
1. Identify duplicate entries in an event registration dataset.
2. Remove duplicate product listings from an inventory and count the number of unique products.

### **Summary:**
- Filtering and selecting data allows you to isolate relevant rows or columns.
- Sorting helps order the dataset for better analysis.
- Renaming columns improves the readability of the dataset.
- Adding/removing columns is useful for creating new features or removing irrelevant data.
- Data transformation allows you to apply custom logic to modify data.
- Handling duplicates ensures that data does not have unnecessary redundancy.

By practicing these concepts and exercises, you will gain a strong foundation in data wrangling and manipulation using `pandas`.