# Indian College Data Cleaning Assignment

## Objective
This assignment focuses on **data cleaning using pandas `apply()` function**. You'll work with a messy Indian college dataset that has various data quality issues.

## Learning Goals
- Master the `apply()` function with lambda expressions
- Handle different data types and formats
- Clean and standardize messy data
- Deal with missing values and outliers
- Practice string manipulation and data validation

## Dataset Description
The dataset contains information about 500+ college students with the following columns:
- `student_id`: Student identification number
- `name`: Student name
- `email`: Email address
- `phone`: Contact number
- `age`: Age of student
- `gender`: Gender
- `course`: Course enrolled (B.Tech, M.Tech, etc.)
- `department`: Department name
- `semester`: Current semester
- `cgpa`: Cumulative Grade Point Average
- `attendance`: Attendance percentage
- `city`: City name
- `state`: State name
- `admission_date`: Date of admission
- `fees_paid`: Whether fees have been paid
- `hostel`: Whether student stays in hostel

---

## Step 0: Load the Dataset

In [None]:
import pandas as pd
import numpy as np

# Load the messy dataset
df = pd.read_csv('../Datasets/indian_college_messy_data.csv')

print(f"Dataset shape: {df.shape}")
print("\nFirst 10 rows:")
display(df.head(10))
print("\nData types:")
print(df.dtypes)
print("\nMissing values:")
print(df.isnull().sum())

---
# ASSIGNMENT QUESTIONS
---

## Question 1: Clean Student Names

**Problem:** The `name` column has inconsistent capitalization (UPPERCASE, lowercase, Title Case).

**Task:** Use `apply()` to convert all names to proper Title Case format (First letter of each word capitalized).

**Expected Output:** All names should be in format like "Rahul Kumar" or "Priya Sharma"

**Hint:** Use the `.title()` string method inside apply()

In [None]:
# Your code here
# df['name'] = df['name'].apply(lambda x: ...)


# Check result
print("Sample cleaned names:")
print(df['name'].head(10))

## Question 2: Clean and Standardize Email Addresses

**Problem:** Email addresses have:
- Leading/trailing spaces
- Mixed case (should be lowercase)
- Some missing values

**Task:** Use `apply()` to:
1. Convert emails to lowercase
2. Remove leading/trailing spaces
3. Keep NaN values as they are

**Hint:** Check if value is not null before processing using `pd.notna()` or `pd.isna()`

In [None]:
# Your code here
# df['email'] = df['email'].apply(lambda x: ...)


# Check result
print("Sample cleaned emails:")
print(df['email'].head(10))

## Question 3: Extract Clean Phone Numbers

**Problem:** Phone numbers have various formats:
- With country code: +91-9876543210
- With parentheses: (9876543210)
- With spaces: 987 654 3210
- Mixed formats

**Task:** Use `apply()` to extract only the 10-digit phone number (remove all special characters, spaces, and country code).

**Hint:** You can use string methods like `.replace()` to remove characters, or use regular expressions

In [None]:
# Your code here
# df['phone'] = df['phone'].apply(lambda x: ...)


# Check result
print("Sample cleaned phone numbers:")
print(df['phone'].head(10))
print(f"\nAll phone numbers have 10 digits: {df['phone'].dropna().apply(lambda x: len(str(x)) == 10).all()}")

## Question 4: Fix Age Column

**Problem:** The `age` column has:
- String values that should be integers
- Negative values
- Unrealistic values (e.g., age > 60 for college students)

**Task:** Use `apply()` to:
1. Convert all ages to integers
2. Take absolute value if negative
3. If age > 40, assume it's an error and subtract 50 (since original age was increased by 50)
4. Valid age range should be 17-30

**Hint:** Use conditional logic inside apply function

In [None]:
# Your code here
# df['age'] = df['age'].apply(lambda x: ...)


# Check result
print("Age statistics after cleaning:")
print(df['age'].describe())
print(f"\nAll ages are integers: {df['age'].dtype == 'int64' or df['age'].dtype == 'int32'}")
print(f"Age range: {df['age'].min()} to {df['age'].max()}")

## Question 5: Standardize Gender Values

**Problem:** Gender column has multiple representations:
- Male: 'Male', 'M', 'male', 'MALE', 'Man'
- Female: 'Female', 'F', 'female', 'FEMALE', 'Woman'
- Other: 'Other', 'O', 'other', 'Non-Binary'

**Task:** Use `apply()` to standardize all gender values to just three categories: 'Male', 'Female', 'Other'

**Hint:** Convert to uppercase first, then use conditional checks or a mapping dictionary

In [None]:
# Your code here
# df['gender'] = df['gender'].apply(lambda x: ...)


# Check result
print("Gender value counts after standardization:")
print(df['gender'].value_counts())

## Question 6: Clean CGPA Values

**Problem:** CGPA column has:
- String values (should be float)
- Values > 10 (out of range)
- Negative values
- Missing values

**Task:** Use `apply()` to:
1. Convert to float
2. Take absolute value if negative
3. If value > 10, subtract 5 (since original was increased by 5)
4. Round to 2 decimal places
5. Keep NaN as NaN

**Hint:** Use pd.notna() to check for non-null values before processing

In [None]:
# Your code here
# df['cgpa'] = df['cgpa'].apply(lambda x: ...)


# Check result
print("CGPA statistics after cleaning:")
print(df['cgpa'].describe())
print(f"\nCGPA data type: {df['cgpa'].dtype}")
print(f"Valid CGPA range (5.0 to 10.0): {df['cgpa'].dropna().min()} to {df['cgpa'].dropna().max()}")

## Question 7: Clean Attendance Percentage

**Problem:** Attendance column has:
- Values with '%' symbol (e.g., '85.5%')
- Values > 100
- Mixed types (string and numeric)
- Missing values

**Task:** Use `apply()` to:
1. Remove '%' symbol if present
2. Convert to float
3. Cap values at 100 (if > 100, set to 100)
4. Round to 1 decimal place
5. Keep NaN as NaN

**Hint:** Check if value is string and contains '%', then remove it

In [None]:
# Your code here
# df['attendance'] = df['attendance'].apply(lambda x: ...)


# Check result
print("Attendance statistics after cleaning:")
print(df['attendance'].describe())
print(f"\nMax attendance: {df['attendance'].max()}%")

## Question 8: Expand State Abbreviations

**Problem:** State column has mix of full names and abbreviations:
- 'MH', 'Maharashtra'
- 'KA', 'Karnataka'
- etc.

**Task:** Use `apply()` to convert all state abbreviations to full names using the mapping below.

```python
state_map = {
    'MH': 'Maharashtra', 'KA': 'Karnataka', 'TN': 'Tamil Nadu',
    'DL': 'Delhi', 'GJ': 'Gujarat', 'WB': 'West Bengal',
    'TS': 'Telangana', 'RJ': 'Rajasthan', 'UP': 'Uttar Pradesh',
    'KL': 'Kerala'
}
```

**Hint:** Use the dictionary's `.get()` method with the original value as default

In [None]:
# State mapping dictionary
state_map = {
    'MH': 'Maharashtra', 'KA': 'Karnataka', 'TN': 'Tamil Nadu',
    'DL': 'Delhi', 'GJ': 'Gujarat', 'WB': 'West Bengal',
    'TS': 'Telangana', 'RJ': 'Rajasthan', 'UP': 'Uttar Pradesh',
    'KL': 'Kerala'
}

# Your code here
# df['state'] = df['state'].apply(lambda x: ...)


# Check result
print("State value counts after expansion:")
print(df['state'].value_counts())
print(f"\nNo abbreviations remaining: {not any(df['state'].str.len() == 2)}")

## Question 9: Standardize Boolean Columns

**Problem:** Both `fees_paid` and `hostel` columns have multiple representations of True/False:
- True values: 'Yes', 'yes', 'YES', 'Y', 'y', '1', 1, True, 'True', 'TRUE'
- False values: 'No', 'no', 'NO', 'N', 'n', '0', 0, False, 'False', 'FALSE'

**Task:** Use `apply()` to convert both columns to proper boolean values (True/False).

**Hint:** Convert to string, then check if it contains 'Y', 'T', '1', or 'yes' (case-insensitive)

In [None]:
# Your code here for fees_paid
# df['fees_paid'] = df['fees_paid'].apply(lambda x: ...)

# Your code here for hostel
# df['hostel'] = df['hostel'].apply(lambda x: ...)


# Check result
print("Fees Paid value counts:")
print(df['fees_paid'].value_counts())
print("\nHostel value counts:")
print(df['hostel'].value_counts())
print(f"\nFees_paid is boolean: {df['fees_paid'].dtype == 'bool'}")
print(f"Hostel is boolean: {df['hostel'].dtype == 'bool'}")

## Question 10: Create Grade Category from CGPA

**Problem:** Create a new column `grade_category` based on CGPA ranges.

**Task:** Use `apply()` to create a new column with the following mapping:
- CGPA >= 9.0: 'Excellent'
- CGPA >= 8.0 and < 9.0: 'Very Good'
- CGPA >= 7.0 and < 8.0: 'Good'
- CGPA >= 6.0 and < 7.0: 'Average'
- CGPA < 6.0: 'Below Average'
- NaN values: 'Not Available'

**Hint:** Use nested if-else conditions or multiple conditions inside apply()

In [None]:
# Your code here
# df['grade_category'] = df['cgpa'].apply(lambda x: ...)


# Check result
print("Grade Category distribution:")
print(df['grade_category'].value_counts())
print("\nSample data with CGPA and Grade Category:")
display(df[['name', 'cgpa', 'grade_category']].head(20))

---
## Bonus Challenge: Remove Duplicates

**Task:** Identify and remove duplicate records from the dataset.

**Hint:** Use pandas built-in duplicate handling methods, not apply()

In [None]:
# Check for duplicates
print(f"Total records before removing duplicates: {len(df)}")
print(f"Duplicate records: {df.duplicated().sum()}")

# Your code here to remove duplicates
# df = ...

print(f"\nTotal records after removing duplicates: {len(df)}")

---
## Final Step: Save Cleaned Dataset

After completing all the cleaning tasks, save the cleaned dataset to a new CSV file.

In [None]:
# Save cleaned dataset
df.to_csv('../Datasets/indian_college_cleaned_data.csv', index=False)

print("Cleaned dataset saved successfully!")
print(f"\nFinal dataset shape: {df.shape}")
print("\nData types after cleaning:")
print(df.dtypes)
print("\nMissing values after cleaning:")
print(df.isnull().sum())
print("\nFirst 10 rows of cleaned data:")
display(df.head(10))

---
## Summary Statistics

After cleaning, let's look at some summary statistics to validate our work.

In [None]:
print("=" * 80)
print("SUMMARY STATISTICS - CLEANED DATASET")
print("=" * 80)

print("\n1. Numerical Columns:")
display(df[['age', 'semester', 'cgpa', 'attendance']].describe())

print("\n2. Gender Distribution:")
print(df['gender'].value_counts())

print("\n3. Course Distribution:")
print(df['course'].value_counts())

print("\n4. Department Distribution:")
print(df['department'].value_counts())

print("\n5. State Distribution:")
print(df['state'].value_counts())

print("\n6. Fees Payment Status:")
print(df['fees_paid'].value_counts())

print("\n7. Hostel Status:")
print(df['hostel'].value_counts())

print("\n8. Grade Categories:")
print(df['grade_category'].value_counts())