<a href="https://colab.research.google.com/github/usshaa/Cheatsheets/blob/main/Activity4_Solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# College Placement Data Preprocessing Task

### 🎯 Objective:

Clean and prepare the dataset for further **analysis**, **visualization**, or **machine learning** tasks.

### ✅ Exercise: Preprocess College Placement Data

### 🔶 Step 1: Import Libraries and Load Dataset

In [None]:
# Load raw dataset
import pandas as pd
import numpy as np

# Load raw dataset
df = pd.read_csv('Placement_Data_Full_Class.csv')

# Show first 5 rows
df.head()


Unnamed: 0,Sno,Gender,10th %,SSC Board,12th %,HSC Board,12th Stream,Degree %,Degree stream,Work exp,specialisation,Mba %,status,salary
0,1,M,67.0,Others,91.0,Others,Commerce,58.0,Sci&Tech,No,Mkt&HR,58.8,Placed,270000.0
1,2,M,79.33,Central,78.33,Others,Science,77.48,Sci&Tech,Yes,Mkt&Fin,66.28,Placed,200000.0
2,3,M,65.0,Central,68.0,Central,Arts,64.0,Comm&Mgmt,No,Mkt&Fin,57.8,Placed,250000.0
3,4,M,56.0,Central,52.0,Central,Science,52.0,Sci&Tech,No,Mkt&HR,59.43,Not Placed,
4,5,M,85.8,Central,73.6,Central,Commerce,73.3,Comm&Mgmt,No,Mkt&Fin,55.5,Placed,425000.0


### 🔶 Step 2: Explore Dataset Info & Missing Values

In [None]:
# Check structure
df.info()

# Check missing values
df.isnull().sum()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 215 entries, 0 to 214
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Sno             215 non-null    int64  
 1   Gender          215 non-null    object 
 2   10th %          215 non-null    float64
 3   SSC Board       215 non-null    object 
 4   12th %          215 non-null    float64
 5   HSC Board       215 non-null    object 
 6   12th Stream     215 non-null    object 
 7   Degree %        215 non-null    float64
 8   Degree stream   215 non-null    object 
 9   Work exp        215 non-null    object 
 10  specialisation  215 non-null    object 
 11  Mba %           215 non-null    float64
 12  status          215 non-null    object 
 13  salary          148 non-null    float64
dtypes: float64(5), int64(1), object(8)
memory usage: 23.6+ KB


Unnamed: 0,0
Sno,0
Gender,0
10th %,0
SSC Board,0
12th %,0
HSC Board,0
12th Stream,0
Degree %,0
Degree stream,0
Work exp,0


### 🔶 Step 3: Handle Missing Values

In [None]:
# For simplicity, we fill missing salary with 0 for 'Not Placed'
df['salary'] = df['salary'].fillna(0)

In [None]:
# If any other missing values found, drop them
df.dropna(inplace=True)


In [None]:
# Confirm again
df.isnull().sum()

Unnamed: 0,0
Sno,0
Gender,0
10th %,0
SSC Board,0
12th %,0
HSC Board,0
12th Stream,0
Degree %,0
Degree stream,0
Work exp,0


### 🔶 Step 4: Create New Feature: Average Academic Score

In [None]:
# Average of 10th, 12th, Degree, and MBA scores
df['avg_academic_score'] = df[['10th %', '12th %', 'Degree %', 'Mba %']].mean(axis=1)

### 🔶 Step 5: Standardize Categorical Values (Inconsistent Labels)

In [None]:
# Strip whitespace and capitalize labels
df['SSC Board'] = df['SSC Board'].str.strip().str.title()
df['HSC Board'] = df['HSC Board'].str.strip().str.title()
df['12th Stream'] = df['12th Stream'].str.strip().str.title()
df['Degree stream'] = df['Degree stream'].str.strip().str.title()
df['specialisation'] = df['specialisation'].str.strip().str.title()

### 🔶 Step 6: Encode Categorical Variables


In [None]:
# Encode gender
df['Gender'] = df['Gender'].map({'M': 0, 'F': 1})

In [None]:
# Encode placement status
df['status'] = df['status'].map({'Not Placed': 0, 'Placed': 1})

In [None]:
# Encode binary work experience
df['Work exp'] = df['Work exp'].map({'No': 0, 'Yes': 1})

### 🔶 Step 7: Remove Duplicates (if any)

In [None]:
# Check and drop duplicate rows
df.drop_duplicates(inplace=True)

### 🔶 Step 8: Detect Outliers (optional)

In [None]:
# Quick check for outliers in salary using IQR
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1

In [None]:
# Detect outliers
outliers = df[(df['salary'] < Q1 - 1.5 * IQR) | (df['salary'] > Q3 + 1.5 * IQR)]
print(f"Number of salary outliers: {len(outliers)}")


Number of salary outliers: 1


### 🔶 Step 9: Save Cleaned Data for ML / Viz

In [None]:
# Save cleaned dataset
df.to_csv('Cleaned_College_Placement.csv', index=False)