## Lesson 2 – DataFrame Operations & Cleaning

You can sort by column values or by index

You will learned data cleaning, sorting, renaming, NaN handling, type conversions, and functional transformations — which are core Pandas skills.

Real-world data is messy.  
In this lesson, we fix missing values, inconsistent formatting, incorrect dtypes, outliers, and duplicates.


Step 1

In [37]:
import pandas as pd
import numpy as np 

print("Pandas version:", pd.__version__)

# Load file
df = pd.read_csv("data/employees.csv")
df

Pandas version: 2.3.3


Unnamed: 0,Name,Age,City,Salary,Tax,Net Salary
0,Piyush,26,Bhopal,55000,5500.0,49500.0
1,Amit,25,Delhi,60000,6000.0,54000.0
2,Sneha,27,Pune,65000,6500.0,58500.0
3,Ravi,24,Mumbai,62000,6200.0,55800.0


In [3]:
# Sort by single column

df.sort_values(by='Age', ascending=True)

Unnamed: 0,Name,Age,City,Salary,Tax,Net Salary
3,Ravi,24,Mumbai,62000,6200.0,55800.0
1,Amit,25,Delhi,60000,6000.0,54000.0
0,Piyush,26,Bhopal,55000,5500.0,49500.0
2,Sneha,27,Pune,65000,6500.0,58500.0


In [4]:
df.sort_values(by='Tax', ascending=False)

Unnamed: 0,Name,Age,City,Salary,Tax,Net Salary
2,Sneha,27,Pune,65000,6500.0,58500.0
3,Ravi,24,Mumbai,62000,6200.0,55800.0
1,Amit,25,Delhi,60000,6000.0,54000.0
0,Piyush,26,Bhopal,55000,5500.0,49500.0


In [5]:
# Sort by multiple columns
df.sort_values(by=['City', 'Age'], ascending=[True, False])

Unnamed: 0,Name,Age,City,Salary,Tax,Net Salary
0,Piyush,26,Bhopal,55000,5500.0,49500.0
1,Amit,25,Delhi,60000,6000.0,54000.0
3,Ravi,24,Mumbai,62000,6200.0,55800.0
2,Sneha,27,Pune,65000,6500.0,58500.0


In [6]:
# Sort by index
df.sort_index()

Unnamed: 0,Name,Age,City,Salary,Tax,Net Salary
0,Piyush,26,Bhopal,55000,5500.0,49500.0
1,Amit,25,Delhi,60000,6000.0,54000.0
2,Sneha,27,Pune,65000,6500.0,58500.0
3,Ravi,24,Mumbai,62000,6200.0,55800.0


Mini Task:
Sort your df by Salary (descending) and show only Name, Salary, and Net Salary columns.

In [7]:
df.sort_values(by=['Salary'], ascending=False)[['Name', 'Salary', 'Net Salary']]

Unnamed: 0,Name,Salary,Net Salary
2,Sneha,65000,58500.0
3,Ravi,62000,55800.0
1,Amit,60000,54000.0
0,Piyush,55000,49500.0


Step 2 – Renaming Columns

You’ll often need to rename messy column names.

In [8]:
df.rename(columns={'Salary': 'Gross_Salary', 'Net Salary': "Net_Salary"}, inplace=True)
df.columns

Index(['Name', 'Age', 'City', 'Gross_Salary', 'Tax', 'Net_Salary'], dtype='object')

Mini Task:
Rename the column Tax → Tax_Amount

In [9]:
df.rename(columns={'Tax': 'Tax_Amount'}, inplace=True)
print(df.columns)

Index(['Name', 'Age', 'City', 'Gross_Salary', 'Tax_Amount', 'Net_Salary'], dtype='object')


Step 3 – Handling Missing Data (NaN)

Let’s simulate missing values first:

In [10]:
df.loc[2, 'City'] = np.nan
df.loc[0, 'Gross_Salary'] = np.nan
print(df)

     Name  Age    City  Gross_Salary  Tax_Amount  Net_Salary
0  Piyush   26  Bhopal           NaN      5500.0     49500.0
1    Amit   25   Delhi       60000.0      6000.0     54000.0
2   Sneha   27     NaN       65000.0      6500.0     58500.0
3    Ravi   24  Mumbai       62000.0      6200.0     55800.0


Then handle them:

In [11]:
# Check for missing values
print(df.isna().sum())

Name            0
Age             0
City            1
Gross_Salary    1
Tax_Amount      0
Net_Salary      0
dtype: int64


In [12]:
# Fill missing values

df['City'].fillna('Unknown', inplace=True)
df

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['City'].fillna('Unknown', inplace=True)


Unnamed: 0,Name,Age,City,Gross_Salary,Tax_Amount,Net_Salary
0,Piyush,26,Bhopal,,5500.0,49500.0
1,Amit,25,Delhi,60000.0,6000.0,54000.0
2,Sneha,27,Unknown,65000.0,6500.0,58500.0
3,Ravi,24,Mumbai,62000.0,6200.0,55800.0


In [13]:
# Drop rows where Gross_Salary is missing
df.dropna(subset=['Gross_Salary'], inplace=True)
df

Unnamed: 0,Name,Age,City,Gross_Salary,Tax_Amount,Net_Salary
1,Amit,25,Delhi,60000.0,6000.0,54000.0
2,Sneha,27,Unknown,65000.0,6500.0,58500.0
3,Ravi,24,Mumbai,62000.0,6200.0,55800.0


In [14]:
# Fill numeric column with mean
df['Gross_Salary'].fillna(df['Gross_Salary'].mean(), inplace=True)
df

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Gross_Salary'].fillna(df['Gross_Salary'].mean(), inplace=True)


Unnamed: 0,Name,Age,City,Gross_Salary,Tax_Amount,Net_Salary
1,Amit,25,Delhi,60000.0,6000.0,54000.0
2,Sneha,27,Unknown,65000.0,6500.0,58500.0
3,Ravi,24,Mumbai,62000.0,6200.0,55800.0


Mini Task:
Make one column (your choice) have a few np.nan values, then fill them using the median.

In [15]:
df

Unnamed: 0,Name,Age,City,Gross_Salary,Tax_Amount,Net_Salary
1,Amit,25,Delhi,60000.0,6000.0,54000.0
2,Sneha,27,Unknown,65000.0,6500.0,58500.0
3,Ravi,24,Mumbai,62000.0,6200.0,55800.0


In [16]:
df.loc[2, 'Age'] = np.nan
print(df)

    Name   Age     City  Gross_Salary  Tax_Amount  Net_Salary
1   Amit  25.0    Delhi       60000.0      6000.0     54000.0
2  Sneha   NaN  Unknown       65000.0      6500.0     58500.0
3   Ravi  24.0   Mumbai       62000.0      6200.0     55800.0


In [17]:
df['Age'].fillna(df['Age'].median(), inplace=True)
df


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)


Unnamed: 0,Name,Age,City,Gross_Salary,Tax_Amount,Net_Salary
1,Amit,25.0,Delhi,60000.0,6000.0,54000.0
2,Sneha,24.5,Unknown,65000.0,6500.0,58500.0
3,Ravi,24.0,Mumbai,62000.0,6200.0,55800.0


Step 4 – Changing Data Types

You can convert data types for optimization or compatibility.

In [18]:
df['Age'] = df['Age'].astype('int32')
df['Gross_Salary'] = df['Gross_Salary'].astype('float')

print(df['Age'].dtypes) 
print(df['Gross_Salary'].dtypes) 

int32
float64


Mini Task:
Convert your Tax_Amount column to float, and confirm types using df.dtypes.

In [19]:
df['Tax_Amount'] = df['Tax_Amount'].astype(float)
print(df['Tax_Amount'].dtypes)

float64


Step 5 – Applying Functions

Use .apply() and lambda to transform columns.

In [20]:
df['Age_Group'] = df['Age'].apply(lambda x: 'Senior' if x >=26 else 'Junior')
df['Age_Group']

1    Junior
2    Junior
3    Junior
Name: Age_Group, dtype: object

Mini Task:
Create a new column Salary_Level = “High” if Gross_Salary > 60000, else “Low”.

In [21]:
df['Salary_Level'] = df['Gross_Salary'].apply(lambda x: 'High' if x > 60000 else 'Low')
df['Salary_Level']

1     Low
2    High
3    High
Name: Salary_Level, dtype: object

Step 6 – Dropping & Reordering Columns

In [None]:
# Drop column

df.drop(columns=['Tax_Amount'], inplace=True)

In [27]:
# Reorder columns
df = df[['Name', 'Age', 'City', 'Gross_Salary', 'Net_Salary', 'Age_Group', 'Salary_Level']]
df

Unnamed: 0,Name,Age,City,Gross_Salary,Net_Salary,Age_Group,Salary_Level
1,Amit,25,Delhi,60000.0,54000.0,Junior,Low
2,Sneha,24,Unknown,65000.0,58500.0,Junior,High
3,Ravi,24,Mumbai,62000.0,55800.0,Junior,High


Mini Task:
Drop any 1 column, then reorder so Name and City appear first.

In [None]:
df.drop(columns=['Age_Group'], inplace=True)



In [30]:
df = df[['Name', 'City', 'Age', 'Gross_Salary', 'Net_Salary', 'Salary_Level']]
df

Unnamed: 0,Name,City,Age,Gross_Salary,Net_Salary,Salary_Level
1,Amit,Delhi,25,60000.0,54000.0,Low
2,Sneha,Unknown,24,65000.0,58500.0,High
3,Ravi,Mumbai,24,62000.0,55800.0,High


Step 7 – Reset and Set Index

In [31]:
# Reset index
df.reset_index(drop=True, inplace=True)
df

Unnamed: 0,Name,City,Age,Gross_Salary,Net_Salary,Salary_Level
0,Amit,Delhi,25,60000.0,54000.0,Low
1,Sneha,Unknown,24,65000.0,58500.0,High
2,Ravi,Mumbai,24,62000.0,55800.0,High


In [32]:
# Set 'Name' as index
df.set_index('Name', inplace=True)
df

Unnamed: 0_level_0,City,Age,Gross_Salary,Net_Salary,Salary_Level
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Amit,Delhi,25,60000.0,54000.0,Low
Sneha,Unknown,24,65000.0,58500.0,High
Ravi,Mumbai,24,62000.0,55800.0,High


Mini Task:
Set City as index and then reset it back to default.

In [33]:
df.set_index('City', inplace=True)
df

Unnamed: 0_level_0,Age,Gross_Salary,Net_Salary,Salary_Level
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Delhi,25,60000.0,54000.0,Low
Unknown,24,65000.0,58500.0,High
Mumbai,24,62000.0,55800.0,High


In [34]:
df.reset_index(drop=True, inplace=True)
df

Unnamed: 0,Age,Gross_Salary,Net_Salary,Salary_Level
0,25,60000.0,54000.0,Low
1,24,65000.0,58500.0,High
2,24,62000.0,55800.0,High


Step 8 – Exporting Cleaned Data

Finally, save your cleaned data:

In [35]:
df.to_csv('data/employees_cleaned.csv')
