In [None]:
Step 2: Data Cleaning & Manipulation using Pandas

Data cleaning means correcting incorrect, missing, or duplicate data.

With the help of the Pandas library, we can easily clean and manipulate data.

In this step, we will learn important cleaning techniques:

Dropping or filling missing data

Removing duplicate rows

Renaming columns

Changing data types

In [6]:
import pandas as pd

data = {
    'Name': ['Rizwan', 'Ali', 'Ahmed', None, 'Zara'],
    'Age': [25, 30, None, 22, 28],
    'City': ['Lahore', None, 'Karachi', 'Islamabad', 'Quetta']
}

df = pd.DataFrame(data)

print("🔹 Original DataFrame:")
print(df)


df_cleaned = df.dropna()

print("\n✅ Cleaned DataFrame (after dropna):")
print(df_cleaned)


🔹 Original DataFrame:
     Name   Age       City
0  Rizwan  25.0     Lahore
1     Ali  30.0       None
2   Ahmed   NaN    Karachi
3    None  22.0  Islamabad
4    Zara  28.0     Quetta

✅ Cleaned DataFrame (after dropna):
     Name   Age    City
0  Rizwan  25.0  Lahore
4    Zara  28.0  Quetta


In [None]:
We have dropped the rows with None values.

The dropna()function only keeps rows where there are no missing values.

In [8]:
import pandas as pd


data = {
    'Name': ['Rizwan', 'Ali', 'Ahmed', None, 'Zara'],
    'Age': [25, 30, None, 22, 28],
    'City': ['Lahore', None, 'Karachi', 'Islamabad', 'Quetta']
}

df = pd.DataFrame(data)

print("🔹 Original DataFrame:")
print(df)


df_filled = df.fillna({
    'Name': 'Unknown',
    'Age': df['Age'].mean(),  
    'City': 'Unknown City'
})

print("\n✅ DataFrame after fillna:")
print(df_filled)


🔹 Original DataFrame:
     Name   Age       City
0  Rizwan  25.0     Lahore
1     Ali  30.0       None
2   Ahmed   NaN    Karachi
3    None  22.0  Islamabad
4    Zara  28.0     Quetta

✅ DataFrame after fillna:
      Name    Age          City
0   Rizwan  25.00        Lahore
1      Ali  30.00  Unknown City
2    Ahmed  26.25       Karachi
3  Unknown  22.00     Islamabad
4     Zara  28.00        Quetta


In [None]:
By using fillna(), we filled the missing values.
- In the Name column, we added 'Unknown'.
- For the Age column, we used the average (mean).
- In the City column, we added 'Unknown City'.

In [9]:
import pandas as pd

data = {
    'Name': ['Rizwan', 'Ali', 'Ahmed', 'Zara', 'Sara'],
    'Age': [25, 30, 25, 28, 30],
    'City': ['Lahore', 'Karachi', 'Lahore', 'Quetta', 'Karachi']
}

df = pd.DataFrame(data)

print("🔹 Original DataFrame (with duplicates):")
print(df)


df_no_duplicates = df.drop_duplicates()

print("\n✅ DataFrame after drop_duplicates():")
print(df_no_duplicates)


🔹 Original DataFrame (with duplicates):
     Name  Age     City
0  Rizwan   25   Lahore
1     Ali   30  Karachi
2   Ahmed   25   Lahore
3    Zara   28   Quetta
4    Sara   30  Karachi

✅ DataFrame after drop_duplicates():
     Name  Age     City
0  Rizwan   25   Lahore
1     Ali   30  Karachi
2   Ahmed   25   Lahore
3    Zara   28   Quetta
4    Sara   30  Karachi


In [None]:


drop_duplicates() is used to remove duplicate rows.

In the above example, the rows for "Ali" and "Sara" were repeated.

The function kept only the unique rows.


In [10]:
import pandas as pd

data = {
    "Name": ["Rizwan", "Ali", "Ahmed"],
    "Age": [22, 25, 24]
}

df = pd.DataFrame(data)


df_renamed = df.rename(columns={"Name": "Full_Name", "Age": "Years"})

print(df_renamed)


  Full_Name  Years
0    Rizwan     22
1       Ali     25
2     Ahmed     24


In [None]:
Here, we used the rename() function to change "Name" to "Full_Name" and "Age" to "Years."

This updates the column names of the DataFrame.

In [None]:
The astype() function in Pandas is used to change the data type of a column.

For example, if a column has numbers stored as strings, we can convert them to integers or floats using astype().

This is useful when:
- We want to perform mathematical operations
- Or we want to ensure data consistency


In [11]:
import pandas as pd


data = {
    'Age': ['22', '25', '30'],  
    'Salary': ['45000', '52000', '60000']
}

df = pd.DataFrame(data)


print("Before:")
print(df.dtypes)

df['Age'] = df['Age'].astype(int)
df['Salary'] = df['Salary'].astype(int)

print("\nAfter:")
print(df.dtypes)


Before:
Age       object
Salary    object
dtype: object

After:
Age       int64
Salary    int64
dtype: object
