<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Removing Duplicates**


Estimated time needed: **30** minutes


## Introduction


In this lab, you will focus on data wrangling, an important step in preparing data for analysis. Data wrangling involves cleaning and organizing data to make it suitable for analysis. One key task in this process is removing duplicate entries, which are repeated entries that can distort analysis and lead to inaccurate conclusions.  


## Objectives


In this lab you will perform the following:


1. Identify duplicate rows  in the dataset.
2. Use suitable techniques to remove duplicate rows and verify the removal.
3. Summarize how to handle missing values appropriately.
4. Use ConvertedCompYearly to normalize compensation data.
   


### Install the Required Libraries


In [1]:
!pip install pandas

Collecting pandas
  Downloading pandas-3.0.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (79 kB)
Collecting numpy>=1.26.0 (from pandas)
  Downloading numpy-2.4.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (6.6 kB)
Downloading pandas-3.0.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (10.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.9/10.9 MB[0m [31m135.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-2.4.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (16.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.6/16.6 MB[0m [31m174.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: numpy, pandas
Successfully installed numpy-2.4.2 pandas-3.0.0


### Step 1: Import Required Libraries


In [2]:
import pandas as pd

### Step 2: Load the Dataset into a DataFrame



load the dataset using pd.read_csv()


In [3]:
# Define the URL of the dataset
file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"

# Load the dataset into a DataFrame
df = pd.read_csv(file_path)

# Display the first few rows to ensure it loaded correctly
print(df.head())


   ResponseId                      MainBranch                 Age  \
0           1  I am a developer by profession  Under 18 years old   
1           2  I am a developer by profession     35-44 years old   
2           3  I am a developer by profession     45-54 years old   
3           4           I am learning to code     18-24 years old   
4           5  I am a developer by profession     18-24 years old   

            Employment RemoteWork   Check  \
0  Employed, full-time     Remote  Apples   
1  Employed, full-time     Remote  Apples   
2  Employed, full-time     Remote  Apples   
3   Student, full-time        NaN  Apples   
4   Student, full-time        NaN  Apples   

                                    CodingActivities  \
0                                              Hobby   
1  Hobby;Contribute to open-source projects;Other...   
2  Hobby;Contribute to open-source projects;Other...   
3                                                NaN   
4                                 

**Note: If you are working on a local Jupyter environment, you can use the URL directly in the <code>pandas.read_csv()</code>  function as shown below:**



#df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv")


### Step 3: Identifying Duplicate Rows


**Task 1: Identify Duplicate Rows**
  1. Count the number of duplicate rows in the dataset.
  2. Display the first few duplicate rows to understand their structure.


In [4]:
## Write your code here
# 1. Count the number of duplicate rows
df.duplicated().sum()

# 2. Display the first few duplicate rows
df[df.duplicated()].head()

Unnamed: 0,ResponseId,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat


### Step 4: Removing Duplicate Rows


**Task 2: Remove Duplicates**
   1. Remove duplicate rows from the dataset using the drop_duplicates() function.
2. Verify the removal by counting the number of duplicate rows after removal .


In [5]:
## Write your code here
# Remove duplicates (even though Task 1 showed there are none)
df = df.drop_duplicates()

# Verify removal by counting duplicates again
duplicates_after = df.duplicated().sum()

print("Number of duplicate rows after removal:", duplicates_after)


Number of duplicate rows after removal: 0


### Step 5: Handling Missing Values


**Task 3: Identify and Handle Missing Values**
   1. Identify missing values for all columns in the dataset.
   2. Choose a column with significant missing values (e.g., EdLevel) and impute with the most frequent value.


In [11]:
## Write your code here
# Identify and handle missing values
# 1. Count missing values in each column
missing_values = df.isnull().sum()
print("Missing values per column:\n", missing_values)

# 2. Identify the most frequent value in a column with many missing values (e.g., EdLevel)
value_counts = df['EdLevel'].value_counts()
print("\nMost frequent values in 'EdLevel':\n", value_counts)

most_frequent_value = df['EdLevel'].mode()[0]
print("\nMost frequent value selected for imputation:", most_frequent_value)

# 3. Impute missing values with the most frequent value (no inplace)
df['EdLevel'] = df['EdLevel'].fillna(most_frequent_value)

# 4. Verify that missing values have been handled
print("\nMissing values in 'EdLevel' after imputation:", df['EdLevel'].isnull().sum())


Missing values per column:
 ResponseId                 0
MainBranch                 0
Age                        0
Employment                 0
RemoteWork             10631
                       ...  
JobSatPoints_11        35992
SurveyLength            9255
SurveyEase              9199
ConvertedCompYearly    42002
JobSat                 36311
Length: 114, dtype: int64

Most frequent values in 'EdLevel':
 EdLevel
Bachelor’s degree (B.A., B.S., B.Eng., etc.)                                          24942
Master’s degree (M.A., M.S., M.Eng., MBA, etc.)                                       15557
Some college/university study without earning a degree                                 7651
Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)     5793
Professional degree (JD, MD, Ph.D, Ed.D, etc.)                                         2970
Associate degree (A.A., A.S., etc.)                                                    1793
Primary/elementary school     

In [13]:
# ## Task 3: Identify and handle missing values -- inplace = True -- less safe
# # mutates the dataframe directly
# # can cause chained‑assignment warnings in more complex operations
# # harder to track in notebooks
# # 1. Count missing values in each column
# missing_values = df.isnull().sum()
# print("Missing values per column:\n", missing_values)

# # 2. Identify the most frequent value in a column with many missing values (e.g., EdLevel)
# value_counts = df['EdLevel'].value_counts()
# print("\nMost frequent values in 'EdLevel':\n", value_counts)

# most_frequent_value = df['EdLevel'].mode()[0]
# print("\nMost frequent value selected for imputation:", most_frequent_value)

# # 3. Impute missing values with the most frequent value
# df['EdLevel'].fillna(most_frequent_value, inplace=True)

# # 4. Verify that missing values have been handled
# print("\nMissing values in 'EdLevel' after imputation:", df['EdLevel'].isnull().sum())


In [14]:
# # Fix with no inplace = True
# # Choose a column (e.g., 'EdLevel')
# column_to_fix = 'WorkExp' 

# # Find the most frequent value (Mode)
# # .mode() returns a Series, so we take the first element [0]
# most_frequent = df[column_to_fix].mode()[0]
# print(f"The most frequent value in {column_to_fix} is: {most_frequent}")

# # Fill the missing values in that column
# df[column_to_fix] = df[column_to_fix].fillna(most_frequent)

# # Verify that there are no more missing values in that specific column
# print(f"Remaining missing values in {column_to_fix}: {df[column_to_fix].isnull().sum()}")

# Important Note on Imputation
When you use the most frequent value, you are making an assumption to keep your dataset large enough for analysis.
- Categorical columns: Mode is best.
- Numerical columns (like Age): Median is often better than the mean if there are outliers.

### Step 6: Normalizing Compensation Data


**Task 4: Normalize Compensation Data Using ConvertedCompYearly**
   1. Use the ConvertedCompYearly column for compensation analysis as the normalized annual compensation is already provided.
   2. Check for missing values in ConvertedCompYearly and handle them if necessary.


In [15]:
## Write your code here
# Task 4: Normalize compensation data using ConvertedCompYearly

# 1. Check missing values in the compensation column
missing_comp = df['ConvertedCompYearly'].isnull().sum()
print("Missing values in ConvertedCompYearly:", missing_comp)

# 2. If missing values exist, impute using the median (robust for skewed salary data)
if missing_comp > 0:
    median_comp = df['ConvertedCompYearly'].median()
    print("Median compensation used for imputation:", median_comp)

    # No inplace: assign back to the column
    df['ConvertedCompYearly'] = df['ConvertedCompYearly'].fillna(median_comp)

# 3. Verify the result
print("Missing values after handling:", df['ConvertedCompYearly'].isnull().sum())


Missing values in ConvertedCompYearly: 42002
Median compensation used for imputation: 65000.0
Missing values after handling: 0


In [17]:
# ## Identifying Missing Values As Alternative method
# # Check for missing values in the normalized column
# missing_comp = df['ConvertedCompYearly'].isnull().sum()
# print(f"Missing values in ConvertedCompYearly: {missing_comp}")

# # Handling Missing Values
# # Calculate the median compensation
# median_comp = df['ConvertedCompYearly'].median()
# print(f"Median Annual Compensation: {median_comp}")

# # Impute missing values with the median (without using inplace)
# df['ConvertedCompYearly'] = df['ConvertedCompYearly'].fillna(median_comp)

# # Final verification
# print(f"Missing values after imputation: {df['ConvertedCompYearly'].isnull().sum()}")

### Step 7: Summary and Next Steps


**In this lab, you focused on identifying and removing duplicate rows.**

- You handled missing values by imputing the most frequent value in a chosen column.

- You used ConvertedCompYearly for compensation normalization and handled missing values.

- For further analysis, consider exploring other columns or visualizing the cleaned dataset.



This lab focused on preparing the survey dataset for analysis by identifying and removing duplicate rows, confirming that no full‑row duplicates were present.<br>
Missing values were examined across all columns, and the `EdLevel` field—one of the columns with significant missingness—was imputed using its most frequent category to maintain consistency in the dataset.<br> Annual compensation was normalized using the `ConvertedCompYearly` column, with missing values handled to ensure the field was ready for analysis.<br>
With these cleaning steps completed, the dataset is now more reliable and suitable for further exploration and visualization.

<!--
## Change Log

|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2024-11-05|1.2|Madhusudhan Moole|Updated lab|
|2024-09-24|1.1|Madhusudhan Moole|Updated lab|
|2024-09-23|1.0|Raghul Ramesh|Created lab|

--!>


Copyright © IBM Corporation. All rights reserved.
