<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Removing Duplicates**


Estimated time needed: **30** minutes


## Introduction


In this lab, you will focus on data wrangling, an important step in preparing data for analysis. Data wrangling involves cleaning and organizing data to make it suitable for analysis. One key task in this process is removing duplicate entries, which are repeated entries that can distort analysis and lead to inaccurate conclusions.  


## Objectives


In this lab you will perform the following:


1. Identify duplicate rows  in the dataset.
2. Use suitable techniques to remove duplicate rows and verify the removal.
3. Summarize how to handle missing values appropriately.
4. Use ConvertedCompYearly to normalize compensation data.
   


### Install the Required Libraries


In [1]:
!pip install pandas



### Step 1: Import Required Libraries


In [2]:
import pandas as pd

### Step 2: Load the Dataset into a DataFrame



load the dataset using pd.read_csv()


In [3]:
# Define the URL of the dataset
file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"

# Load the dataset into a DataFrame
df = pd.read_csv(file_path)

# Display the first few rows to ensure it loaded correctly
print(df.head())


   ResponseId                      MainBranch                 Age  \
0           1  I am a developer by profession  Under 18 years old   
1           2  I am a developer by profession     35-44 years old   
2           3  I am a developer by profession     45-54 years old   
3           4           I am learning to code     18-24 years old   
4           5  I am a developer by profession     18-24 years old   

            Employment RemoteWork   Check  \
0  Employed, full-time     Remote  Apples   
1  Employed, full-time     Remote  Apples   
2  Employed, full-time     Remote  Apples   
3   Student, full-time        NaN  Apples   
4   Student, full-time        NaN  Apples   

                                    CodingActivities  \
0                                              Hobby   
1  Hobby;Contribute to open-source projects;Other...   
2  Hobby;Contribute to open-source projects;Other...   
3                                                NaN   
4                                 

**Note: If you are working on a local Jupyter environment, you can use the URL directly in the <code>pandas.read_csv()</code>  function as shown below:**



#df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv")


### Step 3: Identifying Duplicate Rows


**Task 1: Identify Duplicate Rows**
  1. Count the number of duplicate rows in the dataset.
  2. Display the first few duplicate rows to understand their structure.


In [4]:
## Write your code here
df.duplicated()

0        False
1        False
2        False
3        False
4        False
         ...  
65432    False
65433    False
65434    False
65435    False
65436    False
Length: 65437, dtype: bool

In [5]:
df.head()

Unnamed: 0,ResponseId,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
0,1,I am a developer by profession,Under 18 years old,"Employed, full-time",Remote,Apples,Hobby,Primary/elementary school,Books / Physical media,,...,,,,,,,,,,
1,2,I am a developer by profession,35-44 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,0.0,0.0,0.0,0.0,0.0,0.0,,,,
2,3,I am a developer by profession,45-54 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,,,,,,,Appropriate in length,Easy,,
3,4,I am learning to code,18-24 years old,"Student, full-time",,Apples,,Some college/university study without earning ...,"Other online resources (e.g., videos, blogs, f...",Stack Overflow;How-to videos;Interactive tutorial,...,,,,,,,Too long,Easy,,
4,5,I am a developer by profession,18-24 years old,"Student, full-time",,Apples,,"Secondary school (e.g. American high school, G...","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,...,,,,,,,Too short,Easy,,


### Step 4: Removing Duplicate Rows


**Task 2: Remove Duplicates**
   1. Remove duplicate rows from the dataset using the drop_duplicates() function.
2. Verify the removal by counting the number of duplicate rows after removal .


In [6]:
## Write your code here
df_no_duplicates = df.drop_duplicates()
print("\nDataFrame with duplicates removed:")
print(df_no_duplicates)


DataFrame with duplicates removed:
       ResponseId                      MainBranch                 Age  \
0               1  I am a developer by profession  Under 18 years old   
1               2  I am a developer by profession     35-44 years old   
2               3  I am a developer by profession     45-54 years old   
3               4           I am learning to code     18-24 years old   
4               5  I am a developer by profession     18-24 years old   
...           ...                             ...                 ...   
65432       65433  I am a developer by profession     18-24 years old   
65433       65434  I am a developer by profession     25-34 years old   
65434       65435  I am a developer by profession     25-34 years old   
65435       65436  I am a developer by profession     18-24 years old   
65436       65437     I code primarily as a hobby     18-24 years old   

                Employment                            RemoteWork   Check  \
0      Empl

### Step 5: Handling Missing Values


**Task 3: Identify and Handle Missing Values**
   1. Identify missing values for all columns in the dataset.
   2. Choose a column with significant missing values (e.g., EdLevel) and impute with the most frequent value.


In [7]:
## Write your code here
missing_values = df.isnull().sum()
print(missing_values)

ResponseId                 0
MainBranch                 0
Age                        0
Employment                 0
RemoteWork             10631
                       ...  
JobSatPoints_11        35992
SurveyLength            9255
SurveyEase              9199
ConvertedCompYearly    42002
JobSat                 36311
Length: 114, dtype: int64


In [8]:
mode_LearnCodeOnline = df['LearnCodeOnline'].mode()[0]

In [9]:
mode_LearnCodeOnline

'Technical documentation;Blogs;Written Tutorials;Stack Overflow'

### Step 6: Normalizing Compensation Data


**Task 4: Normalize Compensation Data Using ConvertedCompYearly**
   1. Use the ConvertedCompYearly column for compensation analysis as the normalized annual compensation is already provided.
   2. Check for missing values in ConvertedCompYearly and handle them if necessary.


In [10]:
## Write your code here
df["ConvertedCompYearly"]=df["ConvertedCompYearly"]/(df["ConvertedCompYearly"].max())
df["ConvertedCompYearly"]

0       NaN
1       NaN
2       NaN
3       NaN
4       NaN
         ..
65432   NaN
65433   NaN
65434   NaN
65435   NaN
65436   NaN
Name: ConvertedCompYearly, Length: 65437, dtype: float64

### Step 7: Summary and Next Steps


**In this lab, you focused on identifying and removing duplicate rows.**

- You handled missing values by imputing the most frequent value in a chosen column.

- You used ConvertedCompYearly for compensation normalization and handled missing values.

- For further analysis, consider exploring other columns or visualizing the cleaned dataset.


In [19]:
## Write your code here
mode_LearnCodeOnline = df.fillna(df['LearnCodeOnline'].value_counts().index[0])
mode_LearnCodeOnline

Unnamed: 0,ResponseId,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
0,1,I am a developer by profession,Under 18 years old,"Employed, full-time",Remote,Apples,Hobby,Primary/elementary school,Books / Physical media,Technical documentation;Blogs;Written Tutorial...,...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,ConvertedCompProgramming,Technical documentation;Blogs;Written Tutorial...
1,2,I am a developer by profession,35-44 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,0.0,0.0,0.0,0.0,0.0,0.0,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,ConvertedCompProgramming,Technical documentation;Blogs;Written Tutorial...
2,3,I am a developer by profession,45-54 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Appropriate in length,Easy,ConvertedCompProgramming,Technical documentation;Blogs;Written Tutorial...
3,4,I am learning to code,18-24 years old,"Student, full-time",Technical documentation;Blogs;Written Tutorial...,Apples,Technical documentation;Blogs;Written Tutorial...,Some college/university study without earning ...,"Other online resources (e.g., videos, blogs, f...",Stack Overflow;How-to videos;Interactive tutorial,...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Too long,Easy,ConvertedCompProgramming,Technical documentation;Blogs;Written Tutorial...
4,5,I am a developer by profession,18-24 years old,"Student, full-time",Technical documentation;Blogs;Written Tutorial...,Apples,Technical documentation;Blogs;Written Tutorial...,"Secondary school (e.g. American high school, G...","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Too short,Easy,ConvertedCompProgramming,Technical documentation;Blogs;Written Tutorial...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65432,65433,I am a developer by profession,18-24 years old,"Employed, full-time",Remote,Apples,Hobby;School or academic work,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","On the job training;School (i.e., University, ...",Technical documentation;Blogs;Written Tutorial...,...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,ConvertedCompProgramming,Technical documentation;Blogs;Written Tutorial...
65433,65434,I am a developer by profession,25-34 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,ConvertedCompProgramming,Technical documentation;Blogs;Written Tutorial...
65434,65435,I am a developer by profession,25-34 years old,"Employed, full-time",In-person,Apples,Hobby,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","Other online resources (e.g., videos, blogs, f...",Technical documentation;Stack Overflow;Social ...,...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,ConvertedCompProgramming,Technical documentation;Blogs;Written Tutorial...
65435,65436,I am a developer by profession,18-24 years old,"Employed, full-time","Hybrid (some remote, some in-person)",Apples,Hobby;Contribute to open-source projects;Profe...,"Secondary school (e.g. American high school, G...",On the job training;Other online resources (e....,Technical documentation;Blogs;Written Tutorial...,...,0.0,0.0,0.0,0.0,0.0,0.0,Technical documentation;Blogs;Written Tutorial...,Technical documentation;Blogs;Written Tutorial...,ConvertedCompProgramming,Technical documentation;Blogs;Written Tutorial...


In [21]:
print(df.dropna(subset=["ConvertedCompYearly"],axis=0,inplace=True))

None


<!--
## Change Log

|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2024-11-05|1.2|Madhusudhan Moole|Updated lab|
|2024-09-24|1.1|Madhusudhan Moole|Updated lab|
|2024-09-23|1.0|Raghul Ramesh|Created lab|

--!>


Copyright © IBM Corporation. All rights reserved.
