# Overview
Data cleaning is the process of identifying and correcting or removing errors, inconsistencies, and irrelevant information from a dataset. The goal of data cleaning is to prepare the data for further analysis, modeling, or visualization by ensuring that it is accurate, consistent, and relevant. 

Data cleaning is an important step in the data science process because the quality of the data has a direct impact on the accuracy and validity of the results obtained from any further analysis or modeling. If the data is not cleaned, errors, inconsistencies, or irrelevant information may lead to incorrect or misleading results. For example, if a dataset contains missing values, the results of a statistical analysis may be biased or incorrect. If a dataset contains inconsistent or incorrect data types, the results of a machine learning model may be compromised. These are some examples of why data cleaning is an important, and perhaps the most important step for the success of any analytics project.

In this module, we will cover the following topics:

I. Missing data: A value that represents the absence of a valid value.
II. Duplicate data: It refers to identical or nearly identical records in a database or dataset.
III. Inconsistent data: It refers to the situation where data is stored in different or inconsistent formats within the same dataset.

## Learning Objectives
In this module, the learners will:

* Identify missing values in a dataset
* Apply techniques such as imputation for handling missing values
* Identify and handle duplicate data in a dataset
* Understand the importance of consistent data formats
* Standardize data types and values in a given dataset



# Dataset
Breast Cancer dataset: This is a widely used dataset in the field of medical research and machine learning. This dataset provides information about breast cancer patients, including their demographic information, medical history, and the characteristics of their tumors. 

In terms of data cleaning, the dataset may contain missing values or erroneous data that need to be addressed before any analysis can be performed. Some of the common data cleaning techniques that can be applied to this dataset include identifying and handling missing values, checking for duplicates, and removing any irrelevant columns.

The columns in this dataset are:

* patient_id: This column contains a unique identification number assigned to each patient in the dataset.
* clump_thickness: This column represents the thickness of the tumor in the range of 1 to 10.
* cell_size_uniformity: This column represents the uniformity in size of tumor cells in the range of 1 to 10.
* cell_shape_uniformity: This column represents the uniformity in shape of tumor cells in the range of 1 to 10.
* marginal_adhesion: This column represents the level of adhesion of tumor cells to the surrounding tissue in the range of 1 to 10.
* single_ep_cell_size: This column represents the size of the tumor's epithelial cells in the range of 1 to 10.
* bare_nuclei: This column represents the presence or absence of a nucleus in the tumor cells. It contains values * * ranging from 1 to 10, where 1 represents the absence of a nucleus and 10 represents the presence of a nucleus.
* bland_chromatin: This column represents the uniformity of the chromatin material within the tumor cells, ranging from 1 to 10.
* normal_nucleoli: This column represents the normalcy of the nucleoli within the tumor cells, ranging from 1 to 10.
* mitoses: This column represents the level of mitosis (cell division) within the tumor cells, ranging from 1 to 10.
* class: This column contains the diagnosis of the tumor as either "benign" or "malignant".
* doctor_name: This column contains the name of the doctor who diagnosed the tumor.

# Missing Data
What is missing data?
In real-world datasets, missing data is common and can occur for a variety of reasons such as errors in data collection, data entry, or processing. Missing values can create problems for data analysis and machine learning algorithms, as they may lead to biased or incorrect results. Therefore, handling missing values is an essential step in data preprocessing. 

In Python, a missing value, also known as a "null value" or "not a number" (NaN), is a value that represents the absence of a valid value. This can occur in many ways, such as when data is missing from a database, when a value cannot be calculated, or when a value is not entered for a specific data point. In Python, the standard way to represent missing values is by using the float type and the special constant NumPy.nan from the NumPy library.

## Why is it important?
Handling missing values is important for several reasons:

* Missing values can impact the accuracy of data analysis and machine learning algorithms: If a missing value is not handled properly, it can lead to biased or incorrect results. For example, calculating the mean of a dataset with * * missing values will not accurately reflect the true mean if the missing values are not handled properly.
* Missing values can lead to computational errors: Many machine learning algorithms and statistical models cannot handle missing values and will produce errors when encountering them. This can lead to incorrect results or even prevent the algorithm from running altogether.
* Missing values can reduce the sample size and decrease the power of statistical tests: If a large portion of a dataset contains missing values, the sample size used for analysis will be reduced, potentially leading to lower statistical power and increased Type 2 errors.
* Missing values can impact the interpretability of results: If missing values are not handled properly, they can cause misinterpretation of results and mislead decision-making.

Let's look at a few ways to deal with missing data in Python.

## Deleting the Missing Values
There are several techniques to deal with missing values. In this section, we will focus on the deletion technique, which involves dropping rows or columns with missing values. We will provide examples of how to implement these techniques using the Pandas library in Python.

### Dropping the Rows
One approach to handling missing values in a dataset is to remove entire rows that contain any missing values. By dropping rows with missing values, we effectively eliminate the observations that have incomplete data. Here's an example:

In [1]:
import pandas as pd 

df = pd.read_csv("https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/breast_cancer_data.csv") 
# Find missing values 
print("Missing values in the original DataFrame:\n", df.isnull().sum(), "\n") 

# Remove rows with missing values 
df_cleaned = df.dropna() 

# Print the size of the DataFrame before deletion 
print("Size of DataFrame before deletion:") 
print(df.shape) 

# Print the size of the modified DataFrame 
print("Size of DataFrame after deletion:") 
print(df_cleaned.shape)

Missing values in the original DataFrame:
 patient_id               0
clump_thickness          1
cell_size_uniformity     1
cell_shape_uniformity    0
marginal_adhesion        0
single_ep_cell_size      0
bare_nuclei              2
bland_chromatin          4
normal_nucleoli          1
mitoses                  0
class                    0
doctor_name              0
dtype: int64 

Size of DataFrame before deletion:
(699, 12)
Size of DataFrame after deletion:
(690, 12)


The code starts by importing the Pandas library, which provides powerful data manipulation capabilities. We then load the dataset with missing values using the read_csv function. Adjust the file path and name according to your dataset.

To identify the missing values in the DataFrame, we use the isnull().sum() method, which returns the count of missing values for each column. By applying the dropna() function to the DataFrame df, we remove rows that contain any missing values. The resulting cleaned DataFrame is stored in df_cleaned.

Before deletion, we print the size of the DataFrame using the shape attribute. This gives us an understanding of the number of rows and columns in the original dataset. After deletion, we print the size of the modified DataFrame to observe the impact of removing rows with missing values.

### Dropping the Columns
Another approach to handling missing values is to remove entire columns that contain any missing values. By dropping columns with missing values, we eliminate the features or variables that have incomplete data. Here's an example:

In [2]:
import pandas as pd 

# Load a dataset with missing values 
df = pd.read_csv("https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/breast_cancer_data.csv") 

# Find missing values 
print("Missing values in the original DataFrame:\n", df.isnull().sum(), "\n") 

# Remove columns with missing values 
df_cleaned = df.dropna(axis=1) 

# Print the size of the DataFrame before deletion 
print("Size of DataFrame before deletion:") 
print(df.shape) 

# Print the size of the modified DataFrame 
print("Size of DataFrame after deletion:") 
print(df_cleaned.shape) 

Missing values in the original DataFrame:
 patient_id               0
clump_thickness          1
cell_size_uniformity     1
cell_shape_uniformity    0
marginal_adhesion        0
single_ep_cell_size      0
bare_nuclei              2
bland_chromatin          4
normal_nucleoli          1
mitoses                  0
class                    0
doctor_name              0
dtype: int64 

Size of DataFrame before deletion:
(699, 12)
Size of DataFrame after deletion:
(699, 7)


Similarly, we start by importing the Pandas library and loading the dataset with missing values using the read_csv function.

We use the isnull().sum() method to identify the missing values in the DataFrame, which gives us the count of missing values for each column. To remove columns that contain any missing values, we apply the dropna(axis=1) function to the DataFrame df. The resulting cleaned DataFrame is stored in df_cleaned.

Before deletion, we print the size of the DataFrame using the shape attribute to get an idea of the number of rows and columns in the original dataset. After deletion, we print the size of the modified DataFrame to observe the impact of removing columns with missing values.

## Central Tendency Imputation 
Imputation involves replacing missing values in a dataset with estimated values, and it serves as a technique for handling missing data. Among the various methods used for imputation, one common approach is to substitute the missing values with the mean, median, or mode of the available observed values. This allows for a more complete and reliable dataset for further analysis or modeling.

### Imputation Using Mean
Mean imputation involves replacing missing values with the mean of the observed values. This method assumes that the missing values are missing at random and that the mean of the observed values is a good estimate for the missing values. Here is an example of mean imputation in Python using the Pandas library:

In [3]:
import pandas as pd 

# Load a dataset with missing values 
df = pd.read_csv("https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/breast_cancer_data.csv") 

# Find missing values 
print("Missing values in 'bland_chromatin' before imputation:\n ", df['bland_chromatin'].isnull().sum(),"\n") 

# Impute missing values with mean 
df['bland_chromatin'].fillna(df['bland_chromatin'].mean(), inplace=True) 

# Find missing values after imputation 
print("Missing values in 'bland_chromatin' after imputation:\n ",  
df['bland_chromatin'].isnull().sum(),"\n") 

Missing values in 'bland_chromatin' before imputation:
  4 

Missing values in 'bland_chromatin' after imputation:
  0 



This code snippet utilizes the Pandas library to handle missing values in the 'bland_chromatin' column of a dataset. The code determines the number of missing values present in the 'bland_chromatin' column by using the isnull().sum() method. The isnull() function returns a boolean mask indicating missing values, and the sum() function counts the number of True values (i.e., missing values). This provides insights into the count of missing values prior to any imputation.

To address the missing values, the code employs the fillna() method on the 'bland_chromatin' column. By setting inplace=True, the DataFrame is modified directly, and missing values are replaced with the mean value of the 'bland_chromatin' column, obtained using the mean() method. Subsequently, the code checks the count of missing values in the 'bland_chromatin' column again using isnull().sum(), thus revealing the count of missing values subsequent to imputation.

### Imputation Using Median
Median imputation involves replacing missing values with the median of the observed values. This method is similar to mean imputation, but the median is used instead of the mean. Since the median is less sensitive to extreme values, this method can be more robust to outliers than mean imputation. Here is an example of median imputation in Python using the Pandas library:

In [4]:
import pandas as pd 

# Load a dataset with missing values 
df = pd.read_csv("https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/breast_cancer_data.csv") 

# Find missing values 
print("Missing values in 'bland_chromatin' before imputation:\n ", df['bland_chromatin'].isnull().sum(),"\n") 

# Impute missing values with median 
df['bland_chromatin'].fillna(df['bland_chromatin'].median(), inplace=True) 


# Find missing values after imputation 
print("Missing values in 'bland_chromatin' after imputation:\n ",  
df['bland_chromatin'].isnull().sum(),"\n") 


Missing values in 'bland_chromatin' before imputation:
  4 

Missing values in 'bland_chromatin' after imputation:
  0 



This code determines the count of missing values in the 'bland_chromatin' column using the isnull().sum() method, providing an overview of missing values before any imputation.

To handle the missing values, the code utilizes the fillna() method on the 'bland_chromatin' column. By setting inplace=True, the DataFrame is modified directly, and the missing values are replaced with the median value of the 'bland_chromatin' column, obtained using the median() method. The code then checks the count of missing values in the 'bland_chromatin' column again using isnull().sum(), allowing for an assessment of the count of missing values after the imputation process.

### Imputation Using Mode
Mean and median imputation only work on numeric data, for non-numeric data we need to look at other methods. One such method is using the mode instead of mean or median. Mode imputation involves replacing missing values with the mode of the observed values. The mode is the most frequently occurring value in the dataset. This method is commonly used for categorical data, where the mode represents the most common category, but it can also be used for other types of data when appropriate. Here is an example of mode imputation in Python using the Pandas library:

In [5]:
import pandas as pd 
 

# Load a dataset with missing values 
df = pd.read_csv("https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/breast_cancer_data.csv") 
 

# Find missing values 
print("Missing values in 'bland_chromatin' before imputation:\n ", df['bland_chromatin'].isnull().sum(),"\n") 
 

# Impute missing values with mode 
df['bland_chromatin'].fillna(df['bland_chromatin'].mode()[0], inplace=True) 
 

# Find missing values after imputation 
print("Missing values in 'bland_chromatin' after imputation:\n ",  
df['bland_chromatin'].isnull().sum(),"\n") 


Missing values in 'bland_chromatin' before imputation:
  4 

Missing values in 'bland_chromatin' after imputation:
  0 



This code checks for missing values in the 'bland_chromatin' column using the isnull().sum() method. To handle the missing values, the code uses the fillna() method on the 'bland_chromatin' column. The fillna() method replaces missing values with a specified value or strategy. In this case, the mode (most frequent value) of the 'bland_chromatin' column is calculated using df['bland_chromatin'].mode()[0]. The mode is obtained by calling the mode() method on the column, which returns a Series of mode values (it could be multiple modes if there are ties). By accessing the first value [0], we ensure that only the most frequent value is used for imputation. The inplace=True argument ensures that the DataFrame is modified in place, without the need for assignment.

After the missing values are imputed, the code checks for missing values in the 'bland_chromatin' column again using the same isnull().sum() method.

# Duplicate Data

## What is duplicate data?
Duplicate data refers to identical or nearly identical records in a database or dataset. These duplicate records can arise for various reasons, such as human error, technical issues, or multiple sources of data being merged without identifying and removing duplicate records in a dataset or database.

## Why is it important?
Duplicate data can lead to a variety of problems, such as inflating the size of the database, causing confusion when analyzing the data, and introducing inconsistencies in the data.

Imagine a healthcare organization that maintains a database of patient records. Each record contains information such as the patient's name, address, date of birth, and medical history. Over time, multiple records might be created for the same patient due to various reasons, such as data entry errors, system glitches, or merging of records from different sources.

Now, imagine that a physician needs to look up a patient's medical history to determine the best course of treatment. If the patient has multiple records in the database, the physician might not be able to accurately determine which record contains the most up-to-date and accurate information. As a result, the physician might make an incorrect diagnosis or prescribe the wrong medication, leading to potential harm to the patient.

By removing duplicates, the healthcare organization can ensure that each patient has only one accurate and up-to-date record in the database. This helps ensure that healthcare providers have access to the most accurate and reliable information when making critical decisions about patient care.

This is just one example of why dealing with duplicate data is important. In general, it is essential for organizations to identify and remove duplicates in their data to ensure the accuracy and reliability of the data, avoid confusion and inconsistencies, and make informed decisions based on the data.

## Detecting Duplicates
Duplicate data is a common issue that can arise due to various factors, including data entry errors, system malfunctions, or data integration processes. Addressing duplicate data is an essential aspect of data cleaning. However, before dealing with duplicate data, it is crucial to first identify the duplicates. In Python, there are multiple techniques available to mark duplicated data. In this section, we will explore some of the main techniques used for identifying duplicate data.

### Identify Duplicates Based on All Columns
One technique for identifying duplicate data involves considering all columns in the dataset. This approach compares the entire rows of the dataset to identify exact duplicates, where all column values match.

By examining all columns simultaneously, this technique provides a comprehensive assessment of duplicated records across the entire dataset. It helps in identifying instances where the same data has been entered multiple times or when there are repeated observations in the dataset. Here is an example of how you can identify duplicates based on all columns in Python:

In [6]:
import pandas as pd 

# Load the dataset 
df = pd.read_csv("https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/breast_cancer_data.csv") 

# Identify duplicates based on all columns 
duplicates = df.duplicated() 

# Print out the duplicate counts  
print(duplicates.value_counts()) 

# Print the duplicate row(s) 
print(df[duplicates]) 

False    698
True       1
dtype: int64
     patient_id  clump_thickness  cell_size_uniformity  cell_shape_uniformity  \
258     1198641              3.0                   1.0                      1   

     marginal_adhesion  single_ep_cell_size bare_nuclei  bland_chromatin  \
258                  1                    2           1              3.0   

     normal_nucleoli  mitoses   class doctor_name  
258              1.0        1  benign     Dr. Lee  


The code begins by loading the dataset "breast_cancer_data.csv" into a Pandas DataFrame called df. The next step involves identifying duplicate rows based on all columns using the duplicated() function, which returns a boolean series indicating whether each row is a duplicate or not. 

Next, the code proceeds to print the counts of duplicate values using the value_counts() function applied to the boolean series duplicates. This provides a summary of the number of occurrences for each unique value in the series, indicating the count of duplicates (True) and non-duplicates (False). Lastly, the duplicate rows are printed using the print() function, showing the specific rows that are considered duplicates based on the boolean series duplicates.

### Identify Duplicates Based on Specific Columns
In addition to identifying duplicates based on all columns, it is also common to focus on specific columns when detecting duplicate data. This technique allows us to determine if there are any repeated or highly similar records based on a subset of columns that are considered relevant for identifying duplicates.

By selecting specific columns for duplicate identification, we can tailor the analysis to focus on the attributes that are most important for determining duplicity. This approach is particularly useful when certain columns, such as unique identifiers or key variables, play a crucial role in identifying and distinguishing records. Here is an example of how you can identify duplicates based on specific columns in Python:

In [7]:
import pandas as pd 

# Load the dataset 
df = pd.read_csv("https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/breast_cancer_data.csv") 

# Specify the columns for duplicate identification 
columns_to_check = ['patient_id', 'clump_thickness', 'cell_size_uniformity'] 

# Identify duplicates based on specific columns 
duplicates = df.duplicated(subset=columns_to_check) 


# Print out the duplicate counts 
print(duplicates.value_counts()) 

# Print the duplicate rows 
print(df[duplicates]) 

False    683
True      16
dtype: int64
     patient_id  clump_thickness  cell_size_uniformity  cell_shape_uniformity  \
208     1218860              1.0                   1.0                      1   
253     1100524              6.0                  10.0                     10   
254     1116116              9.0                  10.0                     10   
257     1182404              3.0                   1.0                      1   
258     1198641              3.0                   1.0                      1   
272      320675              3.0                   3.0                      5   
322      733639              3.0                   1.0                      1   
338      704097              1.0                   1.0                      1   
393     1158247              1.0                   1.0                      1   
443      734111              1.0                   1.0                      1   
490     1115293              1.0                   1.0                

This code specifies the columns to be considered for duplicate identification by creating a list columns_to_check containing the column names 'patient_id', 'clump_thickness', and 'cell_size_uniformity'. These columns are relevant in determining the duplicity of records in the dataset, capturing both patient identifiers and features related to the clump thickness and cell size uniformity.

The code then uses the duplicated() function on the DataFrame df, specifying the 'subset' parameter as columns_to_check. This creates a boolean mask duplicates, which indicates True for rows that are duplicates based on the specified columns and False for non-duplicate rows.

To provide a summary of the duplicate counts, the code applies the value_counts() function to the boolean mask duplicates. This counts the occurrences of True and False values, indicating the number of duplicates and non-duplicates, respectively. Finally, the code displays the duplicate rows by using the boolean mask duplicates to index the DataFrame df.

### Mark Duplicates Using the Keep Parameter
Marking duplicates using the keep parameter is a technique used to identify and label duplicate rows within a dataset. The keep parameter allows for customizing which duplicate values should be marked or considered as duplicates.

By default, when marking duplicates, the first occurrence of a duplicate value is considered as non-duplicate, while all subsequent occurrences are marked as duplicates. However, the keep parameter provides different options to modify this behavior:

* keep='first': This is the default value and keeps the first occurrence of a duplicated value as non-duplicate, marking all subsequent occurrences as duplicates. 
* keep='last': This option keeps the last occurrence of a duplicated value as non-duplicate, marking all previous occurrences as duplicates. 
* keep=False: Setting keep to False marks all occurrences of duplicate values as duplicates, considering none of them as non-duplicates

By specifying the keep parameter, you can control how duplicates are marked and choose which occurrence(s) to consider as non-duplicates. This provides flexibility in handling duplicates based on specific requirements or analysis needs.

Here is an example of how you can identify duplicates using the keep parameter in Python:

In [8]:
import pandas as pd  

# Load the dataset  
df = pd.read_csv("https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/breast_cancer_data.csv")  

# Specify the columns for duplicate identification  
columns_to_check = ['patient_id', 'clump_thickness', 'cell_size_uniformity']  

# Mark duplicates using the keep parameter  
duplicates = df.duplicated(subset=columns_to_check, keep='first')  

# Mark the duplicates 
df['Duplicates'] = duplicates 

# Print the duplicate rows  
print(df.tail(10))  

     patient_id  clump_thickness  cell_size_uniformity  cell_shape_uniformity  \
689      654546              1.0                   1.0                      1   
690      654546              1.0                   1.0                      1   
691      695091              5.0                  10.0                     10   
692      714039              3.0                   1.0                      1   
693      763235              3.0                   1.0                      1   
694      776715              3.0                   1.0                      1   
695      841769              2.0                   1.0                      1   
696      888820              5.0                  10.0                     10   
697      897471              4.0                   8.0                      6   
698      897471              4.0                   8.0                      8   

     marginal_adhesion  single_ep_cell_size bare_nuclei  bland_chromatin  \
689                  1          

In the provided code, a list of column names, columns_to_check, is specified. These columns serve the purpose of identifying and marking duplicates within the dataset. By utilizing the duplicated() function, with the subset parameter set to columns_to_check and the keep parameter set to 'first', the code effectively flags the duplicate entries in the DataFrame df. 

To make the duplicates more visible, a new column 'Duplicates' is added to the DataFrame df, which contains the values of the duplicates series. Finally, the last 10 rows of the DataFrame are displayed using the tail() and the print() functions, allowing you to observe and analyze the marked duplicate rows along with the rest of the data.


## Removing Duplicates
After identifying duplicates using the techniques discussed earlier, the next step is to remove these duplicates from the dataset. Duplicate data can introduce inaccuracies and biases in the analysis, leading to erroneous insights and conclusions.

By removing duplicates, you eliminate redundant information and streamline the dataset, making it more manageable and suitable for analysis. There are multiple methods available for removing duplicates, each with its own advantages and specific use cases. The choice of method depends on the characteristics of the dataset and the objectives of the analysis.

### Dropping Duplicate Rows
Dropping duplicate rows is a technique used to remove duplicate observations from a dataset. Duplicate rows can occur due to various reasons, such as data entry errors or merging multiple datasets. Removing these duplicates ensures the integrity and accuracy of the data, preventing biased or misleading analysis.

Here is an example of dropping duplicate rows on the breast cancer dataset:

In [9]:
# Import necessary libraries 
import pandas as pd 
import warnings   
 
# Ignore warning messages to enhance readability 
warnings.filterwarnings('ignore')    
 
# Load the breast cancer dataset 
df = pd.read_csv("https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/breast_cancer_data.csv") 

# Identify duplicates based on all columns  
duplicates = df.duplicated()  
 
# Print the duplicate row(s)  
print('Duplicate row(s) in the DataFrame', df[duplicates])   

# Remove the duplicate rows from the DataFrame in-place 
df.drop_duplicates(inplace=True) 


# Print the duplicate rows after removing duplicates from the DataFrame 
print('After removing duplicates from the DataFrame', df[duplicates]) 

Duplicate row(s) in the DataFrame      patient_id  clump_thickness  cell_size_uniformity  cell_shape_uniformity  \
258     1198641              3.0                   1.0                      1   

     marginal_adhesion  single_ep_cell_size bare_nuclei  bland_chromatin  \
258                  1                    2           1              3.0   

     normal_nucleoli  mitoses   class doctor_name  
258              1.0        1  benign     Dr. Lee  
After removing duplicates from the DataFrame Empty DataFrame
Columns: [patient_id, clump_thickness, cell_size_uniformity, cell_shape_uniformity, marginal_adhesion, single_ep_cell_size, bare_nuclei, bland_chromatin, normal_nucleoli, mitoses, class, doctor_name]
Index: []


In the given code, the breast cancer dataset is loaded into a DataFrame named df. To enhance readability, the warnings library is imported, and warning messages are ignored using the filterwarnings() method.

The duplicated() function is utilized to identify duplicate rows in the DataFrame by considering all columns. To display the duplicate rows, the print() function is used with the DataFrame df and the boolean index duplicates as parameters. This shows the rows that are considered duplicates based on the entire dataset.

Next, the drop_duplicates() function is applied to the DataFrame to remove the duplicate rows in place, ensuring that only unique observations remain. By setting inplace=True, the DataFrame df is modified directly.

Finally, the print() function is used again to show the duplicate rows after removing duplicates.
* Keeping the first occurrence

In certain scenarios, you may want to keep only the first occurrence of duplicated rows in a dataset and remove any subsequent duplicates. This approach ensures that you retain the earliest record of a duplicated entry while eliminating redundant information.

Here's an example code that demonstrates keeping the first occurrence of duplicated rows using the drop_duplicates() function.

In [10]:
import pandas as pd 

# Load the breast cancer dataset 
df = pd.read_csv("https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/breast_cancer_data.csv") 

# Keep the first occurrence of duplicates 
df_cleaned = df.drop_duplicates(keep='first') 

# Print the cleaned dataset 
print("\nCleaned Dataset:") 
print(df_cleaned.head()) 


Cleaned Dataset:
   patient_id  clump_thickness  cell_size_uniformity  cell_shape_uniformity  \
0     1000025              5.0                   1.0                      1   
1     1002945              5.0                   4.0                      4   
2     1015425              3.0                   1.0                      1   
3     1016277              6.0                   8.0                      8   
4     1017023              4.0                   1.0                      1   

   marginal_adhesion  single_ep_cell_size bare_nuclei  bland_chromatin  \
0                  1                    2           1              3.0   
1                  5                    7          10              3.0   
2                  1                    2           2              3.0   
3                  1                    3           4              3.0   
4                  3                    2           1              3.0   

   normal_nucleoli  mitoses   class doctor_name  
0           

In the provided code, we use the drop_duplicates() function to eliminate duplicate rows while keeping only the first occurrence. By specifying keep='first', the function retains the first instance of each duplicated row and removes any subsequent duplicates.

The cleaned dataset is stored in the DataFrame df_cleaned. It contains the original rows but without any duplicates beyond the first occurrence. Finally, we print the cleaned dataset to observe the result.

* Keeping the last occurrence

In certain cases, you may want to keep only the last occurrence of duplicated rows in a dataset and remove any previous duplicates. This approach ensures that you retain the most recent record of a duplicated entry while discarding redundant information.

Here's an example code that demonstrates keeping the last occurrence of duplicated rows:

In [11]:
# Load the breast cancer dataset 
df = pd.read_csv("https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/breast_cancer_data.csv") 

 
# Keep the last occurrence of duplicates 
df_cleaned = df.drop_duplicates(keep='last') 

  
# Print the cleaned dataset 
print("\nCleaned Dataset:") 
print(df_cleaned.head())


Cleaned Dataset:
   patient_id  clump_thickness  cell_size_uniformity  cell_shape_uniformity  \
0     1000025              5.0                   1.0                      1   
1     1002945              5.0                   4.0                      4   
2     1015425              3.0                   1.0                      1   
3     1016277              6.0                   8.0                      8   
4     1017023              4.0                   1.0                      1   

   marginal_adhesion  single_ep_cell_size bare_nuclei  bland_chromatin  \
0                  1                    2           1              3.0   
1                  5                    7          10              3.0   
2                  1                    2           2              3.0   
3                  1                    3           4              3.0   
4                  3                    2           1              3.0   

   normal_nucleoli  mitoses   class doctor_name  
0           

In the provided code, we use the drop_duplicates() function to remove duplicate rows while keeping only the last occurrence. By specifying keep='last', the function retains the most recent instance of each duplicated row and eliminates any previous duplicates.

### Removing Duplicates based on Specific Columns
Removing duplicates based on specific columns involves identifying duplicates only within those columns and removing them from the dataset. This technique is useful when you want to focus on specific attributes or when duplicates in certain columns are more critical to address.

Here's the code snippet to remove duplicates based on specific columns using the breast cancer dataset:

In [12]:
import pandas as pd 

# Load the breast cancer dataset 
df = pd.read_csv("https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/breast_cancer_data.csv") 

# Specify the columns for duplicate identification 
columns_to_check = ['patient_id', 'clump_thickness', 'cell_size_uniformity'] 

# Remove duplicates based on specific columns 
df_cleaned = df.drop_duplicates(subset=columns_to_check) 

# Print the cleaned dataset 
print(df_cleaned.tail(10)) 

     patient_id  clump_thickness  cell_size_uniformity  cell_shape_uniformity  \
687      566346              3.0                   1.0                      1   
688      603148              4.0                   1.0                      1   
689      654546              1.0                   1.0                      1   
691      695091              5.0                  10.0                     10   
692      714039              3.0                   1.0                      1   
693      763235              3.0                   1.0                      1   
694      776715              3.0                   1.0                      1   
695      841769              2.0                   1.0                      1   
696      888820              5.0                  10.0                     10   
697      897471              4.0                   8.0                      6   

     marginal_adhesion  single_ep_cell_size bare_nuclei  bland_chromatin  \
687                  1          

In this code, we specify the columns we want to consider for duplicate identification in the columns_to_check list. Next, we use the drop_duplicates() function with the subset parameter set to columns_to_check to remove duplicates based on those specific columns. The resulting cleaned dataset is stored in the df_cleaned DataFrame.

By printing the last 10 rows of the cleaned dataset using tail(10), we can observe the dataset without duplicates based on the specified columns. This approach allows for the targeted removal of duplicates, focusing on the chosen attributes while preserving other information in the dataset.

# What is inconsistent data?
Inconsistent data formats refer to the situation where data is stored in different or inconsistent formats within the same dataset. Inconsistent data formats can also refer to differences in the formatting of similar data, such as different date or time formats, different currency formats, or different units of measurement.

## Why is it important?
Inconsistencies can make it difficult to correctly interpret data, especially if the data is intended for analysis or comparison. For example, consider a dataset that contains sales information, where the currency used for sales is US dollars in some cases and British pounds in others. This inconsistent format can create problems when trying to perform operations on the data, such as finding the total amount of revenue generated from sales.

Inconsistent data formats can arise for many reasons, including manual data entry errors, differences in data collection methods, and differences in the way data are stored or processed by different systems. Regardless of the cause, addressing and resolving inconsistent data formats is an important step in the data cleaning and preparation process, as it ensures that the data is usable and meaningful for analysis and other purposes.

## Data Standardization
Data standardization is a crucial step in the data cleaning process. Inconsistent data formats and scales can lead to issues during analysis and modeling. Data standardization ensures that the data is on a common scale, follows a consistent format, and is suitable for further analysis or modeling tasks.

By applying data standardization techniques, you can improve data quality, enhance comparability across variables, and reduce biases or discrepancies caused by inconsistent data formats.

### Standardizing Data Types
Standardizing data types is an important step in data cleaning to ensure consistency and compatibility across different data sources. In this section, we focus on standardizing the data types of specific columns in the breast cancer dataset. This step ensures that the data is represented in the appropriate format for further analysis and processing. Here is the code that checks the data types of all columns in the breast cancer dataset using the dtypes function:

In [13]:
import pandas as pd 

# Load the breast cancer dataset 
df = pd.read_csv("https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/breast_cancer_data.csv") 

# Check the data types of all columns 
print("Data Types of Columns: ") 
print(df.dtypes) 

Data Types of Columns: 
patient_id                 int64
clump_thickness          float64
cell_size_uniformity     float64
cell_shape_uniformity      int64
marginal_adhesion          int64
single_ep_cell_size        int64
bare_nuclei               object
bland_chromatin          float64
normal_nucleoli          float64
mitoses                    int64
class                     object
doctor_name               object
dtype: object


In the code above, it can be observed that the "bare_nuclei" column is currently classified as an object data type. Considering the context of the data and the expected range of values (1 to 10), it is more appropriate for the "bare_nuclei" column to be of a numeric data type.

In [14]:
import pandas as pd 

# Load the breast cancer dataset 
df = pd.read_csv("https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/breast_cancer_data.csv") 


# Check the unique values in the "bare" column 
unique_values = df['bare_nuclei'].unique() 
print("Unique Values in the 'bare' column:") 
print(unique_values) 

Unique Values in the 'bare' column:
['1' '10' '2' '4' '3' '9' '7' '?' '5' '8' '6' nan]


In the code above, the unique() function is used to obtain the unique values in the "bare_nuclei" column. By inspecting the unique values, you can determine if there are any non-numeric values, such as the "?" symbol, in the "bare_nuclei" column before proceeding with the conversion to a numeric data type.

Here is the code that replaces the "?" symbol with NaN in the "bare_nuclei" column and then converts the column's data type to float:

In [15]:
import pandas as pd   
import numpy as np

# Load the breast cancer dataset 
df = pd.read_csv("https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/breast_cancer_data.csv") 

# Replace the '?' symbol with NaN  
df['bare_nuclei'] = df['bare_nuclei'].replace('?', np.nan) 

# Convert columns to the appropriate data types 
df['bare_nuclei'] = df['bare_nuclei'].astype(float) 

# Print modified dataset 
print("Dataset after Data Type Conversion: ","\n") 
print(df.dtypes) 

Dataset after Data Type Conversion:  

patient_id                 int64
clump_thickness          float64
cell_size_uniformity     float64
cell_shape_uniformity      int64
marginal_adhesion          int64
single_ep_cell_size        int64
bare_nuclei              float64
bland_chromatin          float64
normal_nucleoli          float64
mitoses                    int64
class                     object
doctor_name               object
dtype: object


In the code above, we replace all occurrences of '?' with NaN (Not a Number) using the replace() method from Pandas. After resolving this issue, we proceed to convert the 'bare_nuclei' column to a suitable data type. In this case, we convert it to a floating-point number data type using the astype() method, specifying float as the desired data type. This ensures that the 'bare_nuclei' column contains numeric values, enabling numerical calculations and analyses to be performed accurately.

Finally, we print the modified dataset to verify the changes made. By examining the data types of all columns again using the dtypes attribute, we can confirm that the 'bare_nuclei' column now has the desired data type, which allows for consistent data processing and analysis.

## Standardizing Categorical Values
In the context of data cleaning, it is essential to standardize categorical values to ensure consistency and facilitate accurate analysis. One common scenario is when dealing with categorical columns containing values that may vary in their formatting or capitalization. By standardizing these values, we can eliminate inconsistencies and improve the overall quality of the dataset.

In this case, the class values are "benign" and "malignant." However, it is good practice to standardize the values by capitalizing the first letter, making it consistent throughout the dataset.

In [16]:
import pandas as pd 

# Load the breast cancer dataset 
df = pd.read_csv("https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/breast_cancer_data.csv") 
print(df) 

# Standardize class values 
df['class'] = df['class'].str.capitalize() 

# Print modified dataset 
print("Dataset after Standardizing Class Values: ") 
print(df['class']) 

     patient_id  clump_thickness  cell_size_uniformity  cell_shape_uniformity  \
0       1000025              5.0                   1.0                      1   
1       1002945              5.0                   4.0                      4   
2       1015425              3.0                   1.0                      1   
3       1016277              6.0                   8.0                      8   
4       1017023              4.0                   1.0                      1   
..          ...              ...                   ...                    ...   
694      776715              3.0                   1.0                      1   
695      841769              2.0                   1.0                      1   
696      888820              5.0                  10.0                     10   
697      897471              4.0                   8.0                      6   
698      897471              4.0                   8.0                      8   

     marginal_adhesion  sin

In the code above, the "class" column is accessed using the column name 'class'. The str.capitalize() function is applied to each value in the 'class' column, which capitalizes the first letter of each value. This operation standardizes the class values by ensuring consistent capitalization.

Finally, the modified column is printed to demonstrate the changes made.