# Basic Data Cleaning

Adpated from Jason Brownlee. 2020. [Data Preparation for Machine Learning](https://machinelearningmastery.com/data-preparation-for-machine-learning/).

## Overview

This tutorial covers basic data cleaning techniques using Python and pandas. We'll explore common data quality issues and learn how to address them effectively.

Data cleaning is a crucial step in the data science workflow, ensuring that your datasets are accurate, consistent, and ready for analysis. In this tutorial, you'll learn to identify and address common data quality issues, such as missing values, duplicates, and inconsistent formats. Through hands-on exercises using pandas, you'll gain practical experience in essential data cleaning tasks. By the end of this tutorial, you'll have developed the skills necessary to prepare datasets for further analysis and modeling, setting a strong foundation for your data science projects.

## Learning Objectives

- Understand the importance of data cleaning in the data science workflow
- Learn to identify and handle common data quality issues
- Gain practical experience in using pandas for data cleaning tasks
- Develop skills to prepare datasets for further analysis and modeling

## Prerequisites

- Basic knowledge of Python programming
- Familiarity with pandas library

## Get Started

To start, we install required packages, import the necessary libraries.

### Install required packages

In [None]:
# Installs the pandas library using pip, a package installer for Python.
# Pandas is a powerful data manipulation and analysis library.
%pip install pandas

### Import necessary libraries

In [None]:
# Imports the pandas library and assigns it the alias 'pd' for easier use throughout the code.
import pandas as pd

## Messy Dataset

The breast cancer dataset classifies breast cancer patient as either a recurrence or no recurrence of cancer. 

```
Number of Instances: 289
Number of Attributes: 9 + the class attribute
Attribute Information:
   1. Class: no-recurrence-events, recurrence-events
   2. age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99.
   3. menopause: lt40, ge40, premeno.
   4. tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59.
   5. inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39.
   6. node-caps: yes, no.
   7. deg-malig: 1, 2, 3.
   8. breast: left, right.
   9. breast-quad: left-up, left-low, right-up,	right-low, central.
  10. irradiat:	yes, no.
Missing Attribute Values: (denoted by "?")
   Attribute #:  Number of instances with missing values:
   6.             8
   9.             1.
Class Distribution:
    1. no-recurrence-events: 201 instances
    2. recurrence-events: 85 instances 
```

You can learn more about the dataset here:

* Breast Cancer Dataset ([breast-cancer.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv))
* Breast Cancer Dataset Description ([breast-cancer.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.names))


The messy dataset was modified from Breast Cancer Dataset so that various data cleaning techniques may be demonstrated.

## Identify Columns That Contain a Single Value

First, summarize the number of unique values for each column using pandas.

In [None]:
# Define the file path to the messy_data.csv dataset.
messy_data = "../../Data/messy_data.csv"

# Load the dataset into a pandas DataFrame
df = pd.read_csv(messy_data, header=None)

# Display the first few rows of the DataFrame to inspect the data
df.head()

Then, summarize the number of unique values in each column using `nunique()`.

In [None]:
# Print the shape of the DataFrame 'df', which represents the number of rows and columns.
print("Shape of messy data: ", df.shape)
# Print a header to indicate that the following output shows the number of unique values per column.
print("Column\t#Unique values ")
# Print the number of unique values for each column in the DataFrame 'df'.
print(df.nunique())

We can see that column index 5 only has a single value and should be removed.

## Delete columns that contain a single value

In [None]:
# load the dataset from a csv file named 'messy_data' into a pandas DataFrame
df = pd.read_csv(messy_data, header=None)
# print the shape (number of rows and columns) of the DataFrame to understand its dimensions
print(df.shape)

# calculate and store the number of unique values in each column of the DataFrame
counts = df.nunique()

# create a list called 'to_del' containing the indices of columns where the number of unique values is equal to 1
to_del = [i for i, v in enumerate(counts) if v == 1]
# print the list of column indices that are marked for deletion because they have only one unique value
print(to_del)

# drop the columns identified in 'to_del' from the DataFrame in place (modifying the DataFrame directly)
df.drop(to_del, axis=1, inplace=True)
# print the shape of the DataFrame again after dropping the columns to show the updated dimensions
print(df.shape)

## Identify columns that have very few values

In [None]:
# Load the dataset from a CSV file into a pandas DataFrame. The file path is specified by the variable 'messy_data' and there is no header row in the CSV.
df = pd.read_csv(messy_data, header=None)

# Print a header to the output summarizing the columns that will be displayed.
print("Column, Count, <1%")
# Iterate through each column index and the count of unique values in each column of the DataFrame.
for i, v in enumerate(df.nunique()):
    # Calculate the percentage of unique values in the current column relative to the total number of rows in the DataFrame.
    percentage = float(v) / df.shape[0] * 100
    # Check if the calculated percentage is less than 1%.
    if percentage < 1:
        # If the percentage is less than 1%, print the column index (i), the count of unique values (v), and the calculated percentage (formatted to one decimal place).
        print("%d, %d, %.1f%%" % (i, v, percentage))

## Drop columns with unique values less than 1 percent of rows

In [None]:
# load the dataset from the CSV file 'messy_data' into a pandas DataFrame, assuming no header row.
df = pd.read_csv(messy_data, header=None)
# print the shape of the DataFrame (number of rows and columns) to get an overview of its dimensions.
print(df.shape)

# calculate the number of unique values in each column of the DataFrame.
counts = df.nunique()

# identify columns to be deleted based on a threshold: if the percentage of unique values in a column is less than 1% of the total number of rows, mark it for deletion.
to_del = [i for i, v in enumerate(counts) if (float(v) / df.shape[0] * 100) < 1]
# print the indices of the columns identified for deletion.
print("Columns to delete: ", to_del)

# drop the columns identified in 'to_del' from the DataFrame. axis=1 specifies columns, and inplace=True modifies the DataFrame directly.
df.drop(to_del, axis=1, inplace=True)
# print the shape of the DataFrame again after dropping the columns to show the reduced dimensions.
print(df.shape)

## Identify rows that contain duplicate data

In [None]:
# Load the dataset from a CSV file named 'messy_data' into a pandas DataFrame, assuming no header row in the file.
df = pd.read_csv(messy_data, header=None)

# Identify and create a boolean Series 'dups' indicating duplicate rows in the DataFrame 'df'.
dups = df.duplicated()

# Print a message to the console indicating whether any duplicate rows were found in the DataFrame 'df'.
print("Any duplicates? ", dups.any())

# Print a header message indicating the display of duplicated rows.
print("Duplicated rows:")
# Print all rows from the DataFrame 'df' that are marked as duplicates in the boolean Series 'dups'.
print(df[dups])

## Delete rows that contain duplicate data

In [None]:
# Load the dataset from a CSV file named 'messy_data' into a pandas DataFrame.
# 'header=None' argument indicates that the CSV file does not have a header row.
df = pd.read_csv(messy_data, header=None)
# Print the shape of the DataFrame after loading the data.
# This will output the number of rows and columns in the DataFrame before removing duplicates.
print(df.shape)

# Remove duplicate rows from the DataFrame in place.
# 'inplace=True' modifies the DataFrame directly instead of returning a new DataFrame.
df.drop_duplicates(inplace=True)
# Print the shape of the DataFrame after removing duplicate rows.
# This will output the number of rows and columns in the DataFrame after duplicate rows are removed.
print(df.shape)

## Conclusion

In this tutorial, we've learned essential data cleaning techniques using Python and pandas. We've covered how to handle missing values, remove duplicates, correct data types, and address inconsistent data. These skills are crucial for preparing datasets for further analysis and modeling in data science projects.

## Clean up

Remember to shut down your Jupyter Notebook environment and delete any unnecessary files or resources once you've completed the tutorial.

