# Basic Data Cleaning

Adpated from Jason Brownlee. 2020. [Data Preparation for Machine Learning](https://machinelearningmastery.com/data-preparation-for-machine-learning/).

## Overview

This tutorial covers basic data cleaning techniques using Python and pandas. We'll explore common data quality issues and learn how to address them effectively.

Data cleaning is a crucial step in the data science workflow, ensuring that your datasets are accurate, consistent, and ready for analysis. In this tutorial, you'll learn to identify and address common data quality issues, such as missing values, duplicates, and inconsistent formats. Through hands-on exercises using pandas, you'll gain practical experience in essential data cleaning tasks. By the end of this tutorial, you'll have developed the skills necessary to prepare datasets for further analysis and modeling, setting a strong foundation for your data science projects.

## Learning Objectives

- Understand the importance of data cleaning in the data science workflow
- Learn to identify and handle common data quality issues
- Gain practical experience in using pandas for data cleaning tasks
- Develop skills to prepare datasets for further analysis and modeling

## Prerequisites

- Basic knowledge of Python programming
- Familiarity with pandas library

## Get Started

To start, we install required packages, import the necessary libraries, and define a helper function to download data using the `requests` library.

### Install required packages

In [None]:
%pip install pandas
%pip install requests

### Import necessary libraries

In [None]:
from pathlib import Path

import pandas as pd
import requests

### Define utility functions

Define a helper function for downloading example datasets.  

*Note!* It is not essential that you understand the following code.  It is just for getting the example data.

In [None]:
def download(url, to_file):
    """Download content from the given URL and save it to a file.

    Args:
        url (str): The URL to download the content from.
        to_file (str): The name of the file to save the downloaded content to.

    """
    response = requests.get(url, timeout=10)
    Path(to_file).write_bytes(response.content)
    print(f"downloaded file '{to_file}'")

## Messy Dataset

The breast cancer dataset classifies breast cancer patient as either a recurrence or no recurrence of cancer. 

```
Number of Instances: 289
Number of Attributes: 9 + the class attribute
Attribute Information:
   1. Class: no-recurrence-events, recurrence-events
   2. age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99.
   3. menopause: lt40, ge40, premeno.
   4. tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59.
   5. inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39.
   6. node-caps: yes, no.
   7. deg-malig: 1, 2, 3.
   8. breast: left, right.
   9. breast-quad: left-up, left-low, right-up,	right-low, central.
  10. irradiat:	yes, no.
Missing Attribute Values: (denoted by "?")
   Attribute #:  Number of instances with missing values:
   6.             8
   9.             1.
Class Distribution:
    1. no-recurrence-events: 201 instances
    2. recurrence-events: 85 instances 
```

You can learn more about the dataset here:

* Breast Cancer Dataset ([breast-cancer.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv))
* Breast Cancer Dataset Description ([breast-cancer.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.names))


The messy dataset was modified from Breast Cancer Dataset so that various data cleaning techniques may be demonstrated.

### Download messy data file

In [None]:
download(
  url="https://raw.githubusercontent.com/udel-cbcb/al_ml_workshop/main/data/messy_data.csv",
  to_file="messy_data.csv"
)

## Identify Columns That Contain a Single Value

First, summarize the number of unique values for each column using pandas.

In [None]:
# Load the dataset
df = pd.read_csv("messy_data.csv", header=None)

# Peek into the top five rows
df.head()


Then, summarize the number of unique values in each column using `nunique()`.

In [None]:
print("Shape of messy data: ", df.shape)
print("Column\t#Unique values ")
print(df.nunique())

We can see that column index 5 only has a single value and should be removed.

## Delete columns that contain a single value

In [None]:
# load the dataset
df = pd.read_csv("messy_data.csv", header=None)
print(df.shape)

# get number of unique values for each column
counts = df.nunique()

# record columns to delete
to_del = [i for i, v in enumerate(counts) if v == 1]
print(to_del)

# drop useless columns
df.drop(to_del, axis=1, inplace=True)
print(df.shape)

## Identify columns that have very few values

In [None]:
# load the dataset
df = pd.read_csv("messy_data.csv", header=None)

# summarize the number of unique values in each column
print("Column, Count, <1%")
for i, v in enumerate(df.nunique()):
    # Percent of number of unique values across rows
    percentage = float(v) / df.shape[0] * 100
    if percentage < 1:
        print("%d, %d, %.1f%%" % (i, v, percentage))

## Drop columns with unique values less than 1 percent of rows

In [None]:
# load the dataset
df = pd.read_csv("messy_data.csv", header=None)
print(df.shape)

# get number of unique values for each column
counts = df.nunique()

# record columns to delete
to_del = [i for i, v in enumerate(counts) if (float(v) / df.shape[0] * 100) < 1]
print("Columns to delete: ", to_del)

# drop useless columns
df.drop(to_del, axis=1, inplace=True)
print(df.shape)

## Identify rows that contain duplicate data

In [None]:
# load the dataset
df = pd.read_csv("messy_data.csv", header=None)

# calculate duplicates
dups = df.duplicated()

# report if there are any duplicates
print("Any duplicates? ", dups.any())

# list all duplicate rows
print("Duplicated rows:")
print(df[dups])

## Delete rows that contain duplicate data

In [None]:
# load the dataset
df = pd.read_csv("messy_data.csv", header=None)
print(df.shape)

# delete duplicate rows
df.drop_duplicates(inplace=True)
print(df.shape)

## Conclusion

In this tutorial, we've learned essential data cleaning techniques using Python and pandas. We've covered how to handle missing values, remove duplicates, correct data types, and address inconsistent data. These skills are crucial for preparing datasets for further analysis and modeling in data science projects.

## Clean up

Remember to shut down your Jupyter Notebook environment and delete any unnecessary files or resources once you've completed the tutorial.

