## Objectives
- Identify the data cleaning phase and each stage from an expert’s perspective.
- Implement solutions for common data structure issues programmatically
- Implement cleaning for dirty and messy data with Python for tabular data, time series, and text data.
- Implement visual and programmatic testing
- Store cleaned data

## Manual vs. programmatic cleaning

Programmatic data cleaning is faster, more efficient, and a better fit for reproducible data wrangling workflows than manual data cleaning. Data wrangling takes tremendous time for the data professional, so programmatic data cleaning is strongly preferred compared to the manual process.
The Data Cleaning Process

Programmatic data cleaning has three steps:

- Define: write a data cleaning plan by converting your assessments into cleaning tasks.
- Code: translate the data cleaning plan to code and run it to clean the data.
- Test: test the dataset, often using code, to ensure the cleaning code works and revisit some of the elements from the assessment phase.

## Structuring issues and fixing them

Let's recap the data structuring issues and remediation techniques:
Issue: Column headers are values, not variable names
- Use **unpivoting** or melting to convert a wide dataset (with many columns) into a long format (with fewer columns and more rows).
- Use **transposing** to switch rows with columns and columns with rows.

Issue: Get a clearer understanding of the groupings in data.
- Use **pivoting** to convert a long dataset into a wide dataset and create useful groupings.
- Use **group-by and aggregations** to group categorical data together and aggregate the values associated with them like a sum or a mean.

Issue: A single observational unit is stored in multiple tables
- Use **merging** to combine multiple data tables into a single table with all of the required information
- Use **appending or concatenating** on two tables with the same variables and add the data points from one table to another table

- **Unpivoting:** Using pandas' `.melt()`, specifying id_vars as the identifier variable, var_name as the name of the new column with the original table's variable name, and value_vars as the name of the new column with the original table's values.

- **Pivoting:** Using `.pivot()` method with index, columns, and values arguments to produce groupings.

- **Transposing:** Using the `.T` method.

- **Merging:** Using the pandas `.merge()` method. By default, `.merge()` performs an inner join, using the intersection of keys from both frames and preserving the order of the left keys.

- **Appending:** Using `.concat()` method. We can use this method with the parameterignore_index=Trueto ensure the index is labeled 0 to n-1, where n is the total number of rows.

- **Group-by and aggregation:** Using `groupby()` and `.agg()`.

## Advanced Mergeing 
Advanced Merging Example:

To perform more advanced pandas merging operations, you can use the onand howarguments to the merge() function.

1. `On` specifies the column-level names to join on. You can specify multiple columns as long as these are found in both Dataframes.
2. `how` specifies the type of merge to be performed.
3. Furthermore, `left_on` and `right_on` specifies which dataframe's index (left or right) to use as the join key.

## Handle Outliers

- Set up a range manually if you already know the range pandas indexing
- Identify outliers automatically using the standard deviation method, using the outputs of pandas df.describe() function.
- Drop outliers using df.drop(index=...)
    - Recall you can use the inplace=argument to simplify your implementation by dropping the rows in place, which overwrites your original data.
- Finally, identify the impact on summary statistics after dealing with outliers.

## Duplicates
- `.dupliated() .drop_duplicates() .drop()`

## Missing values
It's important to identify if the missing values in the dataset are correctly represented. Sometimes the missing values are represented as characters like "-" and "#" or texts like "no data", which can be easily missed using the .isna() method. So we should always check if missing values are correctly represented and replace them with proper values like np.nan. Some useful methods are:
- `.isin()`
- `.replace()`

Let's briefly recap the options to deal with these NaNs:

1. Drop the rows of your dataset if there are only a few rows with missing values. You can use pandas' `dropna()`, and it won’t impact your data analyses significantly.
2. Drop an entire column when almost all of the values (such as 95-98% of values) are missing in a column. You can use `df.drop('COLUMN_NAME', axis=1)`.
3. If you don’t want to drop the existing data, impute these values using pandas' `df.fillna()` function. For example, to impute the values using the mean of the column, use `df['COLUMN_NAME'].fillna( df['COLUMN_NAME'].mean())`
4. Convert the data into categories using `pd.cut()`, then apply one of the above techniques.

When checking data quality, it is usually best to deal with completeness issues first so that subsequent data cleaning around missing data will not have to be repeated.

We looked at ways to remediate the following tidiness issues in the data tables.

- **Multiple variables** are stored in one column. String operations and unpivoting can help us resolve this type of issue.
- **One observation** unit is stored in multiple tables. We can use merging to solve the issue.

We looked at how to remediate some major data quality issues in our clinical trial data tables, including using operations like str.strip(), astype(), pandas indexing, and more!

This concludes our exploration of data quality issues in this demo. Note that we didn’t get to all the data quality issues - some issues we didn't cover include:

- **Validity issues** where sometimes state names are fully mentioned whereas other times abbreviations are used.
- Multiple phone number formats.