Skip to content
This repository has been archived by the owner on May 23, 2024. It is now read-only.

4.4.1.Manually cleaning data

Seongjoo Brenden Song edited this page Jul 12, 2021 · 1 revision

Verification

A process to confirm that a data-cleaning effort was well-executed and the resulting data is accurate and reliable

Changelog

A file containing a chronologically ordered list of modifications made to a project

See the big picture when verifying data-cleaning

  1. Consider the business problem
  2. Consider the goal
  3. Consider the data

CASE statement

The CASE statement goes through one or more conditions and returns a value as soon as a condition is met

Example:

SELECT
  customer_id,
  CASE
    WHEN first_name = 'Tnoy' THEN 'Tony'
    WHEN first_name = 'Tmo' THEN 'Tom'
    WHEN first_name = 'Rachle' THEN 'Rachel'
    ELSE first_name
    END AS cleaned_name
FROM
  customer_data.customer_name   

Data-cleaning verification: A checklist

This reading will give you a checklist of common problems you can refer to when doing your data cleaning verification, no matter what tool you are using. When it comes to data cleaning verification, there is no one-size-fits-all approach or a single checklist that can be universally applied to all projects. Each project has its own organization and data requirements that lead to a unique list of things to run through for verification.

Keep in mind, as you receive more data or a better understanding of the project goal(s), you might want to revisit some or all of these steps.

Correct the most common problems

Make sure you identified the most common problems and corrected them, including:

  • Sources of errors: Did you use the right tools and functions to find the source of the errors in your dataset?
  • Null data: Did you search for NULLs using conditional formatting and filters?
  • Misspelled words: Did you locate all misspellings?
  • Mistyped numbers: Did you double-check that your numeric data has been entered correctly?
  • Extra spaces and characters: Did you remove any extra spaces or characters using the TRIM function?
  • Duplicates: Did you remove duplicates in spreadsheets using the Remove Duplicates function or DISTINCT in SQL?
  • Mismatched data types: Did you check that numeric, date, and string data are typecast correctly?
  • Messy (inconsistent) strings: Did you make sure that all of your strings are consistent and meaningful?
  • Messy (inconsistent) date formats: Did you format the dates consistently throughout your dataset?
  • Misleading variable labels (columns): Did you name your columns meaningfully?
  • Truncated data: Did you check for truncated or missing data that needs correction?
  • Business Logic: Did you check that the data makes sense given your knowledge of the business?

Review the goal of your project

Once you have finished these data cleaning tasks, it is a good idea to review the goal of your project and confirm that your data is still aligned with that goal. This is a continuous process that you will do throughout your project-- but here are three steps you can keep in mind while thinking about this:

  • Confirm the business problem
  • Confirm the goal of the project
  • Verify that data can solve the problem and is aligned to the goal

Test your knowledge on manual data cleaning

TOTAL POINTS 4

Question 1

Making sure data is properly verified is an important part of the data-cleaning process. Which of the following tasks are involved in this verification? Select all that apply.

  • Considering whether the data is credible and appropriate for the project
  • Asking stakeholders to check and confirm the data is clean
  • Rechecking the data-cleaning effort
  • Manually fixing any errors found in the data

Correct. The verification process confirms that data cleaning was well executed and the resulting data is accurate and reliable. To verify data, analysts recheck the data-cleaning effort, manually fix errors in the data, and consider whether the data is credible and appropriate for the project.

Question 2

Fill in the blank: To count the total number of spreadsheet values within a specified range, a data analyst uses the _____ function.

  • WHOLE
  • TOTAL
  • SUM
  • COUNTA

Correct. To count the total number of spreadsheet values within a specified range, a data analyst uses the COUNTA function.

Question 3

A data analyst is cleaning a dataset with inconsistent formats and repeated cases. They use the TRIM function to remove extra spaces from string variables. What other tools can they use for data cleaning? Select all that apply.

  • Protect sheet
  • Remove duplicates
  • Import data
  • Find and replace

Correct. The analyst can use TRIM*, remove duplicates, and find and replace for data cleaning.*

Question 4

To correct a typo in a database column, where should you insert a CASE statement in a query?

  • As a SELECT clause
  • As an ORDER BY clause
  • As a FROM clause
  • As a GROUP BY clause

Correct. You should add a CASE statement as a SELECT clause. A CASE statement goes through one or more conditions and returns a value as soon as a condition is met. The typo would be a condition and the correction would be the returned value for the condition.

Clone this wiki locally