# The Importance of Properly Cleaning Your Data
Cleaning up your dataset should always be the first part of any analysis. 

![image.png](attachment:image.png)
We've all heard the expression "Garbage in, garbage out"... and it's the truth. Data cleaning is a tedious and time-consuming part of any analysis, but the better you prepare your dataset, the more you can avoid common pitfalls and end up with better, more reliable results. 

## Formatting Errors
Before you can begin to look for more serious issues, we have to clean up our dataset by removing or editing the following types of common messiness: 
- Spaces and strange characters
- Blank rows
- Mismatched or missing column names
- Encoding issues between operating systems
- Duplicated data
- Encoding categorical data to numeric

## Missing Values
Pandas will label missing values as `NaN` (numpy null) or `None` and knowing where these values are, why they are there and how many values are missing is critical to your analysis:   
- Do you have all the data needed for an analysis? 
- Does the missing data effect the representativenes of your sample?
- Should we change to 0 or remove from the dataset? Replace with the mean? 
- Are these datapoints a measurement of 0, or are they actually missing? 

## Outliers
Sometimes extreme values can tell you important insights about your dataset, and other times they're just noise and error. You will need to find these outliers and: 
- Determine if outliers are real or due to error
- Examine their effect on the distribution of your dataset
- Decide if you need to remove certain samples

## Standardization
When performing caluclations between different measurements, it's important to make sure everything is on the same scale. For example: 
- Time 
- Metric vs imperial measurements
- Temperature
- Same sized bins

We'll go over techniques for dealing with all of the above in today's lesson. 