# Cleaning Data

Earlier, we mentioned that most of our notebooks, and even some of our individual cells, follow a fairly standard pattern:

1. Read Data
2. **Clean Data**
3. Filter Data
4. Process Data
5. Output Data

This article will be about cleaning data.

If you find your self reading from external data sources, you can often not have any idea what the format of the data is.  Certain file formats save data with no extra metadata about the contents, and everything is just ascii text.  When reading this kind of data into pandas, care must be taken.

You will of course need to read the data by hand, in a text editor, for instance, inorder to get an idea of the structure of things like a CSV or Excel file.  

Below is a list of common data cleaning operations that are often performed when loading external data into a Pandas DataFrame:

1. Eliminate Unwanted Observations:
    * Duplicate Values: Remove repeated entries that may have occurred during data collection.
    * Irrelevant Observations: Remove data that doesn't fit the specific problem that you're trying to solve.

2. Error Correction:
    * Typos, capitalization inconsistencies, and mislabeled classes may need to be corrected.

3. Formatting corrections:
    * Dates and other special types of data can often be read as strings, and may need parsing and conversion to the appropriate format.

4. Handle Missing Data:
    * Pandas typically represents missing data with NaN values which can cause problems with analytics operations.
    * This can be resolved by ways such as deleting observations, replacing with 0 or any relevant value, or imputation (Filling NaN values using methods like `fillna()`, `interpolate()`, or with statistical measures like `mean`, `median` etc.)
    * Occasionally, the use of the `bfill` and `ffill` (back-fill and forward-fill) methods can fill in missing values with the previous or next valid value. 

5. Addressing Text Data:
    * Text data typically requires extra steps in order to prepare it for modeling, like lowercasing, stemming, lemmatization, stop words removal, and vectorization.

6. Normalization and Standardization:
    * It's often helpful to scale numeric variables to bring them onto the same scale, which can improve the performance of certain algorithms.

7. Encode Categorical Variables:
    * Many machine learning models require the input data to be numerical. If your dataset includes categorical data, you may need to encode these categories as numbers.

8. Type coercion:
    * When reading data in from the outside world, often the types will be unclear to pandas.  Some times it can figure it out with a little bit of help, some times it needs a lot of help.  It is often useful to avoid trying to get pandas to calculate the right types until you have done many of the previous steps in cleaning the data.  By removing spare '#Empty' tags from a column and replacing them with None's, it may be more clear to the pandas engines that that column was a float, and not a string. 

9. Feature Engineering:
    * Depending on the problem, you might benefit from creating new features based on existing ones, for example, extracting the day of the week from a date, or the domain from an email address.

10. Setting Appropriate Index:
    * Sometimes it's better to have a specific column set as an index for the DataFrame.

Data cleaning is a critical step in the data preprocessing pipeline. It sets the stage for the exploratory data analysis and modeling stages that follow. The quality of the data cleaning and preprocessing can often significantly impact the outcomes of the subsequent analysis or model development stages.

As a general pattern, cleaning up the textual representation, dealing with illegal or missing values, and fixing spelling, should come first, with type coercion coming last.

## Detailed Examples

Lets take a look at some examples of data that needs cleaned, and the pandas techniques and functions that we use to clean data.

### Load Data

Lets load up some fake data to start with.

In [1]:
import pandas as pd
import numpy as np

data = pd.read_csv('data/cleaning_example_01.csv')
data

Unnamed: 0,a,b,c,d
0,,1.0,1.0,1
1,2.0,2.0,2.0,2
2,3.0,3.0,,3
3,4.0,4.0,4.0,4
4,5.0,,5.0,5
5,6.0,6.0,6.0,YYY
6,7.0,7.0,7.0,XXX


Lets see what the basic data looks like.

In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   a       6 non-null      float64
 1   b       6 non-null      float64
 2   c       6 non-null      float64
 3   d       7 non-null      object 
dtypes: float64(3), object(1)
memory usage: 356.0+ bytes


Notice that the column `d` is listed of type object.  This is because that particular column contained both numbers and strings, and so it doesn't know what else to do with it.  This will come up shortly.

### Cleaning up bad text

The data we loaded up has some mistaken text in one of the column, so lets get rid of it.  Here, we replace it with the NaN value.

In [3]:
data2 = (
    data
    .replace('XXX', np.NaN)
    .replace('YYY', np.NaN)
)

data2

Unnamed: 0,a,b,c,d
0,,1.0,1.0,1.0
1,2.0,2.0,2.0,2.0
2,3.0,3.0,,3.0
3,4.0,4.0,4.0,4.0
4,5.0,,5.0,5.0
5,6.0,6.0,6.0,
6,7.0,7.0,7.0,


The replace method has a version where you can do regular expression replacement across a data frame (or column) which is quite useful.

Note that in this new form, the data shows a Nan above, but it still thinks the column is an object.  

In [4]:
data2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   a       6 non-null      float64
 1   b       6 non-null      float64
 2   c       6 non-null      float64
 3   d       5 non-null      object 
dtypes: float64(3), object(1)
memory usage: 356.0+ bytes


While not a huge issue yet, lets see if the built in convert_dtype method can convert that column to an Int64 column like the others did

In [5]:
data3 = (
    data2
    .convert_dtypes()
)
data3

Unnamed: 0,a,b,c,d
0,,1.0,1.0,1.0
1,2.0,2.0,2.0,2.0
2,3.0,3.0,,3.0
3,4.0,4.0,4.0,4.0
4,5.0,,5.0,5.0
5,6.0,6.0,6.0,
6,7.0,7.0,7.0,


In [6]:
data3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   a       6 non-null      Int64 
 1   b       6 non-null      Int64 
 2   c       6 non-null      Int64 
 3   d       5 non-null      string
dtypes: Int64(3), string(1)
memory usage: 377.0 bytes


Oddly, it seems to now think its a string.  This can happen, the automatic tools try to do the right thing for you, but it cant always do so.  So lets see if we can patch it up by manually setting the type.

In [7]:
data3 = (
    data2
    .convert_dtypes()
    .astype({'d': 'Int64'})
)
data3

Unnamed: 0,a,b,c,d
0,,1.0,1.0,1.0
1,2.0,2.0,2.0,2.0
2,3.0,3.0,,3.0
3,4.0,4.0,4.0,4.0
4,5.0,,5.0,5.0
5,6.0,6.0,6.0,
6,7.0,7.0,7.0,


And now its ok, its all Int64

In [8]:
data3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   a       6 non-null      Int64
 1   b       6 non-null      Int64
 2   c       6 non-null      Int64
 3   d       5 non-null      Int64
dtypes: Int64(4)
memory usage: 384.0 bytes


### Getting rid of missing values.

There are NaN values listed in several different rows of this data.  

What we should do with this is up for debate and dependant on your situation but lets look at a few options.

In [9]:
(
    data3
    .fillna(0.0)
)

Unnamed: 0,a,b,c,d
0,0,1,1,1
1,2,2,2,2
2,3,3,0,3
3,4,4,4,4
4,5,0,5,5
5,6,6,6,0
6,7,7,7,0


In [10]:
(
    data3
    .dropna()
)

Unnamed: 0,a,b,c,d
1,2,2,2,2
3,4,4,4,4


In [11]:
(
    data3
    .astype('float')
    .interpolate('linear')
)

Unnamed: 0,a,b,c,d
0,,1.0,1.0,1.0
1,2.0,2.0,2.0,2.0
2,3.0,3.0,3.0,3.0
3,4.0,4.0,4.0,4.0
4,5.0,5.0,5.0,5.0
5,6.0,6.0,6.0,5.0
6,7.0,7.0,7.0,5.0


In [12]:
(
    data3
    .bfill()
)

Unnamed: 0,a,b,c,d
0,2,1,1,1.0
1,2,2,2,2.0
2,3,3,4,3.0
3,4,4,4,4.0
4,5,6,5,5.0
5,6,6,6,
6,7,7,7,


In [13]:
(
    data3
    .ffill()
)

Unnamed: 0,a,b,c,d
0,,1,1,1
1,2.0,2,2,2
2,3.0,3,2,3
3,4.0,4,4,4
4,5.0,4,5,5
5,6.0,6,6,5
6,7.0,7,7,5


As we have mentioned elsewhere, leaving all those spare extra intermediate dataframes around is not the best idea, and we have the ability to chain all these steps together in a single operation, to make sure they all happen in the right order.  So, while the above may be a step I would take to work out how to clean the thing, when all done, it would end up looking like this, below, with all the work in one section.

In [14]:
cleaned_data = (
    data
    .replace('XXX', np.NaN)
    .replace('YYY', np.NaN)
    .convert_dtypes()
    .astype({'d': 'Int64'})
    .fillna(0)
)

cleaned_data

Unnamed: 0,a,b,c,d
0,0,1,1,1
1,2,2,2,2
2,3,3,0,3
3,4,4,4,4
4,5,0,5,5
5,6,6,6,0
6,7,7,7,0
