# Data Cleaning Made Easy
***
This is how you can clean up categorical data without writing much code at all. What you need is some very light regular expressions to do a simple job.

You may be familiar with .strip() to remove leading and trailing whitespace, but there is another way to remove all spacing greater than 1 space from the column.

_This works best when there is only one word in the cell in each column._
***

In [34]:
# This works with Pandas
import pandas as pd

In [39]:
# Creating a data frame of some really bad categorical data
df = pd.DataFrame(
    {
        'Name' : ['C   h5$r(&*^987i    s  ', 'W&^%$# oo)(*&^(&^%d W a R 0987 d      ',' D )(*^23a %%%tA 0987^%)               ',' c L%e$3an(*&in #$%G)',' f O$%^&L09867l**9ow', ' m #$%^&*98237e20394**87  ']
    }
)

# Displaying below
df

Unnamed: 0,Name
0,C h5$r(&*^987i s
1,W&^%$# oo)(*&^(&^%d W a R 0987 d
2,D )(*^23a %%%tA 0987^%)
3,c L%e$3an(*&in #$%G)
4,f O$%^&L09867l**9ow
5,m #$%^&*98237e20394**87


## Cleaning the data
***
* We use .str to work with the data as a string.
* We use .lower() to lower-case all letters
* We use .replace(r'[\W\d\s+]', '' ,regex=True)
  * r basically means 'raw' in this context
  * [The expression reads what is in here]
  * \W means "not a word"
  * \d means "digit"
  * \s+ means "> than one space"
  * We are going to replace all of those things with nothing
* We use .title() to capitalize the first letter
* We use .strip() to ensure we have cleaned all leading and trailing whitespace

In [40]:
# Cleaning the data
df['Name'] = df['Name'].str.lower().str.replace(r'[\W\d\s+]','', regex=True).str.title().str.strip()

# Displaying below
df

Unnamed: 0,Name
0,Chris
1,Woodward
2,Data
3,Cleaning
4,Follow
5,Me
