### BACKGROUND:`

### Sample data is from UC Davis Machine Learning Repository [adult data set](http://archive.ics.uci.edu/ml/machine-learning-databases/adult/)

Data set has white space that is prepended or in front of each value.  We want to remove the white space from columns of type string.  Even the column names have a white space in front.  Instances such as theses where the source data has these quirky data anomalies are quite common in the real world.  Most often we can't control the source data, but we can clean the data afterwards.  Depending on the level of control, below are 3 different methods.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('adult_data.csv')

In [3]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,<=50k
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


### Except for the first column (```age```), the column names have a white space in front:

In [4]:
df.columns

Index(['age', ' workclass', ' fnlwgt', ' education', ' education-num',
       ' marital-status', ' occupation', ' relationship', ' race', ' sex',
       ' capital-gain', ' capital-loss', ' hours-per-week', ' native-country',
       ' <=50k'],
      dtype='object')

### Let's fix the column names first, then the column values:

In [5]:
df.columns = [column.strip() for column in df.columns]

In [6]:
df.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       '<=50k'],
      dtype='object')

### Columns of type string or ```str``` also have a white space in front of each value.  In pandas dataframe, ```str``` columns are represented as ```object``` data type:

In [7]:
df.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
<=50k             object
dtype: object

### We can use ```value_counts()``` method to give us counts by data type:

In [8]:
df.dtypes.value_counts()

object    9
int64     6
dtype: int64

### Let's take a peek at a couple of the string columns:

In [9]:
df['workclass'].values

array([' State-gov', ' Self-emp-not-inc', ' Private', ..., ' Private',
       ' Private', ' Self-emp-inc'], dtype=object)

In [10]:
df['education'].values

array([' Bachelors', ' Bachelors', ' HS-grad', ..., ' HS-grad',
       ' HS-grad', ' HS-grad'], dtype=object)

From above, we see that indeed there is a white space in front of each value.  There a few different ways we can remove these white spaces.

### Method #1: Test type using ```isinstance()``` method:

In [11]:
def strip_whitespace(df):
    for column in df.columns:
        # if first value of column is of type str, then strip the white space from all values
        if isinstance(df[column][0], str):
            df[column] = df[column].str.strip()

In [12]:
strip_whitespace(df)

### Let's double-check that the function did its job:

In [13]:
df['sex'].values

array(['Male', 'Male', 'Male', ..., 'Female', 'Male', 'Female'],
      dtype=object)

In [14]:
df['education'].values

array(['Bachelors', 'Bachelors', 'HS-grad', ..., 'HS-grad', 'HS-grad',
       'HS-grad'], dtype=object)

### Method #2: Using ```dtype``` to test for data type:

In [15]:
df = pd.read_csv('adult_data.csv')
df.columns = [column.strip() for column in df.columns]

In [16]:
def strip_whitespace2(df):
    for column in df.columns:
        # if column is of type 'object', then strip the white space
        if df[column].dtype == 'object':
            df[column] = df[column].str.strip()

In [17]:
strip_whitespace2(df)

In [18]:
df['education'].values

array(['Bachelors', 'Bachelors', 'HS-grad', ..., 'HS-grad', 'HS-grad',
       'HS-grad'], dtype=object)

### Method #3: Using pandas ```is_string_dtype``` or ```is_numeric_dtype```:

In [19]:
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype

In [20]:
df = pd.read_csv('adult_data.csv')
df.columns = [column.strip() for column in df.columns]

In [21]:
def strip_whitespace3(df):
    for column in df.columns:
        # if column is of a string data type, then strip the white space
        if is_string_dtype(df[column]):
            df[column] = df[column].str.strip()

In [22]:
df['education'].values

array([' Bachelors', ' Bachelors', ' HS-grad', ..., ' HS-grad',
       ' HS-grad', ' HS-grad'], dtype=object)

In [23]:
strip_whitespace3(df)

In [24]:
df['education'].values

array(['Bachelors', 'Bachelors', 'HS-grad', ..., 'HS-grad', 'HS-grad',
       'HS-grad'], dtype=object)