# Using pandas dataframe pipe() function

Pandas dataframe [pipe()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pipe.html?highlight=pipe#pandas.DataFrame.pipe) function allows you to pass a result of function to another function and that function can pass its result to another function, and so on.  In my opinion, the example shown at the official documentation's site does not make this too clear.  I think the example shown was a bit complex.  Below is my attempt to show a practical example.

**NOTE:** I will be using type annotations which are **optional**.  But I think type annotations make it more obvious or explicit as to what is the expected data type of the function parameters, as well as what is the data type of the function's return value.

In [1]:
from typing import List, TypeVar
import pandas as pd
DataFrame = TypeVar('pd.core.frame.DataFrame')

In [2]:
df = pd.read_csv(r'D:\gitprojects\streamlit-seaborn-heatmap\top_parts.csv', usecols=range(6))

In [3]:
df.head()

Unnamed: 0,YEAR,FACTORY,MODEL,PART5_NAME,AF_YR_MTH,RO_YR_MTH
0,2019,HMA,PASSPORT,00036 - COMPRESSOR,2019-09,2019-11
1,2019,ELP,RDX,04101 - SENSOR ASSY.,2018-10,2019-09
2,2019,MAP,CRV,04320 - PIGTAIL,2019-02,2019-11
3,2019,HMA,PILOT,04321 - CONNECTOR,2018-09,2019-02
4,2019,HMIN,CRV,04321 - CONNECTOR,2019-05,2019-11


#### Let's replace underscores with spaces in column names so that we can replace them later in a pipeline

In [4]:
df.columns = ['YEAR', 'FACTORY', 'MODEL', 'PART5 NAME', 'AF YR MTH', 'RO YR MTH']

In [5]:
df.head()

Unnamed: 0,YEAR,FACTORY,MODEL,PART5 NAME,AF YR MTH,RO YR MTH
0,2019,HMA,PASSPORT,00036 - COMPRESSOR,2019-09,2019-11
1,2019,ELP,RDX,04101 - SENSOR ASSY.,2018-10,2019-09
2,2019,MAP,CRV,04320 - PIGTAIL,2019-02,2019-11
3,2019,HMA,PILOT,04321 - CONNECTOR,2018-09,2019-02
4,2019,HMIN,CRV,04321 - CONNECTOR,2019-05,2019-11


Let's assume that the above is our data set.  Let's also assume we now want to apply some transformations or clean-up to this data set:

- Make column names lower case
- Replace space in the column names with underscore
- Remove the 'model' column from the data set

We can next define functions that will apply those transformations or clean-ups.

### Typical clean-up functions to be used in a pipeline

In [6]:
def start_pipeline(df: DataFrame) -> DataFrame:
    '''Make a copy of the dataframe that is being used in a pipeline
       To preserve "idempotency" (subsequent output of each re-run remains the same),
       it is usually best practice to make a copy of the data at the start of the pipeline
       Since this is "in-memory", you should not do this with a very large data set
       
    Parameters
    ----------
    df : pandas DataFrame
        DataFrame of the untouched/original data
    
    Returns
    -------
    Pandas DataFrame
    '''
    
    return df.copy()

def columns_lower(df: DataFrame) -> DataFrame:
    '''Make all column names lower case
    
    Parameters
    ----------
    df : pandas DataFrame
    
    Returns
    -------
    pandas DataFrame
    
    '''
    
    df.columns = [column.lower() for column in df.columns]
    return df

def replace_space_with_underscore(df: DataFrame) -> DataFrame:
    '''Replace space in column names with an underscore character
    
    Parameters
    ----------
    df : pandas DataFrame
    
    Returns
    -------
    pandas DataFrame
    
    '''
    
    df.columns = [column.strip().replace(' ', '_') for column in df.columns]
    return df

def drop_columns(df: DataFrame, column_names: List[str]) -> DataFrame:
    '''Drop or remove column(s) from a data frame
    
    Parameters
    ----------
    df : pandas DataFrame
    column_names : List[str]
        A list of column names that will be dropped/removed
        
    Returns
    -------
    pandas DataFrame
    '''
    
    df = df.drop(columns=column_names)
    return df

In [7]:
(df
 .pipe(start_pipeline)
 .pipe(columns_lower)
 .pipe(replace_space_with_underscore)
 .pipe(drop_columns, column_names=['model'])
).head()

Unnamed: 0,year,factory,part5_name,af_yr_mth,ro_yr_mth
0,2019,HMA,00036 - COMPRESSOR,2019-09,2019-11
1,2019,ELP,04101 - SENSOR ASSY.,2018-10,2019-09
2,2019,MAP,04320 - PIGTAIL,2019-02,2019-11
3,2019,HMA,04321 - CONNECTOR,2018-09,2019-02
4,2019,HMIN,04321 - CONNECTOR,2019-05,2019-11


In [9]:
cleaned_df = \
(df
 .pipe(start_pipeline)
 .pipe(columns_lower)
 .pipe(replace_space_with_underscore)
 .pipe(drop_columns, column_names=['model'])
).head()

In [10]:
cleaned_df

Unnamed: 0,year,factory,part5_name,af_yr_mth,ro_yr_mth
0,2019,HMA,00036 - COMPRESSOR,2019-09,2019-11
1,2019,ELP,04101 - SENSOR ASSY.,2018-10,2019-09
2,2019,MAP,04320 - PIGTAIL,2019-02,2019-11
3,2019,HMA,04321 - CONNECTOR,2018-09,2019-02
4,2019,HMIN,04321 - CONNECTOR,2019-05,2019-11


#### Since in our pipeline we've made a copy of the original dataframe, we can see what the data looked like originally:

In [11]:
df.head()

Unnamed: 0,YEAR,FACTORY,MODEL,PART5 NAME,AF YR MTH,RO YR MTH
0,2019,HMA,PASSPORT,00036 - COMPRESSOR,2019-09,2019-11
1,2019,ELP,RDX,04101 - SENSOR ASSY.,2018-10,2019-09
2,2019,MAP,CRV,04320 - PIGTAIL,2019-02,2019-11
3,2019,HMA,PILOT,04321 - CONNECTOR,2018-09,2019-02
4,2019,HMIN,CRV,04321 - CONNECTOR,2019-05,2019-11
