**<h1>Introduction</h1>**

This kernel aims to automatically tansform categorical features in dataset to numerical features. Necessity to transform categorical features to numerical features arise from machine unability to understand text/string. Telco Customer Churn is suitable for this problem because its relatively high number of categorical features, as seen below.

In [None]:
import pandas as pd

# Read the dataset
df = pd.read_csv('../input/WA_Fn-UseC_-Telco-Customer-Churn.csv')
df.info()

**<h1>Automatically Transform Categorical Features to Numerical Features</h1>**

Machine learning algorithm require numerical inputs. So, if you want to build classification model using this Telco dataset, first you need to transform the categorical features to numerical ones. The function/subroutine below is automatically do that particular pre-processing. The subroutine takes dataset in form of DataFrame and list of features name that you want to exclude from the transformation process as inputs. The subroutine use [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) and [OneHotEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) module from scikit-learn.

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# transform_categorical
#   - Function to automatically transform categorical features.
# input  : 
#   - df_input (dataframe) : dataset
#   - exclude (list) : list of feature name that will be excluded from transformation
# output : 
#   - df (dataframe) : numerical-encoded dataset

def transform_categorical(df_input, exclude=[]) :
    df = df_input.copy()
    # Dictionary to put the label encoder & one hot encoder
    l_encoderDict = dict()
    o_encoderDict = dict()
    
    # Iterate column in dataframe
    columns = list(df)
    for ori_column in columns :
        # Get one sample of the data to determine data type
        sample = df[ori_column][0]
        # Check whether the data is str or subtype of str
        if isinstance(sample, str) and ori_column not in exclude:
            # Initialize new column name
            new_column = ori_column + '_encoded'
            # Label encoding
            l_encoderDict[ori_column] = LabelEncoder()
            df[new_column] = l_encoderDict[ori_column].fit_transform(df[ori_column])
            # Delete old column
            df.drop(columns=[ori_column], inplace=True)
            # Check whether the numerical values are not binary
            if len(l_encoderDict[ori_column].classes_) > 2 :
                # One hot encoding
                o_encoderDict[ori_column] = OneHotEncoder()
                oneHot_temp = df[new_column].values.reshape(-1,1)
                oneHot_array = o_encoderDict[ori_column].fit_transform(oneHot_temp).toarray()
                # Initializa new column name
                oneHot_columns = [ori_column + '_' + str(j) for j in range(oneHot_array.shape[1])]
                # Convert one hot encoding array to dataframe
                dfOneHot = pd.DataFrame(oneHot_array, columns=oneHot_columns)
                # Add one hot encoding to existing dataframe
                df = pd.concat([df, dfOneHot], axis=1)
                # Delete old column
                df.drop(columns=[new_column], inplace=True)
    return df


Telco Customer Churn dataset has 18 string-value features however not all string-value features are categorical features. *customerID* column  is data identifier and will not be used to build classification model. *TotalCharges* column is the total amount charged to customer, so it should be a float yet it's string-formatted. Therefore *customerID* and *TotalCharges* column shouldn't be encoded to numerical features.

In [None]:
# Use of the subroutine on Telco dataset
df_encoded = transform_categorical(df, exclude=['customerID', 'TotalCharges'])

**<h1>Before & After</h1>**

In [None]:
# Original dataset
df.head(5)

In [None]:
# Encoded dataset
df_encoded.head(5)

In [None]:
df_encoded.describe()