In [17]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder

In [3]:
telco = pd.read_csv('Telco-Customer-Churn.csv')
telco.head(5)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


# LABEL ENCODING

From several machine learning models expect data to be numerical. So, categorical data better to convert into numerical.
There is some popular way to do it.

Label encoding is one of the method, which is replace categorical value between 0 and the number of unique value minus 1. good for columns that contain 2 value

In [15]:
#we will use columns which is contain 2 value
#for example, columns gender and Partner

labelenc = telco[['gender', 'Partner']]

In [16]:
labelenc

Unnamed: 0,gender,Partner
0,Female,Yes
1,Male,No
2,Male,No
3,Male,No
4,Female,No
...,...,...
7038,Male,Yes
7039,Female,Yes
7040,Female,Yes
7041,Male,Yes


Label Encoding can performed with this kind of way:
1. using scikit learn library
2. using astype to 'category'
3. manually replace the value with numerical value

here I take the first way

In [18]:
labelencoder = LabelEncoder()

labelenc['newgender'] = labelencoder.fit_transform(labelenc['gender'])
labelenc

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  labelenc['newgender'] = labelencoder.fit_transform(labelenc['gender'])


Unnamed: 0,gender,Partner,newgender
0,Female,Yes,0
1,Male,No,1
2,Male,No,1
3,Male,No,1
4,Female,No,0
...,...,...,...
7038,Male,Yes,1
7039,Female,Yes,0
7040,Female,Yes,0
7041,Male,Yes,1


In [21]:
labelenc['newpartner'] = labelenc['Partner'].astype('category')
labelenc

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  labelenc['newpartner'] = labelenc['Partner'].astype('category')


Unnamed: 0,gender,Partner,newgender,newpartner
0,Female,Yes,0,Yes
1,Male,No,1,No
2,Male,No,1,No
3,Male,No,1,No
4,Female,No,0,No
...,...,...,...,...
7038,Male,Yes,1,Yes
7039,Female,Yes,0,Yes
7040,Female,Yes,0,Yes
7041,Male,Yes,1,Yes


In [22]:
labelenc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   gender      7043 non-null   object  
 1   Partner     7043 non-null   object  
 2   newgender   7043 non-null   int32   
 3   newpartner  7043 non-null   category
dtypes: category(1), int32(1), object(2)
memory usage: 144.7+ KB


# ONE-HOT ENCODING

we can use one hot encoding for typical categorical that don't have hirarki 

In [38]:
onehot =  telco[['PaymentMethod']]

In [39]:
onehot

Unnamed: 0,PaymentMethod
0,Electronic check
1,Mailed check
2,Mailed check
3,Bank transfer (automatic)
4,Electronic check
...,...
7038,Mailed check
7039,Credit card (automatic)
7040,Electronic check
7041,Mailed check


In [40]:
onehot['PaymentMethod'].value_counts()

Electronic check             2365
Mailed check                 1612
Bank transfer (automatic)    1544
Credit card (automatic)      1522
Name: PaymentMethod, dtype: int64

In [41]:
onehot1 = pd.get_dummies(onehot['PaymentMethod'], prefix='PaymentMethod')

In [43]:
onehot1

Unnamed: 0,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0,0,1,0
1,0,0,0,1
2,0,0,0,1
3,1,0,0,0
4,0,0,1,0
...,...,...,...,...
7038,0,0,0,1
7039,0,1,0,0
7040,0,0,1,0
7041,0,0,0,1


In [44]:
onehotcon = pd.concat([onehot,onehot1], axis=1)

In [45]:
onehotcon

Unnamed: 0,PaymentMethod,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,Electronic check,0,0,1,0
1,Mailed check,0,0,0,1
2,Mailed check,0,0,0,1
3,Bank transfer (automatic),1,0,0,0
4,Electronic check,0,0,1,0
...,...,...,...,...,...
7038,Mailed check,0,0,0,1
7039,Credit card (automatic),0,1,0,0
7040,Electronic check,0,0,1,0
7041,Mailed check,0,0,0,1


# ORDINAL ENCODING

for categorical data that have hirarki, such as class, how good the product etc