Previously OneHotEncoder was used in Pandas because it was super complicated with sklearn. until the new update, it is now easier and prevents data leakage if added within a pipeline. 

Why is this better.
1- You do not have to make a big DF made out of get dummies in Pandas for the training of the model.

2- When new data comes Test Data, you do not need to make get dummies on it before predicting. Also if the test data has different values, it will mix the onehotencoder values they were trained on. e.g. for the below. if training data had S,C,Q in Embarked, but the test data had only S,C. It will not create the correct shape for the data column. 

3- you can do grid search with both the model parameters and preprocessing.

4- in some cases, preprocessing outside Sklearn can make CV scores less reliable. e.g standardscaler, missing value imputation, or text data if you do the preprocessing before Sklearn the CV scores are possibly going to be unreliable. 

In [1]:
import pandas as pd

In [2]:
df=pd.read_csv('http://bit.ly/kaggletrain')

In [3]:
df.shape

(891, 12)

In [4]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [5]:
#check if there are available missing values
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [6]:
#esclude rows where embark is missing, we chose those columns to make the analysis 
df=df.loc[df.Embarked.notna(),['Survived','Pclass', 'Sex','Embarked']]

In [7]:
df.isna().sum()

Survived    0
Pclass      0
Sex         0
Embarked    0
dtype: int64

In [8]:
## survived is our target
df.head()

Unnamed: 0,Survived,Pclass,Sex,Embarked
0,0,3,male,S
1,1,1,female,C
2,1,3,female,S
3,1,1,female,S
4,0,3,male,S


In [9]:
X=df.loc[:,['Pclass']]
y=df.Survived

In [10]:
X.shape

(889, 1)

In [11]:
y.shape

(889,)

In [12]:
from sklearn.linear_model import LogisticRegression

In [13]:
#For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss; ‘liblinear’ is 
#limited to one-versus-rest schemes.
logreg=LogisticRegression(solver='lbfgs')

In [14]:
from sklearn.model_selection import cross_val_score

In [15]:
cross_val_score(logreg, X, y, cv=5, scoring='accuracy').mean()

0.6783406335301212

In [16]:
# how many 0 and 1 were predicted
y.value_counts (normalize=True)

0    0.617548
1    0.382452
Name: Survived, dtype: float64

## Add more features

In [17]:
from sklearn.preprocessing import OneHotEncoder

## sparse, if true will return a matrix, else return an array. 
ohe=OneHotEncoder(sparse=False)

In [18]:
df.head()

Unnamed: 0,Survived,Pclass,Sex,Embarked
0,0,3,male,S
1,1,1,female,C
2,1,3,female,S
3,1,1,female,S
4,0,3,male,S


In [19]:
ohe.fit_transform(df[['Sex']])

array([[0., 1.],
       [1., 0.],
       [1., 0.],
       ...,
       [1., 0.],
       [0., 1.],
       [0., 1.]])

In [20]:
## shoes what was encoded. 
ohe.categories_

[array(['female', 'male'], dtype=object)]

In [21]:
X=df.drop('Survived' , axis ='columns')
X.head()

Unnamed: 0,Pclass,Sex,Embarked
0,3,male,S
1,1,female,C
2,3,female,S
3,1,female,S
4,3,male,S


In [22]:
## column transformer. Used if you have columns in you DF that needs different preprocessing. 
## e.g onehot enocder on Sex and Embarked columns only, since Pclass is already numeric
from sklearn.compose import make_column_transformer

In [23]:
## we create the instance first
column_trans = make_column_transformer (
    (OneHotEncoder(), ['Sex', 'Embarked']), 
    remainder='passthrough')

In [24]:
column_trans.fit_transform(X)

array([[0., 1., 0., 0., 1., 3.],
       [1., 0., 1., 0., 0., 1.],
       [1., 0., 0., 0., 1., 3.],
       ...,
       [1., 0., 0., 0., 1., 3.],
       [0., 1., 1., 0., 0., 1.],
       [0., 1., 0., 1., 0., 3.]])

## Pipeline

In [25]:
## use make_pipeline because it is functionally better. used to chain steps.
## The only difference is that make_pipeline generates names for steps automatically.
## pipe = Pipeline([('vec', CountVectorizer()), ('clf', LogisticRegression()])
## pipe = make_pipeline(CountVectorizer(), LogisticRegression())     
from sklearn.pipeline import make_pipeline

In [26]:
## we create a chain now, first do the column trans(onehotencoder), then do the regression
pipe=make_pipeline(column_trans, logreg)

In [27]:
## here we cross validate the entire pipeline. Adding the 2 features Sex and embarked has increased our accuracy.
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()

0.7727924839713071

## Evaluation 

In [29]:
X_new=X.sample(5, random_state=99)
X_new

Unnamed: 0,Pclass,Sex,Embarked
599,1,male,C
512,1,male,S
273,1,male,C
215,1,female,C
790,3,male,Q


In [30]:
## since we have the model in a pipeline, we use the below to fit the data and  
pipe.fit(X,y)

Pipeline(memory=None,
     steps=[('columntransformer', ColumnTransformer(n_jobs=None, remainder='passthrough', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('onehotencoder', OneHotEncoder(categorical_features=None, categories=None,
       dtype=<class 'numpy.float64'>, handle_unknown='error...enalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))])

In [31]:
## here it is doing the same as model.predict but only with transforming the data. as you can see above the new sample
## had strings, but we are normalizing(onehotencoding) it using the pipe and then doing the prediction. 
pipe.predict(X_new)

array([1, 0, 1, 1, 0], dtype=int64)