<a href="https://colab.research.google.com/github/sding26/testing/blob/master/Copy_of_Preprocessing_Categorical_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#One-Hot Encoding in Scikit-learn 
**(Using LabelEncoder + OneHotEncoder and the more streamlined ColumnTransformer approach)**
Intuition

You will prepare your categorical data using LabelEncoder()
You will apply OneHotEncoder() on your new DataFrame in step 1

In [0]:
# import 
import numpy as np
import pandas as pd
# load dataset
X = pd.read_csv('https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv')
X.head(3)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925


In [0]:
# Remove unwanted variables

X = X.loc[:, X.columns != 'Name'] #delete Name column


In [0]:
# limit to categorical data using df.select_dtypes(), but know your data.  Pclass is also categorical
Xcatsonly = X.select_dtypes(include=[object])
Xcatsonly.head(3)

Unnamed: 0,Sex
0,male
1,female
2,female


In [0]:
# check original shape
X.shape

(887, 7)

In [0]:
# import preprocessing from sklearn
from sklearn import preprocessing

In [0]:
# view columns using df.columns
X.columns

Index(['Survived', 'Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard',
       'Parents/Children Aboard', 'Fare'],
      dtype='object')

In [0]:
# TODO: create a LabelEncoder object and fit it to each feature in X


# 1. INSTANTIATE
# encode labels with value between 0 and n_classes-1.
from sklearn.preprocessing import LabelEncoder

labelencoder = LabelEncoder()

# 2/3. FIT AND TRANSFORM
X_2=X # copy of original data

X_2['Survived'] = labelencoder.fit_transform(X_2['Survived'])
X_2['Pclass'] = labelencoder.fit_transform(X_2['Pclass'])
X_2['Sex'] = labelencoder.fit_transform(X_2['Sex'])



In [0]:
X_2.dtypes # all values are now numeric.  We need to one-hot encode categorical data, so that our models do not think they are continuous data!

Survived                     int64
Pclass                       int64
Sex                          int64
Age                        float64
Siblings/Spouses Aboard      int64
Parents/Children Aboard      int64
Fare                       float64
dtype: object

In [0]:
X.dtypes # original dtypes

Survived                     int64
Pclass                       int64
Sex                          int64
Age                        float64
Siblings/Spouses Aboard      int64
Parents/Children Aboard      int64
Fare                       float64
dtype: object

##OneHotEncoder + ColumnTransformer

Encode categorical integer features using a one-hot aka one-of-K scheme.

The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features.

The output will be a sparse matrix where each column corresponds to one possible value of one feature.

It is assumed that input features take on values in the range [0, n_values).

This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models.

In [0]:
# Goal is to build data that includes one hot encoded categorical variables.  Old approach was to use labelencoder first, then use onehotencoder.
# New approach is to do it all in one execution using ColumnTransformer

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# one hot encode columns in index locations 0,1, and 2...
columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(categories='auto'), [0,1,2])], remainder='passthrough')
Xonehot = columnTransformer.fit_transform(X)

print(X.shape)
Xonehot.shape # an array we can use to input into a model with correctly shaped categorical data.

(887, 7)


(887, 11)

# More advanced example demonstrating flexibility of new Column Transformer approach

In [0]:
import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV

np.random.seed(0)

# Read data from Titanic dataset.
titanic_url = ('https://raw.githubusercontent.com/amueller/'
               'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
data = pd.read_csv(titanic_url)

# We will train our classifier with the following features:
# Numeric Features:
# - age: float.
# - fare: float.
# Categorical Features:
# - embarked: categories encoded as strings {'C', 'S', 'Q'}.
# - sex: categories encoded as strings {'female', 'male'}.
# - pclass: ordinal integers {1, 2, 3}.

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['embarked', 'sex', 'pclass']

#Replacing missing values with Modal value and then one hot encoding.
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# final preprocessor object set up with ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(solver='lbfgs'))])

X = data.drop('survived', axis=1)
y = data['survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

model score: 0.790


In [0]:
import pickle
pickle.dump(clf, open( "titanic_model.pkl", "wb" ) )

In [0]:
X_test

Unnamed: 0,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1139,3,"Rekic, Mr. Tido",male,38.0,0,0,349249,7.8958,,S,,,
533,2,"Phillips, Miss. Alice Frances Louisa",female,21.0,0,1,S.O./P.P. 2,21.0000,,S,12,,"Ilfracombe, Devon"
459,2,"Jacobsohn, Mr. Sidney Samuel",male,42.0,1,0,243847,27.0000,,S,,,London
1150,3,"Risien, Mr. Samuel Beard",male,,0,0,364498,14.5000,,S,,,
393,2,"Denbury, Mr. Herbert",male,25.0,0,0,C.A. 31029,31.5000,,S,,,"Guernsey / Elizabeth, NJ"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
753,3,"Davies, Mr. Evan",male,22.0,0,0,SC/A4 23568,8.0500,,S,,,
1052,3,"Nankoff, Mr. Minko",male,,0,0,349218,7.8958,,S,,,
426,2,"Hale, Mr. Reginald",male,30.0,0,0,250653,13.0000,,S,,75.0,"Auburn, NY"
554,2,"Schmidt, Mr. August",male,26.0,0,0,248659,13.0000,,S,,,"Newark, NJ"
