<a href="https://colab.research.google.com/github/sakuronohana/cas_datenanalyse/blob/master/PVA2/Feature-Coding-categorical.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Feature (En)coding Kategorische Features in Python

Importiere pandas, scikit-learn, numpy and [category_encoder](http://contrib.scikit-learn.org/categorical-encoding/) libraries.

In [0]:
!pip3 install category_encoders

In [0]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelBinarizer, LabelEncoder

import category_encoders as ce

Die Spaltennamen geben wir so an!

In [0]:
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration", "num_doors", "body_style",
           "drive_wheels", "engine_location", "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system", "bore", "stroke", 
           "compression_ratio", "horsepower", "peak_rpm", "city_mpg", "highway_mpg", "price"]

Lese Daten ein und konvertiere NaN zu ?

In [0]:
df = pd.read_csv("./data/imports-85.data",
                 header=None, names=headers, na_values="?" )

In [0]:
df.head()

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102.0,5500.0,24,30,13950.0
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115.0,5500.0,18,22,17450.0


Datentypen in dataframe

In [0]:
df.dtypes

symboling              int64
normalized_losses    float64
make                  object
fuel_type             object
aspiration            object
num_doors             object
body_style            object
drive_wheels          object
engine_location       object
wheel_base           float64
length               float64
width                float64
height               float64
curb_weight            int64
engine_type           object
num_cylinders         object
engine_size            int64
fuel_system           object
bore                 float64
stroke               float64
compression_ratio    float64
horsepower           float64
peak_rpm             float64
city_mpg               int64
highway_mpg            int64
price                float64
dtype: object

Mache Kopie nur mit den object Spalten.

In [0]:
obj_df = df.select_dtypes(include=['object']).copy()

In [0]:
obj_df.head()

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system
0,alfa-romero,gas,std,two,convertible,rwd,front,dohc,four,mpfi
1,alfa-romero,gas,std,two,convertible,rwd,front,dohc,four,mpfi
2,alfa-romero,gas,std,two,hatchback,rwd,front,ohcv,six,mpfi
3,audi,gas,std,four,sedan,fwd,front,ohc,four,mpfi
4,audi,gas,std,four,sedan,4wd,front,ohc,five,mpfi


Ueberprüfe null values in the Daten

In [0]:
obj_df[obj_df.isnull().any(axis=1)]

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system
27,dodge,gas,turbo,,sedan,fwd,front,ohc,four,mpfi
63,mazda,diesel,std,,sedan,fwd,front,ohc,four,idi


Da die num_doors Spalten die null values enthalten, kontrolliere welche Werte vorkommen.

In [0]:
obj_df["num_doors"].value_counts()

four    114
two      89
Name: num_doors, dtype: int64

Wir füllen es mit dem am häufigsten vorkommenden Element - four.

In [0]:
obj_df = obj_df.fillna({"num_doors": "four"})

In [0]:
obj_df[obj_df.isnull().any(axis=1)]

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system


### Feature coding mit pandas

Konvertiere num_cylinders und num_doors zu Zahlen

In [0]:
obj_df["num_cylinders"].value_counts()

four      159
six        24
five       11
eight       5
two         4
three       1
twelve      1
Name: num_cylinders, dtype: int64

In [0]:
cleanup_nums = {"num_doors":     {"four": 4, "two": 2},
                "num_cylinders": {"four": 4, "six": 6, "five": 5, "eight": 8,
                                  "two": 2, "twelve": 12, "three":3 }}

In [0]:
obj_df.replace(cleanup_nums, inplace=True)

In [0]:
obj_df.head()

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system
0,alfa-romero,gas,std,2,convertible,rwd,front,dohc,4,mpfi
1,alfa-romero,gas,std,2,convertible,rwd,front,dohc,4,mpfi
2,alfa-romero,gas,std,2,hatchback,rwd,front,ohcv,6,mpfi
3,audi,gas,std,4,sedan,fwd,front,ohc,4,mpfi
4,audi,gas,std,4,sedan,4wd,front,ohc,5,mpfi


Nochmalige Kontrolle der Datentypen, dass sie auch wirklich als Zahlen interpretiert werden.

In [0]:
obj_df.dtypes

make               object
fuel_type          object
aspiration         object
num_doors           int64
body_style         object
drive_wheels       object
engine_location    object
engine_type        object
num_cylinders       int64
fuel_system        object
dtype: object

Eine Möglichkeit des Codings ist die Verwendung der pandas category

In [0]:
obj_df["body_style"].value_counts()

sedan          96
hatchback      70
wagon          25
hardtop         8
convertible     6
Name: body_style, dtype: int64

In [0]:
obj_df["body_style"] = obj_df["body_style"].astype('category')

In [0]:
obj_df.dtypes

make                 object
fuel_type            object
aspiration           object
num_doors             int64
body_style         category
drive_wheels         object
engine_location      object
engine_type          object
num_cylinders         int64
fuel_system          object
dtype: object

We can assign the category codes to a new column so we have a clean numeric representation

Wir können die Kategorie codes einer neuen Spalte zuweisen, so dass wir eine saubere numerische Repräsentation haben.

In [0]:
obj_df["body_style_cat"] = obj_df["body_style"].cat.codes

In [0]:
obj_df.head()

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system,body_style_cat
0,alfa-romero,gas,std,2,convertible,rwd,front,dohc,4,mpfi,0
1,alfa-romero,gas,std,2,convertible,rwd,front,dohc,4,mpfi,0
2,alfa-romero,gas,std,2,hatchback,rwd,front,ohcv,6,mpfi,2
3,audi,gas,std,4,sedan,fwd,front,ohc,4,mpfi,3
4,audi,gas,std,4,sedan,4wd,front,ohc,5,mpfi,3


In [0]:
obj_df.dtypes

make                 object
fuel_type            object
aspiration           object
num_doors             int64
body_style         category
drive_wheels         object
engine_location      object
engine_type          object
num_cylinders         int64
fuel_system          object
body_style_cat         int8
dtype: object

# One hot Encoding
Für das one hot encoding, verwnde pandas get_dummies.

In [0]:
pd.get_dummies(obj_df, columns=["drive_wheels"]).head()

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,engine_location,engine_type,num_cylinders,fuel_system,body_style_cat,drive_wheels_4wd,drive_wheels_fwd,drive_wheels_rwd
0,alfa-romero,gas,std,2,convertible,front,dohc,4,mpfi,0,0,0,1
1,alfa-romero,gas,std,2,convertible,front,dohc,4,mpfi,0,0,0,1
2,alfa-romero,gas,std,2,hatchback,front,ohcv,6,mpfi,2,0,0,1
3,audi,gas,std,4,sedan,front,ohc,4,mpfi,3,0,1,0
4,audi,gas,std,4,sedan,front,ohc,5,mpfi,3,1,0,0


get_dummiers hat noch Optionen für das leichtere Verständnis der Spalten.

In [0]:
pd.get_dummies(obj_df, columns=["body_style", "drive_wheels"], prefix=["body", "drive"]).head()

Unnamed: 0,make,fuel_type,aspiration,num_doors,engine_location,engine_type,num_cylinders,fuel_system,body_style_cat,body_convertible,body_hardtop,body_hatchback,body_sedan,body_wagon,drive_4wd,drive_fwd,drive_rwd
0,alfa-romero,gas,std,2,front,dohc,4,mpfi,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,alfa-romero,gas,std,2,front,dohc,4,mpfi,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,alfa-romero,gas,std,2,front,ohcv,6,mpfi,2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
3,audi,gas,std,4,front,ohc,4,mpfi,3,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
4,audi,gas,std,4,front,ohc,5,mpfi,3,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0


Noch eine Möglichkeit ist eine true/false zuordnung, wenn möglich.

In [0]:
obj_df["engine_type"].value_counts()

ohc      148
ohcf      15
ohcv      13
dohc      12
l         12
rotor      4
dohcv      1
Name: engine_type, dtype: int64

Verwende np.where und den str accessor um das effizient umzusetzen.

In [0]:
obj_df["OHC_Code"] = np.where(obj_df["engine_type"].str.contains("ohc"), 1, 0)

In [0]:
obj_df[["make", "engine_type", "OHC_Code"]].head(20)

Unnamed: 0,make,engine_type,OHC_Code
0,alfa-romero,dohc,1
1,alfa-romero,dohc,1
2,alfa-romero,ohcv,1
3,audi,ohc,1
4,audi,ohc,1
5,audi,ohc,1
6,audi,ohc,1
7,audi,ohc,1
8,audi,ohc,1
9,audi,ohc,1


### Feature coding mit Scitkit-learn

Instanziere den LabelEncoder

In [0]:
lb_make = LabelEncoder()

In [0]:
obj_df["make_code"] = lb_make.fit_transform(obj_df["make"])

In [0]:
obj_df[["make", "make_code"]].head(11)

Unnamed: 0,make,make_code
0,alfa-romero,0
1,alfa-romero,0
2,alfa-romero,0
3,audi,1
4,audi,1
5,audi,1
6,audi,1
7,audi,1
8,audi,1
9,audi,1


Etwas ähnliches zu pandas get_dummies, verwende LabelBinarizer

In [0]:
lb_style = LabelBinarizer()
lb_results = lb_style.fit_transform(obj_df["body_style"])

The results are an array that needs to be converted to a DataFrame

Das Resultat ist ein array, das in ein DataFrame umgewandelt werden muss.

In [0]:
lb_results

array([[1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 1, 0, 0],
       ..., 
       [0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0]])

In [0]:
pd.DataFrame(lb_results, columns=lb_style.classes_).head()

Unnamed: 0,convertible,hardtop,hatchback,sedan,wagon
0,1,0,0,0,0
1,1,0,0,0,0
2,0,0,1,0,0
3,0,0,0,1,0
4,0,0,0,1,0


### Fortgeschrittenes Feature coding von kategorischen Features
[category_encoder](http://contrib.scikit-learn.org/categorical-encoding/) library

In [0]:
# Get a new clean dataframe
obj_df = df.select_dtypes(include=['object']).copy()

In [0]:
obj_df.head()

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system
0,alfa-romero,gas,std,two,convertible,rwd,front,dohc,four,mpfi
1,alfa-romero,gas,std,two,convertible,rwd,front,dohc,four,mpfi
2,alfa-romero,gas,std,two,hatchback,rwd,front,ohcv,six,mpfi
3,audi,gas,std,four,sedan,fwd,front,ohc,four,mpfi
4,audi,gas,std,four,sedan,4wd,front,ohc,five,mpfi


Versuche den Backward Difference Encoder für die Spalte: engine_type

In [0]:
encoder = ce.backward_difference.BackwardDifferenceEncoder(cols=["engine_type"])
encoder.fit(obj_df, verbose=1)

BackwardDifferenceEncoder(cols=['engine_type'], drop_invariant=False,
             handle_unknown='impute', impute_missing=True, return_df=True,
             verbose=0)

In [0]:
encoder.transform(obj_df).iloc[:,0:7].head()

Unnamed: 0,col_engine_type_0,col_engine_type_1,col_engine_type_2,col_engine_type_3,col_engine_type_4,col_engine_type_5,col_engine_type_6
0,1.0,-0.857143,-0.714286,-0.571429,-0.428571,-0.285714,-0.142857
1,1.0,-0.857143,-0.714286,-0.571429,-0.428571,-0.285714,-0.142857
2,1.0,0.142857,-0.714286,-0.571429,-0.428571,-0.285714,-0.142857
3,1.0,0.142857,0.285714,-0.571429,-0.428571,-0.285714,-0.142857
4,1.0,0.142857,0.285714,-0.571429,-0.428571,-0.285714,-0.142857


Hier das polynomial encoding

In [0]:
encoder = ce.polynomial.PolynomialEncoder(cols=["engine_type"])
encoder.fit(obj_df, verbose=1)

PolynomialEncoder(cols=['engine_type'], drop_invariant=False,
         handle_unknown='impute', impute_missing=True, return_df=True,
         verbose=0)

In [0]:
encoder.transform(obj_df).iloc[:,0:7].head()

Unnamed: 0,col_engine_type_0,col_engine_type_1,col_engine_type_2,col_engine_type_3,col_engine_type_4,col_engine_type_5,col_engine_type_6
0,1.0,-0.566947,0.545545,-0.408248,0.241747,-0.109109,0.032898
1,1.0,-0.566947,0.545545,-0.408248,0.241747,-0.109109,0.032898
2,1.0,-0.377964,0.0,0.408248,-0.564076,0.436436,-0.197386
3,1.0,-0.188982,-0.327327,0.408248,0.080582,-0.545545,0.493464
4,1.0,-0.188982,-0.327327,0.408248,0.080582,-0.545545,0.493464
