## OneHotEncoder Data Transformation Project

#### Objective

To transform nominal data using OneHotEncoder to facilitate visualization and cluster analysis using k-modes clustering algorithm.

#### Method

Transform LabelEncoded nominal variables into one-hot numeric arrays using scikit-learn's OneHotEncoder.

#### Datasource

Proprietary survey, n = 1,200
    
Variables for transformation: var9, var11, var13, var16, var217, var234, var235, var236, var246

In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

In [3]:
# Create dataframe
df = pd.read_csv("OpioidsRecodes.csv")
df.head()

Unnamed: 0,Vrid,Vdatesub,Vstatus,Vcid,Vcomment,Vlanguage,Vreferer,Vsessionid,Vuseragent,Vip,...,var16_y,var217_y,var230_y,var231_y,var232_y,var233_y,var234_y,var235_y,var236_y,var246_y
0,17,3/20/2018,Complete,,,English,,1521528719_5ab0af8fe318c0.83350522,Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.3...,47.40.144.98,...,0,1,1,0,0,3,2,2,4,2
1,18,3/20/2018,Complete,,,English,https://s.cint.com/Consent/Collect/9ed49688-a2...,1521528831_5ab0afff9dad82.74757954,Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.3...,24.99.168.150,...,3,1,2,0,0,0,3,3,0,0
2,22,3/20/2018,Complete,,,English,,1521528941_5ab0b06d95d276.07495051,Mozilla/5.0 (Linux; Android 7.0; Moto G (4) Bu...,47.151.21.204,...,4,0,2,1,1,0,3,3,4,1
3,23,3/20/2018,Complete,,,English,https://s.cint.com/Consent/Collect/ed2140dc-95...,1521528964_5ab0b084821f35.43447059,Mozilla/5.0 (X11; CrOS x86_64 8872.73.0) Apple...,98.200.10.6,...,0,0,2,0,1,1,3,3,4,0
4,24,3/20/2018,Complete,,,English,https://s.cint.com/Consent/Collect/0529624f-1a...,1521528989_5ab0b09d6dc659.59506533,Mozilla/5.0 (iPhone; CPU iPhone OS 8_4 like Ma...,174.210.7.12,...,4,0,2,1,1,0,3,3,4,0


In [5]:
df.tail()

Unnamed: 0,Vrid,Vdatesub,Vstatus,Vcid,Vcomment,Vlanguage,Vreferer,Vsessionid,Vuseragent,Vip,...,var16_y,var217_y,var230_y,var231_y,var232_y,var233_y,var234_y,var235_y,var236_y,var246_y
1069,3242,4/16/2018,Complete,,,English,,1523896492_5ad4d0ac65ef18.67029642,Mozilla/5.0 (Linux; Android 7.0; SAMSUNG-SM-G9...,209.232.26.121,...,1,1,1,0,1,1,3,3,3,0
1070,3263,4/16/2018,Complete,,,English,,1523897491_5ad4d4930255d2.69575274,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,108.74.29.196,...,0,1,1,1,1,0,3,3,5,1
1071,3268,4/16/2018,Complete,,,English,,1523897381_5ad4d4256446c8.63615049,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6...,24.131.34.8,...,1,0,2,0,0,0,2,1,4,0
1072,3269,4/16/2018,Complete,,,English,,1523897873_5ad4d611858324.88779137,Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.3...,65.189.193.221,...,0,1,1,1,1,1,3,3,4,1
1073,3292,4/16/2018,Complete,,,English,,1523900952_5ad4e218661488.49655041,Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7...,192.252.75.3,...,0,2,2,2,1,0,3,3,3,0


In [6]:
# View dataset shape
df.shape

(1074, 254)

In [7]:
type(df)

pandas.core.frame.DataFrame

In [147]:
# Create list of variables for transformation
ds_list = ['var9_y', 'var11_y', 'var13_y', 'var16_y', 'var217_y', 'var234_y', 'var235_y', 'var236_y', 'var246_y']

In [148]:
# Create subset for transformation
ds = df[ds_list]
ds

Unnamed: 0,var9_y,var11_y,var13_y,var16_y,var217_y,var234_y,var235_y,var236_y,var246_y
0,0,0,0,0,1,2,2,4,2
1,5,1,0,3,1,3,3,0,0
2,0,0,3,4,0,3,3,4,1
3,3,1,5,0,0,3,3,4,0
4,0,1,3,4,0,3,3,4,0
5,2,0,0,6,0,2,2,4,2
6,8,1,0,3,1,1,1,1,1
7,3,1,2,2,0,3,3,4,2
8,1,0,0,0,5,3,3,0,0
9,0,0,0,0,1,0,0,1,0


In [149]:
ds.dtypes

var9_y      int64
var11_y     int64
var13_y     int64
var16_y     int64
var217_y    int64
var234_y    int64
var235_y    int64
var236_y    int64
var246_y    int64
dtype: object

In [150]:
# View value counts of variables for transformation
ds.apply(pd.value_counts)

Unnamed: 0,var9_y,var11_y,var13_y,var16_y,var217_y,var234_y,var235_y,var236_y,var246_y
0,175,526.0,754.0,571.0,274.0,116.0,103.0,144.0,603.0
1,81,548.0,100.0,149.0,588.0,291.0,293.0,145.0,346.0
2,91,,119.0,154.0,113.0,54.0,52.0,66.0,125.0
3,175,,58.0,65.0,12.0,613.0,626.0,217.0,
4,62,,12.0,71.0,38.0,,,323.0,
5,94,,31.0,22.0,49.0,,,179.0,
6,88,,,42.0,,,,,
7,101,,,,,,,,
8,104,,,,,,,,
9,103,,,,,,,,


In [151]:
encoder = OneHotEncoder()
ohe_array = encoder.fit_transform(ds).toarray()
print(ohe_array)

[[1. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 1. 0. 0.]
 [1. 0. 0. ... 0. 1. 0.]
 ...
 [0. 0. 0. ... 1. 0. 0.]
 [0. 0. 0. ... 0. 1. 0.]
 [0. 0. 0. ... 1. 0. 0.]]
