# 6. Preprocessing part 2

so far we've been doing a lot of pre-processing on numeric data

but you can also imagine that we have data that's like this where maybe we have classes low medium high risk and then it will be nice if we can do some pre-processing such that this text data becomes numeric data as well


In [2]:
import numpy as np

arr = np.array(["low", "low", "high", "medium"]).reshape(-1, 1)
arr


array([['low'],
       ['low'],
       ['high'],
       ['medium']], dtype='<U6')

the most common technique for that is the one hot encoder

now what this encoder will do is it will be able to take in an array of text or categories and transform that out to something that is indeed numeric

In [3]:
from sklearn.preprocessing import OneHotEncoder


 now if you just run this as is you're going to get a data structure that's known as a sparse matrix

In [5]:
enc = OneHotEncoder()
enc.fit_transform(arr)


<4x3 sparse matrix of type '<class 'numpy.float64'>'
	with 4 stored elements in Compressed Sparse Row format>

there's a setting though that we can change such that the sparsity is false and then we can actually see what is inside of it

In [6]:
enc = OneHotEncoder(sparse=False)
enc.fit_transform(arr)


array([[0., 1., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.]])

note that the first two rows are indicated as low and we can see that indeed they share the same column then we see high over here which is listed there and then we see medium below and that is listed there so we can see a form of correspondence and that is something that is indeed useful and the most common use case for this is if this is let's say the class that you would like to predict then this numeric representation that is going to be the y array that you're going to pass to psychic learn because this is something that psychic learn can use to train numerically on (`[0., 1., 0.],` -> `'low'`).

there is some behavior to be aware of though and this is not super relevant if you're generating labels but it is relevant if you're using this onenote encoder to encode information for the data that will predict the label

let's say i grab the encoder now and i ask it to transform something

and what i'm asking it to transform is something that it's never seen before so notice that im asking it to give me a label for zero but zero does not appear in this set (np.array) and that is the set where we did perform a fit on so we might wonder what's going to happen here

In [None]:
# enc.transform([["zero"]])


ValueError: Found unknown categories ['zero'] in column 0 during transform

well we're going to get a big fat value error so it's saying value error found unknown category O.

so essentially it is telling us you're not allowed to give me data that i've never seen before

we can change the setting for this though because at the moment the handle unknown parameter is set to error but we can change that such that it's set to ignore

and now if i run this it's not going to give me an error

In [8]:
enc = OneHotEncoder(sparse=False, handle_unknown="ignore")
enc.fit_transform(arr)

enc.transform([["zero"]])


array([[0., 0., 0.]])

and what it's doing is it's saying well these are all zeros or another way of saying that zero is neither low high or medium - so we can just go ahead and give it this zero array back

now one thing to finally note about that is that this (`handle_unknown="ignore"`) is a very useful setting if you're generating your x matrix so to say- but you don't want to do that if you're generating your y labels because those are things that you want to have very strict control over

in this series of videos ive shown you some of the pre- processing steps that are available but a very convenient way for you to play around with more of them and to get a better understanding is to go to this website called drawdata.xyz - and full disclaimer it's a website that i made but it's a website that allows you to quite literally make a drawing of a little bit of data and that way you can play with it from your jupyter notebook and playing with preprocessors is the best way to learn about them now what you can go ahead and do from here once you've drawn your data set that you're interested in you can click this download csv button to download this file locally but what you can also do is you can copy the csv to your clipboard.

what you can then do is you can type `pandas.read_clipboard()` and then this will be able to read from your clipboard the only thing you have to do manually is you gotta set the separator to a comma (`pd.read_clipboard(sep=","`) because i think the clipboard is typically reading in from excel but what i can now do is just run this and lo and behold the data set that i was just drawing is now available to me here and this is a really nice way to just get a little bit playful with scikit-learn pre processing steps and pipelines
