Since the Python notebooks I made for this week were mistakenly deleted, the annotations in this notebook are much shorter than usual. If you have any questions about anything presented here, please contact me. 

# Representing Categorical Features Numerically

Here is a dataset of entirely categorical features that we will use.

In [1]:
import pandas as pd

df = pd.DataFrame([
        ["male", "Europe", "Internet Explorer"],
        ["female", "Europe", "Firefox"],
        ["male", "Asia", "Internet Explorer"],
        ["male", "Asia", "Internet Explorer"],
        ["female", "North America", "Chrome"],
        ["female", "North America", "Firefox"]
    ], columns=["gender", "continent", "browser"])

df

Unnamed: 0,gender,continent,browser
0,male,Europe,Internet Explorer
1,female,Europe,Firefox
2,male,Asia,Internet Explorer
3,male,Asia,Internet Explorer
4,female,North America,Chrome
5,female,North America,Firefox


We obviously can't throw this into an sklearn classifier. We have to represent it numerically. The way this is traditionally done us to use a method known as one-hot encoding. It's also referred to as label binarization or dummy coding. 

For a given categorical feature with $n$ possible labels, we make $n$ new features, one for each label. For the feature that corresponds to the label of the datapoint, it takes a value of $1$. Otherwise, it takes a value of 0. 

The sklearn tool that does this for us is called ``LabelBinarizer``.

In [2]:
from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()

I will now use the Label Binarizer instance I created, called ``lb``, on the ``continent`` feature to demonstrate one-hot encoding. 

In [3]:
new_cont = lb.fit_transform(df['continent'])

In [4]:
new_cont

array([[0, 1, 0],
       [0, 1, 0],
       [1, 0, 0],
       [1, 0, 0],
       [0, 0, 1],
       [0, 0, 1]])

We can see that the label binarizer created three new features (columns). Each column corresponds to one of the three possible labels: Europe, Asia, North America. In ``new_cont``, the second column has a value of 1, while the other two have values of 0. The second column represents Europe. Note that in the 3rd row, the first column has a value of 1 and the other two have values of 0. The first column refers to Asia. Whatever your label was in the original data, you get a value of 1 in the column corresponding to that label in the one-hot encoding, and a 0 in the other columns. 

You might use one-hot encoding on a feature with many labels. Once you have your new representation, it'll be hard to figure out which column corresponds to which label. Fortunately, the Label Binarizer object - in this case ``lb`` - stores that information. 

In [5]:
lb.classes_

array(['Asia', 'Europe', 'North America'], 
      dtype='<U13')

The first column is Asia, the second column is Europe, and the third is North America. 

Using all this information we can convert the one-hot encoding into a data frame.

In [6]:
new_cont = pd.DataFrame(new_cont, columns=lb.classes_, index=df.index)
new_cont

Unnamed: 0,Asia,Europe,North America
0,0,1,0
1,0,1,0
2,1,0,0
3,1,0,0
4,0,0,1
5,0,0,1


And we can add it to the original data. 

In [7]:
df = df.join(new_cont)

In [8]:
df

Unnamed: 0,gender,continent,browser,Asia,Europe,North America
0,male,Europe,Internet Explorer,0,1,0
1,female,Europe,Firefox,0,1,0
2,male,Asia,Internet Explorer,1,0,0
3,male,Asia,Internet Explorer,1,0,0
4,female,North America,Chrome,0,0,1
5,female,North America,Firefox,0,0,1


Using the same process, I will now convert the browser feature into one-hot encoding. 

In [9]:
lb = LabelBinarizer()
new_browser = lb.fit_transform(df['browser'])
new_browser = pd.DataFrame(new_browser, columns=lb.classes_, index=df.index)
df = df.join(new_browser)
df

Unnamed: 0,gender,continent,browser,Asia,Europe,North America,Chrome,Firefox,Internet Explorer
0,male,Europe,Internet Explorer,0,1,0,0,0,1
1,female,Europe,Firefox,0,1,0,0,1,0
2,male,Asia,Internet Explorer,1,0,0,0,0,1
3,male,Asia,Internet Explorer,1,0,0,0,0,1
4,female,North America,Chrome,0,0,1,1,0,0
5,female,North America,Firefox,0,0,1,0,1,0


Let's look at gender, which in this case has two possible labels. Extrapolating what we've learned so far, you'd think that turning gender into a one-hot encoding would result in two columns, since there are two possible labels. One column for male, the other for female. 

However, when we use the label binarizer, we see different behavior.

In [10]:
lb = LabelBinarizer()
new_gender = lb.fit_transform(df['gender'])
new_gender

array([[1],
       [0],
       [1],
       [1],
       [0],
       [0]])

We only get one column, instead of two. With browser and continent, there were three labels, which got turned into three new features. However, with gender there are two labels and we only have one new feature. 

The reason for this is simple. In the case of two mutually-exclusive labels - and *only* in this case - to state that a data point **has feature A** is the logical equivalent of saying the feature **does not have feature B**. And visa versa. Here, to say someone is **male** is equivalent to saying they're **not female**. As such, we only need one column to store two-label data. It indicates whether a data point *has* or *does not have* one of the labels. To say that a data point *does not have* one label is equivalent to saying it has the other label. 

In [11]:
lb.classes_

array(['female', 'male'], 
      dtype='<U6')

Here, we can see that sklearn has arbitrarily chosen female to be represented by 0 and male to be represented by 1. Let's add this to the original data. 

In [12]:
new_gender = pd.DataFrame(new_gender, columns=['bgender'], index=df.index)

In [13]:
df = df.join(new_gender)

In [14]:
df

Unnamed: 0,gender,continent,browser,Asia,Europe,North America,Chrome,Firefox,Internet Explorer,bgender
0,male,Europe,Internet Explorer,0,1,0,0,0,1,1
1,female,Europe,Firefox,0,1,0,0,1,0,0
2,male,Asia,Internet Explorer,1,0,0,0,0,1,1
3,male,Asia,Internet Explorer,1,0,0,0,0,1,1
4,female,North America,Chrome,0,0,1,1,0,0,0
5,female,North America,Firefox,0,0,1,0,1,0,0


We have successfully turned our categorical features into numerical features. 

Using pandas subsetting we can extract only the numerical features to send to a machine learning algorithm.

In [15]:
to_algo = df[["Asia", "Europe", "North America", "Chrome", "Firefox", "Internet Explorer", "bgender"]]
to_algo

Unnamed: 0,Asia,Europe,North America,Chrome,Firefox,Internet Explorer,bgender
0,0,1,0,0,0,1,1
1,0,1,0,0,1,0,0
2,1,0,0,0,0,1,1
3,1,0,0,0,0,1,1
4,0,0,1,1,0,0,0
5,0,0,1,0,1,0,0


# Representing Raw Text Numerically

To represent raw text numerically, we use something called the **vector space model**. Please watch the lecture "Representing text as Numbers" on Canvas to see a detailed explanation of what this is. It is covered fully in the lectures. 

In [16]:
texts = [
"We went to the bank to get some money. Usually, there is a lot of money there. But today, the bank had no money.",
"At the bank, there are a lot of people who work with money. The store money, count money, and try to make more money from their money.",
"We like to take walks along the bank, because the view of the water is beautiful. But today, unfortunately, the water had overran the bank and so we couldn't walk.",
"We drove our boat through the water next to the left bank of the river. There were people standing on the bank waving at us, and some were out in the water swimming."
]

To make a vector space model representation of a set of texts, we use ``CountVectorizer``.

In [17]:
from sklearn.feature_extraction.text import CountVectorizer

In [18]:
cv = CountVectorizer()

In [19]:
vectors = cv.fit_transform(texts)
vectors

<4x59 sparse matrix of type '<class 'numpy.int64'>'
	with 87 stored elements in Compressed Sparse Row format>

The output is a sparse matrix. This is a valid input to all sklearn methods, including ``.fit``, ``cross_val_scores``, and ``.train_test_split``. For the purposes of this instruction only, I am converting it into an np.array using ``toarray``.

In [20]:
vectors = vectors.toarray()

In [21]:
vectors[0, :]

array([0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 3, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 2, 0, 2, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0], dtype=int64)

Each column of ``vectors`` corresponds to a word in the text corpus. Each row corresponds to a text. This is the row corresponding to the first text. It shows that it had zero instances of word 0. It had zero instances of word 1. It had 2 instances of word 4.

To know which column corresponds to which word, the Count Vectorizer object has an attribute called ``.vocabulary_``. 

In [22]:
cv.vocabulary_

{'along': 0,
 'and': 1,
 'are': 2,
 'at': 3,
 'bank': 4,
 'beautiful': 5,
 'because': 6,
 'boat': 7,
 'but': 8,
 'couldn': 9,
 'count': 10,
 'drove': 11,
 'from': 12,
 'get': 13,
 'had': 14,
 'in': 15,
 'is': 16,
 'left': 17,
 'like': 18,
 'lot': 19,
 'make': 20,
 'money': 21,
 'more': 22,
 'next': 23,
 'no': 24,
 'of': 25,
 'on': 26,
 'our': 27,
 'out': 28,
 'overran': 29,
 'people': 30,
 'river': 31,
 'so': 32,
 'some': 33,
 'standing': 34,
 'store': 35,
 'swimming': 36,
 'take': 37,
 'the': 38,
 'their': 39,
 'there': 40,
 'through': 41,
 'to': 42,
 'today': 43,
 'try': 44,
 'unfortunately': 45,
 'us': 46,
 'usually': 47,
 'view': 48,
 'walk': 49,
 'walks': 50,
 'water': 51,
 'waving': 52,
 'we': 53,
 'went': 54,
 'were': 55,
 'who': 56,
 'with': 57,
 'work': 58}

# TF-IDF

Another way to represent texts numerically is not to *count* instances of a word, but instead to calculate their TF-IDF weight. TF-IDF weighting is a way to put more emphasis on words that are characteristic to a document. For example, a given document may have 150 instances of the word *the* and 20 instances of the word *astrophysics*. According to the count vector method described above, the word *the* is more important than *astrophysics*, because it occurs more. However, the word *the* also appears in *many other documents*, while *astrophysics* does not. So intuitively, *astrophysics* should be a more important word for this document than *the*. TF-IDF will therefore weight the words such that *astrophysics* has a higher weight for the document than *the*.

Please see the video lecture on Canvas for a far more detailed explanation of TF-IDF. 

Once you understand the difference between count vectorizing and TF-IDF vectorizing, the way this is done in sklearn is essentially the same. 

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer()
vectors = tv.fit_transform(texts)
vectors = vectors.toarray()
vectors[0,:]

array([ 0.        ,  0.        ,  0.        ,  0.        ,  0.24006301,
        0.        ,  0.        ,  0.        ,  0.18134668,  0.        ,
        0.        ,  0.        ,  0.        ,  0.23001526,  0.18134668,
        0.        ,  0.18134668,  0.        ,  0.        ,  0.18134668,
        0.        ,  0.54404003,  0.        ,  0.        ,  0.23001526,
        0.12003151,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.18134668,  0.        ,
        0.        ,  0.        ,  0.        ,  0.24006301,  0.        ,
        0.29363153,  0.        ,  0.24006301,  0.18134668,  0.        ,
        0.        ,  0.        ,  0.23001526,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.14681576,  0.23001526,
        0.        ,  0.        ,  0.        ,  0.        ])

Like with Count Vectors, each element in the vector corresponds to a word in the vocabulary. However, instead of indicating *how many times this word appeared in this text*, the vector element represents the *TF-IDF weightinf of that word*. 