![](../img/330-banner.png)

Lecture 6: `sklearn` `ColumnTransformer` and Text Features
------------
UBC 2022-23 W2

Instructor: Amir Abdi
 - Office Hours: Mondays 5-6 (or 5-7)

<br><br><br>

iclicker link: https://join.iclicker.com/EMMJ   
<img src="img_aa/iclicker_qr_code.png" height="300" width="300"> 


## Announcements, and LO

### Announcements

- Homework 3 is due Feb 1, 11:59pm
- We're working on Homework 2 grading. The grades will be released later this week.  

### Learning outcomes 

From this lecture, you will be able to 

- use `ColumnTransformer` to build all our transformations together into one object and use it with `sklearn` pipelines;  
- define `ColumnTransformer` where transformers contain more than one steps;
- explain `handle_unknown="ignore"` hyperparameter of `scikit-learn`'s `OneHotEncoder`;
- explain `drop="if_binary"` argument of `OneHotEncoder`;
- **identify when it's appropriate to apply ordinal encoding vs one-hot encoding;**
- **explain strategies to deal with categorical variables with too many categories;**

Text Data:
- explain why **text** data needs a different treatment than categorical variables;
- use `scikit-learn`'s `CountVectorizer` to encode text data;
- explain different hyperparameters of `CountVectorizer`.
- incorporate text features in a machine learning pipeline

## Legends

    
| <img src="https://upload.wikimedia.org/wikipedia/commons/f/f8/This_is_the_photo_of_Arthur_Samuel.jpg" width="100"> | <img src="http://www.cs.cmu.edu/~tom/TomHead2-6-22-22.jpg" width="100">  | <img src="https://upload.wikimedia.org/wikipedia/commons/4/49/John_McCarthy_Stanford.jpg" width="100"> | <img src="https://datascience.columbia.edu/wp-content/uploads/2020/08/Vapnik_web.png" width="100"> |
| :-----------: | :-----------: | :-----------: | :-----------: |
| Arthur Samuel       | Tom Mitchell       |John McCarthy|  Vladimir N. Vapnik |
| (1901-1990)    | 1951 - Now       |  1927 – 2011 | 1936 - Now |
| First computer learning program | 1997 ML Texbook, CMU Prof | Co-coined term AI, Lisp,<br> Time-sharing, Garbage collection | SVM



<img src="https://upload.wikimedia.org/wikipedia/commons/a/a1/Alan_Turing_Aged_16.jpg" width="300">

**Alan Turing**  
**1912 – 1954 (42 years)**

- Known for: 
  - Turing Test (1950):  test of a machine's ability to exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human.
  - Turing Machine: The abstract machine, executing symbols on an infitnite stripe of tape, and capable of implementing any computer program.
  - and many more that is beyond **my limited scope of knowledge**...
- Turing was prosecuted in 1952 for homosexual acts. Turing did not deny the charges.
  - An official public apology was made on behalf of the British government for "the appalling way Turing was treated". 


Other places where you might hear his name:
- Turing Award (Noble Prize for Computing)
- Turing Completeness
- Alan Turing law: Refer to a 2017 law in the United Kingdom that retroactively pardoned men cautioned or convicted under historical legislation that outlawed homosexual acts.


> Turing “took his time finding the right words,” and BBC radio producer had called Turing a very difficult person to interview for that reason.

<br><br><br><br>

# sklearn's [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)

**Imports**

In [1]:
import os
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from IPython.display import HTML

sys.path.append("../code/.")
pd.set_option("display.max_colwidth", 200)

from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

ModuleNotFoundError: No module named 'matplotlib'

- In most applications, some features are categorical, some are continuous, some are binary, and some are ordinal. 

- When we want to develop supervised machine learning pipelines on real-world datasets, very often we want to **apply different transformation on different columns**. 

- Enter `sklearn`'s `ColumnTransformer`!! 

- Let's look at a toy example: 

In [None]:
df = pd.read_csv("../data/quiz2-grade-toy-col-transformer.csv")
df

In [None]:
df.info()

## Transformations on the toy data

In [None]:
df.head()

- Scaling on numeric features
- One-hot encoding on the categorical feature `major` and binary feature `enjoy_class`
- Ordinal encoding on the ordinal feature `class_attendance`
- Imputation on the `lab2` feature
- None on the `ml_experience` feature

In [None]:
X = df.drop(columns=["quiz2"])
y = df["quiz2"]
X.columns

In [None]:
X.head()

In [None]:
numeric_feats = ["university_years", "lab1", "lab3", "lab4", "quiz1"]  # apply scaling
categorical_feats = ["major"]  # apply one-hot encoding
passthrough_feats = ["ml_experience"]  # do not apply any transformation

# Not needed, as drop is the default behaviour
drop_feats = [
    "lab2",
    "class_attendance",
    "enjoy_course",
]  # do not include these features in modeling

For simplicity, let's only focus on scaling and one-hot encoding first. 

### `ColumnTransformer` Interface

- Each transformation is specified by a name, a transformer object, and the columns this transformer should be applied to. 

In [None]:
from sklearn.compose import ColumnTransformer

# -------- New Class ------------
ct = ColumnTransformer(
    [
        ("MyScaling", StandardScaler(), numeric_feats),
        ("MyOnehot", OneHotEncoder(sparse=False), categorical_feats),
        ("MyPassthrough", "passthrough", passthrough_feats),
        # ("MyDrop", "drop", drop_feats), # not neeeded, drop is the default behaviour
    ]
)
# -------------------------------
ct

### `make_column_transformer` Interface

- Similar to `make_pipeline` syntax, there is convenient `make_column_transformer` syntax. 
- The syntax automatically names each step based on its class. 
- We'll be mostly using this syntax. 

In [None]:
from sklearn.compose import make_column_transformer

ct = make_column_transformer(    
    (StandardScaler(), numeric_feats),  # scaling on numeric features
    ("passthrough", passthrough_feats),  # no transformations on the binary features    
    (OneHotEncoder(), categorical_feats),  # OHE on categorical features
    # ("drop", drop_feats),   # not neeeded, drop is the default behaviour
)
ct

In [None]:
ct.fit(X)
transformed = ct.transform(X)

# if we had test data
# transform_X_test = ct.transform(X_test)

# Alternatively, you could have called:
# transformed = ct.fit_transform(X)

In [None]:
X.shape

In [None]:
transformed.shape

In [None]:
type(X)

In [None]:
type(transformed)

- When we `fit_transform`, each transformer is applied to the specified columns and the result of the transformations are **concatenates the results**. 
- A big advantage here is that we build all our transformations together into one object, and that way we're sure we do the same operations to all splits of the data.
- Otherwise we might, for example, do the OHE on both train and test but forget to scale the test data.

<br><br><br><br><br><br><br>
**[study at home]**

### Convert `numpy.ndarray`
<br><br>
Note that the returned object is not a dataframe. So there are no column names.  back to `DataFrame`

In [None]:
transformed



- How can we view our transformed data as a dataframe? 
- We are adding more columns. 
- So the original columns won't directly map to the transformed data. 
- Let's create column names for the transformed data. 

In [None]:
column_names = (
    numeric_feats
    + passthrough_feats    
    + ct.named_transformers_["onehotencoder"].get_feature_names_out().tolist()
)
column_names

In [None]:
ct.named_transformers_

<br><br>
Note that the order of the columns in the transformed data depends upon the order of the features we pass to the `ColumnTransformer` and can be different than the order of the features in the original dataframe.  
<br><br>

In [None]:
pd.DataFrame(transformed, columns=column_names)

<br><br><br><br><br><br>
### Summary

<br>

![](../img/column-transformer.png)
<!-- <img src='./img/column-transformer.png' width="1500"> -->

[Adapted from here.](https://amueller.github.io/COMS4995-s20/slides/aml-04-preprocessing/#37)

## Training models with transformed data
- We can now pass the `ColumnTransformer` object as a step in a **pipeline**. 

In [None]:
# Same as before, just passing ColumnTransformer (ct) to pipeline
pipe = make_pipeline(ct, SVC())
pipe.fit(X, y)

# I don't care about results on train data; this is toy problem
pipe.predict(X)

In [None]:
pipe

<br><br>

## ❓❓ Questions for you 

### (iClicker) Exercise 6.1 

**iClicker cloud join link: https://join.iclicker.com/EMMJ**

**Select all of the following statements which are TRUE.**

1. You could carry out cross-validation by passing a `ColumnTransformer` object directly to `cross_validate`. 
2. After applying column transformer, the order of the columns in the transformed data has to be the same as the order of the columns in the original data. 
3. After applying a column transformer, the transformed data is always going to be of different shape than the original data. 
4. When you call `fit_transform` on a `ColumnTransformer` object, you get a numpy ndarray. 

Answers True: 4

<br><br><br><br>

---------------
<br><br><br><br><br><br>
**[Study on your own - Random details on how to set the output type of ScikitLearn]**

### `sklearn` `set_config`

In [None]:
from sklearn import set_config

In [None]:
set_config(display="text")
ct

In [None]:
set_config(display="diagram")
ct

<br><br><br><br>

-----------------

### Multiple transformations in a transformer with pipeline

We can nest a pipeline inside a transformer

<br><br>

Recall that `lab2` has missing values. 


In [None]:
X.head(10)

- So we would like to apply more than one transformations on it: imputation and scaling.  
- We can treat `lab2` separately, but we can also include it into `numeric_feats` and apply both transformations on all numeric columns.

In [None]:
numeric_feats = [
    "university_years",
    "lab1",
    "lab2",
    "lab3",
    "lab4",
    "quiz1",
]  # apply scaling
categorical_feats = ["major"]  # apply one-hot encoding
passthrough_feats = ["ml_experience"]  # do not apply any transformation

# Not needed, default behaviour
drop_feats = ["class_attendance", "enjoy_course"]

<br><br><br>
**To apply more than one transformations we can define a pipeline inside a column transformer to chain different transformations.**
<br><br><br>

In [None]:
ct = make_column_transformer(
  # ---------- important -------------------
    (      
        make_pipeline(SimpleImputer(), StandardScaler()),
        numeric_feats,
    ),
    # -------------------------------------------
    ("passthrough", passthrough_feats),  # no transformations on the binary features    
    (OneHotEncoder(), categorical_feats),  # OHE on categorical features
)

In [None]:
ct

In [None]:
X_transformed = ct.fit_transform(X)

In [None]:
column_names = (
    numeric_feats
    + passthrough_feats    
    + ct.named_transformers_["onehotencoder"].get_feature_names_out().tolist()
)
column_names

In [None]:
pd.DataFrame(X_transformed, columns=column_names)

<br><br>

<br><br>

### Incorporating ordinal feature `class_attendance` 

- The `class_attendance` column is different than the `major` column in that there is some ordering of the values. 
    - Excellent > Good > Average > poor    

In [None]:
X.head()

Let's try applying `OrdinalEncoder` on this column. 

In [None]:
X_toy = X[["class_attendance"]]
enc = OrdinalEncoder()
enc.fit(X_toy)
X_toy_ord = enc.transform(X_toy)
df = pd.DataFrame(
    data=X_toy_ord,
    columns=["class_attendance_enc"],
    index=X_toy.index,
)

In [None]:
pd.concat([X_toy, df], axis=1).head(10)

- What's the problem here? 
    - The encoder doesn't know the order. 
- We can examine unique categories manually, order them based on our intuitions, and then provide this human knowledge to the transformer. 

What are the unique categories of `class_attendance`? 

In [None]:
X_toy["class_attendance"].unique()

**Let's order them manually.**

In [None]:
class_attendance_levels = ["Poor", "Average", "Good", "Excellent"]

Let's make sure that we have included all categories in our manual ordering.  

In [None]:
assert set(class_attendance_levels) == set(X_toy["class_attendance"].unique())

In [None]:
oe = OrdinalEncoder(categories=[class_attendance_levels], dtype=int)
oe.fit(X_toy[["class_attendance"]])
ca_transformed = oe.transform(X_toy[["class_attendance"]])
df = pd.DataFrame(
    data=ca_transformed, columns=["class_attendance_enc"], index=X_toy.index
)
print(oe.categories_)
pd.concat([X_toy, df], axis=1).head(10)

The encoded categories are looking better now! 

#### More than one ordinal columns?

- We can pass the manually ordered categories when we create an `OrdinalEncoder` object as a list of lists. 
- If you have more than one ordinal columns
    - manually create a list of ordered categories for each column
    - pass a list of lists to `OrdinalEncoder`, where each inner list corresponds to manually created list of ordered categories for a corresponding ordinal column. 
    

Now let's incorporate ordinal encoding of `class_attendance` in our column transformer. 

In [None]:
X

In [None]:
numeric_feats = [
    "university_years",
    "lab1",
    "lab2",
    "lab3",
    "lab4",
    "quiz1",
]  # apply scaling
categorical_feats = ["major"]  # apply one-hot encoding
ordinal_feats = ["class_attendance"]  # apply ordinal encoding
passthrough_feats = ["ml_experience"]  # do not apply any transformation
drop_feats = ["enjoy_course"]  # do not include these features

In [None]:
ct = make_column_transformer(
    (
        make_pipeline(SimpleImputer(), StandardScaler()),
        numeric_feats,
    ),
    (
        OrdinalEncoder(categories=[class_attendance_levels], dtype=int),
        ordinal_feats,
    ),  # Ordinal encoding on ordinal features
    ("passthrough", passthrough_feats),  # no transformations on the binary features
    (OneHotEncoder(), categorical_feats),  # OHE on categorical features    
)

In [None]:
ct

In [None]:
X_transformed = ct.fit_transform(X)

In [None]:
column_names = (
    numeric_feats
    + ordinal_feats
    + passthrough_feats    
    + ct.named_transformers_["onehotencoder"].get_feature_names_out().tolist()
)
column_names

In [None]:
pd.DataFrame(X_transformed, columns=column_names)

<br><br>

### Dealing with unknown categories

Let's create a pipeline with the column transformer and pass it to `cross_validate`. 

In [None]:
pipe = make_pipeline(ct, SVC())
pipe

In [None]:
# This will fails
# scores = cross_validate(pipe, X, y, return_train_score=True)

- What's going on here??
- Let's look at the error message:
`ValueError: Found unknown categories ['Biology'] in column 0 during transform
`

In [None]:
X["major"].value_counts()

- **There is only one instance of Biology.**
- During **cross-validation**, this is getting put into the validation split.
- By default, `OneHotEncoder` throws an error because you might want to know about this.

Simplest fix:
- Pass `handle_unknown="ignore"` argument to `OneHotEncoder`
- It creates a row with all zeros. 

In [None]:
ct = make_column_transformer(
    (
        make_pipeline(SimpleImputer(), StandardScaler()),
        numeric_feats,
    ),
    (
        OrdinalEncoder(categories=[class_attendance_levels], dtype=int),
        ordinal_feats,
    ),
    ("passthrough", passthrough_feats),
    (
        OneHotEncoder(handle_unknown="ignore"), # --> new code
        categorical_feats,
    ),
)

In [None]:
ct

In [None]:
pipe = make_pipeline(ct, SVC())
scores = cross_validate(pipe, X, y, cv=5, return_train_score=True)
scores['test_score'].mean()

- With this approach, all unknown categories will be represented with all zeros and cross-validation is running OK now. 

Ask yourself the following questions when you work with categorical variables   
- Do you want this behaviour? 
- Are you expecting to get many unknown categories? Do you want to be able to distinguish between them?

**Learning about all possible categories of a given feature doesn't break the Golden Rule** because:
- The Train and Test data was supposed to originate from the **same distribution**.
- If it's some fix number of categories. For example, if it's something like provinces in Canada or majors taught at UBC. We know the categories in advance and this is one of the cases where it might be OK to violate the golden rule and get a list of all possible values for the categorical variable. 

<br><br>

In [None]:
X["enjoy_course"].head()

In [None]:
ohe_enc = OneHotEncoder(drop="if_binary", dtype=int, sparse=False)
ohe_enc.fit(X[["enjoy_course"]])
transformed = ohe_enc.transform(X[["enjoy_course"]])
df = pd.DataFrame(data=transformed, columns=["enjoy_course_enc"], index=X.index)
pd.concat([X[["enjoy_course"]], df], axis=1).head(10)

In [None]:
numeric_feats = [
    "university_years",
    "lab1",
    "lab2",
    "lab3",
    "lab4",
    "quiz1",
]  # apply scaling
categorical_feats = ["major"]  # apply one-hot encoding
ordinal_feats = ["class_attendance"]  # apply ordinal encoding
binary_feats = ["enjoy_course"]  # apply one-hot encoding with drop="if_binary"
passthrough_feats = ["ml_experience"]  # do not apply any transformation
drop_feats = []

In [None]:
ct = make_column_transformer(
    (
        make_pipeline(SimpleImputer(), StandardScaler()),
        numeric_feats,
    ),
    (
        OrdinalEncoder(categories=[class_attendance_levels], dtype=int),
        ordinal_feats,
    ),
    (
        OneHotEncoder(drop="if_binary", dtype=int),  # --> new code
        binary_feats,
    ),  # OHE on categorical features
    ("passthrough", passthrough_feats),
    (
        OneHotEncoder(handle_unknown="ignore"),
        categorical_feats,
    )
)

In [None]:
ct

In [None]:
pipe = make_pipeline(ct, SVC())
scores = cross_validate(pipe, X, y, cv=5, return_train_score=True)

## Break (5 min)

![](../img/eva-coffee.png)


<br><br><br><br>

---------
**[Study the section at home]**
# End2end example

In [None]:
housing_df = pd.read_csv("../data/housing.csv")
train_df, test_df = train_test_split(housing_df, test_size=0.1, random_state=123)

train_df.head()

Some column values are mean/median but some are not. 

Let's add some new features to the dataset which could help predicting the target: `median_house_value`. 

In [None]:
train_df = train_df.assign(
    rooms_per_household=train_df["total_rooms"] / train_df["households"]
)
test_df = test_df.assign(
    rooms_per_household=test_df["total_rooms"] / test_df["households"]
)

train_df = train_df.assign(
    bedrooms_per_household=train_df["total_bedrooms"] / train_df["households"]
)
test_df = test_df.assign(
    bedrooms_per_household=test_df["total_bedrooms"] / test_df["households"]
)

train_df = train_df.assign(
    population_per_household=train_df["population"] / train_df["households"]
)
test_df = test_df.assign(
    population_per_household=test_df["population"] / test_df["households"]
)

In [None]:
train_df.head()

In [None]:
# Let's keep both numeric and categorical columns in the data.
X_train = train_df.drop(columns=["median_house_value", "total_rooms", "total_bedrooms", "population"])
y_train = train_df["median_house_value"]

X_test = test_df.drop(columns=["median_house_value", "total_rooms", "total_bedrooms", "population"])
y_test = test_df["median_house_value"]

In [None]:
from sklearn.compose import ColumnTransformer, make_column_transformer

In [None]:
X_train.head(10)

In [None]:
X_train.columns

In [None]:
# Identify the categorical and numeric columns
numeric_features = [
    "longitude",
    "latitude",
    "housing_median_age",
    "households",
    "median_income",
    "rooms_per_household",
    "bedrooms_per_household",
    "population_per_household",
]

categorical_features = ["ocean_proximity"]
target = "median_income"

- Let's create a `ColumnTransformer` for our dataset. 

In [None]:
X_train.info()

In [None]:
X_train["ocean_proximity"].value_counts()

In [None]:
numeric_transformer = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

preprocessor = make_column_transformer(
    (numeric_transformer, numeric_features),
    (categorical_transformer, categorical_features),
)

In [None]:
preprocessor

In [None]:
X_train_pp = preprocessor.fit_transform(X_train)

- When we `fit` the preprocessor, it calls `fit` on _all_ the transformers
- When we `transform` the preprocessor, it calls `transform` on _all_ the transformers. 

We can get the new names of the columns that were generated by the one-hot encoding:

In [None]:
preprocessor

In [None]:
preprocessor.named_transformers_["onehotencoder"].get_feature_names_out(
    categorical_features
)

Combining this with the numeric feature names gives us all the column names:

In [None]:
column_names = numeric_features + list(
    preprocessor.named_transformers_["onehotencoder"].get_feature_names_out(
        categorical_features
    )
)
column_names

Let's visualize the preprocessed training data as a dataframe. 

In [None]:
pd.DataFrame(X_train_pp, columns=column_names)

In [None]:
from utils import mean_std_cross_val_scores
results_dict = {}
dummy = DummyRegressor()
results_dict["dummy"] = mean_std_cross_val_scores(
    dummy, X_train, y_train, return_train_score=True
)
pd.DataFrame(results_dict).T

In [None]:
from sklearn.neighbors import KNeighborsRegressor
knn_pipe = make_pipeline(preprocessor, KNeighborsRegressor())

In [None]:
knn_pipe

In [None]:
results_dict["imp + scaling + ohe + KNN"] = mean_std_cross_val_scores(
    knn_pipe, X_train, y_train, return_train_score=True
)

In [None]:
pd.DataFrame(results_dict).T

In [None]:
from sklearn.svm import SVR

svr_pipe = make_pipeline(preprocessor, SVR())
results_dict["imp + scaling + ohe + SVR (default)"] = mean_std_cross_val_scores(
    svr_pipe, X_train, y_train, return_train_score=True
)

In [None]:
pd.DataFrame(results_dict).T

The results with `scikit-learn`'s default SVR hyperparameters are pretty bad. 

In [None]:
svr_C_pipe = make_pipeline(preprocessor, SVR(C=10000))
results_dict["imp + scaling + ohe + SVR (C=10000)"] = mean_std_cross_val_scores(
    svr_C_pipe, X_train, y_train, return_train_score=True
)

In [None]:
pd.DataFrame(results_dict).T

With a bigger value for `C` the results are much better. We need to carry out systematic hyperparameter optimization to get better results. (Coming up next week.)

- Note that categorical features are different than free text features. Sometimes there are columns containing free text information and we we'll look at ways to deal with them in the later part of this lecture. 

------------------------

<br><br><br><br><br><br><br><br>
# [Responsible AI] Do we actually want to use certain features for prediction?

- Do you want to use certain features such as **gender** or **race** in prediction?
- Remember that the systems you build are going to be used in some applications. 
- It's extremely important to be mindful of the consequences of including certain features in your predictive model. 
<br><br><br><br><br><br><br><br>

As responsible researchers, we should exclude certain features from the data **even if they improve model performance**.
<br><br><br><br><br>

## OHE with many categories

- Do we have enough data for **rare categories** to learn anything meaningful? 
- How about **grouping** them into bigger categories?
    - Example: country names into continents such as "South America" or "Asia"
- Or having **"other"** category for rare cases? 

<br><br><br><br><br><br><br>
Any decision we make in ML is a hyper-parameter
<br><br><br><br><br><br><br>

### How about the `target` (label) values? Should we preprocess them?

- Generally **no** need for this when doing classification; but, in some cases, **yeah, it could happen**
  - Example: In regression it makes sense in some cases. (More on this later)
- For classification, you often don't need to do much (you might need to apply OrdinalEncoding in some libraries)
  - Example: `sklearn` is fine with categorical labels ($y$-values) for classification problems. 

<br><br><br><br>

# Encoding text data  

In [None]:
toy_spam = [
    [
        "URGENT!! As a valued network customer you have been selected to receive a £900 prize reward!",
        "spam",
    ],
    ["Lol you are always so convincing.", "non spam"],
    ["Nah I don't think he goes to usf, he lives around here though", "non spam"],
    [
        "URGENT! You have won a 1 week FREE membership in our £100000 prize Jackpot!",
        "spam",
    ],
    [
        "Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030",
        "spam",
    ],
    ["Congrats! I can't wait to see you!!", "non spam"],
]
toy_df = pd.DataFrame(toy_spam, columns=["sms", "target"])
toy_df

## Spam/non spam toy example 

- What if the feature is in the form of raw text?
- The feature `sms` below is neither categorical nor ordinal. 
- How can we encode it so that we can pass it to the machine learning algorithms we have seen so far? 

In [None]:
toy_df

- How can we encode or represent raw text data into fixed number of features so that we can learn some useful patterns from it?  
- This is a well studied problem in the field of **Natural Language Processing (NLP)**, which is concerned with giving computers the ability to understand written and spoken language. 
- Some popular representations of raw text include: 
    - **Bag of words** 
    - TF-IDF
    - Embedding representations 

## Bag of words (BOW) representation (unigram model)

- One of the most popular representation of raw text 
- Ignores the syntax and word order
- It has two components: 
    - The vocabulary (all unique words in all documents) 
    - A value indicating either the presence or absence or the count of each word in the document. 


<center>
<img src='../img/bag-of-words.png' width="600">
</center>

[Source](https://web.stanford.edu/~jurafsky/slp3/4.pdf)       

### Extracting BOW features using `scikit-learn`
- `CountVectorizer`
    - Converts a collection of text documents to a matrix of word counts.  
    - Each row represents a "document" (e.g., a text message in our example). 
    - Each column represents a word in the vocabulary (the set of unique words) in the training data. 
    - Each cell represents how often the word occurs in the document.       

<br><br><br><br><br>
In the Natural Language Processing (NLP) community text data  is referred to as a **corpus** (plural: corpora). 
<br><br><br><br><br>

As usual, start with the documentation:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# -------- New Code -----------------------
vec = CountVectorizer()
X_counts = vec.fit_transform(toy_df["sms"])
# -----------------------------------------
bow_df = pd.DataFrame(
    X_counts.toarray(), columns=vec.get_feature_names_out(), index=toy_df["sms"]
)
bow_df

In [None]:
X_counts

<br><br><br><br>
With `CountVectorizer` you need to define separate `CountVectorizer` transformers for each text column, if you have more than one text columns.    

In [None]:
type(toy_df["sms"])

### Why sparse matrices? 

- Most words do not appear in a given document.
- We get massive computational savings if we only store the nonzero elements.
- There is a bit of overhead, because we also need to store the locations:
    - e.g. "location (3,27): 1".
    
- However, if the fraction of nonzero is small, this is a huge win.

In [None]:
print("The total number of elements: ", np.prod(X_counts.shape))
print("The number of non-zero elements: ", X_counts.nnz)
print(
    "Proportion of non-zero elements: %0.4f" % (X_counts.nnz / np.prod(X_counts.shape))
)
print(
    "The value at cell 3,%d is: %d"
    % (vec.vocabulary_["jackpot"], X_counts[3, vec.vocabulary_["jackpot"]])
)

<br><br><br><br><br><br>
**Reminder/Note:`OneHotEncoder` and sparse features**
- By default, `OneHotEncoder` also creates sparse features. 
- You could set `sparse=False` to get a regular `numpy` array. 
- If there are a huge number of categories, it may be beneficial to keep them sparse.
- For smaller number of categories, it doesn't matter much.
<br><br><br><br><br><br>


### Important hyperparameters of `CountVectorizer` 

Check the doc: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- `binary`
    - whether to use absence/presence feature values or counts (If True, all non zero counts are set to 1)
- `max_features`
    - only consider top `max_features` ordered by frequency in the corpus
- `max_df`
    - ignore features which occur in more than `max_df` documents 
- `min_df` 
    - ignore features which occur in less than `min_df` documents 
- `ngram_range`
    - consider word sequences in the given range 

Let's look at all features, i.e., words (along with their frequencies).

In [None]:
vec = CountVectorizer()
X_counts = vec.fit_transform(toy_df["sms"])
bow_df = pd.DataFrame(
    X_counts.toarray(), columns=vec.get_feature_names(), index=toy_df["sms"]
)
bow_df

When we use `max_features=8`, we limit the number of features to 8

In [None]:
vec_binary = CountVectorizer(max_features=8)  # --> change: max_features=8
X_counts = vec_binary.fit_transform(toy_df["sms"])
bow_df = pd.DataFrame(
    X_counts.toarray(), columns=vec_binary.get_feature_names_out(), index=toy_df["sms"]
)
bow_df

<br><br><br>
Here, we say: we are only interested in whether the word exists in the doc or not (**ignore the count**)

In [None]:
vec8 = CountVectorizer(binary=True, max_features=8)  # --> change: max_features=8, binary=True
X_counts = vec8.fit_transform(toy_df["sms"])
bow_df = pd.DataFrame(
    X_counts.toarray(), columns=vec8.get_feature_names_out(), index=toy_df["sms"]
)
bow_df.head()

<br><br><br><br><br><br><br>
**Read the following if the difference in feature names of the above two DataFrames is confusing**

------------
Notice that `vec8` and `vec8_binary` have different vocabularies, which is kind of unexpected behaviour and doesn't match the documentation of `scikit-learn`. 

The **binarization** is done **before limiting the features to `max_features`**, and so now we are actually looking at the document counts (**in how many documents the token occurs**) rather than term count.

The ties in counts between different words makes it even more confusing. I don't think it'll have a big impact on the results but this is good to know! Remember that `scikit-learn` developers are also humans who are prone to make mistakes. So it's always a good habit to question whatever tools we use every now and then. 



In [None]:
vec8 = CountVectorizer(max_features=8)
X_counts = vec8.fit_transform(toy_df["sms"])
pd.DataFrame(
    data=X_counts.sum(axis=0).tolist()[0],
    index=vec8.get_feature_names_out(),
    columns=["counts"],
).sort_values("counts", ascending=False)

In [None]:
vec8_binary = CountVectorizer(binary=True, max_features=8)
X_counts = vec8_binary.fit_transform(toy_df["sms"])
pd.DataFrame(
    data=X_counts.sum(axis=0).tolist()[0],
    index=vec8_binary.get_feature_names_out(),
    columns=["counts"],
).sort_values("counts", ascending=False)

------------
<br><br><br><br><br><br><br>

<br><br><br><br><br><br><br><br>
#### Question: 
Is it OK for `CountVectorizer` to be fit on the **Test data** to make sure we include all of its "words"? (afterall, we care about the **count**, right?)

<br><br><br><br><br><br><br><br>

### Preprocessing in `CountVectorizer`

- Note that `CountVectorizer` comes with some default arguments, and does some pre-processing on the text by default
    - example: Converting words to lowercase (`lowercase=True`)
    - example: getting rid of punctuation and special characters (`token_pattern ='(?u)\\b\\w\\w+\\b'`)
    - Learn more here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html


In [None]:
pipe = make_pipeline(CountVectorizer(), SVC())

In [None]:
pipe.fit(toy_df["sms"], toy_df["target"])

In [None]:
# This is a toy problem; 6 samples, we don't have train and test sets.

# if we had a test set, we would have predicted the labels of the test set:
# pipe.predict(X_test["sms"])

# and we would have scored our model on the test set with:
# pipe.score(X_test["sms"], y_test)

### Is this a realistic representation of text data? 

- Of course this is not a great representation of language
    - We are throwing out everything we know about language and losing a lot of information. 
    - **Bag Of Words** assumes that **there is no syntax**, **semantics** and **compositional meaning** in language.  
- But it works surprisingly well for many tasks. 
- We will learn more expressive representations in the coming weeks. 

<br><br>

<br><br><br><br><br><br><br><br>
**[Run this section at home; here, we only focus on the Vocabulary section]** 
----------------------
## Demo of incorporating text features

Recall that we had dropped `song_title` feature when we worked with the Spotify dataset. 

Let's try to include it in our pipeline and examine whether we get better results. 

In [None]:
spotify_df = pd.read_csv("../data/spotify.csv", index_col=0)
X_spotify = spotify_df.drop(columns=["target"])
y_spotify = spotify_df["target"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X_spotify, y_spotify, test_size=0.2, random_state=123
)

In [None]:
X_train.shape

In [None]:
X_train

Let's look at the distribution of values in the `song_title` column. 

In [None]:
X_train["song_title"].value_counts()

- Most of the song titles are unique, which makes sense. 
- What would happen if we apply one-hot encoding to this feature? 
- Can we encode this as a text feature? 

In [None]:
X_train.columns

In [None]:
numeric_features = [
    "acousticness",
    "danceability",
    "duration_ms",
    "energy",
    "instrumentalness",
    "key",
    "liveness",
    "loudness",
    "mode",
    "speechiness",
    "tempo",
    "time_signature",
    "valence",
]
drop_features = ['artist']

# Note that unlike other feature types we are defining `text_feature` as a string and not as a list.
text_feature = "song_title"  # note that we are not creating a list here.

preprocessor = make_column_transformer(
    (StandardScaler(), numeric_features),
    (CountVectorizer(max_features=2000, stop_words="english"), text_feature),
    ("drop", drop_features)
)

### Explore the transformed data 

In [None]:
transformed = preprocessor.fit_transform(X_train, y_train)
transformed.shape

In [None]:
vocab = preprocessor.named_transformers_["countvectorizer"].get_feature_names_out()

In [None]:
vocab[40:80]

In [None]:
vocab.shape

In [None]:
column_names = numeric_features + vocab.tolist()

In [None]:
df = pd.DataFrame(transformed.toarray(), columns=column_names, index=X_train.index)
df

### Explore the learned vocabulary 

In [None]:
vocab[0:10]

In [None]:
vocab[500:510]

In [None]:
vocab[1800:1810]

In [None]:
vocab[0::100]

<br><br><br><br><br><br>
**[Explore on your own]**
-------------

Let's find songs containing the word _earth_ in them. 

In [None]:
earth_index_vocab = np.where(vocab == "earth")
print(earth_index_vocab)
print('index of "earth" in the vocabulary list is:', earth_index_vocab[0][0])

In [None]:
earth_index_in_df = len(numeric_features) + earth_index_vocab[0][0]
earth_index_in_df

In [2]:
earth_songs = df[df.iloc[:, earth_index_in_df] == 1]
earth_songs.iloc[:, earth_index_in_df - 2 : earth_index_in_df + 2]

NameError: name 'df' is not defined

In [None]:
earth_songs.index

In [None]:
X_train.loc[earth_songs.index]["song_title"]

------------
<br><br><br><br><br><br><br>

### Model building 

Let's create a pipeline using SVC. 
- SVC works well with sparse features. 

In [None]:
pipe = make_pipeline(preprocessor, SVC())

In [None]:
results = pd.DataFrame(cross_validate(pipe, X_train, y_train, return_train_score=True))
print('validation score:', results.mean()['test_score'])

--------
- Is our CV **improving** after incorporating this feature?
- Let's examine what numbers we get when we don't include it. 
--------

In [None]:
pipe_num = make_pipeline(StandardScaler(), SVC())

X_train_num = X_train.drop(columns=["song_title", 'artist'])

In [None]:
results = pd.DataFrame(
    cross_validate(pipe_num, X_train_num, y_train, return_train_score=True)
)
print('validation score:', results.mean()['test_score'])

- Not a big difference in the results. 

- What about the `artist` column?
- Does it make sense to apply BOW encoding to it? 
- Let's look at the distribution of values in the `artist` column. 

In [None]:
X_train['artist'].value_counts()

In [None]:
most_frequent = X_train["artist"].value_counts().iloc[:15]
most_frequent

- We have many unique artists. Probably it's not worth to create a "other" category here. 

In [None]:
numeric_features = [
    "acousticness",
    "danceability",
    "duration_ms",
    "energy",
    "instrumentalness",
    "key",
    "liveness",
    "loudness",
    "mode",
    "speechiness",
    "tempo",
    "time_signature",
    "valence",
]
categorical_features = ['artist']
text_feature = "song_title"  # note that we are not creating a list here.

preprocessor_artist = make_column_transformer(
    (StandardScaler(), numeric_features),
    (OneHotEncoder(sparse=False, dtype=int, handle_unknown="ignore", categories=[most_frequent.index.values]), categorical_features),
    (CountVectorizer(max_features=2000, stop_words="english"), text_feature),
)

In [None]:
pipe = make_pipeline(preprocessor_artist, SVC())

In [None]:
results = pd.DataFrame(cross_validate(pipe, X_train, y_train, return_train_score=True))
print('validation score:', results.mean()['test_score'])

**Tiny bit** improvement in the mean CV scores but we are still overfitting. 

<br><br><br><br><br><br><br>
When adding a feature doesn't add much value to the product, but, adds complexity, we sometimes decide not to include the feature.

We call such features with **minimal impact** the **epsilon features**
<br><br><br><br><br><br><br>

<br><br>

## ❓❓ Questions for you 

### (iClicker) Exercise 6.2 

**iClicker cloud join link: https://join.iclicker.com/EMMJ**

**Select all of the following statements which are TRUE.**

- (A) `handle_unknown="ignore"` would treat all unknown categories equally. 
- (B) As you increase the value for `max_features` hyperparameter of `CountVectorizer` the training score is likely to go up. 
- (C) Suppose you are encoding text data using `CountVectorizer`. If you encounter a word in the validation or the test split that's not available in the training data, we'll get an error. 
- (D) In the code below, inside `cross_validate`, each fold might have slightly different number of features (columns) in the fold.

```
pipe = (CountVectorizer(), SVC())
cross_validate(pipe, X_train, y_train)
```

<br><br><br><br><br>
## Other Language Preprocessors and Models

### TF-IDF (term frequency–inverse document frequency)

In [None]:
toy_spam = [
    [
        "URGENT!! As a valued network customer you have been selected to receive a £900 prize reward!",
        "spam",
    ],
    ["Lol you are always so convincing.", "non spam"],
    ["Nah I don't think he goes to usf, he lives around here though", "non spam"],
    [
        "URGENT! You have won a 1 week FREE membership in our £100000 prize Jackpot!",
        "spam",
    ],
    [
        "Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030",
        "spam",
    ],
    ["Congrats! I can't wait to see you!!", "non spam"],
]
toy_df = pd.DataFrame(toy_spam, columns=["sms", "target"])

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = toy_df['sms']

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

In [None]:
print(vectorizer.get_feature_names_out())

In [None]:
X.toarray()

<br><br><br><br>

<br><br><br><br><br><br>
## What is n-gram?

That's something you folks will learn about because we didn't have time to cover :D
<br><br><br><br><br><br>

## What did we learn today?

- Motivation to use `ColumnTransformer`
- `ColumnTransformer` syntax
- Defining transformers with multiple transformations
- How to visualize transformed features in a dataframe 
- More on ordinal features 
- Different arguments `OneHotEncoder`
    - `handle_unknow="ignore"`
    - `if_binary`
- Dealing with text features
    - Bag of words representation: `CountVectorizer`

![](../img/eva-talksoon.png)