### UBC Extended Learning
#### Instructor: Socorro Dominguez
#### Module 06

Overview:

- [] Explain handle_unknown="ignore" hyperparameter of scikit-learn's OneHotEncoder.
- [] Identify when it's appropriate to apply ordinal encoding vs one-hot encoding.
- [] Explain strategies to deal with categorical variables with too many categories.
- [] Explain why text data needs a different treatment than categorical variables.
- [] Use scikit-learn's CountVectorizer to encode text data.
- [] Explain different hyperparameters of CountVectorizer.
- [] Use ColumnTransformer to build all our transformations together into one object and use it with scikit-learn pipelines.

#### One Hot Encoder
- Converts categorical variables into a binary matrix (one-hot encoded).
- Each category becomes a binary column, indicating its presence (1) or absence (0).

**Example:**
   - Original Data: ['Red', 'Blue', 'Green']
   - One-Hot Encoded:
     ```markdown
     Red   Blue   Green
     1     0      0
     0     1      0
     0     0      1
     ```

- Avoids introducing false ordinal relationships in nominal categorical data.


- The `handle_unknown` determines how the encoder handles unknown categories during transformation.

- When set to "ignore," the encoder will not raise an error if it encounters an unknown category during transformation.

- Instead, it will ignore the unknown category and assign zeros to all corresponding one-hot encoded columns.

- If the above example, suddenly found a `Yellow` observation:

     ```markdown
     Red   Blue   Green
     1     0      0
     0     1      0
     0     0      1
     0     0      0
     ```

In [1]:
from sklearn.preprocessing import OneHotEncoder

# Example data with known categories during training
X_train = [['R'], ['G'], ['B']]

# Example data with a new category during testing
X_test = [['R'], ['Y'], ['Y']]

# Create the OneHotEncoder with handle_unknown='ignore'
encoder = OneHotEncoder(handle_unknown='ignore')

# Fit and transform the training data
X_train_encoded = encoder.fit_transform(X_train)

# Transform the test data
X_test_encoded = encoder.transform(X_test)

print("Encoded training data:")
print(X_train_encoded.toarray())

print("Encoded test data:")
print(X_test_encoded.toarray())


Encoded training data:
[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]
Encoded test data:
[[0. 0. 1.]
 [0. 0. 0.]
 [0. 0. 0.]]


#### One Hot Encoding or Ordinal Encoding?

##### Ordinal Encoding
- Assigns numerical labels to categorical variables with a meaningful order or hierarchy.

**Example:**
- Original Data: ['Low', 'Medium', 'High']
- Ordinal Encoded:
     ```
     0
     1
     2
     ```

- Appropriate for categorical variables with a clear order (e.g., 'low' < 'medium' < 'high').

#### ColumnTransformer for Multiple Categories

- Too many categories in a categorical variable can lead to a high-dimensional representation.

- We can use `ColumnTransformer` to apply different transformations to specific columns.

3. **Example:**
   ```python
   from sklearn.compose import ColumnTransformer
   from sklearn.preprocessing import OneHotEncoder
   from sklearn.feature_extraction.text import CountVectorizer

   # Different transformations to different column types
   transformer = ColumnTransformer(
       transformers=[
           ('categorical', OneHotEncoder(handle_unknown='ignore'), ['category']),
           ('text', CountVectorizer(), ['text'])
       ])

   transformed_data = transformer.fit_transform(data)
   ```
The **'category'** column is one-hot encoded, and the **'text'** column is vectorized using `CountVectorizer`.


##### Text Data Treatment vs. Categorical Variables

- **Text Data:**
  - Involves natural language processing.
  - Requires tokenization, stemming, or lemmatization.

- **Categorical Variables:**
  - Represent discrete categories.
  - Encoded using methods like one-hot or ordinal encoding.

## Understanding Bag of Words (CountVectorizer)

- Bag of Words (BoW) is a fundamental technique in Natural Language Processing (NLP) used for text analysis and feature extraction.
- BoW is a technique that converts text documents into numerical feature vectors.
- It represents a document as a collection (or "bag") of words, disregarding grammar and word order but keeping track of word frequency.
- BoW is widely used for text classification, sentiment analysis, and clustering tasks in NLP.


![](https://miro.medium.com/v2/resize:fit:1200/1*3K9GIOVLNu0cRvQap_KaRg.png)

- How **BoW** works:
  1. Tokenization: Break down the text into individual words or tokens.
  2. Counting: Count the occurrence of each word in the document.
  3. Vectorization: Convert the word counts into numerical feature vectors.

#### Example Text Document

- Let's consider an example text document: "The cat sat on the mat. The mat was blue."
- We will apply Bag of Words to this document to convert it into a numerical representation.

- The tokenization process separates the document into individual words:
  ```
    ["The", "cat", "sat", "on", "the", "mat", "The", "mat", "was", "blue"].
  ```
- Counting the occurrences of each word, we get:

    ```
    {"The": 2,
    "cat": 1,
    "sat": 1,
    "on": 1,
    "the": 1,
    "mat": 2,
    "was": 1,
    "blue": 1}
    ```

#### CountVectorizer in Python 

- CountVectorizer in Python is a powerful tool that automates the process of Bag of Words.
- It converts a collection of text documents to a matrix of token counts.
- Let's see how to implement Bag of Words using CountVectorizer in Python.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

# Example text documents
documents = [
    "The cat sat on the mat.",
    "The mat was blue and the cat was happy."
]

# Create a CountVectorizer instance
vectorizer = CountVectorizer()

# Fit and transform the documents
X = vectorizer.fit_transform(documents)

# Get the feature names (tokens)
feature_names = vectorizer.get_feature_names_out()

# Print the feature names and the numerical representation of the documents
print("Feature names (tokens):", feature_names)
print("Numerical representation of documents:")
print(X.toarray())

Feature names (tokens): ['and' 'blue' 'cat' 'happy' 'mat' 'on' 'sat' 'the' 'was']
Numerical representation of documents:
[[0 0 1 0 1 1 1 2 0]
 [1 1 1 1 1 0 0 2 2]]


#### Applications of Bag of Words
- Bag of Words has numerous applications in NLP, including:
  - Text classification: Identifying the topic or sentiment of a document.
  - Document clustering: Grouping similar documents together.
  - Information retrieval: Ranking documents based on relevance to a query.


#### Important hyperparameters:
- `stop_words` : Does not take into consideration words that are not meaningful (is, the, a, ...)
- `max_features` : Some vocabularies can be massive, this can make things SLOW. Use this to make the process faster, at least when you set up the pipeline.
- `lowercase`: Convert all text to lowercase.

#### Remember
- Bag of Words is a powerful technique for converting text data into numerical representations.
- `CountVectorizer` in Python simplifies the implementation of Bag of Words.
- Understanding Bag of Words is essential for various NLP tasks, enabling efficient text analysis and feature extraction.