# **Module 3: Supervised Machine Learning for Text**

## Section I Text Classification

#### 1. **Classification**

##### **(1) Definition**
Given a set of classes, **Classification**  is to <u>assign</u> the correct *class label* to the given *input*  
*<p style = "color:grey" align="center">![1](assets/example_of_class_1.png)</p>*
*<p align="center" style="color:grey">an example of text classification</p>*
##### **(2) Examples**
- **Topic identification**: Is this news article about *politics*, *sports*, or *technology*?
- **Spam detection**: Is this email *a spam* or *not a spam*?
- **Sentiment analysis**: Is this movie review *positive* or *negative*?
- **Spelling correction**: *whether* or *weather* (different sentences)? *color* or *colour* (different styles)?

#### 2. **Supervised Learning**

##### **(1) Definition**
The way that machine learns from past instances for future applications
*<p align="center">![Alt text](assets/supervised_learning.png)</p>*
*<p align="center" style="color:grey">What is supervised learning?</p>*

##### **(2) Supervised classification**
- Definition: Learn a <u>classification model</u> on properties (*features*) and their importance (*weights*) from *labeled instances*
    - <u>classification model</u>: a *predictive* model that uses input to predict the category or class (discrete labels)
- *Features*: *X* - set of attributes or features $\{x^{[1]}, x^{[2]}, ..., x^{[n]}\}$, $x^{[i]} \in \{x^{[i]}_1, x^{[i]}_2, ..., x^{[i]}_n\}$
- *Labels*: *y* - a "class" label from the label set $y \in \{y_1, y_2, ..., y_k\}$
- Training phase: dataset with each sample having both *features* and *labels*
- Inference phase: dataset with each sample having only *features* and *no labels*

##### **(3) Train-Validation-Test split (knowledge recall)**
For labeled dataset:

- *Training dataset*: split and used for learning parameters/models
- *Validation dataset*: used to set parameters for the model
- *Test dataset*: used to evaluate the final model

*<p align="center">![train-validation-test split](assets/train-val-test_split.png)</p>*
*<p align="center" style="color:grey">train-validation-test split</p>*

##### **(4) Classification paradigms**

- **Binary classification**: A classification when there are only two possible classes (only $y_1$ and $y_2$)
- **Multi-class classification**: when there are more than two possible classes ($y_1, y_2, ..., y_n$)
- **Multi-label classification**: when data instances can have two or more labels ($Y = \{y^{[1]},y^{[2]},...y^{[n]}\}$) (*uncommon*)

##### **(5) Questions to ask**
- During training phase:
    - What are the features?
    - How to represent the features?
    - What is the classification model/algorithm?
    - What are the model parameters?
- During inference phase:
    - What are the expected performance?
    - What is a good measure for expected performance?


#### 3. **Identifying Features from Text**

#### **(1)Uniqueness of text**

- Textual data has unique challenges
- All the info needed is in the next
- Features can be pulled out from text in <u>different granularities</u>

#### **(2)Type of textual features**

- ***Words***
    - **Features**:
        - The most common
        - A significant number
    - **Challenges**:
        - Handling common-occurring words (*stop words*; e.g., the)
        - Normalization: make it *lower case* vs. leave *as-is*
        - Stemming/Lemmetization: remove *plurals*
- ***Characteristics of words***
    - **Examples**:
        -  *Capitalization*
            - U.S. vs. us
            - White House vs. white house
        - *Parts of speech (POS) of words in a sentence*
            - weather vs. whether: a *"the"* as a determinant indicates *weather* rather than *whether*
        - *Grammatical structure* and *sentence parsing*
            - v. + n.
        - *Grouping words of similar meaning or semantics*
            - {buy, purchase}
            - {Mr., Ms., Dr., Prof.}
            - Numbers/Digits
            - Dates
- ***Other types of features***
    - **Inside the words**:
        - subsequences in words: "-ing" (continuous form of *v.*), "-ion" (indicator of a *n.*), ...
    - **From word sequences**:
        - bigram - "*White House*"