# Tutorial (Text Data Processing)

(Last updated: Feb 16, 2023)

This tutorial will familiarize you with the data science pipeline of processing text data

[pipeline pic]

You can use the following link to jump to the tasks and assignments:

[table of contents]

## Scenario

[AG News Classification Dataset](https://www.kaggle.com/datasets/amananandrai/ag-news-classification-dataset)

## Import Packages

We put all the packages that are needed for this tutorial below:

In [None]:
import nltk
import numpy as np
import pandas as pd
import spacy
import matplotlib.pyplot as plt
import torch

## Task Answers

The code block below contains answers for the assignments in this tutorial. **Do not check the answers in the next cell before practicing the tasks.**

In [None]:
def check_answer_df(df_result, df_answer, n=1):
    """
    This function checks if two output dataframes are the same.
    
    Parameters
    ----------
    df_result : pandas.DataFrame
        The result from the output of a function.
    df_answer: pandas.DataFrame
        The expected output of the function.
    n : int
        The numbering of the test case.
    """
    try:
        assert df_answer.equals(df_result)
        print("Test case %d passed." % n)
    except:
        print("Test case %d failed." % n)
        print("")
        print("Your output is:")
        print(df_result)
        print("")
        print("Expected output is:")
        print(df_answer)

## Task 3: Preprocess Text Data

In this task, we will process the text data from [dataset]. First, we need to load [the dataset].

In [None]:
# Load dataset

### Tokenization 

### Lemmatization or stemming

Showing both, but stick to one for the assignment.

### Stopword removal

- Assignment: Show word frequencies; top-n overall, per topic from the dataset both before and after pre-processing

Sources:

- https://www.kaggle.com/code/vukglisovic/classification-combining-lda-and-word2vec/notebook

## Task 4: Another option: spaCy

- Make a seperate column with the text transformed to a SpaCy span per column
- Show that tokens, lemma's, stems and if something is a stopword is immediately available on the span.
- Assignment: make a version of the text that's tokenized, lemmatized/stemmed and has stopwords removed

## Task 5: Unsupervised Learning - Topic Modelling

- Use LDA to transform preprocessed text into features
- Run simple Kmeans to make clusters
- Let student pick amount of clusters (elbow method)
- Evaluate using adjusted_mutual_info_score and adjusted_rand_score
- Similar to an assignment from the Applied ML course these students had last year, but LDA was cut for their year

## Task 6: Word embeddings

- Show code to make embeddings based on pre-processed text using both NLTK and spaCy
- Assignment: Let student apply it to dataframe

Sources:
- https://www.shanelynn.ie/word-embeddings-in-python-with-spacy-and-gensim/
- https://www.kaggle.com/code/vukglisovic/classification-combining-lda-and-word2vec/notebook

## Task 7: Supervised Learning - Topic Classification

- Using the word embeddings features, train a small neural net 
- Picture of the desired neural net we want to make
- Don't give the full torch code, only one layer to let them do something with torch
- Hyperparameter tuning (either some hinted ones or see if Ray Tune is worth it for this task
- Evaluate using confusion matrix against true features

Sources:
- https://pytorch.org/tutorials/beginner/hyperparameter_tuning_tutorial.html
- https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html