# **NLP-Notebook Contents**


**1. How to extract information**<br>
**2. How to classify**<br>
**3. How to generate**

# **1 - How to extract information**
## 1.1 Text Summarization with Huggingface

**In 1.1 we'll have a first look at some tools from Huggingface's transformers library. The best thing about this library: It's very (!) convenient. You can work with large and powerful models without having to write a lot of code, how great is that? But there's one thing you should keep in mind: Powerful models = Big models (in most cases), so make sure you have a couple 100MBs free on your hardware.** 

In [27]:
# Import ML-tools
from transformers import AutoModelWithLMHead, AutoTokenizer
# Import UI-tools
from IPython.display import display 
import ipywidgets as widgets 
from ipywidgets import interact, Layout 

In [28]:
# Load pretrained models

# Tokenizer: This thing processes your text into so-called tokens (~ words/phrases)
tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-base-finetuned-summarize-news")
# Model: This thing does the actual work (summarize)
model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-base-finetuned-summarize-news")

# JFYI: This model here is a little under 900MB in size



In [29]:
# Nomen est omen! Here we take in text, tokenize it, and generate a summary
def summarize(text, max_length=150):
  input_ids = tokenizer.encode(text, return_tensors="pt", add_special_tokens=True)
  generated_ids = model.generate(input_ids=input_ids, num_beams=2, max_length=max_length,  repetition_penalty=2.5, length_penalty=1.0, early_stopping=True)
  preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]
  return preds[0]

In [30]:
# An example we'll run trough our summary function
news_to_be_summarized = "BERLIN—Germany’s political parties on Monday began what could\
be a monthslong bargaining process to form the next government, with two smaller parties\
in a position to decide who will succeed Angela Merkel as chancellor.\
Sunday’s election marked a leftward shift for the country, with the center-left\
Social Democratic Party, or SPD, coming first and the Greens scoring strong gains.\
But together they don’t hold enough seats in parliament to form a government and \
would need to bring in the pro-market Free Democratic Party as a third partner, \
forcing them to dilute their agenda.\
Other constellations are arithmetically possible, some of them involving the \
defeated conservatives, complicating talks.\
Whatever the shape of the next government, it will likely be broadly centrist—like\
the Merkel-led left-right alliance that preceded it—because many of the partners’\
more radical or controversial proposals could cancel each other out.\
It is also likely to have a strong focus on measures to combat climate \
change, which all four parties highlighted in their campaigns and opinion polls \
show is the dominant concern for German voters. Such a focus could have far-reaching\
implications for an economy where manufacturing, especially car making, \
plays an outsize role.\
Yet the negotiations to get there could take months. And for the first time\
they will hinge on the Greens and the FDP, Germany’s new kingmakers. The two \
parties said on Sunday that they would talk to each other before entering \
negotiations with the bigger conservative bloc and the SPD.\
The center-left Greens stand for climate policies and social justice while the FDP \
is a pro-business group that has called for tax cuts and a smaller state. While \
they both qualify as centrist parties, their platforms have little overlap.\
Courting them are Olaf Scholz, the SPD candidate who secured a narrow victory \
with 25.7% of the vote, and Armin Laschet, the conservative candidate who \
delivered his party’s worst-ever result of 24.1%."

# Let's see how well our summarizer does ...
print("+++Summarized News+++")
summarize(news_to_be_summarized)

+++Summarized News+++


'smaller parties on Monday began what could be a monthslong bargaining process to form the next government, with smaller parties in a position to decide who will succeed Angela Merkel as chancellor. The SPD came first and the Greens scored strong gains. Meanwhile, the FDP is a pro-business group that has called for tax cuts and a smaller state. Notably, the next government will likely be broadly centrist, with many of the partners’ more radical or controversial proposals cancelling each other out. But the negotiations could take months.'

**Looks quite good! Let's now try to integrate this summary-functionality into a small User Interface.**

In [31]:
# A small User Interface 
textbox1 = widgets.Text(description='Input Text');display(textbox1); 
button = widgets.Button(description='Summarize!', layout=Layout(width='200px')); 
button.style.button_color = 'lightblue';display(button); 
textbox2 = widgets.Text(description='Summary');display(textbox2); 

# Connect Widget to Summary function (jfyi: can take a few seconds to summarize)
def on_button_clicked(sender): 
    input_text = textbox1.value
    summary = summarize(input_text)
    textbox2.value = summary
    
button.on_click(on_button_clicked)

Text(value='', description='Input Text')

Button(description='Summarize!', layout=Layout(width='200px'), style=ButtonStyle(button_color='lightblue'))

Text(value='', description='Summary')

## 1.2 Semantic Analysis with Gensim

**In this part of the notebook we'll take a text and generate a summary. Summarization? Again? Yes! But we'll generate a way cooler summary this time - a so-called "topic map".**

In [32]:
# Imports
from __future__ import print_function
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [33]:
# Load data
newsgroups = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))
docs_raw = newsgroups.data
print(len(docs_raw))

11314


In [34]:
# Process text (e.g., remove stopwords)
tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                stop_words = 'english',
                                lowercase = True,
                                token_pattern = r'\b[a-zA-Z]{3,}\b',
                                max_df = 0.5, 
                                min_df = 10)
dtm_tf = tf_vectorizer.fit_transform(docs_raw)
tfidf_vectorizer = TfidfVectorizer(**tf_vectorizer.get_params())
dtm_tfidf = tfidf_vectorizer.fit_transform(docs_raw)

In [35]:
# for TF DTM
lda_tf = LatentDirichletAllocation(n_components=20, random_state=0)
lda_tf.fit(dtm_tf)
# for TFIDF DTM
lda_tfidf = LatentDirichletAllocation(n_components=20, random_state=0)
lda_tfidf.fit(dtm_tfidf)



LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_components=20, n_jobs=1,
             n_topics=None, perp_tol=0.1, random_state=0,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

In [36]:
pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer)

  by='saliency', ascending=False).head(R).drop('saliency', 1)


# **2 - How to classify**
# Text Classification with Pre-Trained Word Embeddings

**In this part of the notebook we'll have a look at a text classifier. And this time, because classification is somewhat task-specific, we actaally have to train a model.**

In [37]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.manifold import TSNE
from sklearn.feature_extraction.text import TfidfVectorizer

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation, Dense, Dropout, Embedding, Flatten, Conv1D, MaxPooling1D, LSTM
from tensorflow.keras import utils
from tensorflow.keras.callbacks import ReduceLROnPlateau, EarlyStopping

import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

import gensim

import re
import numpy as np
import os
from collections import Counter
import time
import pickle
import itertools

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/christoph/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [38]:
# Dataset-related parameters
DATASET_COLUMNS = ["target", "ids", "date", "flag", "user", "text"]
DATASET_ENCODING = "ISO-8859-1"
TRAIN_SIZE = 0.8
TEXT_CLEANING_RE = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"

# Word2Vec-Model parameters 
W2V_SIZE = 300
W2V_WINDOW = 7
W2V_EPOCH = 32
W2V_MIN_COUNT = 10

# Training parameters
SEQUENCE_LENGTH = 300
EPOCHS = 4
BATCH_SIZE = 1024

# Sentiment decoding
POSITIVE = "POSITIVE"
NEGATIVE = "NEGATIVE"
NEUTRAL = "NEUTRAL"
SENTIMENT_THRESHOLDS = (0.4, 0.7)

In [39]:
# Link to dataset: https://www.kaggle.com/kazanova/sentiment140
dataset_filename = os.listdir("./sentiment140/")[0]
dataset_path = os.path.join("./sentiment140/", dataset_filename)

print("Open file:", dataset_path)
df = pd.read_csv(dataset_path, encoding =DATASET_ENCODING , names=DATASET_COLUMNS)

# Cut the dataset to make things a little faster (every 100th row)
df = df.iloc[::100, :]
print("Dataset size:", len(df))

Open file: ./sentiment140/training.1600000.processed.noemoticon.csv
Dataset size: 16000


In [40]:
decode_map = {0: "NEGATIVE", 2: "NEUTRAL", 4: "POSITIVE"}
def decode_sentiment(label):
    return decode_map[int(label)]

df.target = df.target.apply(lambda x: decode_sentiment(x))
stop_words = stopwords.words("english")
stemmer = SnowballStemmer("english")

In [41]:
def preprocess(text, stem=False):
    # Remove link,user and special characters
    text = re.sub(TEXT_CLEANING_RE, ' ', str(text).lower()).strip()
    tokens = []
    for token in text.split():
        if token not in stop_words:
            if stem:
                tokens.append(stemmer.stem(token))
            else:
                tokens.append(token)
    return " ".join(tokens)

In [42]:
df.text = df.text.apply(lambda x: preprocess(x))
df_train, df_test = train_test_split(df, test_size=1-TRAIN_SIZE, random_state=42)
documents = [_text.split() for _text in df_train.text] 

In [43]:
w2v_model = gensim.models.word2vec.Word2Vec(size=W2V_SIZE, 
                                            window=W2V_WINDOW, 
                                            min_count=W2V_MIN_COUNT, 
                                            workers=8)

w2v_model.build_vocab(documents)

In [44]:
words = w2v_model.wv.vocab.keys()
vocab_size = len(words)
print("Vocab size", vocab_size)

Vocab size 1405


In [45]:
# Train w2v-model and tokenize text
w2v_model.train(documents, total_examples=len(documents), epochs=W2V_EPOCH)
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df_train.text)

vocab_size = len(tokenizer.word_index) + 1
print("Total words", vocab_size)

Total words 16233


In [46]:
# Preprocessing 1
x_train = pad_sequences(tokenizer.texts_to_sequences(df_train.text), maxlen=SEQUENCE_LENGTH)
x_test = pad_sequences(tokenizer.texts_to_sequences(df_test.text), maxlen=SEQUENCE_LENGTH)

labels = df_train.target.unique().tolist()
labels.append(NEUTRAL)

In [47]:
# Preprocessing 2
encoder = LabelEncoder()
encoder.fit(df_train.target.tolist())
y_train = encoder.transform(df_train.target.tolist())
y_test = encoder.transform(df_test.target.tolist())
y_train = y_train.reshape(-1,1)
y_test = y_test.reshape(-1,1)

embedding_matrix = np.zeros((vocab_size, W2V_SIZE))
for word, i in tokenizer.word_index.items():
    if word in w2v_model.wv:
        embedding_matrix[i] = w2v_model.wv[word]
print(embedding_matrix.shape)

(16233, 300)


In [48]:
# Build the model
embedding_layer = Embedding(vocab_size, W2V_SIZE, weights=[embedding_matrix], input_length=SEQUENCE_LENGTH, trainable=False)
model = Sequential()
model.add(embedding_layer)
model.add(Dropout(0.5))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

In [49]:
# Get the model ready for training
model.compile(loss='binary_crossentropy',
              optimizer="adam",
              metrics=['accuracy'])

callbacks = [ReduceLROnPlateau(monitor='val_loss', patience=5, cooldown=0)]

In [50]:
# Train the model
history = model.fit(x_train, y_train,
                    batch_size=BATCH_SIZE,
                    epochs=EPOCHS,
                    validation_split=0.1,
                    verbose=1,
                    callbacks=callbacks)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


In [51]:
def decode_sentiment(score, include_neutral=True):
    if include_neutral:        
        label = NEUTRAL
        if score <= SENTIMENT_THRESHOLDS[0]:
            label = NEGATIVE
        elif score >= SENTIMENT_THRESHOLDS[1]:
            label = POSITIVE
        return label
    else:
        return NEGATIVE if score < 0.5 else POSITIVE

In [52]:
def predict(text, include_neutral=True):
    x_test = pad_sequences(tokenizer.texts_to_sequences([text]), maxlen=SEQUENCE_LENGTH)
    score = model.predict([x_test])[0]
    label = decode_sentiment(score, include_neutral=include_neutral)
    return {"label": label, "score": float(score)}  

In [53]:
predict("I love TUM.ai!")

{'label': 'POSITIVE', 'score': 0.7013869285583496}

**Let's package all this in a small UI again.**

In [54]:
# A small User Interface 
textbox3 = widgets.Text(description='Input Text');display(textbox3); 
button2 = widgets.Button(description='Classify!', layout=Layout(width='200px')); 
button2.style.button_color = 'lightblue';display(button2); 
textbox4 = widgets.Text(description='Label');display(textbox4); 

# You know the drill ...
def on_button_clicked2(sender): 
    input_text = textbox3.value
    prediction = predict(input_text)
    textbox4.value = prediction["label"]
    
button2.on_click(on_button_clicked2)

Text(value='', description='Input Text')

Button(description='Classify!', layout=Layout(width='200px'), style=ButtonStyle(button_color='lightblue'))

Text(value='', description='Label')

# **3 - How to generate**
# Text Generation with HuggingFace

**This part of the notebook is tool-wise rather similar to part 1: There's not much code, but, depending on the quality/length of the text you want to generate, you may have to use some big (2GB+) models.**

In [55]:
# get transformers
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium")
GPT2 = TFGPT2LMHeadModel.from_pretrained("gpt2-medium", pad_token_id=tokenizer.eos_token_id)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355256.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=718.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1419628976.0, style=ProgressStyle(descr…




All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2-medium.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [61]:
MAX_LEN = 23
input_sequence = "I don't know about you, but"

# encode context the generation is conditioned on
input_ids = tokenizer.encode(input_sequence, return_tensors='tf')
# generate text until the output length (which includes the context length) reaches 50
greedy_output = GPT2.generate(input_ids, max_length = MAX_LEN)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens = True))

Output:
----------------------------------------------------------------------------------------------------
I don't know about you, but I'm not going to be able to afford to buy a new car.
