# **NLP for FFXIV.**

Decided to kill a Saturday with some NLP based on FFXIV lore.

Corpus:

"Yours is a long road, my friend, and it stretches on to places beyond imagining. With your every step, these grand adventures shall grow more distant and faint. And there may come a day when you forget the faces and voices of those you have met along the way. On that day, I bid you remember this... That no matter how far your journey may take you, you stand where you stand by virtue of the road you walked to get there. For in times of hardship, when you fear you cannot go on... The joy you have known, the pain you have felt, the prayers you have whispered and answered—they shall ever be your strength and your comfort. This I hope—I believe, here at memory's end."

- G'raha Tia to the Warrior of Darkness

We're going to use parts of our corpus through this notebook.

## Document Term Matrix

In [1]:
# import libraries.
from sklearn.feature_extraction.text import TfidfVectorizer

import numpy as np
import pandas as pd

# Graha Tia line to the Warrior of Darkness.
corpus= ["Yours is a long road, my friend,", 
             "and it stretches on to places beyond imagining.",
             "With your every step, these grand adventures shall grow more distant and faint.",
            ]

vectorizer=TfidfVectorizer()
corpus_tfidf=vectorizer.fit_transform(corpus)
print(f"The vocabulary size is \ {len(vectorizer.vocabulary_.keys())} ")
print(f"The document-term matrix shape is\ {corpus_tfidf.shape}")

# turn the corpus into a dataframe.
df = pd.DataFrame(np.round(corpus_tfidf.toarray(),2))
df.columns=vectorizer.get_feature_names()
# document term matrix.
print(df)

The vocabulary size is \ 26 
The document-term matrix shape is\ (3, 26)
   adventures   and  beyond  distant  every  ...  these    to  with  your  yours
0        0.00  0.00    0.00     0.00   0.00  ...   0.00  0.00  0.00  0.00   0.41
1        0.00  0.28    0.36     0.00   0.00  ...   0.00  0.36  0.00  0.00   0.00
2        0.28  0.21    0.00     0.28   0.28  ...   0.28  0.00  0.28  0.28   0.00

[3 rows x 26 columns]


## Classification

In [2]:
# SVM pipeline for classification.
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC

labels= [0,1,0] 

clf = SVC()
clf.fit(df.to_numpy(), labels)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

## Prediction

In [3]:
# Prediction
clf.predict(df.to_numpy())

array([0, 1, 0])

# Transformers

In [None]:
!pip install transformers

In [11]:
# We download a pretained tokenizer for BERT.
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [12]:
tokenizer(corpus)

{'input_ids': [[101, 6737, 2003, 1037, 2146, 2346, 1010, 2026, 2767, 1010, 102], [101, 1998, 2009, 14082, 2006, 2000, 3182, 3458, 16603, 1012, 102], [101, 2007, 2115, 2296, 3357, 1010, 2122, 2882, 7357, 4618, 4982, 2062, 6802, 1998, 8143, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

In [15]:
# return pt indicates we are using the pytorch model.
encoded_input = tokenizer("The joy you have known, the pain you have felt, the prayers you have whispered and answered—they shall ever be your strength and your comfort. This I hope—I believe, here at memory's end.", return_tensors="pt")

In [16]:
# gives us an output of embeddings.
output = model(**encoded_input)

## Fill Mask test



In [25]:
from transformers import pipeline

unmasker = pipeline('fill-mask', model='bert-base-uncased', top_k=10)
unmasker("That no matter how far your [MASK] may take you.")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.3505774736404419,
  'sequence': 'that no matter how far your life may take you.',
  'token': 2166,
  'token_str': 'life'},
 {'score': 0.04271148145198822,
  'sequence': 'that no matter how far your actions may take you.',
  'token': 4506,
  'token_str': 'actions'},
 {'score': 0.030055005103349686,
  'sequence': 'that no matter how far your career may take you.',
  'token': 2476,
  'token_str': 'career'},
 {'score': 0.026767902076244354,
  'sequence': 'that no matter how far your journey may take you.',
  'token': 4990,
  'token_str': 'journey'},
 {'score': 0.014478245750069618,
  'sequence': 'that no matter how far your past may take you.',
  'token': 2627,
  'token_str': 'past'},
 {'score': 0.012551093474030495,
  'sequence': 'that no matter how far your dreams may take you.',
  'token': 5544,
  'token_str': 'dreams'},
 {'score': 0.011550728231668472,
  'sequence': 'that no matter how far your feelings may take you.',
  'token': 5346,
  'token_str': 'feelings'},
 {'score'

In [27]:
# Let's put this in a dataframe.
pd.DataFrame(unmasker("That no matter how far your [MASK] may take you."))

Unnamed: 0,sequence,score,token,token_str
0,that no matter how far your life may take you.,0.350577,2166,life
1,that no matter how far your actions may take you.,0.042711,4506,actions
2,that no matter how far your career may take you.,0.030055,2476,career
3,that no matter how far your journey may take you.,0.026768,4990,journey
4,that no matter how far your past may take you.,0.014478,2627,past
5,that no matter how far your dreams may take you.,0.012551,5544,dreams
6,that no matter how far your feelings may take ...,0.011551,5346,feelings
7,that no matter how far your choices may take you.,0.009691,9804,choices
8,that no matter how far your quest may take you.,0.007554,8795,quest
9,that no matter how far your father may take you.,0.007221,2269,father


Findings: "journey", which was the word from the corpus - ranked fourth. not bad.

## Text Generation with GPT-2

In [31]:
from transformers import set_seed

generator = pipeline("text-generation", model='gpt2')
set_seed(42)

generator("For in times of hardship, when you fear you cannot go on...", max_length=30, num_return_sequences=5) 

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'For in times of hardship, when you fear you cannot go on...what can you do? For once you can, and we think you can."'},
 {'generated_text': 'For in times of hardship, when you fear you cannot go on... and you are outstretched for comfort, how has being with so many of you'},
 {'generated_text': 'For in times of hardship, when you fear you cannot go on...the world is a place of safety and strength. A place of happiness and comfort'},
 {'generated_text': "For in times of hardship, when you fear you cannot go on... you will see... The sun's shining behind his light. Then you realize you"},
 {'generated_text': 'For in times of hardship, when you fear you cannot go on...'}]

Findings: Not even close to the original then again, was not expecting it to match it word for word.

## Zero shot classification

In [32]:
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

Downloading:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

There's a character named Giott that will insult you depending on your race in the game.

In [34]:
sequence_to_classify = "Hey, you! Honorless lout! Half-empty flagon! Lanky dullard! Your mother was a hob and your face looks like a newborn's arse!"
candidate_labels = ['Hyur', 'Elezen', 'Lalafell', 'Miqote', 'Roegadyn', 'Au Ra', 'Viera', 'Hrothgar']

classifier(sequence_to_classify, candidate_labels)

{'labels': ['Hyur',
  'Miqote',
  'Viera',
  'Au Ra',
  'Hrothgar',
  'Elezen',
  'Lalafell',
  'Roegadyn'],
 'scores': [0.24867142736911774,
  0.13656800985336304,
  0.12068362534046173,
  0.11791294068098068,
  0.10476922243833542,
  0.09226266294717789,
  0.09110128879547119,
  0.0880308747291565],
 'sequence': "Hey, you! Honorless lout! Half-empty flagon! Lanky dullard! Your mother was a hob and your face looks like a newborn's arse!"}

Seems to have guessed right that a Hyur spoke this line.

## Summarization

In [35]:
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig

In [36]:
model = BartForConditionalGeneration.from_pretrained('sshleifer/distilbart-cnn-12-6')

tokenizer = BartTokenizer.from_pretrained('sshleifer/distilbart-cnn-12-6')

nlp=pipeline("summarization", model=model, tokenizer=tokenizer)

Downloading:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

In [37]:
text=""""Yours is a long road, my friend, and it stretches on to places beyond imagining. With your every step, these grand adventures shall grow more distant and faint. And there may come a day when you forget the faces and voices of those you have met along the way. On that day, I bid you remember this... That no matter how far your journey may take you, you stand where you stand by virtue of the road you walked to get there. For in times of hardship, when you fear you cannot go on... The joy you have known, the pain you have felt, the prayers you have whispered and answered—they shall ever be your strength and your comfort. This I hope—I believe, here at memory's end."""

In [38]:
q=nlp(text)

import pprint
pp = pprint.PrettyPrinter(indent=0, width=100)
pp.pprint(q[0]['summary_text'])

(' "No matter how far your journey may take you, you stand where you stand by virtue of the road '
 'you walked to get there," he says . "The joy you have known, the pain you have felt, the prayers '
 'you have whispered and answered—they shall ever be your strength and your comfort," he writes .')
