 # Luonnollisen kielen käsittelyn (NLP) demo

## Lähteitä

Koodia, esimerkkejä:

- Demo mukailee esimerkkitutorialia: https://www.tensorflow.org/hub/tutorials/tf2_text_classification
- Osia myös tutorialeista: 
- https://medium.com/intro-to-artificial-intelligence/entity-extraction-using-deep-learning-8014acac6bb8
- https://medium.com/swlh/using-xlnet-for-sentiment-classification-cfa948e65e85-
- https://github.com/kcmankar/pytorch-sentiment-analysis-using-XLNet/blob/master/xlnet_sentiment_analysis.ipynb
- https://news.machinelearning.sg/posts/sentiment_analysis_on_movie_reviews_with_xlnet/
- https://deepnote.com/@datacloudgui/1-Millions-of-movies-rllcqn7nRk6lEmB3xRhlKw

Dataa: 

- Sentimenttidata: https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb
- Sentimenttidata, alkuperäinen lähde? : https://ai.stanford.edu/~amaas/data/sentiment/
- IMDb-data, raakadata: https://www.imdb.com/interfaces/

Menetelmiä:

- XLNet: https://github.com/zihangdai/xlnet
- XLNet-paperi: https://arxiv.org/pdf/1906.08237.pdf
- GLUE-sivusto: https://gluebenchmark.com/
- GLUE-paperi: https://openreview.net/pdf?id=rJ4km2R5t7
- Transformer-neuroverkkojen aarreaitta: https://github.com/huggingface/transformers

Muita lähteitä:

- NLP-kehitystä seuraileva GitHub-repo: https://github.com/sebastianruder/NLP-progress
- Toinen NLP-GitHub -repo: https://github.com/keon/awesome-nlp
- Turku NLP: https://turkunlp.org/
- Kieliriippumattomia sana-assosiaatioita: https://universaldependencies.org/

## Demo

### Määritellään tarvittavat Python-kieliset paketit

In [None]:
!pip install pymongo

In [None]:
import os
import pickle
import numpy as np
import pandas as pd

# NNLP:tä varten:

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # Kytketään pois GPU:n puuttumisesta kertovat virheviestit

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

from tensorflow import keras

import matplotlib.pyplot as plt

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")

from browse_imdb_data import get_review_param

### Pengotaan aineistoa yleiskäsityksen saamiseksi

In [None]:
data_dir = '/home/ml_user/data/IMDb_raw'

file_name_meta = 'imdb_meta.pkl'
file_name_meta_full = os.path.join(data_dir, file_name_meta)

if os.path.exists(file_name_meta_full): # Luetaan data sisään
    
    df_meta = pd.read_pickle(file_name_meta_full)

else: # Muodostetaan data
    
    df_tr_neg = get_review_param(data_dir, 'aclImdb/train/neg', 'aclImdb/train/urls_neg.txt')
    df_tr_pos = get_review_param(data_dir, 'aclImdb/train/pos', 'aclImdb/train/urls_pos.txt')
    df_tr_unsup = get_review_param(data_dir, 'aclImdb/train/unsup', 'aclImdb/train/urls_unsup.txt')

    df_te_neg = get_review_param(data_dir, 'aclImdb/test/neg', 'aclImdb/test/urls_neg.txt')
    df_te_pos = get_review_param(data_dir, 'aclImdb/test/neg', 'aclImdb/test/urls_pos.txt')

    df_reviews_all = pd.concat([df_tr_neg, df_tr_pos, df_tr_unsup, df_te_neg, df_te_pos])

    df_reviews_all.sort_values(by=['imdb_id'], inplace=True, ignore_index=True)

    del df_tr_neg, df_tr_pos, df_tr_unsup, df_te_neg, df_te_pos

    df_uniq_titles = df_reviews_all.imdb_id.drop_duplicates() # Kerää vain uniikit rivit
    
    file_name = os.path.join(data_dir, 'title.basics.tsv')
    df_titles = pd.read_csv(file_name, sep='\t', low_memory=False)
    df_titles = df_titles[df_titles.tconst.isin(df_uniq_titles)]
    
    df_titles = \
        df_titles.sort_values(by=['tconst'],
                              ignore_index=True).drop_duplicates(subset=['tconst'],
                                                                 ignore_index=True, keep='last')
    df_titles.drop(columns=['primaryTitle', 'isAdult'], inplace=True)    
    
        
    # Nimikedataa
    file_name = os.path.join(data_dir, 'title.principals.tsv')
    df_principals = pd.read_csv(file_name, sep='\t', low_memory=False)
    df_principals.drop(columns=['job', 'characters'], inplace=True)
    df_principals = df_principals[df_principals.tconst.isin(df_uniq_titles)]
    df_principals['nconst_ind'] = df_principals.nconst.str.slice(2,9)
    df_principals['nconst_ind'] = df_principals['nconst_ind'].astype('int')
    df_principals.drop(columns=['nconst'], inplace=True)
                                         
    # Roolidataa
    file_name = os.path.join(data_dir, 'name.basics.tsv')
    df_names = pd.read_csv(file_name, sep='\t', low_memory=False)
    df_names.drop(columns=['primaryProfession', 'knownForTitles'], inplace=True)
    df_names['nconst_ind'] = df_names.nconst.str.slice(2,9)
    df_names['nconst_ind'] = df_names['nconst_ind'].astype('int')
    df_names.drop(columns=['nconst'], inplace=True)

    df_meta = df_principals.merge(df_titles,
                                  how='left',
                                  left_on='tconst', 
                                  right_on='tconst').merge(df_names,
                                                           how='left',
                                                           left_on='nconst_ind',
                                                           right_on='nconst_ind')
    df_meta.replace('\\N', '', inplace=True)
    
    df_meta.startYear = df_meta.startYear.astype('float')
    df_meta.endYear = df_meta.endYear.replace('', np.nan).astype('float')
    df_meta.birthYear = df_meta.deathYear.replace('', np.nan).astype('float')
    df_meta.deathYear = df_meta.deathYear.replace('', np.nan).astype('float')
    
    df_meta.to_pickle(file_name_meta_full) # Talleta levylle


### Katsotaan datatiedoston sisältöä

In [None]:
df_meta.head()

In [None]:
df_meta.describe()

In [None]:
df_meta_uniq = df_meta.drop_duplicates(subset=['tconst'], ignore_index=True)

### Piirretään muutama kuvaileva graafi

In [None]:
df_group = df_meta_uniq.groupby('startYear').count()

fig, ax = plt.subplots(1, 1, figsize=(15,5))
ax.bar(df_group.index, df_group.tconst)
ax.set_xlabel('vuosi')
ax.set_ylabel('elokuvia')

In [None]:
df_genres = df_meta_uniq.genres.str.lower().str.split(',', expand=True)
df_group = df_genres.stack().reset_index().groupby(0).count().sort_values(by='level_0', ascending=False)

fig, ax = plt.subplots(1, 1, figsize=(10,5))
ax.barh(df_group.index, df_group.level_0)
ax.set_xlabel('elokuvia')
ax.set_ylabel('aihe')

In [None]:
df_group = df_meta.groupby(by='primaryName').count().sort_values(by='tconst', ascending=False).iloc[:20]

fig, ax = plt.subplots(1, 1, figsize=(10,5))
ax.barh(df_group.index, df_group.tconst)
ax.set_xlabel('mainintoja')
ax.set_ylabel('henkilö')

In [None]:
def plot_meta_by_name(name):
    
    print(df_meta.loc[df_meta.primaryName == name, 
                      ['category', 'originalTitle', 'titleType', 'startYear']].sort_values(by='startYear'))
    

In [None]:
plot_meta_by_name('William Shakespeare')

In [None]:
plot_meta_by_name('Renny Harlin')

## Sentimenttianalyysi

### Jaetaan data opetus- ja testiaineistoihin

In [None]:
train_data, test_data = tfds.load(name="imdb_reviews", split=["train", "test"], 
                                  batch_size=-1, as_supervised=True)
train_examples, train_labels = tfds.as_numpy(train_data)
test_examples, test_labels = tfds.as_numpy(test_data)

In [None]:
print("Training entries: {}, test entries: {}".format(len(train_examples), len(test_examples)))

### Katsotaan esimerkit positiivisesta ja negatiivisesta arviosta

In [None]:
train_examples[0]

In [None]:
train_labels[0] # 0 == negatiivinen

In [None]:
train_examples[5]

In [None]:
train_labels[5] # 1 == positiivinen

### Noudetaan (uutisilla) esiopetettu neuroverkko

In [None]:
model = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(model, input_shape=[], dtype=tf.string, trainable=True)
# hub_layer(train_examples[:3])

### Lisätään verkkoon tuoreita kerroksia uutta ongelmaa varten

In [None]:
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))

model.summary()

### Neuroverkon klassinen rakenne

![title](neural_net.png)

Lähde https://www.cs.mcgill.ca/~jcheung/teaching/fall-2016/comp599/lectures/lecture23.pdf "Goldberg et al. 2015")

### Määritellään tavoiteltavat (optimointi ja vertailu) metriikat

In [None]:
model.compile(optimizer='adam',
              loss=tf.losses.BinaryCrossentropy(from_logits=True),
              metrics=[tf.metrics.BinaryAccuracy(threshold=0.0, name='accuracy')])

### Jaetaan aineisto vielä esiopetukseen- ja testaukseen

In [None]:
train_cutoff = 10000

x_val = train_examples[:train_cutoff]
partial_x_train = train_examples[train_cutoff:]

y_val = train_labels[:train_cutoff]
partial_y_train = train_labels[train_cutoff:]

### Mallin opetusta (kesto olisi läppärillä n. 12 min)

In [None]:
model_dir = '../work/nlp_model'

if os.path.exists(model_dir):
    model = keras.models.load_model(model_dir)

    with open(os.path.join(model_dir, 'hist.pkl'), 'rb') as handle:
        hist = pickle.load(handle)
else:
    history = model.fit(partial_x_train,
                        partial_y_train,
                        epochs=40,
                        batch_size=512,
                        validation_data=(x_val, y_val),
                        verbose=1)
    model.save(model_dir)
                                
    with open(os.path.join(model_dir, 'hist.pkl'), 'wb') as handle:
        pickle.dump(history.history, handle, protocol=pickle.HIGHEST_PROTOCOL)
        
    hist = history.history

### Validoidaan malli 

In [None]:
results = model.evaluate(test_data, test_labels)

In [None]:
history_dict = hist
history_dict.keys()

### Katsotaan, miten mallin tarkuus kehittyi opetuksen aikana

In [None]:
acc = history_dict['accuracy']
val_acc = history_dict['val_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']

epochs = range(1, len(acc) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Testing loss')
plt.title('Training and testing loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

### Katsotaan, miten malli toimii 

In [None]:
plt.clf()   # clear figure

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Testing acc')
plt.title('Training and testing accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

### Mallin lopullinen tarkkuus ~ 85% - harkitaan onko riittävä.

## Moderni vaihtoehto: XLNet
https://github.com/zihangdai/xlnet

### XLNetin edullisia ominaisuuksia, mm. muisti ja huomiokyky (attention)

![title](transformer_self_attention.png)

Lähde: http://127.0.0.1:8888/lab?token=e7b3aa40128c51b96129e8eeb8aa389748ae8da392791418

### XLNet tarkkuus: ~96 % (artikkeli) ~93 % (Google Colab-palvelussa esimerkillä validoitu, koska läppärissä ei GPU:ta)