# Deep learning notebook

This notebook contains the process for the deep learning part of the project. The idea is to achieve an n-dimensional representation of each text and get an average for a given time period in z-space, which can then be used for prediction.

In [10]:
import pandas as pd
import tensorflow as tf
import numpy as np
from transformers import BertModel, BertTokenizer

In [3]:
tok = BertTokenizer.from_pretrained('KB/bert-base-swedish-cased')
model = BertModel.from_pretrained('KB/bert-base-swedish-cased')

Downloading: 100%|██████████| 478M/478M [00:54<00:00, 9.24MB/s]   
Some weights of the model checkpoint at KB/bert-base-swedish-cased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Import data

In [28]:
df = pd.read_csv('../dataset/lawline_data.csv')

Remove newlines by splitting and combining texts. Then turn each document into a list of sentences.

In [29]:
df['text_clean'] = [' '.join(str(item).split()) for item in df['text']]

In [42]:
doc_as_list = [item.split('. ') for item in df['text_clean']]

In [44]:
doc_as_list = [[sent if sent[-1] in ['!', '?'] else sent + '.'
                for sent in doc]
                for doc in doc_as_list]

Turn each document into a word embedding using BERT

In [60]:
def get_embedding(document, tokenizer, model):
    results = tokenizer(document, max_length=512,
                        truncation=True, padding=True,
                        return_tensors="pt")

    sentence_embs = model(**results)[1].detach().numpy()
    doc_emb = np.mean(sentence_embs, axis=0)

    return doc_emb

In [61]:
emb_from_doc = [get_embedding(doc, tok, model) for doc in doc_as_list]

In [58]:
len(emb_from_doc)

(768,)