## Creating tensor instances of a list of sentences

In the model from the research paper "From None to Severe: Predicting Severity in Movie Scripts" (https://github.com/RiTUAL-UH/Predicting-Severity-in-Movie-Scripts) they use the script as input for their RNN. However, the text in the script is first split by sentences and then a tensor is created over the list of sentences. Since each movie script has roughly 10000 words, which we can estimate to be around 1000 sentences, and the BERT-encoding base used, creates a tensor of dimension 768 per sentence, we can expect very very large files. Therefore, we ran this step as a part of preprocessing the data. 

In [1]:
import spacy
from tqdm import tqdm
import glob
import pandas as pd
import re
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
files = glob.glob('../../data/script/*', recursive=True)
model = SentenceTransformer('stsb-bert-base') #This encoding base is given by the Model used in the paper
#Sentences are encoded by calling model.encode()
def get_sentence_emb(whole_list_of_sentences):
    embeddings = model.encode(whole_list_of_sentences, convert_to_tensor = True)
    return embeddings

def split_sentences(text):
    pattern = r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s' # define the pattern for splitting sentences using regex
   
    sentences = re.split(pattern, text)
    sentences = [re.sub(r'\s+', ' ', sentence).strip() for sentence in sentences] # remove extra spaces to cleanup data
    sentences = [sentence for sentence in sentences if sentence]
    return sentences

Some weights of the model checkpoint at C:\Users\Jakob/.cache\torch\sentence_transformers\sbert.net_models_stsb-bert-base\0_BERT were not used when initializing BertModel: ['classifier.weight', 'classifier.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [12]:
imdb_id_with_tensor = []
for idx, filepath in tqdm(enumerate(files), total=len(files)):
    
    if(idx == 113):
        continue

    with open(filepath, 'r') as f:
        try:
            script = f.read()
        except:
            print("skipping")
            continue
    
    imdb_id = filepath.split('\\')[-1].split('.')[0]
    
    list_of_sentences = split_sentences(script)
    try:
        sentences_emb = get_sentence_emb(list_of_sentences)
    except:
        print("skipping")
        continue
    imdb_id_with_tensor.append([imdb_id, sentences_emb])

    if(idx % 50 == 0): # save every 50 files, to create more manageable filesizes
        df = pd.DataFrame(imdb_id_with_tensor, columns=['imdb_id', 'sentences_emb']) # save data to dataframe
        df.to_pickle('../baseline/output/imdb_id_with_embSentencesList{0}.pkl'.format(idx)) # and package as pickle


df = pd.DataFrame(imdb_id_with_tensor, columns=['imdb_id', 'sentences_emb'])
print(df)
df.to_pickle('../baseline/output/imdb_id_with_embSentencesListLast.pkl')

  0%|          | 1/479 [00:47<6:21:12, 47.85s/it]

[['tt0032138', tensor([[ 0.1276, -0.1368,  0.7954,  ...,  0.7945,  0.1805,  0.3932],
        [ 0.6367,  0.3156,  1.3399,  ...,  0.6728, -0.0384, -0.0278],
        [-0.5704,  0.1159,  0.9195,  ...,  0.8791, -0.4666,  0.4746],
        ...,
        [-0.6384,  0.8375,  0.4908,  ...,  0.6937,  0.5297,  0.0019],
        [-0.3690,  0.5031,  1.3547,  ...,  1.1087, -0.5365,  0.0094],
        [-0.5727,  1.3154,  1.2028,  ...,  0.1058, -0.2700,  0.3916]])]]


  0%|          | 1/479 [01:06<8:50:07, 66.54s/it]


KeyboardInterrupt: 