# Make example predictions on RCT abstract from the wild
In this notebook, we're going to make predictions on abstract from the wild using the model we have trained in the previous notebook. To do that we're going to:
- Get the data from: https://pubmed.ncbi.nlm.nih.gov/
- Preprocess the data to be in the same format as our trained data
- Using our model to make predictions on preprocessed data

# Set some dependencies

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

from torch.utils.data import DataLoader, Dataset

import numpy as np
import pandas as pd
import tensorflow as tf

import json

from spacy.lang.en import English

In [None]:
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/skimlit_example_abstracts.json
!wget https://raw.githubusercontent.com/vishalrk1/pytorch/main/Pytorch_Helper.py

!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py

# importing helper function
from helper_functions import create_tensorboard_callback, plot_loss_curves, pred_and_plot, unzip_data, walk_through_dir
from Pytorch_Helper import Tokenizer, LabelEncoder

--2023-06-22 20:15:04--  https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/skimlit_example_abstracts.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6737 (6.6K) [text/plain]
Saving to: ‘skimlit_example_abstracts.json’


2023-06-22 20:15:04 (58.6 MB/s) - ‘skimlit_example_abstracts.json’ saved [6737/6737]

--2023-06-22 20:15:04--  https://raw.githubusercontent.com/vishalrk1/pytorch/main/Pytorch_Helper.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11016 (11K) [text/plain]
Saving to: ‘Pytorch_H

In [None]:
base_dir = "/content/drive/MyDrive/ml-and-ds/Project/Classification/Skimlit/"

tokenizer = Tokenizer.load(fp=base_dir+"utilsson")
label_encoder = LabelEncoder.load(fp=base_dir+"utils/label_encoder.json")

In [None]:
# Downloading glove embeddings files
!wget http://nlp.stanford.edu/data/glove.6B.zip
unzip_data('/content/glove.6B.zip')

def load_glove_embeddings(embeddings_file):
    """Load embeddings from a file."""
    embeddings = {}
    with open(embeddings_file, "r") as fp:
        for index, line in enumerate(fp):
            values = line.split()
            word = values[0]
            embedding = np.asarray(values[1:], dtype='float32')
            embeddings[word] = embedding
    return embeddings

def make_embeddings_matrix(embeddings, word_index, embedding_dim):
    """Create embeddings matrix to use in Embedding layer."""
    embedding_matrix = np.zeros((len(word_index), embedding_dim))
    for word, i in word_index.items():
        embedding_vector = embeddings.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
    return embedding_matrix

EMBEDDING_DIM = 300
HIDDEN_DIM = 128

# Create embeddings
embeddings_file = '/content/glove.6B.{0}d.txt'.format(EMBEDDING_DIM)
glove_embeddings = load_glove_embeddings(embeddings_file=embeddings_file)

embedding_matrix = make_embeddings_matrix(
    embeddings=glove_embeddings, word_index=tokenizer.token_to_index,
    embedding_dim=EMBEDDING_DIM)

print (f"<Embeddings(words={embedding_matrix.shape[0]}, dim={embedding_matrix.shape[1]})>")

--2023-06-22 20:15:08--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2023-06-22 20:15:08--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2023-06-22 20:15:08--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip.1’


2

In [None]:
def gather_last_relevant_hidden(hiddens, seq_lens):
    """Extract and collect the last relevant
    hidden state based on the sequence length."""
    seq_lens = seq_lens.long().detach().cpu().numpy() - 1
    out = []
    for batch_index, column_index in enumerate(seq_lens):
        out.append(hiddens[batch_index, column_index])
    return torch.stack(out)

In [None]:
# Define the model
class SkimlitModel(nn.Module):
    def __init__(self, embedding_dim, vocab_size, hidden_dim, n_layers, linear_output, num_classes, pretrained_embeddings=None, padding_idx=0):
        super(SkimlitModel, self).__init__()

        # Initalizing embeddings
        if pretrained_embeddings is None:
            self.embeddings = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)
        else:
            pretrained_embeddings = torch.from_numpy(pretrained_embeddings).float()
            self.embeddings = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim, _weight=pretrained_embeddings, padding_idx=padding_idx)

        # LSTM layers
        self.lstm1 = nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers, batch_first=True, bidirectional=True)

        # FC layers
        self.fc_text = nn.Linear(2*hidden_dim, linear_output)

        self.fc_line_num = nn.Linear(20, 64)
        self.fc_total_line = nn.Linear(24, 64)

        self.fc_final = nn.Linear((64+64+linear_output), num_classes)
        self.dropout = nn.Dropout(0.3)

    def forward(self, inputs):
        x_in, seq_lens, line_nums, total_lines = inputs
        x_in = self.embeddings(x_in)

        # RNN outputs
        out, b_n = self.lstm1(x_in)
        x_1 = gather_last_relevant_hidden(hiddens=out, seq_lens=seq_lens)

        # FC layers output
        x_1 = F.relu(self.fc_text(x_1))
        x_2 = F.relu(self.fc_line_num(line_nums))
        x_3 = F.relu(self.fc_total_line(total_lines))

        x = torch.cat((x_1, x_2, x_3), dim=1)
        x = self.dropout(x)
        x = self.fc_final(x)
        return x

In [None]:
vocab_size = len(tokenizer)
num_classes = len(label_encoder)
print(num_classes)

class_names = label_encoder.class_to_index.keys()
class_names

5


dict_keys(['BACKGROUND', 'CONCLUSIONS', 'METHODS', 'OBJECTIVE', 'RESULTS'])

In [None]:
# Load the model
model = SkimlitModel(embedding_dim=300,
                     vocab_size=vocab_size,
                     hidden_dim=128,
                     n_layers=3,
                     linear_output=128,
                     num_classes=num_classes,
                     pretrained_embeddings=embedding_matrix)

model.load_state_dict(torch.load(base_dir+"utils/skimlit-model-final-1.pt", map_location="cpu"))

<All keys matched successfully>

In [None]:
model

SkimlitModel(
  (embeddings): Embedding(38740, 300, padding_idx=0)
  (lstm1): LSTM(300, 128, num_layers=3, batch_first=True, bidirectional=True)
  (fc_text): Linear(in_features=256, out_features=128, bias=True)
  (fc_line_num): Linear(in_features=20, out_features=64, bias=True)
  (fc_total_line): Linear(in_features=24, out_features=64, bias=True)
  (fc_final): Linear(in_features=256, out_features=5, bias=True)
  (dropout): Dropout(p=0.3, inplace=False)
)

# Preparing Data for predictions

In [None]:
with open("skimlit_example_abstracts.json", "r") as f:
    example_abstracts = json.load(f)

abstracts = pd.DataFrame(example_abstracts)
abstracts.head()

Unnamed: 0,abstract,source,details
0,This RCT examined the efficacy of a manualized...,https://pubmed.ncbi.nlm.nih.gov/20232240/,RCT of a manualized social treatment for high-...
1,Postpartum depression (PPD) is the most preval...,https://pubmed.ncbi.nlm.nih.gov/28012571/,Formatting removed (can be used to compare mod...
2,"Mental illness, including depression, anxiety ...",https://pubmed.ncbi.nlm.nih.gov/28942748/,Effect of nutrition on mental health
3,Hepatitis C virus (HCV) and alcoholic liver di...,https://pubmed.ncbi.nlm.nih.gov/22244707/,Baclofen promotes alcohol abstinence in alcoho...


In [None]:
abstracts["abstract"][1]

"Postpartum depression (PPD) is the most prevalent mood disorder associated with childbirth. No single cause of PPD has been identified, however the increased risk of nutritional deficiencies incurred through the high nutritional requirements of pregnancy may play a role in the pathology of depressive symptoms. Three nutritional interventions have drawn particular interest as possible non-invasive and cost-effective prevention and/or treatment strategies for PPD; omega-3 (n-3) long chain polyunsaturated fatty acids (LCPUFA), vitamin D and overall diet. We searched for meta-analyses of randomised controlled trials (RCT's) of nutritional interventions during the perinatal period with PPD as an outcome, and checked for any trials published subsequently to the meta-analyses. Fish oil: Eleven RCT's of prenatal fish oil supplementation RCT's show null and positive effects on PPD symptoms. Vitamin D: no relevant RCT's were identified, however seven observational studies of maternal vitamin D 

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm") # setup english sentence parser
doc = nlp(abstracts["abstract"][1]) # create "doc" of parsed sequences
abstract_lines = [str(sent) for sent in list(doc.sents)] # list of line on string (not spaCy type)

In [None]:
# Get total number of lines
total_lines_in_sample = len(abstract_lines)

# Go through each line in abstract and create a list of dictionaries containing features for each line
sample_lines = []
for i, line in enumerate(abstract_lines):
    sample_dict = {}
    sample_dict["text"] = str(line)
    sample_dict["line_number"] = i
    sample_dict["total_lines"] = total_lines_in_sample - 1
    sample_lines.append(sample_dict)

sample_lines

[{'text': 'Postpartum depression (PPD) is the most prevalent mood disorder associated with childbirth.',
  'line_number': 0,
  'total_lines': 11},
 {'text': 'No single cause of PPD has been identified, however the increased risk of nutritional deficiencies incurred through the high nutritional requirements of pregnancy may play a role in the pathology of depressive symptoms.',
  'line_number': 1,
  'total_lines': 11},
 {'text': 'Three nutritional interventions have drawn particular interest as possible non-invasive and cost-effective prevention and/or treatment strategies for PPD; omega-3 (n-3) long chain polyunsaturated fatty acids (LCPUFA), vitamin D and overall diet.',
  'line_number': 2,
  'total_lines': 11},
 {'text': "We searched for meta-analyses of randomised controlled trials (RCT's) of nutritional interventions during the perinatal period with PPD as an outcome, and checked for any trials published subsequently to the meta-analyses.",
  'line_number': 3,
  'total_lines': 11},

In [None]:
df = pd.DataFrame(sample_lines)
df

Unnamed: 0,text,line_number,total_lines
0,Postpartum depression (PPD) is the most preval...,0,11
1,"No single cause of PPD has been identified, ho...",1,11
2,Three nutritional interventions have drawn par...,2,11
3,We searched for meta-analyses of randomised co...,3,11
4,Fish oil:,4,11
5,Eleven RCT's of prenatal fish oil supplementat...,5,11
6,"Vitamin D: no relevant RCT's were identified, ...",6,11
7,Diet:,7,11
8,Two Australian RCT's with dietary advice inter...,8,11
9,"With the exception of fish oil, few RCT's with...",9,11


In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import re

nltk.download("stopwords")
STOPWORDS = stopwords.words("english")
porter = PorterStemmer()

def preprocess(text, stopwords=STOPWORDS):
    """Conditional preprocessing on our text unique to our task."""
    # Lower
    text = text.lower()

    # Remove stopwords
    pattern = re.compile(r"\b(" + r"|".join(stopwords) + r")\b\s*")
    text = pattern.sub("", text)

    # Remove words in paranthesis
    text = re.sub(r"\([^)]*\)", "", text)

    # Spacing and filters
    text = re.sub(r"([-;;.,!?<=>])", r" \1 ", text)
    text = re.sub("[^A-Za-z0-9]+", " ", text) # remove non alphanumeric chars
    text = re.sub(" +", " ", text)  # remove multiple spaces
    text = text.strip()

    return text

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
df.text = df.text.apply(preprocess)
df

Unnamed: 0,text,line_number,total_lines
0,postpartum depression prevalent mood disorder ...,0,11
1,single cause ppd identified however increased ...,1,11
2,three nutritional interventions drawn particul...,2,11
3,searched meta analyses randomised controlled t...,3,11
4,fish oil,4,11
5,eleven rct prenatal fish oil supplementation r...,5,11
6,vitamin relevant rct identified however seven ...,6,11
7,diet,7,11
8,two australian rct dietary advice intervention...,8,11
9,exception fish oil rct nutritional interventio...,9,11


In [None]:
text_seq = tokenizer.texts_to_sequences(texts=df['text'])

In [None]:
def pad_sequences(sequences, max_seq_len=0):
    """Pad sequences to max length in sequence."""
    max_seq_len = max(max_seq_len, max(len(sequence) for sequence in sequences))
    padded_sequences = np.zeros((len(sequences), max_seq_len))
    for i, sequence in enumerate(sequences):
        padded_sequences[i][:len(sequence)] = sequence
    return padded_sequences

In [None]:
class SkimlitDataset(Dataset):
  def __init__(self, text_seq, line_num, total_line):
    self.text_seq = text_seq
    self.line_num_one_hot = line_num
    self.total_line_one_hot = total_line

  def __len__(self):
    return len(self.text_seq)

  def __str__(self):
    return f"<Dataset(N={len(self)})>"

  def __getitem__(self, index):
    X = self.text_seq[index]
    line_num = self.line_num_one_hot[index]
    total_line = self.total_line_one_hot[index]
    return [X, len(X), line_num, total_line]

  def collate_fn(self, batch):
    """Processing on a batch"""
    # Getting Input
    batch = np.array(batch)
    text_seq = batch[:,0]
    seq_lens = batch[:, 1]
    line_nums = batch[:, 2]
    total_lines = batch[:, 3]

    # padding inputs
    pad_text_seq = pad_sequences(sequences=text_seq) # max_seq_len=max_length

    # converting line nums into one-hot encoding
    line_nums = tf.one_hot(line_nums, depth=20)

     # converting total lines into one-hot encoding
    total_lines = tf.one_hot(total_lines, depth=24)

    # converting inputs to tensors
    pad_text_seq = torch.LongTensor(pad_text_seq.astype(np.int32))
    seq_lens = torch.LongTensor(seq_lens.astype(np.int32))
    line_nums = torch.tensor(line_nums.numpy())
    total_lines = torch.tensor(total_lines.numpy())

    return pad_text_seq, seq_lens, line_nums, total_lines

  def create_dataloader(self, batch_size, shuffle=False, drop_last=False):
    dataloader = DataLoader(dataset=self, batch_size=batch_size, collate_fn=self.collate_fn, shuffle=shuffle, drop_last=drop_last, pin_memory=True)
    return dataloader


In [None]:
dataset = SkimlitDataset(text_seq=text_seq, line_num=df['line_number'], total_line=df['total_lines'])

In [None]:
dataloader = dataset.create_dataloader(batch_size=2)

In [None]:
batch_text_seq, batch_seq_len, batch_line_num, batch_total_line = next(iter(dataloader))
batch_line_num.shape, batch_total_line.shape, batch_line_num

  batch = np.array(batch)


(torch.Size([2, 20]),
 torch.Size([2, 24]),
 tensor([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0.]]))

In [None]:
batch_text_seq

tensor([[1224,  207, 1954, 1196,  510,   34, 4487,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0],
        [ 138,  523, 4793,  455,   92,   57,   27, 1174, 5795, 9870,   58, 1174,
         1901,  481,   49, 1765,  460, 3243,  560,  102]])

In [None]:
from torch._C import dtype
from tqdm.notebook import tqdm

def model_prediction(model, dataloader):
  """Prediction step."""
  # Set model to eval mode
  model.eval()
  y_trues, y_probs = [], []
  # Iterate over val batches
  for i, batch in enumerate(dataloader):
    # Forward pass w/ inputs
    # batch = [item.to(.device) for item in batch]  # Set device
    inputs = batch
    z = model(inputs)
    # Store outputs
    y_prob = F.softmax(z, dim=1).detach().cpu().numpy()
    y_probs.extend(y_prob)
  return np.vstack(y_probs)

In [None]:
y_pred = model_prediction(model, dataloader)
y_pred

  batch = np.array(batch)


array([[8.74797925e-02, 9.02174274e-04, 4.10179753e-04, 9.11109507e-01,
        9.83146892e-05],
       [7.66822770e-02, 8.60808909e-01, 6.15286781e-03, 1.59580614e-02,
        4.03977782e-02],
       [4.08619761e-01, 3.31020087e-01, 1.89653691e-02, 1.29672036e-01,
        1.11722834e-01],
       [2.27151457e-02, 1.13745010e-03, 9.51012135e-01, 1.77959464e-02,
        7.33933691e-03],
       [2.29490504e-01, 2.66931385e-01, 3.44053239e-01, 3.64570245e-02,
        1.23068005e-01],
       [1.91957166e-03, 1.82036653e-01, 2.19638404e-02, 6.30196417e-04,
        7.93449759e-01],
       [2.68303514e-01, 4.63551819e-01, 1.07842740e-02, 5.97177595e-02,
        1.97642624e-01],
       [2.10290566e-01, 6.02336228e-02, 6.04235709e-01, 2.89643854e-02,
        9.62757245e-02],
       [2.06474103e-02, 7.14093819e-03, 9.22720015e-01, 1.54736722e-02,
        3.40179615e-02],
       [3.11644264e-02, 3.87566537e-01, 3.68912607e-01, 1.51796862e-02,
        1.97176769e-01],
       [1.70349088e-02, 9.7521

In [None]:
pred = y_pred.argmax(axis=1)
pred = label_encoder.decode(pred)

In [None]:
# Visualize abstract lines and predicted sequence labels
for i, line in enumerate(abstract_lines):
    print(f"{pred[i]}: {line}")

OBJECTIVE: Postpartum depression (PPD) is the most prevalent mood disorder associated with childbirth.
CONCLUSIONS: No single cause of PPD has been identified, however the increased risk of nutritional deficiencies incurred through the high nutritional requirements of pregnancy may play a role in the pathology of depressive symptoms.
BACKGROUND: Three nutritional interventions have drawn particular interest as possible non-invasive and cost-effective prevention and/or treatment strategies for PPD; omega-3 (n-3) long chain polyunsaturated fatty acids (LCPUFA), vitamin D and overall diet.
METHODS: We searched for meta-analyses of randomised controlled trials (RCT's) of nutritional interventions during the perinatal period with PPD as an outcome, and checked for any trials published subsequently to the meta-analyses.
METHODS: Fish oil:
RESULTS: Eleven RCT's of prenatal fish oil supplementation RCT's show null and positive effects on PPD symptoms.
CONCLUSIONS: Vitamin D: no relevant RCT's 

# Creating Fincal function

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import re

# nltk.download("stopwords")
# STOPWORDS = stopwords.words("english")
# porter = PorterStemmer()

def download_stopwords():
  nltk.download("stopwords")
  STOPWORDS = stopwords.words("english")
  porter = PorterStemmer()
  return STOPWORDS, porter

def preprocess(text, stopwords=STOPWORDS):
    """Conditional preprocessing on our text unique to our task."""
    # Lower
    text = text.lower()

    # Remove stopwords
    pattern = re.compile(r"\b(" + r"|".join(stopwords) + r")\b\s*")
    text = pattern.sub("", text)

    # Remove words in paranthesis
    text = re.sub(r"\([^)]*\)", "", text)

    # Spacing and filters
    text = re.sub(r"([-;;.,!?<=>])", r" \1 ", text)
    text = re.sub("[^A-Za-z0-9]+", " ", text) # remove non alphanumeric chars
    text = re.sub(" +", " ", text)  # remove multiple spaces
    text = text.strip()

    return text

In [None]:
def load_glove_embeddings(embeddings_file):
    """Load embeddings from a file."""
    embeddings = {}
    with open(embeddings_file, "r") as fp:
        for index, line in enumerate(fp):
            values = line.split()
            word = values[0]
            embedding = np.asarray(values[1:], dtype='float32')
            embeddings[word] = embedding
    return embeddings

def make_embeddings_matrix(embeddings, word_index, embedding_dim):
    """Create embeddings matrix to use in Embedding layer."""
    embedding_matrix = np.zeros((len(word_index), embedding_dim))
    for word, i in word_index.items():
        embedding_vector = embeddings.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
    return embedding_matrix

def get_embeddings(embedding_file_path, tokenizer, embedding_dim):
  glove_embeddings = load_glove_embeddings(embeddings_file=embeddings_file)
  embedding_matrix = make_embeddings_matrix(embeddings=glove_embeddings, word_index=tokenizer.token_to_index, embedding_dim=embedding_dim)
  return embedding_matrix

In [None]:
import spacy

def spacy_function(abstract):
  nlp = spacy.load("en_core_web_sm") # setup english sentence parser
  doc = nlp(abstract) # create "doc" of parsed sequences
  abstract_lines = [str(sent) for sent in list(doc.sents)] # list of line on string (not spaCy type)

  return abstract_lines

# ---------------------------------------------------------------------------------------------------------------------------

def model_prediction(model, dataloader):
  """Prediction step."""
  # Set model to eval mode
  model.eval()
  y_trues, y_probs = [], []
  # Iterate over val batches
  for i, batch in enumerate(dataloader):
    # Forward pass w/ inputs
    # batch = [item.to(.device) for item in batch]  # Set device
    inputs = batch
    z = model(inputs)
    # Store outputs
    y_prob = F.softmax(z, dim=1).detach().cpu().numpy()
    y_probs.extend(y_prob)
  return np.vstack(y_probs)

# ---------------------------------------------------------------------------------------------------------------------------

def make_predictions(text, embeding_path, model_path, tokenizer, label_encoder):
  # getting all lines seprated from abstract
  abstract_lines = list()
  abstract_lines = spacy_function(text)

  # Get total number of lines
  total_lines_in_sample = len(abstract_lines)

  # Go through each line in abstract and create a list of dictionaries containing features for each line
  sample_lines = []
  for i, line in enumerate(abstract_lines):
    sample_dict = {}
    sample_dict["text"] = str(line)
    sample_dict["line_number"] = i
    sample_dict["total_lines"] = total_lines_in_sample - 1
    sample_lines.append(sample_dict)

  # converting sample line list into pandas Dataframe
  df = pd.DataFrame(sample_lines)

  # getting stopwords
  STOPWORDS, porter = download_stopwords()

  # applying preprocessing function to lines
  df.text = df.text.apply(lambda x: preprocess(x, STOPWORDS))

  # converting texts into numberical sequences
  text_seq = tokenizer.texts_to_sequences(texts=df['text'])

  # creating Dataset
  dataset = SkimlitDataset(text_seq=text_seq, line_num=df['line_number'], total_line=df['total_lines'])

  # creating dataloader
  dataloader = dataset.create_dataloader(batch_size=2)

  # Preparing embedings
  embedding_matrix = get_embeddings(embeding_path, tokenizer, 300)

  # creating model
  model = SkimlitModel(embedding_dim=300, vocab_size=len(tokenizer), hidden_dim=128, n_layers=3, linear_output=128, num_classes=len(label_encoder), pretrained_embeddings=embedding_matrix)

  # loading model weight
  model.load_state_dict(torch.load('/content/drive/MyDrive/Datasets/SkimLit/utils/skimlit-model-final-1.pt', map_location='cpu'))

  # setting model into evaluation mode
  model.eval()

  # getting predictions
  y_pred = model_prediction(model, dataloader)

  # converting predictions into label class
  pred = y_pred.argmax(axis=1)
  pred = label_encoder.decode(pred)

  return abstract_lines, pred

# Prediction 1

In [None]:
tokenizer = Tokenizer.load(fp='/content/drive/MyDrive/Datasets/SkimLit/utils/tokenizer.json')
label_encoder = LabelEncoder.load(fp='/content/drive/MyDrive/Datasets/SkimLit/utils/label_encoder.json')

abstract_lines, pred = make_predictions(
    abstracts.abstract[1],
    '/content/glove.6B.300d.txt',
    '/content/drive/MyDrive/Datasets/SkimLit/utils/skimlit-model-final-1.pt',
    tokenizer,
    label_encoder,
)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
  batch = np.array(batch)


In [None]:
# Visualize abstract lines and predicted sequence labels
for i, line in enumerate(abstract_lines):
    print(f"{pred[i]}: {line}")

OBJECTIVE: Postpartum depression (PPD) is the most prevalent mood disorder associated with childbirth.
CONCLUSIONS: No single cause of PPD has been identified, however the increased risk of nutritional deficiencies incurred through the high nutritional requirements of pregnancy may play a role in the pathology of depressive symptoms.
CONCLUSIONS: Three nutritional interventions have drawn particular interest as possible non-invasive and cost-effective prevention and/or treatment strategies for PPD; omega-3 (n-3) long chain polyunsaturated fatty acids (LCPUFA), vitamin D and overall diet.
METHODS: We searched for meta-analyses of randomised controlled trials (RCT's) of nutritional interventions during the perinatal period with PPD as an outcome, and checked for any trials published subsequently to the meta-analyses.
METHODS: Fish oil:
RESULTS: Eleven RCT's of prenatal fish oil supplementation RCT's show null and positive effects on PPD symptoms.
CONCLUSIONS: Vitamin D: no relevant RCT's

# Prediction 2

In [None]:
abstract_lines, pred = make_predictions(
    abstracts.abstract[0],
    '/content/glove.6B.300d.txt',
    '/content/drive/MyDrive/Datasets/SkimLit/utils/skimlit-model-final-1.pt',
    tokenizer,
    label_encoder,
)

print(abstracts.abstract[0])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


This RCT examined the efficacy of a manualized social intervention for children with HFASDs. Participants were randomly assigned to treatment or wait-list conditions. Treatment included instruction and therapeutic activities targeting social skills, face-emotion recognition, interest expansion, and interpretation of non-literal language. A response-cost program was applied to reduce problem behaviors and foster skills acquisition. Significant treatment effects were found for five of seven primary outcome measures (parent ratings and direct child measures). Secondary measures based on staff ratings (treatment group only) corroborated gains reported by parents. High levels of parent, child and staff satisfaction were reported, along with high levels of treatment fidelity. Standardized effect size estimates were primarily in the medium and large ranges and favored the treatment group.


  batch = np.array(batch)


In [None]:
# Visualize abstract lines and predicted sequence labels
for i, line in enumerate(abstract_lines):
    print(f"{pred[i]}: {line}")

OBJECTIVE: This RCT examined the efficacy of a manualized social intervention for children with HFASDs.
METHODS: Participants were randomly assigned to treatment or wait-list conditions.
METHODS: Treatment included instruction and therapeutic activities targeting social skills, face-emotion recognition, interest expansion, and interpretation of non-literal language.
CONCLUSIONS: A response-cost program was applied to reduce problem behaviors and foster skills acquisition.
RESULTS: Significant treatment effects were found for five of seven primary outcome measures (parent ratings and direct child measures).
RESULTS: Secondary measures based on staff ratings (treatment group only) corroborated gains reported by parents.
RESULTS: High levels of parent, child and staff satisfaction were reported, along with high levels of treatment fidelity.
RESULTS: Standardized effect size estimates were primarily in the medium and large ranges and favored the treatment group.


# Prediction 3

In [None]:
abstract_lines, pred = make_predictions(
    abstracts.abstract[2],
    '/content/glove.6B.300d.txt',
    '/content/drive/MyDrive/Datasets/SkimLit/utils/skimlit-model-final-1.pt',
    tokenizer,
    label_encoder,
)

abstracts.abstract[2]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
  batch = np.array(batch)


'Mental illness, including depression, anxiety and bipolar disorder, accounts for a significant proportion of global disability and poses a substantial social, economic and heath burden. Treatment is presently dominated by pharmacotherapy, such as antidepressants, and psychotherapy, such as cognitive behavioural therapy; however, such treatments avert less than half of the disease burden, suggesting that additional strategies are needed to prevent and treat mental disorders. There are now consistent mechanistic, observational and interventional data to suggest diet quality may be a modifiable risk factor for mental illness. This review provides an overview of the nutritional psychiatry field. It includes a discussion of the neurobiological mechanisms likely modulated by diet, the use of dietary and nutraceutical interventions in mental disorders, and recommendations for further research. Potential biological pathways related to mental disorders include inflammation, oxidative stress, t

In [None]:
# Visualize abstract lines and predicted sequence labels
for i, line in enumerate(abstract_lines):
    print(f"{pred[i]}: {line}")

BACKGROUND: Mental illness, including depression, anxiety and bipolar disorder, accounts for a significant proportion of global disability and poses a substantial social, economic and heath burden.
BACKGROUND: Treatment is presently dominated by pharmacotherapy, such as antidepressants, and psychotherapy, such as cognitive behavioural therapy; however, such treatments avert less than half of the disease burden, suggesting that additional strategies are needed to prevent and treat mental disorders.
CONCLUSIONS: There are now consistent mechanistic, observational and interventional data to suggest diet quality may be a modifiable risk factor for mental illness.
BACKGROUND: This review provides an overview of the nutritional psychiatry field.
BACKGROUND: It includes a discussion of the neurobiological mechanisms likely modulated by diet, the use of dietary and nutraceutical interventions in mental disorders, and recommendations for further research.
BACKGROUND: Potential biological pathwa