# Model 3 : toxchange.toxicology.org

In this section we will describe our approach we used to build the build our first model using what we learnt from the baseline model and steps taken to make it an effective model. For this experiment we used a controlled dataset.

In [1]:
# imports
import sys
import os
import numpy as np
import pandas as pd
import ujson
import re
import spacy
import math
from sklearn.model_selection import train_test_split

from gensim.models import Word2Vec

from nltk.corpus import stopwords
stop_words = stopwords.words('english')

## Model

In this approach we will use a Word2Vec model. The Word2Vec model is used for learning vector representations of words called word embeddings. We will find the word embeddings of the entire corpus. The next step will be to go through each document and find the words that has the most similar words from the entire corpus. These words will be marked as relevant words in the document.


## Dataset

For this experiment we used a controlled dataset of just blogs entries from one blog site: https://toxchange.toxicology.org/p/bl/et/blogid=9

In [2]:
# Data file
data_file ='../../../Datasets/selected/toxchange.toxicology.org.xlsx'
source_data = pd.read_excel(data_file)

print("Shape:",source_data.shape)
source_data.head()

Shape: (2340, 4)


Unnamed: 0,article_date,article_title,article_content,article_url
0,2019-09-19 16:30:00,Apply for a 2020 Supported Award,"Each year, SOT presents awards in partnership ...",https://toxchange.toxicology.org/p/bl/et/bloga...
1,2019-09-19 16:30:00,Nominate a Scientist or Clinician for the 2020...,The SOT Translational/Bridging Travel Award is...,https://toxchange.toxicology.org/p/bl/et/bloga...
2,2019-09-19 16:30:00,National Postdoc Appreciation Week: A Message ...,I would like to thank everyone who participate...,https://toxchange.toxicology.org/p/bl/et/bloga...
3,2019-09-19 16:30:00,Nominations Are Open for the 2020 SOT Translat...,The SOT Translational Impact Award is presente...,https://toxchange.toxicology.org/p/bl/et/bloga...
4,2019-09-19 16:30:00,SOT/SOT Endowment Fund/IUTOX Travel Awards: Fu...,The SOT/SOT Endowment Fund/IUTOX Travel Awards...,https://toxchange.toxicology.org/p/bl/et/bloga...


In [3]:
source_data['word_count'] = source_data['article_content'].str.split().str.len()

# View some metrics of data
print("Number of Blogs:",f'{source_data.shape[0]:,}')
print("Minimum Article Date:",min(source_data['article_date']).strftime("%b %d %Y"))
print("Maximum Article Date:",max(source_data['article_date']).strftime("%b %d %Y"))
print("Minimum Word Count:",min(source_data['word_count']))
print("Maximum Word Count:",f'{max(source_data["word_count"]):,}')

Number of Blogs: 2,340
Minimum Article Date: Mar 11 2012
Maximum Article Date: Sep 19 2019
Minimum Word Count: 10.0
Maximum Word Count: 3,189.0


In [4]:
# Preview some blogs
print("------ Blog 1--------")
print(source_data["article_title"][0],source_data["article_content"][0][:500])
print("------ Blog 2--------")
print(source_data["article_title"][1],source_data["article_content"][1][:500])
print("------ Blog 3--------")
print(source_data["article_title"][2],source_data["article_content"][2][:500])

------ Blog 1--------
Apply for a 2020 Supported Award Each year, SOT presents awards in partnership with the Colgate-Palmolive Company and Syngenta during the SOT Annual Meeting. These companies sponsor several awards, grants, and fellowships allowing the recipients to conduct research for the following year. The deadline for these awards, which are featured here, is October 9, 2019.

The purpose of the Colgate-Palmolive Award for Student Research Training in Alternative Methods is to enhance student research training using¬†in vitro¬†methods or al
------ Blog 2--------
Nominate a Scientist or Clinician for the 2020 SOT Translational/Bridging Travel Award The SOT Translational/Bridging Travel Award is given to assist up to two individuals with travel to the SOT Annual Meeting. The SOT Awards Committee provides this award to mid- or senior-level scientists/clinicians with at least 10 years of experience (postdoctoral research/clinical practice) and who either have an active research pr

## Data Preprocessing

The data preprocessing steps that we will follow inorder to feed the data to the topic model are:
- Combine Title with Blog Content
- Remove line breaks
- Remove Special Characters
- Remove small words < 3 letters
- Convert text to lowercase
- Remove stop words
- Tokenize
- Lemmatization
- Remove custom stop words

In [5]:
# Custom stop words
custom_stopwords_file ='../lookups/custom_stopwords.txt'
custom_stopwords_df = pd.read_csv(custom_stopwords_file, header=None)
print("Shape:",custom_stopwords_df.shape)
custom_stopwords = custom_stopwords_df[0].tolist()

Shape: (1138, 1)


In [6]:
# Utilities to perfrom data cleaning and preparation

nlp = spacy.load('en', disable=['parser', 'ner'])

# function to remove stopwords
def remove_stopwords(rev):
    rev_new = " ".join([i for i in rev if i not in stop_words])
    return rev_new

def lemmatization(texts, tags=['NOUN', 'ADJ']):
    output = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        output.append([token.lemma_ for token in doc if token.pos_ in tags])
    return output

# function to remove custom stopwords
def remove_custom_stopwords(texts):
    output = []
    for sent in texts:
        output.append([word for word in sent if word not in custom_stopwords])
    return output

In [7]:
# Merge title with content
source_data['text'] = source_data['article_title'] + " " + source_data["article_content"]

# Convert column to str
source_data['text'] = source_data['text'].apply(str)

# Replace line breaks
article_text = source_data['text'].str.replace("\n", " ")

# remove unwanted characters, numbers and symbols
article_text = article_text.str.replace("[^a-zA-Z#]", " ")

# remove short words (length < 3)
article_text = article_text.apply(lambda x: ' '.join([w for w in x.split() if len(w)>2]))

# make entire text lowercase
article_text = [r.lower() for r in article_text]

# remove stopwords from the text
article_text = [remove_stopwords(r.split()) for r in article_text]

# Tokenize
tokenized_text = pd.Series(article_text).apply(lambda x: x.split())
# Lemmatize
tokenized_text = lemmatization(tokenized_text)
# Remove custom stopwords
tokenized_text = remove_custom_stopwords(tokenized_text)

flattened_text = []
for i in range(len(tokenized_text)):
    flattened_text.append(' '.join(tokenized_text[i]))

source_data['text'] = flattened_text

# Update word count
source_data['word_count'] = source_data['text'].str.split().str.len()

# Remove word count < 15
source_data = source_data[source_data['word_count'] > 14]
source_data = source_data.reset_index()

In [8]:
print("Shape:",source_data.shape)
source_data.head()

Shape: (1727, 7)


Unnamed: 0,index,article_date,article_title,article_content,article_url,word_count,text
0,0,2019-09-19 16:30:00,Apply for a 2020 Supported Award,"Each year, SOT presents awards in partnership ...",https://toxchange.toxicology.org/p/bl/et/bloga...,49,syngenta recipient deadline replace refine tox...
1,1,2019-09-19 16:30:00,Nominate a Scientist or Clinician for the 2020...,The SOT Translational/Bridging Travel Award is...,https://toxchange.toxicology.org/p/bl/et/bloga...,18,nominate assist mid clinician clinical clinica...
2,2,2019-09-19 16:30:00,National Postdoc Appreciation Week: A Message ...,I would like to thank everyone who participate...,https://toxchange.toxicology.org/p/bl/et/bloga...,35,assembly pda nationwide accomplishment integra...
3,3,2019-09-19 16:30:00,Nominations Are Open for the 2020 SOT Translat...,The SOT Translational Impact Award is presente...,https://toxchange.toxicology.org/p/bl/et/bloga...,19,nomination nonmember outstanding clinical toxi...
4,4,2019-09-19 16:30:00,SOT/SOT Endowment Fund/IUTOX Travel Awards: Fu...,The SOT/SOT Endowment Fund/IUTOX Travel Awards...,https://toxchange.toxicology.org/p/bl/et/bloga...,16,iutox iutox toxicology junior toxicology under...


In [9]:
# Preview some pre processed text
print("------ Blog 1--------")
print(source_data["text"][0][:500])
print("------ Blog 2--------")
print(source_data["text"][1][:500])
print("------ Blog 3--------")
print(source_data["text"][2][:500])

------ Blog 1--------
syngenta recipient deadline replace refine toxicological nonmammalian modeling structure methodology contribute dissertation toxicology expense expense consistent sponsor stipend trainee preference applicant identifie refine validate acceptable formulation nonanimal acute toxicity maximum lump payment progression eligible submit awardee subsequent mode dependent causal sequence toxicity quantitative extrapolation dose trainee recipient recipient communiqu
------ Blog 2--------
nominate assist mid clinician clinical clinical toxicology ceremony nomination nomination package nomination nominee maximum nonmember qualified bridging recipient
------ Blog 3--------
assembly pda nationwide accomplishment integral pda recruitment pda formulate toxicology ideal stay scholar involvement pda scholar scholar advancement dedicated volunteer pda provide excited planning pda upcoming poster informational luncheon assembly luncheon outstanding traineeship exceptional toxicology


## Build Word2Vec


In [10]:
# Train Test Split
train,test = train_test_split(source_data,test_size=0.01, shuffle=False)
train = train.reset_index()
test = test.reset_index()

tokenized_text_train = train['text'].apply(lambda x: x.split()).tolist()
tokenized_text_test = test['text'].apply(lambda x: x.split()).tolist()

print("Shape Train:",train.shape)
print("Shape Test:",test.shape)

Shape Train: (1709, 8)
Shape Test: (18, 8)


In [11]:
tokenized_text_train
# Set parameters
feature_size = 20    # Word vector dimensionality  
window_context = 30  # Context window size                                                                                    
min_word_count = 1   # Minimum word count                        
sample = 1e-3        # Downsample setting for frequent words

w2v_model = Word2Vec(tokenized_text_train, size=feature_size, 
                          window=window_context, min_count=min_word_count,
                          sample=sample, iter=100)

In [12]:
TOP_N_SIMILAR_WORDS = 10
COSINE_SIMILARITY_THRESHOLD = 0.70
NUM_SIMILAR_WORDS_THRESHOLD = 6

# Find relevant words in the train set
train["relevant_words"] = ""
for index, row in train.iterrows():
    # Get tokenized text
    tokenized_text = row["text"].split()
    # Get the unique words in the document
    unique_words = set(tokenized_text)
    relevant_words = []
    for word in unique_words:
        # Find similar words from the corpus
        similar_words = w2v_model.wv.most_similar([word], topn=TOP_N_SIMILAR_WORDS)
        similar_words = [x[0] for x in similar_words if x[1] >= COSINE_SIMILARITY_THRESHOLD]
        if len(similar_words) >= NUM_SIMILAR_WORDS_THRESHOLD:
            relevant_words.append(word)
    
    train.at[index, "relevant_words"]=relevant_words

print("Shape Train:",train.shape)
train.head()

Shape Train: (1709, 9)


Unnamed: 0,level_0,index,article_date,article_title,article_content,article_url,word_count,text,relevant_words
0,0,0,2019-09-19 16:30:00,Apply for a 2020 Supported Award,"Each year, SOT presents awards in partnership ...",https://toxchange.toxicology.org/p/bl/et/bloga...,49,syngenta recipient deadline replace refine tox...,"[validate, acceptable, nonanimal, methodology,..."
1,1,1,2019-09-19 16:30:00,Nominate a Scientist or Clinician for the 2020...,The SOT Translational/Bridging Travel Award is...,https://toxchange.toxicology.org/p/bl/et/bloga...,18,nominate assist mid clinician clinical clinica...,"[nonmember, recipient, nomination]"
2,2,2,2019-09-19 16:30:00,National Postdoc Appreciation Week: A Message ...,I would like to thank everyone who participate...,https://toxchange.toxicology.org/p/bl/et/bloga...,35,assembly pda nationwide accomplishment integra...,"[volunteer, nationwide, informational, trainee..."
3,3,3,2019-09-19 16:30:00,Nominations Are Open for the 2020 SOT Translat...,The SOT Translational Impact Award is presente...,https://toxchange.toxicology.org/p/bl/et/bloga...,19,nomination nonmember outstanding clinical toxi...,"[nonmember, recipient, nomination, seconding]"
4,4,4,2019-09-19 16:30:00,SOT/SOT Endowment Fund/IUTOX Travel Awards: Fu...,The SOT/SOT Endowment Fund/IUTOX Travel Awards...,https://toxchange.toxicology.org/p/bl/et/bloga...,16,iutox iutox toxicology junior toxicology under...,"[download, expense]"


## Model Evaluation

We will visually look at a few of the blogs ot see if the relevant words identified makes sense from the model

In [13]:
def evaluate_results(row):
    text = row["text"]
    relevant_words = row["relevant_words"]
    for relevant_word in relevant_words:
        text = text.replace(relevant_word, '\x1b[1;03;31;46m'+ relevant_word + '\x1b[0m')
    
    print(text)

In [14]:
# View some results
evaluate_results(train.loc[0])

[1;03;31;46msyngenta[0m [1;03;31;46mrecipient[0m deadline replace refine toxicological [1;03;31;46mnonmammalian[0m modeling [1;03;31;46mstructure[0m [1;03;31;46mmethodology[0m contribute dissertation toxicology [1;03;31;46mexpense[0m [1;03;31;46mexpense[0m consistent sponsor stipend trainee preference applicant identifie refine [1;03;31;46mvalidate[0m [1;03;31;46macceptable[0m formulation [1;03;31;46mnonanimal[0m acute toxicity maximum [1;03;31;46mlump[0m payment [1;03;31;46mprogression[0m eligible submit awardee subsequent mode [1;03;31;46mdependent[0m causal [1;03;31;46msequence[0m toxicity [1;03;31;46mquantitative[0m extrapolation dose trainee [1;03;31;46mrecipient[0m [1;03;31;46mrecipient[0m communiqu


In [15]:
# View some results
evaluate_results(train.loc[1])

nominate assist mid clinician clinical clinical toxicology ceremony [1;03;31;46mnomination[0m [1;03;31;46mnomination[0m package [1;03;31;46mnomination[0m nominee maximum [1;03;31;46mnonmember[0m qualified bridging [1;03;31;46mrecipient[0m


In [16]:
# View some results
evaluate_results(train.loc[2])

assembly pda [1;03;31;46mnationwide[0m accomplishment integral pda recruitment pda formulate toxicology ideal stay scholar involvement pda scholar scholar advancement dedicated [1;03;31;46mvolunteer[0m pda provide [1;03;31;46mexcited[0m planning pda upcoming poster [1;03;31;46minformational[0m luncheon assembly luncheon outstanding [1;03;31;46mtraineeship[0m exceptional toxicology


In [17]:
# View some results
evaluate_results(train.loc[3])

[1;03;31;46mnomination[0m [1;03;31;46mnonmember[0m outstanding clinical toxicological multidisciplinary toxicity toxicologist clinician [1;03;31;46mseconding[0m [1;03;31;46mnomination[0m toxicology maximum accomplishment sufficient [1;03;31;46mnomination[0m deadline [1;03;31;46mnomination[0m [1;03;31;46mrecipient[0m


In [18]:
# View some results
evaluate_results(train.loc[5])

scholar assembly scholar toxicology outstanding toxicological pleased announce outstanding toxicology assembly luncheon awardee plaque recognition accomplishment [1;03;31;46mconfidentiality[0m [1;03;31;46mnondisclosure[0m [1;03;31;46mrecipient[0m headquarters adviser [1;03;31;46mnomination[0m applicant [1;03;31;46msignificance[0m headquarters
