Rerun Pre-Doc2Vec for first phase using the entire corpus #8

Soudeh-Jahanshahi · 2024-03-28T20:51:14Z

Please consider the following brief report @ljgarcia , @rohitharavinder , @endri16-lab :

I reviewed my previously generated results and discovered that all the inserted P@5 scores for the three-classes-precision case were replaced with the P@5 score of the first hyperparameter set: I may have accidentally stretched the first cell of the first column in the spreadsheet ... The Values in the Spreadsheet have now been Corrected!
However I reproduced the results by using the new code, which follows the style of Suhasini's training codes. Specifically it includes the test phase of Suhasini's training codes inside a for-loop iterating over different hyperparameter sets. Additionally in the function generate_embeddings of utilities.py I replaced model.infer_vector(doc) with model.docvecs[str(pmid)], since model.docvecs provides access to pre-learned document vectors from the training corpus, while model.infer_vector generates vector representations for new, unseen documents: You may find the code I used and the documentation here.
I uploaded the new results in the Spreadsheet after the results produced by Leyla: As you may see my reproduced results are pretty similar to my previous results.
I have no idea why my results are different form Leyla's and Endri's results. Is it possible that I used a different JSON file, hence different hyperparameters, to generate data compared to the one you used?" I used the same JSON file that I used for Word2Vec models where sg was replaced by dm.

The text was updated successfully, but these errors were encountered:

Soudeh-Jahanshahi · 2024-04-10T19:59:32Z

@ljgarcia and @rohitharavinder : I executed the modified codes by Rohitha (according to this issue) and I uploaded the new results at the end of this sheet: Note that I just got the results for first 6 hyperparameter sets.

The new results are pretty similar to the ones provided by Leyla and Endri ...

Now I should explain how I got my first set of results from almost the same codes:

I generated the embeddings via this command python3 code/embeddings/create_model.py --input RELISH_Annotated_Tokenized_Removed_Stopwords.npy --output data/RELISH_hybrid.model --params_json data/hyperparameters_doc2vec.json --embeddings embeddings/, where hyperparameters_doc2vec.json is the same JSON file that I used for Word2Vec models where sg was replaced by dm.
However I first performed the following changes to the script create_model.py:

def save_doc2vec_embeddings(model: Doc2Vec, pmids: List[int], output_path: str, param_iteration: int):

embeddings_df = pd.DataFrame(list(zip((pmids), model.dv.vectors.tolist())), columns =['pmids', 'embeddings'])
embeddings_df = embeddings_df.sort_values('pmids')
os.makedirs(f"{output_path}/{param_iteration}", exist_ok=True)
embeddings_df.to_pickle(f'{output_path}/{param_iteration}/embeddings.pkl')

and also the following changes

if name == "main":
parser = argparse.ArgumentParser()
parser.add_argument("-i", "--input", type=str,
help="Path to input .NPY or .TSV file.", required=True)
parser.add_argument("-o", "--output", type=str,
help="Path to output Doc2Vec model.")
parser.add_argument("-pj", "--params_json", type=str,
help="File location of word2vec parameter list.")
parser.add_argument("--embeddings", type=str, help="Path to the output embeddings in pickle format.")
args = parser.parse_args()

# Process output file
if args.output:
    output_path = args.output
    if not output_path.endswith(".model"):
        output_path = output_path + ".model"
else:
    output_path = "hybrid_d2v.model"

pmid, join_text = load_tokens(args.input)
tagged_data = generate_TaggedDocument(pmid, join_text)

params = []
with open(args.params_json, "r") as openfile:
    params = json.load(openfile)

for iteration in range(0, len(params)): #len(params)
    print(f'start for param {iteration}')
    #----------------------------------------------------
    #pmid, join_text = load_tokens(args.input)
    #tagged_data = generate_TaggedDocument(pmid, join_text)
    #--------------------------------------------------------
    model = generate_doc2vec_model(tagged_data, params[iteration])
    train_doc2vec_model(model, tagged_data, verbose = 1)
    save_doc2vec_embeddings(model, pmid, args.embeddings, iteration)
    save_doc2vec_model(model, output_path)

Then for generating Cosine Similarities I used function generate_cosine_existing_pairs.py of Word2Doc2Vec repository.
Now we can be sure that there is a discrepancy regarding the codes: Note that I got similar results as my first results using a compact-code based on the test-phase of Suhasini's training code and I couldn't find anything suspect in this compact-code.

Soudeh-Jahanshahi · 2024-04-11T06:09:48Z

Today I went through the new modified code and it turns out that the script create_model.py that I modified and used the first time on Dec to generate doc-embeddings does not contribute in the script run_embeddings.py that is utilized in new modified code ...

Soudeh-Jahanshahi · 2024-04-11T15:42:37Z

@ljgarcia and @rohitharavinder : Today at the meeting Leyla suggested a great idea to check the two groups of codes; let's say compact-code (yielding similar results as script create-model.py) and the script run-embeddings.py (i.e. the code that used by Leyla, Endri and yesterday by me for my last try). She suggested to regenerate the results via script run-embeddings.py but this time for different order of hyperparameter sets: By doing so, we can determine whether the best results consistently correspond to the initial hyperparameter set provided to the code, regardless of the specific parameter values, or not.

And I did this test:

For Doc2Vec Model Generated with Params {'dm': 1, 'vector_size': 400, 'window': 7, 'min_count': 5, 'epochs': 15, 'workers': 8} with the current order 0 (and previous order 17), the script run-embeddings.py (i.e. the code that used by Leyla, Endri and yesterday by me for my last try) yields scores 0.4976 0.4624 0.4429 0.4289 0.4171 0.3741 as three-class precision. While the same code, when the order of the model was 17, yields scores 0.3609 0.3616 0.3612 0.3611 0.3597 0.3587 as three-class precision. The results of compact-code (which gives similar result as script create-model.py) for the same model regarding three-class precision is 0.4981 0.4624 0.4439 0.4296 0.4172 0.3742.

Soudeh-Jahanshahi transferred this issue from zbmed-semtec/doc2vec-doc-relevance Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rerun Pre-Doc2Vec for first phase using the entire corpus #8

Rerun Pre-Doc2Vec for first phase using the entire corpus #8

Soudeh-Jahanshahi commented Mar 28, 2024 •

edited

Loading

Soudeh-Jahanshahi commented Apr 10, 2024 •

edited

Loading

Soudeh-Jahanshahi commented Apr 11, 2024 •

edited

Loading

Soudeh-Jahanshahi commented Apr 11, 2024

Rerun Pre-Doc2Vec for first phase using the entire corpus #8

Rerun Pre-Doc2Vec for first phase using the entire corpus #8

Comments

Soudeh-Jahanshahi commented Mar 28, 2024 • edited Loading

Soudeh-Jahanshahi commented Apr 10, 2024 • edited Loading

Soudeh-Jahanshahi commented Apr 11, 2024 • edited Loading

Soudeh-Jahanshahi commented Apr 11, 2024

Soudeh-Jahanshahi commented Mar 28, 2024 •

edited

Loading

Soudeh-Jahanshahi commented Apr 10, 2024 •

edited

Loading

Soudeh-Jahanshahi commented Apr 11, 2024 •

edited

Loading