Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rerun Pre-Doc2Vec for first phase using the entire corpus #8

Open
Soudeh-Jahanshahi opened this issue Mar 28, 2024 · 3 comments
Open

Comments

@Soudeh-Jahanshahi
Copy link
Contributor

Soudeh-Jahanshahi commented Mar 28, 2024

Please consider the following brief report @ljgarcia , @rohitharavinder , @endri16-lab :

  • I reviewed my previously generated results and discovered that all the inserted P@5 scores for the three-classes-precision case were replaced with the P@5 score of the first hyperparameter set: I may have accidentally stretched the first cell of the first column in the spreadsheet ... The Values in the Spreadsheet have now been Corrected!

  • However I reproduced the results by using the new code, which follows the style of Suhasini's training codes. Specifically it includes the test phase of Suhasini's training codes inside a for-loop iterating over different hyperparameter sets. Additionally in the function generate_embeddings of utilities.py I replaced model.infer_vector(doc) with model.docvecs[str(pmid)], since model.docvecs provides access to pre-learned document vectors from the training corpus, while model.infer_vector generates vector representations for new, unseen documents: You may find the code I used and the documentation here.

  • I uploaded the new results in the Spreadsheet after the results produced by Leyla: As you may see my reproduced results are pretty similar to my previous results.

  • I have no idea why my results are different form Leyla's and Endri's results. Is it possible that I used a different JSON file, hence different hyperparameters, to generate data compared to the one you used?" I used the same JSON file that I used for Word2Vec models where sg was replaced by dm.

@Soudeh-Jahanshahi Soudeh-Jahanshahi transferred this issue from zbmed-semtec/doc2vec-doc-relevance Mar 28, 2024
@Soudeh-Jahanshahi
Copy link
Contributor Author

Soudeh-Jahanshahi commented Apr 10, 2024

@ljgarcia and @rohitharavinder : I executed the modified codes by Rohitha (according to this issue) and I uploaded the new results at the end of this sheet: Note that I just got the results for first 6 hyperparameter sets.

  • The new results are pretty similar to the ones provided by Leyla and Endri ...

Now I should explain how I got my first set of results from almost the same codes:

  • I generated the embeddings via this command python3 code/embeddings/create_model.py --input RELISH_Annotated_Tokenized_Removed_Stopwords.npy --output data/RELISH_hybrid.model --params_json data/hyperparameters_doc2vec.json --embeddings embeddings/, where hyperparameters_doc2vec.json is the same JSON file that I used for Word2Vec models where sg was replaced by dm.

  • However I first performed the following changes to the script create_model.py:

def save_doc2vec_embeddings(model: Doc2Vec, pmids: List[int], output_path: str, param_iteration: int):

embeddings_df = pd.DataFrame(list(zip((pmids), model.dv.vectors.tolist())), columns =['pmids', 'embeddings'])
embeddings_df = embeddings_df.sort_values('pmids')
os.makedirs(f"{output_path}/{param_iteration}", exist_ok=True)
embeddings_df.to_pickle(f'{output_path}/{param_iteration}/embeddings.pkl')

and also the following changes

if name == "main":
parser = argparse.ArgumentParser()
parser.add_argument("-i", "--input", type=str,
help="Path to input .NPY or .TSV file.", required=True)
parser.add_argument("-o", "--output", type=str,
help="Path to output Doc2Vec model.")
parser.add_argument("-pj", "--params_json", type=str,
help="File location of word2vec parameter list.")
parser.add_argument("--embeddings", type=str, help="Path to the output embeddings in pickle format.")
args = parser.parse_args()

# Process output file
if args.output:
    output_path = args.output
    if not output_path.endswith(".model"):
        output_path = output_path + ".model"
else:
    output_path = "hybrid_d2v.model"

pmid, join_text = load_tokens(args.input)
tagged_data = generate_TaggedDocument(pmid, join_text)

params = []
with open(args.params_json, "r") as openfile:
    params = json.load(openfile)

for iteration in range(0, len(params)): #len(params)
    print(f'start for param {iteration}')
    #----------------------------------------------------
    #pmid, join_text = load_tokens(args.input)
    #tagged_data = generate_TaggedDocument(pmid, join_text)
    #--------------------------------------------------------
    model = generate_doc2vec_model(tagged_data, params[iteration])
    train_doc2vec_model(model, tagged_data, verbose = 1)
    save_doc2vec_embeddings(model, pmid, args.embeddings, iteration)
    save_doc2vec_model(model, output_path)
  • Then for generating Cosine Similarities I used function generate_cosine_existing_pairs.py of Word2Doc2Vec repository.
  • Now we can be sure that there is a discrepancy regarding the codes: Note that I got similar results as my first results using a compact-code based on the test-phase of Suhasini's training code and I couldn't find anything suspect in this compact-code.

@Soudeh-Jahanshahi
Copy link
Contributor Author

Soudeh-Jahanshahi commented Apr 11, 2024

Today I went through the new modified code and it turns out that the script create_model.py that I modified and used the first time on Dec to generate doc-embeddings does not contribute in the script run_embeddings.py that is utilized in new modified code ...

@Soudeh-Jahanshahi
Copy link
Contributor Author

@ljgarcia and @rohitharavinder : Today at the meeting Leyla suggested a great idea to check the two groups of codes; let's say compact-code (yielding similar results as script create-model.py) and the script run-embeddings.py (i.e. the code that used by Leyla, Endri and yesterday by me for my last try). She suggested to regenerate the results via script run-embeddings.py but this time for different order of hyperparameter sets: By doing so, we can determine whether the best results consistently correspond to the initial hyperparameter set provided to the code, regardless of the specific parameter values, or not.

And I did this test:

  • For Doc2Vec Model Generated with Params {'dm': 1, 'vector_size': 400, 'window': 7, 'min_count': 5, 'epochs': 15, 'workers': 8} with the current order 0 (and previous order 17), the script run-embeddings.py (i.e. the code that used by Leyla, Endri and yesterday by me for my last try) yields scores 0.4976 0.4624 0.4429 0.4289 0.4171 0.3741 as three-class precision. While the same code, when the order of the model was 17, yields scores 0.3609 0.3616 0.3612 0.3611 0.3597 0.3587 as three-class precision. The results of compact-code (which gives similar result as script create-model.py) for the same model regarding three-class precision is 0.4981 0.4624 0.4439 0.4296 0.4172 0.3742.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant