## Resources
- PyTorch BigGraph
    - [Blog](https://torchbiggraph.readthedocs.io/en/latest/pretrained_embeddings.html)
        - [How to read data](https://torchbiggraph.readthedocs.io/en/latest/downstream_tasks.html#tsv-format)
    - [Video](https://www.youtube.com/watch?v=zDFUu4ggVTg&ab_channel=ConnorShorten)


- https://huggingface.co/docs/transformers/add_new_model

### Others
- About unicode: [unicode in python](https://towardsdatascience.com/a-guide-to-unicode-utf-8-and-strings-in-python-757a232db95c), [utf-8 video](https://www.youtube.com/watch?v=MijmeoH9LT4), [character convresion](https://note.nkmk.me/en/python-chr-ord-unicode-code-point/#:~:text=In%20Python%2C%20the%20built%2Din,Unicode%20code%20points%20and%20characters.&text=A%20character%20can%20also%20be,U%20in%20a%20string%20literal.), [python docs](https://docs.python.org/3/howto/unicode.html)


### Papers:
- [KG-BERT](https://arxiv.org/abs/1909.03193): We treat triples as textual sequences and propose KG-BERT to model these triples. Our method takes entity and relation descriptions of a triple as input and computes scoring function of the triple with the KG-BERT language model. Experimental results on multiple benchmark knowledge graphs show that our method can achieve state-of-the-art performance in triple classification, link prediction and relation prediction tasks.
- [KG + Paragraph Vector in paper networks](https://www.dfki.de/~declerck/semdeep-4/papers/SemDeep-4_paper_2.pdf): They PV as one feature, KG as another feature and perform logistic regression. We can use this for classification tasks.
- [Consider KG triples as sentences](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9170315): They try to complete KG by considering triples as sentences and run it through RNN and CNN. They initialize the embedding matrix with TransE KG embedding and then have it modified during training.
- Contrastive learning: Try to bring closer KG and BERT embeddings by using negative samples.
- [Semnatice similarity, Word sense disambiguation using KG vecs, WE vec, and both](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7566472/): They concatenated KG and BERT (sentence) embedding
- [Align embedding by using Wikipedia-Wikidata Anchors](https://aclanthology.org/attachments/D14-1167.Attachment.pdf)
    - Not clear about the methods of anchors. Another way to align is using the WE of text in KG.
    - **Idea**: Use the text in Wikidata descriptions and Wikipedia pages linked from Wikidata entities. Create embeddings of these texts and have them come closer to the entities' emedding. i.e the word embedding of the word "human" should come close to the embedding of "Q5" and also the text "human" in Wikidata, maybe even a Wikipedia page embedding titled "human".
        - the name of an entity is ambiguous because different entities sometimes have the same name so that the name graph (graph with emb from words) may contaminate the knowledge embedding.
- [Author name disambiguation using KGE](https://link.springer.com/article/10.1007/s11192-022-04426-2): adopts LiteralE (Kristiadi et al., 2019) embedding model to learn the KGEs.
    - They litearlly concat and pass it through a layer like I said! "This architecture incorporates textual embeddings extracted from the titles of the entities (scholarly articles) into their representations by means of a linear transformation defined as follows: **g𝑙𝑖𝑛(𝐞,𝐥)=𝐖[𝐞,𝐥]**, where 𝐞∈𝑅ℎ is the vector associated to the ith entity in a KG, 𝐥∈𝑅𝑑 is the title embedding (from **BERT**), 𝐖∈𝑅(ℎ,𝑑+ℎ) is a linear transformation matrix and [𝐞,𝐥]∈𝑅(ℎ+𝑑) is the **concatenation** vector of the entity embedding 𝐞 and the literal embedding 𝐥"
- [Integrating Knowledge Graph embedding and pretrained Language Models in Hypercomplex Spaces](https://arxiv.org/pdf/2208.02743.pdf): Combine 4 sources of embeddings (KGE, WE, SE, DE) and use it on link prediction. The text embeddings are from entitiy descriptions. They turn everything into entity embedding and perform link prediction.
    - [Quartenions](https://www.youtube.com/watch?v=d4EgbgTm0Bg)
    - Another line of work [8–11] propose different approaches to incorporate pretrained language models into KGE models. They provide different approaches that incorporate KGE and masked language models by unifying the two loss functions for each of the models.
- [Pretrain-KGE](https://aclanthology.org/2020.findings-emnlp.25.pdf): Train KGE by using SE of entity definition from BERT and do the usual negative sampling contrastive learning for KGE. Then evaluate on link prediction.
- [KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation](https://arxiv.org/pdf/1911.06136.pdf):  In KEPLER, we encode textual entity descriptions with a PLM (pre-trained LM) as their embeddings, and then jointly optimize the KE and language modeling objectives. Experimental results show that KEPLER achieves state-of-the-art performances on various **NLP tasks**, and also works remarkably well as an inductive KE model on KG link prediction.
    - Inspired by Xie et al. (2016), we use entity descriptions to bridge the gap between KE and PLM, and align the semantic space of text to the symbol space of KG 
    - We encode the texts and entities into a unified semantic space with the same PLM as the encoder, and jointly optimize the KE and the masked language modeling (MLM) objectives. For the KE objective, we encode the entity descriptions as entity embeddings and then train them in the same way as conventional KE methods. For the MLM objective, we follow the approach of existing PLMs.
    - In KEPLER, instead of using stored embeddings, we encode entities into vectors by using their corresponding text.
    - text data sampled for KE and MLM are not (necessarily) the same. This is because seeing a variety of text (instead of just entity descriptions) in MLM can help the model to have better language understanding ability.
    - Summary: They simply train KE on top of a RoBERTa with description of entity as the embedding. So basically will have to train for the KE part. But since it may *forget* the MLM trained fluency, they keep it as a task as well. Losses are added but they dont have to come from the same input.
    
- [ERNIE/THU-ERNIE](https://aclanthology.org/P19-1139.pdf): They add a pretraining task.
- **[KnowBERT](https://aclanthology.org/D19-1005.pdf)**: End-to-end training with both KG and LM objective. For each KB, we first use an integrated entity linker to retrieve relevant entity embeddings, then update contextual word representations via a form of **word-to-entity attention**. In contrast to previous approaches, the entity linkers and self-supervised language modeling objective are jointly trained end-to-end in a multitask setting that combines a small amount of entity linking supervision with a large amount of raw text.
    - They *fact-check* by building template sentences from wikidata triples and masking sub/obj to get the answer.
- \*\*[Barack’s Wife Hillary: Using Knowledge Graphs for Fact-Aware Language Modeling](https://aclanthology.org/P19-1598.pdf): , a neural language model with mechanisms for selecting and copying facts from a knowledge graph that are relevant to the context. These mechanisms enable the model to render information it has never seen before, as well as generate out-of-vocabulary tokens.
    - <span style="color:red">[Linked Wikitext-2 Dataset](https://rloganiv.github.io/linked-wikitext-2/#/explore): links phrases in wikitext-2 dataset (a small-good subset of wikipedia articles) to wikidata entities.</span>
    - They *fact-check* by building template sentences from wikidata triples and masking sub/obj to get the answer.
- [K BERT](https://ojs.aaai.org/index.php/AAAI/article/view/5681/5537): It looks like they trained BERT from scratch but they mentioned that they don’t add KG in the pre-training phase. So can we use pre-trained models? Not sure, cannot understand their architecture yet. 
- ** [K-Adapters](https://openreview.net/pdf?id=CLnj31GZ4cI) (good literature): Wang et al propose K-Adapters, a knowledge infusion mechanism on top of pre-trained LMs. With K-Adapters, you don’t need to train a large Transformer stack from scratch. Instead, the authors suggest placing a few adapter layers in between the layers of already pre-trained frozen models (they experiment with BERT and RoBERTa)
- [ERICA](https://arxiv.org/pdf/2012.15022.pdf): Contrastive loss and two new pre-training tasks. This will require significant training
- ** [google: source blog, KELM](https://ai.googleblog.com/2021/05/kelm-integrating-knowledge-graphs-with.html) KB + LM by converting triples into Sentences [paper](https://aclanthology.org/2021.naacl-main.278.pdf): This uses LMs (T5) to generate the sentences from KG triples
- * [BERT-MK](https://aclanthology.org/2020.findings-emnlp.207.pdf): Subgraph (medical) KG + BERT: Can use pretrained BERT and add an aggregator with a triple-learner to incorporate KG information. Need to read paper/see code in-depth to understand the details.
- ** [Text representation (BERT) + meta data (Graph) for document classification](https://arxiv.org/pdf/1909.08402.pdf): Uses pytorch big graph. Concats BERT embedding with KG embedding for each book author and runs it through 1/2 MLP layers. We either need to perform classification task and follow this process exactly, or generalize it and use on various tasks including text generation
- [THU-ERNIE](https://paddlepedia-readthedocs-io.translate.goog/en/latest/tutorials/pretrain_model/THU-ERNIE.html?_x_tr_sl=zh-CN&_x_tr_tl=en&_x_tr_hl=en&_x_tr_pto=sc): Adds a pretraining task to incorporate KG information
- [KELM](https://openreview.net/pdf?id=FkG-sX5CE_) (good literature): uses pretrained BERT but makes trains KG embedding and uses GNN


### Ideas
    
- **Possible training task**:
    - Contrastive learning using text already in the KG.
    - MASKing and regenerating text (It feels like LM may over power here. How to gauge the effectiveness of KG during training? Maybe test with various alpha, where alpha\*KG + (1-alpha)\*LM)
    - Combined loss of MLM and Contrastive KE loss
    - Cross the embedding and loss. So emb from LM is used to perform KE task, and embedding from KE is used to perform MLM task. Then combine both losses.
- **Possible Integration modules:**
    - WE and KGE of the text in KG is generated and then concat and pass it through a layer and perform some task like MASK reconstruction (making sure the layer doesnt just choose to use WE or KG alone)
        - Maybe also try to incorporate Q-id embeddings that come close to the literal values of KG. Concat multiple embeddings together.
    - Make these embeddings come closer using l1/l2 loss function? So pass the WE through a few layers, KGE through some layers, and then l2 of the final outputs. This will learn intermediate representations that bring these 2 spaces together. Another way is to change the BERT and KG layer values instead of adding layers separately for KG and WE.
    - COntrastve leanring on the KG texts' WE and KGE. Use same text and other random text. KG:human -- WE:human (positive), KG:human -- WE:dog (negative)
    - Wikipedia Anchoring: Use wikipedia text as MASKed input, generate its WE/SE, and use its' entities KGE. Then combine these two vectors however you wish and perform MASK word generation task like BERT.
    
### Evaluation:
- [blog/paper](https://wandb.ai/othmanelhoufi/LM-for-fact-checking/reports/Language-Models-for-Fact-Checking-Claim-Assessment--VmlldzoyMTIzNzA2) An automated solution for fact-checking using state-of-the-art language models used today for NLP tasks (BERT, RoBERTa, XLNet etc.) and five well known datasets (FEVER, MultiFC, Liar, COVID19, and ANTiVax) containing annotated claims/tweets to fine-tune each LM and classify a given claim.
- [fb blog](https://ai.facebook.com/research/publications/language-models-as-fact-checkers/) In a closed-book setting, we show that our zero-shot LM approach outperforms a random baseline on the standard FEVER task, and that our finetuned LM compares favorably with standard baselines.
- [Debiasing fact verification models](https://arxiv.org/pdf/1908.05267.pdf)
- [Fact verification survey](https://www.mdpi.com/2411-9660/5/3/42)
- [Survey on automated fact checking](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00454/109469/A-Survey-on-Automated-Fact-Checking)
- [Automated fact-checking: A survey](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00454/109469/A-Survey-on-Automated-Fact-Checking)
- [LAMA probing](https://aclanthology.org/D19-1250.pdf)

### Alignment of two embedding spaces:
- https://arxiv.org/pdf/2010.13688.pdf This is a survey of the topic
- [The autoencoder-structure type work](https://www.sciencedirect.com/science/article/pii/S1570826822000270?dgcid=rss_sd_all) It’s between different knowledge graphs, but the structure can be generalized to our problem without using any pre-knowledge.

## Imports

In [1]:
import numpy as np
import pandas as pd

from IPython.display import display, clear_output

## Load Embeddings

In [2]:
emb_file = "wiki_trans_v1_vec.npy"
emb_key_file = "wiki_trans_v1_labels.json"
emb_tsv_file = "wikidata_translation_v1.tsv"

### npy file

In [4]:
embeddings = np.load(emb_file, mmap_mode='r')

In [5]:
embeddings.shape

(78413185, 200)

### json keys

In [None]:
with open(emb_key_file) as fileobject:
    for line in fileobject:
        print(len(line))

In [16]:
with open(emb_key_file) as f:
    while True:
        data = f.read(1024)
        if not data:
            break
        print(data)
        break

["<http://schema.org/Dataset>","<http://wikiba.se/ontology#Item>","<http://www.wikidata.org/entity/Q13442814>","\"wetenschappelijk artikel\"@nl","\"article cient\\u00EDfic\"@ca","\"bilimsel makale\"@tr","\"wissenschaftlicher Artikel\"@de","\"vetenskaplig artikel\"@sv","\"\\u043D\\u0430\\u0443\\u0447\\u043D\\u0438 \\u0447\\u043B\\u0430\\u043D\\u0430\\u043A\"@sr","\"articolo scientifico\"@it","\"vedeck\\u00FD \\u010Dl\\u00E1nok\"@sk","\"v\\u011Bdeck\\u00FD \\u010Dl\\u00E1nek\"@cs","\"art\\u00EDculu cient\\u00EDficu\"@ast","\"artikull shkencor\"@sq","\"\\u043D\\u0430\\u0443\\u0447\\u043D\\u0430\\u044F \\u0441\\u0442\\u0430\\u0442\\u044C\\u044F\"@ru","\"scienca artikolo\"@eo","\"artigo cient\\u00EDfico\"@gl","\"vitenskapelig artikkel\"@nb","\"articol \\u0219tiin\\u021Bific\"@ro","\"vitskapeleg artikkel\"@nn","\"\\u05DE\\u05D0\\u05DE\\u05E8 \\u05DE\\u05D3\\u05E2\\u05D9\"@he","\"tieteellinen artikkeli\"@fi","\"artyku\\u0142 naukowy\"@pl","\"\\u043D\\u0430\\u0443\\u0447\\u043D\\u0438 \\u0447\

### tsv file

In [3]:
for chunk in pd.read_csv(emb_tsv_file, 
                         delimiter='\t', 
                         header=None, 
                         chunksize=200, 
                         skiprows=1, 
                         encoding='unicode-escape'):
    
    display(chunk.tail(50))
    break

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,191,192,193,194,195,196,197,198,199,200
150,ვიკიპედია:კატეგორიზაცია@ka,0.2374,0.2334,-1.2317,-0.1052,-0.8996,-0.3708,0.936,0.1636,0.1508,...,-0.0195,-0.084,0.3218,-0.3902,-0.388,-0.5567,-0.378,0.1094,0.1718,-0.3888
151,تصنيف بتاع ويكيميديا@arz,0.2158,-0.1931,-0.7697,-0.3644,-0.3956,-0.1725,0.9886,-0.2262,-0.0761,...,0.0464,-0.3647,0.4353,-0.3675,-0.3837,-0.5399,-0.2618,0.1787,0.4213,-0.867
152,kategorya ng Wikimedia@tl,0.4686,-0.2089,-0.9573,-0.0756,-0.8581,-0.3367,1.1049,-0.3336,-0.4785,...,0.0061,-0.4368,0.2318,-0.2441,-0.5975,-0.7506,-0.7489,0.3858,0.3886,-0.2948
153,หน้าหมวดหมู่วิกิมีเดีย@th,0.1272,-0.0376,-0.8983,0.0561,-0.8466,-0.0873,1.0121,0.2025,-0.0954,...,0.0917,-0.1776,0.3266,-0.3865,-0.245,-0.3989,-0.1365,0.1301,0.1676,-0.3834
154,Викимедиа проектындагы төркем@tt-cyrl,0.3033,0.0791,-1.0701,-0.301,-0.723,-0.3469,0.6648,-0.0299,-0.1165,...,0.3813,-0.3917,0.5761,-0.482,-0.4692,-0.4877,-0.2935,0.163,0.1791,-0.3645
155,विकिमिडिया श्रेणी@dty,0.6586,0.0836,-1.0962,0.097,-0.9746,0.0705,0.9287,-0.0819,-0.0967,...,0.1462,-0.2997,0.4251,-0.5384,-0.6325,-0.249,-0.3294,0.2013,0.2414,-0.2909
156,Wikimedia proyektındağı törkem@tt-latn,0.4144,-0.3263,-0.7766,-0.2418,-0.7586,-0.1447,0.6765,-0.3042,-0.3844,...,0.1784,-0.3673,0.1457,-0.2463,-0.7155,-0.5209,0.0242,0.1365,0.1784,-0.4464
157,Viciméid catagóir@ga,0.3664,0.0028,-1.0975,-0.0877,-0.7209,-0.3146,1.1576,-0.0249,0.1022,...,0.5356,-0.2766,0.7667,-0.3906,-0.3088,-0.2794,-0.3257,0.0174,0.8024,0.016
158,زمرو:وڪيپيڊيا زمرا بندي@sd,0.6462,-0.0593,-0.9375,-0.2932,-0.6293,-0.0925,0.701,-0.2661,-0.0224,...,0.008,-0.0725,0.3348,-0.2812,-0.5549,-0.1902,-0.2302,0.5304,-0.043,-0.2436
159,Vikimédia ñemohenda@gn,0.4163,-0.0442,-1.0986,0.0494,-0.5605,-0.0159,0.9727,-0.0382,0.2371,...,0.0232,-0.3363,0.432,-0.7403,-0.4469,-0.7115,-0.3073,0.0461,-0.2179,-0.5703


### Explore

In [17]:
df_belt

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,191,192,193,194,195,196,197,198,199,200


In [18]:
for chunk in pd.read_csv(emb_tsv_file, 
                         delimiter='\t', 
                         header=None, 
                         chunksize=20000, 
                         skiprows=1, 
                         encoding='unicode-escape'):
    
    df = chunk[chunk[0].str.lower().str.startswith("belt@", na=False)]
    if len(df)>0:
        df_belt = pd.concat([df_belt, df])
        print(len(df_belt))
#     break

1
2
3
4
9
10
21


Kinds of belts:
- https://www.wikidata.org/wiki/Q134560
- https://www.wikidata.org/wiki/Q623755
- https://www.wikidata.org/wiki/Q961712
- https://www.wikidata.org/wiki/Q1853245
- https://www.wikidata.org/wiki/Q37181413

In [19]:
items = ["Q134560", "Q623755", "Q961712", "Q1853245", "Q37181413"]
items_uri = ["<http://www.wikidata.org/entity/"+Qid+">" for Qid in items]

for chunk in pd.read_csv(emb_tsv_file, 
                         delimiter='\t', 
                         header=None, 
                         chunksize=20000, 
                         skiprows=1, 
                         encoding='unicode-escape'):
    
    df = chunk[chunk[0].isin(items_uri)]
    if len(df)>0:
        df_belt = pd.concat([df_belt, df])
        print(len(df_belt))
#     break

22
23
24
25
26


In [21]:
df_belt.to_csv("belt.csv", index=False)

In [29]:
qid_belt = df_belt[df_belt[0].str.startswith('<')]
qid_belt

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,191,192,193,194,195,196,197,198,199,200
82755,<http://www.wikidata.org/entity/Q37181413>,-0.0027,0.0881,0.4699,0.1606,-0.5265,0.0653,0.1324,-0.2479,-0.1804,...,-0.1619,0.1962,-0.2125,0.0187,-0.0563,-0.2305,-0.2767,0.1607,-0.5335,-0.0385
343305,<http://www.wikidata.org/entity/Q134560>,-0.6188,-0.355,0.4352,-0.503,-0.4783,-0.2958,0.3138,-0.2007,-0.7412,...,0.5383,0.0437,0.1995,-0.1997,-0.4154,-0.1166,0.3945,-0.0963,-0.0322,0.4677
1812901,<http://www.wikidata.org/entity/Q623755>,-0.3563,0.051,0.6262,-0.369,-0.1299,-0.0339,0.2315,-0.2041,-0.4801,...,-0.1942,0.0842,0.0286,0.3489,-0.2796,-0.1287,0.2743,-0.2667,-0.3131,0.4419
7024620,<http://www.wikidata.org/entity/Q961712>,-0.2786,0.4477,0.2711,-0.1358,-0.5924,-0.0394,-0.256,-0.2628,-0.1407,...,0.0718,-0.0269,-0.1646,0.229,0.0948,0.2196,-0.0253,-0.28,-0.5821,-0.1342
25465685,<http://www.wikidata.org/entity/Q1853245>,-0.0178,-0.2253,-0.4834,-0.1349,-0.4299,-0.1861,-0.1403,-0.0367,-0.0787,...,-0.2538,-0.1128,-0.2929,0.3921,-0.0128,-0.127,0.1866,0.1016,-0.0698,0.238


In [30]:
label_belt = df_belt[df_belt[0].str.lower().str.startswith('belt@en')]
label_belt

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,191,192,193,194,195,196,197,198,199,200
39505488,Belt@en,-0.2008,0.3438,-0.5905,-0.2869,-0.6121,-0.0868,0.0126,-0.3297,-0.0499,...,-0.0546,-0.2613,-0.1742,0.2119,0.1862,0.1878,-0.1574,0.4914,-0.2788,0.0812
46680710,belt@en,-0.4692,0.1622,-0.2928,-0.3075,-0.2424,-0.3216,0.1769,-0.1455,-0.0845,...,0.2958,0.0059,-0.1028,-0.0344,-0.469,-0.0298,0.0286,0.2428,0.157,0.3287
57748481,Belt@en-ca,-0.2597,0.2131,0.1443,-0.2932,-0.3756,-0.2273,-0.0531,-0.2891,-0.2113,...,0.2604,0.0547,0.0931,0.1408,0.1131,-0.0498,-0.1386,0.0928,-0.2841,0.1067
57748482,Belt@en-gb,-0.2824,0.2828,0.2554,-0.1884,-0.263,-0.1893,0.0827,-0.3436,-0.136,...,0.2371,0.072,0.1714,0.1785,0.0731,-0.0246,-0.1455,0.1502,-0.2427,0.1757


In [34]:
from scipy import spatial

def cosine_similarity(v1, v2):
    return 1 - spatial.distance.cosine(v1, v2)

In [43]:
for qid, vec1 in qid_belt.set_index(0).iterrows():
    vec1 = vec1.tolist()
    for label, vec2 in label_belt.set_index(0).iterrows():
        vec2 = vec2.tolist()
        print("similarity of", '\t' ,qid.split('/')[-1][:-1], '\t' ,label, '\t' ,cosine_similarity(vec1, vec2))
    print()

similarity of 	 Q37181413 	 Belt@en 	 0.530300725365445
similarity of 	 Q37181413 	 belt@en 	 0.12833555942337782
similarity of 	 Q37181413 	 Belt@en-ca 	 0.6460636411904157
similarity of 	 Q37181413 	 Belt@en-gb 	 0.6422865418227702

similarity of 	 Q134560 	 Belt@en 	 0.23912143045718182
similarity of 	 Q134560 	 belt@en 	 0.5894263343082636
similarity of 	 Q134560 	 Belt@en-ca 	 0.5970326362846458
similarity of 	 Q134560 	 Belt@en-gb 	 0.6043148009882696

similarity of 	 Q623755 	 Belt@en 	 0.2778154448339696
similarity of 	 Q623755 	 belt@en 	 0.4257392511489878
similarity of 	 Q623755 	 Belt@en-ca 	 0.5189452273715972
similarity of 	 Q623755 	 Belt@en-gb 	 0.5581850745648513

similarity of 	 Q961712 	 Belt@en 	 0.5606613394528832
similarity of 	 Q961712 	 belt@en 	 0.17489949976247643
similarity of 	 Q961712 	 Belt@en-ca 	 0.587964726854505
similarity of 	 Q961712 	 Belt@en-gb 	 0.5488954707498972

similarity of 	 Q1853245 	 Belt@en 	 0.6359258645273805
similarity of 	 Q1853245 	 

- Belt@en-gb, Belt@en-ca seem to match with >40-50% similary with all Qids. 
- Belt@en matches with 3 qids with >50%
- belt@en (clothing belt) matches only with its relevant qid with 60% similarity, and with rest, it is 20% or less

In [45]:
for qid1, vec1 in qid_belt.set_index(0).iterrows():
    vec1 = vec1.tolist()
    for qid2, vec2 in qid_belt.set_index(0).iterrows():
        vec2 = vec2.tolist()
        print("similarity of", '\t' ,qid1.split('/')[-1][:-1], '\t' ,qid2.split('/')[-1][:-1], '\t' ,cosine_similarity(vec1, vec2))
    print()

similarity of 	 Q37181413 	 Q37181413 	 1
similarity of 	 Q37181413 	 Q134560 	 0.3462588474541677
similarity of 	 Q37181413 	 Q623755 	 0.5022191108000803
similarity of 	 Q37181413 	 Q961712 	 0.741787888098731
similarity of 	 Q37181413 	 Q1853245 	 0.5430454714760878

similarity of 	 Q134560 	 Q37181413 	 0.3462588474541677
similarity of 	 Q134560 	 Q134560 	 1
similarity of 	 Q134560 	 Q623755 	 0.6220318700597309
similarity of 	 Q134560 	 Q961712 	 0.2741023904060377
similarity of 	 Q134560 	 Q1853245 	 0.22816735453317416

similarity of 	 Q623755 	 Q37181413 	 0.5022191108000803
similarity of 	 Q623755 	 Q134560 	 0.6220318700597309
similarity of 	 Q623755 	 Q623755 	 1
similarity of 	 Q623755 	 Q961712 	 0.43465689834956056
similarity of 	 Q623755 	 Q1853245 	 0.39012673268146403

similarity of 	 Q961712 	 Q37181413 	 0.741787888098731
similarity of 	 Q961712 	 Q134560 	 0.2741023904060377
similarity of 	 Q961712 	 Q623755 	 0.43465689834956056
similarity of 	 Q961712 	 Q961712 	

# LM Models
- T5 variations:
    - [mt-5 small](https://huggingface.co/google/mt5-small)
    - [t5-v1.1-xxl](https://huggingface.co/google/t5-v1_1-xxl)
    - [long-t5](https://huggingface.co/google/long-t5-tglobal-base)
    - [t5-large-lm-adapt](https://huggingface.co/google/t5-large-lm-adapt)
    - [mt5-base](https://huggingface.co/google/mt5-base)
    - [mt5-large](https://huggingface.co/google/mt5-large)
    - [t5-base-finetuned-common_gen](https://huggingface.co/mrm8488/t5-base-finetuned-common_gen)
    - [t5-efficient-tiny](https://huggingface.co/google/t5-efficient-tiny)
- Long Former
    - [Long Former base](https://huggingface.co/allenai/led-base-16384)
    - [Long Former large](https://huggingface.co/allenai/led-large-16384)
- Big Science T0
    - [t0-3B - Big Science](https://huggingface.co/bigscience/T0_3B)
    - [t0 - Big Science](https://huggingface.co/bigscience/T0)
    - [t0pp - Big Science](https://huggingface.co/bigscience/T0pp)
- BERT
    - [bert-base-uncased](https://huggingface.co/bert-base-uncased)
    - [bert-base-cased](https://huggingface.co/bert-base-cased)
    - [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased)
    - [bert-large-uncased](https://huggingface.co/bert-large-uncased)
    - [bert-base-multilingual-uncased](https://huggingface.co/bert-base-multilingual-uncased)
    - [bert-large-cased](https://huggingface.co/bert-large-cased)
    - [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased)
    - [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased)
    - [distilbert-base-cased](https://huggingface.co/distilbert-base-cased)
- GPT
    - [gpt2](https://huggingface.co/gpt2)
    - [distilgpt2](https://huggingface.co/distilgpt2)
    - [EleutherAI/gpt-neo-1.3B](https://huggingface.co/EleutherAI/gpt-neo-1.3B)
    - [gpt2-medium](https://huggingface.co/gpt2-medium)
    - [EleutherAI/gpt-j-6B](https://huggingface.co/EleutherAI/gpt-j-6B)
    - [EleutherAI/gpt-neo-2.7B](https://huggingface.co/EleutherAI/gpt-neo-2.7B)
    - [EleutherAI/gpt-neo-125M](https://huggingface.co/EleutherAI/gpt-neo-125M)
    - [gpt2-xl](https://huggingface.co/gpt2-xl)
    - [gpt2-large](https://huggingface.co/gpt2-large)
    - [openai-gpt](https://huggingface.co/openai-gpt)
- Roberta
    - [roberta-base](https://huggingface.co/roberta-base)
    - [xlm-roberta-base](https://huggingface.co/xlm-roberta-base)
    - [roberta-large](https://huggingface.co/roberta-large)
    - [distilroberta-base](https://huggingface.co/distilroberta-base)

In [2]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelWithLMHead, AutoModelForCausalLM, AutoModelForMaskedLM, AutoModelForPreTraining
import transformers

# BertForMaskedLM

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

In [75]:
bert_model_name = "bert-base-cased"

model = AutoModelForPreTraining.from_pretrained(bert_model_name)
tokenizer = AutoTokenizer.from_pretrained(bert_model_name)

Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [95]:
help(tokenizer)

Help on BertTokenizerFast in module transformers.models.bert.tokenization_bert_fast object:

class BertTokenizerFast(transformers.tokenization_utils_fast.PreTrainedTokenizerFast)
 |  BertTokenizerFast(vocab_file=None, tokenizer_file=None, do_lower_case=True, unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]', tokenize_chinese_chars=True, strip_accents=None, **kwargs)
 |  
 |  Construct a "fast" BERT tokenizer (backed by HuggingFace's *tokenizers* library). Based on WordPiece.
 |  
 |  This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
 |  refer to this superclass for more information regarding those methods.
 |  
 |  Args:
 |      vocab_file (`str`):
 |          File containing the vocabulary.
 |      do_lower_case (`bool`, *optional*, defaults to `True`):
 |          Whether or not to lowercase the input when tokenizing.
 |      unk_token (`str`, *optional*, defaults to `"[UNK]"`):

In [93]:
help(transformers.BertForPreTraining)

Help on class BertForPreTraining in module transformers.models.bert.modeling_bert:

class BertForPreTraining(BertPreTrainedModel)
 |  BertForPreTraining(config)
 |  
 |  Bert Model with two heads on top as done during the pretraining: a `masked language modeling` head and a `next
 |  sentence prediction (classification)` head.
 |  
 |  
 |  This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
 |  library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
 |  etc.)
 |  
 |  This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
 |  Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
 |  and behavior.
 |  
 |  Parameters:
 |      config ([`BertConfig`]): Model configuration class with all the parameters of the model.
 |          Initializing with a config file

In [76]:
model

BertForPreTraining(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine

In [127]:
help(model)

Help on BertForMaskedLM in module transformers.models.bert.modeling_bert object:

class BertForMaskedLM(BertPreTrainedModel)
 |  BertForMaskedLM(config)
 |  
 |  Bert Model with a `language modeling` head on top.
 |  
 |  This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
 |  library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
 |  etc.)
 |  
 |  This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
 |  Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
 |  and behavior.
 |  
 |  Parameters:
 |      config ([`BertConfig`]): Model configuration class with all the parameters of the model.
 |          Initializing with a config file does not load the weights associated with the model, only the
 |          configuration. Check out the [`~PreTrainedMode

In [98]:
inputs = tokenizer("Hello, my dog is cute. [SEP] My shoes are tight.", return_tensors="pt")
outputs = model(**inputs)

inputs, outputs

({'input_ids': tensor([[  101,   101,  8667,   117,  1139,  3676,  1110, 10509,   119,   102,
           1422,  5743,  1132,  3600,   119,   102,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])},
 BertForPreTrainingOutput(loss=None, prediction_logits=tensor([[[ -7.3181,  -7.2345,  -7.3501,  ...,  -6.2456,  -5.9510,  -6.3604],
          [ -7.3316,  -7.2501,  -7.3701,  ...,  -6.2620,  -5.9732,  -6.3748],
          [ -7.6375,  -7.9497,  -7.6809,  ...,  -6.4824,  -6.2852,  -6.7270],
          ...,
          [-12.6466, -12.4578, -12.6983,  ..., -10.0199, -10.2367, -11.2309],
          [-14.6324, -14.3464, -14.7202,  ..., -12.3296, -12.8787, -13.6785],
          [-14.6324, -14.3464, -14.7202,  ..., -12.3296, -12.8787, -13.6785]]],
        grad_fn=<ViewBackward0>), seq_relationship_logits=tensor([[ 1.1897, -0.2228]], grad_fn=<AddmmBackward0>), hidden_states=None, attentio

In [99]:
tokenizer.tokenize("Hello, my dog is cute. [SEP] My shoes are tight.")

['Hello',
 ',',
 'my',
 'dog',
 'is',
 'cute',
 '.',
 '[SEP]',
 'My',
 'shoes',
 'are',
 'tight',
 '.']

In [78]:
prediction_scores, seq_relationship_scores = outputs[:2]

prediction_scores, seq_relationship_scores

(tensor([[[ -7.3146,  -7.2150,  -7.3546,  ...,  -6.2915,  -5.9520,  -6.3191],
          [ -7.8435,  -8.0283,  -7.8237,  ...,  -6.5801,  -6.3865,  -6.6978],
          [-12.6162, -12.2843, -12.7444,  ...,  -8.2236,  -9.7125, -10.8290],
          ...,
          [-14.2960, -14.4208, -13.3845,  ...,  -8.7427, -10.3113,  -9.9714],
          [ -8.7748,  -8.1111,  -8.3881,  ...,  -6.5583,  -6.6553,  -7.8282],
          [-14.0976, -14.0160, -14.4813,  ..., -11.5246, -12.0814, -12.7681]]],
        grad_fn=<ViewBackward0>),
 tensor([[ 1.1983, -0.2422]], grad_fn=<AddmmBackward0>))

In [92]:
outputs[0].shape, outputs[1].shape

(torch.Size([1, 15, 28996]), torch.Size([1, 2]))

In [96]:
outputs

BertForPreTrainingOutput(loss=None, prediction_logits=tensor([[[ -7.3472,  -7.2642,  -7.3727,  ...,  -6.2709,  -5.9648,  -6.3878],
         [ -7.9827,  -8.1628,  -7.9748,  ...,  -6.7188,  -6.5194,  -6.9336],
         [-12.5089, -12.5157, -12.7470,  ...,  -8.4077,  -9.6678, -11.3322],
         ...,
         [ -8.9322,  -8.2632,  -8.3811,  ...,  -8.6831,  -7.2777,  -7.6629],
         [-11.7479, -11.5592, -11.8093,  ...,  -9.2796,  -9.5640, -10.4089],
         [-14.7984, -14.4865, -14.8255,  ..., -12.3701, -12.9534, -13.6202]]],
       grad_fn=<ViewBackward0>), seq_relationship_logits=tensor([[ 1.1441, -0.1681]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [124]:
import torch

text = "This is a great [MASK] for [MASK]."
inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).prediction_logits
# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices.tolist()

for ix in range(5):
#     print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")
    print(tokenizer.decode([top_5_tokens[0][ix], top_5_tokens[1][ix]]))

opportunity me
day you
place us
idea him
time them


In [108]:
inputs.input_ids.shape

torch.Size([1, 10])

In [109]:
token_logits.shape

torch.Size([1, 10, 28996])

In [112]:
mask_token_index

tensor([5, 7])

In [111]:
mask_token_logits.shape

torch.Size([2, 28996])

## MLM

In [3]:
bert_model_name = "distilbert-base-uncased" ##"bert-base-cased"
model = AutoModelForMaskedLM.from_pretrained(bert_model_name)
tokenizer = AutoTokenizer.from_pretrained(bert_model_name)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [5]:
inputs = tokenizer("Hello, my dog is cute. [SEP] My shoes are tight.", return_tensors="pt")
outputs = model(**inputs)

inputs, outputs

({'input_ids': tensor([[  101,  7592,  1010,  2026,  3899,  2003, 10140,  1012,   102,  2026,
           6007,  2024,  4389,  1012,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])},
 MaskedLMOutput(loss=None, logits=tensor([[[ -6.2956,  -6.2546,  -6.2639,  ...,  -5.5336,  -5.4668,  -3.3986],
          [ -7.3770,  -7.4773,  -7.5341,  ...,  -7.0813,  -6.6342,  -4.9592],
          [-12.6205, -12.6647, -12.8700,  ..., -10.9594,  -9.5039,  -9.1719],
          ...,
          [ -9.1824,  -9.4474,  -9.2105,  ...,  -7.9900,  -7.8203,  -4.0722],
          [-10.6852, -10.7796, -10.7815,  ...,  -8.6925,  -8.9572,  -6.1142],
          [-10.9742, -11.3112, -11.1756,  ...,  -9.2756,  -9.3903,  -7.7605]]],
        grad_fn=<ViewBackward0>), hidden_states=None, attentions=None))

In [4]:
model

DistilBertForMaskedLM(
  (activation): GELUActivation()
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inp

In [6]:
from torch import nn


class BERTModified(nn.Module):
    
    def __init__(self, bert_model_name="bert-base-cased"):
        super().__init__()
        
        self.num_labels = 30522 # for bert 28996
        self.dense = nn.Linear(in_features=self.num_labels, out_features=self.num_labels, bias=True)
        self.bert = AutoModelForMaskedLM.from_pretrained(bert_model_name)
        self.loss_fct = nn.CrossEntropyLoss()
        
        ## set to eval
        self.bert.eval()
        
        ## freeze model
        for param in self.bert.parameters():
            param.requires_grad = False

    def forward(self, tokenizer_output, labels):

        bert_outputs = self.bert(**tokenizer_output)
        logits = self.dense(bert_outputs)
        loss = None

        if labels is not None:
            loss = self.loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

        return SequenceClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=None,
            attentions=None,
        )

In [8]:
from transformers import DataCollatorForWholeWordMask

data_collator = DataCollatorForWholeWordMask(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

In [9]:
from datasets import load_dataset
dataset = load_dataset("zhengxuanzenwu/wikitext-2-split-128", split="test")

Using custom data configuration zhengxuanzenwu--wikitext-2-split-128-f504347a654a9463
Found cached dataset parquet (/home/a2khatun/.cache/huggingface/datasets/zhengxuanzenwu___parquet/zhengxuanzenwu--wikitext-2-split-128-f504347a654a9463/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


In [10]:
dataset

Dataset({
    features: ['text'],
    num_rows: 8192
})

In [11]:
def tokenize_function(examples):
    result = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result


# Use batched=True to activate fast multithreading!
tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

tokenized_datasets

  0%|          | 0/9 [00:00<?, ?ba/s]

Dataset({
    features: ['input_ids', 'attention_mask', 'word_ids'],
    num_rows: 8192
})

In [12]:
tokenizer.model_max_length

512

In [13]:
tokenizer.decode(tokenized_datasets[0]["input_ids"]) ,dataset[0]

('[CLS] most of the north korean units began conducting delaying actions attempting to get as much of their army as possible into north korea. [SEP]',
 {'text': 'Most of the North Korean units began conducting delaying actions attempting to get as much of their army as possible into North Korea .'})

In [14]:
tokenizer.decode(tokenized_datasets[1]["input_ids"]) ,dataset[1]

('[CLS] by this time her health had deteriorated. [SEP]',
 {'text': 'By this time her health had deteriorated .'})

In [15]:
samples = [tokenized_datasets[i] for i in range(2)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] most of the [MASK] [MASK] unitsrix conducting delaying actions attempting to get as much of their army as [MASK] into north korea. [SEP]'

'>>> [CLS] by [MASK] [MASK] her health had deteriorated. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]'


In [16]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /home/a2khatun/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


In [17]:
train_size = 1000
test_size = 100

downsampled_dataset = tokenized_datasets.train_test_split(train_size=train_size, test_size=test_size, seed=42)
downsampled_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 100
    })
})

In [24]:
from transformers import TrainingArguments

batch_size = 1

# Show the training loss with every epoch
logging_steps = len(downsampled_dataset['train']) // batch_size
model_name = "BERTModified"
output_dir = f"{model_name}-finetuned-wikitext-test"

training_args = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    push_to_hub=True,
    fp16=True,
    logging_steps=logging_steps,
    num_train_epochs=1,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [36]:
model

DistilBertForMaskedLM(
  (activation): GELUActivation()
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inp

In [None]:
BERTModified_model = BERTModified()
BERTModified_model

In [19]:
from transformers import Trainer

trainer = Trainer(
    model=BERTModified_model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator,
)

/home/a2khatun/Downloads/KG/project/BERTModified-finetuned-wikitext-test is already a clone of https://huggingface.co/Aisha/BERTModified-finetuned-wikitext-test. Make sure you pull the latest changes with `repo.git_pull()`.
Using cuda_amp half precision backend


In [20]:
trainer.train()
# trainer.save_model("output/models/BERTModified")

The following columns in the training set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1000
  Num Epochs = 1
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 1000


Epoch,Training Loss,Validation Loss
1,2.6958,


Saving model checkpoint to BERTModified-finetuned-wikitext-test/checkpoint-500
Configuration saved in BERTModified-finetuned-wikitext-test/checkpoint-500/config.json
Model weights saved in BERTModified-finetuned-wikitext-test/checkpoint-500/pytorch_model.bin
Saving model checkpoint to BERTModified-finetuned-wikitext-test/checkpoint-1000
Configuration saved in BERTModified-finetuned-wikitext-test/checkpoint-1000/config.json
Model weights saved in BERTModified-finetuned-wikitext-test/checkpoint-1000/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 100
  Batch size = 1


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=1000, training_loss=2.695760986328125, metrics={'train_runtime': 213.7372, 'train_samples_per_second': 4.679, 'train_steps_per_second': 4.679, 'total_flos': 8960823461880.0, 'train_loss': 2.695760986328125, 'epoch': 1.0})

In [29]:
import math

eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 100
  Batch size = 1


>>> Perplexity: nan


In [30]:
eval_results

{'eval_loss': nan,
 'eval_runtime': 2.9038,
 'eval_samples_per_second': 34.437,
 'eval_steps_per_second': 34.437,
 'epoch': 1.0}

In [28]:
trainer.push_to_hub()

Saving model checkpoint to BERTModified-finetuned-wikitext-test
Configuration saved in BERTModified-finetuned-wikitext-test/config.json
Model weights saved in BERTModified-finetuned-wikitext-test/pytorch_model.bin


Upload file pytorch_model.bin:   0%|          | 32.0k/256M [00:00<?, ?B/s]

remote: Scanning LFS files for validity, may be slow...        
remote: LFS file scan complete.        
To https://huggingface.co/Aisha/BERTModified-finetuned-wikitext-test
   3fdb790..1ceebf7  main -> main

Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Masked Language Modeling', 'type': 'fill-mask'}}
To https://huggingface.co/Aisha/BERTModified-finetuned-wikitext-test
   1ceebf7..4d1b6ef  main -> main



'https://huggingface.co/Aisha/BERTModified-finetuned-wikitext-test/commit/1ceebf7adcf85a4c05833ef659f00c9657c470bb'

In [26]:
# from huggingface_hub import get_full_repo_name

# repo_name = get_full_repo_name(output_dir)
# repo_name

'Aisha/BERTModified-finetuned-wikitext-test'

In [27]:
# from huggingface_hub import Repository

# repo = Repository(output_dir, clone_from=repo_name)

/home/a2khatun/Downloads/KG/project/BERTModified-finetuned-wikitext-test is already a clone of https://huggingface.co/Aisha/BERTModified-finetuned-wikitext-test. Make sure you pull the latest changes with `repo.git_pull()`.


In [32]:
tokenizer

PreTrainedTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

In [33]:
from transformers import pipeline

# Initialize MLM pipeline
mlm = pipeline('fill-mask', model="Aisha/BERTModified-finetuned-wikitext-test", tokenizer=tokenizer)

# Get mask token
mask = mlm.tokenizer.mask_token

# Get result for particular masked phrase
phrase = f'Wikipedia is a free online {mask}, created and edited by volunteers around the world'

result = mlm(phrase)

# Print result
print(result)

loading configuration file config.json from cache at /home/a2khatun/.cache/huggingface/hub/models--Aisha--BERTModified-finetuned-wikitext-test/snapshots/4d1b6ef288e8bed892389281a8f3da50455499dd/config.json
Model config DistilBertConfig {
  "_name_or_path": "Aisha/BERTModified-finetuned-wikitext-test",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.23.1",
  "vocab_size": 30522
}

loading configuration file config.json from cache at /home/a2khatun/.cache/huggingface/hub/models--Aisha--BERTModified-finetuned-wikitext-test/snapshots/4d1b6ef288e8bed892389281a8f3da50455499d

[{'score': 0.8920441269874573, 'token': 12204, 'token_str': 'encyclopedia', 'sequence': 'wikipedia is a free online encyclopedia, created and edited by volunteers around the world'}, {'score': 0.021595323458313942, 'token': 2932, 'token_str': 'magazine', 'sequence': 'wikipedia is a free online magazine, created and edited by volunteers around the world'}, {'score': 0.020654980093240738, 'token': 3485, 'token_str': 'journal', 'sequence': 'wikipedia is a free online journal, created and edited by volunteers around the world'}, {'score': 0.013326575979590416, 'token': 4772, 'token_str': 'publication', 'sequence': 'wikipedia is a free online publication, created and edited by volunteers around the world'}, {'score': 0.010755684226751328, 'token': 9206, 'token_str': 'dictionary', 'sequence': 'wikipedia is a free online dictionary, created and edited by volunteers around the world'}]


In [35]:
for x in result:
    print(f">>> {x['sequence']}")

>>> wikipedia is a free online encyclopedia, created and edited by volunteers around the world
>>> wikipedia is a free online magazine, created and edited by volunteers around the world
>>> wikipedia is a free online journal, created and edited by volunteers around the world
>>> wikipedia is a free online publication, created and edited by volunteers around the world
>>> wikipedia is a free online dictionary, created and edited by volunteers around the world


https://huggingface.co/course/chapter7/3?fw=pt#perplexity-for-language-models

In [None]:
## compare model weights

for p1, p2 in zip(model1.parameters(), model2.parameters()):
    if p1.data.ne(p2.data).sum() > 0:
        return False
return True