Issue in detokenization of token_ids in BertTokenizer

### **When detokenizing the hindi_token_ids why it is tokenizing into bytes rather than strings?**
<img width="433" alt="image" src="https://user-images.githubusercontent.com/91084615/173204064-03a5dec3-7647-41e4-8c81-9838e9deb087.png">
<img width="673" alt="Actual" src="https://user-images.githubusercontent.com/91084615/173204112-ce1c204c-7391-4e36-ae14-faa07c496045.PNG">
<br>
<br>

### **While detokenizing the ids in some other languages it is working perfectly fine.**
![image](https://user-images.githubusercontent.com/91084615/173204213-ba1d9940-befb-4151-a340-2be4593dea5b.png)

<img width="682" alt="English_language" src="https://user-images.githubusercontent.com/91084615/173204404-7a91e472-e5a1-4a9f-b18d-9f2b0116ae7d.PNG">




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issue in detokenization of token_ids in BertTokenizer #949

When detokenizing the hindi_token_ids why it is tokenizing into bytes rather than strings?

While detokenizing the ids in some other languages it is working perfectly fine.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue in detokenization of token_ids in BertTokenizer #949

Description

When detokenizing the hindi_token_ids why it is tokenizing into bytes rather than strings?

While detokenizing the ids in some other languages it is working perfectly fine.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions