HurtLex is a lexicon of offensive, aggressive, and hateful words in over 50 languages. The words are divided into 17 categories, plus a macro-category indicating whether there is stereotype involved. The 17 categories are:
|PS||negative stereotypes ethnic slurs|
|RCI||locations and demonyms|
|PA||professions and occupations|
|DDF||physical disabilities and diversity|
|DDP||cognitive disabilities and diversity|
|DMC||moral and behavioral defects|
|IS||words related to social and economic disadvantage|
|PR:||words related to prostitution|
|OM:||words related to homosexuality|
|QAS||with potential negative connotations|
|RE||felonies and words related to crime and immoral behavior|
|SVP||words related to the seven deadly sins of the Christian tradition|
Hurtlex has a 2-level structure. Lemmas belong to one of these levels:
- conservative: obtained by translating offensive senses of the words in the original lexicon.
- inclusive: obtained by translating all the potentially relevant senses of the words in the original lexicon.
Here is the updated list of the Hurtlex word lists in all languages.
|AF Afrikaans||1.0 1.1 1.2|
|AR Arabic||1.0 1.1 1.2|
|BG Bulgarian||1.0 1.1 1.2|
|BN Bengali||1.0 1.1 1.2|
|CA Catalan||1.0 1.1 1.2|
|CS Czech||1.0 1.1 1.2|
|CY Welsh||1.0 1.1 1.2|
|DA Danish||1.0 1.1 1.2|
|DE German||1.0 1.1 1.2|
|EL Greek||1.0 1.1 1.2|
|EN English||1.0 1.1 1.2|
|EO Esperanto||1.0 1.1 1.2|
|ES Spanish||1.0 1.1 1.2|
|ET Estonian||1.0 1.1 1.2|
|EU Basque||1.0 1.1 1.2|
|FA Persian||1.0 1.1 1.2|
|FI Finnish||1.0 1.1 1.2|
|FR French||1.0 1.1 1.2|
|GA Irish||1.0 1.1 1.2|
|GL Galician||1.0 1.1 1.2|
|HE Hebrew||1.0 1.1 1.2|
|HI Hindi||1.0 1.1 1.2|
|HR Croatian||1.0 1.1 1.2|
|HU Hungarian||1.0 1.1 1.2|
|ID Indonesian||1.0 1.1 1.2|
|IS Icelandic||1.0 1.1 1.2|
|IT Italian||1.0 1.1 1.2|
|JA Japanese||1.0 1.1 1.2|
|KO Korean||1.0 1.1 1.2|
|LT Lithuanian||1.0 1.1 1.2|
|LV Latvian||1.0 1.1 1.2|
|MK Macedonian||1.0 1.1 1.2|
|MS Malay||1.0 1.1 1.2|
|MT Maltese||1.0 1.1 1.2|
|NL Dutch||1.0 1.1 1.2|
|NO Norwegian||1.0 1.1 1.2|
|PL Polish||1.0 1.1 1.2|
|PT Portuguese||1.0 1.1 1.2|
|RO Romanian||1.0 1.1 1.2|
|RU Russian||1.0 1.1 1.2|
|SIMPLE Simple English||1.0 1.1 1.2|
|SK Slovak||1.0 1.1 1.2|
|SL Slovenian||1.0 1.1 1.2|
|SQ Albanian||1.0 1.1 1.2|
|SR Serbian||1.0 1.1 1.2|
|SV Swedish||1.0 1.1 1.2|
|SW Swahili||1.0 1.1 1.2|
|TH Thai||1.0 1.1 1.2|
|TL Tagalog||1.0 1.1 1.2|
|TR Turkish||1.0 1.1 1.2|
|UK Ukrainian||1.0 1.1 1.2|
|VI Vietnamese||1.0 1.1 1.2|
|ZH Chinese||1.0 1.1 1.2|
New in version 1.2: a table with the alignment between lemmas across languages is here.
Revised Hurtlex (IT)
The Revised HurtLex is a lexicon in which every headword is annotated with an offensiveness level score. Focusing on the Italian entries, we revised the terms in HurtLex and derived an offensive score for each lexical item by applying an Item Response Theory model to the ratings provided by a large number of annotators.
Hurtlex is described in this paper:
Elisa Bassignana, Valerio Basile, Viviana Patti. Hurtlex: A Multilingual Lexicon of Words to Hurt. In Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-It 2018)
Revised Hurtlex dictionary is described in a paper currently under review.
Contributions are welcome, in the form of revised lexica. Everyone who is native speaker of a language is invited to fork the repository and file a pull request.
Please try to limit your modifications to the following operations:
- add: add a new item to a lexicon, by creating a new line. Fill in all the column values, including category and stereotype, set level="conservative", and add a new unique ID for the lemma.
- remove: remove an item considered wrong for a lexicon, by removing the corresponding line.
- update: change the lemma or the category of an item, e.g. because of a misspelling.
- add offensiveness score: create a new column with a real value between 0 and 1 to indicate a score for the offensiveness of an item in a lexicon.
- Some languages are written in more than one script (e.g. Hindi, Bangla, Bulgarian, Russian): in these cases is it good practice to harmonize the lexicon by adding the missing spelling and keeping the same ID for the same lemma written in different scripts.
- Some lexicons contain inflected forms instead of lemmas. These are mistakes introduced by the automatic processing. It is safe to remove such works if the corresponding lemma is already in the lexicon, or to modify them if it is not.
Please create a new version directory for the lexicon you submit. If yours is the first manually corrected version of a lexicon (that is, the last version is 1.*) please create the directory for version 2.0. Otherwise, proceed incrementally (2.0 -> 2.1, 2.1 -> 2.2, ...).
Finally, do not forget to add a README.md file in your newly created directory, indicating what has changes, and your contact for due credit.
Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
You are free to:
- Share — copy and redistribute the material in any medium or format
- Adapt — remix, transform, and build upon the material
The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
- Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
- NonCommercial — You may not use the material for commercial purposes.
- ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
- No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
- You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation.
- No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.