Gensim is an open-source Python library specialized in natural language processing (NLP) tasks, specifically for unsupervised learning of topics from text data. It is widely used for building and training text-based models such as Word2Vec, Doc2Vec, and Latent Dirichlet Allocation (LDA).

In [1]:
!pip install gensim



In [2]:
import gensim.downloader as api

In [3]:
info=api.info()
print(info)

{'corpora': {'semeval-2016-2017-task3-subtaskBC': {'num_records': -1, 'record_format': 'dict', 'file_size': 6344358, 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/semeval-2016-2017-task3-subtaskB-eng/__init__.py', 'license': 'All files released for the task are free for general research use', 'fields': {'2016-train': ['...'], '2016-dev': ['...'], '2017-test': ['...'], '2016-test': ['...']}, 'description': 'SemEval 2016 / 2017 Task 3 Subtask B and C datasets contain train+development (317 original questions, 3,169 related questions, and 31,690 comments), and test datasets in English. The description of the tasks and the collected data is given in sections 3 and 4.1 of the task paper http://alt.qcri.org/semeval2016/task3/data/uploads/semeval2016-task3-report.pdf linked in section “Papers” of https://github.com/RaRe-Technologies/gensim-data/issues/18.', 'checksum': '701ea67acd82e75f95e1d8e62fb0ad29', 'file_name': 'semeval-2016-2017-task3-subtaskBC.gz',

In [4]:
info['models'] #contains multiple embedding models (check all the file_names here)

{'fasttext-wiki-news-subwords-300': {'num_records': 999999,
  'file_size': 1005007116,
  'base_dataset': 'Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens)',
  'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/fasttext-wiki-news-subwords-300/__init__.py',
  'license': 'https://creativecommons.org/licenses/by-sa/3.0/',
  'parameters': {'dimension': 300},
  'description': '1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).',
  'read_more': ['https://fasttext.cc/docs/en/english-vectors.html',
   'https://arxiv.org/abs/1712.09405',
   'https://arxiv.org/abs/1607.01759'],
  'checksum': 'de2bb3a20c46ce65c9c131e1ad9a77af',
  'file_name': 'fasttext-wiki-news-subwords-300.gz',
  'parts': 1},
 'conceptnet-numberbatch-17-06-300': {'num_records': 1917247,
  'file_size': 1225497562,
  'base_dataset': 'ConceptNet, word2vec, GloVe, and OpenSubtitles 2016',
  'reader_code': 'https:/

In [5]:
for model_name, model_data in sorted(info['models'].items()) :
  print(model_name, model_data.get('num_records', -1),
        model_data['description'][:40]+ '...')

__testing_word2vec-matrix-synopsis -1 [THIS IS ONLY FOR TESTING] Word vecrors ...
conceptnet-numberbatch-17-06-300 1917247 ConceptNet Numberbatch consists of state...
fasttext-wiki-news-subwords-300 999999 1 million word vectors trained on Wikipe...
glove-twitter-100 1193514 Pre-trained vectors based on  2B tweets,...
glove-twitter-200 1193514 Pre-trained vectors based on 2B tweets, ...
glove-twitter-25 1193514 Pre-trained vectors based on 2B tweets, ...
glove-twitter-50 1193514 Pre-trained vectors based on 2B tweets, ...
glove-wiki-gigaword-100 400000 Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-200 400000 Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-300 400000 Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-50 400000 Pre-trained vectors based on Wikipedia 2...
word2vec-google-news-300 3000000 Pre-trained vectors trained on a part of...
word2vec-ruscorpora-300 184973 Word2vec Continuous Skipgram vectors tra...


Word2Vec : Ces vecteurs capturent les relations sémantiques entre les mots, ce qui permet au modèle de comprendre des relations comme « roi » est à « reine » ce que « homme » est à « femme ».

Limitations:

-Indépendant du contexte

-Mots hors vocabulaire (OOV)



In [6]:
wv = api.load('word2vec-google-news-300')



GloVe : GloVe diffère de Word2Vec dans son approche de la création de vecteurs de mots. Alors que Word2Vec utilise une approche contextuelle (CBOW ou Skip-Gram) pour prédire les mots ou leur contexte, GloVe se concentre sur les informations statistiques globales d’un corpus, en utilisant la cooccurrence des mots pour générer des plongements.

L’idée principale derrière GloVe est que le rapport des probabilités de cooccurrence entre deux mots devrait fournir des informations significatives sur la relation entre ces mots.

P(A inter B)

In [7]:
glove = api.load('glove-twitter-100')



FastText : est une méthode d’intégration de mots développée par le laboratoire AI Research (FAIR) de Facebook. Il s’agit d’une extension de Word2Vec, conçue pour améliorer la qualité des plongements de mots, en particulier pour les mots hors vocabulaire (OOV), en prenant en compte les informations des sous-mots.

FastText s’appuie sur les mêmes modèles Skip-gram ou CBOW que Word2Vec, mais inclut les sous-mots n-grammes comme fonctionnalité supplémentaire. La principale différence avec Word2Vec est que chaque mot est représenté par la somme des vecteurs de ses sous-mots, plutôt que d’être traité comme une seule unité.

In [8]:
fasttext = api.load('fasttext-wiki-news-subwords-300')



In [12]:
wv.most_similar('engineer')

[('electrical_engineer', 0.7500534653663635),
 ('mechanical_engineer', 0.7456685304641724),
 ('Engineer', 0.6911254525184631),
 ('engineering', 0.6766180396080017),
 ('engineeer', 0.611038863658905),
 ('engineers', 0.6096685528755188),
 ('technician', 0.6021502614021301),
 ('electrician', 0.5883190631866455),
 ('Mechanical_Engineer', 0.5782192945480347),
 ('architect', 0.5779430270195007)]

In [11]:
wv.most_similar('tennis')

[('Tennis', 0.7291241884231567),
 ('volleyball', 0.6842045783996582),
 ('badminton', 0.6706023216247559),
 ('basketball', 0.6559170484542847),
 ('soccer', 0.616338312625885),
 ('beach_volleyball', 0.6127015352249146),
 ('Steffi_Graff', 0.6037653088569641),
 ('André_Agassi', 0.5987064242362976),
 ('golf', 0.5907120108604431),
 ('TENNIS', 0.5897902250289917)]

In [10]:
wv.most_similar('Tunisia')

[('Tunisian', 0.7654338479042053),
 ('Egypt', 0.7524809241294861),
 ('Algeria', 0.6980152726173401),
 ('Morocco', 0.6671290397644043),
 ('Tunisians', 0.6517289876937866),
 ('Tunis', 0.6499572992324829),
 ('Tunisa', 0.6263828277587891),
 ('Yemen', 0.6162666082382202),
 ('Mauritania', 0.6132903695106506),
 ('Burkina_Faso', 0.6121270060539246)]

In [13]:
glove.most_similar('engineer')

[('technician', 0.8341956734657288),
 ('engineering', 0.8169041872024536),
 ('architect', 0.8005006313323975),
 ('administrator', 0.7952088117599487),
 ('specialist', 0.7943467497825623),
 ('consultant', 0.7862327694892883),
 ('developer', 0.7659751176834106),
 ('analyst', 0.760871410369873),
 ('supervisor', 0.7608408331871033),
 ('assistant', 0.7426972389221191)]

In [14]:
glove.most_similar('tennis')

[('soccer', 0.8014198541641235),
 ('rugby', 0.7587509155273438),
 ('volleyball', 0.751390278339386),
 ('hockey', 0.7435187697410583),
 ('golf', 0.7418977618217468),
 ('wimbledon', 0.730772852897644),
 ('handball', 0.7275217175483704),
 ('badminton', 0.7260984778404236),
 ('sports', 0.7228738069534302),
 ('sport', 0.7226900458335876)]

In [16]:
glove.most_similar('padel')

[('pádel', 0.7668576240539551),
 ('partidito', 0.7094596028327942),
 ('partidillo', 0.6582919955253601),
 ('billar', 0.6578191518783569),
 ('futbolin', 0.6517313122749329),
 ('voley', 0.6515988111495972),
 ('entreno', 0.6423068642616272),
 ('basquet', 0.6400604248046875),
 ('baloncesto', 0.6377711296081543),
 ('fronton', 0.6376537680625916)]

In [17]:
glove.most_similar('engineer')

[('technician', 0.8341956734657288),
 ('engineering', 0.8169041872024536),
 ('architect', 0.8005006313323975),
 ('administrator', 0.7952088117599487),
 ('specialist', 0.7943467497825623),
 ('consultant', 0.7862327694892883),
 ('developer', 0.7659751176834106),
 ('analyst', 0.760871410369873),
 ('supervisor', 0.7608408331871033),
 ('assistant', 0.7426972389221191)]

In [18]:
glove.most_similar('padel')

[('pádel', 0.7668576240539551),
 ('partidito', 0.7094596028327942),
 ('partidillo', 0.6582919955253601),
 ('billar', 0.6578191518783569),
 ('futbolin', 0.6517313122749329),
 ('voley', 0.6515988111495972),
 ('entreno', 0.6423068642616272),
 ('basquet', 0.6400604248046875),
 ('baloncesto', 0.6377711296081543),
 ('fronton', 0.6376537680625916)]

In [20]:
glove.most_similar('tennis')

[('soccer', 0.8014198541641235),
 ('rugby', 0.7587509155273438),
 ('volleyball', 0.751390278339386),
 ('hockey', 0.7435187697410583),
 ('golf', 0.7418977618217468),
 ('wimbledon', 0.730772852897644),
 ('handball', 0.7275217175483704),
 ('badminton', 0.7260984778404236),
 ('sports', 0.7228738069534302),
 ('sport', 0.7226900458335876)]