# Word vectors using Gensim

## Gensim 
Gensim is an open-source Python library designed for natural language processing (NLP) tasks, especially for topic modeling and document similarity. Here’s a brief overview:

### Key Features
Topic Modeling: Identifies themes in large text collections using algorithms like Latent Dirichlet Allocation (LDA).
Text Pre-processing: Offers tools for cleaning and preparing text data (e.g., stopword removal, lemmatization).
Scalability: Can handle large datasets without needing to load everything into memory.
Performance: Fast processing due to optimized algorithms.

### Applications
Information Retrieval: Helps find relevant texts based on queries.
Sentiment Analysis: Analyzes the emotional tone of texts.
Document Similarity: Measures how similar different texts are.
Gensim is widely used for its efficiency and effectiveness in handling large text data.

In [2]:
!pip install gensim

Collecting gensim
  Using cached gensim-4.3.3-cp311-cp311-win_amd64.whl.metadata (8.2 kB)
Collecting scipy<1.14.0,>=1.7.0 (from gensim)
  Using cached scipy-1.13.1-cp311-cp311-win_amd64.whl.metadata (60 kB)
Downloading gensim-4.3.3-cp311-cp311-win_amd64.whl (24.0 MB)
   ---------------------------------------- 0.0/24.0 MB ? eta -:--:--
   -- ------------------------------------- 1.6/24.0 MB 8.4 MB/s eta 0:00:03
   ------ --------------------------------- 3.9/24.0 MB 9.8 MB/s eta 0:00:03
   --------- ------------------------------ 5.8/24.0 MB 9.3 MB/s eta 0:00:02
   ------------- -------------------------- 8.1/24.0 MB 9.7 MB/s eta 0:00:02
   ----------------- ---------------------- 10.2/24.0 MB 10.0 MB/s eta 0:00:02
   -------------------- ------------------- 12.3/24.0 MB 10.2 MB/s eta 0:00:02
   ------------------------ --------------- 14.4/24.0 MB 10.0 MB/s eta 0:00:01
   --------------------------- ------------ 16.3/24.0 MB 9.8 MB/s eta 0:00:01
   ------------------------------ -----

In [3]:
import gensim.downloader as api
# This is a huge model (~1.6 gb) and it will take some time to load

wv = api.load('word2vec-google-news-300')

In [4]:
wv.similarity(w1="great", w2="good")

0.729151

In [5]:
wv.similarity(w1="great", w2="great")

1.0

In [9]:
wv.most_similar("fancy")

[('fancier', 0.6242973208427429),
 ('snazzy', 0.6128054261207581),
 ('fancy_schmancy', 0.5927230715751648),
 ('flashy', 0.5657637715339661),
 ('Fancy', 0.5621656775474548),
 ('highfalutin', 0.5348057150840759),
 ('frou_frou', 0.5235176086425781),
 ('pricey', 0.5195214152336121),
 ('fancy_shmancy', 0.5183699727058411),
 ('swanky', 0.49819421768188477)]

In [8]:
wv.most_similar("iron")

[('treat_iron_deficiency', 0.5767540335655212),
 ('irons', 0.5468516945838928),
 ('zero_valent', 0.5203249454498291),
 ('dogleg_par', 0.5148035287857056),
 ('Iron_sharpens', 0.5139581561088562),
 ('HBI_DRI', 0.5099007487297058),
 ('Shindle_Pa.', 0.5066496729850769),
 ('titanium_oxide_TiO', 0.5005055069923401),
 ('pond_fronting', 0.48712262511253357),
 ('wood', 0.4768836200237274)]

In [11]:
wv.most_similar(positive=['king', 'woman'], negative=['man'], topn=5)

[('queen', 0.7118193507194519),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321839332581)]

In [13]:
#Gives the distict word from it
wv.doesnt_match(["facebook", "cat", "google", "microsoft"])

'cat'

### Gensim: Glove

In [14]:
glv = api.load("glove-twitter-25")



In [15]:
glv.most_similar("good")

[('too', 0.9648017287254333),
 ('day', 0.9533665180206299),
 ('well', 0.9503170847892761),
 ('nice', 0.9438973665237427),
 ('better', 0.9425962567329407),
 ('fun', 0.9418926239013672),
 ('much', 0.9413353800773621),
 ('this', 0.9387555122375488),
 ('hope', 0.9383506774902344),
 ('great', 0.9378516674041748)]

In [17]:
glv.doesnt_match("facebook cat google microsoft".split())

'cat'