# Project: Create a Word2Vec Model

### Step 1: Import libraries

In [1]:
import os
import nltk
from nltk.corpus import stopwords

### Step 2: Download stopwords
- Execute the following cell

In [2]:
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /home/adel/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/adel/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Step 3: Read content and sentinize
- Initialize an empty list called **all_sentences**
- For each filename in **'files/holmes'**:
    - HINT: Use **os.listdir(...)** ([docs](https://docs.python.org/3/library/os.html#os.listdir))
- Open the file and read the content and convert to lowercase and apply **nltk.sent_tokenize** on content.
    - Use **lower()** on content.

In [3]:
all_sentences = []

for filename in os.listdir('files/holmes'):
    with open(f'files/holmes/{filename}') as f:
        content = f.read()
        all_sentences += nltk.sent_tokenize(content.lower())

### Step 4: Tokenize each sentence
- Get all words by applying **nltk.word_tokenize** on them and assign the result to **all_words**
    - HINT: Use list comprehension

In [4]:
all_words = [nltk.word_tokenize(sent) for sent in all_sentences]

In [5]:
all_words[0][:10]

['the', '``', 'gloria', 'scott', "''", '``', 'i', 'have', 'some', 'papers']

### Step 5: Remove all stop words
- Use **stopwords.words('english')** to filter all the words in **all_words**
    - HINT: iterate over the length of **all_words**, for each index use list comprehension

In [6]:
for i in range(len(all_words)):
    all_words[i] = [w for w in all_words[i] if w not in stopwords.words('english')]

### Step 6: Remove special characters
- Iterate over items in **all_words** to remove words with special characters
    - HINT: Use **isalpha()** ([doc](https://docs.python.org/3/library/stdtypes.html#str.isalpha))

In [7]:
for i in range(len(all_words)):
    all_words[i] = [w for w in all_words[i] if w.isalpha()]

### Step 7: Install gensim and python-Levenshtein
- Run the following cells

In [8]:
!pip install gensim



In [9]:
!pip install python-Levenshtein



### Step 8: Import another library
- Run the following cell

In [10]:
from gensim.models import Word2Vec

### Step 9: Create a model
- Use **Word2Vec** on **all_words**
    - Use **min_count=2** : Ignores all words with total frequency lower than this.

In [11]:
model = Word2Vec(all_words, min_count=2)

### Step 10: Find distances
- Try to run **model.wv.distance('holmes', 'watson')**
- Try to run **model.wv.distance('holmes', 'water')**

In [12]:
model.wv.distance('holmes', 'watson')

0.0005608201026916504

In [13]:
model.wv.distance('holmes', 'water')

0.0012046098709106445

### Step 11: Find closests words
- Get all the words
    - HINT: **words = model.wv.index_to_key**
- Implement a function **closets_words(word)**
    - HINT: **distances = {w: model.wv.distance(word, w) for w in words}**
    - HINT: **sorted(distances, key=lambda w: distances[w])[:15]**

In [14]:
words = model.wv.index_to_key

def closets_words(word):
    distances = {w: model.wv.distance(word, w) for w in words}
    return sorted(distances, key=lambda w: distances[w])[:15]

In [15]:
closets_words('holmes')

['holmes',
 'friend',
 'hand',
 'made',
 'without',
 'eyes',
 'turned',
 'first',
 'colonel',
 'yet',
 'must',
 'quite',
 'come',
 'little',
 'words']