### Importing Libraries

In [1]:
import os
import sys
import numpy as np
import pandas as pd

### Adding `utils` directory to `PYTHONPATH`

In [2]:
sys.path.append(os.path.abspath("../utils"))

### Reading Cleaned Data

In [3]:
# Importing load_csv function from read_data module
from read_data import load_csv

In [4]:
# Loading cleaned data
cleaned_df = load_csv('clean_data', 'cleaned_data.csv')
cleaned_df.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, ha..."
2,206647,Spectre,a cryptic message from bond’s past sends him o...
3,49026,The Dark Knight Rises,following the death of district attorney harve...
4,49529,John Carter,"john carter is a war-weary, former military ca..."


### Text Vectorization

#### Vectorization
- It is the process of converting text into numbers, specifically into vectors so that a computer can understand and work with them.
- Computers don't understand text directly, they understand numbers.
- So, before we feed text to a machine learning model, we need to turn our words into numerical form.
- Imagine trying to train a model to detect whether a message is positive or negative.
- If you just give it the sentence : `"I love this movie!"`
- The model doesn't know what "love" or "movie" means.
- You need to translate those words into numbers.

<hr>

Let's suppose we have three sentences which we are going to convert into numbers.
- `I love NLP`
- `I love deep learning`
- `NLP is fun`

#### Step 1 : Build a Vocabulary (Unique Words)
- From all 3 sentences, list every unique word :

| Word         |
|--------------|
| I            |
| love         |
| NLP          |
| deep         |
| learning     |
| is           |
| fun          |

- So the vocabulary size is 7.

#### Step 2 : Bag of Words (BoW)
- Now we represent each sentence as a vector of word counts using the vocabulary above.

| Sentence                 | I | love | NLP | deep | learning | is | fun |
|--------------------------|---|------|-----|------|----------|----|-----|
| I love NLP | 1 | 1 | 1   | 0    | 0        | 0  | 0   |
| I love deep learning | 1 | 1    | 0   | 1    | 1        | 0  | 0   |
| NLP is fun | 0 | 0    | 1   | 0    | 0        | 1  | 1   |

- Each row = one sentence.
- Each column = one word from the vocabulary.
- Each cell = how many times that word appeared in that sentence.

<hr>

#### Stop Words
- Stop words are the common, frequently occurring words in a language that don't carry much meaning on their own.
- They are often removed from text because :
    - They don't help distinguish between documents/sentences.
    - They add noise and increase vector size in models like Bag of Words (BoW).

#### Example
> `the`, `is`, `in`, `on`, `and`, `a`, `an`, `to`, `of`, `with`, `for`, `from`, `that`, `this`, `it`, etc.

<hr>

#### What is Stemming?
- Stemming is the process of reducing a word to its base or root form (called the "stem").

| Word | Stem |
|:--:|:---:|
| running | run |
| runs | run |
| runner | runner |
| studied | studi |
| studies |	studi |

#### Problem with BoW without Stemming
- BoW treats every unique word as different, even if they are grammatically related. Like :
    - `"He is running in the park."`
    - `"She runs every day."`
- Without stemming, `running` ≠ `runs`.
- So BoW thinks these are completely different words, even though they refer to the same root concept : "`run`".

#### Benefit of Stemming
- Stemming reduces related words to their common root form, like :
    - `running`, `runs`, `ran` → `run`
    - `studies`, `studying`, `studied` → `studi`
- Now BoW will group them together!

#### BoW Matrix (Without Stemming)
- Vocabulary : `["he", "is", "running", "in", "the", "park", "she", "runs", "every", "day"]`

| Sentence          | he | is | running | in | the | park | she | runs | every | day |
|-------------------|----|----|---------|----|-----|------|-----|------|-------|-----|
| Sentence 1        | 1  | 1  | 1       | 1  | 1   | 1    | 0   | 0    | 0     | 0   |
| Sentence 2        | 0  | 0  | 0       | 0  | 0   | 0    | 1   | 1    | 1     | 1   |

- These vectors are very different even though the meaning is similar.

#### BoW Matrix (With Stemming)
- New Vocabulary : `["he", "is", "run", "in", "the", "park", "she", "every", "day"]`

| Sentence          | he | is | run | in | the | park | she | every | day |
|-------------------|----|----|-----|----|-----|------|-----|-------|-----|
| Sentence 1        | 1  | 1  | 1   | 1  | 1   | 1    | 0   | 0     | 0   |
| Sentence 2        | 0  | 0  | 1   | 0  | 0   | 0    | 1   | 1     | 1   |

- Now "`running`" and "`runs`" are treated the same as "`run`".

In [5]:
# Importing stem function from text_stemming module
from text_stemming import stem

In [6]:
# Applying stemming on tags column
cleaned_df['tags'] = cleaned_df['tags'].apply(stem)
cleaned_df.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a parapleg marin is dispa..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believ to be dead, ha c..."
2,206647,Spectre,a cryptic messag from bond’ past send him on a...
3,49026,The Dark Knight Rises,follow the death of district attorney harvey d...
4,49529,John Carter,"john carter is a war-weary, former militari ca..."


### Applying Bag of Words on `tags` Column

In [7]:
# Importing CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=5000, stop_words='english')

In [8]:
# Applying BOW on tags column
vectors = cv.fit_transform(cleaned_df['tags']).toarray()

In [9]:
# Shape of vectors array
vectors.shape
# Here 4381 rows are the tags for each movie
# And, 5000 columns are the most common 5000 words in the vocabulary

(4381, 5000)

In [10]:
# List of unique words (Top 5000 most common words)
cv.get_feature_names_out()

array(['000', '007', '10', ..., 'zone', 'zoo', 'zooeydeschanel'],
      dtype=object)

In [11]:
# List of all stop words from tags column
print(cv.get_stop_words())

frozenset({'front', 'here', 'other', 'behind', 'forty', 'thereupon', 'us', 'will', 'have', 'interest', 'ours', 'them', 'thereby', 'became', 'sincere', 'con', 'even', 'every', 'do', 'whereby', 'yourselves', 'only', 'next', 'two', 'de', 'back', 'wherein', 'herself', 'elsewhere', 'take', 'thereafter', 'be', 'from', 'can', 'too', 'done', 'while', 'this', 'about', 'then', 'before', 'six', 'un', 'how', 'had', 'least', 'keep', 'enough', 'couldnt', 'i', 'with', 'to', 'your', 'along', 'together', 'wherever', 'or', 'fifty', 'sixty', 'a', 'each', 'because', 'etc', 'and', 'becomes', 'thence', 'both', 'herein', 'something', 'those', 'empty', 'formerly', 'ourselves', 'still', 'some', 'they', 'she', 'noone', 'cry', 'thus', 'amount', 'former', 'hasnt', 'anyway', 'under', 'made', 'whereupon', 'same', 'himself', 'please', 'are', 'afterwards', 'nowhere', 'amoungst', 'go', 'anyone', 'not', 'we', 'latter', 'whereafter', 'much', 'however', 'someone', 'third', 'becoming', 'eg', 'across', 'describe', 'toward'

### Calculating Cosine Similarity of Vectors

#### Cosine Similarity
- In the context of Bag of Words (BoW), cosine similarity is a metric used to measure the similarity between two text vectors.
- Since BoW transforms text into vectors, cosine similarity allows you to compare how similar two text are, based on their word distributions.

<hr>

#### Why Cosine Similarity?
- Cosine similarity is often preferred because it focuses on the angle between vectors rather than their magnitude.
- This means it's more concerned with how the words are distributed in each document rather than how long or short the documents are.
- It ranges between -1 and 1 :
    - 1 means the vectors are identical (very similar).
    - 0 means the vectors are orthogonal (completely dissimilar).
    - -1 would indicate completely opposite vectors, though this is rare in typical document similarity scenarios.

<hr>

#### Formula for Cosine Similarity
Given two vectors, $A$ and $B$, the cosine similarity is calculated as :
$$\text{cosine similarity} = \frac{A \cdot B}{|A| \cdot |B|}$$
where:
- $A \cdot B$ is the dot product of the two vectors.
- $|A|$ and $|B|$ are the euclidean norms (magnitude) of the vectors.

<hr>

#### Example
- Let's say we have two documents and after applying Bag of Words, we get the following vectors (with word counts) :

| Word  | Doc 1 | Doc 2 |
|-------|-------|-------|
| cat   | 2     | 1     |
| dog   | 1     | 2     |

- Now, let's compute the cosine similarity between `Doc 1` and `Doc 2` :

#### 1. Dot product of `Doc 1` and `Doc 2`:
$$(2 \times 1) + (1 \times 2) + (0 \times 1) = 2 + 2 + 0 = 4$$

#### 2. Magnitude of `Doc 1`:
$$\text{Doc 1} = \sqrt{(2^2) + (1^2) + (0^2)} = \sqrt{4 + 1} = \sqrt{5}$$

#### 3. Magnitude of `Doc 2`:
$$\text{Doc 2} = \sqrt{(1^2) + (2^2) + (1^2)} = \sqrt{1 + 4 + 1} = \sqrt{6}$$

#### 4. Cosine Similarity:
$$\text{cosine similarity} = \frac{4}{\sqrt{5} \times \sqrt{6}} \approx \frac{4}{\sqrt{30}} \approx 0.7303$$

In [12]:
# Importing cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(vectors)

In [13]:
# Shape of similarity array
# It contains similarity between one vector and every other vector 
similarity.shape

(4381, 4381)

In [14]:
# Similarity of 1st movie with every other movie
similarity[0]

array([1.        , 0.08238526, 0.0836242 , ..., 0.01887128, 0.04484485,
       0.        ])

### Exporting `similarity` as `pickle` file

In [15]:
# Importing export_as_pickle function from serialization module
from serialization import export_as_pickle

In [16]:
# Exporting similarity as pickle file 
export_as_pickle(similarity, 'models', 'similarity.pkl')

Object saved as 'similarity.pkl' successfully.
