# Vectorization
Vectorization is the process of converting textual or categorical data into numerical vectors (arrays of numbers) that machine learning models or similarity algorithms can process. In recommendation systems — especially content-based filtering — we need to measure the similarity between items based on their features. <br><br>
With vectors, we can compute how similar two items are using techniques like:
   - Cosine Similarity
   - Euclidean Distance
   - Dot Product

## Libraries Description
| <div align="left">Library Functions</div>                                                        | <div align="left">Purpose<div>                                                                                                                   |
| ------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
    | <p align="left">**`Numpy`**</p>                                         | <p align="Left">Used for working with arrays, matrices, and performing mathematical operations.</p>  |
    | <p align="left">**`Pandas`**</p>                                         | <p align="left">Used for loading, cleaning, and transforming tabular data.</p>                                    |
    | <p align="left">**`CountVectorizer`**</p> | <p align="left">Imports **CountVectorizer** from scikit-learn, which converts text documents into a matrix of token counts for vectorizing text for ML or similarity comparison.</p>    |
    | <p align="left">**`cosine_similarity`**</p>      | <p align="left">Used to compute **cosine similarity** between vectors for measuring how similar two items are based on vectorized features.</p>            |
    | <p align="left">**`PorterStemmer`**</p>                  | <p align="left">Imports the **Porter stemming** from NLTK, used to reduce words to their root form  that helps normalize text for more effective comparison.</p> |


In [1]:
# import libraries
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.stem.porter import PorterStemmer

## Load Preprocessed dataset

In [2]:
# reading the updated data
df = pd.read_csv("files/movie_cleaned.csv")

In [3]:
# checking dataset
df.columns

Index(['ID', 'Title', 'Overview', 'Release Date', 'Tags'], dtype='object')

In [4]:
df.head()

Unnamed: 0,ID,Title,Overview,Release Date,Tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...",2009-12-10,"in the 22nd century, a paraplegic marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",2007-05-19,"captain barbossa, long believed to be dead, ha..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,2015-10-26,a cryptic message from bond’s past sends him o...
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,2012-07-16,following the death of district attorney harve...
4,49529,John Carter,"John Carter is a war-weary, former military ca...",2012-03-07,"john carter is a war-weary, former military ca..."


## Vectorizing Contents
**CountVectorizer()** is used to convert a collection of text documents into a matrix of token (word) counts.<br><br>
**Argument Explanation:**
- **`max_features=5000:`**	Limits the vocabulary size to the top 5,000 most frequent words across all documents. Helps reduce dimensionality and ignore rarely used words.
- **`stop_words='english':`** Removes common English stopwords (like "the", "is", "and") which do not contribute meaningful information for comparison or classification.

In [5]:
# initializing vectorizer
cv = CountVectorizer(max_features=5000, stop_words='english')

In [6]:
# vectorizing the tags
vectors = cv.fit_transform(df['Tags']).toarray()

In [7]:
# checking words
cv.get_feature_names()

['000',
 '007',
 '10',
 '100',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '17',
 '17th',
 '18',
 '18th',
 '19',
 '1930s',
 '1940s',
 '1950s',
 '1960s',
 '1970s',
 '1980',
 '1980s',
 '1985',
 '1990s',
 '19th',
 '19thcentury',
 '20',
 '200',
 '2009',
 '20th',
 '23',
 '24',
 '25',
 '30',
 '300',
 '3d',
 '40',
 '50',
 '500',
 '60',
 '60s',
 '70',
 '70s',
 'aaron',
 'aaroneckhart',
 'abandoned',
 'abducted',
 'abigailbreslin',
 'abilities',
 'ability',
 'able',
 'aboard',
 'abuse',
 'abusive',
 'academy',
 'accept',
 'accepted',
 'accepts',
 'access',
 'accident',
 'accidental',
 'accidentally',
 'accompanied',
 'accomplish',
 'account',
 'accountant',
 'accused',
 'ace',
 'achieve',
 'act',
 'acting',
 'action',
 'actionhero',
 'actions',
 'activist',
 'activities',
 'activity',
 'actor',
 'actors',
 'actress',
 'acts',
 'actual',
 'actually',
 'adam',
 'adams',
 'adamsandler',
 'adamshankman',
 'adaptation',
 'adapted',
 'addict',
 'addicted',
 'addiction',
 'adolescence',
 'adopt',
 'ado

## Stemming
Stemming is the process of reducing words to their root form, so that different forms of a word are treated as the same during analysis. It converts words like "running", "runs", and "ran" to a common root like "run" and helps reduce redundancy and variations in the vocabulary.<br><br>
It ensures that related terms are recognized as equivalent, improving:
   - Similarity calculations
   - Search results
   - Recommendation quality

In [8]:
# create stemming object
ps = PorterStemmer()
ps

<PorterStemmer>

In [9]:
# helper function to apply the stemming to all the text
def stem_word(text):
    words = []
    for txt in text.split():
        words.append(ps.stem(txt))
    
    return " ".join(words)

In [10]:
# stemming the text inside the tags column
df['Tags'] = df['Tags'].apply(stem_word)

In [11]:
# checking stemming
df['Tags'][0]

'in the 22nd century, a parapleg marin is dispatch to the moon pandora on a uniqu mission, but becom torn between follow order and protect an alien civilization. cultureclash futur spacewar spacecoloni societi spacetravel futurist romanc space alien tribe alienplanet cgi marin soldier battl loveaffair antiwar powerrel mindandsoul 3d action adventur fantasi sciencefict samworthington zoesaldana sigourneyweav jamescameron'

### cosine_similarity() - Get the distance from one movie to another
Cosine similarity measures how similar two vectors are by calculating the cosine of the angle between them and stores the result in a similarity matrix. It ranges from 0 (not similar) to 1 (very similar). This is ideal for textual data where the magnitude of word counts is less important than their direction. his similarity matrix allows us to recommend movies that are most similar to a given movie.

In [12]:
# computes the cosine similarity between all pairs of vectorized items 
similar = cosine_similarity(vectors)

In [13]:
# similarity matrix
similar[0]

array([1.        , 0.08980265, 0.05986843, ..., 0.0248452 , 0.02777778,
       0.        ])

# Main Functionality of Recommendation System
To recommend items similar to a given one, we need to rank the other items based on how similar they are to it — and that means sorting.<br><br>
**The Logic Behind Sorting:**
- Each row in the similar matrix contains the similarity scores of one item with every other item.
- To find the most similar items to a given movie (say, index i), we must:
- Look at **`similar[i]`** (i.e., all similarity scores for movie i)
- Sort these scores in descending order
- Pick the top N highest scores (excluding the movie itself).
<br><br>

**Purpose of Sorting:**
- Ensures we recommend items that are most relevant/similar to the target.
- Prevents returning random or poorly matched results.
- Allows us to limit recommendations to top N items (e.g., top 5).

## Sorting Technique
In a recommendation system, we use cosine similarity to find how closely related other items are to a target item.<br><br>
If we want to recommend the most similar movies, we need to know:
   1. The similarity score, and
   2. Which movie (index) it corresponds to

So in this case, actual index number is very important. But If we sort the similar data, actual index will be lost. That’s where **`enumerate()`** comes in. **`enumerate()`** is very important when sorting similarity scores in a recommendation system, especially when working with vectors or lists where index positions matter. When we sort the list of scores, we want to retain the movie index along with the score. This lets us later fetch the actual movie details based on the index.

In [14]:
# indices of the vector position
vetcor_position = enumerate(similar[1])
vetcor_position

<enumerate at 0x209c9a0ce40>

In [15]:
# distance of movie 1 from another movie according to tags
distance_idx = list(vetcor_position)
distance_idx[:5]

[(0, 0.08980265101338746),
 (1, 1.0000000000000002),
 (2, 0.06451612903225808),
 (3, 0.021166687833365085),
 (4, 0.08216865534971651)]

In [16]:
# sorting distances according to vector value not the index
sorted(distance_idx, reverse=True, key=lambda x:x[1])[:5]

[(1, 1.0000000000000002),
 (12, 0.4147806778921701),
 (17, 0.27371875400769585),
 (199, 0.24906774069335896),
 (3572, 0.21552636243212991)]

## Function to get recommended movies

In [17]:
def recommendation(movie_name):
    # getting movie index
    movie_idx = df[df['Title'] == movie_name].index[0]
    
    # getting distance from other movie index
    dst = similar[movie_idx]
    
    # index of top 5 suggested movies
    suggested = sorted(list(enumerate(dst)), reverse=True, key=lambda x:x[1])[1:6]
    
    print("According To You Interest, You Can Also Watch The Followings: ")
    print()
    for movie in suggested:
        print(df.iloc[movie[0]].Title)

In [18]:
# testing Recommendation
recommendation("Batman Begins")

According To You Interest, You Can Also Watch The Followings: 

The Dark Knight
The Dark Knight Rises
Batman
Batman
Batman & Robin


## Store Similarity Value For Further Operations

In [19]:
import pickle

file = "files/similarities.pkl"

pickle.dump(similar, open(file, 'wb'))

> # This dataset is prepared by MD. TUSHAR SHIHAB, Dept. of CSE