## Theory: Predicting Movie Preferences

### The Hypothesis
My theory is that a person's taste in movies can be predicted. Specifically, if someone liked one movie, it is possible to predict whether they would like another. For example, if someone enjoyed a good action movie, they are likely to enjoy another action movie.

### The Idea
The idea is to build a neural network that takes a comparison of two movies as input and predicts whether the movies are compatible or not.

### Definition of Compatible Movies
In my theory, two movies are considered compatible if:
1. If I liked one of them, I would like the other.
2. If I hated one of them, I would hate the other.

Conversely, movies are considered incompatible if:
 If I liked one of them, I would hate the other.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
from torch.utils.data import Dataset, DataLoader
import mysql.connector
from mysql.connector import Error
import numpy as np
from collections import defaultdict
import random
import hashlib
from sklearn.preprocessing import StandardScaler


In [2]:
connection = mysql.connector.connect(
    host="127.0.0.1",
    user="root",
    password="yanovsky",
    database="final_project_db"
)

In [3]:
if connection.is_connected():
    print("Connected to MySQL database")
    cursor = connection.cursor()
    cursor2 = connection.cursor(dictionary=True)

Connected to MySQL database


## Data Collection

### The Dataset
The first step was to collect movies and TV shows to work with. Due to time constraints and the limitations of the tmdb api I used to gather the data, I decided to focus on the most popular movies and TV shows.

### Reasoning
The rationale behind this choice is that popular movies and series are more likely to come to mind when thinking about specific examples. As a result, there is a high probability that the dataset will include movies or series that most people are familiar with.

In [10]:
query=f"SELECT COUNT(*) AS row_count from media_data"
cursor2.execute(query)
print ("The number of media collected: " + str(cursor2.fetchall()[0]["row_count"]))

The number of media collected: 13938


## User Collection

### The Process
The website’s user interface provided the id of the media (movies or TV shows) in IMDb. Using a web crawler, I was able to collect a large number of users who interacted with or provided feedback on these movies or series.


In [9]:
query=f"SELECT COUNT(*) AS row_count from users"
cursor2.execute(query)
print ("The number of Users collected: " + str(cursor2.fetchall()[0]["row_count"]))

The number of Users collected: 108481


## Collecting Rating Lists

### Purpose
To create a system capable of generating realistic insights, it was necessary to collect a significant number of real user rating lists. This data allows us to learn what each person liked or disliked and draw connections based on that information.

### The Process
After gathering data from 100,000 users, I focused on collecting ratings only for the movies and TV shows stored in my database. While this approach might seem to disregard a substantial number of ratings, it is worth noting that the stored media were the most popular ones. As a result, approximately 50% of the ratings were retained and utilized, ensuring a robust dataset for analysis.

In [11]:
query=f"SELECT COUNT(*) AS row_count from rating"
cursor2.execute(query)
print ("The number of rating collected: " + str(cursor2.fetchall()[0]["row_count"]))

The number of rating collected: 546145


In [15]:
query=f"SELECT COUNT(User_ID) AS user_count, SUM(rating_count) AS total_ratings_sum FROM (SELECT User_ID, COUNT(*) AS rating_count FROM rating GROUP BY User_ID HAVING COUNT(*) >= 5) AS subquery;"
cursor2.execute(query)
x=cursor2.fetchall()[0]
print ("The number of users whe have  5 or more ratings: " + str(x["user_count"]))
print ("The total number of rating of the users who have  5 or more ratings: " + str(x["total_ratings_sum"]))

The number of users whe have  5 or more ratings: 37362
The total number of rating of the users who have  5 or more ratings: 436644


## Movie Representation: Matrix Idea

### The Challenge
I needed to devise a way to represent each movie as a matrix so that similar movies would have similar matrices. Additionally, the representation needed to provide a wide range of diversity to capture meaningful differences.

### The Solution
I decided to represent movies using matrices based on their genres. The tmdb api provided the genres associated with each movie, along with a hierarchy indicating which genres are more dominant in the movie.

### Representation Details
- There are 19 genres in total.
- Each movie can be represented by up to 5 genres.
- The matrix representation of a movie is based on this genre information, with the hierarchy dictating the order and weight of the genres.

This representation ensures that similar movies (in terms of genre) will have similar matrices, providing a meaningful way to compare them.


In [17]:
query=f"SELECT COUNT(DISTINCT genres) AS unique_genres_count FROM media_data;"
cursor2.execute(query)
x=cursor2.fetchall()[0]
print ("The number of distinct movie represtation: " +str(x["unique_genres_count"]))

The number of distinct movie represtation: 2792


## Collecting Users and Their Ratings as Objects for python

In [19]:
query=f"SELECT rating.*, media_data.genres FROM rating JOIN media_data ON rating.media_ID = media_data.id;"
cursor2.execute(query)

rating_results= cursor2.fetchall()

In [22]:
query=f"SELECT * FROM media_data "
cursor2.execute(query)

movie_data= cursor2.fetchall()

In [23]:
query=f"SELECT id from users"
cursor2.execute(query)
users=cursor2.fetchall()

In [24]:
grouped_results = defaultdict(list)
for rating in rating_results:
    grouped_results[rating['User_ID']].append(rating)

In [27]:
user_ratings = grouped_results.get(users[543]["id"])
if user_ratings:
    print(f"Ratings for User_ID 103193794990463432280 :")
    for rating in user_ratings:
        print(rating)

Ratings for User_ID 103193794990463432280 :
{'ID': 'aad3e99c6332', 'media_ID': '440021', 'is_movie': 1, 'User_ID': '104471337628996976085', 'rating': 5, 'rating_date': datetime.datetime(2024, 9, 22, 23, 50, 15), 'genres': '27,9648,35,53,'}
{'ID': 'e72dd716ad41', 'media_ID': '446354', 'is_movie': 1, 'User_ID': '104471337628996976085', 'rating': 7, 'rating_date': datetime.datetime(2024, 9, 22, 23, 50, 15), 'genres': '18,36,'}
{'ID': 'fdd9e348d81b', 'media_ID': '274857', 'is_movie': 1, 'User_ID': '104471337628996976085', 'rating': 8, 'rating_date': datetime.datetime(2024, 9, 22, 23, 50, 15), 'genres': '28,18,14,'}
{'ID': 'd110e29f585c', 'media_ID': '4442', 'is_movie': 1, 'User_ID': '104471337628996976085', 'rating': 6, 'rating_date': datetime.datetime(2024, 9, 22, 23, 50, 15), 'genres': '12,14,28,35,53,'}
{'ID': '039b23105dbb', 'media_ID': '381719', 'is_movie': 1, 'User_ID': '104471337628996976085', 'rating': 7, 'rating_date': datetime.datetime(2024, 9, 22, 23, 50, 15), 'genres': '16,12,1

## Explanation of Matrix Representation

### Structure of the Matrices
- There are 19 genres, each represented as a column.
- Each movie can have up to 5 hierarchical levels of genres, represented as rows.
- The matrix has 19 columns and 5 rows.
- Each row contains a `1` in the column corresponding to the genre in the hierarchy for the movie.

### Movie Comparison
The matrix representation allows for comparing two movies to determine their compatibility. This is the basis for training the neural network to predict whether two movies are compatible.

### Compatibility Matrices
To assess compatibility, the matrices of two movies are compared using the formula:


(x - y) *  (x - y)


Where:
- \(x\) and \(y\) are the matrices representing two movies.

### Scalability
- With 19 genres and up to 5 levels, there are over 2,500 unique matrix representations for individual movies.
- When comparing these matrices, the potential combinations lead to millions of unique compatibility matrices, providing a robust dataset for training the neural network.

In [28]:
def create_genre_matrix(genres_str):
        genre_ids = [int(id) for id in genres_str.split(',') if id]
        # Number of rows and columns
        num_rows = 5
        num_cols = 19  # TMDb has 19 standard movie genres
        matrix = np.zeros((num_rows, num_cols), dtype=int)
        # Complete mapping of TMDb genre IDs to column indices
        genre_id_to_col = {
        28: 0,   # Action
        12: 1,   # Adventure
        16: 2,   # Animation
        35: 3,   # Comedy
        80: 4,   # Crime
        99: 5,   # Documentary
        18: 6,   # Drama
        10751: 7, # Family
        14: 8,   # Fantasy
        36: 9,   # History
        27: 10,  # Horror
        10402: 11, # Music
        9648: 12, # Mystery
        10749: 13, # Romance
        878: 14,  # Science Fiction
        10770: 15, # TV Movie
        53: 16,  # Thriller
        10752: 17, # War
        37: 18   # Western
        }
        limited_genre_ids = genre_ids[:num_rows]
        # Populate the matrix with one-hot encoding in each row
        for i, genre_id in enumerate(limited_genre_ids):
         col_index = genre_id_to_col.get(genre_id)
         if col_index is not None:
            matrix[i, col_index] = 1.0

        return matrix.astype(float)

In [29]:
x=create_genre_matrix("28,12,878,")
y=create_genre_matrix("16,10751,12,14,35,")
print(x)
print("\n")
print (y)
print("\n")
print((x-y)*(x-y) )

[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]


[[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]


[[1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]


In [32]:
test_users=[]
training_usres=[]
trainnum=0
testnum=0
for usr in users:
    try:
     if len (grouped_results.get(usr["id"])) >=5:
         if random.randint(0, 100) < 70:
           training_usres.append(usr["id"])
           trainnum=trainnum+ len (grouped_results.get(usr["id"]))
         else:
               test_users.append(usr["id"])
               testnum=testnum+ len (grouped_results.get(usr["id"]))
    except Exception  as e:
        t=5
print("Number of users to train: "+ str(len(training_usres)))
print("Number of users to test: "+ str(len(test_users)))

Number of users to train: 26577
Number of users to test: 11968


In [33]:
print(training_usres[0])
print("the number of rating for training:" +str(trainnum))

100004649219881563245
the number of rating for training:319388


## Generating Training Data for the Neural Network

### Step 1: Assigning Compatibility Labels
The first step in creating the training data is to assign a compatibility label to each pair of movies based on user preferences:

- **Label 1 (Compatible)**: If the user liked both movies or disliked both movies.
- **Label 0 (Incompatible)**: If the user liked one movie but disliked the other.

This labeling process ensures that the neural network learns to predict whether two movies are compatible based on user preferences.

In [34]:
trainingdata=[]
trainingdatamatrix=[]
totallabels=0
numlable1=0
for usr in training_usres:
    user_ratings = grouped_results.get(usr)
    for rating in user_ratings:
        for rating2 in user_ratings:
            x =  create_genre_matrix(rating['genres'])
            y=  create_genre_matrix(rating2['genres'])
            lable =0
            if rating['rating'] >=7 and  rating2['rating'] >=7:
                lable =1
            if rating['rating'] <7 and  rating2['rating'] <7:
                lable =1 
            numlable1=numlable1 +lable
            totallabels=totallabels+1
            mat=(x-y)*(x-y)
            #trainingdata.append((torch.FloatTensor( np.concatenate([x,y])),torch.FloatTensor([lable]))) 
            trainingdata.append(( mat,lable) )
            #if  not any(np.array_equal(matrix, mat) for matrix in trainingdatamatrix):
             #trainingdatamatrix.append(mat)
print ("Number of training matrix at stage 1 :" + str(len(trainingdata)))

Number of training matrix at stage 1 :4715822


### Step 2: Compressing the Data

After assigning compatibility labels, the next step is to compress the information:

1. **Matrix Compression**: Compress the compatibility matrices into unique representations to reduce redundancy.
2. **Label Count**: For each unique representation:
   - Count how many times it received the label `1` (compatible).
   - Count how many times it received the label `0` (incompatible).

This step helps to efficiently structure the data and balance the training dataset for the neural network.

In [36]:
matrix_label_count = defaultdict(lambda: {'label_1': 1, 'label_0': 1})

# Process each tuple (matrix, label)
for matrix, label in trainingdata:
    # Convert matrix to a hashable type (tuple of tuples)
    matrix_hashable = tuple(map(tuple, matrix))
    
    # Update counts based on the label
    if label == 1:
        matrix_label_count[matrix_hashable]['label_1'] += 1
    elif label == 0:
        matrix_label_count[matrix_hashable]['label_0'] += 1

In [37]:
a, b = list(matrix_label_count.items())[0]
matrix_data_traing=list(matrix_label_count.items())

In [39]:
print (matrix_data_traing[345])

(((0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0), (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0), (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0), (1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0), (0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)), {'label_1': 461, 'label_0': 229})


In [23]:
print (a)
print (b)

{'label_1': 398145, 'label_0': 37153}


In [24]:
print (len(matrix_data_traing))
print (len(trainingdata))

322059
4783940


### Step 3: Converting Labels to Probabilities

The final step is to transform the compatibility labels into probabilities:

1. **Probability Calculation**: 
   - For each unique matrix representation, calculate the estimated probabilities:
     - The probability of being compatible (`label 1`) is computed as the ratio of times it received label `1` to the total occurrences.
     - The probability of being incompatible (`label 0`) is computed similarly.

2. **Training Matrix Filtering**:
   - Exclude matrix representations that appeared only a few times, as they do not provide reliable probability estimates.
   - Retain only the matrices with sufficient occurrences for robust training data.

This process ensures that the neural network is trained to predict the likelihood of compatibility and incompatibility for comparison matrices.

In [41]:
matrix_triplets_for_training = []

# Iterate through the dictionary to extract matrix and label counts
for matrix, counts in matrix_data_traing:
    label_1_count = counts['label_1']
    label_0_count = counts['label_0']
    total_count = label_1_count + label_0_count
    
    if total_count > 7:
        label_1_ratio = label_1_count / total_count
        label_0_ratio = label_0_count / total_count
        
        # Create the triplet and add to the list
        matrix_triplet = (matrix, label_1_ratio, label_0_ratio)
        matrix_triplets_for_training.append(matrix_triplet)

# Print the results
#for triplet in matrix_triplets[:5]:  # Display the first 5 triplets
  #  print(triplet)

In [42]:
a ,b ,c= matrix_triplets_for_training[564]
print (a)
print("\n")
print ("chance to get 1: " +str(b) )
print ("\n")
print ("chance to get 0: " +str(c))

((0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0), (0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0), (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0), (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0), (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0))


chance to get 1: 0.5491803278688525


chance to get 0: 0.45081967213114754


In [44]:
print ("Training data at stage 3: "+ str(len (matrix_triplets_for_training)))

Training data at stage 3: 105794


In [29]:
# Define the neural network model
class DNNModel(nn.Module):
    def __init__(self):
        super(DNNModel, self).__init__()
        self.fc1 = nn.Linear(5 * 19, 128)  # Flatten 5x19 input to 95 features, then 128 neurons in the first layer
        self.fc2 = nn.Linear(128, 64)      # Second layer with 64 neurons
        self.fc3 = nn.Linear(64, 64)       # Third layer with 64 neurons
        self.fc4 = nn.Linear(64, 32)       # Fourth layer with 32 neurons
        self.fc5 = nn.Linear(32, 2)        # Output layer for 2 classes (class 1 and class 0)

    def forward(self, x):
        x = x.view(-1, 5 * 19)                   # Flatten the matrix input
        x = F.leaky_relu(self.fc1(x), negative_slope=0.01)  # Leaky ReLU with negative slope 0.01
        x = F.leaky_relu(self.fc2(x), negative_slope=0.01)  # Leaky ReLU for second layer
        x = F.leaky_relu(self.fc3(x), negative_slope=0.01)  # Leaky ReLU for third layer
        x = F.leaky_relu(self.fc4(x), negative_slope=0.01)  # Leaky ReLU for fourth layer
        x = torch.softmax(self.fc5(x), dim=1)    # Softmax output for class probabilities
        #x = torch.log_softmax(self.fc3(x), dim=1)
        return x
# Instantiate the model, loss function, and optimizer
model = DNNModel()

# Use KLDivLoss for probability distributions
criterion = nn.MSELoss()
#criterion = torch.nn.KLDivLoss(reduction='batchmean')  # KL Divergence Loss
optimizer = optim.Adam(model.parameters(), lr=0.0001)

In [30]:
matrices = [triplet[0] for triplet in matrix_triplets_for_training]  # 5x19 matrices
labels = [[triplet[1], triplet[2]] for triplet in matrix_triplets_for_training]  # Probabilities for label 1 and label 0
print (matrices[0])
print  (labels[0])

((0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0), (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0), (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0), (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0), (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0))
[0.9146492747497117, 0.0853507252502883]


In [31]:
matrices_tensor = torch.tensor(matrices, dtype=torch.float32)  # Ensure float type
labels_tensor = torch.tensor(labels, dtype=torch.float32)      # Float for softmax output

In [32]:
class MatrixTripletDataset(Dataset):
    def __init__(self, matrices, labels):
        self.matrices = matrices
        self.labels = labels

    def __len__(self):
        return len(self.matrices)

    def __getitem__(self, idx):
        # Convert to tensors
        matrix = torch.tensor(self.matrices[idx], dtype=torch.float32)
        label = torch.tensor(self.labels[idx], dtype=torch.float32)  # Probabilities for both class 1 and class 0
        return matrix, label

# Create the dataset
dataset = MatrixTripletDataset(matrices, labels)

# Create DataLoader instance
train_loader = DataLoader(dataset, batch_size=32, shuffle=True)

In [33]:
def train_model(model, train_loader, criterion, optimizer, num_epochs=10):
    """
    Trains the neural network model.
    
    Args:
    - model: The neural network model to be trained.
    - train_loader: DataLoader containing the training data.
    - criterion: Loss function.
    - optimizer: Optimizer for updating model weights.
    - num_epochs: Number of epochs to train the model (default = 10).
    
    Returns:
    - model: The trained model.
    """
    model.train()  # Set the model to training mode
    
    for epoch in range(num_epochs):
        running_loss = 0.0  # To accumulate loss for each epoch
        
        # Iterate over the batches in train_loader
        for inputs, target_probs in train_loader:
            # Ensure inputs and target_probs are in the correct format (float32)
            inputs, target_probs = inputs.to(torch.float32), target_probs.to(torch.float32)
            
            # Zero the parameter gradients
            optimizer.zero_grad()
            
            # Forward pass: compute model's predictions
            outputs = model(inputs)
            
            # Compute loss (difference between predicted and true probabilities)
            loss = criterion(outputs, target_probs)
            
            # Backward pass: compute the gradients
            loss.backward()
            
            # Update the weights
            optimizer.step()
            
            # Accumulate the running loss
            running_loss += loss.item()
        
        # Calculate average loss for the epoch and print it
        epoch_loss = running_loss / len(train_loader)
        print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {epoch_loss:.4f}")
    
    print('Training complete.')
    return model

In [34]:
def init_weights(m):
    if type(m) == nn.Linear:
        torch.nn.init.xavier_uniform_(m.weight)
        m.bias.data.fill_(0.01)

model.apply(init_weights)
trained_model = train_model(model, train_loader, criterion, optimizer, num_epochs=30)

Epoch [1/30], Loss: 0.0329
Epoch [2/30], Loss: 0.0324
Epoch [3/30], Loss: 0.0321
Epoch [4/30], Loss: 0.0319
Epoch [5/30], Loss: 0.0316
Epoch [6/30], Loss: 0.0314
Epoch [7/30], Loss: 0.0311
Epoch [8/30], Loss: 0.0309
Epoch [9/30], Loss: 0.0306
Epoch [10/30], Loss: 0.0304
Epoch [11/30], Loss: 0.0301
Epoch [12/30], Loss: 0.0299
Epoch [13/30], Loss: 0.0297
Epoch [14/30], Loss: 0.0295
Epoch [15/30], Loss: 0.0292
Epoch [16/30], Loss: 0.0290
Epoch [17/30], Loss: 0.0288
Epoch [18/30], Loss: 0.0286
Epoch [19/30], Loss: 0.0284
Epoch [20/30], Loss: 0.0282
Epoch [21/30], Loss: 0.0280
Epoch [22/30], Loss: 0.0279
Epoch [23/30], Loss: 0.0277
Epoch [24/30], Loss: 0.0275
Epoch [25/30], Loss: 0.0273
Epoch [26/30], Loss: 0.0272
Epoch [27/30], Loss: 0.0270
Epoch [28/30], Loss: 0.0269
Epoch [29/30], Loss: 0.0267
Epoch [30/30], Loss: 0.0266
Training complete.


In [35]:
def check_compatibility(usr_rating,cannidate):
    like=0
    dislike=0
    movie_entry = next((movie for movie in movie_data if movie['id'] == cannidate), None)
    matrix_cannidate= create_genre_matrix(movie_entry['genres'])
    if movie_entry:
        for rating in usr_rating:
            if cannidate != rating["media_ID"]:
              matrix_rating=create_genre_matrix(rating['genres'])
              mat= (matrix_cannidate-matrix_rating)*(matrix_cannidate-matrix_rating)
              mat_tensor= torch.tensor(mat, dtype=torch.float32)
              propabillity=trained_model(mat_tensor)
              p1=propabillity[0,0].item()
              p2=propabillity[0,1].item()
              if rating['rating'] >=7:
                  like=like+ p1
                  dislike=dislike+ p2
              else :
                    like=like+ p2
                    dislike=dislike+ p1
    if like >=dislike:
        return 1
    else:
      return 0

In [36]:
def check_user_rating_list(id):
    TP=0.0
    TN=0.0
    FP=0.0
    FN=0.0
    
    user_ratings = grouped_results.get(id)
    for rating in user_ratings:
        b=check_compatibility(user_ratings,rating["media_ID"])
        if rating['rating'] >=7:
            if b==1:
                TP=TP+1
            else:
                FP=FP+1
        else:
           if b==0:
               TN=TN+1
           else:
               FN=FN+1
    #print (right)
    #print(wrong)
    return  TP,TN,FP,FN

In [37]:
TP=0.0
TN=0.0
FP=0.0
FN=0.0

for i in range(int(len(test_users))):
   a,b,c,d= check_user_rating_list(test_users[i])
   TP=TP +a
   TN=TN+b
   FP=FP+c
   FN=FN+d


In [40]:
print ("TP: " + str (TP ))
print ("TN: " + str (TN ))
print ("FP: " + str (FP ))
print ("FN: " + str (FN ))
print ("Right gues rate: " + str(((TP+TN)/(TP+TN+FP+FN))))
print ("Right positive rate: " +str((TP/(TP+FP))))

TP: 63115.0
TN: 32479.0
FP: 20104.0
FN: 26625.0
Right gues rate: 0.6716693717810895
Right positive rate: 0.7584205529987142


## Algorithms for Selecting Good Candidates

Even though we now have a method to determine whether a movie is recommended for a user, it is still crucial to start with good candidate movies. For example, if someone loves action movies, recommending a 1960s action movie might still result in a recommendation, but a more relevant candidate would be better.

### Two Algorithms for Selecting Candidates

#### 1. Popularity-Based Algorithm
This algorithm selects candidates based on their popularity:
- Filter movies with an average rating above 7.
- Count the number of users who rated each movie.
- Movies with a higher number of ratings are more likely to be selected as candidates.

#### 2. Keyword-Based Algorithm
This algorithm uses keywords to find relevant candidates:
- The user interface provides keywords associated with each movie.
- Collect all the keywords from movies the user liked.
- Compare the keywords using natural language processing (NLP):
  - Identify frequently occurring keywords.
  - If no keyword appears frequently, group similar keywords using NLP techniques and count them as one until a sufficient number of grouped keywords is found.

Using the selected keywords:
- Find movies that share the same keywords.
- Select these movies as candidates for recommendations.

These two algorithms ensure a robust approach for selecting relevant movie candidates tailored to user preferences.


In [1]:
{
  "id": 27205,
  "keywords": [
    {
      "id": 1234,
      "name": "dream"
    },
    {
      "id": 5678,
      "name": "subconscious"
    },
    {
      "id": 91011,
      "name": "heist"
    },
    {
      "id": 121314,
      "name": "time dilation"
    }
  ]
}


{
  "id": 1701,
  "name": "hero",
  "movies": [
    {
      "id": 299536,
      "title": "Avengers: Infinity War",
      "release_date": "2018-04-27",
      "popularity": 300.0
    },
    {
      "id": 299537,
      "title": "Captain Marvel",
      "release_date": "2019-03-08",
      "popularity": 250.0
    }
    // Additional movies...
  ]
}


SyntaxError: expression expected after dictionary key and ':' (3427941.py, line 27)

In [None]:
def calculate_keyword_similarity(keywords):
    """
    Calculate semantic similarity between each pair of keywords in the input array.

    Parameters:
        keywords (list): A list of dictionaries, each containing 'count', 'id', and 'name' of a keyword.

    Returns:
        list: A list of dictionaries containing pairs of keywords and their similarity score.
    """
    similarities = []
    for (kw1, kw2) in combinations(keywords, 2):

        doc1 = nlp(kw1['name'])
        doc2 = nlp(kw2['name'])
        similarity = doc1.similarity(doc2)

        
        if similarity >=0.80:
         similarities.append({
            "keyword1": kw1['name'],
            "keyword2": kw2['name'],
            "id1": kw1["id"],
            "id2":kw2["id"],
            "count1":kw1["count"],
            "count2": kw2["count"],
            "similarity": similarity
         })

    # Sort by similarity in descending order
    similarities = sorted(similarities, key=lambda x: x['similarity'], reverse=True)

    return similarities