**This Jupyter Notebook will serve as a proof of concept for a content based video game recommendation system. It was created in part with a guide found [here](http://www.datacamp.com/tutorial/recommender-systems-python). It will pull from data from IMDB on over 20k video games, including user ratings, as well as total number of users who rated each game. The recommender system will use Natural Language Processing on the 'plot' column of each game. It will then use cosine similarity to calculate the numerical difference between an example game and every other game in our dataset. Mathematically, cosine similarity is represented by the following equation:**

![](https://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1590782185/cos_aalkpq.png)

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.feature_extraction.text import TfidfVectorizer # NLP and TF-IDF Vectorization
from sklearn.metrics.pairwise import linear_kernel # Import linear_kernel
import matplotlib.pyplot as plt
import seaborn as sns
import string
import re
import nltk

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

ModuleNotFoundError: No module named 'numpy'

In [None]:
games_df = pd.read_csv('../input/imdb-video-games/imdb-videogames.csv')

games_df.info()

**Now we need to remove the video game entries that have no info added into their 'plot' section. In this dataset, these entries have a string "Add a Plot" in this column. We also need to remove any entries that have dulplicate names, as this will confuse the cosine similarity function, and try to return more than one array, which will cause a ValueError. After dropping these, we will need to re-index our dataframe, so that any search will remain within range.**

In [None]:
no_plot = games_df[games_df['plot'].__eq__('Add a Plot')].index

games_df.drop(index=no_plot, inplace=True)
games_df = games_df.drop_duplicates(subset='name', keep="first", inplace=False)
games_df = games_df.reset_index()

games_df.info()

**We have greatly shortened our list of games, from 20,803 to 11,096.** 

**Now we will perform the language processing on the 'plot' column, removing the "Stop words" from each plot, and use each plot to create a TF-IDF matrix:**

In [None]:
#Define a TF-IDF Vectorizer Object. We will also get rid of all english "stop words" such as 'the', 'a', or 'an'
tfidf = TfidfVectorizer(stop_words='english')

#Replace any NaN values with an empty string
games_df['plot'] = games_df['plot'].fillna('')

#Create the required TF-IDF matrix by fitting and transforming the data, taking the 'plot' column as an input.
tfidf_matrix = tfidf.fit_transform(games_df['plot'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

**Next, we can take a look at some of the words that included in our TD-IDF matrix:**

In [None]:
tfidf.get_feature_names_out()[2000:2100]

**We can also look at the shape of our cosine similarity matrix, and see a few of its entries, to make sure it looks correct:**

In [None]:
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

cosine_sim

In [None]:
cosine_sim.shape

**We can get a look, albeit not very detailed, of this large matrix by plotting it:**

In [None]:
sns.set_theme(style="white")
fig, ax = plt.subplots(figsize=(10,10))

plot_cos = cosine_sim[900:1050, 900:1050]

cax = ax.matshow(plot_cos, interpolation = 'nearest', cmap = sns.diverging_palette(230, 30, as_cmap=True), vmax = 0.4)

plt.title('Video Game Similarity matrix')

fig.colorbar(cax, ticks=[0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.75, 1])
plt.show()

**As can be seen, a large majority of games' plots don't have a great amount of similarity. The graph actually has been adjusted to show more similarity; the max value has been set to 0.4, rather than the usual 1.0. Anything higher than 0.4 will register as 1. Additionally, we have zoomed in on a slice of the 11,096 x 11,096 matrix, to show more granularity. It's important to note that if 1 were the maximum value, any other 1:1 matches were visible in the above matrix plot, it would indicate a duplicate entry, rather than two separate, similar data points, given the amount of dissimilarity in the dataset.**

**Now, in order to create a function which will return a list of similar games after inputing the title of one game, we need a reference of the indices of each game based on the title string. We will create that now:**

In [None]:
indices = pd.Series(games_df.index, index=games_df['name'])

**Now we need to create our recommendation function. It will have to do a couple of things:**
1. Retrieve the index of the game inputted, given its title
2. Get that game's cosine similarity to all other games on our list (11,929)
3. Convert that cosine similarity list into a tuple list, where the first element is the position, and the second is its similarity score
4. Sort the tuple list based on their similarity scores
5. Get the top 10 results, ignoring the 1st result, as it will refer to the inputted game itself (it has a score of 1 in similarity, after all)
6. Print the names that match the indices of those top 10 results


In [None]:
def game_recommendation(name, cosine_sim=cosine_sim):
    #Get the index of the inputted game
    indx = indices[name]
    
    #Get the inputted game's list of similarity scores with all other games
    similarity_scores = list(enumerate(cosine_sim[indx]))
    
    #Sort the games by similarity score
    similarity_scores = sorted(similarity_scores, key=lambda x:x[1], reverse=True)
    
    #Get the top ten results
    similarity_scores = similarity_scores[1:11]
    
    #Get the game indices
    game_indices = [i[0] for i in similarity_scores]
    
    #Return top 10 most similar games
    return games_df['name'].iloc[game_indices], similarity_scores

In [None]:
game_recommendation("Samurai Warriors 3")

**The system seems to have correctly picked up on a 'Samurai' theme.**

In [None]:
game_recommendation("Yakuza Kiwami")

In [None]:
game_recommendation("The Last of Us")

In [None]:
videogame = input("Please enter a game to search: ")

game_recommendation(videogame)