## 1. Introduction <a id="1"></a>

The [Video Game Sales](https://www.kaggle.com/gregorut/videogamesales) dataset provides an interesting way to assess most popular games through the lense of the total amount of sales made. But what if we wanted to look at collections of similar games, i.e., game franchises, rather than individual games? There is no unique ID to identify a franchise and since the games in the same franchise will have slightly different names (e.g. Pokémon Red, Pokémon Gold, Pokémon Crystal, etc.) you cannot simply use the groupby method on them.

This notebook explores how the [Python Record Linkage Toolkit](https://pypi.org/project/recordlinkage/) can help us in this task by grouping similar name games together to idetify potential franchises.

<a><img src="https://images.unsplash.com/photo-1566577134648-41d15b958d80?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1650&q=80"></a> <p style="text-align:center">*In 2020, the Final Fantasy game franchise consists of 15 titles*</p>


### **Table of Contents**
* [Introduction](#1)
* [Setup](#2)
* [Data Exploration](#3)
* [Indexing](#4)
* [Pair Comparison](#5)
* [Testing the Results](#6)
* [Final Considerations](#7)

## Setup <a id="2"></a>

In [None]:
# Install the Record Linkage library (it is not part of the standard Kaggle notebook environment)
%pip install recordlinkage

# Import libraries
import pandas as pd

import plotly.express as px

import recordlinkage as rl

import networkx as nx

# Set notebook properties
pd.options.display.float_format = "{:.2f}".format
pd.options.display.max_rows = 100

# Import the data
df = pd.read_csv('../input/videogamesales/vgsales.csv')

In [None]:
# Make all column names lower for ease of typing
df.columns = df.columns.str.lower()

In [None]:
# Inspect the dataset holistically 
df.info()

There are missing values in the `year` and `publisher` variables. Although not ideal, this will not impact further analysis, so we can leave them blank.

In [None]:
# Inspect the numeric variables
df.describe()

Nothing particularly stands out here, so we can move forward.

## Data Exploration <a id="3"></a>

In this section we will explore the data variables in a bit more detail. This will give us a better understanding of the overall context.

In [None]:
# Print unique values of the relevant columns 
print(f'Unique games in the dataset: {df.name.nunique()}')
print(f'Unique publishers in the dataset: {df.publisher.nunique()}')
print(f'Unique platforms in the dataset: {df.platform.nunique()}')
print(f'Unique genres in the dataset: {df.genre.nunique()}')

There are around 11.5K unique game titles in the dataset. This will be the number that we will try to reduce by grouping the games into franchises.

In [None]:
# Check if names are duplicated
df.groupby(['name'])['name'].count().sort_values(ascending=False)

In [None]:
# Inspect one of the duplicated names
df[df['name'] == 'Need for Speed: Most Wanted']

Duplicates exist. This is because the same game can be released on multiple platforms and rereleased in different years. However, we do not need to deduplicate, as we want to see the total sales for each franchise regardless of platform or release year.

In [None]:
fig = px.bar(df.sort_values(by=['global_sales'],ascending=False)[:20], x='name', y='global_sales')
fig.show()

In [None]:
by_year = df.groupby(['name', 'year'], as_index=False)['global_sales'].sum()
by_year = by_year[by_year['year'] <= 2016]
by_year = by_year.sort_values(by=['year','global_sales'],ascending=False)

fig = px.bar(by_year, x="name", y="global_sales", animation_frame="year", hover_name="name")

fig["layout"].pop("updatemenus") # optional, drop animation buttons
fig.show()

A few observations:
* Data for 2016 seems to be incomplete.
* Wii Sports is the most sold game in history by a far. This is particularly amazing considering that Wii Sport has been released on multiple  gaming platforms.

## Indexing <a id="4"></a>

Now, we will begin the process of matching the games.

To assess if two games are related, we need to compare them. Since we have 11,493 titles in the dataset, we would have to create 66,038,776 comparison pairs (11,493 x 11,492 / 2). Even if one pair takes only 1 second to compare, it would take approximately the amount of time it took to develop Dark Souls (around 2 years) to compute.

One way to reduce this complexity is to limit the number of comparable pairs by assuming that most related games will have similar names. This naturally makes sense, as once a studio has a well selling game, it will maintain a similar name for all the subsequent releases to make the franchise more recognisable and thus potentially increasing the number of sales while reducing marketing costs.

The approach that the Record Linkage library utilises for limiting the number of pairs is called Sorted Neighbourhood Indexing. It is a fairly simple method that follows these steps:

1. Sorts the entire dataset in alphabetic order.
2. Divides  the dataset into small windows of words.
3. Creates pairs of words within each window that can be assessed. 

In [None]:
df = df.set_index(['name'], drop=False)
df = df.drop_duplicates(subset=['name'])

indexer = rl.SortedNeighbourhoodIndex('name', window=9)
candidate_pairs = indexer.index(df)

len(candidate_pairs)

Running the Sorted Neighbourhood Indexing generated 45962 candidate pairs for us to assess. 

## Pair Comparison <a id="5"></a>

Now that we have the candidate pairs, we need to assess if those pairs are actual matches (belong to the same franchise) or not.

Record Linkage allows to choose from multiple similarity metrics. The one that works well with short strings like names is called [Jaro-Winkler](https://statisticaloddsandends.wordpress.com/2019/09/11/what-is-jaro-jaro-winkler-similarity/). It assigns a numeric similarity score between 0 and 1 for any two strings based on:
* The number of matching characters
* Number of transpositions
* If the differences occur at the beginning or the end of the string

In addition, the library allows to compare additional variables. For example, here we assign values of 1 where the platform and the genre of the game match.

In [None]:
compare_cp = rl.Compare()

compare_cp.string('name','name',method='jarowinkler',label='name') # Could use the threshold parameter
compare_cp.exact('platform','platform',label='platform')
compare_cp.exact('genre','genre',label='genre')

features = compare_cp.compute(candidate_pairs, df)

In [None]:
features.head()

We have our results! Now, we need a way to assess if the pairs are actual matches or should be discarded. We can do this multiple ways:
* Simply use the Jaro-Winkler metric alone.
* Use the Jaro-Winkler metric plus a combination of the other values. 
* If we manually inspected and labelled the data, we could actually create a machine learning model that would help us pick the best combination of variables and their values to determine matches.

In [None]:
column_list = ['platform','genre']

matches = features[(features['name'] > 0.85)]

#matches = features[(features['name'] > 0.85) & (features[column_list].sum(axis=1) >= 1)]

For now, we will simply use an arbitrary threshold of the Jaro-Winkler similarity.

To move from pairs to groups we can utalise the NetworkX library. It will allows us to efficiently connect the individual pairs and genererate a list of sets with the games grouped. 

In [None]:
l = matches.index.tolist()

G = nx.Graph()

G.add_edges_from(l)

groups = list(nx.connected_components(G))

groups[:5]

The code seems to be working! Alhough some false positives can be seen. 

In [None]:
group_ids = {name:k for k,comp in enumerate(groups) for name in comp}

groups_df = pd.DataFrame.from_dict(group_ids, orient='index', columns=['group']).rename_axis('name').reset_index()

In [None]:
df = df.reset_index(drop=True)

merged = df.merge(groups_df, on='name', how='left')

merged['group'] = merged['group'].fillna(merged['name'])

In [None]:
top_sales = merged.groupby(['group', 'name'], as_index=False)['global_sales'].max()
top_sales = top_sales.sort_values(by='global_sales', ascending=False)
top_sales = top_sales.drop_duplicates(subset=['group'], keep='first')

del top_sales['global_sales']
top_sales = top_sales.rename(columns={'name':'name_rl'})

df_rl = merged.merge(top_sales, on='group', how='left')

## Testing the Results <a id="6"></a>

In [None]:
df_rl.name_rl.nunique()

Looks like we managed to reduce the number of unique games by more than 50%!

## Final Considerations <a id="7"></a>

Overall, we can see that the approach is effective in finding matching names with the overall reduction of unique titles in the dataset by almost 50%.

The same library and approach discussed above is domain agnostic, so it can be applied to a wide variety of projects. Furthermore, it can be used on a variety of related problems like finding duplicated records in a dataset or finding duplicates when merging two datasets.

However, it is worth to note that the method has its flaws:
1. It assumes that games in the same franchise will have similar names. While this assumption is usually correct based on the logic discussed in the indexing step, there are exceptions. Consider the following games in the Star Wars universe: [X-Wing](https://en.wikipedia.org/wiki/Star_Wars:_X-Wing), [Clone Wars Adventures](https://en.wikipedia.org/wiki/Clone_Wars_Adventures), [Vader Immortal](https://en.wikipedia.org/wiki/Darth_Vader#Virtual_reality_game). The names have no relation to each other, and the approach discussed would not be able to group them.

2. There is no effective way to assess the quality of the results other than manually inspecting them.

It is also worth to note that Python Record Linkage Toolkit is not the only library that exits for the purpose of matching entities. Two other popular libraries include: Dedupe and Fuzzywuzzy.