In [None]:
# Bruno Vieira Ribeiro

In this project we will go through our collected data from [Ludopedia](https://ludopedia.com.br/) ranking of boardgames. We'll perform basic data cleaning and EDA.

Columns in our dataset:
* 'age': recommended age to play the game
* 'artist': Names of artists that worked on the game (separated by commas)
* 'designer': Names of game designers (separated by commas)
* 'dominio': Domain of the game (Expert, Family or Child)
* 'imagem': URL to an image of the game cover
* 'mecanicas': List of mechanics involved in the game
* 'media': mean score given by users
* 'notaRank': score given by website (Bayesian average)
* 'notas': Amount of users giving score to game
* 'numOfPlayers': Number of players that can play the game
* 'position': Position in ranking
* 'timeOfPlay': Estimated time of single play through
* 'title': title of game
* 'year': year of release

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [None]:
sns.set_style("whitegrid")
sns.set(rc={'figure.figsize':(10,6)})

# Reading and cleaning the data

In [None]:
df = pd.read_csv('../input/ludopedia-rank/all_pages2.csv')

In [None]:
df.head()

In [None]:
df.columns

We can reorder the columns just to get a better visual of the data. Namely, I would like to see the `position` and `title` columns up front.

In [None]:
cols = ['position', 'title', 'year', 'notaRank', 'media', 'notas', 'age',
        'numOfPlayers', 'timeOfPlay', 'dominio', 'mecanicas', 'artist', 'designer',
        'imagem']

df = df[cols]

In [None]:
df.head()

## `position` column

The `position` column has a 'º' character that has no purpose in this context, so we can remove it and convert its tyoe to numeric to clean this column:

In [None]:
df['position'] = df['position'].apply(lambda s: s.replace("º",""))

In [None]:
df['position'] = pd.to_numeric(df['position'])

Now, to get some general info:

In [None]:
df.info()

## `age` column

**Question**: do all entries in the `age` column have a `+` string at the end?

To answer this, we can count the number of elements in this column conatining the string `+`:

In [None]:
len(df.loc[df['age'].str.contains(' +')])

So: yes, they all do! Let's get rid of that and all white spaces:

In [None]:
df['age'] = df['age'].apply(lambda s: s.replace("+",""))
df['age'] = df['age'].apply(lambda s: s.replace(" ",""))

In [None]:
df['age']

Looking good, but we still need to convert these to numeric:

In [None]:
df['age'] = pd.to_numeric(df['age'])

We can do a simple bar plot to see how many games we have per age recommendation:

In [None]:
sns.countplot(y = 'age', data = df,
              order = df['age'].value_counts().index);

Also, let's check distribution of scores by age:

In [None]:
fig, axs = plt.subplots(2, 1, sharex=True)

sns.boxplot(x='age', y = 'notaRank', data = df, ax = axs[0])
sns.boxplot(x='age', y = 'media', data = df, ax = axs[1]);

## `numOfPlayers` column

All elements of this column have a `jogadores` string at the end. Once again, we can get rid of this:

In [None]:
df['numOfPlayers'] = df['numOfPlayers'].apply(lambda s: s.replace("jogadores","").strip())

From this column we can create two different columns: `minPlayers` and `maxPlayers`.

First let's check if all entries have the same format:

In [None]:
df['numOfPlayers']

Some columns have a range of players while others only allow for a fixed number of players. We can count how many of each we have in our data:

In [None]:
# With range
len(df.loc[df['numOfPlayers'].str.contains('a')])

In [None]:
# Without (for completeness)
len(df.loc[df['numOfPlayers'].str.contains('a') == False])

A quick way of dealing with both types of entries (be it only a single number or a range), is to define `minPlayers` as the first character in the `numOfPlayers` entry and `maxPlayers` as the remaining string after the characetr `a` (if there is a character `s`).

In [None]:
df['minPlayers'] = df['numOfPlayers'].apply(lambda s: int(s[0]))

df['maxPlayers'] = df['numOfPlayers'].apply(lambda s: s[s.find('a')+2:] if len(s)>1 else s[0])

In [None]:
df.head()

In [None]:
sns.countplot(data=df, x='maxPlayers');

In [None]:
sns.countplot(data=df, x='minPlayers');

## `year` column

This one has all years bewteen parenthesis. We can just use a `strip` method to remove them and later convert to numeric:

In [None]:
df['year'] = df['year'].apply(lambda s: s.strip('( )'))
df['year'] = pd.to_numeric(df['year'])

In [None]:
df.info()

Since we are looking at this column, we can do a quick **countplot** to check the number of released games per year:

In [None]:
plt.figure(figsize=(20,6))
plt.xticks(rotation=90)
sns.countplot(x='year', data = df);

After entering the 2000's, there was a huge boom in boardgame releases with a peak in 2015.

## `timeOfPlay` column

These entries are all measured in **min**. We can remove this substring and convert everything to numeric:

In [None]:
len(df.loc[df['timeOfPlay'].str.contains('min')])

In [None]:
df['timeOfPlay'] = df['timeOfPlay'].apply(lambda s: s.replace(" min",""))

In [None]:
df['timeOfPlay'] = pd.to_numeric(df['timeOfPlay'])

Do a simple simple histogram to check the distribution of `timeOfPlay`:

In [None]:
sns.histplot(data = df, x = 'timeOfPlay');

In [None]:
unique_times = df['timeOfPlay'].unique()
np.sort(unique_times)

In [None]:
sns.boxplot(x='timeOfPlay', y = 'notaRank', data = df)
plt.xticks(rotation=90);

What game has '1000 in' of time of play?

In [None]:
df[df['timeOfPlay'] == 1000]

From the game's (The 7th Continent) description:
> Unlike most board games, it will take you many, MANY hours of exploring and searching the seventh continent until you eventually discover how to remove the curse(s)...or die trying.
The 7th Continent features an easy saving system so that you can stop playing at any time and resume your adventure later on, just like in a video game!

## Sorting by `position` column

In [None]:
df = df.sort_values(by=['position'])
df = df.reset_index(drop=True)

In [None]:
df.head()

## `dominio` column

We'll start by checking the distribution of `notaRank` for all unique `dominio`s:

In [None]:
sns.boxplot(x='dominio', y = 'notaRank', data = df);

In [None]:
df['dominio'].unique()

Well, it appears we have **mixed** entries here. There are only three possible domains for a game in the original site:
* Expert
* Family
* For children.

Let's try to avoid these mixed categories by doing a case to case analysis.

First off, we can count the number of games in each possible entry in the `dominio` column:

In [None]:
df['dominio'].value_counts()

Dealing with multiple entries in `dominio`:

In [None]:
df[df['dominio']=='Jogos Familiares,Jogos Expert,Jogos Infantis']

This game is classified as `Family game` in [BGG](https://boardgamegeek.com/boardgame/382/heimlich-co). So, let's change the entry:

In [None]:
df.at[1798, 'dominio'] = 'Jogos Familiares'

In [None]:
df.at[1798, 'dominio']

In [None]:
df['dominio'].value_counts()

Great! Let's, now, deal with the `Jogos Infantis,Jogos Familiares` case:

In [None]:
df[df['dominio']=='Jogos Infantis,Jogos Familiares']

Again, going to [BGG](https://boardgamegeek.com/boardgame/217362/frogriders), we choose `Frogriders` to have be a family game:

In [None]:
df.at[1020, 'dominio'] = 'Jogos Familiares'
df.at[1020, 'dominio']

The other game in this list is a national production (with no BGG for more info). We will set it as a `Jogos Infantis` class, as it is recommended for ages 5+:

In [None]:
df.at[690, 'dominio'] = 'Jogos Infantis'
df.at[690, 'dominio']

On to the next cases:

In [None]:
df['dominio'].value_counts()

In [None]:
df[df['dominio']=='Jogos Expert,Jogos Familiares']

First thing that stands out here is `War`. It has been around long enough and played enough for us to consider it a family game. Since it has the same age recommendation as Catan and this awesome grandparent of modern board games has a complexity ranking of 2.32/5 (see [BGG](https://boardgamegeek.com/boardgame/13/catan)), we will set both as family games.

In [None]:
df.at[147, 'dominio'] = 'Jogos Familiares'
df.at[2016, 'dominio'] = 'Jogos Familiares'
print(df.at[147, 'dominio'], df.at[2016, 'dominio'])

The game **O Bom do Videogame** is another national entry with a complexity rating of 2.5/5 at [BGG](https://boardgamegeek.com/boardgame/234432/o-bom-do-videogame). We will set it as a family game also:

In [None]:
df.at[255, 'dominio'] = 'Jogos Familiares'
df.at[255, 'dominio']

The final game in this list is trickier. The game **Quilombolas – O Refúgio dos Palmares** is an awesome looking game (not yet fully released). It has an important historical theme and some interesting combinations of mechanics. We will choose to set it as an expert game for the depth of its historical theme and dynamic rules.

In [None]:
df.at[602, 'dominio'] = 'Jogos Expert'
df.at[602, 'dominio']

Now, on to the final multiclassified domains:

In [None]:
df['dominio'].value_counts()

In [None]:
df[df['dominio']=='Jogos Familiares,Jogos Infantis']

All 4 of these will be classified as `Infantis` as they are very kid friendly with a low complexity rating on BGG.

In [None]:
df.at[1110, 'dominio'] = 'Jogos Infantis'
df.at[1361, 'dominio'] = 'Jogos Infantis'
df.at[2136, 'dominio'] = 'Jogos Infantis'
df.at[2438, 'dominio'] = 'Jogos Infantis'
print(df.at[1110, 'dominio'], df.at[1361, 'dominio'], df.at[2136, 'dominio'], df.at[2438, 'dominio'])

Finally, all games are classified in one of three possible **domains**:

In [None]:
df['dominio'].value_counts()

Let's check for missing values of the `dominio` feature:

In [None]:
df[df['dominio'].isnull()]

The first game in ths list is a collection of **Carson City** and its expansions. The base game is classified as `expert`, as can be seen here:

In [None]:
df[df['title']=='Carson City']

So, this collection will also be in this class:

In [None]:
df.at[279, 'dominio'] = 'Jogos Expert'
df.at[279, 'dominio']

Next, we have **Futebol de botão**. Most brazilian kids know (and love) this game. It is definetly challenging and tons of fun to play (commonly with your dad or grandad). So, we'll choose to set its domain as a family game:

In [None]:
df.at[902, 'dominio'] = 'Jogos Familiares'
df.at[902, 'dominio']

Next, we have **Lotus**. Following [BGG](https://boardgamegeek.com/boardgame/198525/lotus), we'll classifiy it as a family game:

In [None]:
df.at[1176, 'dominio'] = 'Jogos Familiares'
df.at[1176, 'dominio']

Following this, we have `Stratego (Revised Edition)`. Let's find the unrevised edition and get information from there and use it:

In [None]:
df[df['title'] == 'Stratego']

In [None]:
df.at[2075, 'dominio'] = 'Jogos Familiares'
df.at[2075, 'dominio']

Since `Afluentes` has am age recommendation of 12, we will set it as a family game:

In [None]:
df.at[2357, 'dominio'] = 'Jogos Familiares'
df.at[2357, 'dominio']

`Toru` is a very fast paced party game, so we'll classify it as a family game also:

In [None]:
df.at[2388, 'dominio'] = 'Jogos Familiares'
df.at[2388, 'dominio']

Let's check if we are done:

In [None]:
df[df['dominio'].isnull()]

Finally! To close this column cleaning, we can create a countplot for all three domains:

In [None]:
sns.countplot(y = 'dominio', data = df);

And back to our boxplot for distributions of `notaRank`:

In [None]:
sns.boxplot(x='dominio', y = 'notaRank', data = df);

Great! Now to inspect further missing values:

## `mecanicas` column

In [None]:
df.isnull().sum()

Doing a similar case to case analysis, we can look into the missing `mecanicas` entry:

In [None]:
df[df['mecanicas'].isnull()]

The first game is **Rhino Hero: Super Battle**. In [BGG](https://boardgamegeek.com/boardgame/218333/rhino-hero-super-battle), there is a list of mechanics for this game. We can see the available mechanics in the official Ludopedia [mechanics list](https://www.ludopedia.com.br/mecanicas) and see waht we can use:

Acoording to BGG, the mechanics are:
* Dice Rolling
* Single Loser Game
* Stacking and Balancing

There is only one corresponding mechanic in Ludopedia:
* Rolagem de Dados

So, we will use this in our dataframe:

In [None]:
df.at[1093, 'mecanicas'] = 'Rolagem de Dados'
df.at[1093, 'mecanicas']

Next, we have **Timeline: Discoveries**. According to the game description:
> Players take turns placing a card from their hand in a row on the table.

So, we will set its mechanics as `Gestão de Mão`, which seems a good fit.

In [None]:
df.at[1235, 'mecanicas'] = 'Gestão de Mão'
df.at[1235, 'mecanicas']

Next: **Jenga Tetris**. Again, we look to [BGG](https://boardgamegeek.com/boardgame/145259/jenga-tetris) for any clues. In there, we find the mechanics to be `Push Your Luck`, which actually has a very nice correspondent in Ludopedia (`Force sua sorte`).

In [None]:
df.at[2317, 'mecanicas'] = 'Force sua sorte'
df.at[2317, 'mecanicas']

Next: **Jogo dos conquistadores**.

Acoording to [BGG](https://boardgamegeek.com/boardgame/24069/jogo-dos-conquistadores), the mechanics are:
* Area Movement
* Dice Rolling
* Variable Phase Order

Which have equivalents in Ludopedia as:
* Movimento de Área
* Rolagem de Dados
* Ordem de Fases Variável

So, we set all this in the dataframe.

In [None]:
df.at[2347, 'mecanicas'] = 'Movimento de Área,Rolagem de Dados,Ordem de Fases Variável'
df.at[2347, 'mecanicas']

Since we are peeking into this game, we notice it has missing `designer` entry. However, the designers are listed in BGG, so we can include these:

In [None]:
df.at[2347, 'designer'] = 'Sérgio Halaban,André Zatz'
df.at[2347, 'designer']

We will skip to the last game for now (reasons will become clear later). The last game in this list is **Pick Up Sticks**. In [BGG](https://boardgamegeek.com/boardgame/6424/pick-sticks), the listed mechanics are
* Physical Removal
* Push Your Luck
* Set Collection
* Take That

Which have equivalents in Ludopedia as:
* -
* Force sua sorte
* Colecionar Componentes
* Toma essa

So, we will use these in our dataframe:

In [None]:
df.at[2513, 'mecanicas'] = 'Force sua sorte,Colecionar Componentes,Toma essa'
df.at[2513, 'mecanicas']

Ok, so the game we skipped: `Clube Grow` is a collection of classic games (according to the description: Mico, Sobe-Desce, Ludo, Trilha, Resta 1, Damas Chinesas, Mini Can-can, Mega Trunfo e Gamão). Because it is such a collection and has three missing features, we will drop it from our dataframe. First we will create a copy to keep all games, then we will use the new copy to start dropping and making more changes:

In [None]:
df_ludo = df.copy()

In [None]:
df_ludo.drop(2370, inplace = True)

Check if everything worked out:

In [None]:
df_ludo[df_ludo['mecanicas'].isnull()]

Reset the indexes:

In [None]:
df_ludo = df_ludo.reset_index(drop=True)

## `artist` and `designer` columns

What else is missing:

In [None]:
df_ludo.isnull().sum()

**NOTE: some games have `(Uncredited)` as designer entry.**

First off, we can explore games with both artist and designer information missing:

In [None]:
# Games with no artist AND no designer info
df_ludo[df_ludo['artist'].isnull() & df_ludo['designer'].isnull()]

We can use `np.where` to assign the string 'No info' to the designer feature for every game without both designer and artist data:

In [None]:
df_ludo['designer'] = np.where(df_ludo['artist'].isnull() & df_ludo['designer'].isnull(),
                               'No info',
                               df_ludo['designer'])

df_ludo['artist'] = np.where(df_ludo['designer'] == 'No info',
                               'No info',
                               df_ludo['artist'])

In [None]:
df_ludo.isnull().sum()

Next, we check for games with no artist information:

In [None]:
# Games with no artist AND WITH designer info
df_ludo[df_ludo['artist'].isnull() & df_ludo['designer'].notnull()]

For these games, we will fill the  missing values for `artist` with the string 'Missing artist'.

In [None]:
df_ludo['artist'] = np.where(df_ludo['artist'].isnull() & df_ludo['designer'].notnull(),
                               'Missing artist',
                               df_ludo['artist'])

Finally, games with no designer:

In [None]:
# Games WITH artist AND no designer info
df_ludo[df_ludo['artist'].notnull() & df_ludo['designer'].isnull()]

Same as before, we will use the string 'Missing designer'.

In [None]:
df_ludo['designer'] = np.where(df_ludo['artist'].notnull() & df_ludo['designer'].isnull(),
                               'Missing designer',
                               df_ludo['designer'])

Now, let's inspect missing values:

In [None]:
df_ludo.isnull().sum()

# EDA

## Unique values

Great! Now we can explore some numbers.

It is not straightforward to see the number of unique artists (or designers or mechanics...). But we can use a helper function to do so:

In [None]:
def uniques(feature):
    '''
    Takes a string 'feature' and returns a set of unique
    features in the entire df.
    '''
    unique_feat = set()
    for i in df_ludo[feature]:
        feats = set(i.split(','))
        unique_feat.update(feats)

    print('Number of unique '+feature+':' ,len(unique_feat))
    return unique_feat

Basically, the function loops through the `feature` column and updates a set containing all entries within each row. As sets don't allow for duplicates, we end up with a iterable containing only unique items of that feature. The function returns this set and prints the number of items in it (using `len()`).

In [None]:
unique_designers = uniques('designer')

In [None]:
unique_artists = uniques('artist')

In [None]:
unique_mecanicas = uniques('mecanicas')

Now, I would like to know how is the distribution of, say, the average score given by users (the `media` column) for games with a specific mechanics.

For this, we can write another helper function:

In [None]:
def make_dist(df, feature):
    '''
    Function to plot a histogram of the distribution of 'feature' within 'df'.
    '''
    # Check distribution of feature:
    mean_feat = df[feature].mean()
    num_of_values = len(df)

    fig = px.histogram(df, x=feature,
                       title='Mean value of ' +feature+ ': ' +str(round(mean_feat,2))+
                           ' - (with '+str(num_of_values)+' games)',
                       opacity = 0.6)

    fig.show()

We test out this function to see the distribution of `media` in all games with a `cooperativo` mechanics:

In [None]:
make_dist(df_ludo.loc[df_ludo['mecanicas'].str.contains('Cooperativo')], 'media')

Some other interesting data we can explore is the count of how many games contain each unique mechanics (`mecanica`).

We can create a dictionary to count the appereance of each unique mechanics in the dataframe:

In [None]:
count_mecanicas = {}
for mec in unique_mecanicas:
    count_mecanicas[mec] = len(df_ludo.loc[df_ludo['mecanicas'].str.contains(mec)])

In [None]:
count_mecanicas

To better visualize this, we can create a dataframe object and plot it with seaborn as a barplot:

In [None]:
mec_df = pd.DataFrame(count_mecanicas.items(), columns=['mecanica', 'games_count'])
# Sorting
mec_df = mec_df.sort_values(by=['games_count'], ascending = False)

In [None]:
plt.figure(figsize=(10,20))
sns.barplot(x = 'games_count', y = 'mecanica', data = mec_df);

We can get a similar plot for the designers, however, since there are 1526 unique designers, we can filter the top ones. Since we can do the same for artists, let's create a helper function for the process.

In [None]:
# Define a function for this
def plot_uniques(df, unique_values, feature, max_plot=20):
    '''
    Plots a horizonatl bar plot for the value count of unique features.
    Returns a dataframe with two columns: [feature, games_count]
    This second column counts the number of games containing that feature.
    
    '''
    count_feat = {}
    for val in unique_values:
        count_feat[val] = len(df.loc[df[feature].str.match(val)])
    
    feat_df = pd.DataFrame(count_feat.items(), columns=[feature, 'games_count'])
    feat_df = feat_df.sort_values(by=['games_count'], ascending = False)
    
    plt.figure(figsize=(10,10))
    sns.barplot(x = 'games_count', y = feature, data = feat_df.head(max_plot))
    
    return feat_df

In [None]:
designer_df = plot_uniques(df_ludo, unique_designers, 'designer', max_plot=20)

For top 20 artists:

In [None]:
artist_df = plot_uniques(df_ludo, unique_artists, 'artist', max_plot=20)

# Relations between games (getting into graph building)

We are now going to see how these games relate to each other by sharing features. For instance, we would like to see how artists work for various games and what kind of game they collaborate with the most.

Let's first build out our helper functions:

In [None]:
def common_member(a, b):
    '''Check if list a and list b have at least one common member'''
    return not set(a).isdisjoint(b)

In [None]:
def shares_feature_id_iter(df, idx, feature):
    '''Returns a string with indexes of games having
       at least one shared feature separated by commas'''
    print(idx)
    shared = ''
    feat_list = df.iloc[idx][feature]
    feat_list = list( feat_list.split(',') )
    for i, row in df.iterrows():
        if i == idx:
            continue
        if common_member(feat_list, list(row[feature].split(','))):
            shared += str(i)
            shared += ','
    
    print('Shared ', feature, ' - Done index ',idx)
    return shared[:-1]

def shared_feat_id_series_iter(df, feature):
    return pd.Series([
        shares_feature_id_iter(df, idx, feature) for idx in df.index
    ])

Now we can use these functions to create a `shared_artist` column with the indexes of games that share at least one artist with each game in a given row. **This will take some time.**

In [None]:
df_ludo['shared_artists'] = shared_feat_id_series_iter(df_ludo, 'artist')

In [None]:
df_ludo.tail()

In [None]:
df_ludo['shared_designers'] = shared_feat_id_series_iter(df_ludo, 'designer')

In [None]:
df_ludo.tail()

## Shared artists

We can inspect how many games don't share artists with other games:

There are 306 games with no shared artists.

Besides these, there are games with the entries **No info** or **Missing artist**. Let's check how many of these we have:

In [None]:
df_ludo[(df_ludo['artist']=='No info') | (df_ludo['artist']=='Missing artist')]

There are 458 games within this filtered df. Let's get all conditions out in one call:

In [None]:
df_ludo[(df_ludo['artist']=='No info') | (df_ludo['artist']=='Missing artist') | (df_ludo['shared_artists']=='')]

So, the remaining dataframe that is of interest for analizing the shared artists can be called by:

In [None]:
# Df for relevant info on shared artists
df_sa = df_ludo[((df_ludo['artist']=='No info') 
        | (df_ludo['artist']=='Missing artist') 
        | (df_ludo['shared_artists']=='')) == False]

In [None]:
df_sa.head()

With these new columns, we can create an `edges` object in the format required by [Dash Cytoscape](https://dash.plotly.com/cytoscape).

In [None]:
edges_art = []

for i, row in df_sa.iterrows():
    for shared in list( row['shared_artists'].split(',') ):    
        if int(shared) > i:
            artists = list(set(row['artist'].split(',')).intersection(df_ludo.iloc[int(shared)]['artist'].split(',')))
            artists = ','.join([str(item) for item in artists])
            temp_dict = {'data': {'id':str(i)+'-'+shared, 'source': str(i), 'target':str(shared), 'shared':artists}}
            edges_art.append(temp_dict)

In [None]:
len(edges_art)

And we need some nodes for our graph in cytoscape:

In [None]:
# Using only games that have an edge with at least one game (node)
nodes_art = [
    {
        "data": {
            "id": str(i),
            "label": str(i),
            "title": row['title'],
            "year": row['year'],
            "notarank": row['notaRank'],
            "media": row['media'],
            "age": row['age'],
            "players": row['numOfPlayers'],
            "time": row['timeOfPlay'],
            "dominio": row['dominio'],
            "img-src": row['imagem'],
            "edges": list( row['shared_artists'].split(',') ),
            "conns": len(list( row['shared_artists'].split(',') )),
            "node_size": np.sqrt(len(list( row['shared_artists'].split(',') )))
        },
        "selectable": True,
        "grabbable": False,
    }
    for i, row in df_sa.iterrows()
]

## Visualizing shared artists

Now that we have a list (not actually a `python list object`) of games that shared artists with each game, we can plot a distribution of the number of games with shared artists.

To do so, we can create a new column to store the number of games with shared artists by using a `len()` call on a list created from the 'shared_artists' column:

In [None]:
df_ludo['games_with_sa'] = df_ludo['shared_artists'].apply(lambda s: len(s.split(',')) if s!= '' else 0)

In [None]:
df_ludo.head()

In [None]:
sns.histplot(data = df_ludo, x='games_with_sa');

This is very messy. The one bin with over 300 games sharing artists corresponds to our games with missing artists. We can use our `df_sa` to do this task and avoid these games:

In [None]:
df_sa['games_with_sa'] = df_sa['shared_artists'].apply(lambda s: len(s.split(',')) if s!= '' else 0)

In [None]:
df_sa.head()

In [None]:
sns.histplot(data = df_sa, x='games_with_sa',
             bins = 30);

In [None]:
plt.figure(figsize=(10,10))
sns.displot(data = df_sa, x='games_with_sa',
             kind='kde',
            height = 6)
plt.title("Games with at least one shared artists");

In [None]:
# df_sa[df_sa['games_with_sa']>100]

Let's see some numbers for this distribution:

In [None]:
df_sa['games_with_sa'].describe()

Thanks for reading! Stay safe all!