# Candy Crush: analysis of level popularity and difficulty

A quick exploratory data analysis of level difficulty and player engagement.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
from sklearn.linear_model import LinearRegression

matplotlib.rcParams['font.size'] = 14

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('../input/candy-crush/candy_crush.csv')
df.head()

In [None]:
df.info()

In [None]:
print(f"Num players: {df.player_id.nunique()}")

# How many levels did player play?

In [None]:
# Number of levels played per player
player_df = df[['player_id','level']].groupby("player_id").count()
player_df.rename({'level':'num_level_played'},inplace=True,axis=1)
sns.histplot(player_df)
plt.title('Distribution of the number of level played by a player')
_ = plt.xlabel("# levels player by a player")

# How many players played given levels?

In [None]:
temp = df[['level','player_id']].groupby('level').count()
sns.barplot(data=temp, x=temp.index, y='player_id')
plt.ylabel('Number of players')

# How many attempts, success and fails for each level?

First, I create a new dataframe with information for each level.

In [None]:
level_df = df[['level','num_attempts','num_success']].groupby('level').sum()
# Num player who won that level
temp = df[['level','num_success']].groupby('level').apply(lambda x:(x>0).sum())
level_df['num_winning_players'] = temp['num_success']
temp = df[['level','num_success']].groupby('level').apply(lambda x:(x==0).sum())
level_df['num_losing_players'] = temp['num_success']
level_df['num_players'] = level_df['num_winning_players']+level_df['num_losing_players']
# Average number of attempts to win a level: # attempts/ # success
level_df['attempts_success_ratio'] = level_df['num_attempts']/level_df['num_success']
# player_win_lose_ratio: # winning players/ total # players
level_df['player_win_lose_ratio'] = level_df['num_winning_players']/(level_df['num_winning_players']+success_df['num_losing_players'])
level_df

Then I extract visualize the main information.

In [None]:
fig, ax = plt.subplots(5,1, figsize=(8,13), gridspec_kw={'hspace':.8})
plt.sca(ax[0])
sns.barplot(data=level_df, x=level_df.index, y='num_players')
plt.title("Number of players who attempted this level")

plt.sca(ax[1])
sns.barplot(data=level_df, x=level_df.index, y='num_success')
plt.title("total num success for each level")
plt.xlabel('')

plt.sca(ax[2])
sns.barplot(data=level_df, x=level_df.index, y='num_attempts')
plt.title("total num attempts for each level")
plt.xlabel('')

plt.sca(ax[3])
sns.barplot(data=level_df, x=level_df.index, y='attempts_success_ratio')
plt.title("num_attempts/num_success")
plt.xlabel('')

plt.sca(ax[4])
sns.barplot(data=level_df, x=level_df.index, y='player_win_lose_ratio')
plt.title("success ratio: # player who succeed/# player who attempts")

In [None]:
success_df.describe()

## Are success, attempts, winning ratio or number of players correlated?

In [None]:
sns.heatmap(success_df.corr(),cmap='RdBu',annot=True)

## Analysis

The number of player for a given level is quite unequal with minimum of 674 and a maximum if 3373 (for level 15). Although the num of players varies a lot, the number of success is rather constant among levels with a mean of 705 and std of 137. Level 15 is again an except 1157 successes. The barplots suggests that the *number of players*, *number of attempts* and the *attemp/succes ratio* are strongly positively correlated while the *success ratio* is negatively to these three variables. The *number of success* is rather independent of these values. The heatmap of the correlation coeffcient confirms this observation.

This means that, as expected, the hard level (low *player_win_lose_ratio*) require more attempts on average to succeed. But surprisingly, the hard levels are also the most played ones. Below I show the relation between *difficulty* (taken *1/player_win_lost_ratio*) and *num player*.

In [None]:
reg = LinearRegression()
temp = level_df.sort_values(by='player_win_lose_ratio')
y = temp['num_players'].values.reshape(-1,1)
X = temp['player_win_lose_ratio'].values.reshape(-1,1)

sns.regplot(data=level_df, x=1./level_df['player_win_lose_ratio'], y='num_players',order=1,robust=True)

The number of player seems to be linearly correlated with difficulty. If true, the number of player of level 8 is representative while the number of players of level 15 is an outlier. My interpretation is that a player is more likely to share his score with his friend (e.g. on social media) when he succeed or gets a high score on a hard level. Therefore his friends attempt the level too etc... Obviously, as a company King would be interested to produce levels like 8 and 15 that bring in more players. However, easier levels are important to attract new players, and in order for a lot of player to play the same hard level, the number of hard level must not be too hard (otherwise everybody would be playing different levels and the number of play cannot snowball).

# Profile of the players playing hard vs easy level

In [None]:
A = df[['player_id','level','num_attempts']].pivot_table(index='player_id',columns='level',values='num_attempts',fill_value=False,aggfunc=(lambda x: np.sum(x)>0))


In [None]:
A.head()

In [None]:
plt.spy(A.values.astype(float)[:10,:])

# Most played levels together

In [None]:
from scipy.sparse import csr_matrix
A_df = df[['player_id','level','num_attempts']].pivot_table(index='player_id',columns='level',values='num_attempts',fill_value=False,aggfunc=(lambda x: np.sum(x)>0))
A = csr_matrix(A_df.values.astype(float))
co_occ_mat = A.T@A


In [None]:
co_occ_mat

In [None]:
plt.imshow(co_occ_mat.todense(),vmin=0,vmax=1000)
plt.colorbar()

The above graph is the co-occurence matrix of levels which indicates which levels are played together by the same player. The non-zero values are grouped as blocked around the diagonal which means that players tend to play a few levels in succession, or they might start with a level and play the ones before or after. Most players who played the first five levels did not play the last five and vice-versa. It looks like most players played five to ten games. We also note that the players who attempted the hardest level 15 often did not play the earliest easier levels. Thus, the anormalously low *player_win_lose_ratio* for level 15 may be explained by the fact that most players who attempted this level did not train on the first easier levels, and thus may not be well prepared for that challenge.

# Conclusion

Harder levels tend to be more popular. My interpretation is that players are more proud to finish a hard level, and therefore they are more susceptible to share their success with their friends. Some casual players might attempt hard levels even if they haven't played the previous levels. Players who attend hard levels would also typically play a few levels before or after.
Hard levels seem important to engage the community, but easier levels must also be important to keep more casual players engaged.