## 1. Candy Crush Saga

<p><a href="https://king.com/game/candycrush">Candy Crush Saga</a> is a hit mobile game developed by King (part of Activision|Blizzard) that is played by millions of people all around the world. The game is structured as a series of levels where players need to match similar candy together to (hopefully) clear the level and keep progressing on the level map. If you are one of the few that haven't played Candy Crush, here's a short demo:</p>

<p align= 'middle'><a href="https://youtu.be/HGLGxnfs_t8"><img src="https://assets.datacamp.com/production/project_139/img/candy_crush_video.jpeg" style = "width: 700px"></a></p>

<p>Candy Crush has more than 3000 levels, and new ones are added every week. That is a lot of levels! And with that many levels, it's important to get <em>level difficulty</em> just right. Too easy and the game gets boring, too hard and players become frustrated and quit playing.</p>
<p>In this project, we will see how we can use data collected from players to estimate level difficulty. Let's start by loading in the packages we're going to need.</p>

In [15]:
import pandas as pd
import numpy as np

# For visualization
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px
import plotly.graph_objects as go

# Own layout design library
from vizformatter.standards import layout_plotly

# Load layout base objects
sign, layout = layout_plotly(height= 720, width= 1000, font_size= 15)

## 2. The data set
<p>The dataset we will use contains one week of data from a sample of players who played Candy Crush back in 2014. The data is also from a single <em>episode</em>, that is, a set of 15 levels. It has the following columns:</p>
<ul>
<li><strong>player_id</strong>: a unique player id</li>
<li><strong>dt</strong>: the date</li>
<li><strong>level</strong>: the level number within the episode, from 1 to 15.</li>
<li><strong>num_attempts</strong>: number of level attempts for the player on that level and date.</li>
<li><strong>num_success</strong>: number of level attempts that resulted in a success/win for the player on that level and date.</li>
</ul>
<p>The granularity of the dataset is player, date, and level. That is, there is a row for every player, day, and level recording the total number of attempts and how many of those resulted in a win.</p>
<p>Now, let's load in the dataset and take a look at the first couple of rows. </p>

In [16]:
df = pd.read_csv("datasets/candy_crush.csv")
df.head()

Unnamed: 0,player_id,dt,level,num_attempts,num_success
0,6dd5af4c7228fa353d505767143f5815,2014-01-04,4,3,1
1,c7ec97c39349ab7e4d39b4f74062ec13,2014-01-01,8,4,1
2,c7ec97c39349ab7e4d39b4f74062ec13,2014-01-05,12,6,0
3,a32c5e9700ed356dc8dd5bb3230c5227,2014-01-03,11,1,1
4,a32c5e9700ed356dc8dd5bb3230c5227,2014-01-07,15,6,0


In [17]:
print(df.dtypes)

player_id       object
dt              object
level            int64
num_attempts     int64
num_success      int64
dtype: object


In [18]:
# Function the plot the percentage of missing values
def na_counter(df):
    print("NaN Values per column:")
    print("")
    for i in df.columns:
        percentage = 100 - ((len(df[i]) - df[i].isna().sum())/len(df[i]))*100

        # Only return columns with more than 5% of NA values
        if percentage > 5:
            print(i+" has "+ str(round(percentage)) +"% of Null Values")
        else:
            continue

# Execute function
na_counter(df)

NaN Values per column:



In [19]:
df = df[df.num_attempts != 258]
df.describe()

Unnamed: 0,level,num_attempts,num_success
count,16864.0,16864.0,16864.0
mean,9.28712,5.520458,0.627135
std,4.343586,7.059875,0.864729
min,1.0,0.0,0.0
25%,6.0,1.0,0.0
50%,9.0,3.0,1.0
75%,14.0,7.0,1.0
max,15.0,138.0,55.0


## 3. Checking the data set
<p>Now that we have loaded the dataset let's count how many players we have in the sample and how many days worth of data we have.</p>

In [20]:
# Count and display the number of unique players
print("Number of players: \n", df.player_id.nunique(), '\n',
        "Number of records: \n", len(df.player_id),'\n')


# Display the date range of the data
print("Period for which we have data: \nFrom: ",
        min(df.dt), ' To:', max(df.dt))

Number of players: 
 6814 
 Number of records: 
 16864 

Period for which we have data: 
From:  2014-01-01  To: 2014-01-07


Now let's see how many levels were recorded by player on each level.

In [21]:
from plotly.subplots import make_subplots

# Group data of amount of levels recorded by player id
countdf = df.groupby('player_id')['level'].nunique().reset_index()

# Count the number amount of players according to amount of levels recorded by player
countdf = countdf.groupby('level')['player_id'].nunique().reset_index()

# Arrange data according to amount of levels
countdf.level = [str(i)+'s' for i in countdf.level]
countdf = countdf.sort_values('player_id', ascending= False)

# Generate CumSum
cumulative_per =  countdf.player_id / countdf.player_id.sum() * 100
cumulative_per = cumulative_per.cumsum()

# Format new DF
countdf = pd.concat([countdf, cumulative_per], axis = 1)
countdf.columns = ["levels","players","Cum_per"]

# Pareto Chart
linec = make_subplots(specs=[[{"secondary_y": True}]])

# Bar plot graphic object
linec.add_trace(go.Bar(x = countdf.levels, y = countdf.players, name = "Players", marker_color= "#007FFF"),
                        secondary_y = False)

# Scatter plot graphic object
linec.add_trace(go.Scatter(x = countdf.levels, y = countdf.Cum_per/100, mode='lines+markers', name = "Percentage", marker_color= "#FF5A5F"),
                        secondary_y = True)

# Layout
linec.update_layout(title = {'text':'Pareto Chart of Number of Levels recorded by player'},
                    xaxis = {"title":"Number of Levels recorded"},
                    yaxis = {"title":"Unique players"})
linec.update_layout(layout)
linec.update_yaxes(tickformat = "0%", title = "Cumulative Percentage", secondary_y = True)
linec.update_layout(showlegend=False)
linec.add_hline(y=0.8, line_dash = "dash", line_color="red", opacity=0.5, secondary_y = True)
linec.add_annotation(sign)
linec.show()

From the last Pareto chart, we can conclude that 80% of the players just count with 3 levels recorded of 15. But, since this was extracted from a random sample, we won't focus on this.

## 4. Computing level difficulty
<p>Within each Candy Crush episode, there is a mix of easier and tougher levels. Luck and individual skill make the number of attempts required to pass a level different from player to player. The assumption is that difficult levels require more attempts on average than easier ones. That is, <em>the harder</em> a level is, <em>the lower</em> the probability to pass that level in a single attempt is.</p>
<p>A simple approach to model this probability is as a <a href="https://en.wikipedia.org/wiki/Bernoulli_process">Bernoulli process</a>; as a binary outcome (you either win or lose) characterized by a single parameter <em>p<sub>win</sub></em>: the probability of winning the level in a single attempt. This probability can be estimated for each level as:</p>
<p><img src="https://assets.datacamp.com/production/project_139/img/latex1.png" style="width:150px"></p>
<!-- $$p_{win} = \frac{\sum wins}{\sum attempts}$$ -->
<p>For example, let's say a level has been played 10 times and 2 of those attempts ended up in a victory. Then the probability of winning in a single attempt would be <em>p<sub>win</sub></em> = 2 / 10 = 20%.</p>
<p>Now, let's compute the difficulty <em>p<sub>win</sub></em> separately for each of the 15 levels.</p>

In [29]:
difficulty = df.groupby('level').agg(attempts = ('num_attempts', 'sum'),
                                    wins =('num_success', 'sum')).reset_index()
difficulty['p_win'] = difficulty.wins / difficulty.attempts
difficulty

Unnamed: 0,level,attempts,wins,p_win
0,1,1322,818,0.618759
1,2,1285,666,0.518288
2,3,1546,662,0.428202
3,4,1893,705,0.372425
4,5,6937,634,0.091394
5,6,1591,668,0.419862
6,7,4526,614,0.135661
7,8,15816,641,0.040529
8,9,8241,670,0.081301
9,10,3282,617,0.187995


## 5. Plotting difficulty profile

<p><img src="https://assets.datacamp.com/production/project_139/img/tiffi.jpeg" style="width:150px; float:left" hspace="20"> </p>

<p>Great! We now have the difficulty for all the 15 levels in the episode. Keep in mind that, as we measure difficulty as the probability to pass a level in a single attempt, a <em>lower</em> value (a smaller probability of winning the level) implies a <em>higher</em> level difficulty.</p>
<p>Now that we have the difficulty of the episode we should plot it. Let's plot a line graph with the levels on the X-axis and the difficulty (<em>p<sub>win</sub></em>) on the Y-axis. We call this plot the <em>difficulty profile</em> of the episode.</p>

In [23]:
# Lineplot of Success Probability per Level
line1 = px.line(difficulty, x='level', y="p_win",
                title = "Probability of Level Success at first attempt",
                labels = {"p_win":"Probability", "level":"Episode Level"})
line1.update_layout(layout)
line1.update_xaxes(range=[1,15], tick0 = 1, dtick = 1)
line1.update_yaxes(range=[0,0.7], tick0 = 0, dtick = 0.1)
line1.update_layout(yaxis_tickformat = "0%")
line1.add_annotation(sign)
line1.show()

## 6. Spotting hard levels
<p>What constitutes a <em>hard</em> level is subjective. However, to keep things simple, we could define a threshold of difficulty, say 10%, and label levels with <em>p<sub>win</sub></em> &lt; 10% as <em>hard</em>. It's relatively easy to spot these hard levels on the plot, but we can make the plot more friendly by explicitly highlighting the hard levels.</p>

In [24]:
# Lineplot of Success Probability per Level
line2 = go.Figure(go.Scatter(
    x = difficulty.level,
    y = difficulty.p_win))
line2.update_layout(title = {'text':'Probability of Level Success at first attempt with Hard Levels'},
                    xaxis = {"title":"Episode Level"},
                    yaxis = {"title":"Probability"})
line2.update_layout(layout)
line2.update_xaxes(range=[1,15], tick0 = 1, dtick = 1)
line2.update_layout(yaxis_tickformat = "0%")
line2.update_layout(showlegend=False)
line2.add_hrect(y0=0.02, y1=0.1, line_width=0, fillcolor="red", opacity=0.2)
line2.add_annotation(sign)
line2.show()

## 7. Computing uncertainty
<p><img src="https://assets.datacamp.com/production/project_139/img/mr_toffee.jpeg" style="width:150px; float:right" hspace="20"> </p>
<p>As Data Scientists we should always report some measure of the uncertainty of any provided numbers. Maybe tomorrow, another sample will give us slightly different values for the difficulties? Here we will simply use the <a href="https://en.wikipedia.org/wiki/Standard_error"><em>Standard error</em></a> as a measure of uncertainty:</p>

<p><img src="https://assets.datacamp.com/production/project_139/img/latex2.png" style="width:150px; float:left" hspace="20"></p>

<!-- $$
\sigma_{error} \approx \frac{\sigma_{sample}}{\sqrt{n}}
$$ -->

<p>Here <em>n</em> is the number of datapoints and <em>σ<sub>sample</sub></em> is the sample standard deviation. For a Bernoulli process, the sample standard deviation is: </p>
<p><img src="https://assets.datacamp.com/production/project_139/img/latex3.png" style="width:150px; float:left" hspace="20"></p>

<!-- $$
\sigma_{sample} = \sqrt{p_{win} (1 - p_{win})} 
$$ -->

<p>Therefore, we can calculate the standard error like this:</p>
<p><img src="https://assets.datacamp.com/production/project_139/img/latex4.png" style="width:150px; float:left" hspace="20"></p>

<!-- $$
\sigma_{error} \approx \sqrt{\frac{p_{win}(1 - p_{win})}{n}}
$$ -->
<p>We already have all we need in the <code>difficulty</code> data frame! Every level has been played <em>n</em> number of times and we have their difficulty <em>p<sub>win</sub></em>. Now, let's calculate the standard error for each level.</p>

In [30]:
# Computing the standard error of p_win for each level
difficulty['error'] = np.sqrt(difficulty.p_win * (1 - difficulty.p_win) / difficulty.attempts)
difficulty

Unnamed: 0,level,attempts,wins,p_win,error
0,1,1322,818,0.618759,0.013358
1,2,1285,666,0.518288,0.013939
2,3,1546,662,0.428202,0.012585
3,4,1893,705,0.372425,0.011112
4,5,6937,634,0.091394,0.00346
5,6,1591,668,0.419862,0.012373
6,7,4526,614,0.135661,0.00509
7,8,15816,641,0.040529,0.001568
8,9,8241,670,0.081301,0.003011
9,10,3282,617,0.187995,0.00682


## 8. Showing uncertainty
<p>Now that we have a measure of uncertainty for each levels' difficulty estimate let's use <em>error bars</em> to show this uncertainty in the plot. We will set the length of the error bars to one standard error. The upper limit and the lower limit of each error bar should then be <em>p<sub>win</sub></em> + <em>σ<sub>error</sub></em> and <em>p<sub>win</sub></em> - <em>σ<sub>error</sub></em>, respectively.</p>

In [26]:
# Lineplot of Success Probability per Level
line3 = px.line(difficulty, x='level', y="p_win",
                title = "Probability of Level Success at first attempt with Error Bars",
                labels = {"p_win":"Probability", "level":"Episode Level"},
                error_y="error")
line3.update_layout(layout)
line3.update_xaxes(range=[0,16], tick0 = 1, dtick = 1)
line3.update_yaxes(range=[0,0.65], tick0 = 0, dtick = 0.1)
line3.update_layout(yaxis_tickformat = "0%")
line3.add_hrect(y0=0.02, y1=0.1, line_width=0, fillcolor="red", opacity=0.2)
line3.add_annotation(sign)
line3.show()

## 9. A final metric
<p>It looks like our difficulty estimates are pretty precise! Using this plot, a level designer can quickly spot where the hard levels are and also see if there seems to be too many hard levels in the episode.</p>
<p>One question a level designer might ask is: "How likely is it that a player will complete the episode without losing a single time?" Let's calculate this using the estimated level difficulties!</p>

In [27]:
# The probability of completing the episode without losing a single time
prob = 1

for i in difficulty.p_win:
    prob = prob*i

# Printing it out
print("Probability of Success in one single attempt \nfor whole episode: ", prob*100, "%")

Probability of Success in one single attempt 
for whole episode:  9.889123140886191e-10 %


## 10. Should our level designer worry?
<p>Given the probability we just calculated, should our level designer worry about that a lot of players might complete the episode in one attempt?</p>

In [28]:
# Should our level designer worry about that a lot of 
# players will complete the episode in one attempt?
should_the_designer_worry = False