In [2]:
import pandas as pd  # type: ignore
from pandasql import sqldf # type: ignore
data = pd.read_csv('IndieZ.csv')

## **1\. User experience comparison**

**Main problem**: Identification for user experience's change between 2 versions.

**Hypothesis**: With the given information of bad user experience in the tutorial and my own experience after several test runs, it is apparent that the tutorial section failed to explain the game's rule. I theorized that with the uncertanty of not knowing how to play the game, a good amount of players would struggle at level 1 even though it is supposed to be the easiest level. To be able to prove this, I compiled a query to first acknowledging the tutorial completion rate.

In [3]:
sqldf('''
SELECT 
    nop.version,
    nop.user as num_of_player,
    notc.user as tut_completed,
    (CAST(notc.user as FLOAT) / CAST(nop.user as FLOAT)) *100 
        as tut_complete_percentage
FROM (
    SELECT 
        version,
        COUNT(DISTINCT user) as user 
    FROM 
        data
    GROUP BY 
        version
) as nop
JOIN (
    SELECT 
        version,
        COUNT(DISTINCT user) as user
    FROM 
        data
    WHERE 
        quantity = -2 
    GROUP BY 
        version
) as notc
ON nop.version = notc.version 
''')

Unnamed: 0,version,num_of_player,tut_completed,tut_complete_percentage
0,1.5.2,6671,6341,95.053215
1,1.6.0,6929,6571,94.833309


The tutorial completion rate is fairly similar across both versions. Then using this information, I complied another query to get the number of players that completed the tutorial but lost at level 1:

In [4]:
sqldf('''
SELECT 
    tut_complete.version,
    tut_complete.user as tut_completed_user,
    lv_1_lost.user as lv_1_lost_player,
    ROUND(CAST(lv_1_lost.user as FLOAT) / CAST(tut_complete.user as FLOAT) * 100, 2) as percentage
FROM (
    SELECT 
        version,
        COUNT(DISTINCT user) as user 
    FROM 
        data
    WHERE 
        quantity = -2 
    GROUP BY 
        version
) as tut_complete
JOIN (
    SELECT 
        version,
        COUNT(DISTINCT user) as user
    FROM 
        data
    WHERE 
        user IN (
            SELECT  
                user
            FROM 
                data
            WHERE 
                event_name = 'tutorial' AND quantity = -2) 
        AND level = 1 AND win = 0
    GROUP BY 
        version
) as lv_1_lost
ON tut_complete.version = lv_1_lost.version 
''')

Unnamed: 0,version,tut_completed_user,lv_1_lost_player,percentage
0,1.5.2,6341,1561,24.62
1,1.6.0,6571,1373,20.89


Using the same logic as the corellation between the uncertanty of the game's rule and level 1's losing rate, I hypothesized that the number of loses would drop significantly in level 2 since most player would understand how the game works by that time:

In [5]:
sqldf('''
SELECT 
    tut_complete.version,
    tut_complete.player as num_of_lv_2_player,
    lost.player as lv_2_lost_player,
    ROUND(CAST(lost.player as FLOAT) / CAST(tut_complete.player as FLOAT) * 100, 2) as percentage
FROM (
    SELECT 
        version,
        COUNT(DISTINCT user) as player 
    FROM 
        data
    WHERE 
        level = 2
    GROUP BY 
        version
) as tut_complete
JOIN (
    SELECT 
        version,
        COUNT(DISTINCT user) as player
    FROM 
        data
    WHERE 
        user IN (
            SELECT  
                user
            FROM 
                data
            WHERE 
                event_name = 'tutorial' AND quantity = -2) 
            AND 
                level = 2 AND win = 0
    GROUP BY 
        version
) as lost
ON tut_complete.version = lost.version
''')

Unnamed: 0,version,num_of_lv_2_player,lv_2_lost_player,percentage
0,1.5.2,6223,56,0.9
1,1.6.0,6396,48,0.75


As expected, the lost rate dropped dramatically from more than 20% to less than 1% in both versions. 

**Key findings**

- The bad user experience due to the vague nature of the tutorial section can be identified using the number of player that struggle at the very first level of the game. This can be further confirmed with the significant difference between level 1's lost rate and level 2's lost rate. The lost rate of other levels beyond 1 and 2 are most likely the reflection of the level's difficulty. 

- There is a significant drop in the percentage of player that completed the tutorial but lost at the first level between 2 versions, which may indicates that the new tutorial is more likely to provide a better user experience. However, further inspection is required to conclude the real impact of version 1.6.0's changes.

## **2\. Versions evaluation**

**Main problem**: Whether the changes in version 1.6.0 is significant enough to be able to replace version 1.5.2 .

**Hypothesis**: Based on the earlier inspection, version 1.6.0 could benefit from further improvement before launching.

### **Retention rate** 

Utilized both SQL and pandas, I compiled a data frame to calculate the retention rate of each version: 

In [8]:
df_152 = sqldf('''
    SELECT 
        day_diff,
        COUNT(DISTINCT user) num_of_player
    FROM 
        data 
    WHERE 
        version = '1.5.2'
    GROUP BY 
        day_diff
''')
player_count_152 = data[data['version'] == '1.5.2'].groupby('user')['day_diff'].max().value_counts().sum() 

df_160 = sqldf('''
    SELECT 
        day_diff,
        COUNT(DISTINCT user) num_of_player
    FROM 
        data 
    WHERE 
        version = '1.6.0'
    GROUP BY 
        day_diff
''')
player_count_160 = data[data['version'] == '1.6.0'].groupby('user')['day_diff'].max().value_counts().sum()

retention_rate = pd.DataFrame({
    'ver_152': df_152['num_of_player'],
    'retention_rate_152' : round(df_152['num_of_player'] / player_count_152 * 100, 2),
    'ver_160': df_160['num_of_player'],
    'retention_rate_160' : round(df_160['num_of_player'] / player_count_160 * 100, 2),
    '160/152' : round(df_160['num_of_player'] / player_count_160 * 100, 2) - round(df_152['num_of_player'] / player_count_152 * 100, 2)
})
retention_rate.index.name = 'day_dif'
retention_rate

Unnamed: 0_level_0,ver_152,retention_rate_152,ver_160,retention_rate_160,160/152
day_dif,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,6663,99.88,6903,99.62,-0.26
1,1780,26.68,1981,28.59,1.91
2,891,13.36,1036,14.95,1.59
3,517,7.75,714,10.3,2.55
4,385,5.77,520,7.5,1.73
5,275,4.12,423,6.1,1.98
6,202,3.03,334,4.82,1.79
7,183,2.74,289,4.17,1.43


While the differences in the number of head count were quite significant, the actual retention rate stayed relatively low with the increase in version 1.6.0 could be deemed insignificant. 