<img src="https://bento.cdn.pbs.org/hostedbento-prod/blog/20170114_200556_794501_pk-channel-16x9.jpeg"/>

# What is PBS Kids

The Public Broadcasting Service (PBS) is an American public broadcaster and television program distributor..<br>It is a nonprofit organization and the most prominent provider of educational television programming to public television stations in the United States.<br>
<b>Subsidiary:</b> PBS KIDS, World<br>
<b>Geographic scope:</b> United States<br>
PBS Kids is the brand for most of the children's programming aired by the Public Broadcasting Service (PBS) in the United States.

# Introduction

PBS KIDS, a trusted name in early childhood education for decades, aims to gain insights into how media can help children learn important skills for success in school and life. In this challenge, youâ€™ll use anonymous gameplay data, including knowledge of videos watched and games played, from the PBS KIDS Measure Up! app, a game-based learning tool developed as a part of the CPB-PBS Ready To Learn Initiative with funding from the U.S. Department of Education. Competitors will be challenged to predict scores on in-game assessments and create an algorithm that will lead to better-designed games and improved learning outcomes. Your solutions will aid in discovering important relationships between engagement with high-quality educational media and learning processes.

The outcomes in this competition are grouped into 4 groups (labeled accuracy_group in the data):<br>
<br>
3: the assessment was solved on the first attempt<br>
2: the assessment was solved on the second attempt<br>
1: the assessment was solved after 3 or more attempts<br>
0: the assessment was never solved<br>

More information about game play can be found in https://www.kaggle.com/c/data-science-bowl-2019/discussion/117019#latest-680222

Now we are clear with few things.<br>
We will be given information of kids game play data and we need to predict accuracy_group

# Let's start with analysis

In [None]:
import pandas as pd
from time import time
import datetime
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
from plotly.offline import  init_notebook_mode
import random
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from catboost import CatBoostClassifier
from sklearn.metrics import confusion_matrix
import colorlover as cl
from tqdm import tqdm_notebook as tqdm
sns.set(rc={'figure.figsize':(11.7,8.27)})
init_notebook_mode(connected=True)

In [None]:
train_labels_df = pd.read_csv("/kaggle/input/data-science-bowl-2019/train_labels.csv")
train_df = pd.read_csv("/kaggle/input/data-science-bowl-2019/train.csv")
specs_df = pd.read_csv("/kaggle/input/data-science-bowl-2019/specs.csv")
test_df = pd.read_csv("/kaggle/input/data-science-bowl-2019/test.csv")
submission_df = pd.read_csv("/kaggle/input/data-science-bowl-2019/sample_submission.csv")

In [None]:
print ("In train dataset we have total of " + str(train_df['installation_id'].nunique()) + " unique Installation ID")
print ("In test dataset we have total of " + str(test_df['installation_id'].nunique()) + " unique Installation ID")

train.csv & test.csv<br>
These are the main data files which contain the gameplay events.<br>
<br>
event_id - Randomly generated unique identifier for the event type. Maps to event_id column in specs table.<br>
game_session - Randomly generated unique identifier grouping events within a single game or video play session.<br>
timestamp - Client-generated datetime<br>
event_data - Semi-structured JSON formatted string containing the events parameters. Default fields are: event_count, event_code, and game_time; otherwise fields are determined by the event type.<br>
installation_id - Randomly generated unique identifier grouping game sessions within a single installed application instance.<br>
event_count - Incremental counter of events within a game session (offset at 1). Extracted from event_data.<br>
event_code - Identifier of the event 'class'. Unique per game, but may be duplicated across games. E.g. event code '2000' always identifies the 'Start Game' event for all games. Extracted from event_data.<br>
game_time - Time in milliseconds since the start of the game session. Extracted from event_data.<br>
title - Title of the game or video.<br>
type - Media type of the game or video. Possible values are: 'Game', 'Assessment', 'Activity', 'Clip'.<br>
world - The section of the application the game or video belongs to. Helpful to identify the educational curriculum goals of the media. Possible values are: 'NONE' (at the app's start screen), TREETOPCITY' (Length/Height), 'MAGMAPEAK' (Capacity/Displacement), 'CRYSTALCAVES' (Weight).<br>
<br>
<br>

In [None]:
train_df.head()


specs.csv<br>
This file gives the specification of the various event types.<br>

event_id - Global unique identifier for the event type. Joins to event_id column in events table.<br>
info - Description of the event.<br>
args - JSON formatted string of event arguments. Each argument contains:<br>
name - Argument name.<br>
type - Type of the argument (string, int, number, object, array).<br>
info - Description of the argument.<br>

In [None]:
specs_df.head()

train_labels.csv<br>
This file demonstrates how to compute the ground truth for the assessments in the training set.<br>

In [None]:
train_labels_df.head()

50% of the kids finsih the assessment in one go.<br>
23.91% of kids havent solve the assessment<br>
3: the assessment was solved on the first attempt<br>
2: the assessment was solved on the second attempt<br>
1: the assessment was solved after 3 or more attempts<br>
0: the assessment was never solved<br>

In [None]:
temp_df = train_labels_df.accuracy_group.value_counts(normalize = True) *100
temp_df = temp_df.round(2)
text = [str(x) + "%" for x in temp_df.values]
fig = go.Figure(data = go.Bar(x = temp_df.index,y = temp_df.values, text = text,textposition='auto'))
fig.update_traces(marker_color='#D95219', marker_line_color='#D95219',marker_line_width=1.5, opacity=0.6)
fig.update_layout(title={'text': "Percentage of accuracy group",'y':0.9,'x':0.5,'xanchor': 'center','yanchor': 'top'} )
fig.show()

The game mainy have three worlds <br>
MAGMAPEAK (Capacity)<br>
CRYSTALCAVES (Weight)<br>
TREETOPCITY (Height & Length)<br>
44.3% of kids spent their time in Magmapeak world

In [None]:
temp_df = train_df['world'].value_counts(normalize = True) * 100
temp_df = temp_df.round(2)
text = [str(x) + "%" for x in temp_df.values]
fig = go.Figure(data = go.Bar(x = temp_df.values,y = temp_df.index, text = text,textposition='auto',orientation='h'))
fig.update_traces(marker_color='#611F8D', marker_line_color='#611F8D',marker_line_width=1.5, opacity=0.6)
fig.update_layout(title={'text': "Percentage of World",'y':0.9,'x':0.5,'xanchor': 'center','yanchor': 'top'} )
fig.show(title = "Percentage of world")

There are totally 4 media types -- activty,game,assessment,clip. lets see how they are distributed in the world

NONE indicated the start of the app and it has only one media type which is clip<br>
MAGMAPEAK & TREETOPCITY have more activites<br>
Crystalcaves have more games<br>

In [None]:
temp_df = train_df.groupby('world')['type'].value_counts(normalize = True).reset_index(name="percentage")
temp_df['percentage'] = temp_df['percentage'] *100
temp_df = temp_df.round(2)
data = []
type_ = temp_df['type'].unique()
colors = [x.replace(")","").replace("rgb(","") for x in cl.scales['4']['qual']['Paired']]
count = 0
for i in type_:

    text = [str(x) + "%" for x in temp_df[temp_df['type'] == i]['percentage'].values]
    data.append(go.Bar(name = i, x =temp_df[temp_df['type'] == i]['world'].values,text = text,textposition='auto',
                      y =  temp_df[temp_df['type'] == i]['percentage'].values,marker=dict(
        color='rgba(' + colors[count] + ',0.6)',
        line=dict(color='rgba(' + colors[count] + ',1.0)', width=1)
    )))
    count = count + 1
fig = go.Figure(data=data)
fig.update_layout(barmode='stack')
fig.update_layout(title={'text': "Percentage of media types in each world",'y':0.9,'x':0.5,'xanchor': 'center','yanchor': 'top'} )
fig.show()


Chow Time is the famous activity

In [None]:
# lets see the fav title
temp_df = train_df['title'].str.replace("\(Activity\)","").replace("\(Assessment\)","")
text = ' ' .join(val for val in temp_df)
wordcloud = WordCloud(width=1600, height=800, stopwords = {'None','etc','and','other'}).generate(text)
plt.figure(figsize=(20,10), facecolor='k')
plt.imshow(wordcloud,interpolation="bilinear")
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

So from the above three graphs we are sure that Magmapeak is most played world and Sandcastle builder is most famous activity

# Time based analysis 

In [None]:
train_df['timestamp'] = pd.to_datetime(train_df.timestamp)
train_df['date'] = train_df['timestamp'].dt.date
train_df['month'] = train_df['timestamp'].dt.month_name()
train_df['weekday_name'] = train_df['timestamp'].dt.weekday_name
train_df['hour'] = train_df['timestamp'].dt.hour
train_df['minute'] = train_df['timestamp'].dt.minute

> Kids are more active from 10 AM till midnight.<br>
> Kids are more active on Friday (Weekend starts)<br>
> September has more traffic<br>

In [None]:
date_df = train_df.groupby("date")['event_id'].count()
month_df = train_df.groupby("month")['event_id'].count().reset_index(name="count")
month_df['month'] = pd.Categorical(month_df['month'],categories=['December','November','October','September','August','July','June','May','April','March','February','January'],ordered=True)
month_df = month_df.sort_values('month',ascending=False)

weekday_df = train_df.groupby("weekday_name")['event_id'].count().reset_index(name="count")
weekday_df['weekday'] = pd.Categorical(weekday_df['weekday_name'],categories=['Saturday','Friday','Thursday','Wednesday','Tuesday','Monday','Sunday'],ordered=True)
weekday_df = weekday_df.sort_values('weekday',ascending=False)

hour_df = train_df.groupby("hour")['event_id'].count()
minute_df = train_df.groupby("minute")['event_id'].count()
fig = make_subplots(rows = 5,cols = 1)

installation_df = train_df.groupby("date")['installation_id'].count()
fig.append_trace(go.Scatter(x = minute_df.index, y = minute_df.values, mode = "lines", name = "Minute"),row=1,col=1)
fig.append_trace(go.Scatter(x = hour_df.index, y = hour_df.values, mode = "markers", name = "Hour"),row=2,col=1)
fig.append_trace(go.Scatter(x = weekday_df['weekday'], y = weekday_df['count'], mode = "lines+markers", name = "Week Day"),row=3,col=1)
fig.append_trace(go.Scatter(x = date_df.index, y = date_df.values, mode = "lines+markers", name = "Date"),row=4,col=1)
fig.append_trace(go.Scatter(x = month_df['month'], y = month_df['count'], mode = "lines", name = "Month"),row=5,col=1)




fig.update_layout(height=1000)
fig.show()

Cart Balancer is the most easily solved assessment.<br>
Most the kids have not finised the Chest Sorter<br>

In [None]:
temp_df = train_labels_df.groupby('title')['accuracy_group'].value_counts(normalize=True).reset_index(name="percentage")
temp_df['percentage'] = temp_df['percentage']*100
temp_df = temp_df.round(2)
temp_df['title'] = temp_df['title'].str.replace("\(Assessment\)","")
colors = [x.replace(")","").replace("rgb(","") for x in cl.scales['4']['qual']['Dark2']]
data = []
for i in range(4):
    text = [str(x) + "%" for x in temp_df[temp_df['accuracy_group'] == i]['percentage'].values]
    data.append(go.Bar(name = i, x = temp_df[temp_df['accuracy_group'] == i]['title'].values,
                       text = text,textposition='auto',
                      y = temp_df[temp_df['accuracy_group'] == i]['percentage'].values,marker=dict(
        color='rgba(' + colors[i] + ',0.6)',
        line=dict(color='rgba(' + colors[i] + ',1.0)', width=1)
    )))
fig = go.Figure(data=data)
fig.update_layout(barmode='stack', title={'text': "Percentage of accuracy group for different type of Assessment",'y':0.9,'x':0.5,'xanchor': 'center','yanchor': 'top'})
fig.show()

In [None]:
temp_df =  train_labels_df['title'].value_counts()
data = go.Bar(x = temp_df.index,y = temp_df.values,text = temp_df.values,  textposition='auto')
fig = go.Figure(data = data)
fig.update_traces(marker_color='#C5197D', marker_line_color='#8E0052',marker_line_width=1.5, opacity=0.6)
fig.update_layout(barmode='stack', title={'text': "Different typess of Assessment",'y':0.9,'x':0.5,'xanchor': 'center','yanchor': 'top'})
fig.show()

# Lets understand the relationship of installation_id, game_session, event_id

Installation_id it is a single app installation instance<br>
Now lets say the installation ID 0001e90f belongs to Alex
We can see that Alex has total of 1357 rows in train data set

In [None]:
temp_df = train_df[train_df.installation_id=="0001e90f"]
temp_df

In [None]:
print("Out of 1357 rows we have " + str(temp_df.event_id.nunique()) + " unique event ID and " + str(temp_df.game_session.nunique()) + " unique game session")

game seesion is a total period of time devoted to an activity<br>
lets take a game session example and see what does it have

In [None]:
temp_df[temp_df.game_session == "0848ef14a8dc6892"]

This session entirely belongs to Sandcastle Builder

Now lets discuss about event id

The event id belongs to a specific table called specs_df.<br>
This table has 368 unique events.<br>
These events can be anything line users x,y cordinates or when a tutorial is played or when a player clicks someting

In [None]:
specs_df

# Game time distribution

lets understand the data distribution of test set. taking only 1000000 records of the train set<br>
we are applying np.log1p to the game time is to understand the skewness to the large value

> for interactive visualization please uncomment the code

In [None]:
train_df['game_time_log'] = train_df['game_time'].apply(np.log1p)
train_df = train_df.head(1000000)
# fig = px.box(train_df, y="game_time_log",x = "type",color='month',title={'text': "Distribution of game_time by type based on month",'y':0.9,'x':0.5,'xanchor': 'center','yanchor': 'top'},
#              color_discrete_sequence=cl.scales['3']['qual']['Dark2'])
# fig.show()
ax = sns.catplot(x="type", y="game_time_log", data=train_df,col="month",kind="box", aspect=.7)

In [None]:
# fig = px.box(train_df, y="game_time_log",x = "type",color='weekday_name',title={'text': "Distribution of game_time by type based on weekday",'y':0.9,'x':0.5,'xanchor': 'center','yanchor': 'top'},
#              color_discrete_sequence=cl.scales['3']['qual']['Dark2'])
# fig.show()
ax = sns.catplot(x="type", y="game_time_log", data=train_df,col="weekday_name",kind="box", aspect=.7)

In [None]:
# fig = px.box(train_df, y="game_time_log",x = "type",color='world',title={'text': "Distribution of game_time by type based on world",'y':0.9,'x':0.5,'xanchor': 'center','yanchor': 'top'},
#              color_discrete_sequence=cl.scales['3']['qual']['Dark2'])
# fig.show()
plt.figure(figsize=(16, 6))

ax = sns.catplot(x="type", y="game_time_log", data=train_df,col="world",kind="strip", aspect=.7)

In [None]:
# fig = px.strip(train_df, y="game_time_log",x = "world",title={'text': "Distribution of game_time by world",'y':0.9,'x':0.5,'xanchor': 'center','yanchor': 'top'},
#              color_discrete_sequence=cl.scales['3']['qual']['Dark2'])
# fig.show()

ax = sns.catplot(x="world", y="game_time_log", data=train_df,kind="strip", aspect=.7)

# How hard the assignments are?

Few notes.
* For a given assessment and accuracy group 3 it is clear that count of incorrect is equal to count of correct. Example for Bird Measurer(accuracy group = 3) has 693 count for incorrect and it has 693 count for correct
* Chest sorter is the most toughest assessment.
* A kid has attempted 85 times to solve Bird Measure

In [None]:
incorrect = train_labels_df.groupby(['title','accuracy_group'])['num_incorrect'].value_counts().reset_index(name="count")
correct = train_labels_df.groupby(['title','accuracy_group'])['num_correct'].value_counts().reset_index(name="count")

In [None]:
px.scatter(incorrect[incorrect['title'] == "Bird Measurer (Assessment)"], x="accuracy_group", y="count",color = "num_incorrect",size = "count",hover_name="accuracy_group",title="Bird Measurer incorrect answers")

In [None]:
px.scatter(correct[correct['title'] == "Bird Measurer (Assessment)"], x="accuracy_group", y="count",color = "num_correct",size = "count",hover_name="accuracy_group",title="Bird Measurer correct answers")

In [None]:
px.scatter(incorrect[incorrect['title'] == "Mushroom Sorter (Assessment)"], x="accuracy_group", y="count",color = "num_incorrect",size = "count",hover_name="accuracy_group",title="Mushroom Sorter incorrect answers")

In [None]:
px.scatter(correct[correct['title'] == "Mushroom Sorter (Assessment)"], x="accuracy_group", y="count",color = "num_correct",size = "count",hover_name="accuracy_group",title="Mushroom Sorter correct answers")

In [None]:
px.scatter(incorrect[incorrect['title'] == "Cauldron Filler (Assessment)"], x="accuracy_group", y="count",color = "num_incorrect",size = "count",hover_name="accuracy_group",title="Cauldron Filler incorrect answers")

In [None]:
px.scatter(correct[correct['title'] == "Cauldron Filler (Assessment)"], x="accuracy_group", y="count",color = "num_correct",size = "count",hover_name="accuracy_group",title="Cauldron Filler correct answers")

In [None]:
px.scatter(incorrect[incorrect['title'] == "Chest Sorter (Assessment)"], x="accuracy_group", y="count",color = "num_incorrect",size = "count",hover_name="accuracy_group",title="Chest Sorter incorrect answers")

In [None]:
px.scatter(correct[correct['title'] == "Chest Sorter (Assessment)"], x="accuracy_group", y="count",color = "num_correct",size = "count",hover_name="accuracy_group",title="Chest Sorter correct answers")

In [None]:
px.scatter(incorrect[incorrect['title'] == "Cart Balancer (Assessment)"], x="accuracy_group", y="count",color = "num_incorrect",size = "count",hover_name="accuracy_group",title="Cart Balancer incorrect answers")

In [None]:
px.scatter(correct[correct['title'] == "Cart Balancer (Assessment)"], x="accuracy_group", y="count",color = "num_correct",size = "count",hover_name="accuracy_group",title="Cart Balancer correct answers")

# lets build the model

lets use catboost 

Please upvote this kernel too https://www.kaggle.com/mhviraf/a-new-baseline-for-dsb-2019-catboost-model<br>
(Feature engineering code is taken from here) @mhviraf thankyou for this amazing kernel

Submissions are scored based on the quadratic weighted kappa, which measures the agreement between two outcomes. This metric typically varies from 0 (random agreement) to 1 (complete agreement). In the event that there is less agreement than expected by chance, the metric may go below 0.

The outcomes in this competition are grouped into 4 groups (labeled accuracy_group in the data):

3: the assessment was solved on the first attempt
2: the assessment was solved on the second attempt
1: the assessment was solved after 3 or more attempts
0: the assessment was never solved

More about QWK https://www.kaggle.com/c/data-science-bowl-2019/discussion/114133

In [None]:
def qwk(act,pred,n=4,hist_range=(0,3)):
    
    O = confusion_matrix(act,pred)
    O = np.divide(O,np.sum(O))
    
    W = np.zeros((n,n))
    for i in range(n):
        for j in range(n):
            W[i][j] = ((i-j)**2)/((n-1)**2)
            
    act_hist = np.histogram(act,bins=n,range=hist_range)[0]
    prd_hist = np.histogram(pred,bins=n,range=hist_range)[0]
    
    E = np.outer(act_hist,prd_hist)
    E = np.divide(E,np.sum(E))
    
    num = np.sum(np.multiply(W,O))
    den = np.sum(np.multiply(W,E))
        
    return 1-np.divide(num,den)

In [None]:
list_of_user_activities = list(set(train_df['title'].unique()).union(set(test_df['title'].unique())))
activities_map = dict(zip(list_of_user_activities, np.arange(len(list_of_user_activities))))

train_df['title'] = train_df['title'].map(activities_map)
test_df['title'] = test_df['title'].map(activities_map)
train_labels_df['title'] = train_labels_df['title'].map(activities_map)

win_code = dict(zip(activities_map.values(), (4100*np.ones(len(activities_map))).astype('int')))
win_code[activities_map['Bird Measurer (Assessment)']] = 4110

train_df['timestamp'] = pd.to_datetime(train_df['timestamp'])
test_df['timestamp'] = pd.to_datetime(test_df['timestamp'])

In [None]:
def get_data(user_sample, test_set=False):
    last_activity = 0
    user_activities_count = {'Clip':0, 'Activity': 0, 'Assessment': 0, 'Game':0}
    accuracy_groups = {0:0, 1:0, 2:0, 3:0}
    all_assessments = []
    accumulated_accuracy_group = 0
    accumulated_accuracy=0
    accumulated_correct_attempts = 0 
    accumulated_uncorrect_attempts = 0 
    accumulated_actions = 0
    counter = 0
    durations = []
    for i, session in user_sample.groupby('game_session', sort=False):
        session_type = session['type'].iloc[0]
        session_title = session['title'].iloc[0]
        if test_set == True:
            second_condition = True
        else:
            if len(session)>1:
                second_condition = True
            else:
                second_condition= False
            
        if (session_type == 'Assessment') & (second_condition):
            all_attempts = session.query(f'event_code == {win_code[session_title]}')
            true_attempts = all_attempts['event_data'].str.contains('true').sum()
            false_attempts = all_attempts['event_data'].str.contains('false').sum()
            features = user_activities_count.copy()
            features['session_title'] = session['title'].iloc[0] 
            features['accumulated_correct_attempts'] = accumulated_correct_attempts
            features['accumulated_uncorrect_attempts'] = accumulated_uncorrect_attempts
            accumulated_correct_attempts += true_attempts 
            accumulated_uncorrect_attempts += false_attempts
            if durations == []:
                features['duration_mean'] = 0
            else:
                features['duration_mean'] = np.mean(durations)
            durations.append((session.iloc[-1, 2] - session.iloc[0, 2] ).seconds)
            features['accumulated_accuracy'] = accumulated_accuracy/counter if counter > 0 else 0
            accuracy = true_attempts/(true_attempts+false_attempts) if (true_attempts+false_attempts) != 0 else 0
            accumulated_accuracy += accuracy
            if accuracy == 0:
                features['accuracy_group'] = 0
            elif accuracy == 1:
                features['accuracy_group'] = 3
            elif accuracy == 0.5:
                features['accuracy_group'] = 2
            else:
                features['accuracy_group'] = 1

            features.update(accuracy_groups)
            features['accumulated_accuracy_group'] = accumulated_accuracy_group/counter if counter > 0 else 0
            features['accumulated_actions'] = accumulated_actions
            accumulated_accuracy_group += features['accuracy_group']
            accuracy_groups[features['accuracy_group']] += 1
            if test_set == True:
                all_assessments.append(features)
            else:
                if true_attempts+false_attempts > 0:
                    all_assessments.append(features)
                
            counter += 1

        accumulated_actions += len(session)
        if last_activity != session_type:
            user_activities_count[session_type] += 1
            last_activity = session_type

    if test_set:
        return all_assessments[-1] 
    return all_assessments

In [None]:
compiled_data = []
installation_id = train_df['installation_id'].nunique()
for i, (ins_id, user_sample) in tqdm(enumerate(train_df.groupby('installation_id', sort=False)), total=installation_id):
    compiled_data += get_data(user_sample)

In [None]:
new_train = pd.DataFrame(compiled_data)
del compiled_data
new_train.shape

In [None]:
new_train.head()

In [None]:
all_features = [x for x in new_train.columns if x not in ['accuracy_group']]
cat_features = ['session_title']
X, y = new_train[all_features], new_train['accuracy_group']
del train_df

We are gonna use CatBoostClassifier

In [None]:
clf = CatBoostClassifier(loss_function='MultiClass',task_type="CPU",learning_rate=0.05,iterations=3000,od_type="Iter",early_stopping_rounds=500,random_seed=21)
clf.fit(X, y, verbose=500, cat_features=cat_features)
del X, y

In [None]:
new_test = []
for ins_id, user_sample in tqdm(test_df.groupby('installation_id', sort=False), total=1000):
    a = get_data(user_sample, test_set=True)
    new_test.append(a)
    
X_test = pd.DataFrame(new_test)
del test_df

In [None]:
preds = clf.predict(X_test)
del X_test

In [None]:
submission_df['accuracy_group'] = np.round(preds).astype('int')
submission_df.to_csv('submission.csv', index=None)
submission_df.head()

> WIP<br>
> Please upvote if you find this kernel intresting