# pbskids
In this dataset, you are provided with game analytics for the PBS KIDS Measure Up! app. In this app, children navigate a map and complete various levels, which may be activities, video clips, games, or assessments. Each assessment is designed to test a child's comprehension of a certain set of measurement-related skills. There are five assessments: Bird Measurer, Cart Balancer, Cauldron Filler, Chest Sorter, and Mushroom Sorter.


The intent of the competition is to use the gameplay data to forecast how many attempts a child will take to pass a given assessment (an incorrect answer is counted as an attempt).


Each application install is represented by an installation_id. This will typically correspond to one child, 
but you should expect noise from issues such as shared devices. 


In the training set, you are provided the full history of gameplay data. 
In the test set, we have truncated the history after the start event of a single assessment, chosen randomly, for which you must predict the number of attempts. 


Note that the training set contains many installation_ids which never took assessments, whereas every installation_id in the test set made an attempt on at least **one assessment.**


The outcomes in this competition are grouped into 4 groups (labeled accuracy_group in the data):
- 3: the assessment was solved on the first attempt
- 2: the assessment was solved on the second attempt
- 1: the assessment was solved after 3 or more attempts
- 0: the assessment was never solved

The file train_labels.csv has been provided to show how these groups would be computed on the assessments in the training set. 
Assessment attempts are captured in event_code 4100 for all assessments except for Bird Measurer, which uses event_code 4110. 
If the attempt was correct, it contains "correct":true.

## train.csv & test.csv
These are the main data files which contain the gameplay events.

event_id 
- Randomly generated unique identifier for the event type. Maps to event_id column in specs table.

game_session
- Randomly generated unique identifier grouping events within a single game or video play session.

timestamp
- Client-generated datetime

event_data
- Semi-structured JSON formatted string containing the events parameters. Default fields are: event_count, event_code, and game_time; otherwise fields are determined by the event type.

installation_id
- Randomly generated unique identifier grouping game sessions within a single installed application instance.

event_count
- Incremental counter of events within a game session (offset at 1). Extracted from event_data.

event_code
- Identifier of the event 'class'. Unique per game, but may be duplicated across games. E.g. event code '2000' always identifies the 'Start Game' event for all games. Extracted from event_data.

game_time
- Time in milliseconds since the start of the game session. Extracted from event_data.

title
- Title of the game or video.

type
- Media type of the game or video. Possible values are: 'Game', 'Assessment', 'Activity', 'Clip'.

world
- The section of the application the game or video belongs to. 
- Helpful to identify the educational curriculum goals of the media. Possible values are: 'NONE' (at the app's start screen), TREETOPCITY' (Length/Height), 'MAGMAPEAK' (Capacity/Displacement), 'CRYSTALCAVES' (Weight).

## specs.csv
This file gives the specification of the various event types.

event_id - Global unique identifier for the event type. Joins to event_id column in events table.
info
- Description of the event.

args
- JSON formatted string of event arguments. Each argument contains:

name
- Argument name.

type
- Type of the argument (string, int, number, object, array).

info
- Description of the argument.

##  Situations
<i> Refer to : https://www.kaggle.com/c/data-science-bowl-2019/discussion/117019#latest-671781</i>

Though the game seems pretty straight forward, it took some time for me to understand the data. After going through the app and the discussions/kernels, I now have a better understanding of the data and what we are expected to predict. Sharing it here so that it might be of help to someone.


I try to explain with a fictitious example. Imagine that Janet, a 5-year kid old wants to play the measureup app installed in her dad's Ipad(installation_id - unique per **installation**).


When she opens the app(game session), she has a choice between **three "worlds"**
- Treetop city
- Magma
- Crystal caves each with its own theme.

Within each world, there are multiple **media types**
- Video Clip
- Activity
- Game
- Assessment


The learning path is designed as follows 
- 1.Exposure(video clip)
- 2.->Exploration(activity)
- 3.-> Practice(game)
- 4.-> Demonstration(assessment). 

**But Janet doesn't have to follow this order, she is free to choose. **

For example, she can straight away start with the assessment, or start with a game, then an activity followed by an assessment. 

She can also repeat the assessment as many times as she wishes. 
Maybe she gets it right the first time, but she likes it so much that she wants to do the assessment again
(there is nothing in the game that prevents from her doing so)


**Some games/assessments might have multiple rounds/levels. **

The level of difficulty is the same for everyone, 
it doesn't become progressively difficult based on the performance.


There are multiple media **title** belonging to the different types. 
For example, within the 'Treetop city' there are 
- All star sorting(Game) 
- Treasure map(Clip)
- Fireworks(Activity) 
- amongst others.


Janet likes dinosaurs so much that she wants to play the game 'All star sorting'. 

In this game, she needs to place the dinosaurs in their homes in the order of their heights. 

The system records **all the actions (both Janet's as well as the system response) as events**.

For example, the events could be the following: she ... 
- presses the play button
- drags the dinosaurs, 
- places the dinosaurs in the right home, 

the system gives instructions, feedback if her action is correct or not.


**Each of these events have a corresponding event_id and event_code.**

**An event_code can be thought of as a category, **

and an **event_id is specific to the title or game**.
For example, 
**the eventids 6043a2b4 (All star sorting) and d3640339 (Dino dive) represent the same eventcode 4090(player clicks help button) but belong to different games thus have different event_id**


When an event is recorded, the event_data captures detailed information based on the event. 
For example,
- if Janet clicks on a dinosaur , the x,y co-ordinates of the click is also noted. 
So for every event the information captured could be different. 
**The event_data field has the detailed information in a JSON format**
**and we could use the args column in the specs dataset to parse this information.**


Keypoints to note:


You should expect noise from issues such as shared devices. Imagine, 
**Joe (Janet's little brother) also shares the same Ipad, and often plays the game.**


Start of the title(game/activity/clip/assessment) is recorded with event_code 2000.
Assessment attempts(with their outcome - correct or incorrect) are captured in eventcode 4100 for all assessments except for Bird measurer which has the event_code 4110. 

Note: In the data, we do see an eventcode 4100 for Bird measurer, 
maybe it is related to another level of this assessment. 
But for this competition, when it comes to Bird measurer we are interested in 4110 .

Test data - Per installationid, the last row contains the event start of the assessment (eventcode 2000). We need to predict the accuracy_group for this assessment. Note: In the test data, you may find previous assessments(with their outcomes, event!
_code 4100 or 4110)


### Official 
<i>https://www.kaggle.com/c/data-science-bowl-2019/discussion/115034#latest-675608</i>

Hello!

I'm a data scientist at PBS KIDS and I'm part of the team that developed and helped collect the competition dataset.

First off, our team will be submitting solutions but we will not be eligible for the cash prizes. 
I am not a ML expert so I will be relying mostly on feature engineering and my understanding of the subject matter.

On that front, I thought it would be helpful for us to give a more detailed description of the media types that are presented in the PBS KIDS Measure Up! app. **These media types are tagged in the dataset as Clip, Activity, Game, or Assessment.**

The app is designed to try to guide the kids through an idealized learning path, 
which is intended to present players with a pattern of exposure->exploration->practice->demonstration (as in demonstration of knowledge).
Each of the worlds in the app may have one or more such sequences of media objects, 
and sometimes the app does not follow this exact formula. However, kids are not required to follow the path that is laid out for them, 
and whether the suggested linear progression leads to better learning outcomes than a random path is not yet clear.
Perhaps this competition will give us some insights into this question as well!

Each content type can be loosely thought of as corresponding to a phase of the learning cycle.

- Clips
Videos are intended to expose the kid to a topic or a problem solving approach. 
Videos typically model or explain things. There is no interactive component to videos. 
Clips can further be classified into:
- Interstitials: short transitional videos between worlds or sections of the world, in which the protagonists of the adventure (Del, Dot and Dee) are seen exploring the island. Aside from the introductory video titled 'Welcome To The Lost Lagoon!', these can be identified by the title specifying the world and the relevant section (e.g. 'Crystal Caves - Level 1'). These videos merely hint to the subject matter.
- Longer clips (2-3 minutes in length): these videos explain an important subject or approach with the help of familiar characters from the PBS KIDS world. Typically these videos have been excerpted from longer television episodes.

Keep in mind in the dataset only the start of the video playback is captured. 
Therefore there are far fewer events corresponding to clips than there are to games or assessments. 
That does not mean clips are less popular! Also, lack of interactivity not withstanding, 
there is good evidence that video contributes significantly to learning outcomes.

- Activities
Activities are open-ended mini-games that allow kids to practice their skills in an environment that mimics real life play patterns to support “messing about”. 
Activities do not have a defined goal, but they do typically model cause and effect. 
We sometimes refer to Activities as 'sandboxes' or 'toys'.

- Games
These are the typical video games most people are familiar with. 
Games help kids practice their skills with the goal of solving a specific problem. 
Each challenge may belong to a progressively more challenging round (marked in the data), 
and multiple rounds may be grouped into levels. Games do not end until the player finishes the game or decides to exit the play session.
If a final goal is achieved, there is usually an option to replay the entire game from the start.

- Assessments
Assessments are interactives that are designed specifically with the goal of measuring a player’s knowledge of the subject matter. 
Metrics that represent the intrinsic knowledge of the user are typically derived either from first principles rooted in childhood educational psychometry or from a posteriori data observations. One such (simple) metric might be the number of incorrect answers leading to the assessment solution, but many others can be formulated.

In [4]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from IPython.display import display,HTML,Image, display_png
import pickle
import re
from itertools import zip_longest
from collections import Counter
import datetime
from datetime import datetime as dt
import copy
from functools import reduce
from tqdm import tqdm_notebook as tqdm
import os
import multiprocessing

## Set up 

In [3]:
df_train = pd.read_csv('../source/train.csv.zip')

In [5]:
df_specs = pd.read_csv('../source/specs.csv')

### Clean data

In [7]:
df_train.head(10)

Unnamed: 0,event_id,game_session,timestamp,event_data,installation_id,event_count,event_code,game_time,title,type,world
0,27253bdc,45bb1e1b6b50c07b,2019-09-06T17:53:46.937Z,"{""event_code"": 2000, ""event_count"": 1}",0001e90f,1,2000,0,Welcome to Lost Lagoon!,Clip,NONE
1,27253bdc,17eeb7f223665f53,2019-09-06T17:54:17.519Z,"{""event_code"": 2000, ""event_count"": 1}",0001e90f,1,2000,0,Magma Peak - Level 1,Clip,MAGMAPEAK
2,77261ab5,0848ef14a8dc6892,2019-09-06T17:54:56.302Z,"{""version"":""1.0"",""event_count"":1,""game_time"":0...",0001e90f,1,2000,0,Sandcastle Builder (Activity),Activity,MAGMAPEAK
3,b2dba42b,0848ef14a8dc6892,2019-09-06T17:54:56.387Z,"{""description"":""Let's build a sandcastle! Firs...",0001e90f,2,3010,53,Sandcastle Builder (Activity),Activity,MAGMAPEAK
4,1bb5fbdb,0848ef14a8dc6892,2019-09-06T17:55:03.253Z,"{""description"":""Let's build a sandcastle! Firs...",0001e90f,3,3110,6972,Sandcastle Builder (Activity),Activity,MAGMAPEAK
5,1325467d,0848ef14a8dc6892,2019-09-06T17:55:06.279Z,"{""coordinates"":{""x"":583,""y"":605,""stage_width"":...",0001e90f,4,4070,9991,Sandcastle Builder (Activity),Activity,MAGMAPEAK
6,1325467d,0848ef14a8dc6892,2019-09-06T17:55:06.913Z,"{""coordinates"":{""x"":601,""y"":570,""stage_width"":...",0001e90f,5,4070,10622,Sandcastle Builder (Activity),Activity,MAGMAPEAK
7,1325467d,0848ef14a8dc6892,2019-09-06T17:55:07.546Z,"{""coordinates"":{""x"":250,""y"":665,""stage_width"":...",0001e90f,6,4070,11255,Sandcastle Builder (Activity),Activity,MAGMAPEAK
8,1325467d,0848ef14a8dc6892,2019-09-06T17:55:07.979Z,"{""coordinates"":{""x"":279,""y"":629,""stage_width"":...",0001e90f,7,4070,11689,Sandcastle Builder (Activity),Activity,MAGMAPEAK
9,1325467d,0848ef14a8dc6892,2019-09-06T17:55:08.566Z,"{""coordinates"":{""x"":839,""y"":654,""stage_width"":...",0001e90f,8,4070,12272,Sandcastle Builder (Activity),Activity,MAGMAPEAK


In [9]:
df_train.timestamp = pd.to_datetime(df_train.timestamp,format='%Y-%m-%dT%H:%M:%S.%f')

In [15]:
# Check type
df_train.dtypes

event_id                        object
game_session                    object
timestamp          datetime64[ns, UTC]
event_data                      object
installation_id                 object
event_count                      int64
event_code                       int64
game_time                        int64
title                           object
type                            object
world                           object
dtype: object

In [16]:
# Check type
df_train.isnull().sum()

event_id           0
game_session       0
timestamp          0
event_data         0
installation_id    0
event_count        0
event_code         0
game_time          0
title              0
type               0
world              0
dtype: int64

In [26]:
for _col in df_train.columns:
    
    print(_col+' ',end="")
    if df_train[_col].dtype=='object' :
        print(df_train[_col].value_counts())
    print(" ")

event_id 1325467d    274673
bb3e370b    256179
cf82af56    224694
5e812b27    206129
cfbd47c8    199734
             ...  
4074bac2         1
5dc079d8         1
119b5b02         1
dcb1663e         1
1b54d27f         1
Name: event_id, Length: 384, dtype: int64
 
game_session 6e6e697f2e593de1    3368
bb1f09ec062b6660    3182
33495c8f126e2ef9    2505
34c82b23355e378c    2456
8fe0ab3c3e448a04    2398
                    ... 
62f2230887cd06dc       1
fef0b3861e79283a       1
19da2ad05bb1f7b1       1
4dc0be6f75f1787c       1
118cbeda2a3037fb       1
Name: game_session, Length: 303319, dtype: int64
 
timestamp  
event_data {"event_code": 2000, "event_count": 1}                                                                                                                                                                                                                                                                  183676
{"version":"1.0","event_count":1,"game_time":0,"event_code":2000}        

In [21]:
df_train[_col].value_counts()

MAGMAPEAK       5023687
CRYSTALCAVES    3232546
TREETOPCITY     3061231
NONE              23578
Name: world, dtype: int64

In [42]:
def save_pickle(filename,obj):
    path_folder = '../pickle/'
    with open(path_folder+filename+'.pickle','wb') as f:
        pickle.dump(obj,f)

In [43]:
save_pickle('df_train',df_train)
save_pickle('df_spec',df_spec)

OSError: [Errno 22] Invalid argument