# About this competition from Kaggle

In this dataset, you are provided with game analytics for the PBS KIDS Measure Up! app. In this app, children navigate a map and complete various levels, which may be activities, video clips, games, or assessments. Each assessment is designed to test a child's comprehension of a certain set of measurement-related skills. There are five assessments: Bird Measurer, Cart Balancer, Cauldron Filler, Chest Sorter, and Mushroom Sorter.

The intent of the competition is to use the gameplay data to forecast how many attempts a child will take to pass a given assessment (an incorrect answer is counted as an attempt). Each application install is represented by an installation_id. This will typically correspond to one child, but you should expect noise from issues such as shared devices. In the training set, you are provided the full history of gameplay data. In the test set, we have truncated the history after the start event of a single assessment, chosen randomly, for which you must predict the number of attempts. Note that the training set contains many installation_ids which never took assessments, whereas every installation_id in the test set made an attempt on at least one assessment.

The outcomes in this competition are grouped into 4 groups (labeled accuracy_group in the data):

3: the assessment was solved on the first attempt
2: the assessment was solved on the second attempt
1: the assessment was solved after 3 or more attempts
0: the assessment was never solved

The file train_labels.csv has been provided to show how these groups would be computed on the assessments in the training set. Assessment attempts are captured in event_code 4100 for all assessments except for Bird Measurer, which uses event_code 4110. If the attempt was correct, it contains "correct":true.

### Inspiration / References for this piece:

Erik Bruin's extensive EDA and baseline kernel [https://www.kaggle.com/erikbruin/data-science-bowl-2019-eda-and-baseline](http://) 

Gabriel Preda's detailed data exploration plots https://www.kaggle.com/gpreda/2019-data-science-bowl-eda#Data-exploration

Guillaume Martin's memory reduction function in this kernel - https://www.kaggle.com/gemartin/load-data-reduce-memory-usage. 

We start by importing the necessary libraries 

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_selection import VarianceThreshold


# Data input

Reading all the data files

In [None]:
%%time
train = pd.read_csv('../input/data-science-bowl-2019/train.csv')
train_labels = pd.read_csv('../input/data-science-bowl-2019/train_labels.csv')
test = pd.read_csv('../input/data-science-bowl-2019/test.csv')
specs = pd.read_csv('../input/data-science-bowl-2019/specs.csv')
sample_submission = pd.read_csv('../input/data-science-bowl-2019/sample_submission.csv')

In [None]:
np.random.seed(123)

Important to note that the files have taken 1 min 20s to be read. As Kaggle points out this is a synchronous rerun code competition and the private test set has approximately 8MM rows. We should be mindful of memory in your notebooks to avoid submission errors.

# Data Formatting

In [None]:
train.sample(10)

In [None]:
train.shape

So we have 11 features and 11 million rows in the train data 
From Kaggle, these are the main data files which contain the gameplay events.

* event_id - Randomly generated unique identifier for the event type. Maps to event_id column in specs table.
* game_session - Randomly generated unique identifier grouping events within a single game or video play session.
* timestamp - Client-generated datetime
* event_data - Semi-structured JSON formatted string containing the events parameters. Default fields are: event_count, event_code, and game_time; otherwise fields are determined by the event type.
* installation_id - Randomly generated unique identifier grouping game sessions within a single installed application instance.
* event_count - Incremental counter of events within a game session (offset at 1). Extracted from event_data.
* event_code - Identifier of the event 'class'. Unique per game, but may be duplicated across games. E.g. event code '2000' always identifies the 'Start Game' event for all games. Extracted from event_data.
* game_time - Time in milliseconds since the start of the game session. Extracted from event_data.
* title - Title of the game or video.
* type - Media type of the game or video. Possible values are: 'Game', 'Assessment', 'Activity', 'Clip'.
* world - The section of the application the game or video belongs to. Helpful to identify the educational curriculum goals of the media. Possible values are: 'NONE' (at the app's start screen), TREETOPCITY' (Length/Height), 'MAGMAPEAK' (Capacity/Displacement), 'CRYSTALCAVES' (Weight).

In [None]:
test.sample(10)

In [None]:
test.shape

The test data also contains around 11 million rows and the same 11 features

As Kaggle mentioned that only assessments are used in the testing criteria, it makes no sense to retain installation ids that do not contain an assessment

In [None]:
assessed_only = train[train.type == 'Assessment'].drop_duplicates(subset='installation_id')[['installation_id']]
train = train[train.installation_id.isin(assessed_only['installation_id'])]
train.shape

This has reduced the train data to 8 million rows. 

To understand how the data comes together, we will see if there are any common installation ids in the test and train 

In [None]:
len(set(train.installation_id.unique()) & (set(test.installation_id.unique())))

Ok that's good this means kids (or kids sharing devices with the same installation_id) that are in the train are not present in the test. 
What about game session? Are there any common values in train and test?

In [None]:
len(set(train.game_session.unique()) & (set(test.game_session.unique())))

None, again. How about event_ids?

In [None]:
len(set(train.event_id.unique()) & (set(test.event_id.unique())))

We understand that event_ids are randomly generated identifiers for event type. To understand what events are we can look into the specs data from which the identifier maps the event related info

In [None]:
pd.options.display.max_colwidth = 150
specs.sample(10)

Ok, so these are all events triggered within the app. Some examples include:
* When the player hovers mouse over an interactive object
* When the player clicks on the help button
* When the player picks a mushroom in the resource area

Therefore it makes sense that there are 365 events that are common to both train and test. Further more, Kaggle also defines the variable in the specs.csv

* event_id - Global unique identifier for the event type. Joins to event_id column in events table.
* info - Description of the event.
* args - JSON formatted string of event arguments. Each argument contains:
* name - Argument name.
* type - Type of the argument (string, int, number, object, array).
* info - Description of the argument.

Before getting into the data analysis part, we need to beware of memory usage.
I found a memory reduction function in this kernel - https://www.kaggle.com/gemartin/load-data-reduce-memory-usage. All credits go to Guillaume Martin.

In [None]:
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype.name

        if col_type not in ['object', 'category', 'datetime64[ns, UTC]']:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df

In [None]:
reduce_mem_usage(train)

In [None]:
reduce_mem_usage(test)

In [None]:
reduce_mem_usage(specs)

In [None]:
reduce_mem_usage(train_labels)

# Data Exploration 

Plotting event code by row counts

In [None]:
plt.figure(figsize=(12,6))


sns.countplot(x='event_code',data=train, palette = 'Blues_d',
              order = train['event_code'].value_counts().index).set_title('Count by Event Code - Train')
plt.xticks(rotation=90,fontsize=8)
plt.show()

In [None]:
plt.figure(figsize=(12,6))


sns.countplot(x='event_code',data=test, palette = 'Blues_d',
              order = test['event_code'].value_counts().index).set_title('Count by Event Code - Test')
plt.xticks(rotation=90,fontsize=8)
plt.show()

The event code is heavily skewed. This will cause us problems when making them dummy variables. We will deal with them later.

We also notice another field event_count, can we find out which event code has the highest event counts?

In [None]:
train.groupby('event_code')[['event_count']].agg('sum').sort_values(by = 'event_count',ascending=False).head(10)

Event 4070 seems to have the highest event count. What about the test data?

In [None]:
test.groupby('event_code')[['event_count']].agg('sum').sort_values(by = 'event_count',ascending=False).head(10)

Ok that's more or less similar. Let's now look at the train labels data

In [None]:
train_labels.sample(10)

So this data contains our target variable - the accuracy group.

Kaggle mentions that this file demonstrates how to compute the ground truth for the assessments in the training set.

We can also take out installation_ids that don't have the target variable in the train labels data. 

In [None]:
train = train[train.installation_id.isin(train_labels.installation_id.unique())]

So now the train data has reduced to 7 million rows

In [None]:
train.shape

Are there any missing values in the data?

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

In [None]:
specs.isnull().sum()

In [None]:
train_labels.isnull().sum()

There appears to be no missing values in the data. 

In order to understand the data better, we plot out some of the variables

In [None]:
sns.set(font_scale=1.5,palette = 'Blues_d')
sns.set_style('whitegrid')
plt.figure(figsize=(12,6))


sns.countplot(x='type',data=train,
              order = train['type'].value_counts().index).set_title('Count by game type - Train')

In [None]:
sns.set(font_scale=1.5,palette = 'Blues_d')
sns.set_style('whitegrid')
plt.figure(figsize=(12,6))


sns.countplot(x='type',data=test,
              order = test['type'].value_counts().index).set_title('Count by game type - Test')

This thread (https://www.kaggle.com/c/data-science-bowl-2019/discussion/115034) contains some information about the different types of content. 

Each content type can be loosely thought of as corresponding to a phase of the learning cycle.

Clips
Videos are intended to expose the kid to a topic or a problem solving approach. Videos typically model or explain things. There is no interactive component to videos. Clips can further be classified into:
- Interstitials: short transitional videos between worlds or sections of the world, in which the protagonists of the adventure (Del, Dot and Dee) are seen exploring the island. Aside from the introductory video titled 'Welcome To The Lost Lagoon!', these can be identified by the title specifying the world and the relevant section (e.g. 'Crystal Caves - Level 1'). These videos merely hint to the subject matter.
- Longer clips (2-3 minutes in length): these videos explain an important subject or approach with the help of familiar characters from the PBS KIDS world. Typically these videos have been excerpted from longer television episodes.

Keep in mind in the dataset only the start of the video playback is captured. Therefore there are far fewer events corresponding to clips than there are to games or assessments. That does not mean clips are less popular! Also, lack of interactivity notwithstanding, there is good evidence that video contributes significantly to learning outcomes.

Activities
Activities are open-ended mini-games that allow kids to practice their skills in an environment that mimics real life play patterns to support “messing about”. Activities do not have a defined goal, but they do typically model cause and effect. We sometimes refer to Activities as 'sandboxes' or 'toys'.

Games
These are the typical video games most people are familiar with. Games help kids practice their skills with the goal of solving a specific problem. Each challenge may belong to a progressively more challenging round (marked in the data), and multiple rounds may be grouped into levels. Games do not end until the player finishes the game or decides to exit the play session. If a final goal is achieved, there is usually an option to replay the entire game from the start.

Assessments
Assessments are interactives that are designed specifically with the goal of measuring a player’s knowledge of the subject matter. Metrics that represent the intrinsic knowledge of the user are typically derived either from first principles rooted in childhood educational psychometry or from a posteriori data observations. One such (simple) metric might be the number of incorrect answers leading to the assessment solution, but many others can be formulated.

There are some interesting points to note here:

* The reason why our plot has far fewer clips count than other contents is becuse there are less events for videos as the dataset only captures the start of a video
* Since games help kids practice their skills, we may be able to say that kids who frequently replay / easily progress through games have a higher chance of getting a correct answer in the assessment at the first attempt

So what about worlds?

In [None]:
plt.figure(figsize=(12,7))

sns.countplot(x='world',data=train,
             order = train['world'].value_counts().index).set_title('Count by World - Train')

In [None]:
plt.figure(figsize=(12,7))

sns.countplot(x='world',data=test,
             order = test['world'].value_counts().index).set_title('Count by World - Test')

Magma peak seems to be having the highest count in both the train and test sets - not sure why?

In [None]:
plt.figure(figsize=(12,12))

sns.countplot(y='title',data=train,palette = 'Blues_d',
             order = train['title'].value_counts().index).set_title('Count by Title - Train')

In [None]:
plt.figure(figsize=(12,12))

sns.countplot(y='title',data=test,palette = 'Blues_d',
             order = test['title'].value_counts().index).set_title('Count by Title - Test')

There seems to be a lot of titles. Let's plan later on how to one-hot encode these values for our model. Also for some reason, Bottle Filler and Scrub-A-Dub seems to be the most frequent title in train and test

In [None]:
plt.figure(figsize=(12,6))

sns.countplot(y='title',data=train_labels,palette = 'Blues_d',
             order = train_labels['title'].value_counts().index).set_title('Count by Assessment - Train labels')

In [None]:
plt.figure(figsize=(12,6))

sns.countplot(y='accuracy_group',data=train_labels,palette = 'Blues_d',
             order = train_labels['accuracy_group'].value_counts().index).set_title('Count by Accuracy Group - Train labels')

Ok so obviously there are lots of incorrect answers before a successful result in an assessment. 

Now, to understand the time stamp field, we'll plot the outcomes. Converting timestamp to datetime

In [None]:
train['timestamp'] = pd.to_datetime(train['timestamp'])
test['timestamp'] = pd.to_datetime(test['timestamp'])

And plotting the day of the week counts

In [None]:
plt.figure(figsize=(12,6))

sns.countplot(x=train['timestamp'].dt.dayofweek,data=train,palette = 'Blues_d').set_title('Count by Day of the Week - Train')

In [None]:
plt.figure(figsize=(12,6))

sns.countplot(x=test['timestamp'].dt.dayofweek,data=test,palette = 'Blues_d').set_title('Count by Day of the Week - Test')

0 denotes Monday, 1 denotes Tuesday and so on...

Train and test looks slightly different. There is a slightly higher count during Fridays and Saturdays in the train data. However in the test, Fridays are much more higher.

Now, to look at hour of day...

In [None]:
plt.figure(figsize=(12,6))

sns.countplot(x=train['timestamp'].dt.hour,data=train,palette = 'Blues_d').set_title('Count by Hour of the Day - Train')

In [None]:
plt.figure(figsize=(12,6))

sns.countplot(x=test['timestamp'].dt.hour,data=test,palette = 'Blues_d').set_title('Count by Hour of the Day - Test')

There is an obvious pattern here, the graph shows lesser usage in the early hours of the day up until noon time.

In [None]:
train=train.sort_values('timestamp')
test=test.sort_values('timestamp')

In [None]:
plt.figure(figsize=(15,8))

sns.countplot(x=train['timestamp'].dt.date,data=train,palette = 'Blues_d').set_title('Count by Date - Train')
plt.xticks(rotation=90,fontsize=8)
plt.show()

In [None]:
plt.figure(figsize=(15,8))

sns.countplot(x=test['timestamp'].dt.date,data=test,palette = 'Blues_d').set_title('Count by Date - Test')
plt.xticks(rotation=90,fontsize=8)
plt.show()

By plotting count by date range, we don't really see a huge trend in the train. The graph looks pretty standard. However in test there is an obvious peak during August

To prepare the data, we need to join the train and train labels. Also need to format the train and test datasets to feed into the model

We will perform a left join with train and train_labels using installation_id and game_session. Let's drop all other colums in the train labels other than the ones we require i.e. game_session, installation_id and accuracy_group (our target variable). Once joined, let's look into the number of rows to ensure that we performed the join successfully

In [None]:
train.shape

The new train data should have the same number of rows as shown above

In [None]:
train_labels.shape

In [None]:
train_labels

In [None]:
train_new = pd.merge(train, train_labels.filter(['game_session','installation_id','accuracy_group'],axis=1), on=['installation_id','game_session'], how='left')

In [None]:
train = train_new

Ok so we have the same number of rows and added the accuracy group added

# Feature Engineering

Now we will write a function to prepare the train and test datasets. We also need to perform one hot encoding to the dataset. Before we worry about too many dummy variables, let's look into the number of unique values for the data

In [None]:
train.nunique()

Title and event code will cause problems for us as they contain several values that need to be converted to dummy variables. Maybe we can group the smaller ones?

In [None]:
grouped_events = train.groupby(['event_code'])['event_code'].count().rename('count').reset_index().sort_values('count', ascending=False)
grouped_events['perc'] = grouped_events['count'] / grouped_events['count'].sum()
grouped_events

In [None]:
grouped_events_test = test.groupby(['event_code'])['event_code'].count().rename('count').reset_index().sort_values('count', ascending=False)
grouped_events_test['perc'] = grouped_events_test['count'] / grouped_events_test['count'].sum()
grouped_events_test

As we can see, there are several smaller events that can be merged together as 'other'. To make things easier let's make the following grouping:

It will also be good to test the same in the test data and check if the main events are the same.

In [None]:
set(grouped_events.head(6).event_code).intersection(grouped_events_test.head(6).event_code)

Any event_code other than 4070, 4030, 3010, 3110, 4020, 2020 can be grouped together as 'other' or any dummy event_code like '0000'. Let's first convert them into strings.

In [None]:
train['event_code'] = train['event_code'].apply(str)
test['event_code'] = test['event_code'].apply(str)

In [None]:
main_events = grouped_events.head(6).event_code
train[~train.event_code.isin(main_events)]

Next, we should apply the same logic to the title field

In [None]:
grouped_titles = train.groupby(['title'])['title'].count().rename('count').reset_index().sort_values('count', ascending=False)
grouped_titles['perc'] = grouped_titles['count'] / grouped_titles['count'].sum()
grouped_titles

In [None]:
grouped_titles_test = test.groupby(['title'])['title'].count().rename('count').reset_index().sort_values('count', ascending=False)
grouped_titles_test['perc'] = grouped_titles_test['count'] / grouped_titles_test['count'].sum()
grouped_titles_test

In [None]:
set(grouped_titles.head(6).title) & set(grouped_titles_test.head(6).title)

In [None]:
main_titles = grouped_titles.head(6).title
train[~train.title.isin(main_titles)]

In [None]:
def prepare_data(df):
    
    # Adding all the time columns
    df['month'] = df['timestamp'].dt.month
    df['hour'] = df['timestamp'].dt.hour
    df['year'] = df['timestamp'].dt.year
    df['dayofweek'] = df['timestamp'].dt.dayofweek
    
    # drop any unnecessary columns
    df = df.drop(['timestamp','event_data','game_session','event_id'], axis = 1)
    
    # merge all smaller event codes / titles together
    df.loc[(~df.event_code.isin(main_events),'event_code')]='0000'
    df.loc[(~df.title.isin(main_titles),'title')]='Other'
    
    # convert into dummy variables
    dummies = pd.get_dummies(df[['type','title','world','event_code']])
    
    # drop unnecessary columns
    df = df.drop(['type','title', 'world','event_code'], axis = 1)
    df = pd.concat([df, dummies], axis=1)
    
    return df

In [None]:
train_prep = prepare_data(train)
test_prep = prepare_data(test)

In [None]:
%whos DataFrame

Baseline model - to be continued...