# Predict Student Performance using GamePlay

### Data Collection

- session_id - the ID of the session the event took place in
- index - the index of the event for the session
- elapsed_time - how much time has passed (in milliseconds) between the start of the session and when the event was recorded
- event_name - the name of the event type
- name - the event name (e.g. identifies whether a notebook_click is is opening or closing the notebook)
- level - what level of the game the event occurred in (0 to 22)
- page - the page number of the event (only for notebook-related events)
- room_coor_x - the coordinates of the click in reference to the in-game room (only for click events)
- room_coor_y - the coordinates of the click in reference to the in-game room (only for click events)
- screen_coor_x - the coordinates of the click in reference to the player’s screen (only for click events)
- screen_coor_y - the coordinates of the click in reference to the player’s screen (only for click events)
- hover_duration - how long (in milliseconds) the hover happened for (only for hover events)
- text - the text the player sees during this event
- fqid - the fully qualified ID of the event
- room_fqid - the fully qualified ID of the room the event took place in
- text_fqid - the fully qualified ID of the
- fullscreen - whether the player is in fullscreen mode
- hq - whether the game is in high-quality
- music - whether the game music is on or off
- level_group - which group of levels - and group of questions - this row belongs to (0-4, 5-12, 13-22)

In [5]:
#importing the required library
import pandas as pd

In [7]:
#reading the datasets
train_df=pd.read_csv('data/train.csv')
train_labels_df=pd.read_csv('data/train_labels.csv')

In [11]:
train_df

Unnamed: 0,session_id,index,elapsed_time,event_name,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group
0,20090312431273200,0,0,cutscene_click,basic,0,,-413.991405,-159.314686,380.0,494.0,,undefined,intro,tunic.historicalsociety.closet,tunic.historicalsociety.closet.intro,0,0,1,0-4
1,20090312431273200,1,1323,person_click,basic,0,,-413.991405,-159.314686,380.0,494.0,,"Whatcha doing over there, Jo?",gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4
2,20090312431273200,2,831,person_click,basic,0,,-413.991405,-159.314686,380.0,494.0,,Just talking to Teddy.,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4
3,20090312431273200,3,1147,person_click,basic,0,,-413.991405,-159.314686,380.0,494.0,,I gotta run to my meeting!,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4
4,20090312431273200,4,1863,person_click,basic,0,,-412.991405,-159.314686,381.0,494.0,,"Can I come, Gramps?",gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26296941,22100221145014656,1600,5483231,navigate_click,undefined,22,,343.887291,36.701026,483.0,273.0,,,,tunic.capitol_2.hall,,0,0,1,13-22
26296942,22100221145014656,1601,5485166,navigate_click,undefined,22,,332.696070,141.493178,545.0,221.0,,,chap4_finale_c,tunic.capitol_2.hall,,0,0,1,13-22
26296943,22100221145014656,1602,5485917,navigate_click,undefined,22,,369.912859,140.569205,611.0,217.0,,,,tunic.capitol_2.hall,,0,0,1,13-22
26296944,22100221145014656,1603,5486753,navigate_click,undefined,22,,252.299653,123.805889,526.0,232.0,,,chap4_finale_c,tunic.capitol_2.hall,,0,0,1,13-22


In [12]:
train_labels_df

Unnamed: 0,session_id,correct
0,20090312431273200_q1,1
1,20090312433251036_q1,0
2,20090312455206810_q1,1
3,20090313091715820_q1,0
4,20090313571836404_q1,1
...,...,...
424111,22100215342220508_q18,1
424112,22100215460321130_q18,1
424113,22100217104993650_q18,1
424114,22100219442786200_q18,1


Here, train_df has the input data and train_labels_df has the output data. So, We need to combine both the dataframes based on the session_id.
- Note: In train_label_df, session_id column we also have question id along with the session_id value. So, we need to split the values on '_'

In [31]:
#spliting the values in session_id in the train_labels_df dataframe
sid_split_vals=train_labels_df['session_id'].str.split('_q')
train_labels_df['session_id_actual']=sid_split_vals.str[0]
train_labels_df['question_id']=sid_split_vals.str[1]
train_labels_df['session_id_actual']=train_labels_df['session_id_actual'].astype('int64')
train_labels_df['question_id']=train_labels_df['question_id'].astype(int)

In [32]:
train_labels_df.head()

Unnamed: 0,session_id,correct,session_id_actual,question_id
0,20090312431273200_q1,1,20090312431273200,1
1,20090312433251036_q1,0,20090312433251036,1
2,20090312455206810_q1,1,20090312455206810,1
3,20090313091715820_q1,0,20090313091715820,1
4,20090313571836404_q1,1,20090313571836404,1


Merging with train_df

In [33]:
actual_tl_df=train_labels_df[['session_id_actual','question_id','correct']]

In [34]:
actual_tl_df.columns=['session_id','level','correct']

In [35]:
actual_tl_df.head()

Unnamed: 0,session_id,level,correct
0,20090312431273200,1,1
1,20090312433251036,1,0
2,20090312455206810,1,1
3,20090313091715820,1,0
4,20090313571836404,1,1


In [36]:
final_train_data=train_df.merge(actual_tl_df,on=['session_id','level'])

In [39]:
final_train_data.head()

Unnamed: 0,session_id,index,elapsed_time,event_name,name,level,page,room_coor_x,room_coor_y,screen_coor_x,...,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group,correct
0,20090312431273200,28,28113,navigate_click,undefined,1,,-587.657879,-27.916913,441.0,...,,,retirement_letter,tunic.historicalsociety.closet,,0,0,1,0-4,1
1,20090312431273200,29,32229,notification_click,basic,1,,-182.558163,-1.906501,767.0,...,,Gramps is in trouble for losing papers?,,tunic.historicalsociety.closet,tunic.historicalsociety.closet.retirement_lett...,0,0,1,0-4,1
2,20090312431273200,30,33063,notification_click,basic,1,,-182.500704,-55.888296,767.0,...,,This can't be right!,,tunic.historicalsociety.closet,tunic.historicalsociety.closet.retirement_lett...,0,0,1,0-4,1
3,20090312431273200,31,34245,notification_click,basic,1,,-182.486523,-55.883804,767.0,...,,Gramps is a great historian!,,tunic.historicalsociety.closet,tunic.historicalsociety.closet.retirement_lett...,0,0,1,0-4,1
4,20090312431273200,32,36433,object_click,close,1,,-113.484832,241.116732,836.0,...,,,retirement_letter,tunic.historicalsociety.closet,,0,0,1,0-4,1


In [44]:
final_train_data['level'].unique()

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18])

We can see that the values in the level column are between the numbers from 1 to 18

In [45]:
print('Initial train data shape: ',train_df.shape)
print('Final train data shape:   ',final_train_data.shape)

Initial train data shape:  (26296946, 20)
Final train data shape:    (20732578, 21)


In [50]:
#saving the final_train_data dataframe to .csv file
final_train_data.to_csv('data/new_train_data.csv',index=False)