# Predict Student Performance using GamePlay

### Feature Engineering

- session_id - the ID of the session the event took place in
- index - the index of the event for the session
- elapsed_time - how much time has passed (in milliseconds) between the start of the session and when the event was recorded
- event_name - the name of the event type
- name - the event name (e.g. identifies whether a notebook_click is is opening or closing the notebook)
- level - what level of the game the event occurred in (0 to 22)
- page - the page number of the event (only for notebook-related events)
- room_coor_x - the coordinates of the click in reference to the in-game room (only for click events)
- room_coor_y - the coordinates of the click in reference to the in-game room (only for click events)
- screen_coor_x - the coordinates of the click in reference to the player’s screen (only for click events)
- screen_coor_y - the coordinates of the click in reference to the player’s screen (only for click events)
- hover_duration - how long (in milliseconds) the hover happened for (only for hover events)
- text - the text the player sees during this event
- fqid - the fully qualified ID of the event
- room_fqid - the fully qualified ID of the room the event took place in
- text_fqid - the fully qualified ID of the
- fullscreen - whether the player is in fullscreen mode
- hq - whether the game is in high-quality
- music - whether the game music is on or off
- level_group - which group of levels - and group of questions - this row belongs to (0-4, 5-12, 13-22)

In [178]:
#import libraries
import pandas as pd
import numpy as np

In [179]:
#reading the dataset
data_frame=pd.read_csv('data/test.csv')

In [180]:
data_frame.head()

Unnamed: 0,session_id,index,elapsed_time,event_name,name,level,page,room_coor_x,room_coor_y,screen_coor_x,...,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group,session_level
0,20090109393214576,0,0,cutscene_click,basic,0,,-413.991405,75.685314,380.0,...,,undefined,intro,tunic.historicalsociety.closet,tunic.historicalsociety.closet.intro,0,0,1,0-4,0
1,20090109393214576,1,1965,person_click,basic,0,,-105.991405,-63.314686,688.0,...,,"Whatcha doing over there, Jo?",gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4,0
2,20090109393214576,2,3614,person_click,basic,0,,-418.991405,47.685314,375.0,...,,Just talking to Teddy.,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4,0
3,20090109393214576,3,5330,person_click,basic,0,,-110.991405,-57.314686,683.0,...,,I gotta run to my meeting!,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4,0
4,20090109393214576,4,6397,person_click,basic,0,,-110.991405,-57.314686,683.0,...,,"Can I come, Gramps?",gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4,0


In [181]:
data_frame.shape

(3728, 21)

## Handling null values

In [182]:
data_frame.isnull().sum()

session_id           0
index                0
elapsed_time         0
event_name           0
name                 0
level                0
page              3575
room_coor_x        362
room_coor_y        362
screen_coor_x      362
screen_coor_y      362
hover_duration    3375
text              2566
fqid              1223
room_fqid            0
text_fqid         2566
fullscreen           0
hq                   0
music                0
level_group          0
session_level        0
dtype: int64

#### Null values in page column

In [183]:
data_frame['page'].unique()

array([nan,  0.,  1.,  2.,  3.,  4.,  5.,  6.])

Getting the group of level values based on page number

In [184]:
data_frame.groupby('page')['event_name'].unique()

page
0.0    [notebook_click]
1.0    [notebook_click]
2.0    [notebook_click]
3.0    [notebook_click]
4.0    [notebook_click]
5.0    [notebook_click]
6.0    [notebook_click]
Name: event_name, dtype: object

In [185]:
data_frame['event_name'].unique()

array(['cutscene_click', 'person_click', 'navigate_click',
       'observation_click', 'notification_click', 'object_click',
       'object_hover', 'notebook_click', 'map_hover', 'map_click',
       'checkpoint'], dtype=object)

Page numbers are only for notebook_click (notebook related events). So, the page values for other values in event_name column are null. We need to fill null values in page column with the some other value probably -1. Since, event_name is discrete and has 0 in it. 

In [186]:
data_frame['page']=data_frame['page'].fillna(-1)

In [187]:
data_frame['page'].isnull().sum()

0

#### Null values in coordinate columns

In [188]:
#there are 4 coordinate columns such as room_coor_x, room_coor_y, screen_coor_x and screen_coor_y

The coordinate values are only for click_events

In [189]:
#shape of dataframe where room_coor_x is not null
data_frame[data_frame['room_coor_x'].isna()==False].shape

(3366, 21)

In [190]:
#shape of dataframe where room_coor_x is null
data_frame[data_frame['room_coor_x'].isna()].shape

(362, 21)

In [191]:
#checking the null values for all the coordinate columns
data_frame[['room_coor_x','room_coor_y','screen_coor_x','screen_coor_y']].isnull().sum()

room_coor_x      362
room_coor_y      362
screen_coor_x    362
screen_coor_y    362
dtype: int64

In [192]:
#Below query shows that all the coordinate values are null if any coordinate value is null.

In [193]:
data_frame[data_frame['room_coor_x'].isna()&data_frame['room_coor_y'].isna()&
          data_frame['screen_coor_x'].isna()&data_frame['screen_coor_y'].isna()]

Unnamed: 0,session_id,index,elapsed_time,event_name,name,level,page,room_coor_x,room_coor_y,screen_coor_x,...,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group,session_level
67,20090109393214576,67,116880,object_hover,undefined,2,-1.0,,,,...,5168.0,,tunic.hub.slip,tunic.historicalsociety.collection,,0,0,1,0-4,0
96,20090109393214576,96,166545,map_hover,basic,3,-1.0,,,,...,132.0,,tunic.historicalsociety,tunic.historicalsociety.entry,,0,0,1,0-4,0
109,20090109393214576,109,184627,object_hover,undefined,3,-1.0,,,,...,33.0,,plaque.face.date,tunic.kohlcenter.halloffame,,0,0,1,0-4,0
114,20090109393214576,114,190745,object_hover,undefined,3,-1.0,,,,...,3816.0,,plaque.face.date,tunic.kohlcenter.halloffame,,0,0,1,0-4,0
136,20090109393214576,136,207212,map_hover,basic,4,-1.0,,,,...,885.0,,toentry,tunic.kohlcenter.halloffame,,0,0,1,0-4,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3721,20090312331414616,999,1580531,map_hover,basic,22,-1.0,,,,...,35.0,,tunic.humanecology,tunic.historicalsociety.entry,,0,0,1,13-22,8
3722,20090312331414616,1000,1581029,map_hover,basic,22,-1.0,,,,...,16.0,,tunic.humanecology,tunic.historicalsociety.entry,,0,0,1,13-22,8
3723,20090312331414616,1001,1581679,map_hover,basic,22,-1.0,,,,...,484.0,,tunic.wildlife,tunic.historicalsociety.entry,,0,0,1,13-22,8
3724,20090312331414616,1002,1583044,map_hover,basic,22,-1.0,,,,...,783.0,,tunic.capitol_2,tunic.historicalsociety.entry,,0,0,1,13-22,8


Let us replace all the coordinate values with 0. Since, the coordinate value pair will 0,0.
Before that, we will check whether there is (0,0) pair in coordinate columns.

In [194]:
#checking for room coordinates
data_frame[(data_frame['room_coor_x']==0)&(data_frame['room_coor_y']==0)]

Unnamed: 0,session_id,index,elapsed_time,event_name,name,level,page,room_coor_x,room_coor_y,screen_coor_x,...,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group,session_level


In [195]:
#checking for screen coordinates
data_frame[(data_frame['screen_coor_x']==0)&(data_frame['screen_coor_y']==0)]

Unnamed: 0,session_id,index,elapsed_time,event_name,name,level,page,room_coor_x,room_coor_y,screen_coor_x,...,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group,session_level


We can conclude that there is no certain pair that is (0,0). So, we can replace 0 in all the null values in coordinate columns

In [196]:
data_frame['room_coor_x']=data_frame['room_coor_x'].fillna(0)
data_frame['room_coor_y']=data_frame['room_coor_y'].fillna(0)
data_frame['screen_coor_x']=data_frame['screen_coor_x'].fillna(0)
data_frame['screen_coor_y']=data_frame['screen_coor_y'].fillna(0)

In [197]:
#re-checking the null values for all the coordinate columns
data_frame.isnull().sum()

session_id           0
index                0
elapsed_time         0
event_name           0
name                 0
level                0
page                 0
room_coor_x          0
room_coor_y          0
screen_coor_x        0
screen_coor_y        0
hover_duration    3375
text              2566
fqid              1223
room_fqid            0
text_fqid         2566
fullscreen           0
hq                   0
music                0
level_group          0
session_level        0
dtype: int64

Handling missing values for hover_duration column

In [198]:
0 in data_frame['hover_duration'].unique()

False

Let us replace 0 with the null values in the hover_duration column since, there will be no hover event to calculate the time

In [199]:
data_frame['hover_duration']=data_frame['hover_duration'].fillna(0)

In [200]:
data_frame.isnull().sum()

session_id           0
index                0
elapsed_time         0
event_name           0
name                 0
level                0
page                 0
room_coor_x          0
room_coor_y          0
screen_coor_x        0
screen_coor_y        0
hover_duration       0
text              2566
fqid              1223
room_fqid            0
text_fqid         2566
fullscreen           0
hq                   0
music                0
level_group          0
session_level        0
dtype: int64

In [201]:
#the remaining columns are of object type

Replacing the null values for fqid, text_fqid is related to each other.

In [202]:
#getting the data_frame records with specific columns
fqid_data=data_frame[['level','index','event_name','name','fqid','room_fqid','text_fqid']].copy()

In [203]:
#filling the null values for extracted_fqid
#creating the new column extracted_fqid which will store the fqid column's values. 
fqid_data['extracted_fqid']=fqid_data['fqid']
#checking the records where fqid is null and text_fqid is not and storing
text_fqid_nn=fqid_data[fqid_data['fqid'].isna()&(fqid_data['text_fqid'].isna()==False)]

In [204]:
#assigning the values for extracted_fqid column with the extracted_fqid column's value in 
#text_fqid_nn dataframe with the text_fqid_nn index values.
fqid_data['extracted_fqid'].iloc[text_fqid_nn.index]=text_fqid_nn['extracted_fqid']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [205]:
fqid_data.head()

Unnamed: 0,level,index,event_name,name,fqid,room_fqid,text_fqid,extracted_fqid
0,0,0,cutscene_click,basic,intro,tunic.historicalsociety.closet,tunic.historicalsociety.closet.intro,intro
1,0,1,person_click,basic,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,gramps
2,0,2,person_click,basic,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,gramps
3,0,3,person_click,basic,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,gramps
4,0,4,person_click,basic,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,gramps


In [206]:
#creating a new column named extracted_text_fqid and storing text_fqid column values.
fqid_data['extracted_text_fqid']=fqid_data['text_fqid']

In [207]:
#getting the index values of fqid_data where text_fqid has null values and extracted_fqid. doesn't null values..
ex_text_fqid_index=fqid_data[(fqid_data['text_fqid'].isna())&(fqid_data['extracted_fqid'].isna()==False)].index

In [208]:
#assigning the values i.e combination of room_fqid and extracted_fqid column values to extracted_text_fqid at the index values taken from above. 
fqid_data['extracted_text_fqid'].iloc[ex_text_fqid_index]=fqid_data['room_fqid'].iloc[ex_text_fqid_index]+'.'+fqid_data['extracted_fqid'].iloc[ex_text_fqid_index]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [209]:
#creating a new column combine by joining the values in event_name, name, room_fqid and level as a single string. 
fqid_data['combine']=fqid_data['event_name']+fqid_data['name']+fqid_data['room_fqid']+fqid_data['level'].astype(str)

In [210]:
#creating a new combine_copy and storing extracted_fqid column's values.
fqid_data['combine_copy']=fqid_data['extracted_fqid']

In [211]:
#taking limited columns event_name, name, room_fqid, extracted_text_fqid, 'extracted_fqid', level 
#to replace the null values of fqid column
sample_data=fqid_data[['event_name','name','room_fqid','extracted_text_fqid','extracted_fqid','level']]
#getting the records from the above dataframe where the extracted_fqid is null and dropping the duplicates in it.
sample_data_uni=sample_data[sample_data['extracted_fqid'].isna()==False].drop_duplicates()

In [212]:
#shape of above dataframe after dropping duplicates
sample_data_uni.shape

(541, 6)

In [213]:
#creating a dictionary to store the values of fqid
fqid_dict=dict()
#combining all the string values of event_name, name, room_fqid and level and storing in a new columm 'combine'
sample_data_uni['combine']=sample_data_uni['event_name']+sample_data_uni['name']+sample_data_uni['room_fqid']+sample_data_uni['level'].astype(str)
#iterating through the unique values of combine column 
for comb in sample_data_uni.drop_duplicates()['combine'].unique():
    #checking whether the extracted_fqid has only one unique value through each combine value
    if len(sample_data_uni[sample_data_uni['combine']==comb]['extracted_fqid'].unique())==1:
        #if yes, then printing the combine value and the extracted_fqid value
        print(comb)
        print('---------')
        print(sample_data_uni[sample_data_uni['combine']==comb]['extracted_fqid'].unique())
        print('----------------------------------------------------------------------')
        #assigning the extracted_fqid value to the fqid_dict where combine is the key
        fqid_dict[comb]=sample_data_uni[sample_data_uni['combine']==comb]['extracted_fqid'].unique()[0]

cutscene_clickbasictunic.historicalsociety.closet0
---------
['intro']
----------------------------------------------------------------------
observation_clickbasictunic.historicalsociety.closet0
---------
['photo']
----------------------------------------------------------------------
navigate_clickundefinedtunic.historicalsociety.closet1
---------
['tobasement']
----------------------------------------------------------------------
observation_clickbasictunic.historicalsociety.basement1
---------
['janitor']
----------------------------------------------------------------------
cutscene_clickbasictunic.historicalsociety.entry1
---------
['groupconvo']
----------------------------------------------------------------------
object_clickbasictunic.historicalsociety.entry1
---------
['report']
----------------------------------------------------------------------
object_clickclosetunic.historicalsociety.entry1
---------
['report']
----------------------------------------------------------

object_clickclosetunic.library.microfiche10
---------
['reader']
----------------------------------------------------------------------
person_clickbasictunic.humanecology.frontdesk11
---------
['worker']
----------------------------------------------------------------------
map_clickundefinedtunic.humanecology.frontdesk11
---------
['tunic.capitol_1']
----------------------------------------------------------------------
person_clickbasictunic.capitol_1.hall11
---------
['boss']
----------------------------------------------------------------------
navigate_clickundefinedtunic.kohlcenter.halloffame11
---------
['toentry']
----------------------------------------------------------------------
map_hoverbasictunic.kohlcenter.halloffame11
---------
['tunic.capitol_1']
----------------------------------------------------------------------
map_clickundefinedtunic.kohlcenter.halloffame11
---------
['tunic.historicalsociety']
-------------------------------------------------------------------

In [214]:
test_df=sample_data_uni

In [215]:
len(fqid_dict)

160

In [216]:
len(test_df['combine'].unique())

279

In [217]:
len(test_df['combine'])

541

In [218]:
#getting the index value from the fqid_data after mapping the values with the values in fqid_dict
#where extracted_fqid is null
null_fqid_data_ind=fqid_data[fqid_data['extracted_fqid'].isna()]['combine'].map(fqid_dict).index

In [219]:
#assigning the values to the combine_copy at the null_fqid_data_ind index values with the 
#combine values by mapping with fqid_dict dictionary where extracted_fqid column has null values
fqid_data['combine_copy'].iloc[null_fqid_data_ind]=fqid_data[fqid_data['extracted_fqid'].isna()]['combine'].map(fqid_dict)

In [220]:
fqid_data.head()

Unnamed: 0,level,index,event_name,name,fqid,room_fqid,text_fqid,extracted_fqid,extracted_text_fqid,combine,combine_copy
0,0,0,cutscene_click,basic,intro,tunic.historicalsociety.closet,tunic.historicalsociety.closet.intro,intro,tunic.historicalsociety.closet.intro,cutscene_clickbasictunic.historicalsociety.clo...,intro
1,0,1,person_click,basic,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,gramps,tunic.historicalsociety.closet.gramps.intro_0_...,person_clickbasictunic.historicalsociety.closet0,gramps
2,0,2,person_click,basic,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,gramps,tunic.historicalsociety.closet.gramps.intro_0_...,person_clickbasictunic.historicalsociety.closet0,gramps
3,0,3,person_click,basic,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,gramps,tunic.historicalsociety.closet.gramps.intro_0_...,person_clickbasictunic.historicalsociety.closet0,gramps
4,0,4,person_click,basic,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,gramps,tunic.historicalsociety.closet.gramps.intro_0_...,person_clickbasictunic.historicalsociety.closet0,gramps


In [221]:
#getting the index of fqid_data where extracted_text_fqid is null and combine_copy is not null
cc_nn_index=fqid_data[fqid_data['extracted_text_fqid'].isna()&(fqid_data['combine_copy'].isna()==False)].index

In [222]:
#assigning the values of extracted_text_fqid to the newly created column extracted_text_fqid_copy 
fqid_data['extracted_text_fqid_copy']=fqid_data['extracted_text_fqid']

In [223]:
#assigning the combined values of room_fqid, combine_copy with the index values in cc_nn_index
#to the extracted_text_fqid_copy column at the index cc_nn_index
fqid_data['extracted_text_fqid_copy'].iloc[cc_nn_index]=fqid_data['room_fqid'].iloc[cc_nn_index]+'.'+fqid_data['combine_copy'].iloc[cc_nn_index]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [224]:
fqid_data.isnull().sum()

level                          0
index                          0
event_name                     0
name                           0
fqid                        1223
room_fqid                      0
text_fqid                   2566
extracted_fqid              1223
extracted_text_fqid         1147
combine                        0
combine_copy                1015
extracted_text_fqid_copy     939
dtype: int64

In [225]:
fqid_data.head()

Unnamed: 0,level,index,event_name,name,fqid,room_fqid,text_fqid,extracted_fqid,extracted_text_fqid,combine,combine_copy,extracted_text_fqid_copy
0,0,0,cutscene_click,basic,intro,tunic.historicalsociety.closet,tunic.historicalsociety.closet.intro,intro,tunic.historicalsociety.closet.intro,cutscene_clickbasictunic.historicalsociety.clo...,intro,tunic.historicalsociety.closet.intro
1,0,1,person_click,basic,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,gramps,tunic.historicalsociety.closet.gramps.intro_0_...,person_clickbasictunic.historicalsociety.closet0,gramps,tunic.historicalsociety.closet.gramps.intro_0_...
2,0,2,person_click,basic,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,gramps,tunic.historicalsociety.closet.gramps.intro_0_...,person_clickbasictunic.historicalsociety.closet0,gramps,tunic.historicalsociety.closet.gramps.intro_0_...
3,0,3,person_click,basic,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,gramps,tunic.historicalsociety.closet.gramps.intro_0_...,person_clickbasictunic.historicalsociety.closet0,gramps,tunic.historicalsociety.closet.gramps.intro_0_...
4,0,4,person_click,basic,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,gramps,tunic.historicalsociety.closet.gramps.intro_0_...,person_clickbasictunic.historicalsociety.closet0,gramps,tunic.historicalsociety.closet.gramps.intro_0_...


In [226]:
#checking the records where extracted_text_fqid_copy is not null and combine_copy is null
fqid_data[(fqid_data['extracted_text_fqid_copy'].isna()==False)&(fqid_data['combine_copy'].isna())]

Unnamed: 0,level,index,event_name,name,fqid,room_fqid,text_fqid,extracted_fqid,extracted_text_fqid,combine,combine_copy,extracted_text_fqid_copy
23,0,23,notification_click,basic,,tunic.historicalsociety.closet,tunic.historicalsociety.closet.notebook,,tunic.historicalsociety.closet.notebook,notification_clickbasictunic.historicalsociety...,,tunic.historicalsociety.closet.notebook
65,2,65,notification_click,basic,,tunic.historicalsociety.collection,tunic.historicalsociety.collection.tunic.slip,,tunic.historicalsociety.collection.tunic.slip,notification_clickbasictunic.historicalsociety...,,tunic.historicalsociety.collection.tunic.slip
66,2,66,notification_click,basic,,tunic.historicalsociety.collection,tunic.historicalsociety.collection.tunic.slip,,tunic.historicalsociety.collection.tunic.slip,notification_clickbasictunic.historicalsociety...,,tunic.historicalsociety.collection.tunic.slip
112,3,112,notification_click,basic,,tunic.kohlcenter.halloffame,tunic.kohlcenter.halloffame.plaque.face.date,,tunic.kohlcenter.halloffame.plaque.face.date,notification_clickbasictunic.kohlcenter.hallof...,,tunic.kohlcenter.halloffame.plaque.face.date
113,3,113,notification_click,basic,,tunic.kohlcenter.halloffame,tunic.kohlcenter.halloffame.plaque.face.date,,tunic.kohlcenter.halloffame.plaque.face.date,notification_clickbasictunic.kohlcenter.hallof...,,tunic.kohlcenter.halloffame.plaque.face.date
...,...,...,...,...,...,...,...,...,...,...,...,...
3694,21,972,notification_click,basic,,tunic.historicalsociety.stacks,tunic.historicalsociety.stacks.journals_flag.p...,,tunic.historicalsociety.stacks.journals_flag.p...,notification_clickbasictunic.historicalsociety...,,tunic.historicalsociety.stacks.journals_flag.p...
3695,21,973,notification_click,basic,,tunic.historicalsociety.stacks,tunic.historicalsociety.stacks.journals_flag.p...,,tunic.historicalsociety.stacks.journals_flag.p...,notification_clickbasictunic.historicalsociety...,,tunic.historicalsociety.stacks.journals_flag.p...
3700,21,978,notification_click,basic,,tunic.historicalsociety.stacks,tunic.historicalsociety.stacks.journals_flag.p...,,tunic.historicalsociety.stacks.journals_flag.p...,notification_clickbasictunic.historicalsociety...,,tunic.historicalsociety.stacks.journals_flag.p...
3701,21,979,notification_click,basic,,tunic.historicalsociety.stacks,tunic.historicalsociety.stacks.journals_flag.p...,,tunic.historicalsociety.stacks.journals_flag.p...,notification_clickbasictunic.historicalsociety...,,tunic.historicalsociety.stacks.journals_flag.p...


In [227]:
#assigning the values in combine_copy of fqid_data to fqid column in data_frame
data_frame['fqid']=fqid_data['combine_copy']

In [228]:
fqid_data['extracted_text_fqid_copy'].str.split('.').str[3:]

0                      [intro]
1       [gramps, intro_0_cs_0]
2       [gramps, intro_0_cs_0]
3       [gramps, intro_0_cs_0]
4       [gramps, intro_0_cs_0]
                 ...          
3723         [tunic, wildlife]
3724        [tunic, capitol_2]
3725        [tunic, capitol_2]
3726          [chap4_finale_c]
3727          [chap4_finale_c]
Name: extracted_text_fqid_copy, Length: 3728, dtype: object

In [229]:
#assigning the values in extracted_text_fqid_copy of fqid_data to text_fqid column in data_frame
data_frame['text_fqid']=fqid_data['extracted_text_fqid_copy']

In [230]:
data_frame.head()

Unnamed: 0,session_id,index,elapsed_time,event_name,name,level,page,room_coor_x,room_coor_y,screen_coor_x,...,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group,session_level
0,20090109393214576,0,0,cutscene_click,basic,0,-1.0,-413.991405,75.685314,380.0,...,0.0,undefined,intro,tunic.historicalsociety.closet,tunic.historicalsociety.closet.intro,0,0,1,0-4,0
1,20090109393214576,1,1965,person_click,basic,0,-1.0,-105.991405,-63.314686,688.0,...,0.0,"Whatcha doing over there, Jo?",gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4,0
2,20090109393214576,2,3614,person_click,basic,0,-1.0,-418.991405,47.685314,375.0,...,0.0,Just talking to Teddy.,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4,0
3,20090109393214576,3,5330,person_click,basic,0,-1.0,-110.991405,-57.314686,683.0,...,0.0,I gotta run to my meeting!,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4,0
4,20090109393214576,4,6397,person_click,basic,0,-1.0,-110.991405,-57.314686,683.0,...,0.0,"Can I come, Gramps?",gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4,0


In [231]:
#re-checking the null values in the data frame
data_frame.isnull().sum()

session_id           0
index                0
elapsed_time         0
event_name           0
name                 0
level                0
page                 0
room_coor_x          0
room_coor_y          0
screen_coor_x        0
screen_coor_y        0
hover_duration       0
text              2566
fqid              1015
room_fqid            0
text_fqid          939
fullscreen           0
hq                   0
music                0
level_group          0
session_level        0
dtype: int64

In [232]:
#replacing the remaining null values in the fqid and text_fqid column with the string value 'Missing'

In [233]:
#checking whether missing exists in fqid column
'missing' in data_frame['fqid'].unique()

False

In [234]:
'Missing' in data_frame['fqid'].unique()

False

In [235]:
#checking whether missing exists in text_fqid column
'missing' in data_frame['text_fqid'].unique()

False

In [236]:
'Missing' in data_frame['text_fqid'].unique()

False

In [237]:
#replace the null values in fqid column with 'Missing' value

In [238]:
data_frame['fqid']=data_frame['fqid'].fillna('Missing')

In [239]:
data_frame['fqid']=data_frame['fqid'].str.split('.').str[0]

In [240]:
#replace the null values in text_fqid column with 'Missing' value

In [241]:
data_frame['text_fqid']=data_frame['text_fqid'].fillna('Missing')

In [242]:
data_frame['text_fqid']=data_frame['room_fqid']+'.'+data_frame['fqid']

In [243]:
#re-check the null value count
data_frame.isnull().sum()

session_id           0
index                0
elapsed_time         0
event_name           0
name                 0
level                0
page                 0
room_coor_x          0
room_coor_y          0
screen_coor_x        0
screen_coor_y        0
hover_duration       0
text              2566
fqid                 0
room_fqid            0
text_fqid            0
fullscreen           0
hq                   0
music                0
level_group          0
session_level        0
dtype: int64

In [244]:
#removing the redudant columns such as 'text','session_id'

In [245]:
data_frame=data_frame.drop(['text','session_id'],axis=1)

In [246]:
data_frame.head()

Unnamed: 0,index,elapsed_time,event_name,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group,session_level
0,0,0,cutscene_click,basic,0,-1.0,-413.991405,75.685314,380.0,259.0,0.0,intro,tunic.historicalsociety.closet,tunic.historicalsociety.closet.intro,0,0,1,0-4,0
1,1,1965,person_click,basic,0,-1.0,-105.991405,-63.314686,688.0,398.0,0.0,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps,0,0,1,0-4,0
2,2,3614,person_click,basic,0,-1.0,-418.991405,47.685314,375.0,287.0,0.0,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps,0,0,1,0-4,0
3,3,5330,person_click,basic,0,-1.0,-110.991405,-57.314686,683.0,392.0,0.0,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps,0,0,1,0-4,0
4,4,6397,person_click,basic,0,-1.0,-110.991405,-57.314686,683.0,392.0,0.0,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps,0,0,1,0-4,0


### Label Encoding

 There are certain object columns which should be encoded into numerical values.

In [247]:
#differentiating the columns
object_cols=[]
numeric_cols=[]
for cols in data_frame.columns:
    if data_frame[cols].dtype=='O':
        object_cols.append(cols)
    else:
        try:
            data_frame[cols].astype(int)
            numeric_cols.append(cols)
        except:
            pass

In [248]:
object_cols

['event_name', 'name', 'fqid', 'room_fqid', 'text_fqid', 'level_group']

In [249]:
numeric_cols

['index',
 'elapsed_time',
 'level',
 'page',
 'room_coor_x',
 'room_coor_y',
 'screen_coor_x',
 'screen_coor_y',
 'hover_duration',
 'fullscreen',
 'hq',
 'music',
 'session_level']

Working on the column names in object_cols list

In [250]:
object_data=data_frame[object_cols]

In [251]:
object_data.head()

Unnamed: 0,event_name,name,fqid,room_fqid,text_fqid,level_group
0,cutscene_click,basic,intro,tunic.historicalsociety.closet,tunic.historicalsociety.closet.intro,0-4
1,person_click,basic,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps,0-4
2,person_click,basic,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps,0-4
3,person_click,basic,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps,0-4
4,person_click,basic,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps,0-4


In [252]:
def label_encoding(col_name,df):
    ranking=df[col_name].value_counts().index
    mapping={i:k for k,i in enumerate(ranking,0)}
    return df[col_name].map(mapping)

In [253]:
for col_name in object_data.columns:
    object_data[col_name]=label_encoding(col_name,object_data)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  object_data[col_name]=label_encoding(col_name,object_data)


In [254]:
object_data.head()

Unnamed: 0,event_name,name,fqid,room_fqid,text_fqid,level_group
0,3,0,63,14,116,2
1,1,0,6,14,61,2
2,1,0,6,14,61,2
3,1,0,6,14,61,2
4,1,0,6,14,61,2


Working on Numerical columns

In [255]:
numeric_cols

['index',
 'elapsed_time',
 'level',
 'page',
 'room_coor_x',
 'room_coor_y',
 'screen_coor_x',
 'screen_coor_y',
 'hover_duration',
 'fullscreen',
 'hq',
 'music',
 'session_level']

In [256]:
num_data=data_frame[numeric_cols]

In [257]:
num_data.head()

Unnamed: 0,index,elapsed_time,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,fullscreen,hq,music,session_level
0,0,0,0,-1.0,-413.991405,75.685314,380.0,259.0,0.0,0,0,1,0
1,1,1965,0,-1.0,-105.991405,-63.314686,688.0,398.0,0.0,0,0,1,0
2,2,3614,0,-1.0,-418.991405,47.685314,375.0,287.0,0.0,0,0,1,0
3,3,5330,0,-1.0,-110.991405,-57.314686,683.0,392.0,0.0,0,0,1,0
4,4,6397,0,-1.0,-110.991405,-57.314686,683.0,392.0,0.0,0,0,1,0


In [258]:
from sklearn.preprocessing import StandardScaler

In [259]:
final_data_frame=pd.concat([object_data,num_data],axis=1)

In [260]:
final_data_frame.head()

Unnamed: 0,event_name,name,fqid,room_fqid,text_fqid,level_group,index,elapsed_time,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,fullscreen,hq,music,session_level
0,3,0,63,14,116,2,0,0,0,-1.0,-413.991405,75.685314,380.0,259.0,0.0,0,0,1,0
1,1,0,6,14,61,2,1,1965,0,-1.0,-105.991405,-63.314686,688.0,398.0,0.0,0,0,1,0
2,1,0,6,14,61,2,2,3614,0,-1.0,-418.991405,47.685314,375.0,287.0,0.0,0,0,1,0
3,1,0,6,14,61,2,3,5330,0,-1.0,-110.991405,-57.314686,683.0,392.0,0.0,0,0,1,0
4,1,0,6,14,61,2,4,6397,0,-1.0,-110.991405,-57.314686,683.0,392.0,0.0,0,0,1,0


In [261]:
scaler=StandardScaler()

In [262]:
transformed_data=scaler.fit_transform(final_data_frame)

In [263]:
final_input_data=pd.DataFrame(transformed_data)

In [264]:
final_input_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,0.57281,-0.912381,4.099225,1.821716,3.305951,2.012536,-1.63062,-0.993054,-1.937641,-0.187175,-0.825175,0.896769,-0.143707,-0.563536,-0.146064,0.0,0.0,0.0,-1.707
1,-0.342311,-0.912381,-0.30018,1.821716,1.273612,2.012536,-1.628188,-0.992109,-1.937641,-0.187175,-0.190921,0.254304,1.000591,0.242797,-0.146064,0.0,0.0,0.0,-1.707
2,-0.342311,-0.912381,-0.30018,1.821716,1.273612,2.012536,-1.625755,-0.991316,-1.937641,-0.187175,-0.835472,0.767351,-0.162283,-0.401109,-0.146064,0.0,0.0,0.0,-1.707
3,-0.342311,-0.912381,-0.30018,1.821716,1.273612,2.012536,-1.623323,-0.99049,-1.937641,-0.187175,-0.201217,0.282036,0.982015,0.207991,-0.146064,0.0,0.0,0.0,-1.707
4,-0.342311,-0.912381,-0.30018,1.821716,1.273612,2.012536,-1.620891,-0.989977,-1.937641,-0.187175,-0.201217,0.282036,0.982015,0.207991,-0.146064,0.0,0.0,0.0,-1.707


In [265]:
final_input_data.columns=final_data_frame.columns

In [271]:
final_input_data['session_level']=data_frame['session_level']

In [273]:
final_input_data.head()

Unnamed: 0,event_name,name,fqid,room_fqid,text_fqid,level_group,index,elapsed_time,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,fullscreen,hq,music,session_level
0,0.57281,-0.912381,4.099225,1.821716,3.305951,2.012536,-1.63062,-0.993054,-1.937641,-0.187175,-0.825175,0.896769,-0.143707,-0.563536,-0.146064,0.0,0.0,0.0,0
1,-0.342311,-0.912381,-0.30018,1.821716,1.273612,2.012536,-1.628188,-0.992109,-1.937641,-0.187175,-0.190921,0.254304,1.000591,0.242797,-0.146064,0.0,0.0,0.0,0
2,-0.342311,-0.912381,-0.30018,1.821716,1.273612,2.012536,-1.625755,-0.991316,-1.937641,-0.187175,-0.835472,0.767351,-0.162283,-0.401109,-0.146064,0.0,0.0,0.0,0
3,-0.342311,-0.912381,-0.30018,1.821716,1.273612,2.012536,-1.623323,-0.99049,-1.937641,-0.187175,-0.201217,0.282036,0.982015,0.207991,-0.146064,0.0,0.0,0.0,0
4,-0.342311,-0.912381,-0.30018,1.821716,1.273612,2.012536,-1.620891,-0.989977,-1.937641,-0.187175,-0.201217,0.282036,0.982015,0.207991,-0.146064,0.0,0.0,0.0,0


In [274]:
final_input_data.shape

(3728, 19)

In [275]:
final_input_data.to_csv('data/clean_test_data.csv',index=False)