# __DATA ANALYSIS__
# Cyberscapes: Serious Games as a Strategy for Enhancing Filipino Cybersecurity Literacy and Interest

## Preliminaries

In [66]:
# Import libraries
import pandas as pd
import openpyxl

# Open sheet document
# workbook = openpyxl.load_workbook(filename = 'SurveyResponses.xlsx')
# worksheet = workbook['TalliedResponses_Anon']
# worksheet.max_column
worksheet = pd.read_csv("SurveyResponses.csv")


# Pre-test/Post-test survey responses
df_raw = worksheet.iloc[1:]

df_raw[0:5]


Unnamed: 0,Survey Language,Timestamp,Age,Gender,Preferred Languages,Current Occupation,Last School Attended,Highest Educational Attainment,Associate Program,Undergraduate Program D,...,POSTK07,POSTK08,POSTK09,POSTK10,POSTK11,POSTK12,POSTK13,POSTK14,POSTK15,POSTK16
1,ENG,6/28/2023 12:58:40,21.0,Male,English,Student,UP Diliman,"Undergraduate (e.g., BA, BBA, BS)",,,...,1,0,1,1,1,1,1,1,1,1
2,ENG,6/28/2023 13:04:31,20.0,Male,"Filipino, Bisaya",Student,UP Diliman,"Undergraduate (e.g., BA, BBA, BS)",,,...,1,1,1,1,1,1,1,1,1,1
3,ENG,6/28/2023 13:14:18,22.0,Male,"Filipino, English",Student,UP Diliman,"Undergraduate (e.g., BA, BBA, BS)",,,...,1,1,1,1,1,1,1,1,1,1
4,ENG,6/28/2023 13:16:18,22.0,Male,English,Student,LPU Laguna,"Undergraduate (e.g., BA, BBA, BS)",,,...,1,0,1,1,1,1,1,1,1,1
5,ENG,6/28/2023 13:17:55,22.0,Female,"Filipino, English",Student,UPLB,"Undergraduate (e.g., BA, BBA, BS)",,,...,1,1,1,1,1,1,1,1,1,1


## Summary of Responses

In [67]:
# Summary of Responses (Tallied Cybersecurity Interest and Knowledge Scores)
df_summary = df_raw[["Age", "Gender", "Preferred Languages", "Current Occupation", "Undergraduate Program B", "PRE Interest", "POST Interest", "PRE Knowledge", "POST Knowledge"]]

df_summary[["PRE Interest", "POST Interest", "PRE Knowledge", "POST Knowledge"]] = df_summary[["PRE Interest", "POST Interest", "PRE Knowledge", "POST Knowledge"]].astype(int)
df_summary[0:5]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_summary[["PRE Interest", "POST Interest", "PRE Knowledge", "POST Knowledge"]] = df_summary[["PRE Interest", "POST Interest", "PRE Knowledge", "POST Knowledge"]].astype(int)


Unnamed: 0,Age,Gender,Preferred Languages,Current Occupation,Undergraduate Program B,PRE Interest,POST Interest,PRE Knowledge,POST Knowledge
1,21.0,Male,English,Student,BS Computer Science,41,43,15,15
2,20.0,Male,"Filipino, Bisaya",Student,BS Food Technology,18,40,14,14
3,22.0,Male,"Filipino, English",Student,BS Applied Physics,30,42,15,16
4,22.0,Male,English,Student,BS Industrial Engineering,42,49,15,14
5,22.0,Female,"Filipino, English",Student,BS Chemistry,37,43,16,16


### Right-tailed Paired Samples t-Test

> $H_0$: $\mu_1 \gt \mu_2$ (population 1 mean is *greater than* population 2 mean) \\
> $H_1$: $\mu_1 \leq \mu_2$ (population 1 mean is *less than* or *equal to* population 2 mean)

#### Is POST Interest $\gt$ PRE Interest?

> $H_0$: CyberScapes did NOT enhance the cybersecurity interest of the participants. \\
> $H_1$: CyberScapes enhanced the cybersecurity interest of the participants.

In [68]:
from scipy.stats import ttest_rel
# Paired Samples t-Test (PRE vs POST Cybersecurity Interest)

ttest_rel(df_summary['POST Interest'], df_summary['PRE Interest'], alternative='greater')

TtestResult(statistic=5.0789900097492895, pvalue=5.8997108231260355e-06, df=36)

Performing a right-tailed paired samples t-test on the observed POST and PRE interest scores revealed a p-value of $p ≈ 1.277 \times 10^{-5}$ which is far less than $0.05$, rejecting the null hypothesis. Thus, CyberScapes enhanced the cybersecurity interest of the participants.

#### Is POST Knowledge $\gt$ PRE Knowledge?

$H_0$: CyberScapes did NOT enhance the cybersecurity knowledge of the participants. \\
$H_1$: CyberScapes enhanced the cybersecurity knowledge of the participants.

In [69]:
# Paired Samples t-Test (PRE vs POST Cybersecurity Knowledge)

ttest_rel(df_summary['POST Knowledge'], df_summary['PRE Knowledge'], alternative='greater')

TtestResult(statistic=0.5131350685123557, pvalue=0.3054958117264494, df=36)

A right-tailed paired samples t-test on the observe POST and PRE Knowledge scores reveals a p-value of $p ≈ 0.305$ which is greater than $0.05$, failing to reject the null hypothesis. Thus, CyberScapes did not have an effect in improving the cybersecurity knowledge of the participants.

### Two Samples Independent t-Test

$H_0$: $\mu_1 = \mu_2$ (population 1 mean is *equal to* population 2 mean) \\
$H_1$: $\mu_1 \neq \mu_2$ (population 1 mean is *unequal to* population 2 mean)

#### Is there a significant difference in the cybersecurity interest enhancement of Computer Science undergraduates and NON-Computer Science undergraduates?

$H_0$: Cybersecurity interest improvement of Computer Science undergraduates is equal to the cybersecurity improvement of NON-Computer Science undergraduates. \\
$H_1$: Cybersecurity interest improvement of Computer Science and Non-Computer Science undergraduates are unequal.

In [70]:
from scipy.stats import ttest_ind

# Independent Two Sample t-Test (Com Sci vs. Non-Com Sci Cybersecurity Interest)

df_comsci = df_summary.loc[df_summary['Undergraduate Program B'] == "BS Computer Science"]
df_noncomsci = df_summary.loc[df_summary['Undergraduate Program B'] != "BS Computer Science"]

ttest_ind(df_comsci['POST Interest'] - df_comsci['PRE Interest'], df_noncomsci['POST Interest'] - df_noncomsci['PRE Interest'])

Ttest_indResult(statistic=0.13683410325130582, pvalue=0.8919460213690391)

#### Is there a significant difference in the PRIOR cybersecurity **knowledge** of Computer Science undergraduates and NON-Computer Science undergraduates?

$H_0$: PRIOR cybersecurity knowledge of Computer Science and Non-Computer Science undergraduates are *equal*. \\
$H_1$: PRIOR cybersecurity knowledge of Computer Science and Non-Computer Science undergraduates are *unequal*.

In [71]:
# Independent Two Sample t-Test (Com Sci vs. Non-Com Sci Knowledge)

ttest_ind(df_comsci['PRE Knowledge'], df_noncomsci['PRE Knowledge'])

Ttest_indResult(statistic=1.3137435166034428, pvalue=0.19747911819959105)

#### Is there a significant difference in the PRIOR cybersecurity **interest** of Computer Science undergraduates and NON-Computer Science undergraduates?

$H_0$: PRIOR cybersecurity interest of Computer Science and Non-Computer Science undergraduates are *equal*. \\
$H_1$: PRIOR cybersecurity interest of Computer Science and Non-Computer Science undergraduates are *unequal*.

In [72]:
# Independent Two Sample t-Test (Com Sci vs. Non-Com Sci Interest)

ttest_ind(df_comsci['PRE Interest'], df_noncomsci['PRE Interest'])

Ttest_indResult(statistic=0.9560605988827486, pvalue=0.3455990491662936)

#### Is there a significant difference in the PRIOR cybersecurity **interest** of Male vs Female participants?

$H_0$: PRIOR cybersecurity interest of Male and Female participants are *equal*. \\
$H_1$: PRIOR cybersecurity interest of Male and Female participants are *unequal*.

In [73]:
# Independent Two Sample t-Test (Male vs. Female PRE Cybersecurity Interest)

df_male = df_summary.loc[df_summary['Gender'] == "Male"]
df_female = df_summary.loc[df_summary['Gender'] == "Female"]


ttest_ind(df_male['PRE Interest'], df_female['PRE Interest'])

Ttest_indResult(statistic=-0.02503648714837312, pvalue=0.9801680642060083)

#### Is there a significant difference in the PRIOR cybersecurity **interest** of  Male vs Female Non-Computer Science undergraduates?

$H_0$: PRIOR cybersecurity interest of Male and Female non-Computer Science undergraduate participants are *equal*. \\
$H_1$: PRIOR cybersecurity interest of Male and Female non-Computer Science undergraduate participants are *unequal*.

In [74]:
# Independent Two Sample t-Test (NON-ComSci Male vs. Female PRE Cybersecurity Interest)

df_noncomsci_male = df_noncomsci.loc[df_noncomsci['Gender'] == "Male"]
df_noncomsci_female = df_noncomsci.loc[df_noncomsci['Gender'] == "Female"]


ttest_ind(df_noncomsci_male['PRE Interest'], df_noncomsci_female['PRE Interest'])

Ttest_indResult(statistic=0.16609521314262957, pvalue=0.8694733425697461)

### Percentage of Correct Responses per Question (Knowledge)

In [75]:
# Gained points for Individual Questions (Knowledge)

# PRE Knowledge
df_points_PREK = df_raw[['Undergraduate Program B', "PREK01", "PREK02", "PREK03", "PREK04", "PREK05", "PREK06", "PREK07", "PREK08", "PREK09", "PREK10", "PREK11", "PREK12", "PREK13", "PREK14", "PREK15", "PREK16"]]
df_points_PREK_sum = df_points_PREK.drop(['Undergraduate Program B'], axis=1).astype(int).sum()

# POST Knowledge
df_points_POSTK = df_raw[['Undergraduate Program B', "POSTK01", "POSTK02", "POSTK03", "POSTK04", "POSTK05", "POSTK06", "POSTK07", "POSTK08", "POSTK09", "POSTK10", "POSTK11", "POSTK12", "POSTK13", "POSTK14", "POSTK15", "POSTK16"]]
df_points_POSTK_sum = df_points_POSTK.drop(['Undergraduate Program B'], axis=1).astype(int).sum()

# Put in one Dataframe
df_points_knowledge = pd.DataFrame()
df_points_knowledge["PRE Knowledge"] = df_points_PREK_sum.values/df_summary.shape[0]
df_points_knowledge["POST Knowledge"] = df_points_POSTK_sum.values/df_summary.shape[0]

df_points_knowledge

Unnamed: 0,PRE Knowledge,POST Knowledge
0,0.945946,1.0
1,0.972973,1.0
2,0.810811,0.810811
3,1.0,0.972973
4,0.891892,0.945946
5,0.756757,0.648649
6,0.972973,1.0
7,0.837838,0.837838
8,0.945946,0.972973
9,0.972973,1.0


Each question testing the cybersecurity knowledge of participants were answered correctly by the majority (approximately above $80\%$) of the participants, except question $\#6$ (mitigation of risks associated with typosquatting) where only $\approx 65$ got the correct answers on the POST test.

### Agreement per Statement in Likert Scale (Interest)

In [76]:
# Gained points for Individual Questions (Interest)

# PRE Interest
df_points_PREI = df_raw[['Undergraduate Program B', "PREI01", "PREI02", "PREI03", "PREI04", "PREI05", "PREI06", "PREI07", "PREI08", "PREI09", "PREI10"]]
df_points_PREI_sum = df_points_PREI.drop([ 'Undergraduate Program B'], axis=1).astype(int).sum()

# POST Interest
df_points_POSTI = df_raw[['Undergraduate Program B', "POSTI01", "POSTI02", "POSTI03", "POSTI04", "POSTI05", "POSTI06", "POSTI07", "POSTI08", "POSTI09", "POSTI10"]]
df_points_POSTI_sum = df_points_POSTI.drop([ 'Undergraduate Program B'], axis=1).astype(int).sum()

# Put in one Dataframe
df_points_interest = pd.DataFrame()
df_points_interest["PRE Interest"] = (df_points_PREI_sum.values/df_summary.shape[0])
df_points_interest["POST Interest"] = (df_points_POSTI_sum.values/df_summary.shape[0])

int_statements = ["Knowledgeable in C.S.",
"Interested in learning C.S.",
"Actively eeking info about C.S.",
"Updated on news about C.S.",
"Usability of C.S. in day life",
"Ensuring safety of accounts",
"Securing online presence",
"Confidence in protecting info",
"Pursuing degree re C.S.",
"Pursuing work re C.S."]

df_points_interest.insert(0, "Statement", int_statements, True)

df_points_interest

Unnamed: 0,Statement,PRE Interest,POST Interest
0,Knowledgeable in C.S.,3.27027,4.027027
1,Interested in learning C.S.,4.189189,4.243243
2,Actively eeking info about C.S.,3.540541,4.054054
3,Updated on news about C.S.,2.783784,4.027027
4,Usability of C.S. in day life,4.216216,4.648649
5,Ensuring safety of accounts,4.189189,4.594595
6,Securing online presence,4.0,4.405405
7,Confidence in protecting info,3.540541,4.135135
8,Pursuing degree re C.S.,2.675676,2.918919
9,Pursuing work re C.S.,2.621622,2.945946


Moderate to high agreement were gained by statements representing participants' cybersecurity interest, except to statements pertaining to their interest in pursuing a degree or career related to cybersecurity. On the average, participants are in between *disinclined* and *neutral* about pursuing a degree or career related to cybersecurity.

In [77]:
# Per Item PRE Interest of Non-CS vs. CS Participants

df_points_PREI_sum_comsci = df_points_PREI.loc[df_points_PREI['Undergraduate Program B'] == "BS Computer Science"].drop(['Undergraduate Program B'], axis=1).astype(int).sum()
df_points_PREI_sum_noncomsci = df_points_PREI.loc[df_points_PREI['Undergraduate Program B'] != "BS Computer Science"].drop(['Undergraduate Program B'], axis=1).astype(int).sum()

# Per Item POST Interest of Non-CS vs. CS Participants

df_points_POSTI_sum_comsci = df_points_POSTI.loc[df_points_POSTI['Undergraduate Program B'] == "BS Computer Science"].drop(['Undergraduate Program B'], axis=1).astype(int).sum()
df_points_POSTI_sum_noncomsci = df_points_POSTI.loc[df_points_POSTI['Undergraduate Program B'] != "BS Computer Science"].drop(['Undergraduate Program B'], axis=1).astype(int).sum()

mux = pd.MultiIndex.from_product([['Non-CS Interest','CS Interest'], ['PRE','POST']])
df_points_interest_ugrad = pd.DataFrame(columns=mux)

# Filling up the dataframe with average interest scores per statement

df_points_interest_ugrad["Non-CS Interest", "PRE"] = (df_points_PREI_sum_noncomsci.values/df_noncomsci.shape[0])
df_points_interest_ugrad["Non-CS Interest", "POST"] = (df_points_POSTI_sum_noncomsci.values/df_noncomsci.shape[0])
df_points_interest_ugrad["CS Interest", "PRE"] = (df_points_PREI_sum_comsci.values/df_comsci.shape[0])
df_points_interest_ugrad["CS Interest", "POST"] = (df_points_POSTI_sum_comsci.values/df_comsci.shape[0])
df_points_interest_ugrad.insert(0, "Statement", int_statements, True)

df_points_interest_ugrad


Unnamed: 0_level_0,Statement,Non-CS Interest,Non-CS Interest,CS Interest,CS Interest
Unnamed: 0_level_1,Unnamed: 1_level_1,PRE,POST,PRE,POST
0,Knowledgeable in C.S.,3.153846,4.038462,3.545455,4.0
1,Interested in learning C.S.,4.076923,4.076923,4.454545,4.636364
2,Actively eeking info about C.S.,3.423077,3.884615,3.818182,4.454545
3,Updated on news about C.S.,2.576923,3.923077,3.272727,4.272727
4,Usability of C.S. in day life,4.269231,4.653846,4.090909,4.636364
5,Ensuring safety of accounts,4.230769,4.615385,4.090909,4.545455
6,Securing online presence,4.038462,4.307692,3.909091,4.636364
7,Confidence in protecting info,3.576923,4.153846,3.454545,4.090909
8,Pursuing degree re C.S.,2.538462,2.769231,3.0,3.272727
9,Pursuing work re C.S.,2.461538,2.807692,3.0,3.272727


Recall that the above statistical test show NO significant difference in their degree of improvement in cybersecurity improvement through the game. However, in general, a better interest in Cybersecurity can be seen among Computer Science students.

# Gameplay Experience Feedback

Participants rating on gameplay experience, particularly with regard to UI/Graphics, Responsiveness and Ease of Gamplay, Sense of Achievement, and Replayability.

In [78]:
df_feedback_sum = df_raw[["FB01", "FB02", "FB03", "FB04", "FB05", "FB06", "FB07", "FB08", "FB09", "FB10", "FB11"]].astype(int).sum()

df_feedback = pd.DataFrame()
df_feedback["Rating"] = (df_feedback_sum.values/df_summary.shape[0])

fb_statements = ["Intuitiveness",
"Navigation",
"Graphics",
"Difficulty",
"Responsiveness",
"Glitches/Bugs",
"Completion Pride",
"High Score Pride",
"Playing Again",
"Game Variation",
"Non-Repetitive"]

df_feedback.insert(0, "Statement", fb_statements, True)

df_feedback

Unnamed: 0,Statement,Rating
0,Intuitiveness,4.567568
1,Navigation,4.72973
2,Graphics,4.675676
3,Difficulty,4.594595
4,Responsiveness,4.72973
5,Glitches/Bugs,4.756757
6,Completion Pride,4.486486
7,High Score Pride,4.621622
8,Playing Again,4.405405
9,Game Variation,4.135135
