Vysvetlivky k stlpcom z: https://www.kaggle.com/fedesoriano/stroke-prediction-dataset?fbclid=IwAR1JNHGZbEhtzY0LdMW5YjkyT4JhRp1A8qzUhEFxai2b-2nRhw4JFQcNx-0

* id: unique identifier
* gender: "Male", "Female" or "Other"
* age: age of the patient
* hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
* heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
* ever_married: "No" or "Yes"
* work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
* Residence_type: "Rural" or "Urban"
* avg_glucose_level: average glucose level in blood
* bmi: body mass index
* smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
* stroke: 1 if the patient had a stroke or 0 if not
*Note: "Unknown" in smoking_status means that the information is unavailable for this patient

In [None]:
install.packages("moments")
install.packages("dplyr")
install.packages("ggpubr")
install.packages("nortest")
install.packages("ggplot2")


Nacitaj data:

In [None]:
df = read.csv("./data/healthcare-dataset-stroke-data.xls")
head(df)

### Zakladne info o datach

Najskor nas zaujima vyskyt abnormalit ako chybajuce hodnoty alebo hodnoty ktore nie su realne. 

Z nizsie uvedeneho sumaru vidime napriklad, ze znak 'gender' ma 3 rozne kategorie, pricom sa ocakavaju len 2 a ze tretiu kategoriu ma len jedno pozorovanie. 

Dalej znak 'bmi' je uvedeny ako kategoricke data, hoci by sme ocakavali skor ciselne vyjadrenie.

In [None]:
str(df)

In [None]:
summary(df)

Identifikator nie je sam o sebe zaujimavy, ale ocakavame ze vsetky hodnoty tohto stlpca budu unikatne:

In [None]:
nrow(unique(df[c("id")])) == nrow(df[c("id")])

Stlpec 'gender' ma 3 unikatne hodnoty. Tretia hodnota je 'Other' a vyskytuje sa v jedinom pozorovani. Kedze sa jedna o jedine pozorovanie a nevieme pohlavie urcit, toto pozorovanie odstranime a budeme uvazovat len 2 pohlavia. 

In [None]:
unique(df[c("gender")])

In [None]:
df = df[df$gender != 'Other', ]

In [None]:
unique(df[c("gender")])

Znak 'bmi' sa rozoznal ako kategoricky pre nerozoznane NA hodnoty. Preto ich transformujeme. Taktiez kategoriu 'Unknown' v znaku 'smoking_status' mozeme previest na NA pre konzistenciu.

In [None]:
df[] <- lapply(df, function(x) {
    is.na(levels(x)) <- levels(x) == "N/A"
    x
})

df[] <- lapply(df, function(x) {
    is.na(levels(x)) <- levels(x) == "Unknown"
    x
})

Teraz sa vieme pozriet, kde mame NA:

In [None]:
apply(df, 2, function(x) any(is.na(x)))

In [None]:
df['bmi'] = sapply(df['bmi'], as.character)
df['bmi'] = sapply(df['bmi'], as.numeric)

In [None]:
summary(df)

## Age

Rozsah (N)

In [None]:
NROW(df$age)

Priemer

In [None]:
apply(df['age'], 2, mean)

Median



In [None]:
apply(df['age'], 2, median)

Modus

In [None]:
modus <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

In [None]:
apply(df['age'], 2, modus)

In [None]:
hist(df$age
    , xlab="Vek"
    , ylab="Pocet"
    , main="Histogram pre vek"
    , col="lightblue")

Min

In [None]:
min(df$age)

Max

In [None]:
max(df$age)

Kvantily

In [None]:
quantile(df$age)

Boxplot pre vek

In [None]:
boxplot(df$age, horizontal=TRUE, xlab="Vek", main="Boxplot pre vek", col="lightblue")

Rozptyl

In [None]:
var(df$age)

Smerodajná odchýlka

In [None]:
sd(df$age)

In [None]:
library(moments)

Koeficient asymetrie

In [None]:
skewness(df$age)

Špicatosť

In [None]:
kurtosis(df$age)

Su data z norm. rozd.? <br>
H0: Data su z norm. rozd. <br>
H1: Data nie su z norm. rozd.

Q-Q plot

In [None]:
qqnorm(df$age)
qqline(df$age, col = 2)

In [None]:
library(nortest)

In [None]:
#Shapiro-Wilk test je do max 5000 zaznamov.
shap_test <- shapiro.test(df$age[0:5000])
shap_test

In [None]:
cat("p-value: ", shap_test$p.value, "\n")
if (shap_test$p.value > 0.05) {
    print("Nemozeme zamietnut H0")
} else print ("Zamietame H0, data nie su z norm. rozd.")

In [None]:
#Anderson-Darling normality test
ad_test <- ad.test(df$age)

In [None]:
cat("p-value: ", ad_test$p.value, "\n")
if (ad_test$p.value > 0.05) {
    print("Nemozeme zamietnut H0")
} else print ("Zamietame H0, data nie su z norm. rozd.")

## BMI

In [None]:
#remove NA
bmi_no_na <- na.omit(df$bmi)

Rozsah (N)

In [None]:
NROW(bmi_no_na)

Priemer

In [None]:
apply(bmi_no_na, 2, mean)

Median



In [None]:
apply(bmi_no_na, 2, median)

Modus

In [None]:
modus <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

In [None]:
apply(bmi_no_na, 2, modus)

In [None]:
hist(bmi_no_na
    , xlab="BMI"
    , ylab="Pocet"
    , main="Histogram pre BMI"
    , col="lightblue")

Min

In [None]:
min(bmi_no_na)

Max

In [None]:
max(bmi_no_na)

Kvantily

In [None]:
quantile(bmi_no_na)

Boxplot pre vek

In [None]:
boxplot(bmi_no_na, horizontal=TRUE, xlab="BMI", main="Boxplot pre BMI", col="lightblue")

Rozptyl

In [None]:
var(bmi_no_na)

Smerodajná odchýlka

In [None]:
sd(bmi_no_na)

In [None]:
library(moments)

Koeficient asymetrie

In [None]:
skewness(bmi_no_na)

Špicatosť

In [None]:
kurtosis(bmi_no_na)

Su data z norm. rozd.? <br>
H0: Data su z norm. rozd. <br>
H1: Data nie su z norm. rozd.

Q-Q plot

In [None]:
qqnorm(bmi_no_na)
qqline(bmi_no_na, col = 2)

In [None]:
library(nortest)

In [None]:
#Shapiro-Wilk test je do max 5000 zaznamov.
shap_test <- shapiro.test(bmi_no_na[0:5000])
shap_test

In [None]:
cat("p-value: ", shap_test$p.value, "\n")
if (shap_test$p.value > 0.05) {
    print("Nemozeme zamietnut H0")
} else print ("Zamietame H0, data nie su z norm. rozd.")

In [None]:
#Anderson-Darling normality test
ad_test <- ad.test(bmi_no_na)

In [None]:
cat("p-value: ", ad_test$p.value, "\n")
if (ad_test$p.value > 0.05) {
    print("Nemozeme zamietnut H0")
} else print ("Zamietame H0, data nie su z norm. rozd.")