# **Phishing URL detection: chi-squared test synthetic dataset**

**RQ1:** What characteristics provide the greatest discriminatory information for identifying phishing sites compared to legitimate ones?  

**RQ2:** Do synthetic data generated by Large Language Models preserve the same statistical properties as real data?

**RQ3:** What are the main differences between synthetic data generated by LLMs and real data in regression and clustering contexts?

**RQ4:** Can the features generated by LLMs be mapped to known statistical distributions?

<br>

**Author:** Raffaele Aurucci

## **Reading filtered dataset**

In [1]:
download.file("https://drive.google.com/uc?id=1Sd9obB-lHiCWhDgXsmR6rtXpeupCnWYX&export=download", "Phishing_URL_Synthetic_Dataset_3_Filtered.csv")

In [2]:
df <- read.csv('Phishing_URL_Synthetic_Dataset_3_Filtered.csv', sep = ",")

In [3]:
str(df)

'data.frame':	10124 obs. of  21 variables:
 $ URLLength            : int  45 60 28 75 90 55 42 100 39 62 ...
 $ DomainLength         : int  25 30 18 40 50 35 22 60 27 33 ...
 $ TLDEncoding          : num  0.32 0.2 0.4 0.1 0.25 0.35 0.3 0.15 0.38 0.27 ...
 $ NoOfLettersInURL     : int  30 35 12 45 40 28 20 50 15 32 ...
 $ NoOfDigitsInURL      : int  4 5 2 10 6 4 2 8 3 5 ...
 $ NoOfSpecialCharsInURL: int  5 4 1 6 2 3 2 8 2 5 ...
 $ IsHTTPS              : int  0 0 0 0 0 0 0 0 1 0 ...
 $ LineOfCode           : int  300 210 100 400 150 320 250 500 80 90 ...
 $ LargestLineLength    : int  2000 1800 1500 3000 2500 2200 1000 4500 900 2000 ...
 $ HasTitle             : int  0 0 1 0 1 0 0 0 0 0 ...
 $ NoOfReference        : int  5 2 8 0 3 4 2 10 1 0 ...
 $ DomainTitleMatchScore: num  10.5 0 15 0 5.5 20 0 25.4 10 0 ...
 $ URLTitleMatchScore   : num  25.4 0 22 0 10.2 35.7 0 30.1 12 0 ...
 $ HasFavicon           : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Robots               : int  0 0 0 0 0 0 0 0 0 0 ...
 

In [4]:
set.seed(42)

## **Two-tailed chi-squared test for normal distribution**

The two-tailed chi-squared test is used to assess whether a given sample follows a normal distribution. The null hypothesis (H₀) states that the data follow a normal distribution, while the alternative hypothesis (H₁) asserts that the data do not follow a normal distribution.

If the chi-squared test statistic falls within the critical interval, the null hypothesis cannot be rejected, suggesting that the data distribution is consistent with normality. Conversely, if the test statistic falls outside the critical interval, the null hypothesis is rejected, indicating a potential deviation from the normal distribution.


### Feature **URLLength**

In [5]:
data <- df$URLLength

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Mean of the sample
m <- mean(data)
cat("Sample mean (m):", m, "\n")

# Standard deviation of the sample
d <- sd(data)
cat("Sample standard deviation (d):", d, "\n")

# Using quantiles of the normal distribution to determine subsets
a <- numeric(4)
for (i in 1:4)
  a[i] <- qnorm(0.2 * i, mean = m, sd = d)

cat("Quantiles (a):", a, "\n")

# Number of intervals
r <- 5

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculating the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which((data >= a[1]) & (data < a[2])))
nint[3] <- length(which((data >= a[2]) & (data < a[3])))
nint[4] <- length(which((data >= a[3]) & (data < a[4])))
nint[5] <- length(which(data >= a[4]))

cat("Frequencies of intervals (nint):", nint, "\n")

# Calculating the chi-squared test value
chi2 <- sum(((nint - n * 0.2) / sqrt(n * 0.2))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specifying k and the significance level alpha
k <- 2
alpha <- 0.05

# Calculating the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)

cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 10124 
Sample mean (m): 172.6041 
Sample standard deviation (d): 298.0365 
Quantiles (a): -78.22974 97.09742 248.1108 423.438 
Frequencies of intervals (nint): 0 6819 1960 501 844 
Chi-squared test value (chi2): 15213.66 
Lower critical value: 0.05063562 
Upper critical value: 7.377759 


### Feature **DomainLength**

In [6]:
data <- df$DomainLength

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Mean of the sample
m <- mean(data)
cat("Sample mean (m):", m, "\n")

# Standard deviation of the sample
d <- sd(data)
cat("Sample standard deviation (d):", d, "\n")

# Using quantiles of the normal distribution to determine subsets
a <- numeric(4)
for (i in 1:4)
  a[i] <- qnorm(0.2 * i, mean = m, sd = d)

cat("Quantiles (a):", a, "\n")

# Number of intervals
r <- 5

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculating the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which((data >= a[1]) & (data < a[2])))
nint[3] <- length(which((data >= a[2]) & (data < a[3])))
nint[4] <- length(which((data >= a[3]) & (data < a[4])))
nint[5] <- length(which(data >= a[4]))

cat("Frequencies of intervals (nint):", nint, "\n")

# Calculating the chi-squared test value
chi2 <- sum(((nint - n * 0.2) / sqrt(n * 0.2))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specifying k and the significance level alpha
k <- 2
alpha <- 0.05

# Calculating the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)

cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 10124 
Sample mean (m): 39.01146 
Sample standard deviation (d): 34.58139 
Quantiles (a): 9.907022 30.25036 47.77255 68.11589 
Frequencies of intervals (nint): 370 4743 2793 1335 883 
Chi-squared test value (chi2): 6171.788 
Lower critical value: 0.05063562 
Upper critical value: 7.377759 


### Feature **TLDEncoding**

In [7]:
data <- df$TLDEncoding

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Mean of the sample
m <- mean(data)
cat("Sample mean (m):", m, "\n")

# Standard deviation of the sample
d <- sd(data)
cat("Sample standard deviation (d):", d, "\n")

# Using quantiles of the normal distribution to determine subsets
a <- numeric(4)
for (i in 1:4)
  a[i] <- qnorm(0.2 * i, mean = m, sd = d)

cat("Quantiles (a):", a, "\n")

# Number of intervals
r <- 5

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculating the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which((data >= a[1]) & (data < a[2])))
nint[3] <- length(which((data >= a[2]) & (data < a[3])))
nint[4] <- length(which((data >= a[3]) & (data < a[4])))
nint[5] <- length(which(data >= a[4]))

cat("Frequencies of intervals (nint):", nint, "\n")

# Calculating the chi-squared test value
chi2 <- sum(((nint - n * 0.2) / sqrt(n * 0.2))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specifying k and the significance level alpha
k <- 2
alpha <- 0.05

# Calculating the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)

cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 10124 
Sample mean (m): 0.5479511 
Sample standard deviation (d): 0.247687 
Quantiles (a): 0.3394924 0.4852003 0.6107019 0.7564098 
Frequencies of intervals (nint): 2150 1596 1869 2222 2287 
Chi-squared test value (chi2): 163.6976 
Lower critical value: 0.05063562 
Upper critical value: 7.377759 


### Feature **NoOfLettersInURL**

In [8]:
data <- df$NoOfLettersInURL

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Mean of the sample
m <- mean(data)
cat("Sample mean (m):", m, "\n")

# Standard deviation of the sample
d <- sd(data)
cat("Sample standard deviation (d):", d, "\n")

# Using quantiles of the normal distribution to determine subsets
a <- numeric(4)
for (i in 1:4)
  a[i] <- qnorm(0.2 * i, mean = m, sd = d)

cat("Quantiles (a):", a, "\n")

# Number of intervals
r <- 5

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculating the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which((data >= a[1]) & (data < a[2])))
nint[3] <- length(which((data >= a[2]) & (data < a[3])))
nint[4] <- length(which((data >= a[3]) & (data < a[4])))
nint[5] <- length(which(data >= a[4]))

cat("Frequencies of intervals (nint):", nint, "\n")

# Calculating the chi-squared test value
chi2 <- sum(((nint - n * 0.2) / sqrt(n * 0.2))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specifying k and the significance level alpha
k <- 2
alpha <- 0.05

# Calculating the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)

cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 10124 
Sample mean (m): 37.43965 
Sample standard deviation (d): 34.07063 
Quantiles (a): 8.76508 28.80795 46.07134 66.11422 
Frequencies of intervals (nint): 375 4730 2678 1169 1172 
Chi-squared test value (chi2): 5890.102 
Lower critical value: 0.05063562 
Upper critical value: 7.377759 


### Feature **NoOfDigitsInURL**

In [10]:
data <- df$NoOfDigitsInURL

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Mean of the sample
m <- mean(data)
cat("Sample mean (m):", m, "\n")

# Standard deviation of the sample
d <- sd(data)
cat("Sample standard deviation (d):", d, "\n")

# Using quantiles of the normal distribution to determine subsets
a <- numeric(4)
for (i in 1:4)
  a[i] <- qnorm(0.2 * i, mean = m, sd = d)

cat("Quantiles (a):", a, "\n")

# Number of intervals
r <- 5

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculating the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which((data >= a[1]) & (data < a[2])))
nint[3] <- length(which((data >= a[2]) & (data < a[3])))
nint[4] <- length(which((data >= a[3]) & (data < a[4])))
nint[5] <- length(which(data >= a[4]))

cat("Frequencies of intervals (nint):", nint, "\n")

# Calculating the chi-squared test value
chi2 <- sum(((nint - n * 0.2) / sqrt(n * 0.2))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specifying k and the significance level alpha
k <- 2
alpha <- 0.05

# Calculating the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)

cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 10124 
Sample mean (m): 9.178289 
Sample standard deviation (d): 24.76018 
Quantiles (a): -11.66041 2.905369 15.45121 30.01698 
Frequencies of intervals (nint): 0 5084 3768 558 714 
Chi-squared test value (chi2): 10058.75 
Lower critical value: 0.05063562 
Upper critical value: 7.377759 


### Feature **NoOfSpecialCharsInURL**

In [11]:
data <- df$NoOfSpecialCharsInURL

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Mean of the sample
m <- mean(data)
cat("Sample mean (m):", m, "\n")

# Standard deviation of the sample
d <- sd(data)
cat("Sample standard deviation (d):", d, "\n")

# Using quantiles of the normal distribution to determine subsets
a <- numeric(4)
for (i in 1:4)
  a[i] <- qnorm(0.2 * i, mean = m, sd = d)

cat("Quantiles (a):", a, "\n")

# Number of intervals
r <- 5

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculating the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which((data >= a[1]) & (data < a[2])))
nint[3] <- length(which((data >= a[2]) & (data < a[3])))
nint[4] <- length(which((data >= a[3]) & (data < a[4])))
nint[5] <- length(which(data >= a[4]))

cat("Frequencies of intervals (nint):", nint, "\n")

# Calculating the chi-squared test value
chi2 <- sum(((nint - n * 0.2) / sqrt(n * 0.2))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specifying k and the significance level alpha
k <- 2
alpha <- 0.05

# Calculating the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)

cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 10124 
Sample mean (m): 7.30482 
Sample standard deviation (d): 23.52284 
Quantiles (a): -12.4925 1.345377 13.26426 27.10214 
Frequencies of intervals (nint): 0 2971 6522 227 404 
Chi-squared test value (chi2): 15349.17 
Lower critical value: 0.05063562 
Upper critical value: 7.377759 


### Feature **LineOfCode**

In [12]:
data <- df$LineOfCode

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Mean of the sample
m <- mean(data)
cat("Sample mean (m):", m, "\n")

# Standard deviation of the sample
d <- sd(data)
cat("Sample standard deviation (d):", d, "\n")

# Using quantiles of the normal distribution to determine subsets
a <- numeric(4)
for (i in 1:4)
  a[i] <- qnorm(0.2 * i, mean = m, sd = d)

cat("Quantiles (a):", a, "\n")

# Number of intervals
r <- 5

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculating the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which((data >= a[1]) & (data < a[2])))
nint[3] <- length(which((data >= a[2]) & (data < a[3])))
nint[4] <- length(which((data >= a[3]) & (data < a[4])))
nint[5] <- length(which(data >= a[4]))

cat("Frequencies of intervals (nint):", nint, "\n")

# Calculating the chi-squared test value
chi2 <- sum(((nint - n * 0.2) / sqrt(n * 0.2))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specifying k and the significance level alpha
k <- 2
alpha <- 0.05

# Calculating the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)

cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 10124 
Sample mean (m): 1999.729 
Sample standard deviation (d): 2892.78 
Quantiles (a): -434.8957 1266.852 2732.607 4434.354 
Frequencies of intervals (nint): 0 7238 823 403 1660 
Chi-squared test value (chi2): 17525.14 
Lower critical value: 0.05063562 
Upper critical value: 7.377759 


### Feature **LargestLineLength**

In [13]:
data <- df$LargestLineLength

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Mean of the sample
m <- mean(data)
cat("Sample mean (m):", m, "\n")

# Standard deviation of the sample
d <- sd(data)
cat("Sample standard deviation (d):", d, "\n")

# Using quantiles of the normal distribution to determine subsets
a <- numeric(4)
for (i in 1:4)
  a[i] <- qnorm(0.2 * i, mean = m, sd = d)

cat("Quantiles (a):", a, "\n")

# Number of intervals
r <- 5

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculating the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which((data >= a[1]) & (data < a[2])))
nint[3] <- length(which((data >= a[2]) & (data < a[3])))
nint[4] <- length(which((data >= a[3]) & (data < a[4])))
nint[5] <- length(which(data >= a[4]))

cat("Frequencies of intervals (nint):", nint, "\n")

# Calculating the chi-squared test value
chi2 <- sum(((nint - n * 0.2) / sqrt(n * 0.2))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specifying k and the significance level alpha
k <- 2
alpha <- 0.05

# Calculating the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)

cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 10124 
Sample mean (m): 7421.699 
Sample standard deviation (d): 11711.79 
Quantiles (a): -2435.189 4454.552 10388.85 17278.59 
Frequencies of intervals (nint): 0 7820 391 228 1685 
Chi-squared test value (chi2): 21581.1 
Lower critical value: 0.05063562 
Upper critical value: 7.377759 


### Feature **NoOfReference**

In [14]:
data <- df$NoOfReference

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Mean of the sample
m <- mean(data)
cat("Sample mean (m):", m, "\n")

# Standard deviation of the sample
d <- sd(data)
cat("Sample standard deviation (d):", d, "\n")

# Using quantiles of the normal distribution to determine subsets
a <- numeric(4)
for (i in 1:4)
  a[i] <- qnorm(0.2 * i, mean = m, sd = d)

cat("Quantiles (a):", a, "\n")

# Number of intervals
r <- 5

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculating the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which((data >= a[1]) & (data < a[2])))
nint[3] <- length(which((data >= a[2]) & (data < a[3])))
nint[4] <- length(which((data >= a[3]) & (data < a[4])))
nint[5] <- length(which(data >= a[4]))

cat("Frequencies of intervals (nint):", nint, "\n")

# Calculating the chi-squared test value
chi2 <- sum(((nint - n * 0.2) / sqrt(n * 0.2))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specifying k and the significance level alpha
k <- 2
alpha <- 0.05

# Calculating the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)

cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 10124 
Sample mean (m): 30.41318 
Sample standard deviation (d): 49.43622 
Quantiles (a): -11.1934 17.88865 42.9377 72.01975 
Frequencies of intervals (nint): 0 5890 2529 884 821 
Chi-squared test value (chi2): 10887.18 
Lower critical value: 0.05063562 
Upper critical value: 7.377759 


### Feature **URLTitleMatchScore**

In [15]:
data <- df$URLTitleMatchScore

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Mean of the sample
m <- mean(data)
cat("Sample mean (m):", m, "\n")

# Standard deviation of the sample
d <- sd(data)
cat("Sample standard deviation (d):", d, "\n")

# Using quantiles of the normal distribution to determine subsets
a <- numeric(4)
for (i in 1:4)
  a[i] <- qnorm(0.2 * i, mean = m, sd = d)

cat("Quantiles (a):", a, "\n")

# Number of intervals
r <- 5

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculating the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which((data >= a[1]) & (data < a[2])))
nint[3] <- length(which((data >= a[2]) & (data < a[3])))
nint[4] <- length(which((data >= a[3]) & (data < a[4])))
nint[5] <- length(which(data >= a[4]))

cat("Frequencies of intervals (nint):", nint, "\n")

# Calculating the chi-squared test value
chi2 <- sum(((nint - n * 0.2) / sqrt(n * 0.2))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specifying k and the significance level alpha
k <- 2
alpha <- 0.05

# Calculating the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)

cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 10124 
Sample mean (m): 43.31626 
Sample standard deviation (d): 27.57864 
Quantiles (a): 20.10549 36.32929 50.30322 66.52702 
Frequencies of intervals (nint): 2894 1791 1292 1811 2336 
Chi-squared test value (chi2): 735.7382 
Lower critical value: 0.05063562 
Upper critical value: 7.377759 


### Feature **DomainTitleMatchScore**

In [16]:
data <- df$DomainTitleMatchScore

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Mean of the sample
m <- mean(data)
cat("Sample mean (m):", m, "\n")

# Standard deviation of the sample
d <- sd(data)
cat("Sample standard deviation (d):", d, "\n")

# Using quantiles of the normal distribution to determine subsets
a <- numeric(4)
for (i in 1:4)
  a[i] <- qnorm(0.2 * i, mean = m, sd = d)

cat("Quantiles (a):", a, "\n")

# Number of intervals
r <- 5

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculating the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which((data >= a[1]) & (data < a[2])))
nint[3] <- length(which((data >= a[2]) & (data < a[3])))
nint[4] <- length(which((data >= a[3]) & (data < a[4])))
nint[5] <- length(which(data >= a[4]))

cat("Frequencies of intervals (nint):", nint, "\n")

# Calculating the chi-squared test value
chi2 <- sum(((nint - n * 0.2) / sqrt(n * 0.2))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specifying k and the significance level alpha
k <- 2
alpha <- 0.05

# Calculating the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)

cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 10124 
Sample mean (m): 45.12907 
Sample standard deviation (d): 29.82302 
Quantiles (a): 20.02938 37.5735 52.68465 70.22876 
Frequencies of intervals (nint): 3062 1354 1815 1632 2261 
Chi-squared test value (chi2): 879.0275 
Lower critical value: 0.05063562 
Upper critical value: 7.377759 


### Feature **NoOfExternalFiles**

In [17]:
data <- df$NoOfExternalFiles

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Mean of the sample
m <- mean(data)
cat("Sample mean (m):", m, "\n")

# Standard deviation of the sample
d <- sd(data)
cat("Sample standard deviation (d):", d, "\n")

# Using quantiles of the normal distribution to determine subsets
a <- numeric(4)
for (i in 1:4)
  a[i] <- qnorm(0.2 * i, mean = m, sd = d)

cat("Quantiles (a):", a, "\n")

# Number of intervals
r <- 5

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculating the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which((data >= a[1]) & (data < a[2])))
nint[3] <- length(which((data >= a[2]) & (data < a[3])))
nint[4] <- length(which((data >= a[3]) & (data < a[4])))
nint[5] <- length(which(data >= a[4]))

cat("Frequencies of intervals (nint):", nint, "\n")

# Calculating the chi-squared test value
chi2 <- sum(((nint - n * 0.2) / sqrt(n * 0.2))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specifying k and the significance level alpha
k <- 2
alpha <- 0.05

# Calculating the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)

cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 10124 
Sample mean (m): 28.38996 
Sample standard deviation (d): 29.59371 
Quantiles (a): 3.483273 20.89248 35.88744 53.29666 
Frequencies of intervals (nint): 301 4836 2764 1186 1037 
Chi-squared test value (chi2): 6469.816 
Lower critical value: 0.05063562 
Upper critical value: 7.377759 


## **Two-tailed chi-squared test for binomial distribution**

The two-tailed chi-squared test is used to assess whether a given sample follows a binomial distribution. The null hypothesis (H₀) states that the data follow a binomial distribution, while the alternative hypothesis (H₁) asserts that the data do not follow a binomial distribution.

If the chi-squared test statistic falls within the critical interval, the null hypothesis cannot be rejected, suggesting that the data distribution is consistent with a binomial distribution. Conversely, if the test statistic falls outside the critical interval, the null hypothesis is rejected, indicating a potential deviation from the binomial distribution.


### Feature **IsHTTPS**

In [18]:
data <- df$IsHTTPS

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Calculate the absolute frequencies for interval 0 and interval 1
freq <- table(data)

# Generate intervals for the chi-square test
a <- numeric(2)
a[1] <- round(freq[1] / n, digits = 3)
a[2] <- 1 - a[1]
cat("Intervals (a):", a, "\n")

# Number of intervals
r <- 2

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculate the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which(data >= a[1]))
cat("Frequencies of intervals (nint):", nint, "\n")

# Calculate the chi-square test value
chi2 <- sum(((nint - n * a) / sqrt(n * a))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specify k and the significance level alpha
k <- 0
alpha <- 0.05

# Calculate the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)
cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 10124 
Intervals (a): 0.423 0.577 
Frequencies of intervals (nint): 4280 5844 
Chi-squared test value (chi2): 0.002433171 
Lower critical value: 0.0009820691 
Upper critical value: 5.023886 


### Feature **HasTitle**

In [19]:
data <- df$HasTitle

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Calculate the absolute frequencies for interval 0 and interval 1
freq <- table(data)

# Generate intervals for the chi-square test
a <- numeric(2)
a[1] <- round(freq[1] / n, digits = 3)
a[2] <- 1 - a[1]
cat("Intervals (a):", a, "\n")

# Number of intervals
r <- 2

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculate the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which(data >= a[1]))
cat("Frequencies of intervals (nint):", nint, "\n")

# Calculate the chi-square test value
chi2 <- sum(((nint - n * a) / sqrt(n * a))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specify k and the significance level alpha
k <- 0
alpha <- 0.05

# Calculate the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)
cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 10124 
Intervals (a): 0.35 0.65 
Frequencies of intervals (nint): 3547 6577 
Chi-squared test value (chi2): 0.005626929 
Lower critical value: 0.0009820691 
Upper critical value: 5.023886 


### Feature **HasFavicon**

In [20]:
data <- df$HasFavicon

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Calculate the absolute frequencies for interval 0 and interval 1
freq <- table(data)

# Generate intervals for the chi-square test
a <- numeric(2)
a[1] <- round(freq[1] / n, digits = 3)
a[2] <- 1 - a[1]
cat("Intervals (a):", a, "\n")

# Number of intervals
r <- 2

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculate the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which(data >= a[1]))
cat("Frequencies of intervals (nint):", nint, "\n")

# Calculate the chi-square test value
chi2 <- sum(((nint - n * a) / sqrt(n * a))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specify k and the significance level alpha
k <- 0
alpha <- 0.05

# Calculate the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)
cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 10124 
Intervals (a): 0.575 0.425 
Frequencies of intervals (nint): 5826 4298 
Chi-squared test value (chi2): 0.008928671 
Lower critical value: 0.0009820691 
Upper critical value: 5.023886 


### Feature **IsResponsive**

In [21]:
data <- df$IsResponsive

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Calculate the absolute frequencies for interval 0 and interval 1
freq <- table(data)

# Generate intervals for the chi-square test
a <- numeric(2)
a[1] <- round(freq[1] / n, digits = 3)
a[2] <- 1 - a[1]
cat("Intervals (a):", a, "\n")

# Number of intervals
r <- 2

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculate the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which(data >= a[1]))
cat("Frequencies of intervals (nint):", nint, "\n")

# Calculate the chi-square test value
chi2 <- sum(((nint - n * a) / sqrt(n * a))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specify k and the significance level alpha
k <- 0
alpha <- 0.05

# Calculate the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)
cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 10124 
Intervals (a): 0.508 0.492 
Frequencies of intervals (nint): 5144 4980 
Chi-squared test value (chi2): 0.0004015504 
Lower critical value: 0.0009820691 
Upper critical value: 5.023886 


### Feature **Robots**

In [22]:
data <- df$Robots

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Calculate the absolute frequencies for interval 0 and interval 1
freq <- table(data)

# Generate intervals for the chi-square test
a <- numeric(2)
a[1] <- round(freq[1] / n, digits = 3)
a[2] <- 1 - a[1]
cat("Intervals (a):", a, "\n")

# Number of intervals
r <- 2

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculate the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which(data >= a[1]))
cat("Frequencies of intervals (nint):", nint, "\n")

# Calculate the chi-square test value
chi2 <- sum(((nint - n * a) / sqrt(n * a))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specify k and the significance level alpha
k <- 0
alpha <- 0.05

# Calculate the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)
cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 10124 
Intervals (a): 0.608 0.392 
Frequencies of intervals (nint): 6156 3968 
Chi-squared test value (chi2): 0.0001532023 
Lower critical value: 0.0009820691 
Upper critical value: 5.023886 


### Feature **HasSocialNet**

In [23]:
data <- df$HasSocialNet

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Calculate the absolute frequencies for interval 0 and interval 1
freq <- table(data)

# Generate intervals for the chi-square test
a <- numeric(2)
a[1] <- round(freq[1] / n, digits = 3)
a[2] <- 1 - a[1]
cat("Intervals (a):", a, "\n")

# Number of intervals
r <- 2

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculate the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which(data >= a[1]))
cat("Frequencies of intervals (nint):", nint, "\n")

# Calculate the chi-square test value
chi2 <- sum(((nint - n * a) / sqrt(n * a))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specify k and the significance level alpha
k <- 0
alpha <- 0.05

# Calculate the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)
cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 10124 
Intervals (a): 0.549 0.451 
Frequencies of intervals (nint): 5559 4565 
Chi-squared test value (chi2): 0.0003405986 
Lower critical value: 0.0009820691 
Upper critical value: 5.023886 


### Feature **HasDescription**

In [24]:
data <- df$HasDescription

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Calculate the absolute frequencies for interval 0 and interval 1
freq <- table(data)

# Generate intervals for the chi-square test
a <- numeric(2)
a[1] <- round(freq[1] / n, digits = 3)
a[2] <- 1 - a[1]
cat("Intervals (a):", a, "\n")

# Number of intervals
r <- 2

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculate the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which(data >= a[1]))
cat("Frequencies of intervals (nint):", nint, "\n")

# Calculate the chi-square test value
chi2 <- sum(((nint - n * a) / sqrt(n * a))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specify k and the significance level alpha
k <- 0
alpha <- 0.05

# Calculate the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)
cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 10124 
Intervals (a): 0.585 0.415 
Frequencies of intervals (nint): 5924 4200 
Chi-squared test value (chi2): 0.0008672606 
Lower critical value: 0.0009820691 
Upper critical value: 5.023886 


### Feature **HasCopyrightInfo**

In [25]:
data <- df$HasCopyrightInfo

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Calculate the absolute frequencies for interval 0 and interval 1
freq <- table(data)

# Generate intervals for the chi-square test
a <- numeric(2)
a[1] <- round(freq[1] / n, digits = 3)
a[2] <- 1 - a[1]
cat("Intervals (a):", a, "\n")

# Number of intervals
r <- 2

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculate the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which(data >= a[1]))
cat("Frequencies of intervals (nint):", nint, "\n")

# Calculate the chi-square test value
chi2 <- sum(((nint - n * a) / sqrt(n * a))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specify k and the significance level alpha
k <- 0
alpha <- 0.05

# Calculate the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)
cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 10124 
Intervals (a): 0.664 0.336 
Frequencies of intervals (nint): 6719 3405 
Chi-squared test value (chi2): 0.004927114 
Lower critical value: 0.0009820691 
Upper critical value: 5.023886 
