# **Phishing URL detection: chi-squared test**

**RQ1:** What characteristics provide the greatest discriminatory information for identifying phishing sites compared to legitimate ones?  

**RQ2:** Do synthetic data generated by Large Language Models preserve the same statistical properties as real data?

**RQ3:** What are the main differences between synthetic data generated by LLMs and real data in regression and clustering contexts?

**RQ4:** Can the features generated by LLMs be mapped to known statistical distributions?

<br>

**Author:** Raffaele Aurucci

## **Reading filtered dataset**

In [None]:
download.file("https://drive.google.com/uc?id=1Hq5AkkiOBiPLmPMEzLbgPj83Hs_lzUnY&export=download", "Phishing_URL_Dataset_3_Filtered.csv")

In [None]:
df <- read.csv('Phishing_URL_Dataset_3_Filtered.csv', sep = ",")

In [None]:
str(df)

'data.frame':	20153 obs. of  21 variables:
 $ URLLength            : int  462 379 285 437 22 221 318 397 21 473 ...
 $ DomainLength         : int  14 14 14 14 14 11 11 13 13 13 ...
 $ TLDEncoding          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NoOfLettersInURL     : int  298 227 171 152 0 145 201 245 0 307 ...
 $ NoOfDegitsInURL      : int  87 81 54 264 11 29 58 88 10 100 ...
 $ NoOfSpecialCharsInURL: int  54 48 37 14 4 24 36 57 4 59 ...
 $ IsHTTPS              : int  1 1 1 0 0 1 1 0 0 1 ...
 $ LineOfCode           : int  2 2 11 242 17 11 2 2 11 125 ...
 $ LargestLineLength    : int  1638 1638 564 446 234 493 1638 1638 257 52977 ...
 $ HasTitle             : int  1 1 1 1 1 1 1 1 1 1 ...
 $ NoOfReference        : int  0 0 1 1 2 1 0 0 1 7 ...
 $ DomainTitleMatchScore: num  0 0 0 0 0 0 0 0 0 0 ...
 $ URLTitleMatchScore   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ HasFavicon           : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Robots               : int  0 0 0 0 0 0 0 0 0 0 ...
 $ IsResponsive         : int  1 1

In [None]:
set.seed(42)

## **Two-tailed chi-squared test for normal distribution**

The two-tailed chi-squared test is used to assess whether a given sample follows a normal distribution. The null hypothesis (H₀) states that the data follow a normal distribution, while the alternative hypothesis (H₁) asserts that the data do not follow a normal distribution.

If the chi-squared test statistic falls within the critical interval, the null hypothesis cannot be rejected, suggesting that the data distribution is consistent with normality. Conversely, if the test statistic falls outside the critical interval, the null hypothesis is rejected, indicating a potential deviation from the normal distribution.


### Feature **URLLength**

In [10]:
data <- df$URLLength

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Mean of the sample
m <- mean(data)
cat("Sample mean (m):", m, "\n")

# Standard deviation of the sample
d <- sd(data)
cat("Sample standard deviation (d):", d, "\n")

# Using quantiles of the normal distribution to determine subsets
a <- numeric(4)
for (i in 1:4)
  a[i] <- qnorm(0.2 * i, mean = m, sd = d)

cat("Quantiles (a):", a, "\n")

# Number of intervals
r <- 5

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculating the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which((data >= a[1]) & (data < a[2])))
nint[3] <- length(which((data >= a[2]) & (data < a[3])))
nint[4] <- length(which((data >= a[3]) & (data < a[4])))
nint[5] <- length(which(data >= a[4]))

cat("Frequencies of intervals (nint):", nint, "\n")

# Calculating the chi-squared test value
chi2 <- sum(((nint - n * 0.2) / sqrt(n * 0.2))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specifying k and the significance level alpha
k <- 2
alpha <- 0.05

# Calculating the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)

cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 20153 
Sample mean (m): 36.52469 
Sample standard deviation (d): 34.01521 
Quantiles (a): 7.896763 27.90703 45.14234 65.15261 
Frequencies of intervals (nint): 0 9341 7763 1504 1545 
Chi-squared test value (chi2): 17600.06 
Lower critical value: 0.05063562 
Upper critical value: 7.377759 


### Feature **DomainLength**

In [11]:
data <- df$DomainLength

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Mean of the sample
m <- mean(data)
cat("Sample mean (m):", m, "\n")

# Standard deviation of the sample
d <- sd(data)
cat("Sample standard deviation (d):", d, "\n")

# Using quantiles of the normal distribution to determine subsets
a <- numeric(4)
for (i in 1:4)
  a[i] <- qnorm(0.2 * i, mean = m, sd = d)

cat("Quantiles (a):", a, "\n")

# Number of intervals
r <- 5

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculating the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which((data >= a[1]) & (data < a[2])))
nint[3] <- length(which((data >= a[2]) & (data < a[3])))
nint[4] <- length(which((data >= a[3]) & (data < a[4])))
nint[5] <- length(which(data >= a[4]))

cat("Frequencies of intervals (nint):", nint, "\n")

# Calculating the chi-squared test value
chi2 <- sum(((nint - n * 0.2) / sqrt(n * 0.2))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specifying k and the significance level alpha
k <- 2
alpha <- 0.05

# Calculating the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)

cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 20153 
Sample mean (m): 22.00362 
Sample standard deviation (d): 9.863465 
Quantiles (a): 13.70232 19.50474 24.5025 30.30492 
Frequencies of intervals (nint): 2266 7263 5100 2961 2563 
Chi-squared test value (chi2): 4466.763 
Lower critical value: 0.05063562 
Upper critical value: 7.377759 


### Feature **TLDEncoding**

In [12]:
data <- df$TLDEncoding

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Mean of the sample
m <- mean(data)
cat("Sample mean (m):", m, "\n")

# Standard deviation of the sample
d <- sd(data)
cat("Sample standard deviation (d):", d, "\n")

# Using quantiles of the normal distribution to determine subsets
a <- numeric(4)
for (i in 1:4)
  a[i] <- qnorm(0.2 * i, mean = m, sd = d)

cat("Quantiles (a):", a, "\n")

# Number of intervals
r <- 5

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculating the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which((data >= a[1]) & (data < a[2])))
nint[3] <- length(which((data >= a[2]) & (data < a[3])))
nint[4] <- length(which((data >= a[3]) & (data < a[4])))
nint[5] <- length(which(data >= a[4]))

cat("Frequencies of intervals (nint):", nint, "\n")

# Calculating the chi-squared test value
chi2 <- sum(((nint - n * 0.2) / sqrt(n * 0.2))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specifying k and the significance level alpha
k <- 2
alpha <- 0.05

# Calculating the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)

cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 20153 
Sample mean (m): 0.5000744 
Sample standard deviation (d): 0.2807711 
Quantiles (a): 0.2637715 0.4289419 0.571207 0.7363773 
Frequencies of intervals (nint): 4479 500 10287 648 4239 
Chi-squared test value (chi2): 15703.41 
Lower critical value: 0.05063562 
Upper critical value: 7.377759 


### Feature **NoOfLettersInURL**

In [13]:
data <- df$NoOfLettersInURL

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Mean of the sample
m <- mean(data)
cat("Sample mean (m):", m, "\n")

# Standard deviation of the sample
d <- sd(data)
cat("Sample standard deviation (d):", d, "\n")

# Using quantiles of the normal distribution to determine subsets
a <- numeric(4)
for (i in 1:4)
  a[i] <- qnorm(0.2 * i, mean = m, sd = d)

cat("Quantiles (a):", a, "\n")

# Number of intervals
r <- 5

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculating the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which((data >= a[1]) & (data < a[2])))
nint[3] <- length(which((data >= a[2]) & (data < a[3])))
nint[4] <- length(which((data >= a[3]) & (data < a[4])))
nint[5] <- length(which(data >= a[4]))

cat("Frequencies of intervals (nint):", nint, "\n")

# Calculating the chi-squared test value
chi2 <- sum(((nint - n * 0.2) / sqrt(n * 0.2))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specifying k and the significance level alpha
k <- 2
alpha <- 0.05

# Calculating the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)

cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 20153 
Sample mean (m): 21.00362 
Sample standard deviation (d): 24.65215 
Quantiles (a): 0.2558477 14.75807 27.24917 41.7514 
Frequencies of intervals (nint): 12 9461 7267 1697 1716 
Chi-squared test value (chi2): 16601.93 
Lower critical value: 0.05063562 
Upper critical value: 7.377759 


### Feature **NoOfDigitsInURL**

In [14]:
data <- df$NoOfDegitsInURL

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Mean of the sample
m <- mean(data)
cat("Sample mean (m):", m, "\n")

# Standard deviation of the sample
d <- sd(data)
cat("Sample standard deviation (d):", d, "\n")

# Using quantiles of the normal distribution to determine subsets
a <- numeric(4)
for (i in 1:4)
  a[i] <- qnorm(0.2 * i, mean = m, sd = d)

cat("Quantiles (a):", a, "\n")

# Number of intervals
r <- 5

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculating the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which((data >= a[1]) & (data < a[2])))
nint[3] <- length(which((data >= a[2]) & (data < a[3])))
nint[4] <- length(which((data >= a[3]) & (data < a[4])))
nint[5] <- length(which(data >= a[4]))

cat("Frequencies of intervals (nint):", nint, "\n")

# Calculating the chi-squared test value
chi2 <- sum(((nint - n * 0.2) / sqrt(n * 0.2))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specifying k and the significance level alpha
k <- 2
alpha <- 0.05

# Calculating the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)

cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 20153 
Sample mean (m): 2.235498 
Sample standard deviation (d): 8.081593 
Quantiles (a): -4.566142 0.1880503 4.282947 9.037139 
Frequencies of intervals (nint): 0 14974 2445 1418 1316 
Chi-squared test value (chi2): 37888.3 
Lower critical value: 0.05063562 
Upper critical value: 7.377759 


### Feature **NoOfSpecialCharsInURL**

In [15]:
data <- df$NoOfSpecialCharsInURL

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Mean of the sample
m <- mean(data)
cat("Sample mean (m):", m, "\n")

# Standard deviation of the sample
d <- sd(data)
cat("Sample standard deviation (d):", d, "\n")

# Using quantiles of the normal distribution to determine subsets
a <- numeric(4)
for (i in 1:4)
  a[i] <- qnorm(0.2 * i, mean = m, sd = d)

cat("Quantiles (a):", a, "\n")

# Number of intervals
r <- 5

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculating the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which((data >= a[1]) & (data < a[2])))
nint[3] <- length(which((data >= a[2]) & (data < a[3])))
nint[4] <- length(which((data >= a[3]) & (data < a[4])))
nint[5] <- length(which(data >= a[4]))

cat("Frequencies of intervals (nint):", nint, "\n")

# Calculating the chi-squared test value
chi2 <- sum(((nint - n * 0.2) / sqrt(n * 0.2))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specifying k and the significance level alpha
k <- 2
alpha <- 0.05

# Calculating the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)

cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 20153 
Sample mean (m): 2.716816 
Sample standard deviation (d): 3.932356 
Quantiles (a): -0.5927377 1.720565 3.713067 6.02637 
Frequencies of intervals (nint): 0 10099 5790 2821 1443 
Chi-squared test value (chi2): 15959.29 
Lower critical value: 0.05063562 
Upper critical value: 7.377759 


### Feature **LineOfCode**

In [16]:
data <- df$LineOfCode

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Mean of the sample
m <- mean(data)
cat("Sample mean (m):", m, "\n")

# Standard deviation of the sample
d <- sd(data)
cat("Sample standard deviation (d):", d, "\n")

# Using quantiles of the normal distribution to determine subsets
a <- numeric(4)
for (i in 1:4)
  a[i] <- qnorm(0.2 * i, mean = m, sd = d)

cat("Quantiles (a):", a, "\n")

# Number of intervals
r <- 5

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculating the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which((data >= a[1]) & (data < a[2])))
nint[3] <- length(which((data >= a[2]) & (data < a[3])))
nint[4] <- length(which((data >= a[3]) & (data < a[4])))
nint[5] <- length(which(data >= a[4]))

cat("Frequencies of intervals (nint):", nint, "\n")

# Calculating the chi-squared test value
chi2 <- sum(((nint - n * 0.2) / sqrt(n * 0.2))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specifying k and the significance level alpha
k <- 2
alpha <- 0.05

# Calculating the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)

cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 20153 
Sample mean (m): 1020.963 
Sample standard deviation (d): 4001.97 
Quantiles (a): -2347.18 7.075463 2034.851 4389.106 
Frequencies of intervals (nint): 0 3152 14408 1709 884 
Chi-squared test value (chi2): 34734.04 
Lower critical value: 0.05063562 
Upper critical value: 7.377759 


### Feature **LargestLineLength**

In [17]:
data <- df$LargestLineLength

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Mean of the sample
m <- mean(data)
cat("Sample mean (m):", m, "\n")

# Standard deviation of the sample
d <- sd(data)
cat("Sample standard deviation (d):", d, "\n")

# Using quantiles of the normal distribution to determine subsets
a <- numeric(4)
for (i in 1:4)
  a[i] <- qnorm(0.2 * i, mean = m, sd = d)

cat("Quantiles (a):", a, "\n")

# Number of intervals
r <- 5

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculating the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which((data >= a[1]) & (data < a[2])))
nint[3] <- length(which((data >= a[2]) & (data < a[3])))
nint[4] <- length(which((data >= a[3]) & (data < a[4])))
nint[5] <- length(which(data >= a[4]))

cat("Frequencies of intervals (nint):", nint, "\n")

# Calculating the chi-squared test value
chi2 <- sum(((nint - n * 0.2) / sqrt(n * 0.2))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specifying k and the significance level alpha
k <- 2
alpha <- 0.05

# Calculating the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)

cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 20153 
Sample mean (m): 6398.606 
Sample standard deviation (d): 15242.78 
Quantiles (a): -6430.038 2536.893 10260.32 19227.25 
Frequencies of intervals (nint): 0 12703 4800 1109 1541 
Chi-squared test value (chi2): 26492.85 
Lower critical value: 0.05063562 
Upper critical value: 7.377759 


### Feature **NoOfReference**

In [18]:
data <- df$NoOfReference

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Mean of the sample
m <- mean(data)
cat("Sample mean (m):", m, "\n")

# Standard deviation of the sample
d <- sd(data)
cat("Sample standard deviation (d):", d, "\n")

# Using quantiles of the normal distribution to determine subsets
a <- numeric(4)
for (i in 1:4)
  a[i] <- qnorm(0.2 * i, mean = m, sd = d)

cat("Quantiles (a):", a, "\n")

# Number of intervals
r <- 5

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculating the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which((data >= a[1]) & (data < a[2])))
nint[3] <- length(which((data >= a[2]) & (data < a[3])))
nint[4] <- length(which((data >= a[3]) & (data < a[4])))
nint[5] <- length(which(data >= a[4]))

cat("Frequencies of intervals (nint):", nint, "\n")

# Calculating the chi-squared test value
chi2 <- sum(((nint - n * 0.2) / sqrt(n * 0.2))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specifying k and the significance level alpha
k <- 2
alpha <- 0.05

# Calculating the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)

cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 20153 
Sample mean (m): 101.9698 
Sample standard deviation (d): 344.4722 
Quantiles (a): -187.9453 14.69879 189.2409 391.885 
Frequencies of intervals (nint): 0 10409 6201 2427 1116 
Chi-squared test value (chi2): 18038.7 
Lower critical value: 0.05063562 
Upper critical value: 7.377759 


### Feature **URLTitleMatchScore**

In [19]:
data <- df$URLTitleMatchScore

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Mean of the sample
m <- mean(data)
cat("Sample mean (m):", m, "\n")

# Standard deviation of the sample
d <- sd(data)
cat("Sample standard deviation (d):", d, "\n")

# Using quantiles of the normal distribution to determine subsets
a <- numeric(4)
for (i in 1:4)
  a[i] <- qnorm(0.2 * i, mean = m, sd = d)

cat("Quantiles (a):", a, "\n")

# Number of intervals
r <- 5

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculating the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which((data >= a[1]) & (data < a[2])))
nint[3] <- length(which((data >= a[2]) & (data < a[3])))
nint[4] <- length(which((data >= a[3]) & (data < a[4])))
nint[5] <- length(which(data >= a[4]))

cat("Frequencies of intervals (nint):", nint, "\n")

# Calculating the chi-squared test value
chi2 <- sum(((nint - n * 0.2) / sqrt(n * 0.2))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specifying k and the significance level alpha
k <- 2
alpha <- 0.05

# Calculating the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)

cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 20153 
Sample mean (m): 47.63797 
Sample standard deviation (d): 49.61464 
Quantiles (a): 5.881237 35.06824 60.20769 89.3947 
Frequencies of intervals (nint): 10435 72 14 229 9403 
Chi-squared test value (chi2): 28813.27 
Lower critical value: 0.05063562 
Upper critical value: 7.377759 


### Feature **DomainTitleMatchScore**

In [20]:
data <- df$DomainTitleMatchScore

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Mean of the sample
m <- mean(data)
cat("Sample mean (m):", m, "\n")

# Standard deviation of the sample
d <- sd(data)
cat("Sample standard deviation (d):", d, "\n")

# Using quantiles of the normal distribution to determine subsets
a <- numeric(4)
for (i in 1:4)
  a[i] <- qnorm(0.2 * i, mean = m, sd = d)

cat("Quantiles (a):", a, "\n")

# Number of intervals
r <- 5

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculating the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which((data >= a[1]) & (data < a[2])))
nint[3] <- length(which((data >= a[2]) & (data < a[3])))
nint[4] <- length(which((data >= a[3]) & (data < a[4])))
nint[5] <- length(which(data >= a[4]))

cat("Frequencies of intervals (nint):", nint, "\n")

# Calculating the chi-squared test value
chi2 <- sum(((nint - n * 0.2) / sqrt(n * 0.2))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specifying k and the significance level alpha
k <- 2
alpha <- 0.05

# Calculating the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)

cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 20153 
Sample mean (m): 45.04443 
Sample standard deviation (d): 49.46018 
Quantiles (a): 3.417689 32.51383 57.57502 86.67117 
Frequencies of intervals (nint): 10866 168 7 187 8925 
Chi-squared test value (chi2): 28918.81 
Lower critical value: 0.05063562 
Upper critical value: 7.377759 


### Feature **NoOfExternalFiles**

In [21]:
data <- df$NoOfExternalFiles

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Mean of the sample
m <- mean(data)
cat("Sample mean (m):", m, "\n")

# Standard deviation of the sample
d <- sd(data)
cat("Sample standard deviation (d):", d, "\n")

# Using quantiles of the normal distribution to determine subsets
a <- numeric(4)
for (i in 1:4)
  a[i] <- qnorm(0.2 * i, mean = m, sd = d)

cat("Quantiles (a):", a, "\n")

# Number of intervals
r <- 5

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculating the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which((data >= a[1]) & (data < a[2])))
nint[3] <- length(which((data >= a[2]) & (data < a[3])))
nint[4] <- length(which((data >= a[3]) & (data < a[4])))
nint[5] <- length(which(data >= a[4]))

cat("Frequencies of intervals (nint):", nint, "\n")

# Calculating the chi-squared test value
chi2 <- sum(((nint - n * 0.2) / sqrt(n * 0.2))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specifying k and the significance level alpha
k <- 2
alpha <- 0.05

# Calculating the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)

cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 20153 
Sample mean (m): 39.5288 
Sample standard deviation (d): 327.322 
Quantiles (a): -235.9523 -43.39727 122.4549 315.0099 
Frequencies of intervals (nint): 0 0 18901 1103 149 
Chi-squared test value (chi2): 68788.25 
Lower critical value: 0.05063562 
Upper critical value: 7.377759 


## **Two-tailed chi-squared test for binomial distribution**

The two-tailed chi-squared test is used to assess whether a given sample follows a binomial distribution. The null hypothesis (H₀) states that the data follow a binomial distribution, while the alternative hypothesis (H₁) asserts that the data do not follow a binomial distribution.

If the chi-squared test statistic falls within the critical interval, the null hypothesis cannot be rejected, suggesting that the data distribution is consistent with a binomial distribution. Conversely, if the test statistic falls outside the critical interval, the null hypothesis is rejected, indicating a potential deviation from the binomial distribution.


### Feature **IsHTTPS**

In [40]:
data <- df$IsHTTPS

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Calculate the absolute frequencies for interval 0 and interval 1
freq <- table(data)

# Generate intervals for the chi-square test
a <- numeric(2)
a[1] <- round(freq[1] / n, digits = 3)
a[2] <- 1 - a[1]
cat("Intervals (a):", a, "\n")

# Number of intervals
r <- 2

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculate the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which(data >= a[1]))
cat("Frequencies of intervals (nint):", nint, "\n")

# Calculate the chi-square test value
chi2 <- sum(((nint - n * a) / sqrt(n * a))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specify k and the significance level alpha
k <- 0
alpha <- 0.05

# Calculate the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)
cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 20153 
Intervals (a): 0.241 0.759 
Frequencies of intervals (nint): 4866 15287 
Chi-squared test value (chi2): 0.02259735 
Lower critical value: 0.0009820691 
Upper critical value: 5.023886 


### Feature **HasTitle**

In [41]:
data <- df$HasTitle

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Calculate the absolute frequencies for interval 0 and interval 1
freq <- table(data)

# Generate intervals for the chi-square test
a <- numeric(2)
a[1] <- round(freq[1] / n, digits = 3)
a[2] <- 1 - a[1]
cat("Intervals (a):", a, "\n")

# Number of intervals
r <- 2

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculate the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which(data >= a[1]))
cat("Frequencies of intervals (nint):", nint, "\n")

# Calculate the chi-square test value
chi2 <- sum(((nint - n * a) / sqrt(n * a))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specify k and the significance level alpha
k <- 0
alpha <- 0.05

# Calculate the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)
cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 20153 
Intervals (a): 0.159 0.841 
Frequencies of intervals (nint): 3195 16958 
Chi-squared test value (chi2): 0.03228131 
Lower critical value: 0.0009820691 
Upper critical value: 5.023886 


### Feature **HasFavicon**

In [42]:
data <- df$HasFavicon

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Calculate the absolute frequencies for interval 0 and interval 1
freq <- table(data)

# Generate intervals for the chi-square test
a <- numeric(2)
a[1] <- round(freq[1] / n, digits = 3)
a[2] <- 1 - a[1]
cat("Intervals (a):", a, "\n")

# Number of intervals
r <- 2

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculate the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which(data >= a[1]))
cat("Frequencies of intervals (nint):", nint, "\n")

# Calculate the chi-square test value
chi2 <- sum(((nint - n * a) / sqrt(n * a))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specify k and the significance level alpha
k <- 0
alpha <- 0.05

# Calculate the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)
cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 20153 
Intervals (a): 0.677 0.323 
Frequencies of intervals (nint): 13634 6519 
Chi-squared test value (chi2): 0.02083007 
Lower critical value: 0.0009820691 
Upper critical value: 5.023886 


### Feature **IsResponsive**

In [43]:
data <- df$IsResponsive

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Calculate the absolute frequencies for interval 0 and interval 1
freq <- table(data)

# Generate intervals for the chi-square test
a <- numeric(2)
a[1] <- round(freq[1] / n, digits = 3)
a[2] <- 1 - a[1]
cat("Intervals (a):", a, "\n")

# Number of intervals
r <- 2

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculate the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which(data >= a[1]))
cat("Frequencies of intervals (nint):", nint, "\n")

# Calculate the chi-square test value
chi2 <- sum(((nint - n * a) / sqrt(n * a))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specify k and the significance level alpha
k <- 0
alpha <- 0.05

# Calculate the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)
cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 20153 
Intervals (a): 0.416 0.584 
Frequencies of intervals (nint): 8374 11779 
Chi-squared test value (chi2): 0.01901204 
Lower critical value: 0.0009820691 
Upper critical value: 5.023886 


### Feature **Robots**

In [44]:
data <- df$Robots

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Calculate the absolute frequencies for interval 0 and interval 1
freq <- table(data)

# Generate intervals for the chi-square test
a <- numeric(2)
a[1] <- round(freq[1] / n, digits = 3)
a[2] <- 1 - a[1]
cat("Intervals (a):", a, "\n")

# Number of intervals
r <- 2

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculate the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which(data >= a[1]))
cat("Frequencies of intervals (nint):", nint, "\n")

# Calculate the chi-square test value
chi2 <- sum(((nint - n * a) / sqrt(n * a))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specify k and the significance level alpha
k <- 0
alpha <- 0.05

# Calculate the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)
cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 20153 
Intervals (a): 0.754 0.246 
Frequencies of intervals (nint): 15205 4948 
Chi-squared test value (chi2): 0.02485007 
Lower critical value: 0.0009820691 
Upper critical value: 5.023886 


### Feature **HasSocialNet**

In [45]:
data <- df$HasSocialNet

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Calculate the absolute frequencies for interval 0 and interval 1
freq <- table(data)

# Generate intervals for the chi-square test
a <- numeric(2)
a[1] <- round(freq[1] / n, digits = 3)
a[2] <- 1 - a[1]
cat("Intervals (a):", a, "\n")

# Number of intervals
r <- 2

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculate the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which(data >= a[1]))
cat("Frequencies of intervals (nint):", nint, "\n")

# Calculate the chi-square test value
chi2 <- sum(((nint - n * a) / sqrt(n * a))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specify k and the significance level alpha
k <- 0
alpha <- 0.05

# Calculate the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)
cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 20153 
Intervals (a): 0.602 0.398 
Frequencies of intervals (nint): 12142 8011 
Chi-squared test value (chi2): 0.0202733 
Lower critical value: 0.0009820691 
Upper critical value: 5.023886 


### Feature **HasDescription**

In [46]:
data <- df$HasDescription

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Calculate the absolute frequencies for interval 0 and interval 1
freq <- table(data)

# Generate intervals for the chi-square test
a <- numeric(2)
a[1] <- round(freq[1] / n, digits = 3)
a[2] <- 1 - a[1]
cat("Intervals (a):", a, "\n")

# Number of intervals
r <- 2

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculate the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which(data >= a[1]))
cat("Frequencies of intervals (nint):", nint, "\n")

# Calculate the chi-square test value
chi2 <- sum(((nint - n * a) / sqrt(n * a))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specify k and the significance level alpha
k <- 0
alpha <- 0.05

# Calculate the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)
cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 20153 
Intervals (a): 0.612 0.388 
Frequencies of intervals (nint): 12340 7813 
Chi-squared test value (chi2): 0.008463256 
Lower critical value: 0.0009820691 
Upper critical value: 5.023886 


### Feature **HasCopyrightInfo**

In [47]:
data <- df$HasCopyrightInfo

# Length of the sample
n <- length(data)
cat("Sample size (n):", n, "\n")

# Calculate the absolute frequencies for interval 0 and interval 1
freq <- table(data)

# Generate intervals for the chi-square test
a <- numeric(2)
a[1] <- round(freq[1] / n, digits = 3)
a[2] <- 1 - a[1]
cat("Intervals (a):", a, "\n")

# Number of intervals
r <- 2

# Initializing a numeric vector to store interval frequencies
nint <- numeric(r)

# Calculate the frequencies of the intervals
nint[1] <- length(which(data < a[1]))
nint[2] <- length(which(data >= a[1]))
cat("Frequencies of intervals (nint):", nint, "\n")

# Calculate the chi-square test value
chi2 <- sum(((nint - n * a) / sqrt(n * a))^2)
cat("Chi-squared test value (chi2):", chi2, "\n")

# Specify k and the significance level alpha
k <- 0
alpha <- 0.05

# Calculate the critical values for the chi-squared test
critical_lower <- qchisq(alpha / 2, df = r - k - 1)
critical_upper <- qchisq(1 - alpha / 2, df = r - k - 1)
cat("Lower critical value:", critical_lower, "\n")
cat("Upper critical value:", critical_upper, "\n")

Sample size (n): 20153 
Intervals (a): 0.569 0.431 
Frequencies of intervals (nint): 11472 8681 
Chi-squared test value (chi2): 0.004943699 
Lower critical value: 0.0009820691 
Upper critical value: 5.023886 
