# (PART) STATISTICAL ANALYSIS {-}

# How to Perform Statistical Analysis in Python and R?

## Explanation

Statistical analysis helps us understand the characteristics of our dataset, identify patterns, and make data-driven decisions. In this section, we will cover basic statistical measures such as mean, median, variance, and correlation.

```{r, echo=FALSE, include=FALSE}
knitr::opts_chunk$set(
  echo  =TRUE,
  message  =FALSE,
  warning  =FALSE,
  cache  =FALSE,
  comment  =NA
)

if(!require("tidyverse")) {
  install.packages("tidyverse")
  library(tidyverse)}
```

## Python Code

In [17]:
import pandas as pd

# Load dataset
df = pd.read_csv("data/iris.csv")

# Summary statistics
summary_stats = df.describe()

# Calculate variance for numerical columns
variance = df.var(numeric_only=True)

# Calculate correlation between numerical variables
correlation = df.corr(numeric_only=True)

# Display results
print("Summary Statistics:\n", summary_stats)
print("\nVariance:\n", variance)
print("\nCorrelation:\n", correlation)

Summary Statistics:
        sepal_length  sepal_width  petal_length  petal_width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.057333      3.758000     1.199333
std        0.828066     0.435866      1.765298     0.762238
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000

Variance:
 sepal_length    0.685694
sepal_width     0.189979
petal_length    3.116278
petal_width     0.581006
dtype: float64

Correlation:
               sepal_length  sepal_width  petal_length  petal_width
sepal_length      1.000000    -0.117570      0.871754     0.817941
sepal_width      -0.117570     1.000000     -0.428440    -0.366126
petal_length      0.871754    -0.428440      1.000000     0.962865
petal_width       0.817941   

## R Code

```{r}
# Load dataset
df <- read.csv("data/iris.csv")

# Summary statistics
summary_stats <- summary(df)

# Calculate variance for numerical columns
variance <- apply(df[, 1:4], 2, var)

# Calculate correlation between numerical variables
correlation <- cor(df[, 1:4])

# Display results
print("Summary Statistics:")
print(summary_stats)
print("\nVariance:")
print(variance)
print("\nCorrelation:")
print(correlation)

```

# How to Calculate Skewness and Kurtosis in Python and R?

## Explanation

Skewness and kurtosis help us understand the distribution of data.  
- **Skewness** measures the asymmetry of the data distribution. A skewness of 0 indicates a perfectly symmetric distribution.  
- **Kurtosis** measures the "tailedness" of the distribution. A normal distribution has a kurtosis of 3. Values greater than 3 indicate heavy tails, while values less than 3 indicate light tails.

## Python Code



In [18]:
import pandas as pd
from scipy.stats import skew, kurtosis

# Load dataset
df = pd.read_csv("data/iris.csv")

# Compute skewness
skewness = df.iloc[:, :-1].apply(skew)

# Compute kurtosis
kurt = df.iloc[:, :-1].apply(kurtosis)

# Display results
print("Skewness:\n", skewness)
print("\nKurtosis:\n", kurt)

Skewness:
 sepal_length    0.311753
sepal_width     0.315767
petal_length   -0.272128
petal_width    -0.101934
dtype: float64

Kurtosis:
 sepal_length   -0.573568
sepal_width     0.180976
petal_length   -1.395536
petal_width    -1.336067
dtype: float64


## R Code

```{r}
# Check and load necessary libraries from CRAN mirror
if(!require(tidyverse)) install.packages("tidyverse", dependencies = TRUE, repos = "https://cloud.r-project.org/")
if(!require(e1071)) install.packages("e1071", dependencies = TRUE, repos = "https://cloud.r-project.org/")

library(tidyverse)
library(e1071)

# Load dataset
df <- read_csv("data/iris.csv")

# Compute skewness and kurtosis
skewness_values <- df %>%
  select(-species) %>%
  summarise(across(everything(), skewness))

kurtosis_values <- df %>%
  select(-species) %>%
  summarise(across(everything(), kurtosis))

# Display results
print("Skewness:")
print(skewness_values)

print("Kurtosis:")
print(kurtosis_values)
```

# How to Perform a t-test in Python and R?

## Explanation

**t-tests** are used to compare the means of two groups and determine whether they are significantly different from each other. In the iris dataset, we can compare the sepal length of two species to see if their means differ significantly.

There are different types of t-tests:

**Independent t-test**: Compares means between two independent groups.

**Paired t-test**: Compares means from the same group at different time points.

## Python Code

In Python, we use **scipy.stats.ttest_ind()** for an independent t-test.

In [19]:
import pandas as pd
from scipy import stats

# Load dataset
df = pd.read_csv("data/iris.csv")

# Filter two species for comparison
setosa = df[df['species'] == 'setosa']['sepal_length']
versicolor = df[df['species'] == 'versicolor']['sepal_length']

# Perform independent t-test
t_stat, p_value = stats.ttest_ind(setosa, versicolor)

print(f"t-statistic: {t_stat}, p-value: {p_value}")

t-statistic: -10.52098626754911, p-value: 8.985235037487079e-18


# How to compute the mean, median, and mode of a dataset?

## Explanation
- **Mean**: The average of all values in the dataset.
- **Median**: The middle value when the data is sorted.
- **Mode**: The value that appears most frequently in the dataset.

## Python Code


In [20]:
import pandas as pd

# Load dataset
df = pd.read_csv("data/iris.csv")

# Compute mean, median, and mode
mean_values = df.drop(columns=["species"]).mean()
median_values = df.drop(columns=["species"]).median()
mode_values = df.drop(columns=["species"]).mode().iloc[0]

# Display results
print("Mean:\n")
print(mean_values)

print("\nMedian:\n")
print(median_values)

print("\nMode:\n")
print(mode_values)

Mean:

sepal_length    5.843333
sepal_width     3.057333
petal_length    3.758000
petal_width     1.199333
dtype: float64

Median:

sepal_length    5.80
sepal_width     3.00
petal_length    4.35
petal_width     1.30
dtype: float64

Mode:

sepal_length    5.0
sepal_width     3.0
petal_length    1.4
petal_width     0.2
Name: 0, dtype: float64


## R Code

```{r}
# Load necessary libraries
library(tidyverse)

# Load dataset
df <- read_csv("data/iris.csv")

# Compute mean, median, and mode
mean_values <- df %>%
  select(-species) %>%
  summarise(across(everything(), mean))

median_values <- df %>%
  select(-species) %>%
  summarise(across(everything(), median))

mode_values <- df %>%
  select(-species) %>%
  summarise(across(everything(), ~ names(sort(table(.), decreasing = TRUE))[1]))

# Display results
print("Mean:")
print(mean_values)

print("Median:")
print(median_values)

print("Mode:")
print(mode_values)

```

# What is the Difference Between an F-test and an ANOVA Test?

## Overview  
Both the **F-test** and **ANOVA** use the **F-statistic**, but they serve different purposes in statistical analysis.  

## Key Aspects  

| **Aspect**         | **F-test** (Variance Comparison)                   | **ANOVA** (Mean Comparison)                     |
|--------------------|---------------------------------------------------|-------------------------------------------------|
| **Purpose**       | Compares the variances of two groups.              | Compares the means of three or more groups.     |
| **Hypotheses**    | - \( H_0 \): Variances are equal. <br> - \( H_a \): Variances are different. | - \( H_0 \): All group means are equal. <br> - \( H_a \): At least one mean is different. |
| **When to Use?**  | Before a t-test, to check variance equality.        | When analyzing differences among multiple groups. |
| **Test Statistic** | \( F = \frac{\sigma_1^2}{\sigma_2^2} \)  (Ratio of variances) | \( F = \frac{\text{Between-group variance}}{\text{Within-group variance}} \) |
| **Python Function** | `levene()` or `bartlett()` from `scipy.stats`.    | `f_oneway()` from `scipy.stats` (for means). |
| **R Function**     | `var.test(group1, group2)`.                        | `aov(response ~ group, data = df)`. |

## Key Differences  

- The **F-test** is used to compare **variances** between two groups.  
- **ANOVA** is used to compare **means** among **three or more groups**.  
- The **F-test** is often used **before ANOVA** to check if the assumption of equal variances holds.  

## What If Variances Are Not Equal?  
If the assumption of equal variances is violated, consider using:  
- **Welch’s ANOVA**, which does not assume equal variances.  
- **Non-parametric tests**, such as the **Kruskal-Wallis test** (for comparing medians).  

---


# How to Perform an F-test in Python and R?

## Explanation  
An **F-test** is used to compare the variances of two independent groups. It helps determine if the groups have equal variances, which is important in statistical tests like t-tests and ANOVA.

The null hypothesis (\(H_0\)) assumes that the variances of the two groups are equal, while the alternative hypothesis (\(H_a\)) states that they are different.

The **F-statistic** for the F-test is calculated as:

\[
F = \frac{\text{variance of group 1}}{\text{variance of group 2}}
\]

If the p-value is small (typically \( p < 0.05 \)), we reject the null hypothesis and conclude that the variances are significantly different.

## Python Code

In [21]:
import pandas as pd
from scipy.stats import levene

# Load dataset
df = pd.read_csv("data/iris.csv")

# Select two species for comparison
group1 = df[df["species"] == "setosa"]["sepal_length"]
group2 = df[df["species"] == "versicolor"]["sepal_length"]

# Perform F-test for variances (Levene's test)
f_stat, p_value = levene(group1, group2)

# Display results
print(f"F-statistic: {f_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# Interpretation
if p_value < 0.05:
    print("Reject H0: The variances of Setosa and Versicolor are significantly different.")
else:
    print("Fail to reject H0: No significant difference in variances.")

F-statistic: 8.1727
P-value: 0.0052
Reject H0: The variances of Setosa and Versicolor are significantly different.


## R Code

```{r}
# Load necessary package
if (!require(car)) install.packages("car", repos = "https://cloud.r-project.org")
library(car)

# Load dataset
df <- read.csv("data/iris.csv")

# Subset data for two species
group1 <- df[df$species == "setosa", "sepal_length"]
group2 <- df[df$species == "versicolor", "sepal_length"]

# Perform F-test for variances (Levene's test)
var_test_result <- var.test(group1, group2)
print(var_test_result)
```

# How to Perform an ANOVA Test in Python and R?

## Explanation

An ANOVA (Analysis of Variance) test is used to compare the means of three or more groups. It helps to determine whether there are any statistically significant differences between the means of the groups.

The null hypothesis ((H_0)) assumes that all group means are equal, while the alternative hypothesis ((H_a)) states that at least one mean is different.

The F-statistic in ANOVA is calculated as:

\[
F = \frac{\text{Between-group variance}}{\text{Within-group variance}}
\]

If the p-value is small (typically ( p < 0.05 )), we reject the null hypothesis and conclude that at least one group mean is significantly different.

## Python Code

In [22]:
import pandas as pd
from scipy.stats import f_oneway

# Load dataset
df = pd.read_csv("data/iris.csv")

# Select groups for comparison
group1 = df[df["species"] == "setosa"]["sepal_length"]
group2 = df[df["species"] == "versicolor"]["sepal_length"]
group3 = df[df["species"] == "virginica"]["sepal_length"]

# Perform ANOVA
f_stat, p_value = f_oneway(group1, group2, group3)

# Display results
print(f"F-statistic: {f_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# Interpretation
if p_value < 0.05:
    print("Reject H0: At least one group mean is significantly different.")
else:
    print("Fail to reject H0: No significant difference in group means.")

F-statistic: 119.2645
P-value: 0.0000
Reject H0: At least one group mean is significantly different.


## R Code 

```{r}
# Load dataset
df <- read.csv("data/iris.csv")

# Perform ANOVA (sepal length by species)
anova_result <- aov(sepal_length ~ species, data = df)
summary(anova_result)
```

# How to Perform a Chi-Square Test in Python and R?

## Explanation  
A **Chi-Square Test** is used to determine whether there is a significant association between two categorical variables. It compares the observed frequencies in each category to the frequencies we would expect if the variables were independent.

The **Chi-Square Statistic** is calculated as:

\[
\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}
\]

Where:
- \(O_i\) is the observed frequency in category \(i\),
- \(E_i\) is the expected frequency in category \(i\).

The null hypothesis (\(H_0\)) assumes that there is no association between the variables (i.e., the variables are independent), while the alternative hypothesis (\(H_a\)) states that there is an association between them.

If the p-value is small (typically \( p < 0.05 \)), we reject the null hypothesis and conclude that there is a significant association between the variables.

## Python Code 
- **Chi-Square Test** - Association Between Categorical Variables


In [23]:
import pandas as pd
from scipy.stats import chi2_contingency

# Load dataset
df = pd.read_csv("data/iris.csv")

# Create a contingency table for 'species' and 'sepal_width' (categorical grouping)
contingency_table = pd.crosstab(df['species'], pd.cut(df['sepal_width'], bins=3))

# Perform Chi-Square Test
chi2_stat, p_value, dof, expected = chi2_contingency(contingency_table)

# Display results
print(f"Chi-Square Statistic: {chi2_stat:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of Freedom: {dof}")
print(f"Expected Frequencies: \n{expected}")

# Interpretation
if p_value < 0.05:
    print("Reject H0: There is a significant association between species and sepal width.")
else:
    print("Fail to reject H0: No significant association between species and sepal width.")

Chi-Square Statistic: 45.1247
P-value: 0.0000
Degrees of Freedom: 4
Expected Frequencies: 
[[15.66666667 29.33333333  5.        ]
 [15.66666667 29.33333333  5.        ]
 [15.66666667 29.33333333  5.        ]]
Reject H0: There is a significant association between species and sepal width.


## R Code
Chi-Square Test - Association Between Categorical Variables

```{r}
# Load dataset
df <- read.csv("data/iris.csv")

# Create a contingency table for 'species' and 'sepal_width' (categorical grouping)
df$sepal_width_cat <- cut(df$sepal_width, breaks = 3)
contingency_table <- table(df$species, df$sepal_width_cat)

# Perform Chi-Square Test
chi2_result <- chisq.test(contingency_table)

# Display results
print(chi2_result)

# Interpretation
if (chi2_result$p.value < 0.05) {
    print("Reject H0: There is a significant association between species and sepal width.")
} else {
    print("Fail to reject H0: No significant association between species and sepal width.")
}
```

# How to Perform a Pearson Correlation Test in Python and R?

## Explanation  
The **Pearson Correlation Test** is used to determine the linear relationship between two continuous variables. It measures the strength and direction of the relationship, with a correlation coefficient (\(r\)) ranging from -1 to 1:
- \( r = 1 \) indicates a perfect positive linear relationship.
- \( r = -1 \) indicates a perfect negative linear relationship.
- \( r = 0 \) indicates no linear relationship.

The null hypothesis (\(H_0\)) assumes that there is no linear correlation between the two variables, while the alternative hypothesis (\(H_a\)) states that there is a linear correlation.

If the p-value is small (typically \( p < 0.05 \)), we reject the null hypothesis and conclude that there is a significant linear relationship between the two variables.

## Python Code


In [24]:
import pandas as pd
from scipy.stats import pearsonr

# Load dataset
df = pd.read_csv("data/iris.csv")

# Select two variables for correlation test
x = df['sepal_length']
y = df['sepal_width']

# Perform Pearson Correlation Test
corr_coefficient, p_value = pearsonr(x, y)

# Display results
print(f"Pearson Correlation Coefficient: {corr_coefficient:.4f}")
print(f"P-value: {p_value:.4f}")

# Interpretation
if p_value < 0.05:
    print("Reject H0: There is a significant linear correlation between sepal length and sepal width.")
else:
    print("Fail to reject H0: No significant linear correlation between sepal length and sepal width.")

Pearson Correlation Coefficient: -0.1176
P-value: 0.1519
Fail to reject H0: No significant linear correlation between sepal length and sepal width.


## R Code

```{r}
# Load dataset
df <- read.csv("data/iris.csv")

# Select two variables for correlation test
x <- df$sepal_length
y <- df$sepal_width

# Perform Pearson Correlation Test
cor_result <- cor.test(x, y)

# Display results
print(cor_result)

# Interpretation
if (cor_result$p.value < 0.05) {
    print("Reject H0: There is a significant linear correlation between sepal length and sepal width.")
} else {
    print("Fail to reject H0: No significant linear correlation between sepal length and sepal width.")
}
```

# How to Perform a Simple Linear Regression in Python and R?

## Explanation  
**Simple Linear Regression** is used to model the relationship between a dependent variable (\(Y\)) and an independent variable (\(X\)) by fitting a linear equation to observed data. The goal is to find the best-fitting line, represented by the equation:

\[
Y = \beta_0 + \beta_1 X + \epsilon
\]

Where:
- \(Y\) is the dependent variable,
- \(X\) is the independent variable,
- \(\beta_0\) is the intercept,
- \(\beta_1\) is the slope,
- \(\epsilon\) is the error term.

The null hypothesis (\(H_0\)) assumes that the slope of the regression line is zero, meaning there is no linear relationship between \(X\) and \(Y\). The alternative hypothesis (\(H_a\)) states that the slope is non-zero, indicating a significant relationship.

If the p-value is small (typically \( p < 0.05 \)), we reject the null hypothesis and conclude that there is a significant relationship between \(X\) and \(Y\).

## Python Code
Simple Linear Regression



In [25]:
import pandas as pd
import statsmodels.api as sm

# Load dataset
df = pd.read_csv("data/iris.csv")

# Select independent and dependent variables
X = df['sepal_length']
y = df['sepal_width']

# Add a constant (intercept) to the independent variable
X = sm.add_constant(X)

# Perform linear regression
model = sm.OLS(y, X).fit()

# Display results
print(model.summary())

# Interpretation
if model.pvalues[1] < 0.05:
    print("Reject H0: There is a significant linear relationship between sepal length and sepal width.")
else:
    print("Fail to reject H0: No significant linear relationship between sepal length and sepal width.")

                            OLS Regression Results                            
Dep. Variable:            sepal_width   R-squared:                       0.014
Model:                            OLS   Adj. R-squared:                  0.007
Method:                 Least Squares   F-statistic:                     2.074
Date:                Tue, 18 Mar 2025   Prob (F-statistic):              0.152
Time:                        14:56:34   Log-Likelihood:                -86.732
No. Observations:                 150   AIC:                             177.5
Df Residuals:                     148   BIC:                             183.5
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const            3.4189      0.254     13.484   

## R Code
Simple Linear Regression


```{r}
# Load dataset
df <- read.csv("data/iris.csv")

# Select independent and dependent variables
X <- df$sepal_length
y <- df$sepal_width

# Perform linear regression
model <- lm(y ~ X)

# Display results
summary(model)

# Interpretation
if (summary(model)$coefficients[2,4] < 0.05) {
    print("Reject H0: There is a significant linear relationship between sepal length and sepal width.")
} else {
    print("Fail to reject H0: No significant linear relationship between sepal length and sepal width.")
}
```

# How to Perform Multiple Linear Regression in Python and R?

## Explanation

**Multiple Linear Regression** is an extension of simple linear regression that models the relationship between a dependent variable (\(Y\)) and two or more independent variables (\(X_1, X_2, \dots, X_n\)). The equation for multiple linear regression is:

\[
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n + \epsilon
\]

Where:
- \(Y\) is the dependent variable,
- \(X_1, X_2, \dots, X_n\) are the independent variables,
- \(\beta_0\) is the intercept,
- \(\beta_1, \beta_2, \dots, \beta_n\) are the coefficients (slopes) for the predictors,
- \(\epsilon\) is the error term.

The null hypothesis (\(H_0\)) assumes that all regression coefficients are zero, meaning no relationship exists between the predictors and the dependent variable. The alternative hypothesis (\(H_a\)) suggests that at least one coefficient is non-zero, indicating a significant relationship.

If the p-value for a coefficient is small (typically \(p < 0.05\)), we reject the null hypothesis for that predictor and conclude that it significantly affects \(Y\).

## Python Code


In [26]:
import pandas as pd
import statsmodels.api as sm

# Load dataset (you can use any dataset here)
df = pd.read_csv("data/iris.csv")

# Select independent variables (predictors)
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]  # Predictors

# Add a constant (intercept) to the model
X = sm.add_constant(X)

# Dependent variable (response)
y = df['sepal_length']  # Example of using sepal length as the dependent variable

# Fit the model
model = sm.OLS(y, X).fit()

# Display the results
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:           sepal_length   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 9.587e+29
Date:                Tue, 18 Mar 2025   Prob (F-statistic):               0.00
Time:                        14:56:34   Log-Likelihood:                 4724.3
No. Observations:                 150   AIC:                            -9439.
Df Residuals:                     145   BIC:                            -9424.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const        -9.104e-15   4.83e-15     -1.887   

## R Code
Multiple Linear Regression

```{r}
# Check if 'caret' is installed, if not, install it
if (!require(caret)) {
  # Set CRAN mirror
  options(repos = c(CRAN = "https://cran.rstudio.com/"))
  install.packages("caret")
  library(caret)
}
# Load dataset (use any dataset available)
df <- read.csv("data/iris.csv")

# Create training and test sets
set.seed(123)  # For reproducibility
trainIndex <- createDataPartition(df$sepal_length, p = 0.7, list = FALSE)
trainData <- df[trainIndex, ]
testData <- df[-trainIndex, ]

# Train the linear regression model
model <- train(sepal_length ~ sepal_width + petal_length + petal_width,
               data = trainData,
               method = "lm")

# Display the model details
print(model)

# Predict on the test set
predictions <- predict(model, testData)

# Display predictions and actual values
predictions
testData$sepal_length

# Evaluate the model using RMSE (Root Mean Squared Error)
rmse <- sqrt(mean((predictions - testData$sepal_length)^2))
cat("RMSE: ", rmse)
```