# (PART) STATISTICAL ANALYSIS {-}

# How to Perform Statistical Analysis in Python and R?

## Explanation

Statistical analysis helps us understand the characteristics of our dataset, identify patterns, and make data-driven decisions. In this section, we will cover basic statistical measures such as mean, median, variance, and correlation.

```{r, echo=FALSE, include=FALSE}
knitr::opts_chunk$set(
  echo  =TRUE,
  message  =FALSE,
  warning  =FALSE,
  cache  =FALSE,
  comment  =NA
)

if(!require("tidyverse")) {
  install.packages("tidyverse")
  library(tidyverse)}
```

## Python Code

In [1]:
import pandas as pd

# Load dataset
df = pd.read_csv("data/iris.csv")

# Summary statistics
summary_stats = df.describe()

# Calculate variance for numerical columns
variance = df.var(numeric_only=True)

# Calculate correlation between numerical variables
correlation = df.corr(numeric_only=True)

# Display results
print("Summary Statistics:\n", summary_stats)
print("\nVariance:\n", variance)
print("\nCorrelation:\n", correlation)

Summary Statistics:
        sepal_length  sepal_width  petal_length  petal_width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.057333      3.758000     1.199333
std        0.828066     0.435866      1.765298     0.762238
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000

Variance:
 sepal_length    0.685694
sepal_width     0.189979
petal_length    3.116278
petal_width     0.581006
dtype: float64

Correlation:
               sepal_length  sepal_width  petal_length  petal_width
sepal_length      1.000000    -0.117570      0.871754     0.817941
sepal_width      -0.117570     1.000000     -0.428440    -0.366126
petal_length      0.871754    -0.428440      1.000000     0.962865
petal_width       0.817941   

## R Code

```{r}
# Load dataset
df <- read.csv("data/iris.csv")

# Summary statistics
summary_stats <- summary(df)

# Calculate variance for numerical columns
variance <- apply(df[, 1:4], 2, var)

# Calculate correlation between numerical variables
correlation <- cor(df[, 1:4])

# Display results
print("Summary Statistics:")
print(summary_stats)
print("\nVariance:")
print(variance)
print("\nCorrelation:")
print(correlation)

```

# How to Calculate Skewness and Kurtosis in Python and R?

## Explanation

Skewness and kurtosis help us understand the distribution of data.  
- **Skewness** measures the asymmetry of the data distribution. A skewness of 0 indicates a perfectly symmetric distribution.  
- **Kurtosis** measures the "tailedness" of the distribution. A normal distribution has a kurtosis of 3. Values greater than 3 indicate heavy tails, while values less than 3 indicate light tails.

## Python Code



In [3]:
import pandas as pd
from scipy.stats import skew, kurtosis

# Load dataset
df = pd.read_csv("data/iris.csv")

# Compute skewness
skewness = df.iloc[:, :-1].apply(skew)

# Compute kurtosis
kurt = df.iloc[:, :-1].apply(kurtosis)

# Display results
print("Skewness:\n", skewness)
print("\nKurtosis:\n", kurt)

Skewness:
 sepal_length    0.311753
sepal_width     0.315767
petal_length   -0.272128
petal_width    -0.101934
dtype: float64

Kurtosis:
 sepal_length   -0.573568
sepal_width     0.180976
petal_length   -1.395536
petal_width    -1.336067
dtype: float64


## R Code

```{r}
# Check and load necessary libraries from CRAN mirror
if(!require(tidyverse)) install.packages("tidyverse", dependencies = TRUE, repos = "https://cloud.r-project.org/")
if(!require(e1071)) install.packages("e1071", dependencies = TRUE, repos = "https://cloud.r-project.org/")

library(tidyverse)
library(e1071)

# Load dataset
df <- read_csv("data/iris.csv")

# Compute skewness and kurtosis
skewness_values <- df %>%
  select(-species) %>%
  summarise(across(everything(), skewness))

kurtosis_values <- df %>%
  select(-species) %>%
  summarise(across(everything(), kurtosis))

# Display results
print("Skewness:")
print(skewness_values)

print("Kurtosis:")
print(kurtosis_values)
```

# How to Perform a t-test in Python and R?

## Explanation

**t-tests** are used to compare the means of two groups and determine whether they are significantly different from each other. In the iris dataset, we can compare the sepal length of two species to see if their means differ significantly.

There are different types of t-tests:

**Independent t-test**: Compares means between two independent groups.

**Paired t-test**: Compares means from the same group at different time points.

## Python Code

In Python, we use **scipy.stats.ttest_ind()** for an independent t-test.

In [5]:
import pandas as pd
from scipy import stats

# Load dataset
df = pd.read_csv("data/iris.csv")

# Filter two species for comparison
setosa = df[df['species'] == 'setosa']['sepal_length']
versicolor = df[df['species'] == 'versicolor']['sepal_length']

# Perform independent t-test
t_stat, p_value = stats.ttest_ind(setosa, versicolor)

print(f"t-statistic: {t_stat}, p-value: {p_value}")

t-statistic: -10.52098626754911, p-value: 8.985235037487079e-18


# How to compute the mean, median, and mode of a dataset?

## Explanation
- **Mean**: The average of all values in the dataset.
- **Median**: The middle value when the data is sorted.
- **Mode**: The value that appears most frequently in the dataset.

## Python Code


In [9]:
import pandas as pd

# Load dataset
df = pd.read_csv("data/iris.csv")

# Compute mean, median, and mode
mean_values = df.drop(columns=["species"]).mean()
median_values = df.drop(columns=["species"]).median()
mode_values = df.drop(columns=["species"]).mode().iloc[0]

# Display results
print("Mean:\n")
print(mean_values)

print("\nMedian:\n")
print(median_values)

print("\nMode:\n")
print(mode_values)

Mean:

sepal_length    5.843333
sepal_width     3.057333
petal_length    3.758000
petal_width     1.199333
dtype: float64

Median:

sepal_length    5.80
sepal_width     3.00
petal_length    4.35
petal_width     1.30
dtype: float64

Mode:

sepal_length    5.0
sepal_width     3.0
petal_length    1.4
petal_width     0.2
Name: 0, dtype: float64


## R Code

```{r}
# Load necessary libraries
library(tidyverse)

# Load dataset
df <- read_csv("data/iris.csv")

# Compute mean, median, and mode
mean_values <- df %>%
  select(-species) %>%
  summarise(across(everything(), mean))

median_values <- df %>%
  select(-species) %>%
  summarise(across(everything(), median))

mode_values <- df %>%
  select(-species) %>%
  summarise(across(everything(), ~ names(sort(table(.), decreasing = TRUE))[1]))

# Display results
print("Mean:")
print(mean_values)

print("Median:")
print(median_values)

print("Mode:")
print(mode_values)

```