## Similarity of each variable to the normal distribution:

In [8]:
import pandas as pd
import numpy as np
from scipy.stats import norm, skew, kurtosis
from scipy.stats import shapiro
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pingouin as pg

In [9]:
data = pd.read_csv('../Dataset/csv_format/2023Data.csv')

In [10]:
# Drop the 'Ladder score' column
data.drop(columns=['Ladder score'], inplace=True)
data.drop(columns=['Standard error of ladder score'], inplace=True)

# Drop the row where the country name is Finland
data = data[data['Country name'] != 'State of Palestine']

We can assess the normality of each variable using skewness, kurtosis, and normality tests like the Shapiro-Wilk test.

In [11]:
# Select numerical columns
numerical_cols = data.select_dtypes(include=['float64', 'int64']).columns

# Loop through each numerical column
for col in numerical_cols:
    # Calculate skewness and kurtosis
    skew_val = skew(data[col])
    kurt_val = kurtosis(data[col])
    print(f"Skewness of {col}: {skew_val:.2f}")
    print(f"Kurtosis of {col}: {kurt_val:.2f}")
    
    # Shapiro-Wilk test for normality
    stat, p_val = shapiro(data[col])
    print(f"Shapiro-Wilk test for {col}: statistic={stat:.4f}, p-value={p_val:.4f}")
    print("-" * 30)

Skewness of upperwhisker: -0.51
Kurtosis of upperwhisker: 0.17
Shapiro-Wilk test for upperwhisker: statistic=0.9801, p-value=0.0437
------------------------------
Skewness of lowerwhisker: -0.42
Kurtosis of lowerwhisker: -0.11
Shapiro-Wilk test for lowerwhisker: statistic=0.9831, p-value=0.0898
------------------------------
Skewness of Logged GDP per capita: -0.50
Kurtosis of Logged GDP per capita: -0.35
Shapiro-Wilk test for Logged GDP per capita: statistic=0.9641, p-value=0.0012
------------------------------
Skewness of Social support: -0.96
Kurtosis of Social support: 0.41
Shapiro-Wilk test for Social support: statistic=0.9173, p-value=0.0000
------------------------------
Skewness of Healthy life expectancy: -0.40
Kurtosis of Healthy life expectancy: -0.79
Shapiro-Wilk test for Healthy life expectancy: statistic=0.9568, p-value=0.0003
------------------------------
Skewness of Freedom to make life choices: -0.94
Kurtosis of Freedom to make life choices: 1.01
Shapiro-Wilk test for

  skew_val = skew(data[col])
  kurt_val = kurtosis(data[col])


## Internal consistency

Internal consistency measures the reliability of a set of variables or items that are meant to measure the same construct. For this dataset, which contains various socioeconomic indicators, we can calculate the Cronbach's alpha to assess the internal consistency of the entire dataset.

In [12]:
# Calculate Cronbach's alpha
columns = data.select_dtypes(include=['float64', 'int64']).columns

# Calculate Cronbach's alpha
cronbach_alpha = pg.cronbach_alpha(data=data[columns])

print(f"Cronbach's alpha for the dataset: {cronbach_alpha[0]:.4f}")

Cronbach's alpha for the dataset: 0.6434


#### Cronbach's alpha is a measure of internal consistency, ranging from 0 to 1. A higher value (e.g., > 0.7) indicates that the variables are measuring the same underlying construct consistently.
#### The code calculates the Cronbach's alpha by first computing the correlation matrix of numerical variables, then using the off-diagonal elements of the correlation matrix to calculate the alpha value.

## Correlation

We can calculate the correlation between the variables to understand the strength and direction of the linear relationship between them.

In [13]:
# Calculate correlation matrix
corr_matrix = data.select_dtypes(include=['float64', 'int64']).corr()
print("Correlation matrix:\n", corr_matrix)

Correlation matrix:
                                             upperwhisker  lowerwhisker  \
upperwhisker                                    1.000000      0.997714   
lowerwhisker                                    0.997714      1.000000   
Logged GDP per capita                           0.776073      0.790438   
Social support                                  0.835510      0.839536   
Healthy life expectancy                         0.737112      0.755561   
Freedom to make life choices                    0.663457      0.659802   
Generosity                                      0.044760      0.035022   
Perceptions of corruption                      -0.467916     -0.472646   
Ladder score in Dystopia                             NaN           NaN   
Explained by: Log GDP per capita                0.776049      0.790412   
Explained by: Social support                    0.835608      0.839646   
Explained by: Healthy life expectancy           0.736880      0.755335   
Explained by: Fre

#### The correlation matrix shows the pairwise correlation coefficients between all numerical variables.
#### A correlation coefficient ranges from -1 to 1, where -1 indicates a perfect negative correlation, 0 indicates no correlation, and 1 indicates a perfect positive correlation.
#### The correlation matrix can help identify multicollinearity issues (highly correlated variables) and potential relationships between variables.

## Outlier detection

Outliers can significantly impact the analysis and modeling of data. We can use various techniques to detect outliers, such as the Z-score method or the Tukey's method.

In [14]:
# Detect outliers using Z-score method
z_scores = np.abs((data.select_dtypes(include=['float64', 'int64']) - data.select_dtypes(include=['float64', 'int64']).mean()) / data.select_dtypes(include=['float64', 'int64']).std())
outliers = (z_scores > 3).any(axis=1)
print(f"Number of outliers detected (Z-score method): {outliers.sum()}")

# Detect outliers using Tukey's method
Q1 = data.select_dtypes(include=['float64', 'int64']).quantile(0.25)
Q3 = data.select_dtypes(include=['float64', 'int64']).quantile(0.75)
IQR = Q3 - Q1
outliers = (data.select_dtypes(include=['float64', 'int64']) < (Q1 - 1.5 * IQR)) | (data.select_dtypes(include=['float64', 'int64']) > (Q3 + 1.5 * IQR))
print(f"Number of outliers detected (Tukey's method): {outliers.any(axis=1).sum()}")

Number of outliers detected (Z-score method): 8
Number of outliers detected (Tukey's method): 24


#### The Z-score method calculates the number of standard deviations a data point is away from the mean. Data points with an absolute Z-score greater than 3 are often considered outliers.
#### Tukey's method is based on the interquartile range (IQR). Data points beyond 1.5 times the IQR from the first (Q1) and third (Q3) quartiles are considered outliers.
#### Both methods identify the number of rows (observations) that contain at least one outlier value across all numerical columns.