## Similarity of each variable to the normal distribution:

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import norm, skew, kurtosis
from scipy.stats import shapiro
from statsmodels.stats.outliers_influence import variance_inflation_factor

We can assess the normality of each variable using skewness, kurtosis, and normality tests like the Shapiro-Wilk test.

In [2]:
# Load the data
data = pd.read_csv('2023Data.csv')

# Select numerical columns
numerical_cols = data.select_dtypes(include=['float64', 'int64']).columns

# Loop through each numerical column
for col in numerical_cols:
    # Calculate skewness and kurtosis
    skew_val = skew(data[col])
    kurt_val = kurtosis(data[col])
    print(f"Skewness of {col}: {skew_val:.2f}")
    print(f"Kurtosis of {col}: {kurt_val:.2f}")
    
    # Shapiro-Wilk test for normality
    stat, p_val = shapiro(data[col])
    print(f"Shapiro-Wilk test for {col}: statistic={stat:.4f}, p-value={p_val:.4f}")
    print("-" * 30)

Skewness of Ladder score: -0.45
Kurtosis of Ladder score: 0.02
Shapiro-Wilk test for Ladder score: statistic=0.9825, p-value=0.0767
------------------------------
Skewness of Standard error of ladder score: 1.17
Kurtosis of Standard error of ladder score: 1.57
Shapiro-Wilk test for Standard error of ladder score: statistic=0.9199, p-value=0.0000
------------------------------
Skewness of upperwhisker: -0.50
Kurtosis of upperwhisker: 0.17
Shapiro-Wilk test for upperwhisker: statistic=0.9808, p-value=0.0508
------------------------------
Skewness of lowerwhisker: -0.41
Kurtosis of lowerwhisker: -0.11
Shapiro-Wilk test for lowerwhisker: statistic=0.9838, p-value=0.1059
------------------------------
Skewness of Logged GDP per capita: -0.48
Kurtosis of Logged GDP per capita: -0.36
Shapiro-Wilk test for Logged GDP per capita: statistic=0.9653, p-value=0.0015
------------------------------
Skewness of Social support: -0.97
Kurtosis of Social support: 0.44
Shapiro-Wilk test for Social support

  skew_val = skew(data[col])
  kurt_val = kurtosis(data[col])


#### Skewness measures the asymmetry of a distribution, with a value of 0 indicating perfect symmetry. Positive skewness indicates a longer right tail, while negative skewness indicates a longer left tail.
#### Kurtosis measures the "peakedness" of a distribution. A normal distribution has a kurtosis of 3. Higher kurtosis indicates a more peaked distribution with heavier tails, while lower kurtosis indicates a flatter distribution.
#### The Shapiro-Wilk test is a statistical test that checks if a sample comes from a normal distribution. A low p-value (e.g., < 0.05) suggests that the data is not normally distributed.

## Internal consistency

Internal consistency measures the reliability of a set of variables or items that are meant to measure the same construct. For this dataset, which contains various socioeconomic indicators, we can calculate the Cronbach's alpha to assess the internal consistency of the entire dataset.

In [3]:
# Calculate Cronbach's alpha
alpha = data.select_dtypes(include=['float64', 'int64']).corr().values[np.triu_indices_from(data.select_dtypes(include=['float64', 'int64']).corr(), k=1)]
alpha = alpha[alpha != 1]
alpha = 1 - alpha.sum() / ((len(alpha) - len(data.select_dtypes(include=['float64', 'int64']).columns)) / 2)
print(f"Cronbach's alpha for the dataset: {alpha:.4f}")

Cronbach's alpha for the dataset: nan


#### Cronbach's alpha is a measure of internal consistency, ranging from 0 to 1. A higher value (e.g., > 0.7) indicates that the variables are measuring the same underlying construct consistently.
#### The code calculates the Cronbach's alpha by first computing the correlation matrix of numerical variables, then using the off-diagonal elements of the correlation matrix to calculate the alpha value.

## Correlation

We can calculate the correlation between the variables to understand the strength and direction of the linear relationship between them.

In [4]:
# Calculate correlation matrix
corr_matrix = data.select_dtypes(include=['float64', 'int64']).corr()
print("Correlation matrix:\n", corr_matrix)

Correlation matrix:
                                             Ladder score  \
Ladder score                                    1.000000   
Standard error of ladder score                 -0.512628   
upperwhisker                                    0.999401   
lowerwhisker                                    0.999448   
Logged GDP per capita                           0.784367   
Social support                                  0.834532   
Healthy life expectancy                         0.746928   
Freedom to make life choices                    0.662924   
Generosity                                      0.044082   
Perceptions of corruption                      -0.471911   
Ladder score in Dystopia                             NaN   
Explained by: Log GDP per capita                0.784342   
Explained by: Social support                    0.834604   
Explained by: Healthy life expectancy           0.746699   
Explained by: Freedom to make life choices      0.662909   
Explained by: Gener

#### The correlation matrix shows the pairwise correlation coefficients between all numerical variables.
#### A correlation coefficient ranges from -1 to 1, where -1 indicates a perfect negative correlation, 0 indicates no correlation, and 1 indicates a perfect positive correlation.
#### The correlation matrix can help identify multicollinearity issues (highly correlated variables) and potential relationships between variables.

## Outlier detection

Outliers can significantly impact the analysis and modeling of data. We can use various techniques to detect outliers, such as the Z-score method or the Tukey's method.

In [5]:
# Detect outliers using Z-score method
z_scores = np.abs((data.select_dtypes(include=['float64', 'int64']) - data.select_dtypes(include=['float64', 'int64']).mean()) / data.select_dtypes(include=['float64', 'int64']).std())
outliers = (z_scores > 3).any(axis=1)
print(f"Number of outliers detected (Z-score method): {outliers.sum()}")

# Detect outliers using Tukey's method
Q1 = data.select_dtypes(include=['float64', 'int64']).quantile(0.25)
Q3 = data.select_dtypes(include=['float64', 'int64']).quantile(0.75)
IQR = Q3 - Q1
outliers = (data.select_dtypes(include=['float64', 'int64']) < (Q1 - 1.5 * IQR)) | (data.select_dtypes(include=['float64', 'int64']) > (Q3 + 1.5 * IQR))
print(f"Number of outliers detected (Tukey's method): {outliers.any(axis=1).sum()}")

Number of outliers detected (Z-score method): 10
Number of outliers detected (Tukey's method): 26


#### The Z-score method calculates the number of standard deviations a data point is away from the mean. Data points with an absolute Z-score greater than 3 are often considered outliers.
#### Tukey's method is based on the interquartile range (IQR). Data points beyond 1.5 times the IQR from the first (Q1) and third (Q3) quartiles are considered outliers.
#### Both methods identify the number of rows (observations) that contain at least one outlier value across all numerical columns.