## Similarity of each variable to the normal distribution:

In [10]:
import pandas as pd
import numpy as np
from scipy.stats import norm, skew, kurtosis
from scipy.stats import shapiro
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pingouin as pg

In [11]:
# Load the data
data = pd.read_csv('../Dataset/csv_format/combined.csv')


In [12]:
# Drop the 'Ladder score' column
data.drop(columns=['Life Ladder'], inplace=True)

In [13]:
# Check for missing values
missing_values = data.isnull().sum()
print("Missing Values:")
print(missing_values)

# Check for duplicate rows
duplicate_rows = data.duplicated().sum()
print("\nDuplicate Rows:")
print(duplicate_rows)

# Check for any unexpected values or outliers
# For example, you can check if any numerical columns have negative values
numeric_columns = data.select_dtypes(include=['float64', 'int64']).columns
unexpected_values = data[numeric_columns].lt(0).sum()
print("\nUnexpected Values:")
print(unexpected_values)

# Check data types
print("\nData Types:")
print(data.dtypes)

# Summary statistics for numerical columns
print("\nSummary Statistics for Numerical Columns:")
print(data.describe())

# Check for outliers using box plots or histograms
# For example, you can visualize the distribution of numerical columns

# Check for inconsistencies in categorical variables
# For example, check if there are inconsistent capitalizations or spelling errors in categorical columns


Missing Values:
Country name                          0
year                                  0
Log GDP per capita                   20
Social support                       13
Healthy life expectancy at birth     54
Freedom to make life choices         33
Generosity                           73
Perceptions of corruption           116
Positive affect                      24
Negative affect                      16
dtype: int64

Duplicate Rows:
0

Unexpected Values:
year                                   0
Log GDP per capita                     0
Social support                         0
Healthy life expectancy at birth       0
Freedom to make life choices           0
Generosity                          1187
Perceptions of corruption              0
Positive affect                        0
Negative affect                        0
dtype: int64

Data Types:
Country name                         object
year                                  int64
Log GDP per capita                  float64
Socia

We can assess the normality of each variable using skewness, kurtosis, and normality tests like the Shapiro-Wilk test.

In [14]:

# Select numerical columns
numerical_cols = data.select_dtypes(include=['float64', 'int64']).columns

# Loop through each numerical column
for col in numerical_cols:
    # Calculate skewness and kurtosis
    skew_val = skew(data[col])
    kurt_val = kurtosis(data[col])
    print(f"Skewness of {col}: {skew_val:.2f}")
    print(f"Kurtosis of {col}: {kurt_val:.2f}")
    
    # Shapiro-Wilk test for normality
    stat, p_val = shapiro(data[col])
    print(f"Shapiro-Wilk test for {col}: statistic={stat:.4f}, p-value={p_val:.4f}")
    print("-" * 30)

Skewness of year: -0.08
Kurtosis of year: -1.07
Shapiro-Wilk test for year: statistic=0.9616, p-value=0.0000
------------------------------
Skewness of Log GDP per capita: nan
Kurtosis of Log GDP per capita: nan
Shapiro-Wilk test for Log GDP per capita: statistic=nan, p-value=1.0000
------------------------------
Skewness of Social support: nan
Kurtosis of Social support: nan
Shapiro-Wilk test for Social support: statistic=nan, p-value=1.0000
------------------------------
Skewness of Healthy life expectancy at birth: nan
Kurtosis of Healthy life expectancy at birth: nan
Shapiro-Wilk test for Healthy life expectancy at birth: statistic=nan, p-value=1.0000
------------------------------
Skewness of Freedom to make life choices: nan
Kurtosis of Freedom to make life choices: nan
Shapiro-Wilk test for Freedom to make life choices: statistic=nan, p-value=1.0000
------------------------------
Skewness of Generosity: nan
Kurtosis of Generosity: nan
Shapiro-Wilk test for Generosity: statistic=

#### Skewness measures the asymmetry of a distribution, with a value of 0 indicating perfect symmetry. Positive skewness indicates a longer right tail, while negative skewness indicates a longer left tail.
#### Kurtosis measures the "peakedness" of a distribution. A normal distribution has a kurtosis of 3. Higher kurtosis indicates a more peaked distribution with heavier tails, while lower kurtosis indicates a flatter distribution.
#### The Shapiro-Wilk test is a statistical test that checks if a sample comes from a normal distribution. A low p-value (e.g., < 0.05) suggests that the data is not normally distributed.

## Internal consistency

Internal consistency measures the reliability of a set of variables or items that are meant to measure the same construct. For this dataset, which contains various socioeconomic indicators, we can calculate the Cronbach's alpha to assess the internal consistency of the entire dataset.

In [19]:
# Calculate Cronbach's alpha
columns = data.select_dtypes(include=['float64', 'int64']).columns

# Calculate Cronbach's alpha
cronbach_alpha = pg.cronbach_alpha(data=data[columns])

print(f"Cronbach's alpha for the dataset: {cronbach_alpha[0]:.4f}")

Cronbach's alpha for the dataset: 0.3020


#### Cronbach's alpha is a measure of internal consistency, ranging from 0 to 1. A higher value (e.g., > 0.7) indicates that the variables are measuring the same underlying construct consistently.
#### The code calculates the Cronbach's alpha by first computing the correlation matrix of numerical variables, then using the off-diagonal elements of the correlation matrix to calculate the alpha value.

## Correlation

We can calculate the correlation between the variables to understand the strength and direction of the linear relationship between them.

In [16]:
# Calculate correlation matrix
corr_matrix = data.select_dtypes(include=['float64', 'int64']).corr()
print("Correlation matrix:\n", corr_matrix)

Correlation matrix:
                                       year  Log GDP per capita  \
year                              1.000000            0.077767   
Log GDP per capita                0.077767            1.000000   
Social support                   -0.029741            0.683590   
Healthy life expectancy at birth  0.163500            0.818126   
Freedom to make life choices      0.234135            0.367525   
Generosity                        0.005641           -0.000854   
Perceptions of corruption        -0.081394           -0.352847   
Positive affect                   0.019226            0.237933   
Negative affect                   0.205329           -0.247541   

                                  Social support  \
year                                   -0.029741   
Log GDP per capita                      0.683590   
Social support                          1.000000   
Healthy life expectancy at birth        0.597659   
Freedom to make life choices            0.409326   
Genero

#### The correlation matrix shows the pairwise correlation coefficients between all numerical variables.
#### A correlation coefficient ranges from -1 to 1, where -1 indicates a perfect negative correlation, 0 indicates no correlation, and 1 indicates a perfect positive correlation.
#### The correlation matrix can help identify multicollinearity issues (highly correlated variables) and potential relationships between variables.

## Outlier detection

Outliers can significantly impact the analysis and modeling of data. We can use various techniques to detect outliers, such as the Z-score method or the Tukey's method.

In [17]:
# Detect outliers using Z-score method
z_scores = np.abs((data.select_dtypes(include=['float64', 'int64']) - data.select_dtypes(include=['float64', 'int64']).mean()) / data.select_dtypes(include=['float64', 'int64']).std())
outliers = (z_scores > 3).any(axis=1)
print(f"Number of outliers detected (Z-score method): {outliers.sum()}")

# Detect outliers using Tukey's method
Q1 = data.select_dtypes(include=['float64', 'int64']).quantile(0.25)
Q3 = data.select_dtypes(include=['float64', 'int64']).quantile(0.75)
IQR = Q3 - Q1
outliers = (data.select_dtypes(include=['float64', 'int64']) < (Q1 - 1.5 * IQR)) | (data.select_dtypes(include=['float64', 'int64']) > (Q3 + 1.5 * IQR))
print(f"Number of outliers detected (Tukey's method): {outliers.any(axis=1).sum()}")

Number of outliers detected (Z-score method): 110
Number of outliers detected (Tukey's method): 321


#### The Z-score method calculates the number of standard deviations a data point is away from the mean. Data points with an absolute Z-score greater than 3 are often considered outliers.
#### Tukey's method is based on the interquartile range (IQR). Data points beyond 1.5 times the IQR from the first (Q1) and third (Q3) quartiles are considered outliers.
#### Both methods identify the number of rows (observations) that contain at least one outlier value across all numerical columns.