# INTRODUCTION

In this notebook, an exploratory data analysis (EDA) is conducted on the 2023 Legatum Prosperity Index dataset. The Legatum Prosperity Index is a comprehensive ranking of countries based on various dimensions of prosperity and development. It encompasses indicators such as safety and security, personal freedom, governance, social capital, investment environment, enterprise conditions, market access and infrastructure, economic quality, living conditions, health, education, and natural environment. By examining these indicators, insights into the factors that contribute to the overall prosperity of nations are gained.

The analysis includes:

- **Data Cleaning and Preparation:** The dataset is tidied and prepared for analysis by addressing any inconsistencies or missing values.
- **Descriptive Statistics:** The central tendencies, dispersions, and distributions of the indicators are summarized.
- **Visualizations:** Various plots are created to illustrate the relationships and distributions of the indicators, including a world map to visualize the geographical distribution of the average scores.
- **Correlation Analysis:** The correlations between different indicators are investigated to understand how they interact with each other.

In subsequent stages, separate notebooks will be used for:

- **[Principal Component Analysis (PCA)](https://www.kaggle.com/code/tarktunataalt/global-prosperity-pca-distance-insights):** To reduce the dimensionality of the dataset and identify the most significant indicators.
- **[K-Means Clustering](https://www.kaggle.com/code/tarktunataalt/k-means-clustering-global-prosperity-2023):** To group countries into clusters based on their prosperity indicators.
- **[K-Medoids Clustering](https://www.kaggle.com/code/tarktunataalt/k-medoids-clustering-global-prosperity-2023):** To provide an alternative clustering method that is less sensitive to outliers.

Through this EDA, patterns and relationships within the data that can provide valuable insights into the factors driving prosperity across different countries are uncovered. This analysis also serves as a foundation for more advanced modeling and hypothesis testing in future studies.


In [None]:
library(ggplot2)
library(tidyr)
library(dplyr)
library(corrplot)
library(rnaturalearth)
library(rnaturalearthdata)
library(stringr)

data=read.csv("/kaggle/input/2023-global-country-development-and-prosperity-index/data.csv")
head(data)

# BOXPLOT VISUALIZATIONS

In [None]:


data_long <- data %>%
  pivot_longer(cols = -Country, names_to = "Indicator", values_to = "Value")

ggplot(data_long, aes(x = Indicator, y = Value, fill = Indicator)) +
  geom_boxplot() +
  coord_flip() +  
  labs(title = "Boxplots of Prosperity Indicators",
       x = "Indicator",
       y = "Value") +
  theme_minimal() +
  theme(legend.position = "none")

The average score (AveragScore) ranges from 25 to 75, with a median around 50. This indicates that the overall development levels of countries are widely distributed, with many countries falling around the average. The Safety and Security (SafetySecurity) indicator generally shows high values, although there are outliers where some countries have lower security levels. Personal Freedom (PersonelFreedom) and Governance (Governance) indicators similarly span a wide range, with median values around 70 and 60, respectively. This highlights significant differences in personal freedom and governance quality among countries.

Social Capital (SocialCapital), Investment Environment (InvestmentEnvironment), and Enterprise Conditions (EnterpriseConditions) indicators also vary among countries. Social capital, in particular, ranges from 30 to 80, while investment environment and enterprise conditions range from 40 to 80. These indicators reflect the diversity in social and economic structures across countries.

Market Access and Infrastructure (MarketAccessInfrastructure) and Economic Quality (EconomicQuality) indicators have a distribution between 50 and 80. This shows variability in infrastructure and market access as well as economic quality across different nations. Living Conditions (LivingConditions) indicator ranges from 40 to 90, with a median around 70, indicating substantial differences in living standards among countries.

Health (Health) and Education (Education) indicators cover a broad range (40 to 90), with median values around 60 and 70. This illustrates significant variations in healthcare services and education quality among countries. The Natural Environment (NaturalEnvironment) indicator ranges from 30 to 80, with a median around 60. This underscores the differences in environmental preservation and sustainability levels across nations.

In [None]:
scaled_data <- data %>%
  select(-Country) %>%
  scale(center = TRUE, scale = TRUE) %>%
  as.data.frame()

scaled_data$Country <- data$Country

scaled_data_long <- scaled_data %>%
  pivot_longer(cols = -Country, names_to = "Indicator", values_to = "Value")

ggplot(scaled_data_long, aes(x = Indicator, y = Value, fill = Indicator)) +
  geom_boxplot() +
  coord_flip() +  
  labs(title = "Boxplots of Scaled Prosperity Indicators",
       x = "Indicator",
       y = "Scaled Value") +
  theme_minimal() +
  theme(legend.position = "none")

The scaled boxplot shows the normalized values of the data, allowing each indicator to be compared on the same scale. However, in this dataset, all parameters already range from 0 to 100, making it sensible to work with the data without scaling. In the unscaled graph, each indicator is presented with its own unit and scale, making it easier to understand the indicators in their original context. For example, indicators like SafetySecurity and Health have a wide distribution with several outliers. The LivingConditions indicator is spread across a wider range compared to others. Differences between indicators are evident due to each having its own scale.

In the scaled graph, all indicators are normalized to the same scale, making it easier to see differences in distributions. While the original graph shows each indicator in its own unit and scale, the normalized graph presents all indicators on the same scale, highlighting similarities and differences more clearly. Outliers are present in both graphs, but in the normalized values, they appear more centrally distributed. This comparison allows us to evaluate the distributions and performances of indicators more objectively. However, since all indicators range from 0 to 100, performing clustering analysis without scaling is also logical. This simplifies the analysis process and retains the original context of the indicators.

# CORRELATION VISUALIZATIONS

In [None]:
corr <- cor(data[,2:14], method = c("pearson","kendall","spearman"))
corr

In [None]:
corrplot.mixed(corr, 
               tl.cex = 0.4, 
               number.cex = 0.8, 
               upper = "pie")

The correlation matrix visually presents the relationships between various indicators in the 2023 Legatum Prosperity Index data. In this matrix, each cell represents the correlation coefficient between two indicators, with values ranging from -1 to 1. Values close to 1 indicate a strong positive correlation, while values close to -1 indicate a strong negative correlation. Overall, AveragScore shows a high degree of positive correlation with other indicators, especially with InvestmentEnvironment, MarketAccessInfrastructure, and Governance. This suggests that the overall average score is significantly influenced by these three indicators. SafetySecurity has a strong positive correlation with AveragScore but shows moderate correlations with other indicators, indicating that safety and security might be somewhat independent of other factors.

PersonelFreedom shows high correlations with Governance and NaturalEnvironment, while Governance has very high correlations with InvestmentEnvironment and EnterpriseConditions. This indicates that governance is closely related to the investment environment and business conditions. SocialCapital has moderate correlations with other indicators, with the highest being with Governance. InvestmentEnvironment has very high correlations with Governance and MarketAccessInfrastructure, showing that the investment environment is strongly linked to good governance and market access. EnterpriseConditions has high correlations with InvestmentEnvironment and Governance, indicating that business conditions are closely related to the investment environment and governance quality.

MarketAccessInfrastructure shows very high correlations with InvestmentEnvironment and Governance, indicating that market access and infrastructure are connected to the investment environment and overall prosperity. EconomicQuality has high correlations with MarketAccessInfrastructure and Governance, showing a connection to overall prosperity. LivingConditions has very high correlations with Education and Health, indicating that living conditions are linked to education, health, and overall prosperity. Health has high correlations with LivingConditions and Education, while Education also has very high correlations with LivingConditions and Health, showing that education is linked to living conditions, health, and overall prosperity. NaturalEnvironment has moderate correlations with other indicators, with the highest being with PersonelFreedom, indicating that the natural environment is related to personal freedom.

# WORLD MAP VISUALIZATION OF AVERAGE SCORES

In order to align the country names in our dataset with those in the world map data, several manual adjustments are required. These discrepancies often arise from differences in naming conventions, such as short forms, official names, or local language variations. For example, "United States of America" in our dataset is adjusted to "United States" to match the name used in the world map data. Similarly, "South Korea" is changed to "Korea," and "Ivory Coast" is adjusted to "Côte d'Ivoire." These manual adjustments ensure that each country in our dataset correctly corresponds to the names used in the world map, allowing for accurate visualization and analysis.

In [None]:
world <- ne_countries(scale = "medium", returnclass = "sf")

world$name

In [None]:
data$Country

In [None]:
data$Country <- str_trim(data$Country)

data <- data %>%
  mutate(Country = case_when(
    Country == "United States of America" ~ "United States",
    Country == "South Korea" ~ "Korea",
    Country == "North Korea" ~ "Dem. Rep. Korea",
    Country == "Ivory Coast" ~ "Côte d'Ivoire",
    Country == "Czech Republic" ~ "Czech Rep.",
    Country == "Dominican Republic" ~ "Dominican Rep.",
    Country == "Bosnia and Herzegovina" ~ "Bosnia and Herz.",
    Country == "Cabo Verde" ~ "Cape Verde",
    Country == "São Tomé and Príncipe" ~ "São Tomé and Principe",
    Country == "Equatorial Guinea" ~ "Eq. Guinea",
    Country == "Democratic Republic of Congo" ~ "Dem. Rep. Congo",
    Country == "Central African Republic" ~ "Central African Rep.",
    Country == "South Sudan" ~ "S. Sudan",
    Country == "Laos" ~ "Lao PDR",
    TRUE ~ Country
  ))

In [None]:
world_data <- left_join(world, data, by = c("name" = "Country"))

ggplot(data = world_data) +
  geom_sf(aes(fill = AveragScore)) +
  scale_fill_gradientn(colors = c("red", "orange", "yellow", "green","blue","purple"), 
                       values = scales::rescale(c(min(data$AveragScore, na.rm = TRUE), 
                                                  quantile(data$AveragScore, 0.25, na.rm = TRUE), 
                                                  quantile(data$AveragScore, 0.75, na.rm = TRUE), 
                                                  max(data$AveragScore, na.rm = TRUE))),
                       na.value = "lightgrey", name = "Average Score") +
  theme_minimal() +
  labs(title = "World Map Colored by Average Score",
       caption = "Source: 2023 Legatum Prosperity Index") +
  theme(legend.position = "bottom")

This world map is colored according to the average scores of countries based on the 2023 Legatum Prosperity Index. In the map, countries with low prosperity levels are shown in red and orange tones, those with medium levels in yellow and green tones, and countries with high prosperity levels in blue and purple tones. North America, Western Europe, Australia, and some Asian countries stand out with high levels of prosperity, while some regions in Africa and South Asia exhibit lower levels of prosperity. 