# Political Polarization and Economic Factors (GINI & CPI) in OECD Countries

## Project Overview

This notebook investigates the potential relationships between key economic indicators and political polarization within OECD (Organisation for Economic Co-operation and Development) member countries. Specifically, we aim to explore whether income inequality, as measured by the **GINI coefficient**, and inflation, as measured by the **Consumer Price Index (CPI)**, correlate with or can help explain variations in political polarization.

Political polarization, broadly defined as the divergence of political attitudes toward ideological extremes, is a topic of considerable interest in contemporary political science. Understanding its potential drivers, including socio-economic factors, is crucial.

**Research Question:** How do income inequality (GINI) and inflation (CPI) relate to the ideological dispersion of political parties in OECD countries?

**Data Sources:**
1.  **Manifesto Project Dataset (MPD):** Provides coded data from political party manifestos, including the `rile` score, which estimates a party's position on a left-right ideological scale. This is our primary source for measuring party ideology.
2.  **OECD Dataset:** Contains various economic indicators for member countries, including GINI coefficients and CPI data, which serve as our independent variables.

**Methodology Outline:**
1.  **Data Loading and Merging:** Load the pre-merged dataset which combines party-level manifesto data with country-level OECD economic indicators.
2.  **Filtering:** Focus the analysis on OECD member countries, based on information within the dataset.
3.  **Quantifying Polarization:** Calculate a political polarization score for each country-year. Our chosen measure is the **vote-share weighted standard deviation of party `rile` scores**. This captures the ideological spread of parties, giving more weight to those with greater electoral support.
4.  **Dataset Preparation:** Aggregate data to the country-year level, creating a panel dataset for analysis.
5.  **Exploratory Analysis:** Conduct correlation analysis to identify linear associations between polarization, GINI, and CPI.
6.  **Statistical Modeling:** Employ linear regression models (OLS) to further examine these relationships, assess statistical significance, and quantify potential impacts.

This notebook documents each step of the analysis, from data preparation to the interpretation of results.

## 1. Setup and Library Imports

This section imports the necessary Python libraries for data manipulation, numerical computation, statistical analysis, and visualization. Each library plays a specific role:

-   `pandas`: For data manipulation and analysis, particularly for working with DataFrames.
-   `numpy`: For numerical operations, especially for array manipulations and mathematical functions.
-   `os`: For interacting with the operating system, primarily used here for constructing file paths in a system-agnostic way.
-   `statsmodels`: A powerful library for estimating and interpreting statistical models, including OLS regression and weighted statistics.
-   `matplotlib.pyplot` and `seaborn`: For creating static, interactive, and animated visualizations. Seaborn builds on Matplotlib and provides a high-level interface for drawing attractive and informative statistical graphics.

In [None]:
import pandas as pd
import numpy as np
import os
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.weightstats import DescrStatsW
import matplotlib.pyplot as plt
import seaborn as sns

# Set plotting style for consistency and aesthetics
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6) # Default figure size

## 2. Configuration and File Paths

Properly managing file paths is crucial for reproducibility. This section defines the paths to input data files and specifies where output (like plots) might be saved. The setup assumes a project structure where this notebook might be in a `notebooks/` directory, with data located in a sibling `output/` directory (which contains the results from a preceding data merging script).

**Rationale:** Using `os.path.join` and dynamic path construction (like `os.path.dirname(os.path.abspath(__file__))` in scripts, or `os.getcwd()` as a fallback here) makes the notebook more portable across different operating systems and local file structures.

In [None]:
# Get the directory where the notebook is likely located (e.g., /path/to/project/notebooks)
# If running this directly, os.getcwd() might be the notebook's directory.
# For better path management if the notebook is in a subdir:
try:
    # __vsc_ipynb_file__ is a variable injected by VS Code's Jupyter extension, holding the path to the current notebook
    notebook_dir = os.path.dirname(os.path.abspath(__vsc_ipynb_file__))
except NameError:
    # Fallback if not in VS Code or __vsc_ipynb_file__ is not defined (e.g., running in a standard Jupyter server)
    notebook_dir = os.getcwd()

# Assume the project_base_dir is one level up from the notebook_dir (e.g., if notebooks are in 'project_root/notebooks/')
project_base_dir = os.path.dirname(notebook_dir)

output_folder_name = "output" # Folder where merged data and plots are stored
output_dir = os.path.join(project_base_dir, output_folder_name)

merged_data_filename = "merged_political_oecd_data.csv"
merged_data_path = os.path.join(output_dir, merged_data_filename)

print(f"Project base directory (assumed): {project_base_dir}")
print(f"Output directory: {output_dir}")
print(f"Looking for merged data at: {merged_data_path}")

## 3. Load Data

This step loads the pre-merged dataset. This dataset is the result of a prior process (executed by the `merge_data.py` script) that combined two primary sources:

1.  **Manifesto Project Dataset (MPDataset_MPDS2024a.csv):**
    * **Content:** Contains detailed information coded from political party election manifestos from a wide range of countries and years. For each party in an election, it includes variables representing their policy positions on various domains.
    * **Key Variable for this study:** `rile`. This score estimates a party's overall position on a one-dimensional left-right scale. It is central to our measure of party ideology and thus polarization.
    * **Other relevant variables:** `countryname`, `year` (derived from `date`), `partyname`, `pervote` (percentage of votes received by the party), `oecdmember` (indicator of OECD membership).

2.  **OECD Economic Indicators (OECD_gini_cpi_data.csv):**
    * **Content:** Provides country-year level data for various economic indicators for OECD and other countries.
    * **Key Variables for this study:**
        * `GINI`: The GINI coefficient measures income inequality within a country, with 0 representing perfect equality and 1 (or 100) representing perfect inequality.
        * `CPI`: The Consumer Price Index measures the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services. It's a common measure of inflation.

The `merge_data.py` script handled the transformation of the OECD data (pivoting to get GINI and CPI as columns) and then performed a left merge with the Manifesto Project data on `countryname` and `year`.

**Rationale for `low_memory=False`:** Using `low_memory=False` in `pd.read_csv` can prevent issues with mixed data types in columns by reading more of the file to infer types, although it uses more memory during the loading process. This is often helpful for larger datasets with complex structures.

In [None]:
if not os.path.exists(merged_data_path):
    print(f"Error: Merged data file not found at {merged_data_path}")
    print("Please ensure you have run the data merging script (e.g., merge_data.py from scripts folder) first ")
    print("and the file 'merged_political_oecd_data.csv' is in the 'output' directory relative to the project root.")
    df = pd.DataFrame() # Create an empty DataFrame to prevent further errors if file not found
else:
    df = pd.read_csv(merged_data_path, low_memory=False)
    print(f"Loaded data with {df.shape[0]} rows and {df.shape[1]} columns.")
    print("First 3 rows of the loaded dataframe:")
    display(df.head(3))

## 4. Inspect and Filter for OECD Countries

**Objective:** To focus the analysis on a set of comparable, developed economies.

**Rationale for focusing on OECD Countries:**
-   **Comparability:** OECD countries generally share similar levels of economic development and democratic institutions, making comparisons of political and economic phenomena more meaningful.
-   **Data Availability:** Data for economic indicators like GINI and CPI are often more consistently available and standardized for OECD members.
-   **Research Scope:** Aligning with the common scope of studies on polarization in established democracies.

**Process:**
1.  **Inspect `oecdmember` column:** Verify the data type and unique values to understand how OECD membership is coded. In this dataset (as discovered during initial scripting), OECD members are coded as `10`.
2.  **Convert to Numeric:** Ensure the column is treated numerically for reliable filtering.
3.  **Filter:** Select only rows where `oecdmember_numeric` is `10`.

In [None]:
if not df.empty and 'oecdmember' in df.columns:
    print("--- Inspecting 'oecdmember' column prior to filtering ---")
    print(f"Data type of 'oecdmember' column: {df['oecdmember'].dtype}")
    print(f"Unique values in 'oecdmember' column: {df['oecdmember'].unique()}")
    print("Value counts for 'oecdmember' (including NaNs if any):")
    print(df['oecdmember'].value_counts(dropna=False))
    
    # Convert to numeric, coercing errors to NaN, then filter for OECD countries (coded as 10)
    df['oecdmember_numeric'] = pd.to_numeric(df['oecdmember'], errors='coerce')
    df_oecd = df[df['oecdmember_numeric'] == 10].copy() # .copy() is used to avoid SettingWithCopyWarning
    
    print(f"\nFiltered for OECD countries (where 'oecdmember_numeric' == 10): {df_oecd.shape[0]} rows remaining.")
    if df_oecd.empty:
        print("Warning: No OECD member data found after filtering. Subsequent analysis will be affected.")
    else:
        print("Sample of filtered OECD data (first 3 rows):")
        display(df_oecd.head(3))
else:
    print("DataFrame is empty or 'oecdmember' column is missing. Cannot proceed with OECD filtering.")
    df_oecd = pd.DataFrame() # Ensure df_oecd is an empty DataFrame to avoid errors in subsequent cells

## 5. Calculate Political Polarization Score

**Concept of Political Polarization:**
Political polarization refers to the extent to which political opinions and party ideologies diverge towards the extremes of a political spectrum and/or cluster into distinct, opposing camps. For this analysis, we focus on **ideological polarization**, specifically the spread of political parties along the left-right dimension.

**Chosen Measure: Vote-Share Weighted Standard Deviation of Party `rile` Scores**

-   **`rile` Score:** The Manifesto Project's `rile` score provides a continuous measure of a party's overall ideological position on a left-right scale, derived from the content analysis of their election manifestos. Positive scores indicate a rightward leaning, negative scores a leftward leaning.
-   **Standard Deviation:** The standard deviation is a statistical measure of dispersion or spread. A higher standard deviation of `rile` scores among parties in a given country-year indicates greater ideological distance between them.
-   **Vote-Share Weighted:** To reflect the political landscape more accurately, the standard deviation is weighted by each party's vote share (`pervote`). This means that the ideological positions of parties with more votes have a greater influence on the overall polarization score. A country with two large, ideologically distant parties would be considered more polarized than a country with one large centrist party and several small, ideologically distant fringe parties.
-   **`ddof=0` in `DescrStatsW`:** This argument in `statsmodels.stats.weightstats.DescrStatsW` calculates the weighted standard deviation as if the data represents the entire population (or the specific set of parties in an election), rather than a sample. This is appropriate here as we are characterizing the observed party system in each election.

**Process:**
1.  Define a function `calculate_weighted_std` that takes a group of parties (for a specific country-year).
2.  Within the function, filter out parties with missing `rile` or `pervote` data, or zero vote share.
3.  Ensure at least two valid parties exist to calculate a spread.
4.  Use `statsmodels.stats.weightstats.DescrStatsW` to compute the weighted standard deviation of `rile` scores, using `pervote` as weights.
5.  Group the `df_oecd` dataframe by `countryname` and `year`, and apply this function to calculate the `PolarizationScore` for each group.

In [None]:
def calculate_weighted_std(group):
    """
    Calculates the vote-share weighted standard deviation of 'rile' scores.
    'pervote' is used as the weight.
    """
    # Ensure 'rile' and 'pervote' are numeric and drop rows with NaNs in these critical columns
    valid_data = group.dropna(subset=['rile', 'pervote'])
    valid_data['rile'] = pd.to_numeric(valid_data['rile'], errors='coerce')
    valid_data['pervote'] = pd.to_numeric(valid_data['pervote'], errors='coerce')
    valid_data = valid_data.dropna(subset=['rile', 'pervote']) # Drop again if coercion created NaNs
    
    # Filter for parties with positive vote share
    valid_data = valid_data[valid_data['pervote'] > 0]

    # Need at least two parties with valid rile and positive vote share to calculate standard deviation.
    if len(valid_data) < 2: 
        return np.nan
    # Ensure weights are not all zero (though filtered above, an explicit check is good practice)
    if valid_data['pervote'].sum() == 0:
        return np.nan

    try:
        # Using statsmodels DescrStatsW for weighted statistics
        weighted_stats = DescrStatsW(valid_data['rile'], weights=valid_data['pervote'], ddof=0) # ddof=0 for population-like std
        return weighted_stats.std
    except Exception as e:
        # print(f"Could not calculate weighted_std for group. Error: {e}") # Optional: for debugging specific groups
        return np.nan

if not df_oecd.empty:
    print("Calculating Political Polarization Score...")
    # A DeprecationWarning may appear here for older pandas versions regarding apply on grouping columns.
    # This is generally fine, or can be addressed by selecting columns explicitly before apply in newer pandas.
    polarization_scores = df_oecd.groupby(['countryname', 'year']).apply(calculate_weighted_std)
    polarization_df = polarization_scores.reset_index(name='PolarizationScore')
    
    print("Polarization scores calculated. Sample:")
    display(polarization_df.head())
    print(f"Number of country-years with successfully calculated polarization scores: {polarization_df['PolarizationScore'].notna().sum()}")
    print(f"Number of country-years where polarization score calculation resulted in NaN: {polarization_df['PolarizationScore'].isna().sum()}")
else:
    print("Skipping polarization calculation as df_oecd is empty.")
    polarization_df = pd.DataFrame(columns=['countryname', 'year', 'PolarizationScore']) # Ensure an empty df with correct columns

## 6. Prepare Final Dataset for Analysis (Country-Year Level)

**Objective:** To create a dataset where each row represents a unique country-year, containing the calculated `PolarizationScore` along with the corresponding `GINI` and `CPI` values.

**Process:**
1.  **Select Economic Data:** From the `df_oecd` (filtered OECD party-level data), select the `countryname`, `year`, `GINI`, and `CPI` columns. Since GINI and CPI are country-year level variables, they will be repeated for each party within a country-year in `df_oecd`. We need one unique value per country-year.
2.  **Remove Duplicates:** Use `drop_duplicates(subset=['countryname', 'year'])` to ensure that we only have one row for each country-year, containing the unique GINI and CPI for that specific country and year.
3.  **Merge with Polarization Scores:** Merge this country-year level economic data with the `polarization_df` (which is already at the country-year level) using `countryname` and `year` as keys. A `left` merge is appropriate here if `country_year_oecd_data` is considered the base, or if we want to ensure all calculated polarization scores are included (though they should align).

This resulting `analysis_df` will be the primary dataset for the subsequent correlation and regression analyses.

In [None]:
if not df_oecd.empty and not polarization_df.empty:
    # GINI and CPI are already at country-year level in the merged_political_oecd_data.csv
    # but they are repeated for each party. We need one value per country-year.
    country_year_oecd_data = df_oecd[['countryname', 'year', 'GINI', 'CPI']].drop_duplicates(subset=['countryname', 'year'])
    
    # Merge polarization scores with the unique country-year GINI and CPI data
    analysis_df = pd.merge(country_year_oecd_data, polarization_df, on=['countryname', 'year'], how='left')
    
    print("Final country-year level dataset (before dropping NaNs for specific analysis variables):")
    display(analysis_df.head())
    print(f"Shape of analysis_df: {analysis_df.shape}")
    print(f"Missing values in analysis_df:\n{analysis_df.isnull().sum()}")
else:
    print("Skipping final dataset preparation as df_oecd or polarization_df is empty.")
    analysis_df = pd.DataFrame(columns=['countryname', 'year', 'GINI', 'CPI', 'PolarizationScore'])

## 7. Handle Missing Data for Statistical Analysis

**Context:** Standard correlation and regression analyses require complete observations for all variables included in a particular model. Rows with missing values (NaNs) in any of the key variables (`PolarizationScore`, `GINI`, `CPI`) must be addressed.

**Method Chosen: Listwise Deletion**
-   The script uses `dropna(subset=['PolarizationScore', 'GINI', 'CPI'])` to remove any country-year observation if it's missing data for any of these three specific variables.

**Rationale for Listwise Deletion:**
-   **Simplicity:** It's a straightforward method to implement.
-   **Comparability:** Ensures that all statistical analyses (correlations, regressions) are performed on the exact same set of observations.

**Limitations of Listwise Deletion:**
-   **Reduced Sample Size:** This method can significantly reduce the number of available data points, potentially leading to a loss of statistical power. As seen in the previous script runs, the sample size drops from ~450 country-years with polarization scores to ~100 when requiring complete GINI and CPI data.
-   **Potential Bias:** If the data are not missing completely at random (MCAR), listwise deletion can introduce bias into the results. However, more complex imputation methods are beyond the scope of this initial analysis.

The resulting `analysis_df_complete` will be used for the subsequent statistical modeling.

In [None]:
if not analysis_df.empty:
    # Drop rows where any of the key analytical variables are missing
    analysis_df_complete = analysis_df.dropna(subset=['PolarizationScore', 'GINI', 'CPI'])
    print(f"Shape of analysis_df after dropping NaNs for key variables (PolarizationScore, GINI, CPI): {analysis_df_complete.shape}")
    print("Sample of the complete dataset for analysis:")
    display(analysis_df_complete.head())
    
    if analysis_df_complete.empty or len(analysis_df_complete) < 2:
        print("Warning: Too few complete data rows (<2) available for full statistical analysis after handling missing values. Correlation and regression may fail or be unreliable.")
else:
    print("analysis_df is empty. Skipping missing data handling.")
    analysis_df_complete = pd.DataFrame(columns=['countryname', 'year', 'GINI', 'CPI', 'PolarizationScore']) # Ensure empty df with correct columns

## 8. Correlation Analysis

**Objective:** To conduct an initial exploratory analysis of the linear relationships between the key variables: `PolarizationScore`, `GINI`, and `CPI`.

**Method: Pearson Correlation Coefficient**
-   Pearson's *r* measures the strength and direction of a linear association between two continuous variables.
-   Values range from -1 (perfect negative linear correlation) to +1 (perfect positive linear correlation), with 0 indicating no linear correlation.

**Process:**
1.  Calculate the Pearson correlation matrix for the three variables.
2.  Visualize the pairwise relationships using scatter plots to help identify patterns, linearity, and potential outliers.

**Interpreting Scatter Plots:**
-   **Direction:** Upward slope (bottom-left to top-right) suggests a positive relationship; downward slope suggests a negative relationship.
-   **Strength:** How tightly clustered the points are around an imaginary line. Tighter clustering indicates a stronger relationship.
-   **Form:** Whether the relationship appears linear or if there's a curve (non-linear).
-   **Outliers:** Points that lie far away from the general pattern of the other points.

In [None]:
if not analysis_df_complete.empty and len(analysis_df_complete) >= 2:
    print("--- Correlation Analysis ---")
    correlation_matrix = analysis_df_complete[['PolarizationScore', 'GINI', 'CPI']].corr(method='pearson')
    print("Pearson Correlation Matrix:")
    display(correlation_matrix)

    # Visualize relationships
    print("\nVisualizing pairwise relationships:")
    fig, axes = plt.subplots(1, 2, figsize=(16, 6)) # Increased figure size for better readability

    sns.scatterplot(ax=axes[0], data=analysis_df_complete, x='GINI', y='PolarizationScore')
    axes[0].set_title('Polarization Score vs. GINI Coefficient')
    axes[0].set_xlabel('GINI Coefficient (Income Inequality)')
    axes[0].set_ylabel('Political Polarization Score')

    sns.scatterplot(ax=axes[1], data=analysis_df_complete, x='CPI', y='PolarizationScore')
    axes[1].set_title('Polarization Score vs. CPI')
    axes[1].set_xlabel('Consumer Price Index (CPI)')
    axes[1].set_ylabel('Political Polarization Score')

    plt.tight_layout()
    plt.show()
    
    # Optional: Save the figure to the output directory
    correlation_plot_path = os.path.join(output_dir, "correlation_plots_notebook.png")
    try:
        fig.savefig(correlation_plot_path)
        print(f"Correlation plots saved to: {correlation_plot_path}")
    except Exception as e:
        print(f"Could not save correlation plots: {e}")
else:
    print("Skipping correlation analysis due to insufficient data (less than 2 complete observations) in analysis_df_complete.")

## 9. Regression Analysis

**Objective:** To move beyond simple correlation and model the relationship between polarization (dependent variable) and our economic indicators (independent variables). Ordinary Least Squares (OLS) regression allows us to:
-   Quantify the relationship (i.e., how much polarization changes for a unit change in GINI or CPI).
-   Assess the statistical significance of these relationships (i.e., how likely they are not due to random chance).
-   Determine the proportion of variance in polarization that can be explained by the model (R-squared).
-   In multiple regression, assess the effect of one independent variable while statistically controlling for others.

**Models to be Estimated:**
1.  **Simple Linear Regression:** `PolarizationScore ~ GINI` (Polarization as a function of GINI alone)
2.  **Simple Linear Regression:** `PolarizationScore ~ CPI` (Polarization as a function of CPI alone)
3.  **Multiple Linear Regression:** `PolarizationScore ~ GINI + CPI` (Polarization as a function of both GINI and CPI)

**Key Regression Output Terms to Interpret:**
-   **`R-squared` / `Adj. R-squared`:** Proportion of variance in the dependent variable explained by the model. Adjusted R-squared accounts for the number of predictors.
-   **`coef` (Coefficient):** The estimated effect of a one-unit increase in an independent variable on the dependent variable.
-   **`P>|t|` (p-value):** If low (typically < 0.05), it suggests the coefficient is statistically significantly different from zero.
-   **`F-statistic` and `Prob (F-statistic)`:** Test the overall significance of the entire model.
-   **`Cond. No.` (Condition Number):** A diagnostic for multicollinearity (high correlation between independent variables). Large values can indicate problems.

**Data Preparation for Regression:**
Ensure no `NaN` or `inf` (infinity) values are present in the subset of data used for regression, as these can cause errors.

In [None]:
if not analysis_df_complete.empty and len(analysis_df_complete) >= 2:
    print("--- Regression Analysis ---")
    # Prepare data specifically for regression, ensuring no NaN/inf values
    regression_data = analysis_df_complete[['PolarizationScore', 'GINI', 'CPI']].copy()
    regression_data.replace([np.inf, -np.inf], np.nan, inplace=True) # Replace infinities with NaN
    regression_data.dropna(inplace=True) # Drop any rows that now have NaNs (or had them before)

    if regression_data.empty or len(regression_data) < 2: # Need at least 2 data points for regression
        print("Not enough valid data points for regression after final cleaning (inf/NaN). Skipping regression.")
    else:
        print(f"Number of observations available for regression: {len(regression_data)}")
        # --- Simple Linear Regression: PolarizationScore ~ GINI ---
        print("\n1. Simple Linear Regression: PolarizationScore ~ GINI")
        try:
            model_gini = smf.ols('PolarizationScore ~ GINI', data=regression_data).fit()
            print(model_gini.summary())
            
            plt.figure(figsize=(8, 6))
            sns.regplot(x='GINI', y='PolarizationScore', data=regression_data, ci=95, line_kws={'color':'red'})
            plt.title('Regression: Polarization Score vs. GINI')
            plt.xlabel('GINI Coefficient')
            plt.ylabel('Political Polarization Score')
            plt.show()
        except Exception as e:
            print(f"Error during GINI regression: {e}")

        # --- Simple Linear Regression: PolarizationScore ~ CPI ---
        print("\n2. Simple Linear Regression: PolarizationScore ~ CPI")
        try:
            model_cpi = smf.ols('PolarizationScore ~ CPI', data=regression_data).fit()
            print(model_cpi.summary())
            
            plt.figure(figsize=(8, 6))
            sns.regplot(x='CPI', y='PolarizationScore', data=regression_data, ci=95, line_kws={'color':'red'})
            plt.title('Regression: Polarization Score vs. CPI')
            plt.xlabel('Consumer Price Index (CPI)')
            plt.ylabel('Political Polarization Score')
            plt.show()
        except Exception as e:
            print(f"Error during CPI regression: {e}")

        # --- Multiple Linear Regression: PolarizationScore ~ GINI + CPI ---
        print("\n3. Multiple Linear Regression: PolarizationScore ~ GINI + CPI")
        try:
            # Check if both GINI and CPI have enough variance to be included (at least 2 unique values)
            if regression_data['GINI'].nunique() > 1 and regression_data['CPI'].nunique() > 1:
                model_multiple = smf.ols('PolarizationScore ~ GINI + CPI', data=regression_data).fit()
                print(model_multiple.summary())
            else:
                print("Skipping multiple regression due to insufficient variance in GINI or CPI (less than 2 unique values).")
        except Exception as e:
            print(f"Error during Multiple regression: {e}")
else:
    print("Skipping regression analysis due to insufficient data (less than 2 complete observations) in analysis_df_complete.")

## 10. Discussion and Conclusion of Findings

This section synthesizes the results from the correlation and regression analyses to address the research question: *How do income inequality (GINI) and inflation (CPI) relate to the ideological dispersion of political parties in OECD countries?*

**Key Findings (based on the ~102 country-year observations with complete data for Polarization, GINI, and CPI):**

1.  **Income Inequality (GINI) and Polarization:**
    * The Pearson correlation between `PolarizationScore` and `GINI` was very weak and negative (-0.07).
    * In both simple linear regression (`PolarizationScore ~ GINI`) and multiple linear regression (`PolarizationScore ~ GINI + CPI`), the GINI coefficient was **not a statistically significant** predictor of political polarization (p-values were high, e.g., ~0.48 in simple regression).
    * The R-squared value for the GINI-only model was extremely low (0.005), indicating that GINI explains a negligible fraction of the variance in polarization scores in this dataset.
    * **Interpretation:** Within the analyzed sample and timeframe, there is no significant linear statistical evidence to suggest that income inequality, as measured by GINI, is a strong or consistent driver of the observed political polarization.

2.  **Inflation (CPI) and Polarization:**
    * The Pearson correlation between `PolarizationScore` and `CPI` was positive and weak-to-moderate (0.29).
    * In both simple linear regression (`PolarizationScore ~ CPI`) and multiple linear regression, the CPI was a **statistically significant positive predictor** of political polarization (p-value ~0.003).
    * The simple regression model with CPI explained about 8.3% of the variance in polarization (R-squared = 0.083). For each one-unit increase in CPI, the polarization score is predicted to increase by approximately 0.23 points.
    * **Interpretation:** There is statistical evidence suggesting that higher levels of inflation (as measured by CPI) are associated with higher levels of political polarization in the OECD countries and years included in this analysis. This relationship persists even when controlling for GINI.

3.  **Combined Model (GINI + CPI):**
    * The multiple regression model including both GINI and CPI had an R-squared of 0.088. This is only a marginal improvement over the CPI-only model, reinforcing the finding that GINI adds little explanatory power in the presence of CPI.
    * CPI remained the only statistically significant predictor in this combined model.

**Overall Conclusion from this Analysis:**

The primary finding of this analysis is a statistically significant positive association between the Consumer Price Index (CPI) and political polarization (measured as vote-share weighted standard deviation of `rile` scores) in the subset of OECD country-years with complete data. Higher inflation appears to coincide with greater ideological divergence among political parties.

Conversely, income inequality (GINI coefficient) did not show a statistically significant relationship with political polarization in this particular dataset and analytical framework. The observed correlations and regression coefficients for GINI were small and not statistically distinguishable from zero.

**Important Limitations and Considerations:**

-   **Sample Size and Period:** The statistical inferences are based on approximately 102 country-year observations, primarily from the mid-2000s onwards, due to the availability of complete GINI and CPI data. This is a subset of the total period for which polarization data might be available, and findings might not generalize to other periods or if more complete data were available.
-   **Explanatory Power (R-squared):** While the relationship with CPI is statistically significant, the R-squared values (around 8-9%) indicate that CPI (and GINI) explain only a small fraction of the total variation in political polarization. This strongly suggests that many other unobserved factors (e.g., institutional characteristics, specific political events, cultural shifts, other economic variables, media landscape) play a more substantial role in driving polarization.
-   **Measure of Polarization:** The chosen measure (vote-share weighted standard deviation of `rile`) is one among many. Different operationalizations of polarization (e.g., affective polarization, legislative voting patterns) might yield different results.
-   **Linearity Assumption:** The regression models assume linear relationships. The actual relationships might be non-linear or more complex.
-   **Correlation vs. Causation:** This analysis identifies associations. It **cannot establish causal links**. For instance, while higher CPI is associated with higher polarization, we cannot conclude that inflation *causes* polarization, or vice-versa. There could be a third factor influencing both, or the relationship could be bidirectional.
-   **Omitted Variable Bias:** The exclusion of other relevant variables could bias the estimated coefficients for GINI and CPI.
-   **Condition Number:** The regression outputs noted high condition numbers, particularly for the CPI model and the multiple regression model. While `statsmodels` often handles perfect multicollinearity, high condition numbers can suggest that independent variables might be somewhat correlated (though GINI and CPI showed very low correlation with each other directly) or that there are other numerical scaling issues, potentially affecting the stability of coefficient estimates. However, the consistency of the CPI effect across models provides some reassurance.

**Potential Future Research Directions:**

-   Explore alternative or more comprehensive measures of polarization.
-   Incorporate a wider range of economic, social, and institutional variables.
-   Employ more advanced time-series or panel data techniques to better account for country-specific fixed effects and temporal dynamics (e.g., lagged effects).
-   Investigate potential non-linear relationships.
-   Conduct case studies of specific countries to understand the mechanisms linking economic conditions to polarization in context.
-   Address missing data through imputation techniques if appropriate, to potentially expand the sample size (though this requires careful consideration of assumptions).

## 11. Next Steps for the Project

The next phase of this project involves taking these findings and the processed data into a business intelligence tool like **Power BI** for further interactive visualization and dashboard creation. This will allow for:

-   More dynamic exploration of trends over time and across countries.
-   Creation of a shareable dashboard to present the key findings to a wider audience.
-   Potentially overlaying qualitative information or event data to help contextualize the observed patterns.

The dataset `analysis_df_complete` (containing the ~102 observations) or the broader `analysis_df` (before listwise deletion for GINI/CPI) could be exported to a new CSV file for easy import into Power BI.