# 04 – Correspondence Analysis for Two Categorical Variables

In this notebook you will:

1. Check that the required Python modules are installed.  
2. Load a CSV file into a pandas DataFrame.  
3. Select **two categorical (qualitative) variables**.  
4. Build a **contingency table** (cross-tabulation) of the two variables.  
5. Perform **correspondence analysis (CA)** to map categories into a low-dimensional geometric space.  
6. Visualize the result with a simple **biplot** (rows and columns in the same plane).  

This method is useful, for example, when you want to study relationships between:

- social groups and occupations,  
- regions and political choices,  
- historical actors and categories, etc.


In [6]:
# Step 1: check and import required modules

print("Step 1: Checking required Python modules...")

required_modules = ["pandas", "numpy", "matplotlib"]
missing = []

for m in required_modules:
    try:
        __import__(m)
    except ImportError:
        missing.append(m)

if missing:
    print("\nThe following required modules are missing:")
    for m in missing:
        print("  -", m)
    print("\nPlease install them before continuing, for example:")
    print("  pip install " + " ".join(missing))
else:
    print("All required modules are available. Proceeding...")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

print("Modules successfully imported: pandas, numpy, matplotlib.")


Step 1: Checking required Python modules...
All required modules are available. Proceeding...
Modules successfully imported: pandas, numpy, matplotlib.


In [7]:
# Step 2: load a CSV file (with robust handling of common format problems)

print("Step 2: Loading the data from a CSV file.")
print("Please provide:")
print("- The name (or path) of the CSV file, including the extension (e.g. 'data.csv').")
print("- Later you will choose TWO categorical columns for the analysis.")
print("IMPORTANT: include the file extension, e.g. '.csv'.")

df = None

while df is None:
    csv_path = input("Enter the CSV file name or path (e.g. data.csv): ").strip()
    if not csv_path:
        print("You did not enter a file name. Please try again.\n")
        continue

    # We first try a standard read, then fall back to a more tolerant mode if needed
    try:
        df = pd.read_csv(csv_path)
        print(f"\nFile loaded with default settings from: {csv_path}")
    except FileNotFoundError:
        print(f"File not found: {csv_path}")
        print("Please check that:")
        print("- The file is in the current working directory OR you used the correct relative path;")
        print("- You included the correct extension, e.g. '.csv'.")
        print("Try again.\n")
        df = None
    except pd.errors.ParserError as e:
        print(f"ParserError with default settings: {e}")
        print("\nThis often happens when:")
        print("- The separator is not a comma (','), but a semicolon (';') or tab;")
        print("- Some lines have an extra separator inside text fields.")
        print("\nWe will now try a more tolerant reading:")
        print("- First, we try with a semicolon separator (sep=';').")
        print("- If that fails, we try engine='python' and on_bad_lines='skip' (skipping malformed lines).")
        try:
            df = pd.read_csv(csv_path, sep=";")
            print(f"\nFile loaded using sep=';'. Please check that the columns look correct.")
        except Exception as e2:
            print(f"Reading with sep=';' also failed: {e2}")
            try:
                df = pd.read_csv(csv_path, engine="python", on_bad_lines="skip")
                print("\nFile loaded with engine='python' and on_bad_lines='skip'.")
                print("Warning: some problematic lines may have been skipped. Use this with care in real research.")
            except Exception as e3:
                print(f"Tolerant reading also failed: {e3}")
                print("Please inspect the CSV file (e.g. in a text editor or spreadsheet), fix the format issues, and try again.\n")
                df = None
    except Exception as e:
        print(f"An unexpected error occurred while reading the file: {e}")
        print("Please fix the problem (e.g. encoding, separator) and try again.\n")
        df = None

print(f"\nData loaded successfully from: {csv_path}")
print("The first 5 rows of the dataset:")
display(df.head())

print("\nColumn names in this dataset:")
print(list(df.columns))


Step 2: Loading the data from a CSV file.
Please provide:
- The name (or path) of the CSV file, including the extension (e.g. 'data.csv').
- Later you will choose TWO categorical columns for the analysis.
IMPORTANT: include the file extension, e.g. '.csv'.


Enter the CSV file name or path (e.g. data.csv):  hbc1.csv



File loaded with default settings from: hbc1.csv

Data loaded successfully from: hbc1.csv
The first 5 rows of the dataset:


Unnamed: 0.1,Unnamed: 0,osoba,proces,rzecz,wydarzenie
0,kobiety,85,57,15,27
1,mężczyźni,9,6,5,14



Column names in this dataset:
['Unnamed: 0', 'osoba', 'proces', 'rzecz', 'wydarzenie']


In [8]:
# Step 3: choose two categorical columns and build a contingency table

print("Step 3: Choose TWO categorical (qualitative) variables.")
print("For example: 'region' and 'party', 'social_class' and 'occupation', etc.")

def ask_for_column(prompt):
    while True:
        col = input(prompt).strip()
        if col in df.columns:
            return col
        print(f"Column '{col}' was not found in the dataset. Please try again.")
        print("Available columns:")
        print(list(df.columns))

col_r = ask_for_column("Enter the name of the FIRST categorical column (rows): ")
col_c = ask_for_column("Enter the name of the SECOND categorical column (columns): ")

print(f"\nWe will now analyse the relationship between:")
print(f"- Rows:    {col_r}")
print(f"- Columns: {col_c}")

# Build contingency table
contingency = pd.crosstab(df[col_r], df[col_c])

print("\nContingency table (raw counts):")
display(contingency)

print("\nBasic information:")
print(f"- Number of row categories:    {contingency.shape[0]}")
print(f"- Number of column categories: {contingency.shape[1]}")
print(f"- Total count:                 {contingency.values.sum()}")


Step 3: Choose TWO categorical (qualitative) variables.
For example: 'region' and 'party', 'social_class' and 'occupation', etc.


KeyboardInterrupt: Interrupted by user

In [None]:
# Step 4: perform correspondence analysis (CA)

print("Step 4: Performing correspondence analysis (CA).")

N = contingency.values.astype(float)           # raw counts as NumPy array
grand_total = N.sum()

if grand_total == 0:
    raise ValueError("The contingency table is empty (total = 0). Cannot perform CA.")

# Matrix of relative frequencies
P = N / grand_total

# Row and column marginal sums (profiles)
r = P.sum(axis=1)   # row sums
c = P.sum(axis=0)   # column sums

# Diagonal matrices of row/column masses
Dr_inv_sqrt = np.diag(1.0 / np.sqrt(r))
Dc_inv_sqrt = np.diag(1.0 / np.sqrt(c))

# Matrix of standardized residuals
rcT = np.outer(r, c)
S = Dr_inv_sqrt @ (P - rcT) @ Dc_inv_sqrt

# Singular Value Decomposition
U, singular_values, VT = np.linalg.svd(S, full_matrices=False)

# Eigenvalues (inertia for each dimension)
eigenvalues = singular_values**2
total_inertia = eigenvalues.sum()
explained_inertia = eigenvalues / total_inertia

print("\nEigenvalues and percentage of inertia explained by each dimension:")
for i, (eig, prop) in enumerate(zip(eigenvalues, explained_inertia), start=1):
    print(f"Dim {i}: eigenvalue = {eig:.6f}, proportion of inertia = {prop*100:.2f}%")

# Principal coordinates for rows and columns (using all dimensions)
F = Dr_inv_sqrt @ U @ np.diag(singular_values)    # row principal coordinates
G = Dc_inv_sqrt @ VT.T @ np.diag(singular_values) # column principal coordinates

dim1, dim2 = 0, 1

row_coords = F[:, [dim1, dim2]]
col_coords = G[:, [dim1, dim2]]

print("\nWe will use the first two dimensions for the graphical display.")
print(f"Dim 1 explains {explained_inertia[0]*100:.2f}% of inertia.")
if len(explained_inertia) > 1:
    print(f"Dim 2 explains {explained_inertia[1]*100:.2f}% of inertia.")
else:
    print("Warning: only one non-zero dimension was found.")


In [None]:
# Step 5: biplot of row and column categories

print("Step 5: Creating a simple biplot of row and column categories.")

row_labels = contingency.index.to_list()
col_labels = contingency.columns.to_list()

x_rows = row_coords[:, 0]
y_rows = row_coords[:, 1]

x_cols = col_coords[:, 0]
y_cols = col_coords[:, 1]

plt.figure(figsize=(8, 8))

# Plot row categories
plt.scatter(x_rows, y_rows, marker="o", label="Row categories")
for x, y, label in zip(x_rows, y_rows, row_labels):
    plt.text(x, y, " " + str(label), verticalalignment="center")

# Plot column categories
plt.scatter(x_cols, y_cols, marker="s", label="Column categories")
for x, y, label in zip(x_cols, y_cols, col_labels):
    plt.text(x, y, " " + str(label), verticalalignment="center")

plt.axhline(0)
plt.axvline(0)
plt.xlabel("Dimension 1")
plt.ylabel("Dimension 2")
plt.title("Correspondence Analysis – Biplot (rows and columns)")
plt.legend()
plt.tight_layout()
plt.show()

print("\nInterpretation tips:")
print("- Categories that are close to each other have similar profiles;")
print("- Categories far from the origin contribute more to the inertia (they are more 'distinct');")
print("- Row and column points that are close suggest an association between these categories;")
print("- Always interpret these patterns in the historical and social context of your data.")


## Summary of what this notebook did

1. **Loaded a CSV file** and allowed you to choose **two categorical variables**.  
2. Built a **contingency table** (cross-tabulation) of the two variables.  
3. Computed the **matrix of relative frequencies**, row and column profiles (masses).  
4. Constructed the matrix of **standardized residuals** and performed an SVD to obtain the principal dimensions.  
5. Calculated **eigenvalues** and the **percentage of inertia** explained by each dimension.  
6. Plotted a simple **biplot**, where:  
   - row categories and column categories are displayed in the same geometric space;  
   - distances and relative positions reflect similarities and associations in the contingency table.  

You can adapt this notebook by:

- Using other pairs of categorical variables from your dataset;  
- Filtering out very rare categories before running CA;  
- Comparing results for different time periods or subgroups;  
- Combining the geometric interpretation with close reading of historical sources and qualitative analysis.
