# Filter Dataset by Required Wikipedia Languages

## Purpose
Filters the cross-verified notable entities database to retain only entities with Wikipedia articles in **all five languages**: en, it, fr, es, de.

## Research Focus
Analyze bias across three dimensions:
- **Gender**: Male vs Female representation
- **Geographic**: Western vs Non-Western regions (UN subregion)
- **Temporal**: Historical periods (bigperiod_birth)

## Data Source
- **Input**: `cross-verified-database.csv.gz` from [BHHT Datascape](https://medialab.github.io/bhht-datascape/)
- **Output**: `data/entities_filtered_by_languages.csv`

**Note**: Language selection focuses on Western Wikipedia editions by European speaker population. TODO: Discuss expanding to non-Western editions with professor.

## 1. Import Libraries

In [11]:
import pandas as pd
from typing import List

## 2. Configuration

In [12]:
# Configuration Constants
REQUIRED_LANGUAGES = ['en', 'it', 'fr', 'es', 'de']  
# Western Wikipedia editions sorted by European speaker population
# Source: https://meta.wikimedia.org/wiki/List_of_Wikipedias

INPUT_FILE = '../data/cross-verified-database.csv.gz'
OUTPUT_FILE = '../data/entities_filtered_by_languages.csv'

## 3. Load Dataset

In [13]:
# Load the dataset
df = pd.read_csv(INPUT_FILE, compression='gzip', encoding='latin-1')

# Rename wikidata_code to wikidata_id for consistency across the pipeline
df = df.rename(columns={'wikidata_code': 'wikidata_id'})

In [14]:
# Display basic dataset information
print(f"Dataset shape: {df.shape[0]:,} rows × {df.shape[1]} columns\n")

# Show all available columns
print("Available columns in the dataset:")
print("=" * 80)
for i, col in enumerate(df.columns, 1):
    print(f"{i:2d}. {col}")
print("=" * 80)

Dataset shape: 2,291,817 rows × 49 columns

Available columns in the dataset:
 1. wikidata_id
 2. birth
 3. death
 4. updated_death_date
 5. approx_birth
 6. approx_death
 7. birth_min
 8. birth_max
 9. death_min
10. death_max
11. gender
12. level1_main_occ
13. name
14. un_subregion
15. birth_estimation
16. death_estimation
17. bigperiod_birth_graph_b
18. bigperiod_death_graph_b
19. curid
20. level2_main_occ
21. freq_main_occ
22. freq_second_occ
23. level2_second_occ
24. level3_main_occ
25. bigperiod_birth
26. bigperiod_death
27. wiki_readers_2015_2018
28. non_missing_score
29. total_count_words_b
30. number_wiki_editions
31. total_noccur_links_b
32. sum_visib_ln_5criteria
33. ranking_visib_5criteria
34. all_geography_groups
35. string_citizenship_raw_d
36. citizenship_1_b
37. citizenship_2_b
38. list_areas_of_rattach
39. area1_of_rattachment
40. area2_of_rattachment
41. list_wikipedia_editions
42. un_region
43. group_wikipedia_editions
44. bplo1
45. dplo1
46. bpla1
47. dpla1
48. panth

### Examine Key Columns for Analysis
We'll inspect the columns that are relevant for bias analysis across three dimensions:
- **Identity Column**: `wikidata_id` - Unique identifier for each entity
- **Gender**: `gender` - Male/Female representation
- **Geographic**: `un_subregion` - Geographic regions based on UN classification
- **Temporal**: `bigperiod_birth` - Historical period bins, `birth` - Year of birth

In [15]:
# Examine the key columns for our analysis
analysis_columns = ['wikidata_id', 'gender', 'bigperiod_birth', 'un_subregion']

print("Sample data from key columns:")
print("=" * 80)
print(df[analysis_columns].head(10))
print("\n")

Sample data from key columns:
  wikidata_id  gender                    bigperiod_birth      un_subregion
0    Q1000002    Male  5.Contemporary period 1901-2020AD    Western Europe
1    Q1000005    Male    4.Mid Modern Period 1751-1900AD    Western Europe
2    Q1000006    Male  5.Contemporary period 1901-2020AD    Western Europe
3    Q1000015    Male  5.Contemporary period 1901-2020AD    Western Europe
4    Q1000023  Female  5.Contemporary period 1901-2020AD    Western Europe
5    Q1000026    Male  5.Contemporary period 1901-2020AD  Northern America
6    Q1000034    Male    4.Mid Modern Period 1751-1900AD    Western Europe
7    Q1000044  Female  5.Contemporary period 1901-2020AD    Western Europe
8    Q1000045    Male  5.Contemporary period 1901-2020AD    Western Europe
9    Q1000048    Male    4.Mid Modern Period 1751-1900AD    Western Europe




In [16]:
# Display unique values for categorical columns
print("GENDER - Unique values:")
print("-" * 80)
gender_counts = df['gender'].value_counts(dropna=False)
for value, count in gender_counts.items():
    print(f"  {value}: {count:,} ({count/len(df)*100:.1f}%)")

print("\n" + "=" * 80)
print("TEMPORAL (bigperiod_birth) - Unique values:")
print("-" * 80)
birth_periods = df['bigperiod_birth'].value_counts(dropna=False).sort_index()
for period, count in birth_periods.items():
    print(f"  {period}: {count:,} ({count/len(df)*100:.1f}%)")

print("\n" + "=" * 80)
print("GEOGRAPHIC (un_subregion) - Unique values:")
print("-" * 80)
regions = df['un_subregion'].value_counts(dropna=False)
for region, count in regions.items():
    print(f"  {region}: {count:,} ({count/len(df)*100:.1f}%)")

GENDER - Unique values:
--------------------------------------------------------------------------------
  Male: 1,901,904 (83.0%)
  Female: 387,906 (16.9%)
  nan: 1,398 (0.1%)
  Other: 609 (0.0%)

TEMPORAL (bigperiod_birth) - Unique values:
--------------------------------------------------------------------------------
  1.Ancient History Before 500AD: 4,992 (0.2%)
  2.Post-Classical History 501-1500AD: 31,220 (1.4%)
  3.Early Modern Period 1501-1750AD: 92,494 (4.0%)
  4.Mid Modern Period 1751-1900AD: 480,273 (21.0%)
  5.Contemporary period 1901-2020AD: 1,486,919 (64.9%)
  Missing: 195,919 (8.5%)

GEOGRAPHIC (un_subregion) - Unique values:
--------------------------------------------------------------------------------
  Western Europe: 779,670 (34.0%)
  Northern America: 474,983 (20.7%)
  Southern Europe: 223,089 (9.7%)
  Northern Europe: 138,728 (6.1%)
  South America: 108,613 (4.7%)
  Eastern Europe: 106,823 (4.7%)
  Oceania Western World: 79,116 (3.5%)
  Eastern Asia: 72,716 (3.2

## 4. Define Target Languages

Filter for entities with Wikipedia articles in **all five** languages: en (English), de (German), fr (French), it (Italian), es (Spanish).

In [17]:
# Generate Wikipedia edition format (e.g., 'enwiki', 'itwiki')
# This format is used in the list_wikipedia_editions column
required_languages_wiki = [f"{lang}wiki" for lang in REQUIRED_LANGUAGES]

print(f'Base language codes: {REQUIRED_LANGUAGES}')
print(f'Wikipedia editions: {required_languages_wiki}')

Base language codes: ['en', 'it', 'fr', 'es', 'de']
Wikipedia editions: ['enwiki', 'itwiki', 'frwiki', 'eswiki', 'dewiki']


## 5. Filter by Required Languages

Keep only entities with articles in **all 5 languages** (multilayer network requirement).

In [18]:
# Function to check if all required languages are present in the entry
def has_required_languages(lang_list_str: str, required: List[str]) -> bool:
    langs = set(lang_list_str.split('|'))
    return all(lang in langs for lang in required)

# Apply the filter using Wikipedia edition format
df_filtered = df[df['list_wikipedia_editions'].apply(
    lambda x: has_required_languages(x, required_languages_wiki)
)]

print(f"Rows before filtering: {len(df):,}")
print(f"Rows after filtering: {len(df_filtered):,}")
print(f"Retention rate: {len(df_filtered)/len(df)*100:.1f}%")

Rows before filtering: 2,291,817
Rows after filtering: 94,403
Retention rate: 4.1%


## 6. Preview Filtered Data

In [19]:
# Display the first few rows of the filtered DataFrame
df_filtered.head()

# Optionally, save the filtered DataFrame to a new CSV file
# df_filtered.to_csv('../data/cross-verified-database_filtered.csv.gz', index=False, compression='gzip')

Unnamed: 0,wikidata_id,birth,death,updated_death_date,approx_birth,approx_death,birth_min,birth_max,death_min,death_max,...,area2_of_rattachment,list_wikipedia_editions,un_region,group_wikipedia_editions,bplo1,dplo1,bpla1,dpla1,pantheon_1,level3_all_occ
10,Q100005,1922.0,1951.0,,,,1922.0,1922.0,1951.0,1951.0,...,Missing,enwiki|dewiki|frwiki|eswiki|ruwiki|itwiki|nlwi...,Europe,grA,28.657778,21.01111,50.254444,52.23,0,D:_journalist_poet_writer_critic_journalist_P:...
32,Q1000203,1886.0,1945.0,,,,1886.0,1886.0,1945.0,1945.0,...,Missing,cawiki|dewiki|enwiki|eswiki|frwiki|itwiki|jawi...,Europe,grA,2.351389,2.351389,48.856945,48.856945,0,D:_architect_production_designer_P:_architect_...
51,Q1000296,1979.0,,,,,1979.0,1979.0,,,...,Missing,eswiki|itwiki|zhwiki|fawiki|huwiki|fiwiki|dewi...,America,grA,-43.941944,,-19.928055,,0,D:_football_P:_football_soccer_English_footbal...
66,Q1000379,1929.0,2010.0,,,,1929.0,1929.0,2010.0,2010.0,...,Missing,bgwiki|dewiki|enwiki|eswiki|etwiki|frwiki|huwi...,Africa,grA,11.4333,11.518056,3.08333,3.857778,0,D:_writer_diplomat_politician_P:_diplomat_poli...
100,Q1000592,1988.0,,,,,1988.0,1988.0,,,...,Ireland,dewiki|enwiki|frwiki|nlwiki|ukwiki|plwiki|jawi...,Europe,grA,-2.233333,,53.466667,,0,D:_boxer_P:_ boxer_magazine_champion_English_b...


## 7. Save Selected Columns

In [20]:
# Create final filtered dataset with selected columns
selected_columns = ['wikidata_id', 'birth', 'bigperiod_birth', 'un_subregion', 'gender']
df_final = df_filtered[selected_columns].copy()

print(f"Selected columns: {selected_columns}")
print(f"\nFinal dataset: {df_final.shape[0]:,} rows × {df_final.shape[1]} columns")
print("\nSample:")
print("=" * 80)
print(df_final.head(10))
print("\n")

# Save to CSV
df_final.to_csv(OUTPUT_FILE, index=False)
print(f"✓ Saved to: {OUTPUT_FILE}")

Selected columns: ['wikidata_id', 'birth', 'bigperiod_birth', 'un_subregion', 'gender']

Final dataset: 94,403 rows × 5 columns

Sample:
    wikidata_id   birth                      bigperiod_birth  \
10      Q100005  1922.0    5.Contemporary period 1901-2020AD   
32     Q1000203  1886.0      4.Mid Modern Period 1751-1900AD   
51     Q1000296  1979.0    5.Contemporary period 1901-2020AD   
66     Q1000379  1929.0    5.Contemporary period 1901-2020AD   
100    Q1000592  1988.0    5.Contemporary period 1901-2020AD   
110     Q100063  1991.0    5.Contemporary period 1901-2020AD   
124      Q10007  1491.0  2.Post-Classical History 501-1500AD   
131    Q1000729  1898.0      4.Mid Modern Period 1751-1900AD   
141    Q1000791  1991.0    5.Contemporary period 1901-2020AD   
153    Q1000874  1388.0  2.Post-Classical History 501-1500AD   

         un_subregion  gender  
10     Eastern Europe    Male  
32     Western Europe    Male  
51      South America    Male  
66     Central Africa    Male 

## Summary

**Completed**: Entity filtering for multilayer network analysis

- **Filter**: Entities with articles in all 5 languages (en, it, fr, es, de)
- **Output**: `entities_filtered_by_languages.csv` with 4 columns
- **Next**: Graph construction with `python src/main.py build`

See [README.md](../README.md) for full pipeline documentation.