## Hypothesis

Between 2018 and 2024, paper titles from US, China,
and Corporate affiliations in ICLR and ICML dispro-
portionately emphasize different machine learning
subfields, revealing regional and institutional special-
ization in research focus.

Specifically, titles from US institutions are more likely
to highlight areas such as Fairness, Causal Inference,
and Graph Learning; Chinese institutions tend to fo-
cus on Federated Learning, Semi-supervised Learning,
and Adversarial Attacks; while Corporate-affiliated
papers (e.g., from Google, Meta, Microsoft) emphasize
topics like Large Language Models, Self-supervised
Learning, and Optimization.

## Pre-processing

In [1]:
import pandas as pd

import openai
import time


import json
from tqdm import tqdm
%config InlineBackend.figure_format = 'retina'


In [2]:
icml_iclr_neurips = pd.read_csv("../papers.csv")

In [3]:
icml_iclr = icml_iclr_neurips[icml_iclr_neurips["Conference"].isin(["ICML", "ICLR"])]
icml_iclr = icml_iclr.reset_index(drop=True)
icml_iclr

Unnamed: 0,Conference,Year,Title,Author,Affiliation
0,ICML,2017,Decoupled Neural Interfaces using Synthetic Gr...,Max Jaderberg,DeepMind
1,ICML,2017,Decoupled Neural Interfaces using Synthetic Gr...,Wojciech Czarnecki,DeepMind
2,ICML,2017,Decoupled Neural Interfaces using Synthetic Gr...,Simon Osindero,DeepMind
3,ICML,2017,Decoupled Neural Interfaces using Synthetic Gr...,Oriol Vinyals,DeepMind
4,ICML,2017,Decoupled Neural Interfaces using Synthetic Gr...,Alex Graves,DeepMind
...,...,...,...,...,...
75039,ICML,2024,Irregular Multivariate Time Series Forecasting...,Weijia Zhang,The Hong Kong University of Science and Techno...
75040,ICML,2024,Irregular Multivariate Time Series Forecasting...,Chenlong Yin,The Hong Kong University of Science and Techno...
75041,ICML,2024,Irregular Multivariate Time Series Forecasting...,Hao Liu,The Hong Kong University of Science and Techno...
75042,ICML,2024,Irregular Multivariate Time Series Forecasting...,Xiaofang Zhou,The Hong Kong University of Science and Techno...


In [4]:
icml_iclr["Affiliation"].unique()

array(['DeepMind', 'Google DeepMind', 'IST Austria', ...,
       'The Swiss AI Lab IDSIA, USI', 'Chinese University of HongKong',
       'Baidu/Rutgers University'], shape=(7030,), dtype=object)

### Splitting Industry vs Academia

In [5]:
def label_affiliation(aff):
    if pd.isnull(aff):
        return "Unknown"
    aff_lower = str(aff).lower()
    if any(x in aff_lower for x in ["university", "institute", "college", "school of", "laboratory", "academy"]):
        return "Academia"
    elif any(x in aff_lower for x in ["google", "microsoft", "deepmind", "facebook", "amazon", "meta", "apple", "nvidia", "ibm", "intel", "openai", "baidu", "alibaba", "bytedance", "tencent", "deep mind"]):
        return "Industry"
    else:
        return "Unknown"

icml_iclr["Affiliation Label"] = icml_iclr["Affiliation"].apply(label_affiliation)
icml_iclr

Unnamed: 0,Conference,Year,Title,Author,Affiliation,Affiliation Label
0,ICML,2017,Decoupled Neural Interfaces using Synthetic Gr...,Max Jaderberg,DeepMind,Industry
1,ICML,2017,Decoupled Neural Interfaces using Synthetic Gr...,Wojciech Czarnecki,DeepMind,Industry
2,ICML,2017,Decoupled Neural Interfaces using Synthetic Gr...,Simon Osindero,DeepMind,Industry
3,ICML,2017,Decoupled Neural Interfaces using Synthetic Gr...,Oriol Vinyals,DeepMind,Industry
4,ICML,2017,Decoupled Neural Interfaces using Synthetic Gr...,Alex Graves,DeepMind,Industry
...,...,...,...,...,...,...
75039,ICML,2024,Irregular Multivariate Time Series Forecasting...,Weijia Zhang,The Hong Kong University of Science and Techno...,Academia
75040,ICML,2024,Irregular Multivariate Time Series Forecasting...,Chenlong Yin,The Hong Kong University of Science and Techno...,Academia
75041,ICML,2024,Irregular Multivariate Time Series Forecasting...,Hao Liu,The Hong Kong University of Science and Techno...,Academia
75042,ICML,2024,Irregular Multivariate Time Series Forecasting...,Xiaofang Zhou,The Hong Kong University of Science and Techno...,Academia


In [6]:
with open("saves/affiliation_annotations.json", "r") as file:
    loaded_annotations = json.load(file)

icml_iclr["Affiliation Label"] = icml_iclr.apply(
    lambda row: loaded_annotations.get(row["Affiliation"], row["Affiliation Label"])
    if row["Affiliation Label"] == "Unknown" else row["Affiliation Label"],
    axis=1
)

icml_iclr

Unnamed: 0,Conference,Year,Title,Author,Affiliation,Affiliation Label
0,ICML,2017,Decoupled Neural Interfaces using Synthetic Gr...,Max Jaderberg,DeepMind,Industry
1,ICML,2017,Decoupled Neural Interfaces using Synthetic Gr...,Wojciech Czarnecki,DeepMind,Industry
2,ICML,2017,Decoupled Neural Interfaces using Synthetic Gr...,Simon Osindero,DeepMind,Industry
3,ICML,2017,Decoupled Neural Interfaces using Synthetic Gr...,Oriol Vinyals,DeepMind,Industry
4,ICML,2017,Decoupled Neural Interfaces using Synthetic Gr...,Alex Graves,DeepMind,Industry
...,...,...,...,...,...,...
75039,ICML,2024,Irregular Multivariate Time Series Forecasting...,Weijia Zhang,The Hong Kong University of Science and Techno...,Academia
75040,ICML,2024,Irregular Multivariate Time Series Forecasting...,Chenlong Yin,The Hong Kong University of Science and Techno...,Academia
75041,ICML,2024,Irregular Multivariate Time Series Forecasting...,Hao Liu,The Hong Kong University of Science and Techno...,Academia
75042,ICML,2024,Irregular Multivariate Time Series Forecasting...,Xiaofang Zhou,The Hong Kong University of Science and Techno...,Academia


In [7]:
icml_iclr["Affiliation Label"].value_counts()

Affiliation Label
Academia    53515
Industry    14605
Unknown      6924
Name: count, dtype: int64

## Splitting China vs US vs Industry

In [8]:
us_uni = pd.read_csv("../uni_csv/us_universities.csv")
china_uni = pd.read_csv("../uni_csv/universities_china.csv")

In [9]:
us_universities = set(us_uni['name'].str.lower())
china_universities = set(china_uni['university'].str.lower())

def update_affiliation_label(row):
    if row["Affiliation Label"] in ["Academia", "Unknown"]:
        affiliation = str(row["Affiliation"]).lower()
        if any(uni in affiliation for uni in us_universities):
            return "United States"
        elif any(uni in affiliation for uni in china_universities):
            return "China"
    return row["Affiliation Label"]

icml_iclr["Affiliation Label"] = icml_iclr.apply(update_affiliation_label, axis=1)
icml_iclr

Unnamed: 0,Conference,Year,Title,Author,Affiliation,Affiliation Label
0,ICML,2017,Decoupled Neural Interfaces using Synthetic Gr...,Max Jaderberg,DeepMind,Industry
1,ICML,2017,Decoupled Neural Interfaces using Synthetic Gr...,Wojciech Czarnecki,DeepMind,Industry
2,ICML,2017,Decoupled Neural Interfaces using Synthetic Gr...,Simon Osindero,DeepMind,Industry
3,ICML,2017,Decoupled Neural Interfaces using Synthetic Gr...,Oriol Vinyals,DeepMind,Industry
4,ICML,2017,Decoupled Neural Interfaces using Synthetic Gr...,Alex Graves,DeepMind,Industry
...,...,...,...,...,...,...
75039,ICML,2024,Irregular Multivariate Time Series Forecasting...,Weijia Zhang,The Hong Kong University of Science and Techno...,China
75040,ICML,2024,Irregular Multivariate Time Series Forecasting...,Chenlong Yin,The Hong Kong University of Science and Techno...,China
75041,ICML,2024,Irregular Multivariate Time Series Forecasting...,Hao Liu,The Hong Kong University of Science and Techno...,China
75042,ICML,2024,Irregular Multivariate Time Series Forecasting...,Xiaofang Zhou,The Hong Kong University of Science and Techno...,China


In [10]:
icml_iclr["Affiliation Label"].value_counts()

Affiliation Label
United States    18346
Academia         18092
China            17337
Industry         14605
Unknown           6664
Name: count, dtype: int64

### Dropping unknown and others (Non US, Non China, Nor Industry)

In [12]:
icml_iclr = pd.read_csv("saves/icml_iclr_affiliation_labels.csv")
icml_iclr = icml_iclr[~icml_iclr["Affiliation Label"].isin(["Unknown", "Academia"])]
icml_iclr = icml_iclr.reset_index(drop=True)
icml_iclr

Unnamed: 0,Conference,Year,Title,Author,Affiliation,Affiliation Label
0,ICML,2017,Decoupled Neural Interfaces using Synthetic Gr...,Max Jaderberg,DeepMind,Industry
1,ICML,2017,Decoupled Neural Interfaces using Synthetic Gr...,Wojciech Czarnecki,DeepMind,Industry
2,ICML,2017,Decoupled Neural Interfaces using Synthetic Gr...,Simon Osindero,DeepMind,Industry
3,ICML,2017,Decoupled Neural Interfaces using Synthetic Gr...,Oriol Vinyals,DeepMind,Industry
4,ICML,2017,Decoupled Neural Interfaces using Synthetic Gr...,Alex Graves,DeepMind,Industry
...,...,...,...,...,...,...
50284,ICML,2024,Learning Graph Representation via Graph Entrop...,Jicong Fan,"The Chinese University of Hong Kong, Shenzhen",China
50285,ICML,2024,Irregular Multivariate Time Series Forecasting...,Weijia Zhang,The Hong Kong University of Science and Techno...,China
50286,ICML,2024,Irregular Multivariate Time Series Forecasting...,Chenlong Yin,The Hong Kong University of Science and Techno...,China
50287,ICML,2024,Irregular Multivariate Time Series Forecasting...,Hao Liu,The Hong Kong University of Science and Techno...,China


In [13]:
icml_iclr["Affiliation Label"].value_counts()

Affiliation Label
United States    18346
China            17338
Industry         14605
Name: count, dtype: int64

## Analysis

In [14]:
df_us = icml_iclr[icml_iclr["Affiliation Label"] == "United States"]
df_china = icml_iclr[icml_iclr["Affiliation Label"] == "China"]
df_industry = icml_iclr[icml_iclr["Affiliation Label"] == "Industry"]
df_us = df_us.reset_index(drop=True)
df_china = df_china.reset_index(drop=True)
df_industry = df_industry.reset_index(drop=True)

In [19]:
df_industry.head()

Unnamed: 0,Conference,Year,Title,Author,Affiliation,Affiliation Label
0,ICML,2017,Decoupled Neural Interfaces using Synthetic Gr...,Max Jaderberg,DeepMind,Industry
1,ICML,2017,Decoupled Neural Interfaces using Synthetic Gr...,Wojciech Czarnecki,DeepMind,Industry
2,ICML,2017,Decoupled Neural Interfaces using Synthetic Gr...,Simon Osindero,DeepMind,Industry
3,ICML,2017,Decoupled Neural Interfaces using Synthetic Gr...,Oriol Vinyals,DeepMind,Industry
4,ICML,2017,Decoupled Neural Interfaces using Synthetic Gr...,Alex Graves,DeepMind,Industry


### Pairwise log odds 

In [22]:
import numpy as np
import pandas as pd
from collections import Counter
import re
from math import log

In [23]:
def tokenize(text):
    return re.findall(r'\b\w+\b', text.lower())

def get_counts(df):
    all_tokens = []
    for title in df['Title']:
        all_tokens.extend(tokenize(title))
    return Counter(all_tokens)

def log_odds_ratio(counts_a, counts_b, counts_ref, alpha=0.01):
    vocab = set(counts_a) | set(counts_b) | set(counts_ref)
    result = {}

    for word in vocab:
        a = counts_a.get(word, 0)
        b = counts_b.get(word, 0)
        ref = counts_ref.get(word, 0)

        pa = a + alpha
        pb = b + alpha
        pr = ref + alpha

        log_odds = log(pa / (sum(counts_a.values()) + alpha * len(vocab) - pa)) - \
                   log(pb / (sum(counts_b.values()) + alpha * len(vocab) - pb))

        variance = 1/(pa) + 1/(pb)
        z_score = log_odds / np.sqrt(variance)

        result[word] = z_score
    return dict(sorted(result.items(), key=lambda x: -x[1]))

In [None]:
counts_us = get_counts(df_us)
counts_china = get_counts(df_china)
counts_industry = get_counts(df_industry)
counts_all = get_counts(pd.concat([df_us, df_china, df_industry]))


us_vs_china = log_odds_ratio(counts_us, counts_china, counts_all)

us_vs_industry = log_odds_ratio(counts_us, counts_industry, counts_all)

china_vs_industry = log_odds_ratio(counts_china, counts_industry, counts_all)

# Print top distinctive words
print("Top US vs China:", list(us_vs_china.items())[:10])
print("Top US vs Industry:", list(us_vs_industry.items())[:10])
print("Top China vs Industry:", list(china_vs_industry.items())[:10])

Top US vs China: [('linear', np.float64(8.68072296596462)), ('and', np.float64(8.242209093469576)), ('bandits', np.float64(7.640295281693778)), ('fairness', np.float64(7.331163206116767)), ('robustness', np.float64(7.145821751515402)), ('of', np.float64(6.804174691804477)), ('poisoning', np.float64(6.4545323013842)), ('convergence', np.float64(6.450225310983126)), ('guarantees', np.float64(6.225271437609789)), ('fair', np.float64(6.209415329987384))]
Top US vs Industry: [('robust', np.float64(8.066730610685692)), ('graph', np.float64(7.352917175984356)), ('markov', np.float64(7.042346238866983)), ('convergence', np.float64(6.775009335920867)), ('robustness', np.float64(6.562204078008988)), ('provable', np.float64(6.496932200803566)), ('approximation', np.float64(6.375815386069381)), ('attacks', np.float64(6.346094849789103)), ('analysis', np.float64(6.333144648983983)), ('data', np.float64(6.148643349939811))]
Top China vs Industry: [('graph', np.float64(10.218636606515712)), ('heterog

### One vs rest log odds -> not so useful

In [20]:
import pandas as pd
import numpy as np
from collections import Counter
import re
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)  
    return text

def get_token_counts(df):
    titles = df['Title'].dropna().astype(str).apply(clean_text)
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(titles)
    counts = np.asarray(X.sum(axis=0)).flatten()
    vocab = vectorizer.get_feature_names_out()
    return dict(zip(vocab, counts))

def compute_log_odds(counts_a, counts_b, prior=0.01):
    all_words = set(counts_a) | set(counts_b)
    log_odds = {}
    for word in all_words:
        a = counts_a.get(word, 0)
        b = counts_b.get(word, 0)
        #smoothing
        log_odds[word] = np.log((a + prior) / sum(counts_a.values())) - np.log((b + prior) / sum(counts_b.values()))
    return dict(sorted(log_odds.items(), key=lambda x: x[1], reverse=True))




us_counts = get_token_counts(df_us)
china_counts = get_token_counts(df_china)
industry_counts = get_token_counts(df_industry)


others_counts = Counter(china_counts) + Counter(industry_counts)
log_odds_us = compute_log_odds(us_counts, others_counts)


others_counts = Counter(us_counts) + Counter(industry_counts)
log_odds_china = compute_log_odds(china_counts, others_counts)


others_counts = Counter(us_counts) + Counter(china_counts)
log_odds_industry = compute_log_odds(industry_counts, others_counts)

# ---------- Display top 10 distinctive terms ----------
print("\nTop US distinctive words:")
print(dict(list(log_odds_us.items())[:10]))

print("\nTop China distinctive words:")
print(dict(list(log_odds_china.items())[:10]))

print("\nTop Industry distinctive words:")
print(dict(list(log_odds_industry.items())[:10]))


Top US distinctive words:
{'harbor': np.float64(8.073219785016084), 'interpolants': np.float64(7.90146815552437), 'semidefinite': np.float64(7.836971274161117), 'changes': np.float64(7.768025988864663), 'deserve': np.float64(7.69397292104016), 'chatgpt': np.float64(7.613994264607639), 'connect': np.float64(7.613994264607639), 'hardwareaware': np.float64(7.613994264607639), 'remote': np.float64(7.613994264607639), 'sotopia': np.float64(7.527058579250335)}

Top China distinctive words:
{'energyguided': np.float64(8.476146149598161), 'stone': np.float64(8.051466476545373), 'glmb': np.float64(7.990878597146226), 'symbol': np.float64(7.9263817157829735), 'underwater': np.float64(7.857436430486517), 'lowerlevel': np.float64(7.783383362662017), 'superior': np.float64(7.7034047062294935), 'highprobability': np.float64(7.7034047062294935), 'linearised': np.float64(7.6164690208721915), 'curbench': np.float64(7.6164690208721915)}

Top Industry distinctive words:
{'trillions': np.float64(8.863874

### LDA (Latent Dirichlet Allocation)

In [25]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [None]:
def run_lda(df, label, n_topics=5):
    vectorizer = CountVectorizer(stop_words='english', max_df=0.95, min_df=2)
    X = vectorizer.fit_transform(df['Title'])
    
    lda = LatentDirichletAllocation(n_components=n_topics, random_state=42)
    lda.fit(X)

    words = vectorizer.get_feature_names_out()
    print(f"Top topics for {label}:\n")
    for idx, topic in enumerate(lda.components_):
        print(f"Topic {idx+1}: ", [words[i] for i in topic.argsort()[-10:][::-1]])
    print("\n" + "-"*50 + "\n")
    

run_lda(df_us, "US")
run_lda(df_china, "China")
run_lda(df_industry, "Industry")


Top topics for US:

Topic 1:  ['learning', 'policy', 'contrastive', 'domain', 'representation', 'time', 'self', 'shot', 'data', 'adaptation']
Topic 2:  ['learning', 'neural', 'networks', 'deep', 'reinforcement', 'data', 'efficient', 'graph', 'robust', 'training']
Topic 3:  ['learning', 'model', 'generation', 'graph', 'optimization', 'self', 'based', 'diffusion', 'aware', 'text']
Topic 4:  ['optimization', 'stochastic', 'gradient', 'convergence', 'private', 'adversarial', 'convex', 'estimation', 'non', 'neural']
Topic 5:  ['models', 'language', 'large', 'model', 'multi', 'data', 'generative', 'scale', 'training', 'inference']

--------------------------------------------------

Top topics for China:

Topic 1:  ['learning', 'multi', 'neural', 'label', 'data', 'domain', 'representation', 'deep', 'distribution', 'estimation']
Topic 2:  ['neural', 'networks', 'learning', 'graph', 'deep', 'self', 'training', 'local', 'detection', 'bayesian']
Topic 3:  ['learning', 'reinforcement', 'models', 