<a href="https://colab.research.google.com/github/hunkim/ACL-2020-Papers/blob/master/generate_paper_list_with_arxiv_link.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Load Paper List

In [0]:
def read_papers(path):
    papers = [[]]
    with open(path, "r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if line:
                papers[-1].append(line)
            else:
                papers.append([])
    for p in papers:
        assert len(p) == 2
    return papers

In [48]:
longp = read_papers("./data/long.txt")
longp[:3]

[['2kenize: Tying Subword Sequences for Chinese Script Conversion',
  'Pranav A and Isabelle Augenstein'],
 ['A Batch Normalized Inference Network Keeps the KL Vanishing Away',
  'Qile Zhu, Wei Bi, Xiaojiang Liu, Xiyao Ma, Xiaolin Li and Dapeng Wu'],
 ['A Call for More Rigor in Unsupervised Cross-lingual Learning',
  'Mikel Artetxe, Sebastian Ruder, Dani Yogatama, Gorka Labaka and Eneko Agirre']]

In [49]:
len(longp)

571

In [46]:
short = read_papers("./data/short.txt")
short[:3]

[['A Complete Shift-Reduce Chinese Discourse Parser with Robust Dynamic Oracle',
  'Shyh-Shiun Hung, Hen-Hsen Huang and Hsin-Hsi Chen'],
 ['A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers',
  'Shen-yun Miao, Chao-Chun Liang and Keh-Yih Su'],
 ['A Frame-based Sentence Representation for Machine Reading Comprehension',
  'Shaoru Guo, Ru Li, Hongye Tan, Xiaoli Li, Yong Guan, Hongyan Zhao and Yueping Zhang']]

In [0]:
len(short)

208

In [0]:
demo = read_papers("./data/demo.txt")
demo[:3]

[['ADVISER: A Toolkit for Developing Multi-modal, Multi-domain and Socially-engaged Conversational Agents',
  'Chia-Yu Li, Daniel Ortega, Dirk Väth, Florian Lux, Lindsey Vanderlyn, Maximilian Schmidt, Michael Neumann, Moritz Völkel, Pavel Denisov, Sabrina Jenne, Zorica Kacarevic and Ngoc Thang Vu'],
 ['BENTO: A Visual Platform for Building Clinical NLP Pipelines Based on CodaLab',
  'Yonghao Jin, Fei Li and Hong Yu'],
 ['Clinical-Coder: Assigning Interpretable ICD-10 Codes to Chinese Clinical Notes',
  'Pengfei Cao, Chenwei Yan, xiangling fu, Yubo Chen, Kang Liu, Jun Zhao, Shengping Liu and Weifeng Chong']]

In [0]:
len(demo)

43

In [9]:
student = read_papers("./data/student.txt")
student[:3]

[['#NotAWhore! A Computational Linguistic Perspective of Rape Culture and Victimization on Social Media',
  'Ashima Suvarna and Grusha Bhalla'],
 ['A Geometry-Inspired Attack for Generating Natural Language Adversarial Examples',
  'Zhao Meng and Roger Wattenhofer'],
 ['A Simple and Effective Dependency parser for Telugu',
  'Sneha Nallani, Manish Shrivastava and Dipti Sharma']]

In [0]:
len(student)

49

# Sorting by Topic



In [18]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
import nltk

nltk.download('wordnet')
stemmer = SnowballStemmer("english")

def lemmatize_stemming(text):
  return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):
  result=[]
  for token in gensim.utils.simple_preprocess(text) :
    if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
      result.append(lemmatize_stemming(token))
                
  return result




[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [0]:
#FIXME: Better way to get human readable topic names from LDA topics?
def list2topiclist(list, num_topics = 8):
  processed_docs = []
  for line in list:
    processed_line = preprocess(line[0])
    processed_docs.append(processed_line)

    dictionary = gensim.corpora.Dictionary(processed_docs)
    bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

  
  lda = gensim.models.LdaModel(bow_corpus, num_topics, 
                               id2word = dictionary, passes = 10)


  def get_topic_title(idx, topn=3):
    topn_terms = [dictionary[x[0]] for x in lda.get_topic_terms(idx, topn)]
    return " ".join(topn_terms)

  # Create topic title
  list_topic_titles = []
  for i in range(num_topics):
    list_topic_titles.append(get_topic_title(i))

  # Assign list to topic
  topic_dict = {}
  for line in list:
    processed_line = preprocess(line[0])
    bow_vector = dictionary.doc2bow(processed_line)
    line_topic = sorted(lda.get_document_topics(bow_vector), 
                        key=lambda tup: tup[1], reverse=True)
    topic_title = list_topic_titles[line_topic[0][0]]

    if topic_title not in topic_dict:
      topic_dict[topic_title] = []

    topic_dict[topic_title].append(line)
  
  return topic_dict
      


In [86]:
topic_long = list2topiclist(longp)
for topic in topic_long:
  print(topic)

['model languag dialogu', 'model languag generat', 'languag relat natur', 'machin translat learn', 'cross neural pars', 'generat semant question', 'generat learn model', 'languag word extract']
languag relat natur
generat semant question
cross neural pars
languag word extract
generat learn model
model languag generat
model languag dialogu
machin translat learn


# Search arXiv Link

In [0]:
from googlesearch import search
import urllib
from bs4 import BeautifulSoup
from difflib import SequenceMatcher
from tqdm import tqdm
import time

def similarity(a, b):
    return SequenceMatcher(None, a, b).ratio()


def search_arxiv_link(title, sleep=1):
    time.sleep(sleep)
    link = None
    for j in search(title, tld="co.in", num=10, stop=1, pause=0.5):
        if 'arxiv.org/abs' in j:
            thepage = urllib.request.urlopen(j)
            soup = BeautifulSoup(thepage, "html.parser")
            searched_title = ' '.join(soup.title.text.lower().split()[1:])
            if similarity(title, searched_title) > 0.8:
                link = j
                break
            else:
                print("NOT MATCHED")
                print(title)
                print(searched_title)
    return link

In [0]:
def generate_paper_list_with_arxiv_link(f, papers):
    for p in tqdm(papers):
        title, authors = p
        link = search_arxiv_link(title.lower())
        if link:
            f.write(f"- {title} [[arXiv]]({link})\n")
        else:
            f.write(f"- {title}\n")
    f.write("\n")

In [0]:
def generate_paper_list_with_arxiv_link_topic(f, papers):
  topic_papers = list2topiclist(papers)
  for topic in topic_papers:
    f.write("###" + topic + "\n")
    generate_paper_list_with_arxiv_link(f, topic_papers[topic])


In [0]:
with open("papers_with_arxiv_link_topic.md", "w") as f:
  f.write("## Long Papers\n\n")
  generate_paper_list_with_arxiv_link_topic(f, longp)
  f.write("## Short Papers\n\n")
  generate_paper_list_with_arxiv_link_topic(f, short)
  f.write("## System Demonstrations\n\n")
  generate_paper_list_with_arxiv_link_topic(f, demo)
  f.write("## Student Research Workshop\n\n")
  generate_paper_list_with_arxiv_link_topic(f, student)


['languag model natur', 'graph generat inform', 'learn multi detect', 'semant learn model', 'cross model lingual', 'word generat domain', 'languag natur understand', 'neural translat machin']







  0%|          | 0/87 [00:00<?, ?it/s][A[A[A[A[A




  1%|          | 1/87 [00:02<03:32,  2.48s/it][A[A[A[A[A




  2%|▏         | 2/87 [00:05<03:36,  2.54s/it][A[A[A[A[A




  3%|▎         | 3/87 [00:07<03:33,  2.54s/it][A[A[A[A[A




  5%|▍         | 4/87 [00:09<03:21,  2.43s/it][A[A[A[A[A




  6%|▌         | 5/87 [00:12<03:11,  2.34s/it][A[A[A[A[A




  7%|▋         | 6/87 [00:14<03:09,  2.34s/it][A[A[A[A[A




  8%|▊         | 7/87 [00:16<03:10,  2.38s/it][A[A[A[A[A




  9%|▉         | 8/87 [00:18<03:02,  2.31s/it][A[A[A[A[A

In [0]:
with open("papers_with_arxiv_link.md", "w") as f:
    f.write("## Long Papers\n\n")
    generate_paper_list_with_arxiv_link(f, longp)
    f.write("## Short Papers\n\n")
    generate_paper_list_with_arxiv_link(f, short)
    f.write("## System Demonstrations\n\n")
    generate_paper_list_with_arxiv_link(f, demo)
    f.write("## Student Research Workshop\n\n")
    generate_paper_list_with_arxiv_link(f, student)

  5%|▌         | 31/571 [01:07<21:59,  2.44s/it]

NOT MATCHED
adaptive compression of word embeddings
online embedding compression for text classification using low rank matrix factorization


  6%|▌         | 32/571 [01:10<22:15,  2.48s/it]

NOT MATCHED
addressing posterior collapse with mutual information for improved variational neural machine translation
improved variational neural machine translation by promoting mutual information


  9%|▉         | 53/571 [01:54<18:55,  2.19s/it]

NOT MATCHED
attentive pooling with learnable norms for text representation
attentive pooling networks


 12%|█▏        | 68/571 [02:31<21:03,  2.51s/it]

NOT MATCHED
bilingual dictionary based neural machine translation without using parallel sentences
bridging neural machine translation and bilingual dictionaries


 13%|█▎        | 73/571 [02:43<19:23,  2.34s/it]

NOT MATCHED
boosting neural machine translation with similar translations
neural machine translation from simplified translations


 17%|█▋        | 96/571 [03:36<18:33,  2.34s/it]

NOT MATCHED
contextualized weak supervision for text classification
weakly-supervised neural text classification


 18%|█▊        | 103/571 [03:50<17:57,  2.30s/it]

NOT MATCHED
cross-lingual unsupervised sentiment classification with multi-view transfer learning
multi-source cross-lingual model transfer: learning what to share


 19%|█▉        | 109/571 [04:07<19:52,  2.58s/it]

NOT MATCHED
curriculum learning for natural language understanding
visualizing and understanding curriculum learning for long short-term memory networks


 22%|██▏       | 126/571 [04:50<17:29,  2.36s/it]

NOT MATCHED
distilling annotations via active imitation learning
random expert distillation: imitation learning via expert policy support estimation


 26%|██▌       | 147/571 [05:45<17:43,  2.51s/it]

NOT MATCHED
effective inter-clause modeling for end-to-end emotion-cause pair extraction
end-to-end emotion-cause pair extraction via learning to link


 31%|███       | 176/571 [06:53<17:02,  2.59s/it]

NOT MATCHED
explicit semantic decomposition for definition generation
semantic composition and decomposition: from recognition to generation


 38%|███▊      | 218/571 [08:31<14:40,  2.49s/it]

NOT MATCHED
graph neural news recommendation with unsupervised preference disentanglement
graph neural news recommendation with long-term and short-term interest modeling


 44%|████▍     | 251/571 [09:43<11:39,  2.18s/it]

NOT MATCHED
improving disentangled text representation learning with information-theoretic guidance
improving disentangled representation learning with the beta bernoulli process


 45%|████▍     | 255/571 [09:52<11:58,  2.27s/it]

NOT MATCHED
improving image captioning with better use of caption
hidden state guidance: improving image captioning using an image conditioned autoencoder


 46%|████▌     | 264/571 [10:15<12:37,  2.47s/it]

NOT MATCHED
in neural machine translation, what does transfer learning transfer?
exploring benefits of transfer learning in neural machine translation


 53%|█████▎    | 303/571 [11:47<10:44,  2.40s/it]

NOT MATCHED
learning constraints for structured prediction using rectifier networks
adversarial constraint learning for structured prediction


 54%|█████▍    | 308/571 [11:58<10:49,  2.47s/it]

NOT MATCHED
learning to ask more: semi-autoregressive sequential question generation under dual-graph interaction
semi-autoregressive neural machine translation


 57%|█████▋    | 325/571 [12:39<10:59,  2.68s/it]

NOT MATCHED
low-resource generation of multi-hop reasoning questions
reinforced multi-task approach for multi-hop question generation


 58%|█████▊    | 333/571 [12:57<09:05,  2.29s/it]

NOT MATCHED
meta-reinforced multi-domain state generator for dialogue systems
transferable multi-domain state generator for task-oriented dialogue systems


 62%|██████▏   | 354/571 [13:43<08:24,  2.32s/it]

NOT MATCHED
multi-hypothesis machine translation evaluation
pairwise neural machine translation evaluation


 71%|███████▏  | 407/571 [15:50<05:31,  2.02s/it]

NOT MATCHED
predicting the topical stance and political leaning of media using tweets
predicting the topical stance of media and popular twitter users


 72%|███████▏  | 409/571 [15:55<06:08,  2.27s/it]

NOT MATCHED
premise selection in natural language mathematical texts
natural language premise selection: finding supporting statements for mathematical text


 76%|███████▌  | 434/571 [16:59<05:52,  2.57s/it]

NOT MATCHED
reinceptione: relation-aware inception network with joint local-global structural information for knowledge graph embedding
relation-aware entity alignment for heterogeneous knowledge graphs


 86%|████████▌ | 492/571 [19:24<03:03,  2.32s/it]

NOT MATCHED
structural information preserving for graph-to-text generation
structural neural encoders for amr-to-text generation


 96%|█████████▌| 547/571 [21:33<00:55,  2.30s/it]

NOT MATCHED
unknown intent detection using gaussian mixture model with an application to zero-shot intent classification
zero-shot user intent detection via capsule neural networks


100%|█████████▉| 569/571 [22:29<00:04,  2.46s/it]

NOT MATCHED
zero-shot text classification via reinforced self-training
transductive zero-shot learning with a self-training dictionary approach


100%|██████████| 571/571 [22:33<00:00,  2.35s/it]
 13%|█▎        | 28/208 [01:03<06:52,  2.29s/it]

NOT MATCHED
camouflaged chinese spam content detection with semi-supervised generative active learning
gans for semi-supervised opinion spam detection


 18%|█▊        | 37/208 [01:26<07:13,  2.54s/it]

NOT MATCHED
content word aware neural machine translation
selective attention for context-aware neural machine translation


 32%|███▏      | 66/208 [02:53<10:01,  4.23s/it]

NOT MATCHED
entity-aware dependency-based deep graph attention network for comparative preference classification
exploiting typed syntactic dependencies for targeted sentiment classification using graph attention neural network


 45%|████▍     | 93/208 [04:07<05:58,  3.11s/it]

NOT MATCHED
interpretable operational risk classification with semi-supervised variational autoencoder
disentangled variational auto-encoder for semi-supervised learning


 49%|████▉     | 102/208 [04:29<04:46,  2.70s/it]

NOT MATCHED
learning low-resource end-to-end goal-oriented dialog for fast and reliable system deployment
learning end-to-end goal-oriented dialog


 58%|█████▊    | 121/208 [05:21<05:06,  3.52s/it]

NOT MATCHED
multimodal and multiresolution speech recognition with transformers
multiresolution and multimodal speech recognition with transformers


 61%|██████    | 126/208 [05:33<03:56,  2.88s/it]

NOT MATCHED
neural graph matching networks for chinese short text matching
graph matching networks for learning the similarity of graph structured objects


 91%|█████████ | 189/208 [08:17<00:52,  2.75s/it]

NOT MATCHED
tree-structured neural topic model
structured neural topic models for reviews


100%|█████████▉| 207/208 [09:01<00:02,  2.87s/it]

NOT MATCHED
``you sound just like your father’’ commercial machine translation systems include stylistic biases
reducing gender bias in neural machine translation as a domain adaptation problem


100%|██████████| 208/208 [09:03<00:00,  2.49s/it]
100%|██████████| 43/43 [01:42<00:00,  2.15s/it]
  4%|▍         | 2/49 [00:04<01:36,  2.06s/it]

NOT MATCHED
a geometry-inspired attack for generating natural language adversarial examples
a geometry-inspired decision-based attack


 49%|████▉     | 24/49 [00:50<00:53,  2.12s/it]

NOT MATCHED
υbleu: uncertainty-aware automatic evaluation method for open-domain dialogue systems
better automatic evaluation of open-domain dialogue systems with contextualized embeddings


100%|██████████| 49/49 [01:43<00:00,  1.84s/it]
