From 594827fbbb90ca87dc4f3a525d5b7e0401231c49 Mon Sep 17 00:00:00 2001 From: Antonia Mouawad Date: Mon, 19 Jul 2021 13:47:48 -0400 Subject: [PATCH 01/21] Add information extraction md --- _posts/2021-07-20-information-extraction-at-scribd.md | 0 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 _posts/2021-07-20-information-extraction-at-scribd.md diff --git a/_posts/2021-07-20-information-extraction-at-scribd.md b/_posts/2021-07-20-information-extraction-at-scribd.md new file mode 100644 index 0000000..e69de29 From 11904893a226bb9528e2eab04b9381b1abf67ef5 Mon Sep 17 00:00:00 2001 From: AntoniaMouawad Date: Mon, 19 Jul 2021 14:06:37 -0400 Subject: [PATCH 02/21] Update 2021-07-20-information-extraction-at-scribd.md --- _posts/2021-07-20-information-extraction-at-scribd.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/_posts/2021-07-20-information-extraction-at-scribd.md b/_posts/2021-07-20-information-extraction-at-scribd.md index e69de29..dbe17da 100644 --- a/_posts/2021-07-20-information-extraction-at-scribd.md +++ b/_posts/2021-07-20-information-extraction-at-scribd.md @@ -0,0 +1,11 @@ +--- +layout: post +title: "Identifying Document Types at Scribd" +tags: +- machinelearning +- data +- featured +team: Applied Research +author: jonathanr +--- + From 4e34453574ddff8094dafc2083a8ef8b0c875185 Mon Sep 17 00:00:00 2001 From: Rafael Lacerda Date: Mon, 19 Jul 2021 14:12:17 -0400 Subject: [PATCH 03/21] added images --- _posts/2021-07-20-information-extraction-at-scribd.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/_posts/2021-07-20-information-extraction-at-scribd.md b/_posts/2021-07-20-information-extraction-at-scribd.md index dbe17da..e66fe1d 100644 --- a/_posts/2021-07-20-information-extraction-at-scribd.md +++ b/_posts/2021-07-20-information-extraction-at-scribd.md @@ -9,3 +9,8 @@ team: Applied Research author: jonathanr --- +![topics image description](https://user-images.githubusercontent.com/11147367/126206921-31cea5fb-989c-4468-bb0e-508935f20636.png) + +![wiki image description](https://user-images.githubusercontent.com/11147367/126206932-a5612459-e597-4340-a379-d62da58a29dc.png) + +![flowchart image description](https://user-images.githubusercontent.com/11147367/126206943-9deabf5f-6add-4a01-9e20-5ed8f9e10069.png) From 3c24266c5a4cfea0ddc63a4d990449f11cb72422 Mon Sep 17 00:00:00 2001 From: AntoniaMouawad Date: Mon, 19 Jul 2021 14:12:38 -0400 Subject: [PATCH 04/21] Update authors.yml --- _data/authors.yml | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/_data/authors.yml b/_data/authors.yml index 5ad6289..ebe3bfe 100644 --- a/_data/authors.yml +++ b/_data/authors.yml @@ -146,3 +146,8 @@ jonathanr: about: | Jonathan is a data scientist on the Applied Research team building machine learning models to understand and connect our content. +antoniam: + name: Antonia Mouawad + github: AntoniaMouawad + about: | + Antonia is a data scientist on the Applied Research team building machine learning models to understand and connect our content. From 05a70c62c408850f9f59b1f7132fd2d36c4b7bd9 Mon Sep 17 00:00:00 2001 From: AntoniaMouawad Date: Mon, 19 Jul 2021 15:25:11 -0400 Subject: [PATCH 05/21] Update 2021-07-20-information-extraction-at-scribd.md --- ...-07-20-information-extraction-at-scribd.md | 176 +++++++++++++++++- 1 file changed, 173 insertions(+), 3 deletions(-) diff --git a/_posts/2021-07-20-information-extraction-at-scribd.md b/_posts/2021-07-20-information-extraction-at-scribd.md index e66fe1d..482101c 100644 --- a/_posts/2021-07-20-information-extraction-at-scribd.md +++ b/_posts/2021-07-20-information-extraction-at-scribd.md @@ -6,11 +6,181 @@ tags: - data - featured team: Applied Research -author: jonathanr +authors: +- antoniam +- rafael --- +This is part 2 in a series of blog posts describing a multi-component machine learning system we built to extract metadata from our documents in order to enrich downstream discovery models. In this post, we present the challenges and limitations we faced and the solutions we came up with when building information extraction NLP models for our text-heavy documents. -![topics image description](https://user-images.githubusercontent.com/11147367/126206921-31cea5fb-989c-4468-bb0e-508935f20636.png) +As mentioned in part 1, we now have a way of identifying text-heavy documents. Having done that, we want to build dedicated models to deepen our semantic understanding of them. We do this by extracting key phrases and entities. + +![flowchart image description](https://user-images.githubusercontent.com/11147367/126206943-9deabf5f-6add-4a01-9e20-5ed8f9e10069.png) + +Figure 1: Diagram of our multi-component machine learning system. + +Key phrases are phrases that represent major themes/topics, whereas entities are proper nouns such as people, places and organizations. For example, when a user uploads a document about the Manhattan project, we will first detect it is text-heavy, then extract key phrases and entities. Potential key phrases would be “atomic bomb” and “nuclear weapons” and potential entities would be “Robert Oppenheimer” and “Los Alamos”. + +As keyphrase extraction brings out the general topics discussed in a document, it helps put a cap on the amount of information kept per document, resulting in a somewhat uniform representation of documents irrespective of their original size. Entity extraction, on the other hand, identifies elements in a text that aren’t necessarily reflected by keyphrases only. We found the combination of keyphrase and entity extraction to provide a rich semantic description of each document. + +The rest of this post will explain how we approached key phrase and entity extraction, and how we identified whether a subset of these keyphrases and entities are present in a knowledge base (also known as linking), and introduce how we use them to categorize documents. + +## Keyphrase Extraction + +Typically a keyphrase extraction system operates in two steps as indicated in this survey:  + +- Using heuristics to extract a list of words/phrases that serve as candidate keyphrases, such as part-of-speech language patterns, stopwords filtering, and n-grams with Wikipedia article titles + +- Determining which of these candidate keyphrases are most likely to be key phrases, using one of the two approaches: + + - Supervised approaches such as binary classification of candidates (useful/not useful), structural features based on positional encoding, etc. + + - Unsupervised approaches such as selecting terms with the highest tf-idf and clustering. + +Training a decent supervised model to be able to extract keyphrases across a wide variety of topics would require a large amount of training data, and might generalize very poorly. For this reason, we decided to take the unsupervised approach. + +Our implementation of key phrase extraction is optimized for speed without sacrificing key phrase quality much. We employ both a statistical method and language specific rules to identify them efficiently. + +We simply start by filtering out stopwords and extracting the n-grams with a base n (bigrams in our case, n=2). This step is fast and straightforward and results in an initial set of candidate n-grams.  + +Limiting the results to a single n-gram class, however, results in split key phrases, which makes linking them to a knowledge base a challenging task. For that, we attempt to agglomerate lower order n-grams into potentially longer key phrases, as long as they occur at a predetermined minimum frequency as compared to the shorter n_gram, based on the following a pattern:  + +`A sequence of nouns (NN) possibly interleaved with either Coordinating Conjunctions (CC) or Prepositions and Subordinating Conjunctions (IN).` + +Here are a few examples: + +Assuming the minimum frequency of agglomeration is 0.5, that means we would only replace the bigram `world (NN) health (NN)` by `world (NN) health (NN) organization (NN)` as long as `world health organization` occurs at least 50% as much as `world health` occurs.  + +Replace `Human (NNP) Development (NNP)` with `Center(NNP) for (IN) Global (NNP) Development (NNP)` only if the latter occurs at least a predetermined percentage of time as compared to the former. + +This method results in more coherent and complete key phrases that could be linked more accurately to a knowledge base entry. + +Finally we use the count of occurrences of the candidate keyphrase as a proxy to its importance. This method is reliable for longer documents, as the repetition of a keyphrase tends to reliably indicate its centrality to the document’s topic.  + +## Named Entities + +Keyphrases are only one side of finding what’s important in a document. To further capture what a document is about, we must also consider the named entities that are present. + +Named Entity Extraction systems identify instances of named entities in a text, which we can count in order to represent their importance in the document, similar to how we did with keyphrases. + +Naively counting named entities through exact string matches surfaces an interesting problem: a single entity may go by many names or aliases, which means string frequency is an unreliable measurement of importance. In the example given in Figure 2, we know that “MIll”, “John Stuart Mill” and “Stuart Mill” all refer to the same person. This means that Mill is even more central to the document than the table indicates, since he is referred to a total of 8 times instead of 5. ![wiki image description](https://user-images.githubusercontent.com/11147367/126206932-a5612459-e597-4340-a379-d62da58a29dc.png) -![flowchart image description](https://user-images.githubusercontent.com/11147367/126206943-9deabf5f-6add-4a01-9e20-5ed8f9e10069.png) +Figure 2: Excerpt from John Stuart Mill’s Wikipedia page (left) and Top 5 Named Entity counts of the first few paragraphs (right). + +To address this counting problem, let’s introduce a few abstractions: + +Named Entity refers to a unique person, place or organization. Because of their uniqueness, we can represent them with a unique identifier (ID).  + +Named Entity Alias (or simply Alias), is one of possibly many names associated with a particular entity. + +Canonical Alias is the preferred name for an entity. + +Named Entity Mention (or simply Mention), refers to each occurrence in a text that a Named Entity was referred to, regardless of which Alias was used. + +Knowledge Base is a collection of entities, allowing us to query for ID, canonical name, aliases and other information that might be relevant for the task at hand. One example is Wikidata. + +The first step to solve the counting problem is to normalize the names a document uses to refer to a named entity. Using our abstractions, this means we want to find all the mentions in a document, and use its alias to find the named entity it belongs to. Then, replace it with either the canonical name or the named entity ID - this distinction will become clearer later on. + +Entity Normalization + +Given a set of aliases that appear in a document, we developed heuristics (e.g. common tokens, initials) to identify which subset of aliases refer to the same named entity. This allowed us to limit our search space when comparing aliases. + +Using our previous example to illustrate this method, we start by assuming the canonical alias is the longest alias in a text for a given entity, and attempt to merge aliases together by evaluating which aliases match the heuristics we developed.  + +Entities’ Aliases Mentioned in the Document + +Mill + +John Stuart Mill + +Robert Grosvenor + +William Henry Smith + +Stuart Mill + +Table 1: Top 5 occurring aliases in the first few paragraphs of John Stuart Mill’s Wikipedia page, some referring to the same person. + +Comparing entities with each other using exact token matching as a heuristic would solve this: + +Merges + +Comparison Left + +Comparison Right + +Result + +Mill + +John Stuart Mill + +(Mill, John Stuart Mill) + +(Mill, John Stuart Mill) + +Robert Grosvenor + +- + +(Mill, John Stuart Mill) + +William Henry Smith + +- + +(Mill, John Stuart Mill) + +Stuart Mill + +(Mill, John Stuart Mill, Stuart Mill) + +[ the remaining comparisons do not yield merges and were omitted ] + +Table 2: Pairwise alias comparisons and resulting merges. Matches highlighted in bold. + +By replacing all mentions with its corresponding canonical alias, we are able to find the correct named entity counts. + +One edge case is when an alias might refer to more than one entity: e.g. the alias “Potter” could refer to the named entities “Harry Potter” or “James Potter” within the Harry Potter universe. To solve this, we built an Entity Linker, which determines which named entity is the most likely to match the alias given the context. This process is further explained in the Linking to a Knowledge Base section. + +When an entity is not present in a knowledge base, we cannot use Named Entity Linking to disambiguate. In this case, our solution uses a fallback method that assigns the ambiguous mention (Potter) to the closest occurring unambiguous mention that matches the heuristics (e.g. Harry).  + +Linking to a Knowledge Base + +Given that many keyphrases and entities mentioned in a document are notable, they are likely present in a knowledge base. This allows us to leverage extra information present in the knowledge base to improve the normalization step as well as downstream tasks. + +Entity Linking assists normalization by providing information that an alias matches a named entity, which otherwise wouldn’t match a heuristic (e.g. “Honest Abe” versus “Abraham Lincoln”). Furthermore, information in a knowledge base can be used to embed linked entities and keyphrases in the same space as text. + +Being able to embed entities in the same space as text is useful, as this unlocks the ability to compare possible matching named entity IDs with the context in which they’re mentioned, and make a decision on whether an alias we’re considering might be one of the entities in the knowledge base (in which case we will use IDs), or whether the alias doesn’t match any entity in the knowledge base, in which case we fall back to using the assumed canonical alias.  + +At Scribd we make use of Entity Linking to not only improve the Entity Normalization step, but also to take advantage of entity and keyphrase embeddings as supplemental features. + +Discussion + +Putting all of this together, we can: + +Link documents to keyphrases and entities + +Find the relative importance of each in a document. + +Take advantage of relevant information in knowledge bases + +This has enabled some interesting projects: + +In one of them, we built a graph of documents along with their related keyphrases and entities. Embedding documents, keyphrases and entities in the same space allowed us to discover documents by analogy. For example, take The Count of Monte Cristo by Alexandre Dumas, a 19th century French novel about revenge. If we add to its embedding the embedding of science_fiction, it leads us to a collection of science fiction novels by Jules Verne (another 19th century French author), such as 20,000 Leagues Under the Sea and Journey to the Center of the Earth. + +Keyphrase extractions have also been useful in adding explainability to document clusters. By extracting the most common keyphrases of a cluster, we can derive a common theme for the cluster’s content: + +![topics image description](https://user-images.githubusercontent.com/11147367/126206921-31cea5fb-989c-4468-bb0e-508935f20636.png) + +Figure 3: Top keyphrases in a document cluster. The keywords imply that the documents therein are related to dentistry & healthcare, which was confirmed by manually inspecting the documents. + +In yet another project, we leveraged precomputed knowledge base embeddings to represent a document in space through a composition of the entities and keyphrases it contains. These features allowed us to understand the documents uploaded by our users and improve the content discovery on the platform. + +To see how we use the information extracted to classify documents into a taxonomy, make sure to check out part 3: ! + + + + + From 3a07c52b66a15b92e309343b1ddf805027707bee Mon Sep 17 00:00:00 2001 From: AntoniaMouawad Date: Mon, 19 Jul 2021 15:31:11 -0400 Subject: [PATCH 06/21] Update 2021-07-20-information-extraction-at-scribd.md --- ...1-07-20-information-extraction-at-scribd.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/_posts/2021-07-20-information-extraction-at-scribd.md b/_posts/2021-07-20-information-extraction-at-scribd.md index 482101c..767e3ad 100644 --- a/_posts/2021-07-20-information-extraction-at-scribd.md +++ b/_posts/2021-07-20-information-extraction-at-scribd.md @@ -70,19 +70,19 @@ Figure 2: Excerpt from John Stuart Mill’s Wikipedia page (left) and Top 5 Name To address this counting problem, let’s introduce a few abstractions: -Named Entity refers to a unique person, place or organization. Because of their uniqueness, we can represent them with a unique identifier (ID).  +`Named Entity` refers to a unique person, place or organization. Because of their uniqueness, we can represent them with a unique identifier (ID).  -Named Entity Alias (or simply Alias), is one of possibly many names associated with a particular entity. +`Named Entity Alias` (or simply Alias), is one of possibly many names associated with a particular entity. -Canonical Alias is the preferred name for an entity. +`Canonical Alias` is the preferred name for an entity. -Named Entity Mention (or simply Mention), refers to each occurrence in a text that a Named Entity was referred to, regardless of which Alias was used. +`Named Entity Mention` (or simply `Mention`), refers to each occurrence in a text that a Named Entity was referred to, regardless of which Alias was used. -Knowledge Base is a collection of entities, allowing us to query for ID, canonical name, aliases and other information that might be relevant for the task at hand. One example is Wikidata. +`Knowledge Base` is a collection of entities, allowing us to query for ID, canonical name, aliases and other information that might be relevant for the task at hand. One example is [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page). The first step to solve the counting problem is to normalize the names a document uses to refer to a named entity. Using our abstractions, this means we want to find all the mentions in a document, and use its alias to find the named entity it belongs to. Then, replace it with either the canonical name or the named entity ID - this distinction will become clearer later on. -Entity Normalization +### Entity Normalization Given a set of aliases that appear in a document, we developed heuristics (e.g. common tokens, initials) to identify which subset of aliases refer to the same named entity. This allowed us to limit our search space when comparing aliases. @@ -150,9 +150,9 @@ Linking to a Knowledge Base Given that many keyphrases and entities mentioned in a document are notable, they are likely present in a knowledge base. This allows us to leverage extra information present in the knowledge base to improve the normalization step as well as downstream tasks. -Entity Linking assists normalization by providing information that an alias matches a named entity, which otherwise wouldn’t match a heuristic (e.g. “Honest Abe” versus “Abraham Lincoln”). Furthermore, information in a knowledge base can be used to embed linked entities and keyphrases in the same space as text. +Entity Linking assists normalization by providing information that an alias matches a named entity, which otherwise wouldn’t match a heuristic (e.g. “Honest Abe” versus “Abraham Lincoln”). Furthermore, [information in a knowledge base can be used to embed linked entities and keyphrases in the same space as text](https://arxiv.org/abs/1601.01343). -Being able to embed entities in the same space as text is useful, as this unlocks the ability to compare possible matching named entity IDs with the context in which they’re mentioned, and make a decision on whether an alias we’re considering might be one of the entities in the knowledge base (in which case we will use IDs), or whether the alias doesn’t match any entity in the knowledge base, in which case we fall back to using the assumed canonical alias.  +Being able to embed entities in the same space as text is useful, as this unlocks the ability to [compare possible matching named entity IDs with the context in which they’re mentioned](https://arxiv.org/abs/1911.03814), and make a decision on whether an alias we’re considering might be one of the entities in the knowledge base (in which case we will use IDs), or whether the alias doesn’t match any entity in the knowledge base, in which case we fall back to using the assumed canonical alias.  At Scribd we make use of Entity Linking to not only improve the Entity Normalization step, but also to take advantage of entity and keyphrase embeddings as supplemental features. @@ -168,7 +168,7 @@ Take advantage of relevant information in knowledge bases This has enabled some interesting projects: -In one of them, we built a graph of documents along with their related keyphrases and entities. Embedding documents, keyphrases and entities in the same space allowed us to discover documents by analogy. For example, take The Count of Monte Cristo by Alexandre Dumas, a 19th century French novel about revenge. If we add to its embedding the embedding of science_fiction, it leads us to a collection of science fiction novels by Jules Verne (another 19th century French author), such as 20,000 Leagues Under the Sea and Journey to the Center of the Earth. +In one of them, we built a graph of documents along with their related keyphrases and entities. Embedding documents, keyphrases and entities in the same space allowed us to discover documents by analogy. For example, take `The Count of Monte Cristo` by Alexandre Dumas, a 19th century French novel about revenge. If we add to its embedding the embedding of science_fiction, it leads us to a collection of science fiction novels by Jules Verne (another 19th century French author), such as `20,000 Leagues Under the Sea` and `Journey to the Center of the Earth`. Keyphrase extractions have also been useful in adding explainability to document clusters. By extracting the most common keyphrases of a cluster, we can derive a common theme for the cluster’s content: From 07f4b862d9ec610810c4d771e6af4bd8af4a1a8b Mon Sep 17 00:00:00 2001 From: AntoniaMouawad Date: Mon, 19 Jul 2021 15:56:57 -0400 Subject: [PATCH 07/21] Update 2021-07-20-information-extraction-at-scribd.md --- ...-07-20-information-extraction-at-scribd.md | 31 +++++++++++-------- 1 file changed, 18 insertions(+), 13 deletions(-) diff --git a/_posts/2021-07-20-information-extraction-at-scribd.md b/_posts/2021-07-20-information-extraction-at-scribd.md index 767e3ad..17d47ee 100644 --- a/_posts/2021-07-20-information-extraction-at-scribd.md +++ b/_posts/2021-07-20-information-extraction-at-scribd.md @@ -88,17 +88,22 @@ Given a set of aliases that appear in a document, we developed heuristics (e.g. Using our previous example to illustrate this method, we start by assuming the canonical alias is the longest alias in a text for a given entity, and attempt to merge aliases together by evaluating which aliases match the heuristics we developed.  -Entities’ Aliases Mentioned in the Document - -Mill - -John Stuart Mill - -Robert Grosvenor - -William Henry Smith - -Stuart Mill + + + + + + + + + + + + + + + +
Entities’ Aliases Mentioned in the Document
MillJohn Stuart MillRobert GrosvenorWilliam Henry SmithStuart Mill
Table 1: Top 5 occurring aliases in the first few paragraphs of John Stuart Mill’s Wikipedia page, some referring to the same person. @@ -178,9 +183,9 @@ Figure 3: Top keyphrases in a document cluster. The keywords imply that the docu In yet another project, we leveraged precomputed knowledge base embeddings to represent a document in space through a composition of the entities and keyphrases it contains. These features allowed us to understand the documents uploaded by our users and improve the content discovery on the platform. -To see how we use the information extracted to classify documents into a taxonomy, make sure to check out part 3: ! - +To see how we use the information extracted to classify documents into a taxonomy, make sure to check out part 3: `Categorizing user-uploaded documents` +If you're interested to learn more about the problems Applied Research is solving or the systems which are built around those solutions, check out [our open positions!](/careers/#open-positions) From 9bd27e70db9f9c0201612b4769cf43d1ed01eaa7 Mon Sep 17 00:00:00 2001 From: AntoniaMouawad Date: Mon, 19 Jul 2021 16:06:33 -0400 Subject: [PATCH 08/21] Update 2021-07-20-information-extraction-at-scribd.md --- ...-07-20-information-extraction-at-scribd.md | 76 +++++++++---------- 1 file changed, 38 insertions(+), 38 deletions(-) diff --git a/_posts/2021-07-20-information-extraction-at-scribd.md b/_posts/2021-07-20-information-extraction-at-scribd.md index 17d47ee..94a87a6 100644 --- a/_posts/2021-07-20-information-extraction-at-scribd.md +++ b/_posts/2021-07-20-information-extraction-at-scribd.md @@ -109,39 +109,39 @@ Table 1: Top 5 occurring aliases in the first few paragraphs of John Stuart Mill Comparing entities with each other using exact token matching as a heuristic would solve this: -Merges - -Comparison Left - -Comparison Right - -Result - -Mill - -John Stuart Mill - -(Mill, John Stuart Mill) - -(Mill, John Stuart Mill) - -Robert Grosvenor - -- - -(Mill, John Stuart Mill) - -William Henry Smith - -- - -(Mill, John Stuart Mill) - -Stuart Mill - -(Mill, John Stuart Mill, Stuart Mill) - -[ the remaining comparisons do not yield merges and were omitted ] + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Merges
Comparison LeftComparison RightResult
MillJohn Stuart Mill(Mill, John Stuart Mill)
(Mill, John Stuart Mill)Robert Grosvenor-
(Mill, John Stuart Mill)William Henry Smith-
(Mill, John Stuart Mill)Stuart Mill(Mill, John Stuart Mill, Stuart Mill)
[ the remaining comparisons do not yield merges and were omitted ]
Table 2: Pairwise alias comparisons and resulting merges. Matches highlighted in bold. @@ -151,7 +151,7 @@ One edge case is when an alias might refer to more than one entity: e.g. the ali When an entity is not present in a knowledge base, we cannot use Named Entity Linking to disambiguate. In this case, our solution uses a fallback method that assigns the ambiguous mention (Potter) to the closest occurring unambiguous mention that matches the heuristics (e.g. Harry).  -Linking to a Knowledge Base +## Linking to a Knowledge Base Given that many keyphrases and entities mentioned in a document are notable, they are likely present in a knowledge base. This allows us to leverage extra information present in the knowledge base to improve the normalization step as well as downstream tasks. @@ -161,15 +161,15 @@ Being able to embed entities in the same space as text is useful, as this unlock At Scribd we make use of Entity Linking to not only improve the Entity Normalization step, but also to take advantage of entity and keyphrase embeddings as supplemental features. -Discussion +## Discussion Putting all of this together, we can: -Link documents to keyphrases and entities +1. Link documents to keyphrases and entities -Find the relative importance of each in a document. +1. Find the relative importance of each in a document -Take advantage of relevant information in knowledge bases +1. Take advantage of relevant information in knowledge bases This has enabled some interesting projects: From 99d616203735ced62ce55154f4f628ad792e87dc Mon Sep 17 00:00:00 2001 From: Rafael Lacerda Date: Mon, 19 Jul 2021 16:28:23 -0400 Subject: [PATCH 09/21] rafael's author name --- _posts/2021-07-20-information-extraction-at-scribd.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2021-07-20-information-extraction-at-scribd.md b/_posts/2021-07-20-information-extraction-at-scribd.md index 94a87a6..dc81cd3 100644 --- a/_posts/2021-07-20-information-extraction-at-scribd.md +++ b/_posts/2021-07-20-information-extraction-at-scribd.md @@ -8,7 +8,7 @@ tags: team: Applied Research authors: - antoniam -- rafael +- rafaelp --- This is part 2 in a series of blog posts describing a multi-component machine learning system we built to extract metadata from our documents in order to enrich downstream discovery models. In this post, we present the challenges and limitations we faced and the solutions we came up with when building information extraction NLP models for our text-heavy documents. From b019a11cae5fbfca0f8c2727c287a13d75723e10 Mon Sep 17 00:00:00 2001 From: AntoniaMouawad Date: Mon, 19 Jul 2021 16:29:28 -0400 Subject: [PATCH 10/21] Update 2021-07-20-information-extraction-at-scribd.md --- _posts/2021-07-20-information-extraction-at-scribd.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2021-07-20-information-extraction-at-scribd.md b/_posts/2021-07-20-information-extraction-at-scribd.md index dc81cd3..a61d53e 100644 --- a/_posts/2021-07-20-information-extraction-at-scribd.md +++ b/_posts/2021-07-20-information-extraction-at-scribd.md @@ -1,6 +1,6 @@ --- layout: post -title: "Identifying Document Types at Scribd" +title: "Information Extraction at Scribd" tags: - machinelearning - data From e5835a2537b57354fad74dd8bfa6873327011d6c Mon Sep 17 00:00:00 2001 From: Rafael Lacerda Date: Mon, 19 Jul 2021 16:37:53 -0400 Subject: [PATCH 11/21] Update 2021-07-20-information-extraction-at-scribd.md --- .../2021-07-20-information-extraction-at-scribd.md | 14 ++++---------- 1 file changed, 4 insertions(+), 10 deletions(-) diff --git a/_posts/2021-07-20-information-extraction-at-scribd.md b/_posts/2021-07-20-information-extraction-at-scribd.md index a61d53e..d2ed19c 100644 --- a/_posts/2021-07-20-information-extraction-at-scribd.md +++ b/_posts/2021-07-20-information-extraction-at-scribd.md @@ -14,9 +14,7 @@ This is part 2 in a series of blog posts describing a multi-component machine le As mentioned in part 1, we now have a way of identifying text-heavy documents. Having done that, we want to build dedicated models to deepen our semantic understanding of them. We do this by extracting key phrases and entities. -![flowchart image description](https://user-images.githubusercontent.com/11147367/126206943-9deabf5f-6add-4a01-9e20-5ed8f9e10069.png) - -Figure 1: Diagram of our multi-component machine learning system. +![flowchart image description](https://user-images.githubusercontent.com/11147367/126206943-9deabf5f-6add-4a01-9e20-5ed8f9e10069.png) Figure 1: Diagram of our multi-component machine learning system. Key phrases are phrases that represent major themes/topics, whereas entities are proper nouns such as people, places and organizations. For example, when a user uploads a document about the Manhattan project, we will first detect it is text-heavy, then extract key phrases and entities. Potential key phrases would be “atomic bomb” and “nuclear weapons” and potential entities would be “Robert Oppenheimer” and “Los Alamos”. @@ -64,9 +62,7 @@ Named Entity Extraction systems identify instances of named entities in a text, Naively counting named entities through exact string matches surfaces an interesting problem: a single entity may go by many names or aliases, which means string frequency is an unreliable measurement of importance. In the example given in Figure 2, we know that “MIll”, “John Stuart Mill” and “Stuart Mill” all refer to the same person. This means that Mill is even more central to the document than the table indicates, since he is referred to a total of 8 times instead of 5. -![wiki image description](https://user-images.githubusercontent.com/11147367/126206932-a5612459-e597-4340-a379-d62da58a29dc.png) - -Figure 2: Excerpt from John Stuart Mill’s Wikipedia page (left) and Top 5 Named Entity counts of the first few paragraphs (right). +![wiki image description](https://user-images.githubusercontent.com/11147367/126206932-a5612459-e597-4340-a379-d62da58a29dc.png) Figure 2: Excerpt from John Stuart Mill’s Wikipedia page (left) and Top 5 Named Entity counts of the first few paragraphs (right). To address this counting problem, let’s introduce a few abstractions: @@ -177,13 +173,11 @@ In one of them, we built a graph of documents along with their related keyphrase Keyphrase extractions have also been useful in adding explainability to document clusters. By extracting the most common keyphrases of a cluster, we can derive a common theme for the cluster’s content: -![topics image description](https://user-images.githubusercontent.com/11147367/126206921-31cea5fb-989c-4468-bb0e-508935f20636.png) - -Figure 3: Top keyphrases in a document cluster. The keywords imply that the documents therein are related to dentistry & healthcare, which was confirmed by manually inspecting the documents. +![topics image description](https://user-images.githubusercontent.com/11147367/126206921-31cea5fb-989c-4468-bb0e-508935f20636.png) Figure 3: Top keyphrases in a document cluster. The keywords imply that the documents therein are related to dentistry & healthcare, which was confirmed by manually inspecting the documents. In yet another project, we leveraged precomputed knowledge base embeddings to represent a document in space through a composition of the entities and keyphrases it contains. These features allowed us to understand the documents uploaded by our users and improve the content discovery on the platform. -To see how we use the information extracted to classify documents into a taxonomy, make sure to check out part 3: `Categorizing user-uploaded documents` +To see how we use the information extracted to classify documents into a taxonomy, make sure to check out part 3: `Categorizing user-uploaded documents` (coming soon). If you're interested to learn more about the problems Applied Research is solving or the systems which are built around those solutions, check out [our open positions!](/careers/#open-positions) From 97615cb2236937980fa637cbba5456c75af5b03e Mon Sep 17 00:00:00 2001 From: Rafael Lacerda Date: Mon, 19 Jul 2021 16:40:10 -0400 Subject: [PATCH 12/21] Update authors.yml --- _data/authors.yml | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/_data/authors.yml b/_data/authors.yml index ebe3bfe..9b660a6 100644 --- a/_data/authors.yml +++ b/_data/authors.yml @@ -151,3 +151,11 @@ antoniam: github: AntoniaMouawad about: | Antonia is a data scientist on the Applied Research team building machine learning models to understand and connect our content. + +rafaelp: + name: Rafael Lacerda + github: lacerda + blog: https://blog.lacerda.ch/ + about: | + Rafael is a data scientist on the Applied Research team building machine learning models to understand and connect our content. + From ef7a2f8d58ec55b3c2c1a6f8300bb62e54e489f6 Mon Sep 17 00:00:00 2001 From: Antonia Mouawad Date: Mon, 19 Jul 2021 17:03:18 -0400 Subject: [PATCH 13/21] rename post --- ...t-scribd.md => 2021-07-15-information-extraction-at-scribd.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename _posts/{2021-07-20-information-extraction-at-scribd.md => 2021-07-15-information-extraction-at-scribd.md} (100%) diff --git a/_posts/2021-07-20-information-extraction-at-scribd.md b/_posts/2021-07-15-information-extraction-at-scribd.md similarity index 100% rename from _posts/2021-07-20-information-extraction-at-scribd.md rename to _posts/2021-07-15-information-extraction-at-scribd.md From 74f02be40129436bd6f0ac19ab94a4ccc63c2fd7 Mon Sep 17 00:00:00 2001 From: Rafael Lacerda Date: Mon, 19 Jul 2021 17:07:33 -0400 Subject: [PATCH 14/21] image html --- ...21-07-15-information-extraction-at-scribd.md | 17 ++++++++++++++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/_posts/2021-07-15-information-extraction-at-scribd.md b/_posts/2021-07-15-information-extraction-at-scribd.md index d2ed19c..2b499f8 100644 --- a/_posts/2021-07-15-information-extraction-at-scribd.md +++ b/_posts/2021-07-15-information-extraction-at-scribd.md @@ -14,7 +14,10 @@ This is part 2 in a series of blog posts describing a multi-component machine le As mentioned in part 1, we now have a way of identifying text-heavy documents. Having done that, we want to build dedicated models to deepen our semantic understanding of them. We do this by extracting key phrases and entities. -![flowchart image description](https://user-images.githubusercontent.com/11147367/126206943-9deabf5f-6add-4a01-9e20-5ed8f9e10069.png) Figure 1: Diagram of our multi-component machine learning system. +
+ Figure 1: Diagram of our multi-component machine learning system. +
Figure 1: Diagram of our multi-component machine learning system.
+
Key phrases are phrases that represent major themes/topics, whereas entities are proper nouns such as people, places and organizations. For example, when a user uploads a document about the Manhattan project, we will first detect it is text-heavy, then extract key phrases and entities. Potential key phrases would be “atomic bomb” and “nuclear weapons” and potential entities would be “Robert Oppenheimer” and “Los Alamos”. @@ -62,7 +65,11 @@ Named Entity Extraction systems identify instances of named entities in a text, Naively counting named entities through exact string matches surfaces an interesting problem: a single entity may go by many names or aliases, which means string frequency is an unreliable measurement of importance. In the example given in Figure 2, we know that “MIll”, “John Stuart Mill” and “Stuart Mill” all refer to the same person. This means that Mill is even more central to the document than the table indicates, since he is referred to a total of 8 times instead of 5. -![wiki image description](https://user-images.githubusercontent.com/11147367/126206932-a5612459-e597-4340-a379-d62da58a29dc.png) Figure 2: Excerpt from John Stuart Mill’s Wikipedia page (left) and Top 5 Named Entity counts of the first few paragraphs (right). + +
+ Figure 2: Excerpt from John Stuart Mill’s Wikipedia page (left) and Top 5 Named Entity counts of the first few paragraphs (right). +
Figure 2: Excerpt from John Stuart Mill’s Wikipedia page (left) and Top 5 Named Entity counts of the first few paragraphs (right).
+
To address this counting problem, let’s introduce a few abstractions: @@ -173,7 +180,11 @@ In one of them, we built a graph of documents along with their related keyphrase Keyphrase extractions have also been useful in adding explainability to document clusters. By extracting the most common keyphrases of a cluster, we can derive a common theme for the cluster’s content: -![topics image description](https://user-images.githubusercontent.com/11147367/126206921-31cea5fb-989c-4468-bb0e-508935f20636.png) Figure 3: Top keyphrases in a document cluster. The keywords imply that the documents therein are related to dentistry & healthcare, which was confirmed by manually inspecting the documents. + +
+ Figure 3: Top keyphrases in a document cluster. The keywords imply that the documents therein are related to dentistry & healthcare, which was confirmed by manually inspecting the documents. +
Figure 3: Top keyphrases in a document cluster. The keywords imply that the documents therein are related to dentistry & healthcare, which was confirmed by manually inspecting the documents.
+
In yet another project, we leveraged precomputed knowledge base embeddings to represent a document in space through a composition of the entities and keyphrases it contains. These features allowed us to understand the documents uploaded by our users and improve the content discovery on the platform. From 9ece4521d3ea9429dbb2d28ebfde9611b39e003e Mon Sep 17 00:00:00 2001 From: Antonia Mouawad Date: Mon, 19 Jul 2021 17:18:31 -0400 Subject: [PATCH 15/21] Link to info extraction --- _posts/2021-07-12-identifying-document-types.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2021-07-12-identifying-document-types.md b/_posts/2021-07-12-identifying-document-types.md index 140586c..e673888 100644 --- a/_posts/2021-07-12-identifying-document-types.md +++ b/_posts/2021-07-12-identifying-document-types.md @@ -121,7 +121,7 @@ While there are different ways of dealing with this, our approach involved two s Now that we have a model to filter documents based on visual cues, we can build dedicated information extraction models for each document type – sheet music, text-heavy, comics, tables. This is exactly how we proceed from here, and we start with extracting information from text-heavy documents. -Part 2 in this series will dive deeper into the challenges and solutions our +[Part 2](/blog/2021/information-extraction-at-scribd.html) in this series will dive deeper into the challenges and solutions our team encountered while building these models. If you're interested to learn more about the problems Applied Research is solving or the systems which are built around those solutions, check out [our open positions!](/careers/#open-positions) From 9d7d0f5843ec49c49f095437d369b8f258fd2495 Mon Sep 17 00:00:00 2001 From: Rafael Lacerda Date: Mon, 19 Jul 2021 17:15:45 -0400 Subject: [PATCH 16/21] add tables as images (weren't rendering properly) --- ...-07-15-information-extraction-at-scribd.md | 64 ++++--------------- 1 file changed, 11 insertions(+), 53 deletions(-) diff --git a/_posts/2021-07-15-information-extraction-at-scribd.md b/_posts/2021-07-15-information-extraction-at-scribd.md index 2b499f8..3b7db19 100644 --- a/_posts/2021-07-15-information-extraction-at-scribd.md +++ b/_posts/2021-07-15-information-extraction-at-scribd.md @@ -91,62 +91,20 @@ Given a set of aliases that appear in a document, we developed heuristics (e.g. Using our previous example to illustrate this method, we start by assuming the canonical alias is the longest alias in a text for a given entity, and attempt to merge aliases together by evaluating which aliases match the heuristics we developed.  - - - - - - - - - - - - - - - -
Entities’ Aliases Mentioned in the Document
MillJohn Stuart MillRobert GrosvenorWilliam Henry SmithStuart Mill
- -Table 1: Top 5 occurring aliases in the first few paragraphs of John Stuart Mill’s Wikipedia page, some referring to the same person. +
+ Table 1: Top 5 occurring aliases in the first few paragraphs of John Stuart Mill’s Wikipedia page, some referring to the same person.
+ +
Table 1: Top 5 occurring aliases in the first few paragraphs of John Stuart Mill’s Wikipedia page, some referring to the same person. +
+
Comparing entities with each other using exact token matching as a heuristic would solve this: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Merges
Comparison LeftComparison RightResult
MillJohn Stuart Mill(Mill, John Stuart Mill)
(Mill, John Stuart Mill)Robert Grosvenor-
(Mill, John Stuart Mill)William Henry Smith-
(Mill, John Stuart Mill)Stuart Mill(Mill, John Stuart Mill, Stuart Mill)
[ the remaining comparisons do not yield merges and were omitted ]
- -Table 2: Pairwise alias comparisons and resulting merges. Matches highlighted in bold. +
+ Table 2: Pairwise alias comparisons and resulting merges. Matches highlighted in bold. +
Table 2: Pairwise alias comparisons and resulting merges. Matches highlighted in bold. +
+
By replacing all mentions with its corresponding canonical alias, we are able to find the correct named entity counts. From 02309db9035610566307e906241764ced62a587a Mon Sep 17 00:00:00 2001 From: Rafael Lacerda Date: Mon, 19 Jul 2021 17:17:31 -0400 Subject: [PATCH 17/21] Update 2021-07-15-information-extraction-at-scribd.md --- _posts/2021-07-15-information-extraction-at-scribd.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2021-07-15-information-extraction-at-scribd.md b/_posts/2021-07-15-information-extraction-at-scribd.md index 3b7db19..6a40999 100644 --- a/_posts/2021-07-15-information-extraction-at-scribd.md +++ b/_posts/2021-07-15-information-extraction-at-scribd.md @@ -12,7 +12,7 @@ authors: --- This is part 2 in a series of blog posts describing a multi-component machine learning system we built to extract metadata from our documents in order to enrich downstream discovery models. In this post, we present the challenges and limitations we faced and the solutions we came up with when building information extraction NLP models for our text-heavy documents. -As mentioned in part 1, we now have a way of identifying text-heavy documents. Having done that, we want to build dedicated models to deepen our semantic understanding of them. We do this by extracting key phrases and entities. +As mentioned in [part 1](https://tech.scribd.com/blog/2021/identifying-document-types.html), we now have a way of identifying text-heavy documents. Having done that, we want to build dedicated models to deepen our semantic understanding of them. We do this by extracting key phrases and entities.
Figure 1: Diagram of our multi-component machine learning system. From beaf361435e01bc79956de6f1cb9768ef34a4892 Mon Sep 17 00:00:00 2001 From: Antonia Mouawad Date: Mon, 19 Jul 2021 17:33:21 -0400 Subject: [PATCH 18/21] Update blog post --- ...07-19-information-extraction-at-scribd.md} | 32 +++++++++---------- 1 file changed, 16 insertions(+), 16 deletions(-) rename _posts/{2021-07-15-information-extraction-at-scribd.md => 2021-07-19-information-extraction-at-scribd.md} (75%) diff --git a/_posts/2021-07-15-information-extraction-at-scribd.md b/_posts/2021-07-19-information-extraction-at-scribd.md similarity index 75% rename from _posts/2021-07-15-information-extraction-at-scribd.md rename to _posts/2021-07-19-information-extraction-at-scribd.md index 6a40999..0d8f119 100644 --- a/_posts/2021-07-15-information-extraction-at-scribd.md +++ b/_posts/2021-07-19-information-extraction-at-scribd.md @@ -12,18 +12,18 @@ authors: --- This is part 2 in a series of blog posts describing a multi-component machine learning system we built to extract metadata from our documents in order to enrich downstream discovery models. In this post, we present the challenges and limitations we faced and the solutions we came up with when building information extraction NLP models for our text-heavy documents. -As mentioned in [part 1](https://tech.scribd.com/blog/2021/identifying-document-types.html), we now have a way of identifying text-heavy documents. Having done that, we want to build dedicated models to deepen our semantic understanding of them. We do this by extracting key phrases and entities. +As mentioned in [part 1](/blog/2021/identifying-document-types.html), we now have a way of identifying text-heavy documents. Having done that, we want to build dedicated models to deepen our semantic understanding of them. We do this by extracting keyphrases and entities.
Figure 1: Diagram of our multi-component machine learning system.
Figure 1: Diagram of our multi-component machine learning system.
-Key phrases are phrases that represent major themes/topics, whereas entities are proper nouns such as people, places and organizations. For example, when a user uploads a document about the Manhattan project, we will first detect it is text-heavy, then extract key phrases and entities. Potential key phrases would be “atomic bomb” and “nuclear weapons” and potential entities would be “Robert Oppenheimer” and “Los Alamos”. +Keyphrases are phrases that represent major themes/topics, whereas entities are proper nouns such as people, places and organizations. For example, when a user uploads a document about the Manhattan project, we will first detect it is text-heavy, then extract keyphrases and entities. Potential keyphrases would be “atomic bomb” and “nuclear weapons” and potential entities would be “Robert Oppenheimer” and “Los Alamos”. As keyphrase extraction brings out the general topics discussed in a document, it helps put a cap on the amount of information kept per document, resulting in a somewhat uniform representation of documents irrespective of their original size. Entity extraction, on the other hand, identifies elements in a text that aren’t necessarily reflected by keyphrases only. We found the combination of keyphrase and entity extraction to provide a rich semantic description of each document. -The rest of this post will explain how we approached key phrase and entity extraction, and how we identified whether a subset of these keyphrases and entities are present in a knowledge base (also known as linking), and introduce how we use them to categorize documents. +The rest of this post will explain how we approached keyphrase and entity extraction, and how we identified whether a subset of these keyphrases and entities are present in a knowledge base (also known as linking), and introduce how we use them to categorize documents. ## Keyphrase Extraction @@ -31,7 +31,7 @@ Typically a keyphrase extraction system operates in two steps as indicated in th - Using heuristics to extract a list of words/phrases that serve as candidate keyphrases, such as part-of-speech language patterns, stopwords filtering, and n-grams with Wikipedia article titles -- Determining which of these candidate keyphrases are most likely to be key phrases, using one of the two approaches: +- Determining which of these candidate keyphrases are most likely to be keyphrases, using one of the two approaches: - Supervised approaches such as binary classification of candidates (useful/not useful), structural features based on positional encoding, etc. @@ -39,21 +39,21 @@ Typically a keyphrase extraction system operates in two steps as indicated in th Training a decent supervised model to be able to extract keyphrases across a wide variety of topics would require a large amount of training data, and might generalize very poorly. For this reason, we decided to take the unsupervised approach. -Our implementation of key phrase extraction is optimized for speed without sacrificing key phrase quality much. We employ both a statistical method and language specific rules to identify them efficiently. +Our implementation of keyphrase extraction is optimized for speed without sacrificing keyphrase quality much. We employ both a statistical method and language specific rules to identify them efficiently. We simply start by filtering out stopwords and extracting the n-grams with a base n (bigrams in our case, n=2). This step is fast and straightforward and results in an initial set of candidate n-grams.  -Limiting the results to a single n-gram class, however, results in split key phrases, which makes linking them to a knowledge base a challenging task. For that, we attempt to agglomerate lower order n-grams into potentially longer key phrases, as long as they occur at a predetermined minimum frequency as compared to the shorter n_gram, based on the following a pattern:  +Limiting the results to a single n-gram class, however, results in split keyphrases, which makes linking them to a knowledge base a challenging task. For that, we attempt to agglomerate lower order n-grams into potentially longer keyphrases, as long as they occur at a predetermined minimum frequency as compared to the shorter n_gram, based on the following a pattern:  `A sequence of nouns (NN) possibly interleaved with either Coordinating Conjunctions (CC) or Prepositions and Subordinating Conjunctions (IN).` Here are a few examples: -Assuming the minimum frequency of agglomeration is 0.5, that means we would only replace the bigram `world (NN) health (NN)` by `world (NN) health (NN) organization (NN)` as long as `world health organization` occurs at least 50% as much as `world health` occurs.  +- Assuming the minimum frequency of agglomeration is 0.5, that means we would only replace the bigram `world (NN) health (NN)` by `world (NN) health (NN) organization (NN)` as long as `world health organization` occurs at least 50% as much as `world health` occurs.  -Replace `Human (NNP) Development (NNP)` with `Center(NNP) for (IN) Global (NNP) Development (NNP)` only if the latter occurs at least a predetermined percentage of time as compared to the former. +- Replace `Human (NNP) Development (NNP)` with `Center(NNP) for (IN) Global (NNP) Development (NNP)` only if the latter occurs at least a predetermined percentage of time as compared to the former. -This method results in more coherent and complete key phrases that could be linked more accurately to a knowledge base entry. +This method results in more coherent and complete keyphrases that could be linked more accurately to a knowledge base entry. Finally we use the count of occurrences of the candidate keyphrase as a proxy to its importance. This method is reliable for longer documents, as the repetition of a keyphrase tends to reliably indicate its centrality to the document’s topic.  @@ -73,15 +73,15 @@ Naively counting named entities through exact string matches surfaces an interes To address this counting problem, let’s introduce a few abstractions: -`Named Entity` refers to a unique person, place or organization. Because of their uniqueness, we can represent them with a unique identifier (ID).  +- `Named Entity` refers to a unique person, place or organization. Because of their uniqueness, we can represent them with a unique identifier (ID).  -`Named Entity Alias` (or simply Alias), is one of possibly many names associated with a particular entity. +- `Named Entity Alias` (or simply Alias), is one of possibly many names associated with a particular entity. -`Canonical Alias` is the preferred name for an entity. +- `Canonical Alias` is the preferred name for an entity. -`Named Entity Mention` (or simply `Mention`), refers to each occurrence in a text that a Named Entity was referred to, regardless of which Alias was used. +- `Named Entity Mention` (or simply `Mention`), refers to each occurrence in a text that a Named Entity was referred to, regardless of which Alias was used. -`Knowledge Base` is a collection of entities, allowing us to query for ID, canonical name, aliases and other information that might be relevant for the task at hand. One example is [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page). +- `Knowledge Base` is a collection of entities, allowing us to query for ID, canonical name, aliases and other information that might be relevant for the task at hand. One example is [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page). The first step to solve the counting problem is to normalize the names a document uses to refer to a named entity. Using our abstractions, this means we want to find all the mentions in a document, and use its alias to find the named entity it belongs to. Then, replace it with either the canonical name or the named entity ID - this distinction will become clearer later on. @@ -134,7 +134,7 @@ Putting all of this together, we can: This has enabled some interesting projects: -In one of them, we built a graph of documents along with their related keyphrases and entities. Embedding documents, keyphrases and entities in the same space allowed us to discover documents by analogy. For example, take `The Count of Monte Cristo` by Alexandre Dumas, a 19th century French novel about revenge. If we add to its embedding the embedding of science_fiction, it leads us to a collection of science fiction novels by Jules Verne (another 19th century French author), such as `20,000 Leagues Under the Sea` and `Journey to the Center of the Earth`. +In one of them, we built a graph of documents along with their related keyphrases and entities. Embedding documents, keyphrases and entities in the same space allowed us to discover documents by analogy. For example, take `The Count of Monte Cristo` by Alexandre Dumas, a 19th century French novel about revenge. If we add to its embedding the embedding of `science_fiction`, it leads us to a collection of science fiction novels by Jules Verne (another 19th century French author), such as `20,000 Leagues Under the Sea` and `Journey to the Center of the Earth`. Keyphrase extractions have also been useful in adding explainability to document clusters. By extracting the most common keyphrases of a cluster, we can derive a common theme for the cluster’s content: @@ -146,7 +146,7 @@ Keyphrase extractions have also been useful in adding explainability to document In yet another project, we leveraged precomputed knowledge base embeddings to represent a document in space through a composition of the entities and keyphrases it contains. These features allowed us to understand the documents uploaded by our users and improve the content discovery on the platform. -To see how we use the information extracted to classify documents into a taxonomy, make sure to check out part 3: `Categorizing user-uploaded documents` (coming soon). +To see how we use the information extracted to classify documents into a taxonomy, make sure to check out part 3 (coming soon): `Categorizing user-uploaded documents` If you're interested to learn more about the problems Applied Research is solving or the systems which are built around those solutions, check out [our open positions!](/careers/#open-positions) From b88665cb3033e4def9d1289fa32b31f301d61125 Mon Sep 17 00:00:00 2001 From: "R. Tyler Croy" Date: Tue, 20 Jul 2021 09:36:06 -0700 Subject: [PATCH 19/21] Initial pass of copy editing and spell checking Tidied up the intro paragraph --- ...-07-19-information-extraction-at-scribd.md | 45 ++++++++++++------- 1 file changed, 28 insertions(+), 17 deletions(-) diff --git a/_posts/2021-07-19-information-extraction-at-scribd.md b/_posts/2021-07-19-information-extraction-at-scribd.md index 0d8f119..cb40800 100644 --- a/_posts/2021-07-19-information-extraction-at-scribd.md +++ b/_posts/2021-07-19-information-extraction-at-scribd.md @@ -8,11 +8,18 @@ tags: team: Applied Research authors: - antoniam -- rafaelp --- -This is part 2 in a series of blog posts describing a multi-component machine learning system we built to extract metadata from our documents in order to enrich downstream discovery models. In this post, we present the challenges and limitations we faced and the solutions we came up with when building information extraction NLP models for our text-heavy documents. -As mentioned in [part 1](/blog/2021/identifying-document-types.html), we now have a way of identifying text-heavy documents. Having done that, we want to build dedicated models to deepen our semantic understanding of them. We do this by extracting keyphrases and entities. +Extracting metadata from our documents is an important part of our discovery +and recommendation pipeline, but discerning useful and relevant details +from text-heavy user-uploaded documents can be challenging. This is +part 2 in a series of blog posts describing a multi-component machine learning +system we built to extract metadata from our documents in order to enrich +downstream discovery models. In this post, we present the challenges and +limitations we faced and the solutions we came up with when building +information extraction NLP models for our text-heavy documents. + +As mentioned in [part 1](/blog/2021/identifying-document-types.html), we now have a way of identifying text-heavy documents. Having done that, we want to build dedicated models to deepen our semantic understanding of them. We do this by extracting keyphrases and entities.
Figure 1: Diagram of our multi-component machine learning system. @@ -21,7 +28,7 @@ As mentioned in [part 1](/blog/2021/identifying-document-types.html), we now hav Keyphrases are phrases that represent major themes/topics, whereas entities are proper nouns such as people, places and organizations. For example, when a user uploads a document about the Manhattan project, we will first detect it is text-heavy, then extract keyphrases and entities. Potential keyphrases would be “atomic bomb” and “nuclear weapons” and potential entities would be “Robert Oppenheimer” and “Los Alamos”. -As keyphrase extraction brings out the general topics discussed in a document, it helps put a cap on the amount of information kept per document, resulting in a somewhat uniform representation of documents irrespective of their original size. Entity extraction, on the other hand, identifies elements in a text that aren’t necessarily reflected by keyphrases only. We found the combination of keyphrase and entity extraction to provide a rich semantic description of each document. +As keyphrase extraction brings out the general topics discussed in a document, it helps put a cap on the amount of information kept per document, resulting in a somewhat uniform representation of documents irrespective of their original size. Entity extraction, on the other hand, identifies elements in a text that aren't necessarily reflected by keyphrases only. We found the combination of keyphrase and entity extraction to provide a rich semantic description of each document. The rest of this post will explain how we approached keyphrase and entity extraction, and how we identified whether a subset of these keyphrases and entities are present in a knowledge base (also known as linking), and introduce how we use them to categorize documents. @@ -41,15 +48,15 @@ Training a decent supervised model to be able to extract keyphrases across a wid Our implementation of keyphrase extraction is optimized for speed without sacrificing keyphrase quality much. We employ both a statistical method and language specific rules to identify them efficiently. -We simply start by filtering out stopwords and extracting the n-grams with a base n (bigrams in our case, n=2). This step is fast and straightforward and results in an initial set of candidate n-grams.  +We simply start by filtering out stopwords and extracting the n-grams with a base n (bi-grams in our case, n=2). This step is fast and straightforward and results in an initial set of candidate n-grams.  -Limiting the results to a single n-gram class, however, results in split keyphrases, which makes linking them to a knowledge base a challenging task. For that, we attempt to agglomerate lower order n-grams into potentially longer keyphrases, as long as they occur at a predetermined minimum frequency as compared to the shorter n_gram, based on the following a pattern:  +Limiting the results to a single n-gram class, however, results in split keyphrases, which makes linking them to a knowledge base a challenging task. For that, we attempt to agglomerate lower order n-grams into potentially longer keyphrases, as long as they occur at a predetermined minimum frequency as compared to the shorter n-gram, based on the following a pattern:  `A sequence of nouns (NN) possibly interleaved with either Coordinating Conjunctions (CC) or Prepositions and Subordinating Conjunctions (IN).` Here are a few examples: -- Assuming the minimum frequency of agglomeration is 0.5, that means we would only replace the bigram `world (NN) health (NN)` by `world (NN) health (NN) organization (NN)` as long as `world health organization` occurs at least 50% as much as `world health` occurs.  +- Assuming the minimum frequency of agglomeration is 0.5, that means we would only replace the bi-gram `world (NN) health (NN)` by `world (NN) health (NN) organization (NN)` as long as `world health organization` occurs at least 50% as much as `world health` occurs.  - Replace `Human (NNP) Development (NNP)` with `Center(NNP) for (IN) Global (NNP) Development (NNP)` only if the latter occurs at least a predetermined percentage of time as compared to the former. @@ -71,7 +78,7 @@ Naively counting named entities through exact string matches surfaces an interes
Figure 2: Excerpt from John Stuart Mill’s Wikipedia page (left) and Top 5 Named Entity counts of the first few paragraphs (right).
-To address this counting problem, let’s introduce a few abstractions: +To address this counting problem, let's introduce a few abstractions: - `Named Entity` refers to a unique person, place or organization. Because of their uniqueness, we can represent them with a unique identifier (ID).  @@ -116,9 +123,9 @@ When an entity is not present in a knowledge base, we cannot use Named Entity Li Given that many keyphrases and entities mentioned in a document are notable, they are likely present in a knowledge base. This allows us to leverage extra information present in the knowledge base to improve the normalization step as well as downstream tasks. -Entity Linking assists normalization by providing information that an alias matches a named entity, which otherwise wouldn’t match a heuristic (e.g. “Honest Abe” versus “Abraham Lincoln”). Furthermore, [information in a knowledge base can be used to embed linked entities and keyphrases in the same space as text](https://arxiv.org/abs/1601.01343). +Entity Linking assists normalization by providing information that an alias matches a named entity, which otherwise wouldn't match a heuristic (e.g. “Honest Abe” versus “Abraham Lincoln”). Furthermore, [information in a knowledge base can be used to embed linked entities and keyphrases in the same space as text](https://arxiv.org/abs/1601.01343). -Being able to embed entities in the same space as text is useful, as this unlocks the ability to [compare possible matching named entity IDs with the context in which they’re mentioned](https://arxiv.org/abs/1911.03814), and make a decision on whether an alias we’re considering might be one of the entities in the knowledge base (in which case we will use IDs), or whether the alias doesn’t match any entity in the knowledge base, in which case we fall back to using the assumed canonical alias.  +Being able to embed entities in the same space as text is useful, as this unlocks the ability to [compare possible matching named entity IDs with the context in which they’re mentioned](https://arxiv.org/abs/1911.03814), and make a decision on whether an alias we’re considering might be one of the entities in the knowledge base (in which case we will use IDs), or whether the alias doesn't match any entity in the knowledge base, in which case we fall back to using the assumed canonical alias.  At Scribd we make use of Entity Linking to not only improve the Entity Normalization step, but also to take advantage of entity and keyphrase embeddings as supplemental features. @@ -132,11 +139,11 @@ Putting all of this together, we can: 1. Take advantage of relevant information in knowledge bases -This has enabled some interesting projects: +This has enabled some interesting projects: In one of them, we built a graph of documents along with their related keyphrases and entities. Embedding documents, keyphrases and entities in the same space allowed us to discover documents by analogy. For example, take `The Count of Monte Cristo` by Alexandre Dumas, a 19th century French novel about revenge. If we add to its embedding the embedding of `science_fiction`, it leads us to a collection of science fiction novels by Jules Verne (another 19th century French author), such as `20,000 Leagues Under the Sea` and `Journey to the Center of the Earth`. -Keyphrase extractions have also been useful in adding explainability to document clusters. By extracting the most common keyphrases of a cluster, we can derive a common theme for the cluster’s content: +Keyphrase extractions have also been useful in adding clarity to document clusters. By extracting the most common keyphrases of a cluster, we can derive a common theme for the cluster’s content:
@@ -144,11 +151,15 @@ Keyphrase extractions have also been useful in adding explainability to document
Figure 3: Top keyphrases in a document cluster. The keywords imply that the documents therein are related to dentistry & healthcare, which was confirmed by manually inspecting the documents.
-In yet another project, we leveraged precomputed knowledge base embeddings to represent a document in space through a composition of the entities and keyphrases it contains. These features allowed us to understand the documents uploaded by our users and improve the content discovery on the platform. - -To see how we use the information extracted to classify documents into a taxonomy, make sure to check out part 3 (coming soon): `Categorizing user-uploaded documents` - -If you're interested to learn more about the problems Applied Research is solving or the systems which are built around those solutions, check out [our open positions!](/careers/#open-positions) +In yet another project, we leveraged precomputed knowledge base embeddings to represent a document in space through a composition of the entities and keyphrases it contains. These features allowed us to understand the documents uploaded by our users and improve the content discovery on the platform. +To see how we use the information extracted to classify documents into a +taxonomy, make sure to check out part 3 which will be coming soon: +*Categorizing user-uploaded documents*. +This post was written in collaboration with my colleague [Rafael +Lacerda](https://blog.lacerda.ch) on the Applied Research team. If you're +interested to learn more about the problems Applied Research is solving, or the +systems which are built around those solutions, check out [our open +positions!](/careers/#open-positions) From b2bd077f0c6949001d74fb599fd10f2ae83aa564 Mon Sep 17 00:00:00 2001 From: "R. Tyler Croy" Date: Tue, 20 Jul 2021 09:36:52 -0700 Subject: [PATCH 20/21] Move the publish date to Wednesday --- ...t-scribd.md => 2021-07-21-information-extraction-at-scribd.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename _posts/{2021-07-19-information-extraction-at-scribd.md => 2021-07-21-information-extraction-at-scribd.md} (100%) diff --git a/_posts/2021-07-19-information-extraction-at-scribd.md b/_posts/2021-07-21-information-extraction-at-scribd.md similarity index 100% rename from _posts/2021-07-19-information-extraction-at-scribd.md rename to _posts/2021-07-21-information-extraction-at-scribd.md From d676e9bd97ab8cc917af06570f6c76933145bed4 Mon Sep 17 00:00:00 2001 From: Rafael Lacerda Date: Tue, 20 Jul 2021 14:55:43 -0400 Subject: [PATCH 21/21] highlighting the AR team perspective Readers might wonder whether this is something R&A built vs. something the AR team built. This should clarify the perspective: this is an AR team effort, as explained by R&A. --- ...-07-21-information-extraction-at-scribd.md | 21 +++++++++---------- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/_posts/2021-07-21-information-extraction-at-scribd.md b/_posts/2021-07-21-information-extraction-at-scribd.md index cb40800..e90b9f8 100644 --- a/_posts/2021-07-21-information-extraction-at-scribd.md +++ b/_posts/2021-07-21-information-extraction-at-scribd.md @@ -8,16 +8,17 @@ tags: team: Applied Research authors: - antoniam +- rafaelp --- Extracting metadata from our documents is an important part of our discovery and recommendation pipeline, but discerning useful and relevant details from text-heavy user-uploaded documents can be challenging. This is part 2 in a series of blog posts describing a multi-component machine learning -system we built to extract metadata from our documents in order to enrich -downstream discovery models. In this post, we present the challenges and -limitations we faced and the solutions we came up with when building -information extraction NLP models for our text-heavy documents. +system the Applied Research team built to extract metadata from our documents in order to +to enrich downstream discovery models. In this post, we present the challenges and +limitations the team faced when building information extraction NLP models for Scribd's +text-heavy documents and how they were solved. As mentioned in [part 1](/blog/2021/identifying-document-types.html), we now have a way of identifying text-heavy documents. Having done that, we want to build dedicated models to deepen our semantic understanding of them. We do this by extracting keyphrases and entities. @@ -141,7 +142,7 @@ Putting all of this together, we can: This has enabled some interesting projects: -In one of them, we built a graph of documents along with their related keyphrases and entities. Embedding documents, keyphrases and entities in the same space allowed us to discover documents by analogy. For example, take `The Count of Monte Cristo` by Alexandre Dumas, a 19th century French novel about revenge. If we add to its embedding the embedding of `science_fiction`, it leads us to a collection of science fiction novels by Jules Verne (another 19th century French author), such as `20,000 Leagues Under the Sea` and `Journey to the Center of the Earth`. +In one of them, the Applied Research team built a graph of documents along with their related keyphrases and entities. Embedding documents, keyphrases and entities in the same space allowed us to discover documents by analogy. For example, take `The Count of Monte Cristo` by Alexandre Dumas, a 19th century French novel about revenge. If we add to its embedding the embedding of `science_fiction`, it leads us to a collection of science fiction novels by Jules Verne (another 19th century French author), such as `20,000 Leagues Under the Sea` and `Journey to the Center of the Earth`. Keyphrase extractions have also been useful in adding clarity to document clusters. By extracting the most common keyphrases of a cluster, we can derive a common theme for the cluster’s content: @@ -151,15 +152,13 @@ Keyphrase extractions have also been useful in adding clarity to document cluste
Figure 3: Top keyphrases in a document cluster. The keywords imply that the documents therein are related to dentistry & healthcare, which was confirmed by manually inspecting the documents.
-In yet another project, we leveraged precomputed knowledge base embeddings to represent a document in space through a composition of the entities and keyphrases it contains. These features allowed us to understand the documents uploaded by our users and improve the content discovery on the platform. +In yet another project, the team leveraged precomputed knowledge base embeddings to represent a document in space through a composition of the entities and keyphrases it contains. These features allowed us to understand the documents uploaded by our users and improve the content discovery on the platform. To see how we use the information extracted to classify documents into a taxonomy, make sure to check out part 3 which will be coming soon: *Categorizing user-uploaded documents*. -This post was written in collaboration with my colleague [Rafael -Lacerda](https://blog.lacerda.ch) on the Applied Research team. If you're -interested to learn more about the problems Applied Research is solving, or the -systems which are built around those solutions, check out [our open -positions!](/careers/#open-positions) +If you're interested to learn more about the problems Applied Research +is solving, or the systems which are built around those solutions, +check out [our open positions!](/careers/#open-positions)