# Run content extraction on html document

#### Readability and Title Extraction
In the following example, we will run the `readability` extractor and the `title` extractor on the html document.  
The extraction is controlled by an `extraction_config` and the input is json document with `html` document embedded in it.  

Let's breakdown the extraction_config:  
> `input_path`  

specifies the path where the html content is in the json  
> `extractors.readability`  

sets up `etk` to run readability extractor on the `input_path`

> `strict: yes`

makes `readability` extractor more precise. ```strict: no``` will result in more content getting extracted from the html as main content  
> `extractors.title`  

sets up etk to run the `title` extractor

#### Output  
The output will be under the field `content_extraction` in the input json document.   
It should have `content_relaxed` corresponding to ```strict: no``` and `content_strict` for ```strict: yes```  
It should also contain `title`

```
{
    content_extraction: {
        content_strict: {
            text: "..."
        },
        content_relaxed: {
            text: "..."
        },
        title: {
            text: "..."
        }
}
```

**All the content extraction values are assigned to a field called `text` under the corresponding content_extraction**


In [6]:
from etk.core import Core
import  json
import pprint
import codecs

extraction_config = {'content_extraction': {
            "input_path": "raw_content",
            "extractors": {
              "readability": [
                {
                  "strict": "yes",
                  "extraction_policy": "keep_existing"
                },
                {
                  "strict": "no",
                  "extraction_policy": "keep_existing",
                  "field_name": "content_relaxed"
                }
              ],
              "title": {
                 "extraction_policy": "keep_existing"
              }
            }
          }
        }
# read the json document from disk
doc = json.load(codecs.open('etk/unit_tests/ground_truth/1.jl', 'r'))
c = Core(extraction_config=extraction_config)
r = c.process(doc)

pp = pprint.PrettyPrinter(indent=4)
print json.dumps(r['content_extraction'], indent=2)


{
  "content_relaxed": {
    "text": "\n \n \n \n \n \n \n smoothlegs24  28 \n \n \n chrissy391  27 \n \n \n My name is Helena height 160cms weight 55 kilos  contact me at escort.here@gmail.com           jefferson ave         age: 23 HrumpMeNow  28 \n \n \n xxtradition  24 \n \n \n jumblyjumb  26 \n \n \n claudia77  26 \n \n \n gushinPuss  28 \n \n \n Littlexdit  25 \n \n \n PinkSweets2  28 \n \n \n withoutlimit  27 \n \n \n bothOfUs3  28 \n \n \n lovelylips  27 \n \n \n killerbod  27 \n \n \n Littlexdit  27 \n \n \n azneyes  23 \n \n \n \n \n \n Escort's Phone: \n \n \n323-452-2013  \n \n Escort's Location: \nLos Angeles, California  \n Escort's Age:   23   Date of Escort Post:   Jan 02nd 6:46am \n REVIEWS:   \n READ AND CREATE REVIEWS FOR THIS ESCORT   \n \n \n \n \n \nThere are  50  girls looking in  .\n VIEW GIRLS \n \nHey I'm luna 3234522013 Let's explore , embrace and indulge in your favorite fantasy  % independent. discreet no drama Firm Thighs and Sexy. My Soft skin & Tight Gri

  
  
#### Landmark Extractor
Now, lets run `landmark-extractor` on the html document. You can read about Inferlink's landmark-extractor [here](https://github.com/inferlink/landmark-extractor)  
**tl;dr** landmark-extractor applies a number of pre-trained rules to the html document. These rules are regex based and can be created in the landmark-extractor tool.

`extraction_config` for landmark-extractor:  

> `extractors.landmark`  

sets up `etk` to run landmark-extractor on the `input_path`

> `extractors.landmark.landmark_threshold`  

the ratio of number of successful landmark rules to the total number of landmark rules for that domain should be greater than or equal to this number. Otherwise `etk` will ignore this landmark extraction

> `resources.landmark`

the place in the `extraction_config` to specify the landmark rules files.


#### Output  
The output will be under the field `content_extraction` in the input json document.   

It should contain the field `inferlink_extractions`

```
{
    content_extraction: {
        inferlink_extractions: {
            inferlink_age:{
                text: "..."
            },
            inferlink_posting-date:{
                text: "..."
            },
            ...
        }
}
```

In [5]:
from etk.core import Core
import pprint
import json, codecs

rules_file_path = 'etk/unit_tests/resources/consolidated_rules.json'
e_config = {
    "resources": {
        "landmark": [
            rules_file_path
        ]
        }, 
    'content_extraction': {
        "input_path": "raw_content",
        "extractors": {
            "landmark": {
                "extraction_policy": "keep_existing",
                "landmark_threshold": 0.5
            }
        }
    }
}
doc = json.load(codecs.open('etk/unit_tests/ground_truth/1.jl', 'r'))
c = Core(extraction_config=e_config)
r = c.process(doc)

pp = pprint.PrettyPrinter(indent=4)
print json.dumps(r['content_extraction'], indent=2)

{
  "inferlink_extractions": {
    "inferlink_location": {
      "text": "Los Angeles, California"
    }, 
    "inferlink_age": {
      "text": "23"
    }, 
    "inferlink_phone": {
      "text": "323-452-2013"
    }, 
    "inferlink_posting-date": {
      "text": "2017-01-02 06:46"
    }, 
    "inferlink_description": {
      "text": "Hey I'm luna 3234522013 Let's explore , embrace and indulge in your favorite fantasy % independent. discreet no drama Firm Thighs and Sexy. My Soft skin & Tight Grip is exactly what you deserve Call or text Fetish friendly Fantasy friendly Party friendly 140 Hr SPECIALS 3234522013"
    }
  }
}


  
# Run data extraction on text

Now let's extract some data types from the text. It is controlled by the `data_extraction` part of the `extraction_config`.  


In [4]:
from etk.core import Core
import pprint
import json, codecs

extraction_config = {
  "resources": {
    "dictionaries": {
      "women_name": "etk/unit_tests/resources/female-names.json.gz"
    }
  },
  "data_extraction": [
    {
      "input_path": "*.*.text.`parent`",
      "fields": {
        "name": {
          "extractors": {
            "extract_using_dictionary": {
              "config": {
                "dictionary": "women_name",
                "ngrams": 1,
                "joiner": " ",
                "pre_process": [
                  "x.lower()"
                ],
                "pre_filter": [
                  "x"
                ],
                "post_filter": [
                  "isinstance(x, basestring)"
                ]
              },
              "extraction_policy": "keep_existing"
            },
            "extract_using_regex": {
              "config": {
                "include_context": "true",
                "regex": "(?:my[\\s]+name[\\s]+is[\\s]+([-a-z0-9@$!]+))",
                "regex_options": [
                  "IGNORECASE"
                ],
                "pre_filter": [
                  "x.replace('\\n', '')",
                  "x.replace('\\r', '')"
                ]
              },
              "extraction_policy": "replace"
            }
          }
        }
      }
    }
  ]
}
# read the json document from disk
doc = json.load(codecs.open('etk/unit_tests/ground_truth/1_content_extracted.jl', 'r'))
c = Core(extraction_config=extraction_config)
r = c.process(doc)
print json.dumps(r['content_extraction']['content_strict']['data_extraction'], indent=2)

{
  "name": {
    "extract_using_dictionary": {
      "results": [
        {
          "origin": {
            "score": 1.0, 
            "segment": "readability_strict", 
            "method": "other_method"
          }, 
          "context": {
            "start": 10, 
            "end": 11, 
            "text": "my name is helena height 160cms weight"
          }, 
          "value": "helena"
        }, 
        {
          "origin": {
            "score": 1.0, 
            "segment": "readability_strict", 
            "method": "other_method"
          }, 
          "context": {
            "start": 136, 
            "end": 137, 
            "text": "i ' m luna 3234522013 let '"
          }, 
          "value": "luna"
        }
      ]
    }, 
    "extract_using_regex": {
      "results": [
        {
          "origin": {
            "score": 1.0, 
            "segment": "readability_strict", 
            "method": "other_method"
          }, 
          "context": {
            "st