## URL matching

**Purpose**: check each outlet's document to see if it was (re-)tweeted, and if so, how many times.

1. Update `standardized_url` field to use the updated version of urlExpander
2. Add matches: for each outlet's document, check if `standardized_url` matches 1+ documents of `tweets2_url` doctype.

In [1]:
# matplotlib is logged even though disable_existing_loggers=yes in logging_config.yaml
# https://stackoverflow.com/a/51529172/7016397
# workaround is to manually set the level before creating my logger
import logging
logging.getLogger('matplotlib').setLevel(logging.WARNING)

from usrightmedia.shared.loggers import get_logger
LOGGER = get_logger(filename = '01-url-matching', logger_type='main')

- Pre-processing uses the Bulk API through the Python Elasticsearch client.
- Elasticsearch: ["How Big Is Too Big?"](https://www.elastic.co/guide/en/elasticsearch/guide/current/bulk.html#_how_big_is_too_big)
>The entire bulk request needs to be loaded into memory by the node that receives our request, so the bigger the request, the less memory available for other requests. There is an optimal size of bulk request. Above that size, performance no longer improves and may even drop off. The optimal size, however, is not a fixed number. It depends entirely on your hardware, your document size and complexity, and your indexing and search load.
>Fortunately, it is easy to find this sweet spot: Try indexing typical documents in batches of increasing size. When performance starts to drop off, your batch size is too big. A good place to start is with batches of 1,000 to 5,000 documents or, if your documents are very large, with even smaller batches.
>It is often useful to keep an eye on the physical size of your bulk requests. One thousand 1KB documents is very different from one thousand 1MB documents. A good bulk size to start playing with is around 5-15MB in size.

In [2]:
bulksize = 2000

## Import INCA and check doctypes

In [3]:
from inca import Inca
myinca = Inca()
myinca.database.list_doctypes()



{'tweets2': 889739,
 'tweets2_url': 285447,
 'foxnews': 63571,
 'breitbart': 39412,
 'dailycaller': 29463,
 'oneamericanews': 23008,
 'washingtonexaminer': 20793,
 'newsmax': 12345,
 'gatewaypundit': 9508,
 'infowars': 6751,
 'vdare': 6545,
 'dailystormer': 4005,
 'rushlimbaugh': 2533,
 'americanrenaissance': 2284,
 'seanhannity': 1232}

### 1. Add `standardized_url` field to documents representing (re-)tweeted URLs and outlets' articles
- `force=True` to use updated version of urlExpander (public fork)

In [4]:
doctypes_to_process = [
    "americanrenaissance",
    "breitbart",
    "dailycaller",
    "dailystormer",
    "foxnews",
    "gatewaypundit",
    "infowars",
    "newsmax",
    "oneamericanews",
    "rushlimbaugh",
    "seanhannity",
    "vdare",
    "washingtonexaminer",
    "tweets2_url"
]

In [5]:
for doctype in doctypes_to_process:
    
    try:
        
        docs = myinca.processing.standardize_url(docs_or_query=doctype,
                                                 field="resolved_url",
                                                 save=True,
                                                 new_key="standardized_url",
                                                 action="batch",
                                                 bulksize=bulksize,
                                                 force=True)
        for doc in docs:
            # runs process on doc
            pass

    except Exception as e:
        LOGGER.warning(e)

100%|██████████| 2284/2284 [00:41<00:00, 55.62it/s]  
100%|██████████| 39412/39412 [10:16<00:00, 63.97it/s] 
100%|██████████| 29463/29463 [07:00<00:00, 70.11it/s] 
100%|██████████| 4005/4005 [00:53<00:00, 74.57it/s]  
100%|██████████| 63571/63571 [18:43<00:00, 56.57it/s] 
100%|██████████| 9508/9508 [01:46<00:00, 89.00it/s]  
100%|██████████| 6751/6751 [01:19<00:00, 84.40it/s]  
100%|██████████| 12345/12345 [02:55<00:00, 70.17it/s] 
100%|██████████| 23008/23008 [05:28<00:00, 70.14it/s] 
100%|██████████| 2533/2533 [00:45<00:00, 55.07it/s]  
100%|██████████| 1232/1232 [00:00<00:00, 4569.37it/s]
100%|██████████| 6545/6545 [01:56<00:00, 56.04it/s]  
100%|██████████| 20793/20793 [05:06<00:00, 67.90it/s] 
100%|██████████| 285447/285447 [33:39<00:00, 141.32it/s]


### 2. Find matches
- For each outlet's document, check if `standardized_url` matches any documents of the `tweets2_url` doctype.
- Store the results in the new keys: `tweets2_url_ids` (list of matched IDs), `tweets2_url_count` (number of matches), and `tweets2_url_ind` (boolean indicator of 1+ matches).

In [7]:
outlet_doctypes = [
    "americanrenaissance",
    "breitbart",
    "dailycaller",
    "dailystormer",
    "foxnews",
    "gatewaypundit",
    "infowars",
    "newsmax",
    "oneamericanews",
    "rushlimbaugh",
    "seanhannity",
    "vdare",
    "washingtonexaminer",
]

In [8]:
# action="run" rather than action="batch" because this processor sends an individual HTTP request to Elasticsearch per document anyway
for doctype in outlet_doctypes:
    
    try:
        
        docs = myinca.processing.match_outlet_articles_to_tweets2_urls(docs_or_query=doctype,
                                                                       field="standardized_url",
                                                                       save=True,
                                                                       new_key="tweets2_url_ids",
                                                                       action="run",
                                                                       force=True)
        for doc in docs:
            # runs process on doc
            pass

    except Exception as e:
        LOGGER.warning(e)

100%|██████████| 2284/2284 [00:46<00:00, 49.15it/s]
100%|██████████| 39412/39412 [12:09<00:00, 54.03it/s]
100%|██████████| 29463/29463 [08:43<00:00, 56.29it/s]
100%|██████████| 4005/4005 [01:06<00:00, 60.14it/s]
100%|██████████| 63571/63571 [18:54<00:00, 56.02it/s]
100%|██████████| 9508/9508 [02:25<00:00, 65.41it/s] 
100%|██████████| 6751/6751 [01:46<00:00, 63.27it/s]
100%|██████████| 12345/12345 [03:19<00:00, 61.75it/s]
100%|██████████| 23008/23008 [06:17<00:00, 60.96it/s] 
100%|██████████| 2533/2533 [00:54<00:00, 46.69it/s]
100%|██████████| 1232/1232 [00:18<00:00, 67.24it/s]
100%|██████████| 6545/6545 [02:06<00:00, 51.72it/s]
100%|██████████| 20793/20793 [05:36<00:00, 61.72it/s]


In [9]:
for doctype in outlet_doctypes:
    
    try:
        
        docs = myinca.processing.match_outlet_articles_to_tweets2_urls_count(docs_or_query=doctype,
                                                                             field="tweets2_url_ids",
                                                                             save=True,
                                                                             new_key="tweets2_url_match_count",
                                                                             action="batch",
                                                                             bulksize=bulksize,
                                                                             force=True)
        for doc in docs:
            # runs process on doc
            pass

    except Exception as e:
        LOGGER.warning(e)

100%|██████████| 2284/2284 [00:38<00:00, 59.09it/s]  
100%|██████████| 39412/39412 [10:45<00:00, 61.10it/s] 
100%|██████████| 29463/29463 [07:17<00:00, 67.32it/s] 
100%|██████████| 4005/4005 [00:54<00:00, 73.69it/s]  
100%|██████████| 63571/63571 [18:51<00:00, 56.16it/s] 
100%|██████████| 9508/9508 [01:55<00:00, 82.26it/s]  
100%|██████████| 6751/6751 [01:32<00:00, 72.68it/s]  
100%|██████████| 12345/12345 [03:06<00:00, 66.23it/s] 
100%|██████████| 23008/23008 [05:50<00:00, 65.68it/s] 
100%|██████████| 2533/2533 [00:47<00:00, 53.68it/s]  
100%|██████████| 1232/1232 [00:00<00:00, 4053.64it/s]
100%|██████████| 6545/6545 [02:15<00:00, 48.14it/s]  
100%|██████████| 20793/20793 [05:13<00:00, 66.33it/s] 


In [10]:
for doctype in outlet_doctypes:
    
    try:
        
        docs = myinca.processing.match_outlet_articles_to_tweets2_urls_ind(docs_or_query=doctype,
                                                                           field="tweets2_url_ids",
                                                                           save=True,
                                                                           new_key="tweets2_url_match_ind",
                                                                           action="batch",
                                                                           bulksize=bulksize,
                                                                           force=True)
        for doc in docs:
            # runs process on doc
            pass

    except Exception as e:
        LOGGER.warning(e)

100%|██████████| 2284/2284 [00:41<00:00, 55.05it/s]  
100%|██████████| 39412/39412 [10:27<00:00, 62.77it/s] 
100%|██████████| 29463/29463 [07:21<00:00, 66.66it/s] 
100%|██████████| 4005/4005 [00:57<00:00, 70.14it/s]  
100%|██████████| 63571/63571 [18:59<00:00, 55.78it/s]
100%|██████████| 9508/9508 [01:52<00:00, 84.30it/s]  
100%|██████████| 6751/6751 [01:33<00:00, 72.48it/s]  
100%|██████████| 12345/12345 [03:09<00:00, 65.31it/s] 
100%|██████████| 23008/23008 [05:53<00:00, 65.11it/s] 
100%|██████████| 2533/2533 [00:51<00:00, 49.41it/s]  
100%|██████████| 1232/1232 [00:00<00:00, 4596.51it/s]
100%|██████████| 6545/6545 [02:04<00:00, 52.38it/s]  
100%|██████████| 20793/20793 [05:03<00:00, 68.44it/s] 
