# Process text

**Purpose**: text pre-processing steps in preparation for NLP tasks.

0.  clean text with regex (Fox News only)
1. remove HTML tags
2. remove punctuation
3. remove stopwords
4. clean whitespace

In [1]:
# matplotlib is logged even though disable_existing_loggers=yes in logging_config.yaml
# https://stackoverflow.com/a/51529172/7016397
# workaround is to manually set the level before creating my logger
import logging
logging.getLogger('matplotlib').setLevel(logging.WARNING)

from usrightmedia.shared.loggers import get_logger
LOGGER = get_logger(filename = '01-text-processing', logger_type='main')

- Pre-processing uses the Bulk API through the Python Elasticsearch client.
- Elasticsearch: ["How Big Is Too Big?"](https://www.elastic.co/guide/en/elasticsearch/guide/current/bulk.html#_how_big_is_too_big)
>The entire bulk request needs to be loaded into memory by the node that receives our request, so the bigger the request, the less memory available for other requests. There is an optimal size of bulk request. Above that size, performance no longer improves and may even drop off. The optimal size, however, is not a fixed number. It depends entirely on your hardware, your document size and complexity, and your indexing and search load.
>Fortunately, it is easy to find this sweet spot: Try indexing typical documents in batches of increasing size. When performance starts to drop off, your batch size is too big. A good place to start is with batches of 1,000 to 5,000 documents or, if your documents are very large, with even smaller batches.
>It is often useful to keep an eye on the physical size of your bulk requests. One thousand 1KB documents is very different from one thousand 1MB documents. A good bulk size to start playing with is around 5-15MB in size.

In [2]:
bulksize = 2000

## Import INCA and check doctypes

In [3]:
from inca import Inca
myinca = Inca()
myinca.database.list_doctypes()



{'tweets2': 889739,
 'tweets2_url': 285447,
 'foxnews': 63571,
 'breitbart': 39412,
 'dailycaller': 29463,
 'oneamericanews': 23008,
 'washingtonexaminer': 20793,
 'newsmax': 12345,
 'gatewaypundit': 9508,
 'infowars': 6751,
 'vdare': 6545,
 'dailystormer': 4005,
 'rushlimbaugh': 2533,
 'americanrenaissance': 2284,
 'seanhannity': 1232}

In [4]:
outlet_doctypes = [
    "americanrenaissance",
    "breitbart",
    "dailycaller",
    "dailystormer",
    "foxnews",
    "gatewaypundit",
    "infowars",
    "newsmax",
    "oneamericanews",
    "rushlimbaugh",
    "seanhannity",
    "vdare",
    "washingtonexaminer",
]

## Cleaning steps

#### 0. regex (Fox News only)

- Rule 1: `"\\n[A-Z0-9 :,\\'!@\$\(\)\-\.\?\:\;\/]+(?:\\n|$)"`

    ```
    Removes substrings which begin with a line break ("\n") and end with a line break or end of string ("$") where the substring contains only capitalized letters or common punctuation:
    - generic promo links
    - links to other news content
    - subheadings within article

    Examples:
    - "\nCLICK HERE TO GET THE FOX NEWS APP\n" (unrelated)
    - "\nTRUMP SAYS HE WILL LEAVE OFFICE IF ELECTORAL COLLEGE VOTES FOR BIDEN\n" (unrelated)
    - "\nCLICK HERE FOR MORE SPORTS COVERAGE ON FOXNEWS.COM\n" (unrelated)
    - "\nSTOCK UP\n" (article subheading)
    - "\nWHAT NEEDS HELP\n" (article subheading)
    
    ```


- Rule 2: `"Get all the latest news on coronavirus and more delivered daily to your inbox\. Sign up here"`

    ```
    Removes generic email signup
    ```  

In [5]:
try:
    rules_fox = [
        {"regexp": "\\n[A-Z0-9 :,\\'!@\$\(\)\-\.\?\:\;\/]+(?:\\n|$)", "replace_with": ""},
        {
            "regexp": "Get all the latest news on coronavirus and more delivered daily to your inbox\. Sign up here",
            "replace_with": "",
        },
    ]

    # generator
    docs_regexp = myinca.processing.multireplace(
        docs_or_query="foxnews",
        field="article_maintext",
        rules=rules_fox,
        save=True,
        new_key="article_maintext_0",
        action="batch",    
        bulksize=bulksize,
    )
    for doc in docs_regexp:
        # runs process on doc
        pass

except Exception as e:
    LOGGER.warning(e)


100%|██████████| 63571/63571 [02:38<00:00, 401.35it/s]


#### 1. remove HTML tags
- run Fox News and non-Fox News outlets separately due to different input key

In [6]:
try:
    # generator
    docs_rmv_html = myinca.processing.remove_html_tags(
        docs_or_query="foxnews",
        field="article_maintext_0",
        save=True,
        new_key="article_maintext_1",
        action="batch",    
        bulksize=bulksize,
    )
    for doc in docs_rmv_html:
        # runs process on doc
        pass
except Exception as e:
    LOGGER.warning(e)


100%|██████████| 63571/63571 [02:45<00:00, 383.84it/s]


In [7]:
remaining_doctypes = [doctype for doctype in outlet_doctypes if doctype not in ["foxnews"]]
remaining_doctypes

['americanrenaissance',
 'breitbart',
 'dailycaller',
 'dailystormer',
 'gatewaypundit',
 'infowars',
 'newsmax',
 'oneamericanews',
 'rushlimbaugh',
 'seanhannity',
 'vdare',
 'washingtonexaminer']

In [8]:
for doctype in remaining_doctypes:
    try:
        docs_rmv_html = myinca.processing.remove_html_tags(
            docs_or_query=doctype,
            field="article_maintext",
            save=True,
            new_key="article_maintext_1",
            action="batch",
            bulksize=bulksize,
        )
        for doc in docs_rmv_html:
            # runs process on doc
            pass
    except Exception as e:
        LOGGER.warning(e)


100%|██████████| 2284/2284 [00:00<00:00, 3157.24it/s]
100%|██████████| 39412/39412 [01:34<00:00, 417.28it/s]
100%|██████████| 29463/29463 [01:01<00:00, 477.43it/s]
100%|██████████| 4005/4005 [00:06<00:00, 594.76it/s] 
100%|██████████| 9508/9508 [00:21<00:00, 439.90it/s] 
100%|██████████| 6751/6751 [00:15<00:00, 429.17it/s] 
100%|██████████| 12345/12345 [00:27<00:00, 446.75it/s]
100%|██████████| 23008/23008 [00:48<00:00, 475.35it/s]
100%|██████████| 2533/2533 [00:01<00:00, 2270.90it/s]
100%|██████████| 1232/1232 [00:00<00:00, 4375.05it/s]
100%|██████████| 6545/6545 [00:21<00:00, 306.06it/s] 
100%|██████████| 20793/20793 [00:44<00:00, 470.39it/s]


#### 2. remove punctuation

In [9]:
for doctype in outlet_doctypes:
    try:
        docs_rmv_punc = myinca.processing.remove_punctuation(
            docs_or_query=doctype,
            field="article_maintext_1",
            save=True,
            new_key="article_maintext_2",
            action="batch",
            bulksize=bulksize,
        )
        for doc in docs_rmv_punc:
            # runs process on doc
            pass
    except Exception as e:
        LOGGER.warning(e)


100%|██████████| 2284/2284 [00:00<00:00, 3280.38it/s]
100%|██████████| 39412/39412 [01:37<00:00, 403.72it/s]
100%|██████████| 29463/29463 [01:06<00:00, 446.02it/s]
100%|██████████| 4005/4005 [00:06<00:00, 601.55it/s] 
100%|██████████| 63571/63571 [02:44<00:00, 385.93it/s]
100%|██████████| 9508/9508 [00:20<00:00, 471.61it/s] 
100%|██████████| 6751/6751 [00:14<00:00, 464.04it/s] 
100%|██████████| 12345/12345 [00:27<00:00, 454.07it/s]
100%|██████████| 23008/23008 [00:47<00:00, 485.57it/s]
100%|██████████| 2533/2533 [00:01<00:00, 2275.16it/s]
100%|██████████| 1232/1232 [00:00<00:00, 4667.93it/s]
100%|██████████| 6545/6545 [00:19<00:00, 337.99it/s] 
100%|██████████| 20793/20793 [00:44<00:00, 464.00it/s]


#### 3. remove stopwords

In [10]:
for doctype in outlet_doctypes:
    try:
        docs_rmv_stopwords = myinca.processing.remove_stopwords(
            stopwords="english",
            docs_or_query=doctype,
            field="article_maintext_2",
            save=True,
            new_key="article_maintext_3",
            action="batch",
            bulksize=bulksize,
        )
        for doc in docs_rmv_stopwords:
            # runs process on doc
            pass
    except Exception as e:
        LOGGER.warning(e)


100%|██████████| 2284/2284 [00:00<00:00, 2417.96it/s]
100%|██████████| 39412/39412 [11:26<00:00, 57.44it/s] 
100%|██████████| 29463/29463 [07:27<00:00, 65.78it/s] 
100%|██████████| 4005/4005 [00:46<00:00, 85.88it/s]  
100%|██████████| 63571/63571 [19:11<00:00, 55.20it/s] 
100%|██████████| 9508/9508 [02:22<00:00, 66.78it/s]  
100%|██████████| 6751/6751 [01:38<00:00, 68.66it/s]  
100%|██████████| 12345/12345 [03:16<00:00, 62.68it/s] 
100%|██████████| 23008/23008 [05:43<00:00, 67.04it/s] 
100%|██████████| 2533/2533 [00:00<00:00, 2707.17it/s]
100%|██████████| 1232/1232 [00:00<00:00, 4549.78it/s]
100%|██████████| 6545/6545 [02:05<00:00, 52.32it/s]  
100%|██████████| 20793/20793 [04:58<00:00, 69.70it/s] 


#### 4. clean whitespace

In [11]:
for doctype in outlet_doctypes:
    try:
        docs_clean_whitespace = myinca.processing.clean_whitespace(
            docs_or_query=doctype,
            field="article_maintext_3",
            save=True,
            new_key="article_maintext_4",
            action="batch",
            bulksize=bulksize,
        )
        for doc in docs_clean_whitespace:
            # runs process on doc
            pass
    except Exception as e:
        LOGGER.warning(e)


100%|██████████| 2284/2284 [00:00<00:00, 2947.98it/s]
100%|██████████| 39412/39412 [11:53<00:00, 55.27it/s] 
100%|██████████| 29463/29463 [07:37<00:00, 64.43it/s] 
100%|██████████| 4005/4005 [00:47<00:00, 84.91it/s]  
100%|██████████| 63571/63571 [19:35<00:00, 54.09it/s] 
100%|██████████| 9508/9508 [02:25<00:00, 65.46it/s]  
100%|██████████| 6751/6751 [01:40<00:00, 67.10it/s]  
100%|██████████| 12345/12345 [03:26<00:00, 59.91it/s] 
100%|██████████| 23008/23008 [05:59<00:00, 63.98it/s] 
100%|██████████| 2533/2533 [00:01<00:00, 2516.02it/s]
100%|██████████| 1232/1232 [00:00<00:00, 4104.96it/s]
100%|██████████| 6545/6545 [02:09<00:00, 50.50it/s]  
100%|██████████| 20793/20793 [05:08<00:00, 67.45it/s] 
