## 06-add-topics-to-inca

**Purpose**: Add topic assignments from multiple topic models into the ES documents which contain the right-wing media outlets' articles.

**Steps**:
1. Create a list of dictionaries where each dict corresponds to a `doc_id`.
2. Each dict contains the `doc_id` and and multiple topic model-related keys.

    - `lda_tfidf_texts_10_top_topic`
    - `lda_tfidf_texts_10_top_topic_pct`
    - `lda_tfidf_texts_10_topic_tokens`
    - `lda_tfidf_texts_10_doc_tokens`
    
    - `lda_tfidf_texts_25_top_topic`
    - `lda_tfidf_texts_25_top_topic_pct`
    - `lda_tfidf_texts_25_topic_tokens`
    - `lda_tfidf_texts_25_doc_tokens`
    
    - `lda_tfidf_texts_40_top_topic`
    - `lda_tfidf_texts_40_top_topic_pct`
    - `lda_tfidf_texts_40_topic_tokens`
    - `lda_tfidf_texts_40_doc_tokens`

3. INCA has already been modified so it can update a document based on a `doc_id` and add multiple new fields in one-go (per [05-softcosine-clusters-inca.ipynb](https://github.com/wlmwng/us-right-media/blob/develop/usrightmedia/code/07-newsevents/)).

In [1]:
import os
import pandas as pd
import copy

from usrightmedia.shared.topics_utils import *

In [2]:
from usrightmedia.shared.loggers import get_logger
LOGGER = get_logger(filename = '06-add-topics-to-inca', logger_type='main')

In [3]:
from inca import Inca
myinca = Inca()



### 1.0 Load `doc_id`s and topic assignments

In [4]:
def load(model):
    df = pd.read_pickle(os.path.join(MODELS_DIR, model, f"{model}_top_topic_with_ids.pkl"))
    # https://stackoverflow.com/a/70311963: add prefix to all columns except doc_id
    df = df.set_index('doc_id').add_prefix(f'lda_tfidf_texts_{model.split("_")[-1]}_').reset_index()
    return df

In [5]:
%%time
df10 = load('lda_corpus_tfidf_docs_texts_topics_10')
df25 = load('lda_corpus_tfidf_docs_texts_topics_25')
df40 = load('lda_corpus_tfidf_docs_texts_topics_40')

CPU times: user 53.1 s, sys: 7.52 s, total: 1min
Wall time: 1min


In [6]:
df10

Unnamed: 0,doc_id,lda_tfidf_texts_10_top_topic,lda_tfidf_texts_10_top_topic_pct,lda_tfidf_texts_10_topic_tokens,lda_tfidf_texts_10_doc_tokens
0,AmericanRenaissance_1128638341,0.0,91.110001,"police, people, man, white, officer, black, at...","[federal, government, study, reparation, desce..."
1,Breitbart_621129461,0.0,88.419998,"police, people, man, white, officer, black, at...","[portrait, break, county, headquarters, vandal..."
2,Breitbart_1483020896,1.0,69.800003,"coronavirus, virus, health, game, pandemic, va...","[death, sharply, american, community, multinat..."
3,Breitbart_1483567174,1.0,58.980000,"coronavirus, virus, health, game, pandemic, va...","[ride, giant, courier, service, lawsuit, brake..."
4,AmericanRenaissance_1812166693,2.0,63.770000,"investigation, russian, election, email, campa...","[dark, money, network, life, nearly, taxpayer,..."
...,...,...,...,...,...
727660,WashingtonExaminer_999923116,0.0,56.650002,"police, people, man, white, officer, black, at...","[host, border, official, appalling, federal, l..."
727661,WashingtonExaminer_999923435,0.0,94.099998,"police, people, man, white, officer, black, at...","[evening, struggle, civil, right, history, tim..."
727662,WashingtonExaminer_999951831,2.0,56.059998,"investigation, russian, election, email, campa...","[cohost, recently, commander, chief, pornograp..."
727663,WashingtonExaminer_999952161,2.0,90.060005,"investigation, russian, election, email, campa...","[federal, judge, person, special, counsel, inv..."


#### 1.1 Convert dataframes to dictionaries

In [7]:
%%time
d10 = df10.to_dict("records")
d25 = df25.to_dict("records")
d40 = df40.to_dict("records")

CPU times: user 23.4 s, sys: 199 ms, total: 23.6 s
Wall time: 23.5 s


In [8]:
d10[-5:-3]

[{'doc_id': 'WashingtonExaminer_999923116',
  'lda_tfidf_texts_10_top_topic': 0.0,
  'lda_tfidf_texts_10_top_topic_pct': 56.650001525878906,
  'lda_tfidf_texts_10_topic_tokens': 'police, people, man, white, officer, black, attack, military, country, gun',
  'lda_tfidf_texts_10_doc_tokens': ['host',
   'border',
   'official',
   'appalling',
   'federal',
   'law',
   'enforcement',
   'officer',
   'life',
   'line',
   'american',
   'people',
   'safe',
   'nazi',
   'spokesman',
   'type',
   'inflammatory',
   'unacceptable',
   'rhetoric',
   'target',
   'back',
   'great',
   'law',
   'enforcement',
   'episode',
   'border',
   'agent',
   'illegal',
   'immigrant',
   'parent',
   'child',
   'family',
   'press',
   'secretary',
   'child',
   'mother',
   'arm',
   'breast',
   'child',
   'away',
   'shower',
   'away',
   'shower',
   'people',
   'shower',
   'trick',
   'room',
   'well',
   'shower',
   'right',
   'spokesman',
   'comment',
   'inappropriate',
   'ap

### 2.0 Prepare inputs for INCA

- Similar steps to [05-softcosine-clusters-inca.ipynb](https://github.com/wlmwng/us-right-media/blob/5156cf1590341828ca00adceeb9b8beebc612fbe/usrightmedia/code/07-newsevents/05-softcosine-clusters-inca.ipynb)

In [9]:
def prep(docs):
    """Prepare docs' input format for INCA.
    Renames 'doc_id' to '_id' so INCA will know which ES document to update.
    
    Args: 
        docs (list of dicts)    
    
    Returns:
        formatted docs (list of dicts)    
    """
    
    # Remove 'doc_id' so it isn't added as a redundant field on the ES document. Note this modifies the original object.
    for doc in docs:
        doc["_id"] = doc["doc_id"]
        doc.pop("doc_id")

In [10]:
%%time
prep(d10)
prep(d25)
prep(d40)

CPU times: user 519 ms, sys: 61.2 ms, total: 580 ms
Wall time: 578 ms


### 3.0 Update Elasticsearch database through INCA

In [11]:
def update(docs):
    # each dict in docs will no longer have an "_id" as its popped off in INCA
    # https://github.com/wlmwng/inca/blob/1c1ed382b2db34682234af0298cf4c19ecbc7074/inca/core/database.py#L209
    myinca.database.update_documents(docs, batchsize=2000)

- check Kibana after running `update()` on `AmericanRenaissance_1128638341`

```
GET /inca_alias/_search
{"_source": {
            "excludes": [ "META" ]
        },
        "query": {
            "bool": {
                "filter": [
                    {"term": {"_id": "AmericanRenaissance_1128638341"}}
                ]
            }
        }
    }
```


In [12]:
%%time
update(d10)

100%|██████████| 364/364 [25:27<00:00,  4.20s/it]


CPU times: user 57.6 s, sys: 960 ms, total: 58.6 s
Wall time: 25min 35s


In [13]:
%%time
update(d25)

100%|██████████| 364/364 [25:25<00:00,  4.19s/it]


CPU times: user 50.5 s, sys: 964 ms, total: 51.4 s
Wall time: 25min 25s


In [14]:
%%time
update(d40)

100%|██████████| 364/364 [26:16<00:00,  4.33s/it]


CPU times: user 50.8 s, sys: 925 ms, total: 51.7 s
Wall time: 26min 17s


In [16]:
d40[11:12]

[{'lda_tfidf_texts_40_top_topic': 12.0,
  'lda_tfidf_texts_40_top_topic_pct': 50.019996643066406,
  'lda_tfidf_texts_40_topic_tokens': 'police, officer, shooting, protester, protest, city, police_officer, gun, man, suspect',
  'lda_tfidf_texts_40_doc_tokens': ['surge',
   'left',
   'wing',
   'extremist',
   'attack',
   'police',
   'presence',
   'city',
   'emergency',
   'service',
   'potential',
   'attack',
   'announcement',
   'left',
   'wing',
   'extremist',
   'arson',
   'attack',
   'police',
   'vehicle',
   'fire',
   'vehicle',
   'city',
   'police',
   'police',
   'directorate',
   'result',
   'german',
   'tabloid',
   'spokesman',
   'power',
   'distribution',
   'box',
   'fire',
   'different',
   'location',
   'fire',
   'radio',
   'mast',
   'permanent',
   'damage',
   'structure',
   'viciously',
   'home',
   'police',
   'special',
   'unit',
   'left',
   'wing',
   'extremism',
   'investigation',
   'incident',
   'note',
   'online',
   'far',
  