## Topic model inputs with associated `doc_id`
- This step was done after the topic models were trained so that the topic assignments can be assigned back to the associated doc_ids.


### 1. Pre-process docs again
- *Note to self*: next time, don't forget to generate the gensim dictionary and corpus along with the associated IDs...

In [1]:
import os
import pandas as pd

In [2]:
from usrightmedia.shared.topics_utils import *

In [3]:
df_data_inputs = pd.read_pickle(os.path.join(INPUTS_DIR, 'df_data_inputs.pkl'))

In [4]:
df_data_inputs

Unnamed: 0,doc_id,doctype,title,lead,article_maintext
0,AmericanRenaissance_1128638341,americanrenaissance,Congresswoman Hopes Reparations Bill is Path t...,Congresswoman Hopes Reparations Bill is Path t...,"Nicholas Ballasy, PJ Media, December 30, 2018\..."
1,Breitbart_621129461,breitbart,Portrait of Ronald Reagan Defaced During Break...,Portrait of Ronald Reagan Defaced During Break...,Someone vandalized a portrait of former Presid...
2,Breitbart_1483020896,breitbart,Study: Opioid Deaths Rise in Towns Where U.S. ...,Study: Opioid Deaths Rise in Towns Where U.S. ...,Opioid deaths sharply rise in American communi...
3,Breitbart_1483567174,breitbart,"Uber, Postmates Sue California to Stop Gig Wor...","Uber, Postmates Sue California to Stop Gig Wor...",Ride-sharing giant Uber and courier service Po...
4,AmericanRenaissance_1812166693,americanrenaissance,Dark Money Behemoth That Hosts BLM Foundation ...,Dark Money Behemoth That Hosts BLM Foundation ...,"Joe Schoffstall, Washington Free Beacon, Decem..."
...,...,...,...,...,...
727743,WashingtonExaminer_999923116,washingtonexaminer,"White House, DHS rip Joe Scarborough for compa...","White House, DHS rip Joe Scarborough for compa...",The White House and Department of Homeland Sec...
727744,WashingtonExaminer_999923435,washingtonexaminer,The 50 years since MLK's assassination,The 50 years since MLK's assassinationFifty ye...,"Fifty years ago this evening, the Rev. Dr. Mar..."
727745,WashingtonExaminer_999951831,washingtonexaminer,Mika Brzezinski says Trump is upset he can't w...,Mika Brzezinski says Trump is upset he can't w...,“Morning Joe” cohost Mika Brzezinski said some...
727746,WashingtonExaminer_999952161,washingtonexaminer,First person sentenced in Robert Mueller's Rus...,First person sentenced in Robert Mueller's Rus...,A federal judge on Tuesday sentenced the first...


In [5]:
%%time
# CPU times: user 16min 48s, sys: 635 ms, total: 16min 49s
# Wall time: 16min 49s
df_titles = preprocess_docs_with_doc_ids(df_data_inputs['doc_id'], df_data_inputs['title'], 'titles', INPUTS_DIR)

In [6]:
%%time
# CPU times: user 2h 16min 6s, sys: 23.5 s, total: 2h 16min 30s
# Wall time: 2h 16min 30s
df_leads = preprocess_docs_with_doc_ids(df_data_inputs['doc_id'], df_data_inputs['lead'], 'leads', INPUTS_DIR)

In [7]:
%%time
# CPU times: user 10h 47min 39s, sys: 1h 47min 22s, total: 12h 35min 2s
# Wall time: 12h 35min 20s
df_texts = preprocess_docs_with_doc_ids(df_data_inputs['doc_id'], df_data_inputs['article_maintext'], 'texts', INPUTS_DIR)

### 2. merge processed versions into `df_data_inputs`

#### 2.1 load files

In [8]:
df_titles = pd.read_pickle(os.path.join(INPUTS_DIR, "docs", "docs_titles_with_inca_ids.pkl"))
df_leads = pd.read_pickle(os.path.join(INPUTS_DIR, "docs", "docs_leads_with_inca_ids.pkl"))
df_texts = pd.read_pickle(os.path.join(INPUTS_DIR, "docs", "docs_texts_with_inca_ids.pkl"))

In [9]:
# not necessary since already loaded at the top of the notebook 
# df_data_inputs = pd.read_pickle(os.path.join(INPUTS_DIR, 'df_data_inputs.pkl'))

#### 2.2 rename columns to prep for merge

In [10]:
df_titles = df_titles.rename(columns={"processed_doc": "processed_title"})
df_leads = df_leads.rename(columns={"processed_doc": "processed_lead"})
df_texts = df_texts.rename(columns={"processed_doc": "processed_text"})

#### 2.3 preview dataframes which will be merged

In [11]:
df_titles

Unnamed: 0,doc_id,processed_title
0,AmericanRenaissance_1128638341,[damage]
1,Breitbart_621129461,[portrait]
2,Breitbart_1483020896,"[study, town, plant]"
3,Breitbart_1483567174,"[uber, law]"
4,AmericanRenaissance_1812166693,"[dark, money]"
...,...,...
625390,WashingtonExaminer_999923116,"[border, official, nazi]"
625391,WashingtonExaminer_999923435,[assassination]
625392,WashingtonExaminer_999951831,"[upset, porn]"
625393,WashingtonExaminer_999952161,"[person, investigation, prison]"


In [12]:
df_leads

Unnamed: 0,doc_id,processed_lead
0,AmericanRenaissance_1128638341,"[damage, federal, government, study, reparatio..."
1,Breitbart_621129461,"[portrait, portrait, break, county, headquarte..."
2,Breitbart_1483020896,"[study, town, plant, death, sharply, american,..."
3,Breitbart_1483567174,"[uber, giant, courier, service, lawsuit, brake..."
4,AmericanRenaissance_1812166693,"[dark, money, dark, money, network, life, near..."
...,...,...
727705,WashingtonExaminer_999923116,"[border, official, host, border, official, app..."
727706,WashingtonExaminer_999923435,"[evening, struggle, civil, right, history, tim..."
727707,WashingtonExaminer_999951831,"[upset, porn, cohost, recently, commander, chi..."
727708,WashingtonExaminer_999952161,"[person, investigation, federal, judge, person..."


In [13]:
df_texts

Unnamed: 0,doc_id,processed_text
0,AmericanRenaissance_1128638341,"[federal, government, study, reparation, desce..."
1,Breitbart_621129461,"[portrait, break, county, headquarters, vandal..."
2,Breitbart_1483020896,"[death, sharply, american, community, multinat..."
3,Breitbart_1483567174,"[ride, giant, courier, service, lawsuit, brake..."
4,AmericanRenaissance_1812166693,"[dark, money, network, life, nearly, taxpayer,..."
...,...,...
727660,WashingtonExaminer_999923116,"[host, border, official, appalling, federal, l..."
727661,WashingtonExaminer_999923435,"[evening, struggle, civil, right, history, tim..."
727662,WashingtonExaminer_999951831,"[cohost, recently, commander, chief, pornograp..."
727663,WashingtonExaminer_999952161,"[federal, judge, person, special, counsel, inv..."


In [14]:
df_data_inputs

Unnamed: 0,doc_id,doctype,title,lead,article_maintext
0,AmericanRenaissance_1128638341,americanrenaissance,Congresswoman Hopes Reparations Bill is Path t...,Congresswoman Hopes Reparations Bill is Path t...,"Nicholas Ballasy, PJ Media, December 30, 2018\..."
1,Breitbart_621129461,breitbart,Portrait of Ronald Reagan Defaced During Break...,Portrait of Ronald Reagan Defaced During Break...,Someone vandalized a portrait of former Presid...
2,Breitbart_1483020896,breitbart,Study: Opioid Deaths Rise in Towns Where U.S. ...,Study: Opioid Deaths Rise in Towns Where U.S. ...,Opioid deaths sharply rise in American communi...
3,Breitbart_1483567174,breitbart,"Uber, Postmates Sue California to Stop Gig Wor...","Uber, Postmates Sue California to Stop Gig Wor...",Ride-sharing giant Uber and courier service Po...
4,AmericanRenaissance_1812166693,americanrenaissance,Dark Money Behemoth That Hosts BLM Foundation ...,Dark Money Behemoth That Hosts BLM Foundation ...,"Joe Schoffstall, Washington Free Beacon, Decem..."
...,...,...,...,...,...
727743,WashingtonExaminer_999923116,washingtonexaminer,"White House, DHS rip Joe Scarborough for compa...","White House, DHS rip Joe Scarborough for compa...",The White House and Department of Homeland Sec...
727744,WashingtonExaminer_999923435,washingtonexaminer,The 50 years since MLK's assassination,The 50 years since MLK's assassinationFifty ye...,"Fifty years ago this evening, the Rev. Dr. Mar..."
727745,WashingtonExaminer_999951831,washingtonexaminer,Mika Brzezinski says Trump is upset he can't w...,Mika Brzezinski says Trump is upset he can't w...,“Morning Joe” cohost Mika Brzezinski said some...
727746,WashingtonExaminer_999952161,washingtonexaminer,First person sentenced in Robert Mueller's Rus...,First person sentenced in Robert Mueller's Rus...,A federal judge on Tuesday sentenced the first...


#### 2.4 merge dataframes

In [15]:
df_data_inputs = df_data_inputs.merge(right=df_titles,
                                      on="doc_id",
                                      how="left",
                                      validate="one_to_one")

df_data_inputs = df_data_inputs.merge(right=df_leads,
                                      on="doc_id",
                                      how="left",
                                      validate="one_to_one")

df_data_inputs = df_data_inputs.merge(right=df_texts,
                                      on="doc_id",
                                      how="left",
                                      validate="one_to_one")

In [16]:
df_data_inputs

Unnamed: 0,doc_id,doctype,title,lead,article_maintext,processed_title,processed_lead,processed_text
0,AmericanRenaissance_1128638341,americanrenaissance,Congresswoman Hopes Reparations Bill is Path t...,Congresswoman Hopes Reparations Bill is Path t...,"Nicholas Ballasy, PJ Media, December 30, 2018\...",[damage],"[damage, federal, government, study, reparatio...","[federal, government, study, reparation, desce..."
1,Breitbart_621129461,breitbart,Portrait of Ronald Reagan Defaced During Break...,Portrait of Ronald Reagan Defaced During Break...,Someone vandalized a portrait of former Presid...,[portrait],"[portrait, portrait, break, county, headquarte...","[portrait, break, county, headquarters, vandal..."
2,Breitbart_1483020896,breitbart,Study: Opioid Deaths Rise in Towns Where U.S. ...,Study: Opioid Deaths Rise in Towns Where U.S. ...,Opioid deaths sharply rise in American communi...,"[study, town, plant]","[study, town, plant, death, sharply, american,...","[death, sharply, american, community, multinat..."
3,Breitbart_1483567174,breitbart,"Uber, Postmates Sue California to Stop Gig Wor...","Uber, Postmates Sue California to Stop Gig Wor...",Ride-sharing giant Uber and courier service Po...,"[uber, law]","[uber, giant, courier, service, lawsuit, brake...","[ride, giant, courier, service, lawsuit, brake..."
4,AmericanRenaissance_1812166693,americanrenaissance,Dark Money Behemoth That Hosts BLM Foundation ...,Dark Money Behemoth That Hosts BLM Foundation ...,"Joe Schoffstall, Washington Free Beacon, Decem...","[dark, money]","[dark, money, dark, money, network, life, near...","[dark, money, network, life, nearly, taxpayer,..."
...,...,...,...,...,...,...,...,...
727743,WashingtonExaminer_999923116,washingtonexaminer,"White House, DHS rip Joe Scarborough for compa...","White House, DHS rip Joe Scarborough for compa...",The White House and Department of Homeland Sec...,"[border, official, nazi]","[border, official, host, border, official, app...","[host, border, official, appalling, federal, l..."
727744,WashingtonExaminer_999923435,washingtonexaminer,The 50 years since MLK's assassination,The 50 years since MLK's assassinationFifty ye...,"Fifty years ago this evening, the Rev. Dr. Mar...",[assassination],"[evening, struggle, civil, right, history, tim...","[evening, struggle, civil, right, history, tim..."
727745,WashingtonExaminer_999951831,washingtonexaminer,Mika Brzezinski says Trump is upset he can't w...,Mika Brzezinski says Trump is upset he can't w...,“Morning Joe” cohost Mika Brzezinski said some...,"[upset, porn]","[upset, porn, cohost, recently, commander, chi...","[cohost, recently, commander, chief, pornograp..."
727746,WashingtonExaminer_999952161,washingtonexaminer,First person sentenced in Robert Mueller's Rus...,First person sentenced in Robert Mueller's Rus...,A federal judge on Tuesday sentenced the first...,"[person, investigation, prison]","[person, investigation, federal, judge, person...","[federal, judge, person, special, counsel, inv..."


In [17]:
%%time
with open(os.path.join(INPUTS_DIR, 'df_data_inputs_with_processed_docs.pkl'), 'wb') as file:
    pickle.dump(df_data_inputs, file)

CPU times: user 38.3 s, sys: 5.62 s, total: 43.9 s
Wall time: 44.1 s
