## 1 **Factiva Article Dataset Construction**

#### 1.1 **Purpose:** This code takes html-formatted articles from Factiva search results and iteratively converts the html to pandas dataframes and then appends each dataframe of articles (df2) to a mother dataframe (df). It also shows post-hoc clean up of df, such as removing duplicate articles and oddly formatted quotation marks. The finished df is then fed into a different Python project that extracts and attributes quotes and paraphrases from decisions makers within these articles. 

##### NOTE: Factiva does not let you scrape their website. It also only lets you download a report of 100 articles at a time so I had to carefully go through each firm-year with over 100 articles and break up search date ranges to get as close to 100 as possible. 

#### 1.2 **Input data:** Create Factiva search queries manually, downloading each search result in RTF format. As part of this manual process I would recommend running a Python code to move the downloaded file and rename it to a folder of choice. I did this with two everything (screens, Factiva, tracking, renaming Python code, etc.) to speed up the process. 

## 2.0: Results 

### 2.1: Set up environment

In [1]:
import time
from tkinter import *
import datetime
import os
import glob
import timeit
import striprtf
import PyRTF
import glob
import re
import sqlite3
import csv
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import numpy as np
import pickle as pkl


### 2.2: Run code on htmls

#### This code converts the html file into a pandas dataset with each article creating one row. While the specific results below are from articles with specific reference to middle manager titles, the same code was used on the initial round of Factiva querries.  

In [None]:
# Set the path to the folder you'd like to use:
path = r'C:/Users/danwilde/Dropbox (Penn)/Project - Fusion Industry/htmls/*'

fields = ['AN', 'SE', 'HD', 'CR', 'WC', 'PD', 'ET', 'SN', 'SC', 'ED', 'PG', 'LA', 'CY', 'LP', 'TD', 'CT', 'RF', 'CO',
          'IN', 'NS', 'RE', 'IPC', 'IPD', 'PUB']

df = pd.DataFrame(columns=fields)

t0 = time.time()
n = 0
for f in glob.glob(path + "*.html"):
    t1 = time.time()
    name = f.split('\\')
    name = name[1]
    file_div = name.split('_')
    firm = file_div[0]
    vertexid = file_div[1]
    s = file_div[2]
    e = file_div[3].split('.')[0]

    df1 = pd.read_html(f, index_col=0)
    df2 = pd.concat([l for l in df1 if 'HD' in l.index.values], axis=1).T
    df2['Firm'] = firm
    df2['vertex.id'] = vertexid
    df2['start'] = s
    df2['end'] = e
    df = df.append(df2, ignore_index=True)
    n += 1
    t2 = time.time()
    total = t1 - t0
    print(n, firm, vertexid, s, e, "time run:", round(t2-t1,2), "total hours:", round((t2-t0)/(60*60),2), "mean rate:", round((t2-t0)/n,2))

print("done")

## 3.0 **Review**
### These is the df of all articles from the middle manager Factiva search querries

In [None]:
df