This notebook contains
- code to load ISIC descriptions and web pages
- examples of the summarization and embedding pipelines
- domain-sector similarity calculation and exploration

In [1]:
import os
import pandas as pd
from warcio.archiveiterator import ArchiveIterator

import lxml
from bs4 import BeautifulSoup as bs
from bs4.dammit import EncodingDetector
from sklearn.metrics.pairwise import cosine_similarity
from sklearn import decomposition
import matplotlib

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from sentence_transformers import SentenceTransformer

# Loading inputs

## Standard Industrial Classification

The UN international standard for industry classification is used.

In [2]:
sic_divs_df = pd.read_csv('data/ISICRev4 NT input.csv.csv').rename({'ISIC Rev. 4 label':'division_label'}, axis=1)

In [3]:
sic_divs_temp_filt = sic_divs_df.query('Inclusions.str.len() > 15')

## Web pages

### Helper functions

In [4]:
def html_to_text(page, warc_headers):
    try:
        encoding = warc_headers.get_header('WARC-Identified-Content-Charset')
        if not encoding:
            for encoding in EncodingDetector(page, is_html=True).encodings:
                # take the first detected encoding
                break
        soup = bs(page, 'lxml', from_encoding=encoding)
        for script in soup(['script', 'style']):
            script.extract()  
        return soup.get_text(' ', strip=True)
    except Exception as e:
        print(e)
        return ''

def open_warc(filename, record_offset):
    with open(os.path.join(WARC_FOLDER, filename), 'rb') as file:
        file.seek(int(record_offset))
        record = next(ArchiveIterator(file))
        page = record.content_stream().read()
        warc_headers = record.rec_headers
        return page, warc_headers

def get_soup(page, warc_headers):
    try:
        encoding = warc_headers.get_header('WARC-Identified-Content-Charset')
        if not encoding:
            for encoding in EncodingDetector(page, is_html=True).encodings:
                # take the first detected encoding
                break
        soup = bs(page, 'lxml', from_encoding=encoding)
        for script in soup(['script', 'style']):
            script.extract()
        return soup
    except Exception as e:
        print(e)
        return ''

def get_filtered_paragraphs(soup):
    return [p_tag.text for p_tag in soup.find_all('p') if len(p_tag.text)>40]

def ps_from_row(row):
    content, headers = open_warc(row.filename, row.offset)
    soup = get_soup(content, headers)
    return get_filtered_paragraphs(soup)

### Read disk

The previous, _downloading and filtering_, notebook explains how the cdxj index was created.

In [5]:
WARC_FOLDER = ''
PUB_CDXJ = 'published_sample.cdxj'

In [6]:
cdxj_file_name = PUB_CDXJ

In [8]:
# At the end of this code block, the cdxj file is accessible as a pandas dataframe
full_path = os.path.join(WARC_FOLDER, cdxj_file_name)

local_index = pd.read_csv(full_path, sep='read_one_line_per_row', header=None, engine='python')

expanded_local_index = local_index[0].str.split(' ', n=2, expand=True)
page_timestamp_df = expanded_local_index.rename({0:'page', 1:'timestamp'}, axis=1).drop(2, axis=1)
json_columns = expanded_local_index[2].apply(eval).apply(pd.Series)

clean_local_index = pd.concat([page_timestamp_df, json_columns], axis=1)

clean_local_index['domain'] = clean_local_index.page.str.split(')', expand=True)[0]

### Select websites' homepage

In [9]:
clean_local_index["page_uri_len"] = clean_local_index.page.str.len()

approx_homepages = clean_local_index.loc[clean_local_index.groupby("domain").page_uri_len.idxmin()]

In [10]:
# TODO: find a better way to extract content from webpages, one fourth of them do not contain paragraphs

In [11]:
%%time
approx_homepages["all_ps"] = approx_homepages.sample(3000).apply(lambda x: ps_from_row(x), axis=1)

  soup = bs(page, 'lxml', from_encoding=encoding)


CPU times: user 1min 18s, sys: 2.71 s, total: 1min 20s
Wall time: 5min 1s


# Embedding

In this section, HF pre-trained models are loaded and used to describe SIC classes and domains into keywords, and embed those keywords

## Loading models

In [12]:
# the models used are gated, use your own access token
access_token = 'hf_YOURACCESSTOKEN'
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id, token=access_token)
gen_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    token=access_token
)

embeddings_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]



## SIC

In [13]:
llm_input_df = sic_divs_temp_filt
llm_sic_answers = []

In [14]:
%%time
for i, row in llm_input_df.iterrows():
    division_description = row.Inclusions
        
    messages = [
        {"role": "user", "content": division_description},
        {"role": "assistant", "content": "What is this?"},
        {"role": "user", "content": "This is a description of an economic sector. Generate a list of sector-describing keywords, one per line, numbered.\
        Keywords can be made up of multiple words."}
    ]
    
    input_ids = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to(gen_model.device)
    
    terminators = [
        tokenizer.eos_token_id,
        tokenizer.convert_tokens_to_ids("<|eot_id|>")
    ]
    
    # %%time
    outputs = gen_model.generate(
        input_ids,
        max_new_tokens=50,
        eos_token_id=terminators,
        do_sample=True,
        temperature=1,
        top_p=0.9,
    )
    response = outputs[0][input_ids.shape[-1]:]

    llm_sic_answers.append(tokenizer.decode(response, skip_special_tokens=True).split('\n\n')[1])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end gene

CPU times: user 2min 49s, sys: 385 ms, total: 2min 49s
Wall time: 2min 50s


The raw output of the language model is processed into individual keywords and embedded.

In [16]:
raw_sic_keywords = pd.Series(llm_sic_answers, name='keywords', index=sic_divs_temp_filt.division_label).str.split('\n').explode()

clean_sic_keywords = raw_sic_keywords.str.strip().str.split(' ', n=1).str[1].dropna()

In [17]:
sic_kw_embeddings = embeddings_model.encode(clean_sic_keywords.values) 

sic_kw_embeddings_df = pd.DataFrame(sic_kw_embeddings, index=clean_sic_keywords.index)

## Domains

In [18]:
sample_size=200
sample_state=16

approx_homepages_sample = approx_homepages.query('all_ps.notna()').sample(sample_size, random_state=sample_state)

In [21]:
# %%time
llm_input_df = approx_homepages_sample
llm_homepage_answers = []

for i, row in llm_input_df.iterrows():
    company = row.domain
    paragraphs = " ".join(row.all_ps)

    messages = [
        {"role": "user", "content": paragraphs},
        {"role": "assistant", "content": "What is this?"},
        {"role": "user", "content": "This is unstructured data downloaded from a webpage. \
        Generate a list of sector-describing keywords (not the geography), one per line, numbered. Keywords can be made up of multiple words.\
        Start your answer by stating if the web content is somehow descriptive of the company or not. \
        If it's not, you must include the word NONE in your answer, all capitals. If it is, generate the list."}
    ]

    input_ids = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to(gen_model.device)
    
    terminators = [
        tokenizer.eos_token_id,
        tokenizer.convert_tokens_to_ids("<|eot_id|>")
    ]
    
    # %%time
    outputs = gen_model.generate(
        input_ids,
        max_new_tokens=100,
        eos_token_id=terminators,
        do_sample=True,
        temperature=1,
        top_p=0.9,
    )
    response = outputs[0][input_ids.shape[-1]:]

    llm_homepage_answers.append(tokenizer.decode(response, skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end gene

The raw output of the language model is processed into individual keywords and embedded.

In [22]:
raw_domain_answer = pd.DataFrame(llm_homepage_answers, columns=["answer"], index=approx_homepages_sample.domain).query("~answer.str.contains('None|NONE')")
raw_domain_keyword = raw_domain_answer.answer.str.split("1.").str[1].str.split("\n").explode()
clean_domain_keyword = raw_domain_keyword.str.split(' ', n=1).str[1].dropna()

n=5
topn_kw_per_domain = clean_domain_keyword.groupby(level=0).head(n)

In [23]:
domain_kw_embeddings = embeddings_model.encode(topn_kw_per_domain.values)

domain_kw_embeddings_df = pd.DataFrame(domain_kw_embeddings, index=topn_kw_per_domain.index)

# Companies-SIC classes similarities

In [26]:
# Create a dataframe of similarities between all keywords
domain_division_kw_similarities = pd.DataFrame(cosine_similarity(domain_kw_embeddings_df, Y=sic_kw_embeddings_df), index=domain_kw_embeddings_df.index, columns=sic_kw_embeddings_df.index)

In [27]:
# Compute average of averages. See report
domain_division_similarities = domain_division_kw_similarities.groupby(level=0).mean().transpose().groupby(level=0).mean()

## Explore the results here

In [208]:
domain_division_similarities.idxmax()

domain
ke,co,3moverseaseducation    Activities of head offices; management consult...
ke,co,acatalyst              Activities of head offices; management consult...
ke,co,accurate               Repair and installation of machinery and equip...
ke,co,aerocruise-safaris     Travel agency, tour operator, reservation serv...
ke,co,africaninteriors             Manufacture of rubber and plastics products
                                                   ...                        
ke,co,wigotgardens           Travel agency, tour operator, reservation serv...
ke,co,wkadvocates                              Legal and accounting activities
ke,co,wurth                  Manufacture of computer, electronic and optica...
ke,co,yes                    Activities auxiliary to financial service and ...
ke,co,zannproperties         Activities of head offices; management consult...
Length: 114, dtype: object

In [209]:
inspected_domain = "ke,co,wigotgardens"

In [210]:
approx_homepages_sample.query("domain == @inspected_domain").all_ps.values

array([list(['A virtual tour is a 360 degree view of this property. Simply turn left,right,up or down to view any side by either using the navigation panels at the bottom or clicking the VT screen and using the keyboard direction keys.Enjoy...', 'info@wigotgardens.co.kereservations@wigotgardens.co.ke'])],
      dtype=object)

In [211]:
topn_kw_per_domain.loc[inspected_domain]

domain
ke,co,wigotgardens               Virtual Tour
ke,co,wigotgardens              Property View
ke,co,wigotgardens                 Navigation
ke,co,wigotgardens             Panoramic View
ke,co,wigotgardens    Interactive Exploration
Name: answer, dtype: object

In [212]:
cosine_similarity(domain_kw_embeddings_df.loc[inspected_domain]).mean()

0.45638436

In [213]:
domain_division_similarities[inspected_domain].sort_values().tail(20)

division_label
Computer programming, consultancy and related activities                                                                           0.160226
Programming and broadcasting activities                                                                                            0.161077
Manufacture of wood and of products of wood and cork, except furniture; manufacture of articles of straw and plaiting materials    0.161523
Other professional, scientific and technical activities                                                                            0.161808
Activities of head offices; management consultancy activities                                                                      0.162496
Scientific research and development                                                                                                0.163449
Manufacture of other transport equipment                                                                                           0.169089
Land 

In [214]:
inspected_sector = 'Manufacture of tobacco products'

In [188]:
clean_sic_keywords[inspected_sector]

division_label
Manufacture of tobacco products     Agricultural
Manufacture of tobacco products       Processing
Manufacture of tobacco products          Tobacco
Manufacture of tobacco products          Primary
Manufacture of tobacco products             Food
Manufacture of tobacco products    Manufacturing
Manufacture of tobacco products         Consumer
Manufacture of tobacco products       Industrial
Name: keywords, dtype: object

In [189]:
cosine_similarity(sic_kw_embeddings_df.loc[inspected_sector]).mean()

0.43145382