# Classify companies based on their websites

The scope of this notebook is to analyze the capabilities of OpenAI's GPT3.5 to classify enterprises based on their company's website. To achieve this, this notebook will try and solve the following problems: 

- How to find an enterprise website with as much info as a proprietary Email-address (e.g. exclude addresses like @gmail.com, @outlook.com)
- Extract relevant keywords from the enterprise's website
- Classify the entry based on the extracted keywords
- Allow the analyst to export the result into a .csv file based on their requirements


In [1]:
# Import necessary libraries
import requests;
import os;
import nltk;
import pandas as pd;
import json;
import openai
from dotenv import load_dotenv;
from bs4 import BeautifulSoup;
from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

load_dotenv();

# Set constant variables
mongo_url = os.getenv("MONGO_URL");
openai.api_key = os.getenv("OPENAI_API_KEY");
client = MongoClient(mongo_url, server_api=ServerApi('1'))

# Set configuration variables
# These are used to determine whether to run a task that might already have taken place,
# e.g. crawl a website, update keywords or predicting a category with OpenAI
update_category = False

# Download stopwords
nltk.download('stopwords')
nltk.download('punkt')

# Establish MongoDB Connection
db = client['python_openapi']
collection = db['enterprises']


[nltk_data] Downloading package stopwords to /home/tobiq/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/tobiq/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Test connection (optional)

Create the necessary DB structure with a dummy entry to test the connection.

In [100]:
# try:
#   enterprise_dummy_entry = {
#     "name": "Test Enterprises GmbH",
#     "domain": "test.com",
#     "email": "dummy@test.com",
#     "category": "",
#     "keywords": ["one", "two", "three"]
#   }
#   collection.insert_one(enterprise_dummy_entry)
#   print("Created dummy entry")
# except Exception as e:
#   print(e)

# Delete all dummy entries again
# try:
#   db = client['python_openapi']
#   collection = db['enterprises']
#   collection.delete_many({"domain": "test.com"})
# except Exception as e:
#   print(e)

## Upload CSV Data

The minimum data structure provided should be

| Name | Email |
| ---- | ----- |
| Test Enterprises GmbH | dummy@test.com |

Use the following template to start: 

```csv
name;email
Test Enterprises GmbH;dummy@test.com
```

Data will be stored in MongoDB. Duplicated domains will be overwritten by the newest email address and name.

In [2]:
csv_input_path = './enterprises.csv'
csv_input_data = pd.read_csv(csv_input_path, sep=";")
json_enterprise_data = []

try:
  for _, row in csv_input_data.iterrows():
    entry = {
      "name": row["name"],
      "domain": row["email"].split('@')[1],
      "email": row["email"],
      "corpus": "",
      "industry": "",
      "keywords": [],
      "confidence_keywords": 0,
      "confidence_industry": 0
    }
    json_enterprise_data.append(entry)

  json_data = json.loads(json.dumps(json_enterprise_data, indent=4))

  for entry in json_data:
    if collection.find_one({"domain": entry["domain"]}) is None:
      print('Inserting entry for domain', entry["domain"])
      collection.insert_one(entry)
    else:
      print('Updating entry for domain', entry["domain"])
      collection.update_one({"domain": entry["domain"]}, {"$set": entry})
  print("Done importing " + str(len(json_data)) + " entries from csv file " + csv_input_path)
except Exception as e:
  print(e)

Inserting entry for domain apple.com
Done importing 1 entries from csv file ./enterprises.csv


## Crawl the website from the given domain

Assuming all domains stored in MongoDB are valid (you could add a validation step right above), the following code will extract all defined tags from the corporate website and enrich the MongoDB entry accordingly. It will

1. Read all enterprise entries
2. Check if entry is to be updated (to avoid multiple, accidential crawls of the same domain)
3. Crawl the website by it's domain using the https - protocol
4. Append the response corpus to the MongoDB entry

> Note: This process runs synchronously and may take a while. To avoid recrawling already existing entries, set the variable 'update_corpus' to False.

In [3]:
enterprises = collection.find({})
html_tags_to_crawl = ['h1', 'h2', 'h3', 'p'] # Add and remove tags you want to find keywords in
http_protocol = 'https'                      # Set to 'http' or 'https'
languages = ['english', 'german']            # Add whatever languages should be filtered for stopwords
update_corpus = True                         # Whether or not to re-crawl a website
max_corpus_size = 500                        # Increase maximum corpus size to be sent to OpenAI

# Remove stopworlds in all defined languages
def remove_stopwords(text):
  filtered_text = text
  for language in languages:
    filtered_text = [word for word in filtered_text if(word.lower() not in stopwords.words(language))]
  return filtered_text;

# Crawl a website and return a text corpus
def create_corpus_from_domain(domain):
  url = http_protocol + '://' + domain;
  response = requests.get(url)
  if(response.status_code != 200):
    print("Error while trying to crawl domain " + domain + ", status code: " + response.status_code)
    return ""
  else:
    soup = BeautifulSoup(response.content, 'html.parser')
    corpus = ""
    for tag in soup.find_all(html_tags_to_crawl):
      text = word_tokenize(tag.text)
      filtered_text_list = remove_stopwords(text)
      filtered_text = " ".join(filtered_text_list)[0:100]
      corpus += filtered_text
    return corpus[0:max_corpus_size]

# Update database entries with the crawled text corpus
def enrich_enterprise_with_corpus(enterprise):
  # Update existing corpus data
  if(enterprise['corpus'] != "" and update_corpus == True):
    print("Updating corpus with " + str(max_corpus_size) + " characters for domain " + enterprise["domain"])
    enterprise_corpus = create_corpus_from_domain(domain=enterprise["domain"])
    enterprise["corpus"] = enterprise_corpus
    collection.update_one({"domain": enterprise["domain"]}, {"$set": enterprise})

  # If the corpus is not empty and updates are disabled, skip this entry
  elif(enterprise['corpus'] != "" and update_corpus == False):
    print("Skipping corpus update for domain " + enterprise["domain"])

  # Default: Crawl a new entry
  else:
    print("Crawling domain " + enterprise["domain"])
    enterprise_corpus = create_corpus_from_domain(domain=enterprise["domain"])
    enterprise["corpus"] = enterprise_corpus
    print("Adding corpus with " + str(max_corpus_size) + " characters for domain " + enterprise["domain"])
    collection.update_one({"domain": enterprise["domain"]}, {"$set": enterprise})

for enterprise in enterprises:
  enrich_enterprise_with_corpus(enterprise)

Crawling domain apple.com
Adding corpus with 500 characters for domain apple.com


## Generate keywords and update entry

After entries are created and their websites crawled, we'll use OpenAI's GPT3.5 to generate keywords for each entry and update the MongoDB entry accordingly. The following code will: 

1. Read all enterprise entries
2. Generate keywords for each entry
3. Generate keyword confidence
4. Update the MongoDB entry accordingly

### Prompt

You are a marketing expert at an international agency. Extract keywords from a text corpus describing a product or service and return a JSON Object adhering to the following rules:

- Omit comments or explanations and return the JSON object.
- Format the JSON Object as: `{ keywords: <keywords>, confidence: <confidence> }`
- Use a JSON Array for `<keywords>` and include only the top ten keywords that best describe the product in the text corpus.
- For `<confidence>`, assign a score from 1 (lowest) to 10 (highest) indicating how well the keywords match the product or service in the text corpus.

In [5]:
enterprises = collection.find({})
update_keywords = False;
system_prompt_keywords = "You are a marketing expert at an international agency. Extract keywords from a text corpus describing a product or service and return a JSON Object adhering to the following rules:\n\n- Omit comments or explanations and return the JSON object.\n- Format the JSON Object as: `{ keywords: <keywords>, confidence: <confidence> }`\n- Use a JSON Array for `<keywords>` and include only the top ten keywords that best describe the product in the text corpus.\n- For `<confidence>`, assign a score from 1 (lowest) to 10 (highest) indicating how well the keywords match the product or service in the text corpus."

def fetch_keywords_for_enterprise(enterprise):
  response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
      { "role": "system", "content": system_prompt_keywords },
      { "role": "user", "content": enterprise['corpus'] }
    ],
    temperature=0,
    max_tokens=256,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0
  )
  response_json = json.loads(response['choices'][0]['message']['content'])
  return response_json

def enrich_enterprise_with_keywords(enterprise):
  # Throw exception if corpus is missing
  if(enterprise['corpus'] == ""):
    raise Exception("Corpus is missing for domain " + enterprise["domain"] + '. Make sure to run the Crawler first!')
  # Skip entry if keywords are already present and update is turned off
  elif(enterprise['keywords'].__len__() > 0 and update_keywords == False):
    print("Skipping keywords update for domain " + enterprise["domain"])
  else:
    keyword_response = fetch_keywords_for_enterprise(enterprise)
    enterprise['keywords'] = keyword_response['keywords']
    print("Keywords for domain " + enterprise["domain"] + " are " + str(enterprise['keywords']))
    enterprise['confidence_keywords'] = keyword_response['confidence']
    collection.update_one({"domain": enterprise["domain"]}, {"$set": enterprise})


for enterprise in enterprises:
    enrich_enterprise_with_keywords(enterprise)


Keywords for domain apple.com are ['Apple', 'iPhone 15 Pro', 'Titanium', 'strong', 'light', 'New camera', 'New design', 'Newphoria', 'Killers Flower Moontheaters', 'Available early next year U.S']


## Generate industry and update entry

Finally, the domain will be categorized into one of the available industries. The following code will:

- Read all enterprise entries
- Generate industry categories for each entry
- Update the MongoDB entry accordingly

### Prompt

As a marketing expert at an international agency, categorize the company based on the provided JSON array of keywords into one of the following industries and return a JSON Object adhering to the following rules:

- Omit comments or explanations and return the JSON object.
- Format the JSON Object as: `{ industry: <industry>, confidence: <confidence> }`
- Replace `<industry>` with the industry name based on your best categorization result.
- Replace `<confidence>` with a score from 1 (lowest) to 10 (highest) indicating how well the industry matches the provided keywords.
- Before you answer, verify that your reply is part of the industry array below
- If it is not, choose an industry that is closest to your reply.

These are the industries: 
```
[ "Technology", "Healthcare", "Finance", "Retail", "Manufacturing", "Automotive", "Energy", "Telecommunications", "Aerospace", "Hospitality", "Entertainment", "Consumer Goods", "Pharmaceuticals", "Construction", "Transportation", "Real Estate", "Food and Beverage", "Media", "Insurance", "Consulting", "Other Services"]
```

In [6]:
enterprises = collection.find({})
update_industry = False;
system_prompt_industry = "As a marketing expert at an international agency, categorize the company based on the provided JSON array of keywords into one of the following industries and return a JSON Object adhering to the following rules:\n\n- Omit comments or explanations and return the JSON object.\n- Format the JSON Object as: `{ industry: <industry>, confidence: <confidence> }`\n- Replace `<industry>` with the industry name based on your best categorization result.\n- Replace `<confidence>` with a score from 1 (lowest) to 10 (highest) indicating how well the industry matches the provided keywords.\n- Before you answer, verify that your reply is part of the industry array below\n- If it is not, choose an industry that is closest to your reply.\n\nThese are the industries: \n```\n[ \"Technology\", \"Healthcare\", \"Finance\", \"Retail\", \"Manufacturing\", \"Automotive\", \"Energy\", \"Telecommunications\", \"Aerospace\", \"Hospitality\", \"Entertainment\", \"Consumer Goods\", \"Pharmaceuticals\", \"Construction\", \"Transportation\", \"Real Estate\", \"Food and Beverage\", \"Media\", \"Insurance\", \"Consulting\", \"Other Services\"]"

def fetch_industry_for_enterprise(enterprise):
  response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
      {
        "role": "system",
        "content": system_prompt_industry
      },
      {
        "role": "user",
        "content": str(enterprise['keywords'])
      }
    ],
    temperature=1,
    max_tokens=256,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0
  )
  response_json = json.loads(response['choices'][0]['message']['content'])
  return response_json

def enrich_enterprise_with_industry(enterprise):
  # Throw exception if keywords array has 0 length
  if(enterprise['keywords'].__len__() == 0):
    raise Exception("Keywords array is empty for domain " + enterprise["domain"] + '. Make sure to run the first prompt before trying to categorize!')
  # Skip entry if keywords are already available and update industry is turned off
  elif(enterprise['industry'] != "" and update_industry == False):
    print("Skipping industry update for domain " + enterprise["domain"])
  else:
    industry_response = fetch_industry_for_enterprise(enterprise)
    enterprise['industry'] = industry_response['industry']
    enterprise['confidence_industry'] = industry_response['confidence']
    print("Industry for domain " + enterprise["domain"] + " is " + enterprise['industry'] + " with a confidence of " + str(enterprise['confidence_industry']))
    collection.update_one({"domain": enterprise["domain"]}, {"$set": enterprise})

for enterprise in enterprises:
  enrich_enterprise_with_industry(enterprise)

Industry for domain apple.com is Technology with a confidence of 9
