In [3]:
import pandas as pd
from collections import Counter
from Levenshtein import distance as levenshtein_distance
import dns.resolver
import requests

In [4]:
def average_levenshtein_distance(domain_list):
    total_distance = 0
    num_pairs = 0
    
    for i in range(len(domain_list)):
        for j in range(i + 1, len(domain_list)):
            total_distance += levenshtein_distance(domain_list[i], domain_list[j])
            num_pairs += 1
    
    return total_distance / num_pairs if num_pairs > 0 else 0

In [5]:
def check_domain_availability(domain):
    try:
        dns.resolver.resolve(domain, 'A')
        return 0
    except dns.resolver.NoAnswer:
        return 1
    except dns.resolver.NXDOMAIN:
        return 1
    except Exception as e:
        return 0

In [6]:
def test_api(business_desc: str):
    print(business_desc)
    print(requests.get(f'http://127.0.0.1:8000/generate_domain_name/{business_desc}').json())
    print()

In [7]:
df = pd.DataFrame({'zero_shot': ['sweetpieemporium.com', 'flakyfingersbakery.com', 'piepassionate.com', 'crustcrafters.com', 'pastriesbypastryqueen.com'], 
                   'few_shot': ["PieParadise.com", "SweetCrust.com", "PieMaker.com", "PieEmporium.com", "PieNirvana.com"],
                   'llm_prompt': ["PiePerfection.com", "SweetPieSensations.com", "PieParadise.com", "CrispyCrustCreations.com", "PiePassionate.com"],
                   'fine_tuned': ["piebaking.co", "piebaking.org", "bakingsolutions.biz", "piehub.io", "piebaking.shop"]})
df = df.reset_index().melt(id_vars=['index']).rename(columns={'variable': 'prompt_type', 'value': 'address'})[['prompt_type', 'address']]
df['address'] = df['address'].str.lower()
df['domain'] = df['address'].str.split('.').str[0]

In [8]:
diversity_dict = {}
distance_dict = {}
availability_dict = {}
for row in df.groupby('prompt_type')['domain'].apply(lambda x: ' '.join(x)).reset_index().itertuples():
    diversity_dict[row.prompt_type] = len(Counter(row.domain.split()).keys()) / len(row.domain.split())
for row in df.groupby('prompt_type')['address'].apply(lambda x: list(x)).reset_index().itertuples():
    distance_dict[row.prompt_type] = average_levenshtein_distance(row.address)
for row in df.groupby('prompt_type')['address'].apply(lambda x: list(x)).reset_index().itertuples():
    availability_dict[row.prompt_type] = sum([check_domain_availability(address) for address in row.address]) / len(row.address)

In [9]:
data = [diversity_dict, distance_dict, availability_dict]
row_names = ['diversity_score', 'levenshteins_distance', 'availability_score']
scores_df = pd.DataFrame(data, index=row_names)

### Testing API with fine tuned model

In [None]:
businesses = ["pet grooming", "hosting services", "baking pies"]
for business in businesses:
    test_api(business)

pet grooming
['groomingpet.io', 'groomingpetcentral.shop', 'petgrooming.online', 'grooming.com', 'petgroomingzone.tech', 'petgrooming.org']

hosting services
['hosting.co', 'hostinghub.online', 'hostingsolutions.org', 'hosting.org', 'hostingservicesnow.site', 'hostingsolutions.tech', 'hostingsolutions.com', 'hostingservices.tech']

baking pies
['piesbaking.site', 'piesbaking.online', 'pies.online', 'piebaking.tech', 'piesbaking.net', 'pie.online', 'pies.co']



# Introduction

The goal of this experiment is to test, evaluate and implement a feature that suggests domain for a business website based on a user provided business description. To determine best approach, 2 different option were considered: prompt engineering and fine tuning open source LLM. Widely popular Mistral-7B LLM was used for all the tests.

### Prompt Engineering

3 different tactics were tested for constructing a prompt to get the best results:
- Zero-shot. No example was given to the LLM. Prompt - _Suggest 5 creative domain names in csv format for a business with a description of: __business_description__._
- Few-shot. Several examples given for LLM to follow. Prompt - _If a business description is 'pet grooming' then their domains could be 'HappyTails.com', 'FurCare.com', 'HappyPaws.com' and similar. If a business description is 'web hosting services' then their domain could be 'hostinger.com'. Suggest 5 unique, short and memorable domain names in csv format for a business with a description of: __business_description__ "_
- Automatic prompt. Asked LLM to propose best prompt for LLM to generate domain names. Prompt - _Generate 5 unique domain names in for a __business_description__. The names should be catchy, easy to remember, and relevant to the __business_description__ industry. provide names in csv format._

### Fine Tuning

Pretrained LLM and a small sample of 1000 {business description : domain name} pairs dataset were used to fine tune the model. Quality of data plays a crucial role in successfully fine tuning LLM. While it is also computationally expensive procedure, it can be optimized in several ways, such as splitting training process into chunks reducing required memory for the process (although increasing training time).

### Experiment Results

All 4 different approaches (3 different prompts and fine tuned model) appeared to do reasonably well. To evaluate the results, 3 metrics were chosen: Levenshtein distance, n-gram diversity score and availability score. Levenshtein distance is a string metric for measuring the difference between two sequences. Applied this to all suggestion combinations from one approach we can get an average distance showing how different from each other the suggestions for domain name are. Together with diversity score (unique domain suggestions divided by total suggestions) we can see how much freedom the approach has in generating names. Whether or not this is desirable, we can see that zero-shot approach has the most different suggestions, which makes sense, since we did not provide an example for LLM to follow.

In [10]:
scores_df

Unnamed: 0,few_shot,fine_tuned,llm_prompt,zero_shot
diversity_score,1.0,0.6,1.0,1.0
levenshteins_distance,7.8,9.5,11.5,14.9
availability_score,0.4,1.0,0.6,0.8


Availability score is a percentage of how many suggestions of domains are available and are not taken at this moment. Availability checks indicated that prompt-engineered names had a higher failure rate in terms of domain availability.

Worth noting, Levenshtein distance could also be utilised to filter out the suggestions. For example, we would like to avoid domains that are similar to well known brands, company names and domains. If suggestion has low distance from one of the well known domain, it should be considered to be rulled out. Even though suggestion may not be exact to some known brand, being very close to it makes it seem like a scam website often used for phishing.

### Recommendations

Prompt engineering is a good option for rapid prototyping and generating domain names quickly. It's flexible and doesn't require a large dataset.

Fine tuning with a custom dataset is a powerful technique to improve creativity and relevance, yielding more domain names that are contextually appropriate and unique. Yet high quality data may or may not be available. This together with higher cost of training the model makes this approach best fit for long term, high importance feature.

Good domain name should be short, memorable and do not closely resemble popular and already taken domains. Evaluating each of the suggested name is crucial. Availability is even more important, would be counterproductive to suggest already taken domain. Unfortunately this has to be done after LLM gives the suggestion, no way to train the model to suggest only free domains.

Additional information from user, either feedback or tracking, for example, their prefered extensions like .com or .net could be useful to further fine tune the results.

Both fine tuned and base models can be deployed and exposed via API for easy access by others.

###

to try and run API locally:
- python -m venv venv
- activate venv
- pip install -r requirements.txt
- uvicorn main:app --reload
- http://localhost:8000/docs

Technically, fine tuned model is hosted on huggingface spaces (https://huggingface.co/spaces/Rytis-J/domain-generator), but with a free tier, running on limited cpu, it's practically unusable.

###

###

It does work though
![alt text](data/hosted_domain_generator.png "Title")