# Domain Name Synthesizer

This is the creative component of <i>IntelliSearch</i> that uses Markov chains that is trained on existing startup names to come up with similar-sounding ones. These startup names are also tagged with the industry(ies) that they are suitable for and the tags are taken into account when these names are being generated to make sure that the generated names are automatically categorized into their respective industries. 

### Markov Chain Name Generator By Category

This is a Markov chain name generator that uses the top 70,000 startups on Crunchbase, sort them by industry, and generate available .com domain names that are relevant to each of these domain names in real time.

There is also possibility for extension for this tool by taking into account additional data + startup descriptions (more on this later.)

In [2]:
# Loading the dataset. Here we are using the top 70K companies on crunchbase
# retrieved here: https://github.com/datahoarder/crunchbase-october-2013

import pandas as pd

df = pd.read_csv('datasets/crunchbase-companies.csv')
df.head()

Unnamed: 0,permalink,name,homepage_url,category_list,funding_total_usd,status,country_code,state_code,region,city,funding_rounds,founded_at,first_funding_at,last_funding_at
0,/organization/-fame,#fame,http://livfame.com,Media,10000000,operating,IND,16,Mumbai,Mumbai,1,,2015-01-05,2015-01-05
1,/organization/-qounter,:Qounter,http://www.qounter.com,Application Platforms|Real Time|Social Network...,700000,operating,USA,DE,DE - Other,Delaware City,2,2014-09-04,2014-03-01,2014-10-14
2,/organization/-the-one-of-them-inc-,"(THE) ONE of THEM,Inc.",http://oneofthem.jp,Apps|Games|Mobile,3406878,operating,,,,,1,,2014-01-30,2014-01-30
3,/organization/0-6-com,0-6.com,http://www.0-6.com,Curated Web,2000000,operating,CHN,22,Beijing,Beijing,1,2007-01-01,2008-03-19,2008-03-19
4,/organization/004-technologies,004 Technologies,http://004gmbh.de/en/004-interact,Software,-,operating,USA,IL,"Springfield, Illinois",Champaign,1,2010-01-01,2014-07-24,2014-07-24


In [3]:
def build_markov_chain(data, n):
    chain = {
        '_initial':{},
        '_names': set(data)
    }
    for word in data:
        word_wrapped = str(word) + '.'
        for i in range(0, len(word_wrapped) - n):
            tuple = word_wrapped[i:i + n]
            next = word_wrapped[i + 1:i + n + 1]
            
            if tuple not in chain:
                entry = chain[tuple] = {}
            else:
                entry = chain[tuple]
            
            if i == 0:
                if tuple not in chain['_initial']:
                    chain['_initial'][tuple] = 1
                else:
                    chain['_initial'][tuple] += 1
                    
            if next not in entry:
                entry[next] = 1
            else:
                entry[next] += 1
    return chain  

In [32]:
# Example

chain = build_markov_chain(df['name'].tolist(), 3)
print(chain['sta'])

{'ta.': 50, 'tar': 84, 'ta ': 35, 'tat': 65, 'tac': 21, 'tal': 40, 'tay': 10, 'tau': 21, 'tad': 4, 'taM': 2, 'taS': 5, 'tas': 6, 'tan': 52, 'tam': 6, 'tab': 14, 'tag': 17, 'tak': 1, 'tai': 21, 'taf': 4, 'taT': 2, 'tax': 1, 'ta™': 1, 'tah': 2, 'taq': 2, 'tav': 5, 'taB': 3, 'taC': 1, 'taE': 1, 'taG': 3, 'taJ': 1, 'taL': 1, 'tap': 3, 'taR': 1, 'taF': 1, 'taa': 1}


In [11]:
import random

# function that selects random tuples from a chain
def select_random_item(items:
    rnd = random.random() * sum(items.values())
    for item in items:
        rnd -= items[item]
        if rnd < 0:
            return item

In [12]:
def generate(chain):

    # keeps track of first few letters for all the
    # names in the dataset and use them as the initial
    # states for our generated names
    tuple = select_random_item(chain['_initial'])
    result = [tuple]
    
    # go through our Markov chain and find the names
    # with the highest weights (probabilities)
    while True:
        tuple = select_random_item(chain[tuple])
        last_character = tuple[-1]
        if last_character == '.':
            break
        result.append(last_character)
    
    # get the generated name and check if its present
    # in our original dataset, if yes, recursively 
    # generate another name, if no, return this name
    generated = ''.join(result)
    if generated not in chain['_names']:
        return generated
    else:
        return generate(chain)

In [37]:
generate(chain)

'SensSucuma LaColderscapeution Live Securisor'

In [18]:
pip install python-whois

Collecting python-whois
  Downloading python-whois-0.7.3.tar.gz (91 kB)
[K     |████████████████████████████████| 91 kB 12.0 MB/s 
[?25hCollecting future
  Downloading future-0.18.2.tar.gz (829 kB)
[K     |████████████████████████████████| 829 kB 15.6 MB/s 
[?25hBuilding wheels for collected packages: python-whois, future
  Building wheel for python-whois (setup.py) ... [?25ldone
[?25h  Created wheel for python-whois: filename=python_whois-0.7.3-py3-none-any.whl size=87701 sha256=86103d6cc807afabb35d07f486582580d67494044c7ba4750f66109de5a1ea4d
  Stored in directory: /home/jovyan/.cache/pip/wheels/11/05/f7/895ce5a73665f77c8274a7d55e34fb3e6b4abbb9a7637e215b
  Building wheel for future (setup.py) ... [?25ldone
[?25h  Created wheel for future: filename=future-0.18.2-py3-none-any.whl size=491059 sha256=890625227c4244c772b86d2b91dd68fa40c79ba53a8793ff65725ad512f94924
  Stored in directory: /home/jovyan/.cache/pip/wheels/56/b0/fe/4410d17b32f1f0c3cf54cdfb2bc04d7b4b8f4ae377e2229ba0
Succ

In [19]:
import whois

def isAvailable(domain):
    try:
        whois.whois(domain)
        return False
    except:
        return True

In [20]:
df.category_list.unique()

array(['Media', 'Application Platforms|Real Time|Social Network Media',
       'Apps|Games|Mobile', ...,
       'Advertising|Mobile|Web Development|Wireless',
       'Consumer Electronics|Internet of Things|Telecommunications',
       'Consumer Goods|E-Commerce|Internet'], dtype=object)

In [21]:
def generate_amount_by_category(category, amount, max_word_length):
    # category is to choose a specific 
    chain = build_markov_chain(df[df['category_list'] == category]['name'].tolist(), 3)
    count = 0
    forbidden_chars = ['0','1','2','3','4','5','6','7','8','9',' ',':','(',')', '-', '#', '!']
    while count < amount:
        domain = generate(chain)
        if len(domain) > max_word_length or any(char in domain for char in forbidden_chars):
            continue
        elif isAvailable(domain.lower().replace(' ','') + '.com'):
            print(domain.lower() + '.com')
            count += 1

In [41]:
generate_amount_by_category('Enterprise Software', 10, 12)

backtreedsky.com
sunstreat.com
verspikeiron.com
ektreamcorio.com
hyperarise.com
kidardbookia.com
totacopy.com
byteactions.com
omedications.com
virtigua.com


## Potential extensions

1. Crunchbase startup investment (more) https://www.kaggle.com/arindam235/startup-investments-crunchbase
2. YC Dataset: https://github.com/andrewzaldivar/YC-HN-Startup-Success/blob/master/Datasets/CleanStartupsFull3.csv

