## Process Dataset

Given is the Dmoz dataset (https://dmoz-odp.org/) which assigns one of 15 categories to each domain. In this notebook it is cleaned and filtered for the prediction-model in [create_model.ipynb](create_model.ipynb).

In [3]:
import pandas as pd 
import numpy as np

from sklearn.model_selection import train_test_split

In [8]:
df = pd.read_csv('dmoz.csv',index_col = 0, header=None,names = ['domain','category'])

In [9]:
df.sample(5)

Unnamed: 0,domain,category
1195983,http://www.sungrubbies.com,Shopping
739905,http://www.breastimagingcenter.com/,Health
317752,http://www.sgdphoto.com/,Business
376434,http://www.szborui.com,Business
1153724,http://www.absinth24.net,Shopping


In [10]:
def process_domain(row):
    if type(row) != str: 
        return ''
    row = row.replace('http://','').replace('www.','').replace('.html','')
    if row[-1] == '/' : 
        row = row [:-1]
    return row.lower()

In [11]:
df.domain = df.domain.apply(process_domain)

In [12]:
df.sample(5)

Unnamed: 0,domain,category
100346,dslextreme.com/users/lisam9/tsui,Arts
992236,internetschoolhouse.com,Reference
620087,icann.org/committees/idn,Computers
8600,nudeblack-girls.com/index,Adult
1507553,pensacolagreyhoundpark.com,Sports


In [16]:
def filter(row):
    if len(row)  > 30:
        return False
    if len(row.split('/')) >1:
        return False
    if len(row.split('.'))>2:
        return False
    if '?' in row:
        return False 
    # only domains with .com suffix are selected
    if row.endswith('.com'):
        return True
    return False

In [14]:
mask = df.domain.apply(filter)
df = df[mask]
df = df.reset_index(drop  =True)

In [15]:
df.sample(5)

Unnamed: 0,domain,category
480851,cropcircle-archive.com,Society
108612,gzbosi.com,Business
403799,bineshiiwildrice.com,Shopping
113948,lucearchitects.com,Business
295738,lebaronfamily.com,Health


In [None]:
df.to_csv('domain_category_dataset.csv')