# New Content extractor

In this project, I set a goal to build a model that I can extract title and main textual content from news web page. When we crawl websites to build a search engine or other purpose, it turns out very difficult because if noisy contents not related to the main article. It will be useful for us on that occasions. 

## Approach
When we visit websites, we can distinguish easily title and content of the main article. My goal is to make the machine have that sense. 

In my opinion, we may make a conclusion based on the visual perception. For example: position on the display, size of the text, color and weight of the text etc. 

On top of that, I think developers adopt similar practices to build their web pages. There may be some similarity including similar html tags to create similar components, html elements' hierarchical structures etc.

Therefore, my model will be trained to understand that sense and similarity.

And also one thing to notice is that I will train two models because I am looking for title and contents from the page

First, I will train the model on a dataset that is extracted from Mongolian news websites because it was comfortable for me. I hope it will work on any language because HTML is universal and once I reached my goal I will retrain my model on websites from other languages if it is neccessary. 

## Dataset

In order to create my dataset for the training, I wrote a very simple Javascript (Typescript) webscraper which opens a browser (using [puppeteer](https://github.com/GoogleChrome/puppeteer)), loads websites and collects HTML elements and their attributes. You can see my webscraper and its usage from [here](../webscraper)

#### Attributes
**site**: the website's name from which the element is extracted<br/>
**url**: the web url from which the element is extracted<br/>
**tagName**:the element's html tag<br/>
**left**: X coordinate of the top-left point of the element on the page<br/>
**top**:Y coordinate of the top-left point of the element on the page<br/>
**width**:the element's width on the page<br/>
**height**:the element's height on the page<br/>
**children**:count of direct child elements<br/>
**textCount**:length of the text in the element<br/>
**parentCount**:count of the ancestor elements<br/>
**fontSize**:font size of the text<br/>
**linkCount**:count of the &#60;a&#62; elements in the element<br/>
**paragraphCount**:count of the &#60;p&#62; elements in the element<br/>
**imageCount**:count of the &#60;img&#62; elements in the element<br/>
**colorRed**:the red attribute of RGB color of the texts in the element<br/>
**colorGreen**:the green attribute of RGB color of the texts in the element<br/>
**colorBlue**:the blue attribute of RGB color of the texts in the element<br/>
**backgroundRed**:the red attribute of RGB color of the background of the element<br/>
**backgroundGreen**:the green attribute of RGB color of the background of the element<br/>
**backgroundBlue**:the blue attribute of RGB color of the background of the element<br/>
**backgroundAlpha**:the transparency attribute of the background of the element<br/>
**textAlign**:text alignment of the text in the element<br/>
**marginTop**:top margin of the element<br/>
**marginRight**:right margin of the element<br/>
**marginBottom**:bottom margin of the element<br/>
**marginLeft**:left margin of the element<br/>
**paddingTop**:top padding of the element<br/>
**paddingRight**:right padding of the element<br/>
**paddingBottom**:bottom padding of the element<br/>
**paddingLeft**:left padding of the element<br/>
**descendants**:count of descendant elements<br/>
**relPosX**: Relative position to the page<br/>
**relPosY**: Relative positioin to the page<br/>
**title**: whether the element is title of the article <br/>
**content**:whether the element is main content of the article

The dataset is pretty imblanced because in one web page only one element is title and another element is content while page contains from several hundreds to several thousand elements. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("../webscraper/out.csv", quotechar='"', skipinitialspace=True)

print(data.columns)
print(data.shape)
print(data['site'].unique())

Index(['site', 'url', 'tagName', 'left', 'top', 'width', 'height', 'children',
       'textCount', 'parentCount', 'fontSize', 'linkCount', 'paragraphCount',
       'imageCount', 'colorRed', 'colorGreen', 'colorBlue', 'backgroundRed',
       'backgroundGreen', 'backgroundBlue', 'backgroundAlpha', 'textAlign',
       'marginTop', 'marginRight', 'marginBottom', 'marginLeft', 'paddingTop',
       'paddingRight', 'paddingBottom', 'paddingLeft', 'descendants',
       'relPosX', 'relPosY', 'title', 'content'],
      dtype='object')
(101922, 35)
['ikon' 'gogo' 'news' 'peak' 'polit' 'zuv' 'updown' 'newspress' 'gereg'
 'nertur' 'livetv' 'sonin' 'olloo' 'itoim' 'medee' 'arslan' 'udriintoim'
 'mongolcom' 'news1' 'zarig' 'sosa' 'dardas' 'mminfo' 'asuudal' 'zindaa'
 'seruuleg' 'newsmedia' 'bolod' 'inews' 'unen' 'paparatsi' 'unuudur'
 'niigmiintoli' '24barimt' 'zaluu' 'amjilt' 'tur' 'fact' 'shuurhai'
 'control' 'jirgee' 'tonshuul' 'mongolcomment' 'scandal' 'miss' 'ontslokh'
 'inet' 'kingnews' 'tusgaa

  interactivity=interactivity, compiler=compiler, result=result)


## Data Preprocessing

In this phase, I will prepare the dataset for the training. 

### 1. Removing not useful fields.
Of course, our model should work free from the web site. So I will remove *url* attribute. The *site* attribute will be used when separating testing and training dataset, so it will be left so far.

In [2]:
data=data.drop(['url'],axis=1)
data.head()

Unnamed: 0,site,tagName,left,top,width,height,children,textCount,parentCount,fontSize,...,marginLeft,paddingTop,paddingRight,paddingBottom,paddingLeft,descendants,relPosX,relPosY,title,content
0,ikon,DIV,0.0,35.0,1920,0,0,0,1,12.0,...,0,0,0,0,0,0,0.0,0.004828,False,False
1,ikon,DIV,0.0,35.0,1920,7060,2,2111,1,12.0,...,0,0,0,0,0,430,0.0,0.004828,False,False
2,ikon,DIV,0.0,0.0,1920,60,1,5,2,12.0,...,0,0,0,0,0,18,0.0,0.0,False,False
3,ikon,DIV,240.0,0.0,1440,60,3,5,3,12.0,...,240,0,0,0,0,17,0.125,0.0,False,False
4,ikon,DIV,258.0,10.0,103,38,1,0,4,12.0,...,18,0,0,0,0,3,0.134375,0.001379,False,False


### 2.Mapping boolean class to numerical class.
*title, content* columns are classes we need to predict. It holds boolean values and it will work with classification. So technically I do not need to map them into numerical values, but I will map them into 0,1 just for convenience.

In [3]:
class_mapping={False:0,True:1}
data['title']=data['title'].map(class_mapping)
data['content']=data['content'].map(class_mapping)
data.head()

Unnamed: 0,site,tagName,left,top,width,height,children,textCount,parentCount,fontSize,...,marginLeft,paddingTop,paddingRight,paddingBottom,paddingLeft,descendants,relPosX,relPosY,title,content
0,ikon,DIV,0.0,35.0,1920,0,0,0,1,12.0,...,0,0,0,0,0,0,0.0,0.004828,0,0
1,ikon,DIV,0.0,35.0,1920,7060,2,2111,1,12.0,...,0,0,0,0,0,430,0.0,0.004828,0,0
2,ikon,DIV,0.0,0.0,1920,60,1,5,2,12.0,...,0,0,0,0,0,18,0.0,0.0,0,0
3,ikon,DIV,240.0,0.0,1440,60,3,5,3,12.0,...,240,0,0,0,0,17,0.125,0.0,0,0
4,ikon,DIV,258.0,10.0,103,38,1,0,4,12.0,...,18,0,0,0,0,3,0.134375,0.001379,0,0


### 3. Non-numerical values

Now I will make sure that any feature does not have non numerical data.

In [4]:
print(data.dtypes.unique())
data.columns[data.dtypes=='O']

[dtype('O') dtype('float64') dtype('int64')]


Index(['site', 'tagName', 'textAlign', 'marginTop', 'marginRight',
       'marginBottom', 'marginLeft', 'paddingTop', 'paddingRight',
       'paddingBottom', 'paddingLeft'],
      dtype='object')

'site' will not be used as training features so we can ignore it so far. tagName and textAlign features are categorical features and I will tackle those features lated. Therefore, according to the result, the features **margin, padding** contain not numerical data. So now I will work on margins which are supposed to hold continuous values.

So lets look what values are that non-numerical values.

In [5]:
#margins are not numeric
from collections import Counter
nan_columns=['marginTop','marginRight','marginBottom','marginLeft','paddingTop','paddingRight','paddingBottom','paddingLeft']
for c in nan_columns:
    string_values=[]
    s=data[c]
    for index, value in s.items():
        try:
            float(value)
        except ValueError:
            string_values.append(value)
    print(c,Counter(string_values))

marginTop Counter({'auto': 44, '10%': 2, '15%': 1})
marginRight Counter({'auto': 497, '15%': 4, '2%': 3, '-100%': 2, '0.95%': 1, '5%': 1, '1%': 1})
marginBottom Counter({'auto': 47, '2%': 15, '10%': 1})
marginLeft Counter({'auto': 505, '2%': 14, '15%': 4, '-100%': 3, 'calc(-10% + 84.48)': 1, '3.4%': 1, '-50%': 1, '8.33333%': 1, '5%': 1, '75%': 1, '1%': 1})
paddingTop Counter({'75%': 7, '60%': 2, '1%': 1, '62%': 1, '67%': 1, '56.1912%': 1, '56.25%': 1})
paddingRight Counter({'2%': 2, '3%': 1})
paddingBottom Counter({'69.2308%': 12, '67%': 10, '50%': 8, '56%': 3, '75%': 2, '1%': 1, '3%': 1, '83.3333%': 1, '70%': 1, '62%': 1, '61%': 1, '65%': 1, '66%': 1, '100%': 1})
paddingLeft Counter({'2%': 2, '3%': 1})


As for the margins or padding, these values can be changed into the most frequent values.

In [6]:
for c in nan_columns:
    data[c]=pd.to_numeric(data[c],errors='coerce')

In [7]:
print(data[nan_columns].mode())
data[nan_columns] = data[nan_columns].fillna(data[nan_columns].mode().iloc[0])
data.head()

   marginTop  marginRight  marginBottom  marginLeft  paddingTop  paddingRight  \
0        0.0          0.0           0.0         0.0         0.0           0.0   

   paddingBottom  paddingLeft  
0            0.0          0.0  


Unnamed: 0,site,tagName,left,top,width,height,children,textCount,parentCount,fontSize,...,marginLeft,paddingTop,paddingRight,paddingBottom,paddingLeft,descendants,relPosX,relPosY,title,content
0,ikon,DIV,0.0,35.0,1920,0,0,0,1,12.0,...,0.0,0.0,0.0,0.0,0.0,0,0.0,0.004828,0,0
1,ikon,DIV,0.0,35.0,1920,7060,2,2111,1,12.0,...,0.0,0.0,0.0,0.0,0.0,430,0.0,0.004828,0,0
2,ikon,DIV,0.0,0.0,1920,60,1,5,2,12.0,...,0.0,0.0,0.0,0.0,0.0,18,0.0,0.0,0,0
3,ikon,DIV,240.0,0.0,1440,60,3,5,3,12.0,...,240.0,0.0,0.0,0.0,0.0,17,0.125,0.0,0,0
4,ikon,DIV,258.0,10.0,103,38,1,0,4,12.0,...,18.0,0.0,0.0,0.0,0.0,3,0.134375,0.001379,0,0


## Models
From this point, I will separate my dataset for two model: Model for title and model for content.

In [8]:
data_title=data.drop(['content'],axis=1)
data_content=data.drop(['title'],axis=1)

# Training model for 'Title'

### 1. Categorical features.
My dataset has two categorical features:tagName,textAlign and both of them are nominal. So I will use one-hot encoding method. And also I will remove the most frequent dummy feature so that it will be possible to ignore them if new tag name or text alignment is introduced on testing or production phase.

But before to do that, in order to reduce dimentionality I can filter tag names and remove the rows which will never get positive class. 

In [9]:
# some tag names are lowercase. it is better all of them are uppercase.
data_title['tagName']=data_title['tagName'].str.upper()
data_title['tagName'].unique()

array(['DIV', 'A', 'SPAN', 'IMG', 'H1', 'P', 'STRONG', 'H2', 'EM', 'H4',
       'TABLE', 'TBODY', 'TR', 'TD', 'IFRAME', 'BR', 'SCRIPT', 'INPUT',
       'LABEL', 'TEXTAREA', 'BUTTON', 'I', 'NOSCRIPT', 'SECTION', 'SVG',
       'RECT', 'UL', 'LI', 'FORM', 'CIRCLE', 'PATH', 'NAV', 'LINE', 'HR',
       'ARTICLE', 'H5', 'FIELDSET', 'FOOTER', 'HEADER', 'H6', 'H3',
       'SMALL', 'VIDEO', 'SOURCE', 'ASIDE', 'STYLE', 'SUP', 'TITLE', 'G',
       'BLOCKQUOTE', 'FIGURE', 'FIGCAPTION', 'PROGRESS', 'B', 'OL',
       'CENTER', 'META', 'TEXT', 'TIME', 'INS', 'LINK', 'SELECT',
       'OPTION', 'ABBR', 'MARQUEE', 'U', 'FONT', 'MAIN', 'AMP-ANALYTICS',
       'DEFS', 'THEAD', 'TH', 'MAP', 'AREA', 'AUDIO', 'CLIPPATH',
       'LEGEND', 'CITE', 'XDOOR-ICON', 'SYMBOL', 'TWITTER-WIDGET', 'USE',
       'PICTURE', 'QMT_START', 'QMT_END', 'DL', 'DT', 'S', 'DD', 'SUB',
       'SBSTICKY', 'NOBR', 'DAC-IVT-OGV', 'ADDRESS', 'FB:LIKE', 'POLYGON',
       'FB:RECOMMENDATIONS-BAR', 'ELLIPSE', 'LINEARGRADIENT', 'STOP',
 

In [10]:
tag_blacklist=['A','IMG','TABLE','TBODY','TR','IFRAME','BR','SCRIPT','INPUT','LABEL','TEXTAREA','BUTTON','NOSCRIPT','SVG','RECT','UL','LI','FORM','CIRCLE','PATH',
              'NAV','LINE','HR','ARTICLE','FIELDSET','FOOTER', 'HEADER',
       'SMALL', 'VIDEO', 'SOURCE', 'ASIDE', 'STYLE', 'SUP', 'G',
       'BLOCKQUOTE', 'FIGURE', 'FIGCAPTION', 'PROGRESS',  'OL',
         'META', 'TEXT', 'TIME', 'INS', 'LINK', 'SELECT',
       'OPTION', 'ABBR', 'MARQUEE', 'U',  'MAIN', 'AMP-ANALYTICS',
       'DEFS', 'THEAD', 'TH', 'MAP', 'AREA', 'AUDIO', 'CLIPPATH',
       'LEGEND', 'CITE', 'XDOOR-ICON', 'SYMBOL', 'TWITTER-WIDGET', 'USE',
       'PICTURE', 'QMT_START', 'QMT_END', 'DL', 'DT', 'S', 'DD', 'SUB',
       'SBSTICKY', 'NOBR', 'DAC-IVT-OGV', 'ADDRESS', 'FB:LIKE', 'POLYGON',
       'FB:RECOMMENDATIONS-BAR', 'ELLIPSE', 'LINEARGRADIENT', 'STOP',
       'RADIALGRADIENT', 'TQWIDGET', 'IMAGE', 'MENU', 'VIDEOPLAYER',
       'AMP-ANIMATION', 'AMP-POSITION-OBSERVER', 'AMP-IMG',
       'I-AMPHTML-SIZER', 'AMP-LIST', 'TEMPLATE', 'AMP-SIDEBAR', 'AMP-AD',
       'AMP-SOCIAL-SHARE', 'AMP-IFRAME', 'I-AMPHTML-SCROLL-CONTAINER',
       'AMP-FACEBOOK-COMMENTS', 'AMP-EMBED', 'AMP-STICKY-AD',
       'AMP-STICKY-AD-TOP-PADDING', 'PRE', 'OBJECT', 'POLYLINE']
data_title=data_title[~data_title['tagName'].isin(tag_blacklist)]
data_title['tagName']
data_title.shape

(51526, 33)

In [11]:
print(data_title['tagName'].value_counts().idxmax())
print(data_title['textAlign'].value_counts().idxmax())
data_title_dummy=pd.get_dummies(data_title,columns=['tagName','textAlign'])
print(data_title_dummy.columns)

DIV
start
Index(['site', 'left', 'top', 'width', 'height', 'children', 'textCount',
       'parentCount', 'fontSize', 'linkCount', 'paragraphCount', 'imageCount',
       'colorRed', 'colorGreen', 'colorBlue', 'backgroundRed',
       'backgroundGreen', 'backgroundBlue', 'backgroundAlpha', 'marginTop',
       'marginRight', 'marginBottom', 'marginLeft', 'paddingTop',
       'paddingRight', 'paddingBottom', 'paddingLeft', 'descendants',
       'relPosX', 'relPosY', 'title', 'tagName_B', 'tagName_CENTER',
       'tagName_DIV', 'tagName_EM', 'tagName_FONT', 'tagName_H1', 'tagName_H2',
       'tagName_H3', 'tagName_H4', 'tagName_H5', 'tagName_H6', 'tagName_I',
       'tagName_P', 'tagName_SECTION', 'tagName_SPAN', 'tagName_STRONG',
       'tagName_TD', 'tagName_TITLE', 'textAlign_-webkit-center',
       'textAlign_-webkit-left', 'textAlign_-webkit-right', 'textAlign_center',
       'textAlign_justify', 'textAlign_left', 'textAlign_right',
       'textAlign_start'],
      dtype='object')


In [12]:
data_title_dummy=data_title_dummy.drop(['tagName_DIV','textAlign_start'],axis=1)
print(data_title_dummy.columns)

Index(['site', 'left', 'top', 'width', 'height', 'children', 'textCount',
       'parentCount', 'fontSize', 'linkCount', 'paragraphCount', 'imageCount',
       'colorRed', 'colorGreen', 'colorBlue', 'backgroundRed',
       'backgroundGreen', 'backgroundBlue', 'backgroundAlpha', 'marginTop',
       'marginRight', 'marginBottom', 'marginLeft', 'paddingTop',
       'paddingRight', 'paddingBottom', 'paddingLeft', 'descendants',
       'relPosX', 'relPosY', 'title', 'tagName_B', 'tagName_CENTER',
       'tagName_EM', 'tagName_FONT', 'tagName_H1', 'tagName_H2', 'tagName_H3',
       'tagName_H4', 'tagName_H5', 'tagName_H6', 'tagName_I', 'tagName_P',
       'tagName_SECTION', 'tagName_SPAN', 'tagName_STRONG', 'tagName_TD',
       'tagName_TITLE', 'textAlign_-webkit-center', 'textAlign_-webkit-left',
       'textAlign_-webkit-right', 'textAlign_center', 'textAlign_justify',
       'textAlign_left', 'textAlign_right'],
      dtype='object')


In [13]:
sites_title=data_title_dummy['site']
y_title=data_title_dummy['title']
X_title=data_title_dummy.drop(['site','title'],axis=1)

In [14]:
from sklearn.preprocessing import MinMaxScaler
mms_title=MinMaxScaler()
X_title_sc=mms_title.fit_transform(X_title)

In [15]:
def choose_max_probability(sites,pred_proba):
    y_pred=[]  
    i=0
    while True:
        s=sites[i]
        s_index=(sites==s)

        max_prob=pred_proba[s_index].argmax(axis=0)[1]
        y_site=[0]*np.sum(s_index)
        y_site[max_prob]=1
        y_pred+=y_site
        i+=np.sum(s_index)
        #print('%s:%d'%(s,max_prob))
        if(i>=sites.shape[0]):
            break;
    return y_pred;

In [16]:
import numpy as np
import sklearn.metrics as metrics
import time;
from imblearn.over_sampling import RandomOverSampler

class KFoldProba:
    
    def __init__(self,k=10, random_state=1):
        self._k=k
        self._folds=[]
        self.random_state=random_state
    
    def fit(self, sites, X, y):
        random=np.random.RandomState(self.random_state)
        self._folds=[]
        
        sites_unq=np.unique(sites)
        random.shuffle(sites_unq)
        
        foldsize=int(len(sites_unq)/self._k)
        remainder=len(sites_unq)-self._k*foldsize
        start=0
        for i in range(self._k):
            this_fold_size=foldsize
            if(i<remainder): this_fold_size+=1
            fold_sites=sites_unq[start:(start+this_fold_size)]
            fold_X=X[np.isin(sites,fold_sites)]
            fold_y=y[np.isin(sites,fold_sites)]
            fold={'sites':sites[np.isin(sites,fold_sites)],"X":fold_X,"y":fold_y}
            self._folds.append(fold)
            start+=this_fold_size
    
    def estimate(self, model):
        scores=dict({"accuracy":0,"precision":0,"recall":0,"f1":0})
        for f in range(self._k):
            start_time = time.time()
            
            test_fold=self._folds[f]
            
            train_X=np.concatenate([self._folds[ff]['X'] for ff in range(self._k) if ff!=f],axis=0) 
            train_y=np.concatenate([self._folds[ff]['y'] for ff in range(self._k) if ff!=f],axis=0) 
            
            ros_title = RandomOverSampler(random_state=self.random_state)
            
            train_X_resampled, train_y_resampled = ros_title.fit_resample(train_X, train_y)
            print("\t\tfold:%d, presampled:%d oversampled:%d"%(f, train_X.shape[0], train_X_resampled.shape[0]))
            
            model.fit(train_X_resampled,train_y_resampled)

            pred_proba=model.predict_proba(test_fold['X'])
            y_pred=choose_max_probability(test_fold['sites'],pred_proba)
            
            
            a=metrics.accuracy_score(test_fold['y'],y_pred)
            p=metrics.precision_score(test_fold['y'],y_pred)
            r=metrics.recall_score(test_fold['y'],y_pred)
            f1=metrics.f1_score(test_fold['y'],y_pred)
            scores["accuracy"]+=a
            scores["precision"]+=p
            scores["recall"]+=r
            scores["f1"]+=f1
            print("\t\tfold:%d, time:%d secs, [%.3f, %.3f,%.3f,%.3f] "%(f,time.time()-start_time,a,p,r,f1))
        return {k: v/self._k for k, v in scores.items()}

In [17]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import datetime;
def grid_validation(kfold,models):
    results=[]
    grid_size=0
    for m in models:
        grid_size+=len(m['params'])
    i=0;
    start_time=datetime.datetime.now()
    
    for m in models:
        now=datetime.datetime.now()
        print("Starting classifer: %d:%d - "%(now.hour,now.minute),m)
        for p in m['params']:
            i+=1
            now=datetime.datetime.now()
            print("\tStarting %d out of %d testing: %d:%d"%(i,grid_size,now.hour,now.minute))
            
            mod=m['model'](**p)
            metrics=kfold.estimate(mod)
            t=datetime.datetime.now()
            print("\tFinished %d out of %d testing: %d:%d, time:%d secs"%(i,grid_size,t.hour,t.minute,t.timestamp()-now.timestamp()))
            print('\tMetrics accuracy:%.3f precision:%.3f recall:%.3f f1:%.3f'%(metrics['accuracy'],metrics['precision'],metrics['recall'],metrics['f1']))
            results.append({'model_name':m,'model_param':p,'metrics':metrics,'model':mod})

kfold=KFoldProba(k=10,random_state=3)
kfold.fit(sites_title.to_numpy(),X_title_sc,y_title.to_numpy())
result_title=grid_validation(
    kfold=kfold,
    models=[
        {
            'model':KNeighborsClassifier,
            'params':[
                {'weights':'uniform','n_neighbors':5},
                {'weights':'uniform','n_neighbors':10},
                {'weights':'uniform','n_neighbors':15},
                {'weights':'distance','n_neighbors':5},
                {'weights':'distance','n_neighbors':10},
                {'weights':'distance','n_neighbors':15}
            ]
        },
        {
             'model':LogisticRegression,
             'params':[
                {'C':0.001,'solver':'liblinear','random_state':5},
                {'C':0.01,'solver':'liblinear','random_state':5},
                {'C':0.1,'solver':'liblinear','random_state':5},
                {'C':1,'solver':'liblinear','random_state':5},
                {'C':10,'solver':'liblinear','random_state':5},
                {'C':100,'solver':'liblinear','random_state':5},
                {'penalty':'l1','C':0.001,'solver':'liblinear','random_state':5},
                {'penalty':'l1','C':0.01,'solver':'liblinear','random_state':5},
                {'penalty':'l1','C':0.1,'solver':'liblinear','random_state':5},
                {'penalty':'l1','C':1,'solver':'liblinear','random_state':5},
                {'penalty':'l1','C':10,'solver':'liblinear','random_state':5},
                {'penalty':'l1','C':100,'solver':'liblinear','random_state':5},
            ]
        }
    ]
)

Starting classifer: 9:55 -  {'model': <class 'sklearn.neighbors.classification.KNeighborsClassifier'>, 'params': [{'weights': 'uniform', 'n_neighbors': 5}, {'weights': 'uniform', 'n_neighbors': 10}, {'weights': 'uniform', 'n_neighbors': 15}, {'weights': 'distance', 'n_neighbors': 5}, {'weights': 'distance', 'n_neighbors': 10}, {'weights': 'distance', 'n_neighbors': 15}]}
	Starting 1 out of 18 testing: 9:55
		fold:0, presampled:46239 oversampled:92300
		fold:0, time:13 secs, [0.997, 0.300,0.300,0.300] 
		fold:1, presampled:45206 oversampled:90234
		fold:1, time:12 secs, [0.998, 0.500,0.500,0.500] 
		fold:2, presampled:44819 oversampled:89460
		fold:2, time:16 secs, [0.998, 0.300,0.300,0.300] 
		fold:3, presampled:47593 oversampled:95008
		fold:3, time:10 secs, [0.998, 0.700,0.700,0.700] 
		fold:4, presampled:48952 oversampled:97726
		fold:4, time:8 secs, [0.997, 0.600,0.600,0.600] 
		fold:5, presampled:47066 oversampled:93954
		fold:5, time:32 secs, [0.999, 0.700,0.700,0.700] 
		fold:6,

		fold:2, time:0 secs, [0.998, 0.400,0.400,0.400] 
		fold:3, presampled:47593 oversampled:95008
		fold:3, time:0 secs, [0.999, 0.900,0.900,0.900] 
		fold:4, presampled:48952 oversampled:97726
		fold:4, time:0 secs, [0.997, 0.600,0.600,0.600] 
		fold:5, presampled:47066 oversampled:93954
		fold:5, time:0 secs, [0.999, 0.700,0.700,0.700] 
		fold:6, presampled:47527 oversampled:94876
		fold:6, time:0 secs, [0.998, 0.600,0.600,0.600] 
		fold:7, presampled:45173 oversampled:90168
		fold:7, time:0 secs, [0.999, 0.600,0.600,0.600] 
		fold:8, presampled:44361 oversampled:88544
		fold:8, time:0 secs, [0.999, 0.600,0.600,0.600] 
		fold:9, presampled:46798 oversampled:93416
		fold:9, time:0 secs, [0.999, 0.700,0.778,0.737] 
	Finished 7 out of 18 testing: 10:7, time:3 secs
	Metrics accuracy:0.999 precision:0.630 recall:0.638 f1:0.634
	Starting 8 out of 18 testing: 10:7
		fold:0, presampled:46239 oversampled:92300
		fold:0, time:0 secs, [0.998, 0.600,0.600,0.600] 
		fold:1, presampled:45206 oversam

		fold:7, time:4 secs, [0.999, 0.700,0.700,0.700] 
		fold:8, presampled:44361 oversampled:88544
		fold:8, time:8 secs, [0.999, 0.700,0.700,0.700] 
		fold:9, presampled:46798 oversampled:93416
		fold:9, time:8 secs, [0.998, 0.400,0.444,0.421] 
	Finished 14 out of 18 testing: 10:10, time:83 secs
	Metrics accuracy:0.999 precision:0.660 recall:0.664 f1:0.662
	Starting 15 out of 18 testing: 10:10
		fold:0, presampled:46239 oversampled:92300
		fold:0, time:7 secs, [0.999, 0.700,0.700,0.700] 
		fold:1, presampled:45206 oversampled:90234
		fold:1, time:9 secs, [0.999, 0.600,0.600,0.600] 
		fold:2, presampled:44819 oversampled:89460
		fold:2, time:24 secs, [0.999, 0.700,0.700,0.700] 
		fold:3, presampled:47593 oversampled:95008
		fold:3, time:11 secs, [0.998, 0.700,0.700,0.700] 
		fold:4, presampled:48952 oversampled:97726
		fold:4, time:11 secs, [0.999, 0.900,0.900,0.900] 
		fold:5, presampled:47066 oversampled:93954
		fold:5, time:6 secs, [0.999, 0.800,0.800,0.800] 
		fold:6, presampled:47527



		fold:2, time:85 secs, [0.999, 0.700,0.700,0.700] 
		fold:3, presampled:47593 oversampled:95008
		fold:3, time:57 secs, [0.998, 0.600,0.600,0.600] 
		fold:4, presampled:48952 oversampled:97726
		fold:4, time:43 secs, [0.998, 0.800,0.800,0.800] 
		fold:5, presampled:47066 oversampled:93954
		fold:5, time:23 secs, [0.999, 0.800,0.800,0.800] 
		fold:6, presampled:47527 oversampled:94876
		fold:6, time:28 secs, [0.999, 0.800,0.800,0.800] 
		fold:7, presampled:45173 oversampled:90168
		fold:7, time:91 secs, [1.000, 0.900,0.900,0.900] 
		fold:8, presampled:44361 oversampled:88544
		fold:8, time:50 secs, [0.999, 0.700,0.700,0.700] 
		fold:9, presampled:46798 oversampled:93416
		fold:9, time:44 secs, [0.998, 0.500,0.556,0.526] 
	Finished 18 out of 18 testing: 10:27, time:607 secs
	Metrics accuracy:0.999 precision:0.730 recall:0.736 f1:0.733


# Training model for 'content'

In [18]:
# some tag names are lowercase. it is better all of them are uppercase.
data_content['tagName']=data_content['tagName'].str.upper()
data_content['tagName'].unique()

array(['DIV', 'A', 'SPAN', 'IMG', 'H1', 'P', 'STRONG', 'H2', 'EM', 'H4',
       'TABLE', 'TBODY', 'TR', 'TD', 'IFRAME', 'BR', 'SCRIPT', 'INPUT',
       'LABEL', 'TEXTAREA', 'BUTTON', 'I', 'NOSCRIPT', 'SECTION', 'SVG',
       'RECT', 'UL', 'LI', 'FORM', 'CIRCLE', 'PATH', 'NAV', 'LINE', 'HR',
       'ARTICLE', 'H5', 'FIELDSET', 'FOOTER', 'HEADER', 'H6', 'H3',
       'SMALL', 'VIDEO', 'SOURCE', 'ASIDE', 'STYLE', 'SUP', 'TITLE', 'G',
       'BLOCKQUOTE', 'FIGURE', 'FIGCAPTION', 'PROGRESS', 'B', 'OL',
       'CENTER', 'META', 'TEXT', 'TIME', 'INS', 'LINK', 'SELECT',
       'OPTION', 'ABBR', 'MARQUEE', 'U', 'FONT', 'MAIN', 'AMP-ANALYTICS',
       'DEFS', 'THEAD', 'TH', 'MAP', 'AREA', 'AUDIO', 'CLIPPATH',
       'LEGEND', 'CITE', 'XDOOR-ICON', 'SYMBOL', 'TWITTER-WIDGET', 'USE',
       'PICTURE', 'QMT_START', 'QMT_END', 'DL', 'DT', 'S', 'DD', 'SUB',
       'SBSTICKY', 'NOBR', 'DAC-IVT-OGV', 'ADDRESS', 'FB:LIKE', 'POLYGON',
       'FB:RECOMMENDATIONS-BAR', 'ELLIPSE', 'LINEARGRADIENT', 'STOP',
 

In [19]:
tag_content_blacklist=['A', 'IMG', 'H1',  'STRONG', 'H2', 'EM', 'H4',
       'TABLE', 'TBODY', 'TR', 'IFRAME', 'BR', 'SCRIPT', 'INPUT',
       'LABEL', 'TEXTAREA', 'BUTTON', 'I', 'NOSCRIPT', 'SVG',
       'RECT', 'UL', 'LI', 'FORM', 'CIRCLE', 'PATH', 'NAV', 'LINE', 'HR',
       'H5', 'FIELDSET', 'FOOTER', 'HEADER', 'H6', 'H3',
       'SMALL', 'VIDEO', 'SOURCE', 'ASIDE', 'STYLE', 'SUP', 'TITLE', 'G',
       'BLOCKQUOTE', 'FIGURE', 'FIGCAPTION', 'PROGRESS', 'B', 'OL',
       'CENTER', 'META', 'TEXT', 'TIME', 'INS', 'LINK', 'SELECT',
       'OPTION', 'ABBR', 'MARQUEE', 'U', 'FONT', 'MAIN', 'AMP-ANALYTICS',
       'DEFS', 'THEAD', 'TH', 'MAP', 'AREA', 'AUDIO', 'CLIPPATH',
       'LEGEND', 'CITE', 'XDOOR-ICON', 'SYMBOL', 'TWITTER-WIDGET', 'USE',
       'PICTURE', 'QMT_START', 'QMT_END', 'DL', 'DT', 'S', 'DD', 'SUB',
       'SBSTICKY', 'NOBR', 'DAC-IVT-OGV', 'ADDRESS', 'FB:LIKE', 'POLYGON',
       'FB:RECOMMENDATIONS-BAR', 'ELLIPSE', 'LINEARGRADIENT', 'STOP',
       'RADIALGRADIENT', 'TQWIDGET', 'IMAGE', 'MENU', 'VIDEOPLAYER',
       'AMP-ANIMATION', 'AMP-POSITION-OBSERVER', 'AMP-IMG',
       'I-AMPHTML-SIZER', 'AMP-LIST', 'TEMPLATE', 'AMP-SIDEBAR', 'AMP-AD',
       'AMP-SOCIAL-SHARE', 'AMP-IFRAME', 'I-AMPHTML-SCROLL-CONTAINER',
       'AMP-FACEBOOK-COMMENTS', 'AMP-EMBED', 'AMP-STICKY-AD',
       'AMP-STICKY-AD-TOP-PADDING', 'PRE', 'OBJECT', 'POLYLINE']
data_content=data_content[~data_content['tagName'].isin(tag_content_blacklist)]
data_content['tagName']
data_content.shape

(44637, 33)

In [20]:
print(data_content['tagName'].value_counts().idxmax())
print(data_content['textAlign'].value_counts().idxmax())
data_content_dummy=pd.get_dummies(data_content,columns=['tagName','textAlign'])
print(data_content_dummy.columns)

DIV
start
Index(['site', 'left', 'top', 'width', 'height', 'children', 'textCount',
       'parentCount', 'fontSize', 'linkCount', 'paragraphCount', 'imageCount',
       'colorRed', 'colorGreen', 'colorBlue', 'backgroundRed',
       'backgroundGreen', 'backgroundBlue', 'backgroundAlpha', 'marginTop',
       'marginRight', 'marginBottom', 'marginLeft', 'paddingTop',
       'paddingRight', 'paddingBottom', 'paddingLeft', 'descendants',
       'relPosX', 'relPosY', 'content', 'tagName_ARTICLE', 'tagName_DIV',
       'tagName_P', 'tagName_SECTION', 'tagName_SPAN', 'tagName_TD',
       'textAlign_-webkit-center', 'textAlign_-webkit-left',
       'textAlign_-webkit-right', 'textAlign_center', 'textAlign_justify',
       'textAlign_left', 'textAlign_right', 'textAlign_start'],
      dtype='object')


In [21]:
data_content_dummy=data_content_dummy.drop(['tagName_DIV','textAlign_start'],axis=1)
print(data_content_dummy.columns)

Index(['site', 'left', 'top', 'width', 'height', 'children', 'textCount',
       'parentCount', 'fontSize', 'linkCount', 'paragraphCount', 'imageCount',
       'colorRed', 'colorGreen', 'colorBlue', 'backgroundRed',
       'backgroundGreen', 'backgroundBlue', 'backgroundAlpha', 'marginTop',
       'marginRight', 'marginBottom', 'marginLeft', 'paddingTop',
       'paddingRight', 'paddingBottom', 'paddingLeft', 'descendants',
       'relPosX', 'relPosY', 'content', 'tagName_ARTICLE', 'tagName_P',
       'tagName_SECTION', 'tagName_SPAN', 'tagName_TD',
       'textAlign_-webkit-center', 'textAlign_-webkit-left',
       'textAlign_-webkit-right', 'textAlign_center', 'textAlign_justify',
       'textAlign_left', 'textAlign_right'],
      dtype='object')


In [22]:
sites_content=data_content_dummy['site']
y_content=data_content_dummy['content']
X_content=data_content_dummy.drop(['site','content'],axis=1)

In [23]:
from sklearn.preprocessing import MinMaxScaler
mms_content=MinMaxScaler()
X_content_sc=mms_content.fit_transform(X_content)

In [24]:
kfold=KFoldProba(k=10,random_state=3)
kfold.fit(sites_content.to_numpy(),X_content_sc,y_content.to_numpy())
result_content=grid_validation(
    kfold=kfold,
    models=[        
        {
            'model':KNeighborsClassifier,
            'params':[
                {'weights':'uniform','n_neighbors':5},
                {'weights':'uniform','n_neighbors':10},
                {'weights':'uniform','n_neighbors':15},
                {'weights':'distance','n_neighbors':5},
                {'weights':'distance','n_neighbors':10},
                {'weights':'distance','n_neighbors':15}
            ]
        },
        {
             'model':LogisticRegression,
             'params':[
                {'C':0.001,'solver':'liblinear','random_state':5},
                {'C':0.01,'solver':'liblinear','random_state':5},
                {'C':0.1,'solver':'liblinear','random_state':5},
                {'C':1,'solver':'liblinear','random_state':5},
                {'C':10,'solver':'liblinear','random_state':5},
                {'C':100,'solver':'liblinear','random_state':5},
                {'penalty':'l1','C':0.001,'solver':'liblinear','random_state':5},
                {'penalty':'l1','C':0.01,'solver':'liblinear','random_state':5},
                {'penalty':'l1','C':0.1,'solver':'liblinear','random_state':5},
                {'penalty':'l1','C':1,'solver':'liblinear','random_state':5},
                {'penalty':'l1','C':10,'solver':'liblinear','random_state':5},
                {'penalty':'l1','C':100,'solver':'liblinear','random_state':5},
            ]
        }
    ]
)

Starting classifer: 10:27 -  {'model': <class 'sklearn.neighbors.classification.KNeighborsClassifier'>, 'params': [{'weights': 'uniform', 'n_neighbors': 5}, {'weights': 'uniform', 'n_neighbors': 10}, {'weights': 'uniform', 'n_neighbors': 15}, {'weights': 'distance', 'n_neighbors': 5}, {'weights': 'distance', 'n_neighbors': 10}, {'weights': 'distance', 'n_neighbors': 15}]}
	Starting 1 out of 18 testing: 10:27
		fold:0, presampled:40994 oversampled:81810
		fold:0, time:3 secs, [0.997, 0.400,0.400,0.400] 
		fold:1, presampled:40507 oversampled:80836
		fold:1, time:4 secs, [0.997, 0.300,0.300,0.300] 
		fold:2, presampled:36789 oversampled:73400
		fold:2, time:7 secs, [0.998, 0.400,0.400,0.400] 
		fold:3, presampled:42092 oversampled:84006
		fold:3, time:3 secs, [0.998, 0.800,0.800,0.800] 
		fold:4, presampled:41942 oversampled:83706
		fold:4, time:3 secs, [0.998, 0.700,0.700,0.700] 
		fold:5, presampled:40858 oversampled:81538
		fold:5, time:4 secs, [0.997, 0.500,0.500,0.500] 
		fold:6, pr

		fold:3, time:0 secs, [0.993, 0.100,0.100,0.100] 
		fold:4, presampled:41942 oversampled:83706
		fold:4, time:0 secs, [0.994, 0.200,0.200,0.200] 
		fold:5, presampled:40858 oversampled:81538
		fold:5, time:0 secs, [0.995, 0.100,0.100,0.100] 
		fold:6, presampled:40576 oversampled:80974
		fold:6, time:0 secs, [0.996, 0.200,0.200,0.200] 
		fold:7, presampled:39050 oversampled:77922
		fold:7, time:0 secs, [0.996, 0.000,0.000,0.000] 
		fold:8, presampled:38789 oversampled:77400
		fold:8, time:0 secs, [0.997, 0.100,0.100,0.100] 
		fold:9, presampled:40136 oversampled:80092
		fold:9, time:0 secs, [0.996, 0.111,0.111,0.111] 
	Finished 7 out of 18 testing: 10:32, time:2 secs
	Metrics accuracy:0.996 precision:0.091 recall:0.091 f1:0.091
	Starting 8 out of 18 testing: 10:32
		fold:0, presampled:40994 oversampled:81810
		fold:0, time:0 secs, [0.995, 0.100,0.100,0.100] 
		fold:1, presampled:40507 oversampled:80836
		fold:1, time:0 secs, [0.995, 0.000,0.000,0.000] 
		fold:2, presampled:36789 overs

		fold:8, time:4 secs, [0.997, 0.000,0.000,0.000] 
		fold:9, presampled:40136 oversampled:80092
		fold:9, time:3 secs, [0.997, 0.222,0.222,0.222] 
	Finished 14 out of 18 testing: 10:33, time:45 secs
	Metrics accuracy:0.996 precision:0.242 recall:0.242 f1:0.242
	Starting 15 out of 18 testing: 10:33
		fold:0, presampled:40994 oversampled:81810
		fold:0, time:14 secs, [0.997, 0.500,0.500,0.500] 
		fold:1, presampled:40507 oversampled:80836
		fold:1, time:13 secs, [0.997, 0.400,0.400,0.400] 
		fold:2, presampled:36789 oversampled:73400
		fold:2, time:12 secs, [0.998, 0.400,0.400,0.400] 
		fold:3, presampled:42092 oversampled:84006
		fold:3, time:17 secs, [0.996, 0.500,0.500,0.500] 
		fold:4, presampled:41942 oversampled:83706
		fold:4, time:19 secs, [0.998, 0.700,0.700,0.700] 
		fold:5, presampled:40858 oversampled:81538
		fold:5, time:13 secs, [0.997, 0.400,0.400,0.400] 
		fold:6, presampled:40576 oversampled:80974
		fold:6, time:17 secs, [0.998, 0.600,0.600,0.600] 
		fold:7, presampled:3



		fold:0, time:311 secs, [0.997, 0.500,0.500,0.500] 
		fold:1, presampled:40507 oversampled:80836




		fold:1, time:383 secs, [0.998, 0.600,0.600,0.600] 
		fold:2, presampled:36789 oversampled:73400




		fold:2, time:349 secs, [0.999, 0.600,0.600,0.600] 
		fold:3, presampled:42092 oversampled:84006




		fold:3, time:291 secs, [0.997, 0.600,0.600,0.600] 
		fold:4, presampled:41942 oversampled:83706




		fold:4, time:347 secs, [0.998, 0.700,0.700,0.700] 
		fold:5, presampled:40858 oversampled:81538




		fold:5, time:227 secs, [0.997, 0.500,0.500,0.500] 
		fold:6, presampled:40576 oversampled:80974




		fold:6, time:174 secs, [0.998, 0.600,0.600,0.600] 
		fold:7, presampled:39050 oversampled:77922
		fold:7, time:108 secs, [0.998, 0.500,0.500,0.500] 
		fold:8, presampled:38789 oversampled:77400




		fold:8, time:297 secs, [0.998, 0.300,0.300,0.300] 
		fold:9, presampled:40136 oversampled:80092




		fold:9, time:307 secs, [0.998, 0.444,0.444,0.444] 
	Finished 17 out of 18 testing: 11:30, time:2798 secs
	Metrics accuracy:0.998 precision:0.534 recall:0.534 f1:0.534
	Starting 18 out of 18 testing: 11:30
		fold:0, presampled:40994 oversampled:81810




		fold:0, time:309 secs, [0.997, 0.500,0.500,0.500] 
		fold:1, presampled:40507 oversampled:80836




		fold:1, time:259 secs, [0.998, 0.600,0.600,0.600] 
		fold:2, presampled:36789 oversampled:73400




		fold:2, time:300 secs, [0.999, 0.500,0.500,0.500] 
		fold:3, presampled:42092 oversampled:84006




		fold:3, time:307 secs, [0.997, 0.600,0.600,0.600] 
		fold:4, presampled:41942 oversampled:83706




		fold:4, time:325 secs, [0.998, 0.700,0.700,0.700] 
		fold:5, presampled:40858 oversampled:81538




		fold:5, time:315 secs, [0.997, 0.500,0.500,0.500] 
		fold:6, presampled:40576 oversampled:80974




		fold:6, time:326 secs, [0.998, 0.600,0.600,0.600] 
		fold:7, presampled:39050 oversampled:77922
		fold:7, time:269 secs, [0.998, 0.500,0.500,0.500] 
		fold:8, presampled:38789 oversampled:77400




		fold:8, time:296 secs, [0.998, 0.300,0.300,0.300] 
		fold:9, presampled:40136 oversampled:80092
		fold:9, time:362 secs, [0.998, 0.444,0.444,0.444] 
	Finished 18 out of 18 testing: 12:21, time:3072 secs
	Metrics accuracy:0.998 precision:0.524 recall:0.524 f1:0.524




In [25]:
title_clf=LogisticRegression(penalty='l1',C=1,solver='liblinear',random_state=5)
ros_title = RandomOverSampler(random_state=3)
train_X_resampled, train_y_resampled = ros_title.fit_resample(X_title_sc,y_title)
print("\t\tpresampled:%d oversampled:%d"%(X_title.shape[0], train_X_resampled.shape[0]))
            
title_clf.fit(train_X_resampled,train_y_resampled)

pred_proba=title_clf.predict_proba(X_title_sc)
y_pred=choose_max_probability(sites_title.to_numpy(),pred_proba)
a=metrics.accuracy_score(y_title,y_pred)
p=metrics.precision_score(y_title,y_pred)
r=metrics.recall_score(y_title,y_pred)
f=metrics.f1_score(y_title,y_pred)
print('accuracy:%0.3f, precision:%0.3f, recall:%0.3f, f1:%0.3f'%(a,p,r,f))

		presampled:51526 oversampled:102854
accuracy:0.999, precision:0.770, recall:0.778, f1:0.774


In [26]:
from joblib import dump
dump(title_clf, 'title_classifier.joblib') 

['title_classifier.joblib']

In [27]:
content_clf=LogisticRegression(penalty='l1',C=1,solver='liblinear',random_state=5)
ros_content = RandomOverSampler(random_state=3)
train_X_resampled, train_y_resampled = ros_content.fit_resample(X_content_sc,y_content)
print("\t\tpresampled:%d oversampled:%d"%(X_content.shape[0], train_X_resampled.shape[0]))
            
content_clf.fit(train_X_resampled,train_y_resampled)

pred_proba=content_clf.predict_proba(X_content_sc)
y_pred=choose_max_probability(sites_content.to_numpy(),pred_proba)
a=metrics.accuracy_score(y_content,y_pred)
p=metrics.precision_score(y_content,y_pred)
r=metrics.recall_score(y_content,y_pred)
f=metrics.f1_score(y_content,y_pred)
print('accuracy:%0.3f, precision:%0.3f, recall:%0.3f, f1:%0.3f'%(a,p,r,f))

		presampled:44637 oversampled:89076
accuracy:0.998, precision:0.566, recall:0.566, f1:0.566


In [28]:
dump(content_clf, 'content_classifier.joblib') 

['content_classifier.joblib']

In [29]:
dump(mms_title,"scaler_title.joblib")
dump(mms_content,"scaler_content.joblib")

['scaler_content.joblib']

In [30]:
import json
freq=X_title[['marginTop','marginRight','marginBottom','marginLeft','paddingTop','paddingRight','paddingBottom',
 'paddingLeft']].mode().iloc[0].to_dict()
cols_title=X_title.columns.values.tolist()
cols_content=X_content.columns.values.tolist()

f = open("datasetinfo.txt", "w")
f.write(json.dumps(freq))
f.write('\n')
f.write(','.join(cols_title))
f.write('\n')
f.write(','.join(tag_blacklist))
f.write('\n')
f.write(','.join(cols_content))
f.write('\n')
f.write(','.join(tag_content_blacklist))
f.close()