# New Content extractor

In this project, I set a goal to build a model that I can extract title and main textual content from news web page. When we crawl websites to build a search engine or other purpose, it turns out very difficult because if noisy contents not related to the main article. It will be useful for us on that occasions. 

## Approach
When we visit websites, we can distinguish easily title and content of the main article. My goal is to make the machine have that sense.

In my opinion, we may make a conclusion based on the visual perception. For example: the position on the display, size of the text, color and weight of the text etc.

On top of that, I think developers adopt similar practices to build their web pages. There may be some similarity including similar html tags to create similar components, html elements' hierarchical structures etc.

Therefore, my model was trained to understand that sense and similarity.

And also one thing to notice is that I trained two models because I am looking for **title** and **contents** which should be extracted by different way.

## Application 

I developed a nodejs application with which I can create my dataset for training and extract the title and the content
from web pages using the model I trained in this project.

The application is located in the [webscraper](https://github.com/uuganbold/news_extractor/tree/master/webscraper) directory of the source [code](https://github.com/uuganbold/news_extractor).


## Dataset

The dataset was scraped from 120 different websites, one page from each website. 
It has 119807 rows, each of which represents each HTML element of the web pages. 
That means average page contains 1000 HTML elements. 

The dataset is pretty imbalanced because each web page contains only one title and one content while
it contains a thousand html elements.

#### Attributes
**site**: the website's name from which the element is extracted<br/>
**url**: the web url from which the element is extracted<br/>
**tagName**:the element's html tag<br/>
**left**: X coordinate of the top-left point of the element on the page<br/>
**top**:Y coordinate of the top-left point of the element on the page<br/>
**width**:the element's width on the page<br/>
**height**:the element's height on the page<br/>
**children**:count of direct child elements<br/>
**textCount**:length of the text in the element<br/>
**parentCount**:count of the ancestor elements<br/>
**fontSize**:font size of the text<br/>
**linkCount**:count of the &#60;a&#62; elements in the element<br/>
**paragraphCount**:count of the &#60;p&#62; elements in the element<br/>
**imageCount**:count of the &#60;img&#62; elements in the element<br/>
**colorRed**:the red attribute of RGB color of the texts in the element<br/>
**colorGreen**:the green attribute of RGB color of the texts in the element<br/>
**colorBlue**:the blue attribute of RGB color of the texts in the element<br/>
**backgroundRed**:the red attribute of RGB color of the background of the element<br/>
**backgroundGreen**:the green attribute of RGB color of the background of the element<br/>
**backgroundBlue**:the blue attribute of RGB color of the background of the element<br/>
**backgroundAlpha**:the transparency attribute of the background of the element<br/>
**textAlign**:text alignment of the text in the element<br/>
**marginTop**:top margin of the element<br/>
**marginRight**:right margin of the element<br/>
**marginBottom**:bottom margin of the element<br/>
**marginLeft**:left margin of the element<br/>
**paddingTop**:top padding of the element<br/>
**paddingRight**:right padding of the element<br/>
**paddingBottom**:bottom padding of the element<br/>
**paddingLeft**:left padding of the element<br/>
**descendants**:count of descendant elements<br/>
**relPosX**: Relative position to the page<br/>
**relPosY**: Relative positioin to the page<br/>
**title**: whether the element is title of the article <br/>
**content**:whether the element is main content of the article

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data_raw = pd.read_csv("../webscraper/out1.csv", quotechar='"', skipinitialspace=True)

print(data_raw.columns)
print(data_raw.shape)
print(data_raw['site'].unique())
print(len(data_raw['site'].unique()))

Index(['site', 'url', 'tagName', 'left', 'top', 'width', 'height', 'children',
       'textCount', 'parentCount', 'fontSize', 'linkCount', 'paragraphCount',
       'imageCount', 'colorRed', 'colorGreen', 'colorBlue', 'backgroundRed',
       'backgroundGreen', 'backgroundBlue', 'backgroundAlpha', 'textAlign',
       'marginTop', 'marginRight', 'marginBottom', 'marginLeft', 'paddingTop',
       'paddingRight', 'paddingBottom', 'paddingLeft', 'descendants',
       'relPosX', 'relPosY', 'title', 'content'],
      dtype='object')
(119807, 35)
['ikon' 'gogo' 'news' 'peak' 'polit' 'zuv' 'updown' 'newspress' 'gereg'
 'nertur' 'livetv' 'sonin' 'olloo' 'itoim' 'medee' 'arslan' 'udriintoim'
 'mongolcom' 'news1' 'zarig' 'sosa' 'dardas' 'mminfo' 'asuudal' 'zindaa'
 'seruuleg' 'newsmedia' 'bolod' 'inews' 'paparatsi' 'unuudur'
 'niigmiintoli' '24barimt' 'zaluu' 'amjilt' 'tur' 'fact' 'shuurhai'
 'control' 'jirgee' 'tonshuul' 'mongolcomment' 'scandal' 'miss' 'ontslokh'
 'inet' 'kingnews' 'tusgaar' 'mur

  interactivity=interactivity, compiler=compiler, result=result)


## Data Preprocessing

In this phase, I will prepare the dataset for the training. 

### 1. Removing not useful fields.
Of course, our model should work free from the web site. So I will remove *url* attribute. The *site* attribute will be used when separating testing and training dataset, so it will be left so far.

In [2]:
data_raw=data_raw.drop(['url'],axis=1)
data_raw.head()

Unnamed: 0,site,tagName,left,top,width,height,children,textCount,parentCount,fontSize,...,marginLeft,paddingTop,paddingRight,paddingBottom,paddingLeft,descendants,relPosX,relPosY,title,content
0,ikon,DIV,0.0,35.0,1920,0,0,0,1,12.0,...,0,0,0,0,0,0,0.0,0.004624,False,False
1,ikon,DIV,0.0,35.0,1920,7380,2,2361,1,12.0,...,0,0,0,0,0,458,0.0,0.004624,False,False
2,ikon,DIV,0.0,0.0,1920,60,1,5,2,12.0,...,0,0,0,0,0,18,0.0,0.0,False,False
3,ikon,DIV,240.0,0.0,1440,60,3,5,3,12.0,...,240,0,0,0,0,17,0.125,0.0,False,False
4,ikon,DIV,258.0,10.0,103,38,1,0,4,12.0,...,18,0,0,0,0,3,0.134375,0.001321,False,False


### 2.Mapping boolean class to numerical class.
*title, content* columns are classes we need to predict. It holds boolean values and it will work with classification. So technically I do not need to map them into numerical values, but I will map them into 0,1 just for convenience.

In [3]:
class_mapping={False:0,True:1}
data_raw['title']=data_raw['title'].map(class_mapping)
data_raw['content']=data_raw['content'].map(class_mapping)
data_raw.head()

Unnamed: 0,site,tagName,left,top,width,height,children,textCount,parentCount,fontSize,...,marginLeft,paddingTop,paddingRight,paddingBottom,paddingLeft,descendants,relPosX,relPosY,title,content
0,ikon,DIV,0.0,35.0,1920,0,0,0,1,12.0,...,0,0,0,0,0,0,0.0,0.004624,0,0
1,ikon,DIV,0.0,35.0,1920,7380,2,2361,1,12.0,...,0,0,0,0,0,458,0.0,0.004624,0,0
2,ikon,DIV,0.0,0.0,1920,60,1,5,2,12.0,...,0,0,0,0,0,18,0.0,0.0,0,0
3,ikon,DIV,240.0,0.0,1440,60,3,5,3,12.0,...,240,0,0,0,0,17,0.125,0.0,0,0
4,ikon,DIV,258.0,10.0,103,38,1,0,4,12.0,...,18,0,0,0,0,3,0.134375,0.001321,0,0


# Splitting test and train set

In [4]:
sites_unq=data_raw['site'].unique()
random=np.random.RandomState(25)
random.shuffle(sites_unq)
sites_train=sites_unq[0:100]
sites_test=sites_unq[100:]
data=data_raw[data_raw['site'].isin(sites_train)]
print(data.shape)
data_test=data_raw[data_raw['site'].isin(sites_test)]
print(data_test.shape)
data.head()

(101855, 34)
(17952, 34)


Unnamed: 0,site,tagName,left,top,width,height,children,textCount,parentCount,fontSize,...,marginLeft,paddingTop,paddingRight,paddingBottom,paddingLeft,descendants,relPosX,relPosY,title,content
0,ikon,DIV,0.0,35.0,1920,0,0,0,1,12.0,...,0,0,0,0,0,0,0.0,0.004624,0,0
1,ikon,DIV,0.0,35.0,1920,7380,2,2361,1,12.0,...,0,0,0,0,0,458,0.0,0.004624,0,0
2,ikon,DIV,0.0,0.0,1920,60,1,5,2,12.0,...,0,0,0,0,0,18,0.0,0.0,0,0
3,ikon,DIV,240.0,0.0,1440,60,3,5,3,12.0,...,240,0,0,0,0,17,0.125,0.0,0,0
4,ikon,DIV,258.0,10.0,103,38,1,0,4,12.0,...,18,0,0,0,0,3,0.134375,0.001321,0,0


### 3. Non-numerical values

Now I will make sure that any feature does not have non numerical data.

In [5]:
print(data.dtypes.unique())
data.columns[data.dtypes=='O']

[dtype('O') dtype('float64') dtype('int64')]


Index(['site', 'tagName', 'textAlign', 'marginTop', 'marginRight',
       'marginBottom', 'marginLeft', 'paddingTop', 'paddingRight',
       'paddingBottom', 'paddingLeft'],
      dtype='object')

'site' will not be used as training features so we can ignore it so far. tagName and textAlign features are categorical features and I will tackle those features lated. Therefore, according to the result, the features **margin, padding** contain not numerical data. So now I will work on margins which are supposed to hold continuous values.

So lets look what values are that non-numerical values.

In [6]:
#margins are not numeric
from collections import Counter
nan_columns=['marginTop','marginRight','marginBottom','marginLeft','paddingTop','paddingRight','paddingBottom','paddingLeft']
for c in nan_columns:
    string_values=[]
    s=data[c]
    for index, value in s.items():
        try:
            float(value)
        except ValueError:
            string_values.append(value)
    print(c,Counter(string_values))

marginTop Counter({'auto': 124, '10%': 2, '15%': 1})
marginRight Counter({'auto': 671, '1.5%': 161, '2%': 18, '3%': 5, '15%': 4, '1%': 2, '-100%': 1, '0.95%': 1})
marginBottom Counter({'auto': 127, '2%': 5, '10%': 1})
marginLeft Counter({'auto': 678, '15%': 5, '2.5641%': 4, '2%': 3, '1%': 2, 'calc(-10% + 84.48)': 1, '3.4%': 1, '-50%': 1, '8.33333%': 1, '75%': 1})
paddingTop Counter({'5%': 49, '75%': 7, '62%': 1, '67%': 1, '56.1912%': 1, '56.2559%': 1, '145%': 1})
paddingRight Counter({'5%': 49, '1%': 2, '3%': 1})
paddingBottom Counter({'5%': 50, '15%': 17, '50%': 8, '66%': 3, '66.6667%': 2, '83.3333%': 2, '3%': 1, '100%': 1, '66.6747%': 1, '66.6722%': 1, '66.5344%': 1})
paddingLeft Counter({'5%': 49, '6.66667%': 3, '1%': 2, '3%': 1})


As for the margins or padding, these values can be changed into the most frequent values.

In [7]:
for c in nan_columns:
    data.loc[:,c]=pd.to_numeric(data.loc[:,c],errors='coerce')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [8]:
print(data[nan_columns].mode())
data.loc[:,nan_columns] = data.loc[:,nan_columns].fillna(data[nan_columns].mode().iloc[0])
data.head()

   marginTop  marginRight  marginBottom  marginLeft  paddingTop  paddingRight  \
0        0.0          0.0           0.0         0.0         0.0           0.0   

   paddingBottom  paddingLeft  
0            0.0          0.0  


Unnamed: 0,site,tagName,left,top,width,height,children,textCount,parentCount,fontSize,...,marginLeft,paddingTop,paddingRight,paddingBottom,paddingLeft,descendants,relPosX,relPosY,title,content
0,ikon,DIV,0.0,35.0,1920,0,0,0,1,12.0,...,0.0,0.0,0.0,0.0,0.0,0,0.0,0.004624,0,0
1,ikon,DIV,0.0,35.0,1920,7380,2,2361,1,12.0,...,0.0,0.0,0.0,0.0,0.0,458,0.0,0.004624,0,0
2,ikon,DIV,0.0,0.0,1920,60,1,5,2,12.0,...,0.0,0.0,0.0,0.0,0.0,18,0.0,0.0,0,0
3,ikon,DIV,240.0,0.0,1440,60,3,5,3,12.0,...,240.0,0.0,0.0,0.0,0.0,17,0.125,0.0,0,0
4,ikon,DIV,258.0,10.0,103,38,1,0,4,12.0,...,18.0,0.0,0.0,0.0,0.0,3,0.134375,0.001321,0,0


## Models
From this point, I will separate my dataset for two model: Model for title and model for content.

In [9]:
data_title=data.drop(['content'],axis=1)
data_content=data.drop(['title'],axis=1)

# Training model for 'Title'

### 1. Categorical features.
My dataset has two categorical features:tagName,textAlign and both of them are nominal. So I will use one-hot encoding method. And also I will remove the most frequent dummy feature so that it will be possible to ignore them if new tag name or text alignment is introduced on testing or production phase.

But before to do that, in order to reduce dimentionality I can filter tag names and remove the rows which will never get positive class. 

In [10]:
# some tag names are lowercase. it is better all of them are uppercase.
data_title['tagName']=data_title['tagName'].str.upper()
data_title['tagName'].unique()

array(['DIV', 'A', 'SPAN', 'IMG', 'H1', 'P', 'STRONG', 'H2', 'EM', 'H4',
       'TABLE', 'TBODY', 'TR', 'TD', 'IFRAME', 'BR', 'SCRIPT', 'INPUT',
       'LABEL', 'TEXTAREA', 'BUTTON', 'I', 'NOSCRIPT', 'HEADER', 'UL',
       'LI', 'FORM', 'NAV', 'SVG', 'CIRCLE', 'SECTION', 'ARTICLE', 'H6',
       'H5', 'H3', 'SMALL', 'VIDEO', 'SOURCE', 'FOOTER', 'ASIDE',
       'PROGRESS', 'FIGURE', 'STYLE', 'B', 'OL', 'HR', 'PATH', 'POLYLINE',
       'RECT', 'BLOCKQUOTE', 'META', 'TEXT', 'TIME', 'INS', 'LINE',
       'LINK', 'ABBR', 'SUP', 'MARQUEE', 'CENTER', 'FONT', 'MAIN',
       'TITLE', 'G', 'AMP-ANALYTICS', 'T', 'SELECT', 'OPTION', 'DEFS',
       'THEAD', 'TH', 'MAP', 'AREA', 'CITE', 'PICTURE', 'FIGCAPTION',
       'USE', 'POLYGON', 'XDOOR-ICON', 'AUDIO', 'SYMBOL', 'FIELDSET',
       'TWITTER-WIDGET', 'DL', 'DT', 'S', 'NOBR', 'DAC-IVT-OGV', 'DD',
       'ADDRESS', 'FB:LIKE', 'FB:RECOMMENDATIONS-BAR', 'U', 'ELLIPSE',
       'LINEARGRADIENT', 'STOP', 'RADIALGRADIENT', 'CLIPPATH', 'TQWIDGET',
       

In [11]:
tag_blacklist=['IMG','TABLE', 'TBODY', 'TR', 'IFRAME', 'BR', 'SCRIPT', 'INPUT',
       'LABEL', 'TEXTAREA', 'BUTTON', 'I', 'NOSCRIPT', 'HEADER', 'UL',
       'LI', 'FORM', 'NAV', 'SVG', 'CIRCLE', 'SECTION', 'ARTICLE',
       'SMALL', 'VIDEO', 'SOURCE', 'FOOTER', 'ASIDE',
       'PROGRESS', 'FIGURE', 'STYLE','OL', 'HR', 'PATH', 'POLYLINE',
       'RECT', 'BLOCKQUOTE', 'META', 'TEXT', 'TIME', 'INS', 'LINE',
       'LINK', 'ABBR', 'SUP', 'MARQUEE','MAIN',
        'G', 'AMP-ANALYTICS', 'T', 'SELECT', 'OPTION', 'DEFS',
       'THEAD', 'MAP', 'AREA', 'CITE', 'PICTURE', 'FIGCAPTION',
       'USE', 'POLYGON', 'XDOOR-ICON', 'AUDIO', 'SYMBOL', 'FIELDSET',
       'TWITTER-WIDGET', 'DL', 'DT', 'S', 'NOBR', 'DAC-IVT-OGV', 'DD',
       'ADDRESS', 'FB:LIKE', 'FB:RECOMMENDATIONS-BAR', 'U', 'ELLIPSE',
       'LINEARGRADIENT', 'STOP', 'RADIALGRADIENT', 'CLIPPATH', 'TQWIDGET',
       'LEGEND', 'IMAGE', 'MENU', 'VIDEOPLAYER', 'AMP-WEB-PUSH',
       'AMP-INSTALL-SERVICEWORKER', 'AMP-USER-NOTIFICATION',
       'AMP-WEB-PUSH-WIDGET', 'AMP-IMG', 'I-AMPHTML-SIZER',
       'AMP-DATE-DISPLAY', 'TEMPLATE', 'AMP-POSITION-OBSERVER',
       'AMP-ANIMATION', 'AMP-IMAGE-LIGHTBOX', 'AMP-SOCIAL-SHARE',
       'AMP-IFRAME', 'AMP-AD', 'AMP-TWITTER', 'AMP-FACEBOOK-COMMENTS',
       'AMP-EMBED', 'AMP-STATE', 'AMP-SIDEBAR', 'AMP-ACCORDION',
       'AMP-LIVE-LIST', 'AMP-LIGHTBOX', 'AMP-PIXEL',
       'AMP-LIGHTBOX-GALLERY', 'OBJECT', 'APP-EFC-MESSAGING', 'CNX',
       'NTV-DIV', 'MASK', 'H0', 'APP-ROOT', 'BETA-FLAG', 'ROUTER-OUTLET',
       'APP', 'NAVBAR', 'CHANNELS-LIST', 'SUB-CHANNELS-LIST', 'STREAM',
       'PAGE', 'FBS-AD', 'PAGE-STANDARD', 'METRICS',
       'CONTRIB-BLOCK', 'CONTRIB-BYLINE', 'GROUP-BLOG-BLURB',
       'DISCLAIMER', 'ARTICLE-BODY-CONTAINER', 'SHARING', 'FBS-ACCORDION',
       'SIG-FILE', 'CONTRIB-FULL-BIO', 'PRINTBAR', 'MEDIANET', 'SIDENAV',
       'FBS-VIDEO', 'PARAM', 'FBS-AD-RAIL', 'DESC', 'PRE', 'FB:COMMENTS']
data_title=data_title[~data_title['tagName'].isin(tag_blacklist)]
data_title['tagName']
data_title.shape

(66688, 33)

In [12]:
print(data_title['tagName'].value_counts().idxmax())
print(data_title['textAlign'].value_counts().idxmax())
data_title_dummy=pd.get_dummies(data_title,columns=['tagName','textAlign'])
print(data_title_dummy.columns)

DIV
start
Index(['site', 'left', 'top', 'width', 'height', 'children', 'textCount',
       'parentCount', 'fontSize', 'linkCount', 'paragraphCount', 'imageCount',
       'colorRed', 'colorGreen', 'colorBlue', 'backgroundRed',
       'backgroundGreen', 'backgroundBlue', 'backgroundAlpha', 'marginTop',
       'marginRight', 'marginBottom', 'marginLeft', 'paddingTop',
       'paddingRight', 'paddingBottom', 'paddingLeft', 'descendants',
       'relPosX', 'relPosY', 'title', 'tagName_A', 'tagName_ARTICLE-HEADER',
       'tagName_B', 'tagName_CENTER', 'tagName_DIV', 'tagName_EM',
       'tagName_FONT', 'tagName_H1', 'tagName_H2', 'tagName_H3', 'tagName_H4',
       'tagName_H5', 'tagName_H6', 'tagName_P', 'tagName_SPAN',
       'tagName_STRONG', 'tagName_TD', 'tagName_TH', 'tagName_TITLE',
       'textAlign_-webkit-center', 'textAlign_-webkit-left',
       'textAlign_-webkit-right', 'textAlign_center', 'textAlign_justify',
       'textAlign_left', 'textAlign_right', 'textAlign_start'],
     

In [13]:
data_title_dummy=data_title_dummy.drop(['tagName_DIV','textAlign_start'],axis=1)
print(data_title_dummy.columns)

Index(['site', 'left', 'top', 'width', 'height', 'children', 'textCount',
       'parentCount', 'fontSize', 'linkCount', 'paragraphCount', 'imageCount',
       'colorRed', 'colorGreen', 'colorBlue', 'backgroundRed',
       'backgroundGreen', 'backgroundBlue', 'backgroundAlpha', 'marginTop',
       'marginRight', 'marginBottom', 'marginLeft', 'paddingTop',
       'paddingRight', 'paddingBottom', 'paddingLeft', 'descendants',
       'relPosX', 'relPosY', 'title', 'tagName_A', 'tagName_ARTICLE-HEADER',
       'tagName_B', 'tagName_CENTER', 'tagName_EM', 'tagName_FONT',
       'tagName_H1', 'tagName_H2', 'tagName_H3', 'tagName_H4', 'tagName_H5',
       'tagName_H6', 'tagName_P', 'tagName_SPAN', 'tagName_STRONG',
       'tagName_TD', 'tagName_TH', 'tagName_TITLE', 'textAlign_-webkit-center',
       'textAlign_-webkit-left', 'textAlign_-webkit-right', 'textAlign_center',
       'textAlign_justify', 'textAlign_left', 'textAlign_right'],
      dtype='object')


In [14]:
sites_title=data_title_dummy['site']
y_title=data_title_dummy['title']
X_title=data_title_dummy.drop(['site','title'],axis=1)

In [15]:
from sklearn.preprocessing import MinMaxScaler
mms_title=MinMaxScaler()
X_title_sc=mms_title.fit_transform(X_title)

The function below chooses one element from each site based on probability

In [16]:
def choose_max_probability(sites,pred_proba):
    y_pred=[]  
    i=0
    while True:
        s=sites[i]
        s_index=(sites==s)

        max_prob=pred_proba[s_index].argmax(axis=0)[1]
        y_site=[0]*np.sum(s_index)
        y_site[max_prob]=1
        y_pred+=y_site
        i+=np.sum(s_index)
        #print('%s:%d'%(s,max_prob))
        if(i>=sites.shape[0]):
            break;
    return y_pred;

In [17]:
import numpy as np
import sklearn.metrics as metrics
import time;
from imblearn.over_sampling import RandomOverSampler

class KFoldProba:
    
    def __init__(self,k=10, random_state=1):
        self._k=k
        self._folds=[]
        self.random_state=random_state
    
    def fit(self, sites, X, y):
        random=np.random.RandomState(self.random_state)
        self._folds=[]
        
        sites_unq=np.unique(sites)
        random.shuffle(sites_unq)
        
        foldsize=int(len(sites_unq)/self._k)
        remainder=len(sites_unq)-self._k*foldsize
        start=0
        for i in range(self._k):
            this_fold_size=foldsize
            if(i<remainder): this_fold_size+=1
            fold_sites=sites_unq[start:(start+this_fold_size)]
            fold_X=X[np.isin(sites,fold_sites)]
            fold_y=y[np.isin(sites,fold_sites)]
            fold={'sites':sites[np.isin(sites,fold_sites)],"X":fold_X,"y":fold_y}
            self._folds.append(fold)
            start+=this_fold_size
    
    def estimate(self, model):
        scores=dict({"accuracy":0,"precision":0,"recall":0,"f1":0})
        for f in range(self._k):
            start_time = time.time()
            
            test_fold=self._folds[f]
            
            train_X=np.concatenate([self._folds[ff]['X'] for ff in range(self._k) if ff!=f],axis=0) 
            train_y=np.concatenate([self._folds[ff]['y'] for ff in range(self._k) if ff!=f],axis=0) 
            
            ros_title = RandomOverSampler(random_state=self.random_state)
            
            train_X_resampled, train_y_resampled = ros_title.fit_resample(train_X, train_y)
            print("\t\tfold:%d, presampled:%d oversampled:%d"%(f, train_X.shape[0], train_X_resampled.shape[0]))
            
            model.fit(train_X_resampled,train_y_resampled)

            pred_proba=model.predict_proba(test_fold['X'])
            y_pred=choose_max_probability(test_fold['sites'],pred_proba)
            
            
            a=metrics.accuracy_score(test_fold['y'],y_pred)
            p=metrics.precision_score(test_fold['y'],y_pred)
            r=metrics.recall_score(test_fold['y'],y_pred)
            f1=metrics.f1_score(test_fold['y'],y_pred)
            scores["accuracy"]+=a
            scores["precision"]+=p
            scores["recall"]+=r
            scores["f1"]+=f1
            print("\t\tfold:%d, time:%d secs, [%.3f, %.3f,%.3f,%.3f] "%(f,time.time()-start_time,a,p,r,f1))
        return {k: v/self._k for k, v in scores.items()}

In [18]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import datetime;
def grid_validation(kfold,models):
    results=[]
    grid_size=0
    for m in models:
        grid_size+=len(m['params'])
    i=0;
    start_time=datetime.datetime.now()
    
    for m in models:
        now=datetime.datetime.now()
        print("Starting classifer: %d:%d - "%(now.hour,now.minute),m)
        for p in m['params']:
            i+=1
            now=datetime.datetime.now()
            print("\tStarting %d out of %d testing: %d:%d"%(i,grid_size,now.hour,now.minute))
            
            mod=m['model'](**p)
            metrics=kfold.estimate(mod)
            t=datetime.datetime.now()
            print("\tFinished %d out of %d testing: %d:%d, time:%d secs"%(i,grid_size,t.hour,t.minute,t.timestamp()-now.timestamp()))
            print('\tMetrics accuracy:%.3f precision:%.3f recall:%.3f f1:%.3f'%(metrics['accuracy'],metrics['precision'],metrics['recall'],metrics['f1']))
            results.append({'model_name':m,'model_param':p,'metrics':metrics,'model':mod})
    return results;

kfold=KFoldProba(k=10,random_state=3)
kfold.fit(sites_title.to_numpy(),X_title_sc,y_title.to_numpy())
result_title=grid_validation(
    kfold=kfold,
    models=[
        {
            'model':KNeighborsClassifier,
            'params':[
                {'weights':'uniform','n_neighbors':5},
                {'weights':'uniform','n_neighbors':10},
                {'weights':'uniform','n_neighbors':15},
                {'weights':'distance','n_neighbors':5},
                {'weights':'distance','n_neighbors':10},
                {'weights':'distance','n_neighbors':15}
            ]
        },
        {
             'model':LogisticRegression,
             'params':[
                {'C':0.001,'solver':'liblinear','random_state':5},
                {'C':0.01,'solver':'liblinear','random_state':5},
                {'C':0.1,'solver':'liblinear','random_state':5},
                {'C':1,'solver':'liblinear','random_state':5},
                {'C':10,'solver':'liblinear','random_state':5},
                {'C':100,'solver':'liblinear','random_state':5},
                {'penalty':'l1','C':0.001,'solver':'liblinear','random_state':5},
                {'penalty':'l1','C':0.01,'solver':'liblinear','random_state':5},
                {'penalty':'l1','C':0.1,'solver':'liblinear','random_state':5},
                {'penalty':'l1','C':1,'solver':'liblinear','random_state':5},
                {'penalty':'l1','C':10,'solver':'liblinear','random_state':5},
                {'penalty':'l1','C':100,'solver':'liblinear','random_state':5},
            ]
        },
        {
            'model':RandomForestClassifier,
            'params':[
                {'n_estimators':10, 'random_state':6},
                {'n_estimators':20, 'random_state':6},
                {'n_estimators':50, 'random_state':6},
                {'n_estimators':70, 'random_state':6},
                {'n_estimators':100, 'random_state':6}
            ]
        }
    ]
)

Starting classifer: 12:7 -  {'model': <class 'sklearn.neighbors.classification.KNeighborsClassifier'>, 'params': [{'weights': 'uniform', 'n_neighbors': 5}, {'weights': 'uniform', 'n_neighbors': 10}, {'weights': 'uniform', 'n_neighbors': 15}, {'weights': 'distance', 'n_neighbors': 5}, {'weights': 'distance', 'n_neighbors': 10}, {'weights': 'distance', 'n_neighbors': 15}]}
	Starting 1 out of 23 testing: 12:7
		fold:0, presampled:59128 oversampled:118076
		fold:0, time:18 secs, [0.999, 0.600,0.600,0.600] 
		fold:1, presampled:61481 oversampled:122782
		fold:1, time:9 secs, [0.998, 0.600,0.600,0.600] 
		fold:2, presampled:60095 oversampled:120010
		fold:2, time:12 secs, [0.999, 0.600,0.600,0.600] 
		fold:3, presampled:60542 oversampled:120904
		fold:3, time:13 secs, [0.999, 0.600,0.600,0.600] 
		fold:4, presampled:61108 oversampled:122036
		fold:4, time:12 secs, [0.998, 0.500,0.500,0.500] 
		fold:5, presampled:59959 oversampled:119738
		fold:5, time:11 secs, [0.999, 0.500,0.500,0.500] 
		f

		fold:2, time:0 secs, [0.999, 0.600,0.600,0.600] 
		fold:3, presampled:60542 oversampled:120904
		fold:3, time:0 secs, [0.999, 0.600,0.600,0.600] 
		fold:4, presampled:61108 oversampled:122036
		fold:4, time:0 secs, [0.999, 0.700,0.700,0.700] 
		fold:5, presampled:59959 oversampled:119738
		fold:5, time:0 secs, [0.999, 0.500,0.500,0.500] 
		fold:6, presampled:62859 oversampled:125538
		fold:6, time:0 secs, [0.998, 0.700,0.700,0.700] 
		fold:7, presampled:59416 oversampled:118652
		fold:7, time:0 secs, [0.999, 0.700,0.700,0.700] 
		fold:8, presampled:60649 oversampled:121118
		fold:8, time:0 secs, [0.999, 0.800,0.800,0.800] 
		fold:9, presampled:54955 oversampled:109730
		fold:9, time:0 secs, [0.999, 0.400,0.400,0.400] 
	Finished 7 out of 23 testing: 12:21, time:5 secs
	Metrics accuracy:0.999 precision:0.640 recall:0.640 f1:0.640
	Starting 8 out of 23 testing: 12:21
		fold:0, presampled:59128 oversampled:118076
		fold:0, time:0 secs, [0.999, 0.700,0.700,0.700] 
		fold:1, presampled:614

		fold:6, time:6 secs, [0.999, 0.900,0.900,0.900] 
		fold:7, presampled:59416 oversampled:118652
		fold:7, time:2 secs, [0.999, 0.700,0.700,0.700] 
		fold:8, presampled:60649 oversampled:121118
		fold:8, time:11 secs, [0.999, 0.800,0.800,0.800] 
		fold:9, presampled:54955 oversampled:109730
		fold:9, time:13 secs, [0.999, 0.400,0.400,0.400] 
	Finished 14 out of 23 testing: 12:24, time:76 secs
	Metrics accuracy:0.999 precision:0.680 recall:0.680 f1:0.680
	Starting 15 out of 23 testing: 12:24
		fold:0, presampled:59128 oversampled:118076
		fold:0, time:10 secs, [0.999, 0.800,0.800,0.800] 
		fold:1, presampled:61481 oversampled:122782
		fold:1, time:14 secs, [0.999, 0.700,0.700,0.700] 
		fold:2, presampled:60095 oversampled:120010
		fold:2, time:8 secs, [0.999, 0.600,0.600,0.600] 
		fold:3, presampled:60542 oversampled:120904
		fold:3, time:13 secs, [0.999, 0.700,0.700,0.700] 
		fold:4, presampled:61108 oversampled:122036
		fold:4, time:5 secs, [0.999, 0.700,0.700,0.700] 
		fold:5, presam

		fold:6, time:6 secs, [1.000, 1.000,1.000,1.000] 
		fold:7, presampled:59416 oversampled:118652
		fold:7, time:6 secs, [1.000, 1.000,1.000,1.000] 
		fold:8, presampled:60649 oversampled:121118
		fold:8, time:6 secs, [1.000, 1.000,1.000,1.000] 
		fold:9, presampled:54955 oversampled:109730
		fold:9, time:5 secs, [0.999, 0.700,0.700,0.700] 
	Finished 21 out of 23 testing: 12:36, time:60 secs
	Metrics accuracy:1.000 precision:0.940 recall:0.940 f1:0.940
	Starting 22 out of 23 testing: 12:36
		fold:0, presampled:59128 oversampled:118076
		fold:0, time:11 secs, [1.000, 1.000,1.000,1.000] 
		fold:1, presampled:61481 oversampled:122782
		fold:1, time:9 secs, [0.999, 0.800,0.800,0.800] 
		fold:2, presampled:60095 oversampled:120010
		fold:2, time:12 secs, [1.000, 1.000,1.000,1.000] 
		fold:3, presampled:60542 oversampled:120904
		fold:3, time:12 secs, [1.000, 1.000,1.000,1.000] 
		fold:4, presampled:61108 oversampled:122036
		fold:4, time:9 secs, [1.000, 1.000,1.000,1.000] 
		fold:5, presampl

# Training model for 'content'

In [19]:
# some tag names are lowercase. it is better all of them are uppercase.
data_content['tagName']=data_content['tagName'].str.upper()
data_content['tagName'].unique()

array(['DIV', 'A', 'SPAN', 'IMG', 'H1', 'P', 'STRONG', 'H2', 'EM', 'H4',
       'TABLE', 'TBODY', 'TR', 'TD', 'IFRAME', 'BR', 'SCRIPT', 'INPUT',
       'LABEL', 'TEXTAREA', 'BUTTON', 'I', 'NOSCRIPT', 'HEADER', 'UL',
       'LI', 'FORM', 'NAV', 'SVG', 'CIRCLE', 'SECTION', 'ARTICLE', 'H6',
       'H5', 'H3', 'SMALL', 'VIDEO', 'SOURCE', 'FOOTER', 'ASIDE',
       'PROGRESS', 'FIGURE', 'STYLE', 'B', 'OL', 'HR', 'PATH', 'POLYLINE',
       'RECT', 'BLOCKQUOTE', 'META', 'TEXT', 'TIME', 'INS', 'LINE',
       'LINK', 'ABBR', 'SUP', 'MARQUEE', 'CENTER', 'FONT', 'MAIN',
       'TITLE', 'G', 'AMP-ANALYTICS', 'T', 'SELECT', 'OPTION', 'DEFS',
       'THEAD', 'TH', 'MAP', 'AREA', 'CITE', 'PICTURE', 'FIGCAPTION',
       'USE', 'POLYGON', 'XDOOR-ICON', 'AUDIO', 'SYMBOL', 'FIELDSET',
       'TWITTER-WIDGET', 'DL', 'DT', 'S', 'NOBR', 'DAC-IVT-OGV', 'DD',
       'ADDRESS', 'FB:LIKE', 'FB:RECOMMENDATIONS-BAR', 'U', 'ELLIPSE',
       'LINEARGRADIENT', 'STOP', 'RADIALGRADIENT', 'CLIPPATH', 'TQWIDGET',
       

In [20]:
tag_content_blacklist=['A',  'IMG', 'H1', 'STRONG', 'H2', 'EM', 'H4',
       'TABLE', 'TBODY', 'TR',  'IFRAME', 'BR', 'SCRIPT', 'INPUT',
       'LABEL', 'TEXTAREA', 'BUTTON', 'I', 'NOSCRIPT', 'HEADER', 'UL',
       'LI', 'FORM', 'NAV', 'SVG', 'CIRCLE', 'H6',
       'H5', 'H3', 'SMALL', 'VIDEO', 'SOURCE', 'FOOTER', 'ASIDE',
       'PROGRESS', 'FIGURE', 'STYLE', 'B', 'OL', 'HR', 'PATH', 'POLYLINE',
       'RECT', 'BLOCKQUOTE', 'META', 'TEXT', 'TIME', 'INS', 'LINE',
       'LINK', 'ABBR', 'SUP', 'MARQUEE', 'CENTER', 'FONT', 
       'TITLE', 'G', 'AMP-ANALYTICS', 'T', 'SELECT', 'OPTION', 'DEFS',
       'THEAD', 'TH', 'MAP', 'AREA', 'CITE', 'PICTURE', 'FIGCAPTION',
       'USE', 'POLYGON', 'XDOOR-ICON', 'AUDIO', 'SYMBOL', 'FIELDSET',
       'TWITTER-WIDGET', 'DL', 'DT', 'S', 'NOBR', 'DAC-IVT-OGV', 'DD',
       'ADDRESS', 'FB:LIKE', 'FB:RECOMMENDATIONS-BAR', 'U', 'ELLIPSE',
       'LINEARGRADIENT', 'STOP', 'RADIALGRADIENT', 'CLIPPATH', 'TQWIDGET',
       'LEGEND', 'IMAGE', 'MENU', 'VIDEOPLAYER', 'AMP-WEB-PUSH',
       'AMP-INSTALL-SERVICEWORKER', 'AMP-USER-NOTIFICATION',
       'AMP-WEB-PUSH-WIDGET', 'AMP-IMG', 'I-AMPHTML-SIZER',
       'AMP-DATE-DISPLAY', 'TEMPLATE', 'AMP-POSITION-OBSERVER',
       'AMP-ANIMATION', 'AMP-IMAGE-LIGHTBOX', 'AMP-SOCIAL-SHARE',
       'AMP-IFRAME', 'AMP-AD', 'AMP-TWITTER', 'AMP-FACEBOOK-COMMENTS',
       'AMP-EMBED', 'AMP-STATE', 'AMP-SIDEBAR', 'AMP-ACCORDION',
       'AMP-LIVE-LIST', 'AMP-LIGHTBOX', 'AMP-PIXEL',
       'AMP-LIGHTBOX-GALLERY', 'OBJECT', 'APP-EFC-MESSAGING', 'CNX',
       'NTV-DIV', 'MASK', 'H0', 'APP-ROOT', 'BETA-FLAG', 'ROUTER-OUTLET',
       'APP', 'NAVBAR', 'CHANNELS-LIST', 'SUB-CHANNELS-LIST', 'STREAM',
       'PAGE', 'FBS-AD', 'PAGE-STANDARD', 'ARTICLE-HEADER', 'METRICS',
       'CONTRIB-BLOCK', 'CONTRIB-BYLINE', 'GROUP-BLOG-BLURB',
       'DISCLAIMER', 'SHARING', 'FBS-ACCORDION',
       'SIG-FILE', 'CONTRIB-FULL-BIO', 'PRINTBAR', 'MEDIANET', 'SIDENAV',
       'FBS-VIDEO', 'PARAM', 'FBS-AD-RAIL', 'DESC', 'PRE', 'FB:COMMENTS']
data_content=data_content[~data_content['tagName'].isin(tag_content_blacklist)]
data_content['tagName']
data_content.shape

(43189, 33)

In [21]:
print(data_content['tagName'].value_counts().idxmax())
print(data_content['textAlign'].value_counts().idxmax())
data_content_dummy=pd.get_dummies(data_content,columns=['tagName','textAlign'])
print(data_content_dummy.columns)

DIV
start
Index(['site', 'left', 'top', 'width', 'height', 'children', 'textCount',
       'parentCount', 'fontSize', 'linkCount', 'paragraphCount', 'imageCount',
       'colorRed', 'colorGreen', 'colorBlue', 'backgroundRed',
       'backgroundGreen', 'backgroundBlue', 'backgroundAlpha', 'marginTop',
       'marginRight', 'marginBottom', 'marginLeft', 'paddingTop',
       'paddingRight', 'paddingBottom', 'paddingLeft', 'descendants',
       'relPosX', 'relPosY', 'content', 'tagName_ARTICLE',
       'tagName_ARTICLE-BODY-CONTAINER', 'tagName_DIV', 'tagName_MAIN',
       'tagName_P', 'tagName_SECTION', 'tagName_SPAN', 'tagName_TD',
       'textAlign_-webkit-center', 'textAlign_-webkit-left',
       'textAlign_-webkit-right', 'textAlign_center', 'textAlign_justify',
       'textAlign_left', 'textAlign_right', 'textAlign_start'],
      dtype='object')


In [22]:
data_content_dummy=data_content_dummy.drop(['tagName_DIV','textAlign_start'],axis=1)
print(data_content_dummy.columns)

Index(['site', 'left', 'top', 'width', 'height', 'children', 'textCount',
       'parentCount', 'fontSize', 'linkCount', 'paragraphCount', 'imageCount',
       'colorRed', 'colorGreen', 'colorBlue', 'backgroundRed',
       'backgroundGreen', 'backgroundBlue', 'backgroundAlpha', 'marginTop',
       'marginRight', 'marginBottom', 'marginLeft', 'paddingTop',
       'paddingRight', 'paddingBottom', 'paddingLeft', 'descendants',
       'relPosX', 'relPosY', 'content', 'tagName_ARTICLE',
       'tagName_ARTICLE-BODY-CONTAINER', 'tagName_MAIN', 'tagName_P',
       'tagName_SECTION', 'tagName_SPAN', 'tagName_TD',
       'textAlign_-webkit-center', 'textAlign_-webkit-left',
       'textAlign_-webkit-right', 'textAlign_center', 'textAlign_justify',
       'textAlign_left', 'textAlign_right'],
      dtype='object')


In [23]:
sites_content=data_content_dummy['site']
y_content=data_content_dummy['content']
X_content=data_content_dummy.drop(['site','content'],axis=1)

In [24]:
from sklearn.preprocessing import MinMaxScaler
mms_content=MinMaxScaler()
X_content_sc=mms_content.fit_transform(X_content)

In [25]:
kfold=KFoldProba(k=10,random_state=3)
kfold.fit(sites_content.to_numpy(),X_content_sc,y_content.to_numpy())
result_content=grid_validation(
    kfold=kfold,
    models=[        
        {
            'model':KNeighborsClassifier,
            'params':[
                {'weights':'uniform','n_neighbors':5},
                {'weights':'uniform','n_neighbors':10},
                {'weights':'uniform','n_neighbors':15},
                {'weights':'distance','n_neighbors':5},
                {'weights':'distance','n_neighbors':10},
                {'weights':'distance','n_neighbors':15}
            ]
        },
        {
             'model':LogisticRegression,
             'params':[
                {'C':0.001,'solver':'liblinear','random_state':5},
                {'C':0.01,'solver':'liblinear','random_state':5},
                {'C':0.1,'solver':'liblinear','random_state':5},
                {'C':1,'solver':'liblinear','random_state':5},
                {'C':10,'solver':'liblinear','random_state':5},
                {'C':100,'solver':'liblinear','random_state':5},
                {'penalty':'l1','C':0.001,'solver':'liblinear','random_state':5},
                {'penalty':'l1','C':0.01,'solver':'liblinear','random_state':5},
                {'penalty':'l1','C':0.1,'solver':'liblinear','random_state':5},
                {'penalty':'l1','C':1,'solver':'liblinear','random_state':5},
                {'penalty':'l1','C':10,'solver':'liblinear','random_state':5},
                {'penalty':'l1','C':100,'solver':'liblinear','random_state':5},
            ]
        },
        {
            'model':RandomForestClassifier,
            'params':[
                {'n_estimators':10, 'random_state':6},
                {'n_estimators':20, 'random_state':6},
                {'n_estimators':50, 'random_state':6},
                {'n_estimators':70, 'random_state':6},
                {'n_estimators':100, 'random_state':6}
            ]
        }
    ]
)

Starting classifer: 12:40 -  {'model': <class 'sklearn.neighbors.classification.KNeighborsClassifier'>, 'params': [{'weights': 'uniform', 'n_neighbors': 5}, {'weights': 'uniform', 'n_neighbors': 10}, {'weights': 'uniform', 'n_neighbors': 15}, {'weights': 'distance', 'n_neighbors': 5}, {'weights': 'distance', 'n_neighbors': 10}, {'weights': 'distance', 'n_neighbors': 15}]}
	Starting 1 out of 23 testing: 12:40
		fold:0, presampled:38507 oversampled:76838
		fold:0, time:6 secs, [0.997, 0.300,0.300,0.300] 
		fold:1, presampled:40080 oversampled:79980
		fold:1, time:4 secs, [0.997, 0.500,0.625,0.556] 
		fold:2, presampled:38771 oversampled:77366
		fold:2, time:5 secs, [0.998, 0.500,0.500,0.500] 
		fold:3, presampled:38926 oversampled:77676
		fold:3, time:3 secs, [0.999, 0.700,0.700,0.700] 
		fold:4, presampled:39428 oversampled:78680
		fold:4, time:3 secs, [0.998, 0.700,0.700,0.700] 
		fold:5, presampled:38633 oversampled:77090
		fold:5, time:4 secs, [0.997, 0.300,0.300,0.300] 
		fold:6, pr

		fold:3, time:0 secs, [0.996, 0.200,0.200,0.200] 
		fold:4, presampled:39428 oversampled:78680
		fold:4, time:0 secs, [0.996, 0.300,0.300,0.300] 
		fold:5, presampled:38633 oversampled:77090
		fold:5, time:0 secs, [0.996, 0.100,0.100,0.100] 
		fold:6, presampled:40649 oversampled:81122
		fold:6, time:0 secs, [0.994, 0.200,0.200,0.200] 
		fold:7, presampled:38158 oversampled:76140
		fold:7, time:0 secs, [0.997, 0.200,0.200,0.200] 
		fold:8, presampled:39838 oversampled:79500
		fold:8, time:0 secs, [0.995, 0.100,0.100,0.100] 
		fold:9, presampled:35711 oversampled:71246
		fold:9, time:0 secs, [0.998, 0.100,0.100,0.100] 
	Finished 7 out of 23 testing: 12:45, time:2 secs
	Metrics accuracy:0.996 precision:0.160 recall:0.163 f1:0.161
	Starting 8 out of 23 testing: 12:45
		fold:0, presampled:38507 oversampled:76838
		fold:0, time:0 secs, [0.997, 0.400,0.400,0.400] 
		fold:1, presampled:40080 oversampled:79980
		fold:1, time:0 secs, [0.995, 0.100,0.125,0.111] 
		fold:2, presampled:38771 overs

		fold:8, time:6 secs, [0.996, 0.300,0.300,0.300] 
		fold:9, presampled:35711 oversampled:71246
		fold:9, time:3 secs, [0.998, 0.300,0.300,0.300] 
	Finished 14 out of 23 testing: 12:47, time:58 secs
	Metrics accuracy:0.997 precision:0.360 recall:0.365 f1:0.362
	Starting 15 out of 23 testing: 12:47
		fold:0, presampled:38507 oversampled:76838
		fold:0, time:19 secs, [0.998, 0.500,0.500,0.500] 
		fold:1, presampled:40080 oversampled:79980
		fold:1, time:30 secs, [0.997, 0.400,0.500,0.444] 
		fold:2, presampled:38771 oversampled:77366
		fold:2, time:28 secs, [0.998, 0.500,0.500,0.500] 
		fold:3, presampled:38926 oversampled:77676
		fold:3, time:26 secs, [0.998, 0.600,0.600,0.600] 
		fold:4, presampled:39428 oversampled:78680
		fold:4, time:22 secs, [0.998, 0.600,0.600,0.600] 
		fold:5, presampled:38633 oversampled:77090
		fold:5, time:24 secs, [0.998, 0.600,0.600,0.600] 
		fold:6, presampled:40649 oversampled:81122
		fold:6, time:14 secs, [0.997, 0.600,0.600,0.600] 
		fold:7, presampled:3



		fold:0, time:39 secs, [0.998, 0.500,0.500,0.500] 
		fold:1, presampled:40080 oversampled:79980




		fold:1, time:323 secs, [0.998, 0.600,0.750,0.667] 
		fold:2, presampled:38771 oversampled:77366
		fold:2, time:405 secs, [0.998, 0.500,0.500,0.500] 
		fold:3, presampled:38926 oversampled:77676




		fold:3, time:291 secs, [0.998, 0.500,0.500,0.500] 
		fold:4, presampled:39428 oversampled:78680




		fold:4, time:257 secs, [0.998, 0.600,0.600,0.600] 
		fold:5, presampled:38633 oversampled:77090




		fold:5, time:161 secs, [0.999, 0.700,0.700,0.700] 
		fold:6, presampled:40649 oversampled:81122




		fold:6, time:152 secs, [0.998, 0.700,0.700,0.700] 
		fold:7, presampled:38158 oversampled:76140




		fold:7, time:183 secs, [0.998, 0.600,0.600,0.600] 
		fold:8, presampled:39838 oversampled:79500




		fold:8, time:184 secs, [0.996, 0.400,0.400,0.400] 
		fold:9, presampled:35711 oversampled:71246




		fold:9, time:144 secs, [1.000, 0.900,0.900,0.900] 
	Finished 17 out of 23 testing: 13:31, time:2142 secs
	Metrics accuracy:0.998 precision:0.600 recall:0.615 f1:0.607
	Starting 18 out of 23 testing: 13:31
		fold:0, presampled:38507 oversampled:76838
		fold:0, time:77 secs, [0.998, 0.500,0.500,0.500] 
		fold:1, presampled:40080 oversampled:79980




		fold:1, time:305 secs, [0.998, 0.600,0.750,0.667] 
		fold:2, presampled:38771 oversampled:77366




		fold:2, time:213 secs, [0.998, 0.500,0.500,0.500] 
		fold:3, presampled:38926 oversampled:77676




		fold:3, time:336 secs, [0.998, 0.500,0.500,0.500] 
		fold:4, presampled:39428 oversampled:78680




		fold:4, time:304 secs, [0.998, 0.600,0.600,0.600] 
		fold:5, presampled:38633 oversampled:77090




		fold:5, time:238 secs, [0.999, 0.700,0.700,0.700] 
		fold:6, presampled:40649 oversampled:81122




		fold:6, time:233 secs, [0.998, 0.700,0.700,0.700] 
		fold:7, presampled:38158 oversampled:76140




		fold:7, time:262 secs, [0.998, 0.600,0.600,0.600] 
		fold:8, presampled:39838 oversampled:79500




		fold:8, time:205 secs, [0.996, 0.400,0.400,0.400] 
		fold:9, presampled:35711 oversampled:71246




		fold:9, time:240 secs, [1.000, 0.900,0.900,0.900] 
	Finished 18 out of 23 testing: 14:11, time:2416 secs
	Metrics accuracy:0.998 precision:0.600 recall:0.615 f1:0.607
Starting classifer: 14:11 -  {'model': <class 'sklearn.ensemble.forest.RandomForestClassifier'>, 'params': [{'n_estimators': 10, 'random_state': 6}, {'n_estimators': 20, 'random_state': 6}, {'n_estimators': 50, 'random_state': 6}, {'n_estimators': 70, 'random_state': 6}, {'n_estimators': 100, 'random_state': 6}]}
	Starting 19 out of 23 testing: 14:11
		fold:0, presampled:38507 oversampled:76838
		fold:0, time:0 secs, [0.999, 0.700,0.700,0.700] 
		fold:1, presampled:40080 oversampled:79980
		fold:1, time:0 secs, [0.997, 0.500,0.625,0.556] 
		fold:2, presampled:38771 oversampled:77366
		fold:2, time:0 secs, [0.998, 0.600,0.600,0.600] 
		fold:3, presampled:38926 oversampled:77676
		fold:3, time:0 secs, [0.999, 0.800,0.800,0.800] 
		fold:4, presampled:39428 oversampled:78680
		fold:4, time:0 secs, [0.999, 0.900,0.900,0.900]

In [26]:
def bestModel(models,metric_to_compare):
    best_model_metric=0
    best_model=None
    for model in models:
        if model['metrics'][metric_to_compare] > best_model_metric:
            best_model_metric=model['metrics'][metric_to_compare]
            best_model=model
    return best_model

## Best Models

In [27]:
title_best_model=bestModel(result_title,'f1')
print("Best model for TITLE:")
print("Classifier:",title_best_model['model_name'])
print("Parameters:",title_best_model['model_param'])
print("Metrics:",title_best_model['metrics'])

title_clf=title_best_model['model']
#title_clf=LogisticRegression(penalty='l1',C=1,solver='liblinear',random_state=5)
ros_title = RandomOverSampler(random_state=3)
train_X_resampled, train_y_resampled = ros_title.fit_resample(X_title_sc,y_title)
print("\t\tpre-oversampled:%d oversampled:%d"%(X_title.shape[0], train_X_resampled.shape[0]))
            
title_clf.fit(train_X_resampled,train_y_resampled)

pred_proba=title_clf.predict_proba(X_title_sc)
y_pred=choose_max_probability(sites_title.to_numpy(),pred_proba)
a=metrics.accuracy_score(y_title,y_pred)
p=metrics.precision_score(y_title,y_pred)
r=metrics.recall_score(y_title,y_pred)
f=metrics.f1_score(y_title,y_pred)
print('accuracy:%0.3f, precision:%0.3f, recall:%0.3f, f1:%0.3f'%(a,p,r,f))

Best model for TITLE:
Classifier: {'model': <class 'sklearn.ensemble.forest.RandomForestClassifier'>, 'params': [{'n_estimators': 10, 'random_state': 6}, {'n_estimators': 20, 'random_state': 6}, {'n_estimators': 50, 'random_state': 6}, {'n_estimators': 70, 'random_state': 6}, {'n_estimators': 100, 'random_state': 6}]}
Parameters: {'n_estimators': 50, 'random_state': 6}
Metrics: {'accuracy': 0.9998423204193703, 'precision': 0.9399999999999998, 'recall': 0.9399999999999998, 'f1': 0.9400000000000001}
		pre-oversampled:66688 oversampled:133176
accuracy:1.000, precision:1.000, recall:1.000, f1:1.000


In [34]:
content_best_model=bestModel(result_content,'f1')
print("Best model for CONTENT:")
print("Classifier:",content_best_model['model_name'])
print("Parameters:",content_best_model['model_param'])
print("Metrics:",content_best_model['metrics'])

content_clf=content_best_model['model']
#content_clf=LogisticRegression(C=10,solver='liblinear',random_state=5)
ros_content = RandomOverSampler(random_state=3)
train_X_resampled, train_y_resampled = ros_content.fit_resample(X_content_sc,y_content)
print("\t\tpre-oversampled:%d oversampled:%d"%(X_content.shape[0], train_X_resampled.shape[0]))
            
content_clf.fit(train_X_resampled,train_y_resampled)

pred_proba=content_clf.predict_proba(X_content_sc)
y_pred=choose_max_probability(sites_content.to_numpy(),pred_proba)
a=metrics.accuracy_score(y_content,y_pred)
p=metrics.precision_score(y_content,y_pred)
r=metrics.recall_score(y_content,y_pred)
f=metrics.f1_score(y_content,y_pred)
print('accuracy:%0.3f, precision:%0.3f, recall:%0.3f, f1:%0.3f'%(a,p,r,f))

Best model for CONTENT:
Classifier: {'model': <class 'sklearn.ensemble.forest.RandomForestClassifier'>, 'params': [{'n_estimators': 10, 'random_state': 6}, {'n_estimators': 20, 'random_state': 6}, {'n_estimators': 50, 'random_state': 6}, {'n_estimators': 70, 'random_state': 6}, {'n_estimators': 100, 'random_state': 6}]}
Parameters: {'n_estimators': 50, 'random_state': 6}
Metrics: {'accuracy': 0.9991566797705819, 'precision': 0.8300000000000001, 'recall': 0.85, 'f1': 0.8388888888888889}
		pre-oversampled:43189 oversampled:86182
accuracy:1.000, precision:0.980, recall:1.000, f1:0.990


## Testing my best model

In [35]:
def test(sites,X,y,black_list,model_cols,freqs,scaler,model):
    sites_white=sites[~X['tagName'].isin(black_list)]
    X_white=X[~X['tagName'].isin(black_list)]
    y_white=y[~X['tagName'].isin(black_list)]
    X_dummy=pd.get_dummies(X_white,columns=['tagName','textAlign'])
    X_final=pd.DataFrame(index=X_dummy.index)
    
    for col in model_cols:
        if col in X_dummy.columns:
            X_final[col]=X_dummy[col];
        else:
            X_final[col]=[0]*X_dummy.shape[0]
    
    for obj_col in X_final.columns[X_final.dtypes=='O'].values:
        X_final[obj_col]=pd.to_numeric(X_final[obj_col],errors='coerce')
        X_final[obj_col] = X_final[obj_col].fillna(freqs[obj_col])
    
    X_sc=scaler.transform(X_final)
    pred_proba=model.predict_proba(X_sc)
    y_pred=choose_max_probability(sites_white.to_numpy(),pred_proba)
    
    a=metrics.accuracy_score(y_white,y_pred)
    p=metrics.precision_score(y_white,y_pred)
    r=metrics.recall_score(y_white,y_pred)
    f1=metrics.f1_score(y_white,y_pred)
    print('Metrics accuracy:%.3f precision:%.3f recall:%.3f f1:%.3f'%(a,p,r,f1))
    
test(data_test['site'],
     data_test.drop(['site','content','title'],axis=1),
     data_test['title'],
     tag_blacklist,
     X_title.columns.values.tolist(),
     X_title[['marginTop','marginRight','marginBottom','marginLeft','paddingTop','paddingRight','paddingBottom',
 'paddingLeft']].mode().iloc[0],
    mms_title,
    title_clf)

test(data_test['site'],
     data_test.drop(['site','content','title'],axis=1),
     data_test['content'],
     tag_content_blacklist,
     X_content.columns.values.tolist(),
     X_content[['marginTop','marginRight','marginBottom','marginLeft','paddingTop','paddingRight','paddingBottom',
 'paddingLeft']].mode().iloc[0],
    mms_content,
    content_clf)

Metrics accuracy:1.000 precision:0.900 recall:0.900 f1:0.900
Metrics accuracy:0.999 precision:0.750 recall:0.750 f1:0.750


## Persisting my model

In [36]:
from joblib import dump
dump(title_clf, 'title_classifier.joblib') 

['title_classifier.joblib']

In [37]:
dump(content_clf, 'content_classifier.joblib') 

['content_classifier.joblib']

In [38]:
dump(mms_title,"scaler_title.joblib")
dump(mms_content,"scaler_content.joblib")

['scaler_content.joblib']

In [39]:
import json
freq=X_title[['marginTop','marginRight','marginBottom','marginLeft','paddingTop','paddingRight','paddingBottom',
 'paddingLeft']].mode().iloc[0].to_dict()
cols_title=X_title.columns.values.tolist()
cols_content=X_content.columns.values.tolist()

f = open("datasetinfo.txt", "w")
f.write(json.dumps(freq))
f.write('\n')
f.write(','.join(cols_title))
f.write('\n')
f.write(','.join(tag_blacklist))
f.write('\n')
f.write(','.join(cols_content))
f.write('\n')
f.write(','.join(tag_content_blacklist))
f.close()