# New Content extractor

In this project, I set a goal to build a model that I can extract title and main textual content from news web page. When we crawl websites to build a search engine or other purpose, it turns out very difficult because if noisy contents not related to the main article. It will be useful for us on that occasions. 

## Approach
When we visit websites, we can distinguish easily title and content of the main article. My goal is to make the machine have that sense. 

In my opinion, we may make a conclusion based on the visual perception. For example: position on the display, size of the text, color and weight of the text etc. 

On top of that, I think developers adopt similar practices to build their web pages. There may be some similarity including similar html tags to create similar components, html elements' hierarchical structures etc.

Therefore, my model will be trained to understand that sense and similarity.

And also one thing to notice is that I will train two models because I am looking for title and contents from the page

First, I will train the model on a dataset that is extracted from Mongolian news websites because it was comfortable for me. I hope it will work on any language because HTML is universal and once I reached my goal I will retrain my model on websites from other languages if it is neccessary. 

## Dataset

In order to create my dataset for the training, I wrote a very simple Javascript (Typescript) webscraper which opens a browser (using [puppeteer](https://github.com/GoogleChrome/puppeteer)), loads websites and collects HTML elements and their attributes. You can see my webscraper and its usage from [here](../webscraper)

#### Attributes
**site**: the website's name from which the element is extracted<br/>
**url**: the web url from which the element is extracted<br/>
**tagName**:the element's html tag<br/>
**left**: X coordinate of the top-left point of the element on the page<br/>
**top**:Y coordinate of the top-left point of the element on the page<br/>
**width**:the element's width on the page<br/>
**height**:the element's height on the page<br/>
**children**:count of direct child elements<br/>
**textCount**:length of the text in the element<br/>
**parentCount**:count of the ancestor elements<br/>
**fontSize**:font size of the text<br/>
**linkCount**:count of the &#60;a&#62; elements in the element<br/>
**paragraphCount**:count of the &#60;p&#62; elements in the element<br/>
**imageCount**:count of the &#60;img&#62; elements in the element<br/>
**colorRed**:the red attribute of RGB color of the texts in the element<br/>
**colorGreen**:the green attribute of RGB color of the texts in the element<br/>
**colorBlue**:the blue attribute of RGB color of the texts in the element<br/>
**backgroundRed**:the red attribute of RGB color of the background of the element<br/>
**backgroundGreen**:the green attribute of RGB color of the background of the element<br/>
**backgroundBlue**:the blue attribute of RGB color of the background of the element<br/>
**backgroundAlpha**:the transparency attribute of the background of the element<br/>
**textAlign**:text alignment of the text in the element<br/>
**marginTop**:top margin of the element<br/>
**marginRight**:right margin of the element<br/>
**marginBottom**:bottom margin of the element<br/>
**marginLeft**:left margin of the element<br/>
**paddingTop**:top padding of the element<br/>
**paddingRight**:right padding of the element<br/>
**paddingBottom**:bottom padding of the element<br/>
**paddingLeft**:left padding of the element<br/>
**descendants**:count of descendant elements<br/>
**title**: whether the element is title of the article <br/>
**content**:whether the element is main content of the article

The dataset is pretty imblanced because in one web page only one element is title and another element is content while page contains from several hundreds to several thousand elements. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("../webscraper/out.csv", quotechar='"', skipinitialspace=True)

print(data.columns)
print(data.shape)

Index(['site', 'url', 'tagName', 'left', 'top', 'width', 'height', 'children',
       'textCount', 'parentCount', 'fontSize', 'linkCount', 'paragraphCount',
       'imageCount', 'colorRed', 'colorGreen', 'colorBlue', 'backgroundRed',
       'backgroundGreen', 'backgroundBlue', 'backgroundAlpha', 'textAlign',
       'marginTop', 'marginRight', 'marginBottom', 'marginLeft', 'paddingTop',
       'paddingRight', 'paddingBottom', 'paddingLeft', 'descendants', 'title',
       'content'],
      dtype='object')
(52275, 33)


## Data Preprocessing

In this phase, I will prepare the dataset for the training. 

### 1. Removing not useful fields.
Of course, our model should work free from the web site. So I will remove *url* attribute. The *site* attribute will be used when separating testing and training dataset, so it will be left so far.

In [2]:
data=data.drop(['url'],axis=1)
data.head()

Unnamed: 0,site,tagName,left,top,width,height,children,textCount,parentCount,fontSize,...,marginRight,marginBottom,marginLeft,paddingTop,paddingRight,paddingBottom,paddingLeft,descendants,title,content
0,ikon,DIV,0.0,35.0,1920,0,0,0,1,12.0,...,0,0,0,0.0,0.0,0.0,0.0,0,False,False
1,ikon,DIV,0.0,35.0,1920,6170,2,1300,1,12.0,...,0,0,0,0.0,0.0,0.0,0.0,360,False,False
2,ikon,DIV,0.0,0.0,1920,60,1,5,2,12.0,...,0,0,0,0.0,0.0,0.0,0.0,18,False,False
3,ikon,DIV,240.0,0.0,1440,60,3,5,3,12.0,...,240,0,240,0.0,0.0,0.0,0.0,17,False,False
4,ikon,DIV,258.0,10.0,103,38,1,0,4,12.0,...,0,0,18,0.0,0.0,0.0,0.0,3,False,False


### 2.Mapping boolean class to numerical class.
*title, content* columns are classes we need to predict. It holds boolean values and it will work with classification. So technically I do not need to map them into numerical values, but I will map them into 0,1 just for convenience.

In [3]:
class_mapping={False:0,True:1}
data['title']=data['title'].map(class_mapping)
data['content']=data['content'].map(class_mapping)
data.head()

Unnamed: 0,site,tagName,left,top,width,height,children,textCount,parentCount,fontSize,...,marginRight,marginBottom,marginLeft,paddingTop,paddingRight,paddingBottom,paddingLeft,descendants,title,content
0,ikon,DIV,0.0,35.0,1920,0,0,0,1,12.0,...,0,0,0,0.0,0.0,0.0,0.0,0,0,0
1,ikon,DIV,0.0,35.0,1920,6170,2,1300,1,12.0,...,0,0,0,0.0,0.0,0.0,0.0,360,0,0
2,ikon,DIV,0.0,0.0,1920,60,1,5,2,12.0,...,0,0,0,0.0,0.0,0.0,0.0,18,0,0
3,ikon,DIV,240.0,0.0,1440,60,3,5,3,12.0,...,240,0,240,0.0,0.0,0.0,0.0,17,0,0
4,ikon,DIV,258.0,10.0,103,38,1,0,4,12.0,...,0,0,18,0.0,0.0,0.0,0.0,3,0,0


### 3. Categorical features.
My dataset has two categorical features:tagName,textAlign and both of them are nominal. So I will use one-hot encoding method. And also I will remove the most frequent dummy feature so that it will be possible to ignore them if new tag name or text alignment is introduced on testing or production phase.

But before to do that, in order to reduce dimentionality I can filter tag names and remove the rows which will never get positive class. 

In [5]:
# some tag names are lowercase. it is better all of them are uppercase.
data['tagName']=data['tagName'].str.upper()
data['tagName'].unique()

array(['DIV', 'A', 'SPAN', 'IMG', 'H1', 'P', 'STRONG', 'H2', 'EM', 'H4',
       'TABLE', 'TBODY', 'TR', 'TD', 'IFRAME', 'BR', 'SCRIPT', 'INPUT',
       'LABEL', 'TEXTAREA', 'BUTTON', 'I', 'NOSCRIPT', 'SECTION', 'SVG',
       'RECT', 'UL', 'LI', 'FORM', 'CIRCLE', 'PATH', 'NAV', 'LINE', 'HR',
       'ARTICLE', 'H5', 'FIELDSET', 'FOOTER', 'HEADER', 'H6', 'H3',
       'SMALL', 'VIDEO', 'SOURCE', 'ASIDE', 'STYLE', 'SUP', 'TITLE', 'G',
       'BLOCKQUOTE', 'FIGURE', 'FIGCAPTION', 'PROGRESS', 'B', 'OL',
       'META', 'TEXT', 'TIME', 'INS', 'LINK', 'SELECT', 'OPTION', 'ABBR',
       'MARQUEE', 'CENTER', 'U', 'FONT', 'MAIN', 'AMP-ANALYTICS', 'DEFS',
       'THEAD', 'TH', 'MAP', 'AREA'], dtype=object)

In [7]:
tag_blacklist=['IMG','IFRAME','SCRIPT','INPUT','TEXTAREA','BUTTON','NOSCRIPT','SVG','RECT','FORM','CIRCLE','PATH',
              'LINE','HR','FIELDSET','VIDEO','SOURCE','STYLE','SUP','TITLE','G','FIGURE','FIGCAPTION','PROGRESS','META',
              'SELECT','OPTION','AMP-ANALYTICS','DEFS','MAP','AREA']
data=data[~data['tagName'].isin(tag_blacklist)]

In [8]:
print(data['tagName'].value_counts().idxmax())
print(data['textAlign'].value_counts().idxmax())
data_dummy=pd.get_dummies(data,columns=['tagName','textAlign'])
print(data_dummy.columns)
data_dummy=data_dummy.drop(['tagName_DIV','textAlign_start'],axis=1)
print(data_dummy.columns)

DIV
start
Index(['site', 'left', 'top', 'width', 'height', 'children', 'textCount',
       'parentCount', 'fontSize', 'linkCount', 'paragraphCount', 'imageCount',
       'colorRed', 'colorGreen', 'colorBlue', 'backgroundRed',
       'backgroundGreen', 'backgroundBlue', 'backgroundAlpha', 'marginTop',
       'marginRight', 'marginBottom', 'marginLeft', 'paddingTop',
       'paddingRight', 'paddingBottom', 'paddingLeft', 'descendants', 'title',
       'content', 'tagName_A', 'tagName_ABBR', 'tagName_ARTICLE',
       'tagName_ASIDE', 'tagName_B', 'tagName_BLOCKQUOTE', 'tagName_BR',
       'tagName_CENTER', 'tagName_DIV', 'tagName_EM', 'tagName_FONT',
       'tagName_FOOTER', 'tagName_H1', 'tagName_H2', 'tagName_H3',
       'tagName_H4', 'tagName_H5', 'tagName_H6', 'tagName_HEADER', 'tagName_I',
       'tagName_INS', 'tagName_LABEL', 'tagName_LI', 'tagName_LINK',
       'tagName_MAIN', 'tagName_MARQUEE', 'tagName_NAV', 'tagName_OL',
       'tagName_P', 'tagName_SECTION', 'tagName_SMALL', '

### 4. Specifiying features and target
Before transforming features, I need to separate target from features and training dataset from validation dataset.

1. Splitting training dataset from validation dataset
I will split my dataset based on site because elements of same site should be belong to same dataset. 

In [9]:
sites_unq=data['site'].unique()

random=np.random.RandomState(23)
random.shuffle(sites_unq)
print(sites_unq)
size=len(sites_unq)
train_size=int(size*.8)

train_sites=sites_unq[0:train_size]
val_sites=sites_unq[train_size:]

data_train=data_dummy[data_dummy['site'].isin(train_sites)]
print("Training set:",data_train.shape)
data_val=data_dummy[data_dummy['site'].isin(val_sites)]
print("Training set:",data_val.shape)

['medee' 'news1' 'mongolcom' 'niigmiintoli' 'itoim' '24barimt' 'sosa'
 'gereg' 'tur' 'inews' 'unen' 'livetv' 'miss' 'peak' 'mminfo' 'tusgaar'
 'paparatsi' 'amjilt' 'zindaa' 'asuudal' 'polit' 'mongolcomment'
 'newspress' 'gogo' 'murch' 'tonshuul' 'ontslokh' 'arslan' 'inet'
 'udriintoim' 'zaluu' 'fact' 'ikon' 'zuv' 'dardas' 'sonin' 'news'
 'scandal' 'updown' 'seruuleg' 'newsmedia' 'control' 'bolod' 'olloo'
 'unuudur' 'kingnews' 'nertur' 'jirgee' 'shuurhai' 'zarig']
Training set: (36383, 79)
Training set: (11234, 79)


2. Now I am going to separate features from target.

In [10]:
#sites
sites_train=data_train['site']
sites_val=data_val['site']
print('sites_train:',sites_train.shape)
print('sites_val:',sites_val.shape)

#y title
y_title_train=data_train['title']
y_title_val=data_val['title']
print('y_title_train:',y_title_train.shape)
print('y_title_val:',y_title_val.shape)

#y content
y_content_train=data_train['content']
y_content_val=data_val['content']
print('y_content_train:',y_content_train.shape)
print('y_content_val:',y_content_val.shape)

#X
X_train=data_train.drop(['title','content','site'],axis=1)
X_val=data_val.drop(['title','content','site'],axis=1)
print('X_train:',X_train.shape)
print('X_val:',X_val.shape)



sites_train: (36383,)
sites_val: (11234,)
y_title_train: (36383,)
y_title_val: (11234,)
y_content_train: (36383,)
y_content_val: (11234,)
X_train: (36383, 76)
X_val: (11234, 76)


### 5. Non-numerical values
Now I will make sure that any feature does not have non numerical data.

In [11]:
print(X_train.dtypes.unique())
X_train.columns[X_train.dtypes=='O']

[dtype('float64') dtype('int64') dtype('O') dtype('uint8')]


Index(['marginTop', 'marginRight', 'marginBottom', 'marginLeft'], dtype='object')

According to the result, the features **marginTop, marginRight, marginBottom, marginLeft** contain not numerical data. So now I will work on margins which are supposed to hold continuous values.

So lets look what values are that non-numerical values.

In [12]:
#margins are not numeric
from collections import Counter
for c in ['marginTop','marginRight','marginBottom','marginLeft']:
    string_values=[]
    s=X_train[c]
    for index, value in s.items():
        try:
            float(value)
        except ValueError:
            string_values.append(value)
    print(c,Counter(string_values))


marginTop Counter({'auto': 7})
marginRight Counter({'auto': 62, '15%': 4, '2%': 3})
marginBottom Counter({'auto': 10, '2%': 3})
marginLeft Counter({'auto': 60, '15%': 4})


As for the margins, this values can be changed into the most frequent values.

In [13]:
for c in ['marginTop','marginRight','marginBottom','marginLeft']:
    X_train[c]=pd.to_numeric(X_train[c],errors='coerce')
    X_val[c]=pd.to_numeric(X_val[c],errors='coerce')

In [14]:
null_columns=['marginTop','marginRight','marginBottom','marginLeft']
print(X_train[null_columns].mode())
X_train[null_columns] = X_train[null_columns].fillna(X_train[null_columns].mode().iloc[0])
X_train.head()

X_val[null_columns]=X_val[null_columns].fillna(X_train[null_columns].mode().iloc[0])

   marginTop  marginRight  marginBottom  marginLeft
0        0.0          0.0           0.0         0.0


In [None]:
#print(data.shape)
#tag_blacklist=['img','iframe','script','input','textarea','button','noscript','svg','rect','form','circle','path',
#              'line','hr','video','source','style','sup','title','g','figure','figcaption','progress','meta',
#              'select','option','amp-analytics','defs']
#
#data=data[~data['tagName'].str.lower().isin(tag_blacklist)]
#data.shape

### 6. Feature scaling
I will also normalize the numerical columns using MinMaxScaler

In [15]:
from sklearn.preprocessing import MinMaxScaler
mms=MinMaxScaler()
X_train_sc=mms.fit_transform(X_train)
X_val_sc=mms.transform(X_val)


## Training the model

I will train my model using LogisticRegression, SVM and KNN classifiers. 

### Over Sampling
My dataset is quite imbalanced. So if I train the model on this dataset, it will just give me a result which all prediction as negative value. 
So I used over sampling method to make my training set balanced. 

In [16]:
from imblearn.over_sampling import RandomOverSampler
ros_title = RandomOverSampler(random_state=0)
X_title_train_resampled, y_title_train_resampled = ros_title.fit_resample(X_train_sc, y_title_train)
print(X_title_train_resampled.shape)

ros_content=RandomOverSampler(random_state=1)
X_content_train_resampled, y_content_train_resampled=ros_content.fit_resample(X_train_sc,y_content_train)

(72686, 76)


### Support vector machine

In [42]:
from sklearn.svm import SVC
from sklearn.metrics import f1_score,accuracy_score,precision_score,recall_score
def train_and_evaluate(svmTesting,args):   
    svmTesting.fit(args[0],args[1])
    y_pred=svmTesting.predict(args[2])
    a=accuracy_score(args[3],y_pred)
    p=precision_score(args[3],y_pred)
    r=recall_score(args[3],y_pred)
    f=f1_score(args[3],y_pred)
    print('kernel:%s, C:%s, gamma:%s, accuracy:%0.3f, precision:%0.3f, recall:%0.3f, f1:%0.3f'%(svmTesting.kernel,svmTesting.C,svmTesting.gamma,a,p,r,f))
    return a,p,r,f





def crossValidation(args):
    tuned_parameters = [{'kernel': 'rbf', 'gamma': [1e-1,1e-2,1e-3,1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': 'linear', 'C': [1, 10, 100, 1000]},
                   {'kernel': 'sigmoid', 'gamma': [1e-1,1e-2,1e-3,1e-4],
                     'C': [1, 10, 100, 1000]},
                   {'kernel': 'poly', 'gamma': [1e-1,1e-2,1e-3,1e-4],
                     'C': [1, 10, 100, 1000]}]

    max={'kernel': None, 'gamma': None, 'C':None, 'Accuracy':0, 'Precision':0, 'Recall':0, 'F1':0}
    for tp in tuned_parameters:
        kernel=tp['kernel']
        Cs=tp['C']
        for c in Cs:
            if 'gamma' in tp:
                gammas=tp['gamma']
                for gamma in gammas:
                    svmTesting=SVC(kernel=kernel,gamma=gamma, C=c)
                    cur=train_and_evaluate(svmTesting,args)
                    if cur[2]>max['Recall']:
                        max={'kernel': kernel, 'gamma': gamma, 'C':c, 'Accuracy':cur[0], 'Precision':cur[1], 'Recall':cur[2], 'F1':cur[3]}
            else:
                svmTesting=SVC(kernel=kernel, C=c)
                cur=train_and_evaluate(svmTesting,args)
                if cur[2]>max['Recall']:
                        max={'kernel': kernel, 'C':c, 'Accuracy':cur[0], 'Precision':cur[1], 'Recall':cur[2], 'F1':cur[3]}
    print()
    print('Parameter values for best performance\n')
    print(max)


print("Training title:")
crossValidation([X_title_train_resampled,y_title_train_resampled,X_val_sc,y_title_val])
print("Training content:")
crossValidation([X_content_train_resampled,y_content_train_resampled,X_val_sc,y_content_val])


Training title:
kernel:rbf, C:1, gamma:0.1, accuracy:0.995, precision:0.109, recall:0.700, f1:0.189
kernel:rbf, C:1, gamma:0.01, accuracy:0.987, precision:0.058, recall:0.900, f1:0.109
kernel:rbf, C:1, gamma:0.001, accuracy:0.988, precision:0.063, recall:0.900, f1:0.118
kernel:rbf, C:1, gamma:0.0001, accuracy:0.988, precision:0.065, recall:0.900, f1:0.121
kernel:rbf, C:10, gamma:0.1, accuracy:0.995, precision:0.113, recall:0.600, f1:0.190
kernel:rbf, C:10, gamma:0.01, accuracy:0.994, precision:0.127, recall:0.900, f1:0.222
kernel:rbf, C:10, gamma:0.001, accuracy:0.979, precision:0.037, recall:0.900, f1:0.071
kernel:rbf, C:10, gamma:0.0001, accuracy:0.988, precision:0.063, recall:0.900, f1:0.118
kernel:rbf, C:100, gamma:0.1, accuracy:0.997, precision:0.061, recall:0.200, f1:0.093
kernel:rbf, C:100, gamma:0.01, accuracy:0.995, precision:0.136, recall:0.800, f1:0.232
kernel:rbf, C:100, gamma:0.001, accuracy:0.993, precision:0.108, recall:0.900, f1:0.194
kernel:rbf, C:100, gamma:0.0001, ac

kernel:poly, C:1, gamma:0.01, accuracy:0.367, precision:0.001, recall:0.900, f1:0.003
kernel:poly, C:1, gamma:0.001, accuracy:0.070, precision:0.001, recall:1.000, f1:0.002
kernel:poly, C:1, gamma:0.0001, accuracy:0.070, precision:0.001, recall:1.000, f1:0.002
kernel:poly, C:10, gamma:0.1, accuracy:0.992, precision:0.098, recall:1.000, f1:0.179
kernel:poly, C:10, gamma:0.01, accuracy:0.712, precision:0.003, recall:0.900, f1:0.006
kernel:poly, C:10, gamma:0.001, accuracy:0.070, precision:0.001, recall:1.000, f1:0.002
kernel:poly, C:10, gamma:0.0001, accuracy:0.070, precision:0.001, recall:1.000, f1:0.002
kernel:poly, C:100, gamma:0.1, accuracy:0.996, precision:0.167, recall:0.900, f1:0.281
kernel:poly, C:100, gamma:0.01, accuracy:0.919, precision:0.011, recall:1.000, f1:0.021
kernel:poly, C:100, gamma:0.001, accuracy:0.070, precision:0.001, recall:1.000, f1:0.002
kernel:poly, C:100, gamma:0.0001, accuracy:0.070, precision:0.001, recall:1.000, f1:0.002
kernel:poly, C:1000, gamma:0.1, acc

### KNNeighbour

In [51]:
from sklearn.neighbors import KNeighborsClassifier

def train_and_evaluate_knn(knnTesting,args):   
    knnTesting.fit(args[0],args[1])
    y_pred=knnTesting.predict(args[2])
    a=accuracy_score(args[3],y_pred)
    p=precision_score(args[3],y_pred)
    r=recall_score(args[3],y_pred)
    f=f1_score(args[3],y_pred)
    print('weights:%s, n_neighbors:%s, p:%s, accuracy:%0.3f, precision:%0.3f, recall:%0.3f, f1:%0.3f'%(knnTesting.weights,knnTesting.n_neighbors,knnTesting.p,a,p,r,f))
    return a,p,r,f

def crossValidation_knn(args):
    tuned_parameters={
        'weights':['uniform','distance'],
        'n_neighbors':[10,15,20,25,30],
        'p':[2,3]
    }

    max={'weights': None, 'n_neighbors': None, 'p':None, 'Accuracy':0, 'Precision':0, 'Recall':0, 'F1':0}
    for w in tuned_parameters['weights']:
        for n in tuned_parameters['n_neighbors']:
            for p in tuned_parameters['p']:
                knnTesting=KNeighborsClassifier(n_neighbors=n,weights=w, p=p)
                cur=train_and_evaluate_knn(knnTesting,args)
                if cur[2]>max['Recall']:
                        max={'weights': w, 'n_neighbors': n, 'p':p, 'Accuracy':cur[0], 'Precision':cur[1], 'Recall':cur[2], 'F1':cur[3]}
    print()
    print('Parameter values for best performance\n')
    print(max)

print("Training KKN")
print("Training title:")
crossValidation_knn([X_title_train_resampled,y_title_train_resampled,X_val_sc,y_title_val])
print("Training content:")
crossValidation_knn([X_content_train_resampled,y_content_train_resampled,X_val_sc,y_content_val])

Training KKN
Training title:
weights:uniform, n_neighbors:10, p:2, accuracy:0.995, precision:0.094, recall:0.500, f1:0.159
weights:uniform, n_neighbors:10, p:3, accuracy:0.995, precision:0.093, recall:0.500, f1:0.156
weights:uniform, n_neighbors:15, p:2, accuracy:0.995, precision:0.093, recall:0.500, f1:0.156
weights:uniform, n_neighbors:15, p:3, accuracy:0.995, precision:0.088, recall:0.500, f1:0.149
weights:uniform, n_neighbors:20, p:2, accuracy:0.994, precision:0.075, recall:0.500, f1:0.130
weights:uniform, n_neighbors:20, p:3, accuracy:0.995, precision:0.085, recall:0.500, f1:0.145
weights:uniform, n_neighbors:25, p:2, accuracy:0.993, precision:0.066, recall:0.500, f1:0.116
weights:uniform, n_neighbors:25, p:3, accuracy:0.994, precision:0.069, recall:0.500, f1:0.122
weights:uniform, n_neighbors:30, p:2, accuracy:0.993, precision:0.066, recall:0.500, f1:0.116
weights:uniform, n_neighbors:30, p:3, accuracy:0.994, precision:0.069, recall:0.500, f1:0.122
weights:distance, n_neighbors:1