<font size="6">Deep Learning Competition 1: Text Feature Engineering</font>

By 110062342 楊博聖

<font size = "5">1. Data</font>

讀入Training Data與Testin data

In [20]:
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
import pandas as pd
from IPython.core.display import display, HTML

display(HTML("<style>.output { max-height: 1000px; overflow-y: auto; }</style>"))
df_train = pd.read_csv('./dataset/train.csv')
df_test = pd.read_csv('./dataset/test.csv')
# print(df_train)

<font size = "5">2. Preprocess: Data Cleaning</font>

先檢視一下沒處理過的data中有什麼內容

In [21]:
print(df_train.loc[0, 'Page content'])

<html><head><div class="article-info"> <span class="byline basic">Clara Moskowitz</span> for <a href="/publishers/space-com/">Space.com</a> <time datetime="Wed, 19 Jun 2013 15:04:30 +0000">2013-06-19 15:04:30 UTC</time> </div></head><body><h1 class="title">NASA's Grand Challenge: Stop Asteroids From Destroying Earth</h1><figure class="article-image"><img class="microcontent" data-fragment="lead-image" data-image="http://i.amz.mshcdn.com/I7b9cUsPSztew7r1WT6_iBLjflo=/950x534/2013%2F06%2F19%2Ffe%2FDactyl.44419.jpg" data-micro="1" data-url="http://mashable.com/2013/06/19/nasa-grand-challenge-asteroid/" src="http://i.amz.mshcdn.com/I7b9cUsPSztew7r1WT6_iBLjflo=/950x534/2013%2F06%2F19%2Ffe%2FDactyl.44419.jpg"/></figure><article data-channel="world"><section class="article-content"> <p>There may be killer asteroids headed for Earth, and NASA has decided to do something about it. The space agency announced a new "Grand Challenge" on June 18 to find all dangerous space rocks and figure out how t

可以看到沒處理過的data非常亂，有不少的html、XML內容，但是這些tag也提供了很適合用來分割資料的資訊。

這裡使用助教在網頁教我們的BeautifulSoup來進行Data cleaning，以下是我這次Extract的Features:

&nbsp; -title: 文章標題 <br>
&nbsp; -title_len: 文章長度 <br>
&nbsp; -author: 作者(去除by/By，並將其全部換成小寫)<br>
&nbsp; -channel: 頻道<br>
&nbsp; -topic: 文章的章節主題<br>
&nbsp; -Day and Date: 時間，包含日期(年、月、日)以及星期幾，若文章沒有這個資訊則使用Fri, 11 Oct 2013 22:10:39替代，輸出結果為weekday, year, month, day<br>
&nbsp; -Content_len: 文章長度<br>
&nbsp; -image_num: 圖片數量<br>
&nbsp; -link_num: 連結數量<br>

經過多次的實驗與嘗試，發現到取用author, topic, weekday, year, month, day, content_len這些資訊最為有效

In [22]:
import re
from bs4 import BeautifulSoup

def preprocessor(text):
    
    # remove HTML tags
    soup = BeautifulSoup(text, 'html.parser')

    # title
    title_tag = soup.body.h1
    title = title_tag.string.strip().lower() if title_tag else ''
    
    # title len
    title_len = len(title_tag.get_text().split()) if title_tag else 0


    # author
    article_info = soup.head.find('div', {'class': 'article-info'})
    author_name = article_info.find('span', {'class': 'author_name'})
    if author_name != None:
        author = author_name.get_text()
    elif article_info.span != None:
        author = article_info.span.string
    else:
        author = article_info.a.string

    author = author.lower()

    if author.startswith('by '):
        author = author[3:]
    author = re.sub('&.*;', '&', author.replace(' and ', ' & '))

    # channel
    channel = soup.body.article['data-channel'].strip().lower()

    # topic
    topic_element = soup.find(attrs={'class': 'article-topics'})
    topic = topic_element.get_text().replace('Topics', '').replace(':', '').replace(',', '').strip().lower() if topic_element else ''

    # day and date
    article_info = soup.head.find('div', {'class': 'article-info'})
    try:
        date_time = article_info.time['datetime']
    except:
        date_time = 'Fri, 11 Oct 2013 22:10:39'
    match_obj = re.search('([\w]+),\s+([\d]+)\s+([\w]+)\s+([\d]+)\s+([\d]+):([\d]+):([\d]+)', date_time)
    weekday, day, month, year, hour, minute, second = match_obj.groups()
    weekday, month = weekday.lower(), month.lower()
    
    # len of content
    content = soup.body.find('section', {'class': 'article-content'}).get_text()
    content_len = len(content)

    # image num
    image_num = len(soup.body.find_all('img'))

    # link num
    link_num = len(soup.body.find_all('a'))

    return author, topic, weekday, year, month, day, content_len

接著要針對一些文字進行處理，包含星期幾要轉換成數字，月份也要轉換成數字。最後可以看到處理完的資料如下面所示。

In [23]:
from tqdm import tqdm

feature_list = []

# Processing training data with tqdm progress bar
for text in tqdm(df_train['Page content'], desc="Processing train data"):
    feature_list.append(preprocessor(text))

# Processing test data with tqdm progress bar
for text in tqdm(df_test['Page content'], desc="Processing test data"):
    feature_list.append(preprocessor(text))

df_extract = pd.DataFrame(
    feature_list,
    columns=['Author', 'Topic', 'Weekday', 'year', 'month', 'day','Length of Content']
)

day = {'mon': 1, 'tue': 2, 'wed': 3,'thu': 4, 'fri': 5, 'sat': 6, 'sun': 7}

month = {'jan': 1, 'feb': 2, 'mar': 3, 'apr': 4, 'may': 5, 'jun': 6, 'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12}

df_final = df_extract.copy()
df_final['Weekday'] = df_final['Weekday'].map(day)
df_final['month'] = df_final['month'].map(month)

print(df_final.head(5))


Processing train data:   0%|          | 0/27643 [00:00<?, ?it/s]

Processing train data: 100%|██████████| 27643/27643 [01:52<00:00, 246.59it/s]
Processing test data: 100%|██████████| 11847/11847 [00:53<00:00, 220.85it/s]

             Author                                              Topic  \
0   clara moskowitz  asteroid asteroids challenge earth space u.s. ...   
1  christina warren  apps and software google open source opn pledg...   
2         sam laird      entertainment nfl nfl draft sports television   
3         sam laird                    sports video videos watercooler   
4   connor finnegan  entertainment instagram instagram video nfl sp...   

   Weekday  year  month day  Length of Content  
0        3  2013      6  19               3591  
1        4  2013      3  28               1843  
2        3  2014      5  07               6646  
3        5  2013     10  11               1821  
4        4  2014      4  17               8919  





<font size = "5">3. Preprocess: Word Stemming and Stop-word Removal</font>

1. Word Stemming <br>
使用 PorterStemmer 進行詞幹提取。
將每個單字還原成原型，例如將 "running" 還原為 "run"，以減少單詞的變化形式對分析的影響。

2. Stop-word Removal <br>
從 NLTK 的停用詞列表中加載英文停用詞。在文本分詞後，去除這些常見但意義不大的詞，如 “the”, “is”, “in” 等，以提升文本處理的效果。

In [24]:
import numpy as np
import nltk

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')
stop = stopwords.words('english')



def tokenizer(text):
    if type(text) == np.ndarray:
        text = text[0]
    return re.split('\s+', text.strip())

def tokenizer_stem_nostop(text):

    if type(text) == np.ndarray:
        text = text[0]
        
    text = re.sub("([\w]+)'[\w]+", (lambda match_obj: match_obj.group(1)), text)
    text = re.sub('\.', '', text)
    text = re.sub('[^\w]+', ' ', text)
    porter= PorterStemmer()
    return [porter.stem(w) for w in re.split('\s+', text.strip()) \
            if w not in stop and re.match('[a-zA-Z]+', w)]


[nltk_data] Downloading package wordnet to /home/stylish/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/stylish/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/stylish/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


<font size = "5">4. BoW(Bag of Words)<font>

使用助教於網站上提供的BoW完成

In [25]:
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer

# title removed
trans_author_topic_title = ColumnTransformer(
    [('Author', 'drop', [0]),
     ('Topic', CountVectorizer(tokenizer=tokenizer_stem_nostop, lowercase=False), [1])],
    n_jobs=-1,
    remainder='passthrough'
)

<font size = "5">5. Data Split(8:2)<font>

將資料先分回Train與Test，再將Train的資料以8:2進行切割，並印出前五項確認資料的形式無誤。

In [26]:
from sklearn.model_selection import train_test_split


X_train_dataset = df_final.values[:df_train.shape[0]]
y_train_dataset = (df_train['Popularity'].values == 1).astype(int)
X_test_dataset = df_final.values[df_train.shape[0]:]

X_train, X_valid, y_train, y_valid = train_test_split(X_train_dataset, y_train_dataset, test_size=0.2, random_state=0)
print(X_train[0:5])

[['seth fiegerman'
  'apple business gadgets ipad ipad air ipad mini stocks' 5 '2013' 11
  '01' 1567]
 ['jason abbruzzese' 'business media vice media' 4 '2014' 9 '04' 2101]
 ['brian womack' 'advertising business mobile twitter' 3 '2013' 12 '25'
  1812]
 ['tricia gilbride'
  'coffee john oliver last week tonight with john oliver pumpkin television video videos viral video watercooler'
  1 '2014' 10 '13' 695]
 ['emily banks' 'ces tech' 6 '2013' 1 '12' 5737]]


<font size = "5">6. Model Training<font>

使用一個簡單的Training函數，透過傳入classifier與其名字分別現在在訓練的是哪一個classifier，並使用roc_auc_score判斷其訓練結果

In [27]:
from sklearn.metrics import roc_auc_score

def training(classifier_name, classifier):
    classifier.fit(X_train, y_train)
    print(f'Classifier Name: {classifier_name}, train score: {roc_auc_score(y_train, classifier.predict_proba(X_train)[:, 1]):.4f}')
    print(f'Classifier Name: {classifier_name}, valid score: {roc_auc_score(y_valid, classifier.predict_proba(X_valid)[:, 1]):.4f}')

<font size = "5">7-1. Model: Random Forest Classifier<font>

In [28]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

RF = Pipeline([('ct', trans_author_topic_title),
                 ('clf', RandomForestClassifier(n_jobs=-1, random_state=0, n_estimators=500))])

training("Random Forest Classifier", RF)



Classifier Name: Random Forest Classifier, train score: 1.0000
Classifier Name: Random Forest Classifier, valid score: 0.5867


<font size = "5">7-2. Model: LGBM<font>

In [29]:
from lightgbm import LGBMClassifier
from sklearn.pipeline import Pipeline

LGBM = Pipeline([('ct', trans_author_topic_title),
                 ('clf', LGBMClassifier(force_row_wise=True, random_state=0, learning_rate=0.009, n_estimators=400))])

training("Light Gradient Booster Machine Classifier", LGBM)




[LightGBM] [Info] Number of positive: 10885, number of negative: 11229
[LightGBM] [Info] Total Bins 2740
[LightGBM] [Info] Number of data points in the train set: 22114, number of used features: 824
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.492222 -> initscore=-0.031114
[LightGBM] [Info] Start training from score -0.031114
Classifier Name: Light Gradient Booster Machine Classifier, train score: 0.6786
Classifier Name: Light Gradient Booster Machine Classifier, valid score: 0.5946


<font size = "5">8. Prediction<font>

在多次的參數更改與實驗下，與最後投到Kaggle的成績來看，是LGBM Classifier的表現最好，便選用。

In [30]:
y_score = LGBM.predict_proba(X_test_dataset)[:, 1]
df_pred = pd.DataFrame({'Id': df_test['Id'], 'Popularity': y_score})
df_pred.to_csv('prediction.csv', index=False)

<font size = "5">9. Conclusion<font>

這次的lab是我第一次做有關Text feature engineering的內容，一開始還很怕什麼都做不出來，但很感謝助教提供那麼詳細與豐富的教學網站讓我能夠有個基礎方向與理解。我學會了如何使用Beautiful soup去處理HTML的資料，也學會更多的Data cleaning方式，例如Word stemming、Stop-word Removal等等。最後也有成功找出適合的模型與參數進行訓練。這也是我第一次直接使用Package內建的Classifier做training，不同於以往需要自己將整個Model建立好，使用別人寫好的真的方便許多，只是也有諸多限制，像是我得遵照他們的參數設定去輸入，相比之下沒有那麼靈活。我也得自行查看每個classifier的功用與能力分別是什麼再去判斷要使用什麼。最終是選定兩種下來進行測試。在Kaggle的Result上，Public是42名，而Private則是提升到了31名，算是還不錯的成績。