### 作業目的: 使用樹型模型進行文章分類

本次作業主利用[Amazon Review data中的All Beauty](https://nijianmo.github.io/amazon/index.html)來進行review評價分類(文章分類)

資料中將review分為1,2,3,4,5分，而在這份作業，我們將評論改分為差評價、普通評價、優良評價(1,2-->1差評、3-->2普通評價、4,5-->3優良評價)

### 載入套件

In [1]:
import json
import re
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

### 資料前處理
文本資料較為龐大，這裡我們取前10000筆資料來進行作業練習

In [5]:
#load json data
all_reviews = []
###<your code>###
with open('All_Beauty.json','r', encoding='utf-8') as f:
    for reveiw in f:
        all_reviews.append(json.loads(reveiw))


all_reviews[0]

{'overall': 1.0,
 'verified': True,
 'reviewTime': '02 19, 2015',
 'reviewerID': 'A1V6B6TNIC10QE',
 'asin': '0143026860',
 'reviewerName': 'theodore j bigham',
 'reviewText': 'great',
 'summary': 'One Star',
 'unixReviewTime': 1424304000}

In [8]:
#parse label(overall) and corpus(reviewText)
corpus = []
labels = []

###<your code>###
for review in all_reviews[:10000]:
    if 'reviewText' not in review or 'overall' not in reveiw:
        continue
    corpus.append(review['reviewText'])
    labels.append(review['overall'])
#transform labels: 1,2 --> 1 and 3 --> 2 and 4,5 --> 3
label_map={1:1,2:1,3:2,4:3,5:3}
###<your code>###
labels= [label_map[int(label)] for label in labels]

In [9]:
#preprocessing data
#remove email address, punctuations, and change line symbol(\n)

###<your code>###
pattern = r'\S*@\S*|\\n|\W'
preprocess_text = lambda x: ' '.join([w for w in re.sub(pattern, ' ', x).split() if w != ''])
corpus = [preprocess_text(text) for text in corpus]

In [10]:
#split corpus and label into train and test
###<your code>###
x_train, x_test, y_train, y_test = train_test_split(corpus, labels, test_size=0.15, random_state=1)

len(x_train), len(x_test), len(y_train), len(y_test)

(8495, 1500, 8495, 1500)

In [11]:
#change corpus into vector
#you can use tfidf or BoW here
###<your code>###
vectorizer = TfidfVectorizer()
vectorizer.fit(x_train)

#transform training and testing corpus into vector form
x_train = vectorizer.transform(x_train) ###<your code>###
x_test = vectorizer.transform(x_test) ###<your code>###

### 訓練與預測

In [12]:
#build classification model (decision tree, random forest, or adaboost)
#start training
###<your code>###
adaboost_cls = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(criterion='gini',
                                                                        max_depth=3,
                                                                        min_samples_split=10,
                                                                        min_samples_leaf=5),
                                  n_estimators=50,
                                  learning_rate=0.8)
adaboost_cls.fit(x_train,y_train)

AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                         class_weight=None,
                                                         criterion='gini',
                                                         max_depth=3,
                                                         max_features=None,
                                                         max_leaf_nodes=None,
                                                         min_impurity_decrease=0.0,
                                                         min_impurity_split=None,
                                                         min_samples_leaf=5,
                                                         min_samples_split=10,
                                                         min_weight_fraction_leaf=0.0,
                                                         presort='deprecated',
                         

In [13]:
#start inference
y_pred = adaboost_cls.predict(x_test) ###<your code>###

In [14]:
#calculate accuracy
###<your code>###
print(f"Accuracy: {adaboost_cls.score(x_test,y_test)}")

Accuracy: 0.9046666666666666


In [15]:
#calculate confusion matrix, precision, recall, and f1-score
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.71      0.25      0.38       106
           2       0.21      0.11      0.14        47
           3       0.92      0.98      0.95      1347

    accuracy                           0.90      1500
   macro avg       0.61      0.45      0.49      1500
weighted avg       0.88      0.90      0.89      1500

[[  27    5   74]
 [   3    5   39]
 [   8   14 1325]]


由上述資訊可以發現, 模型在好評的準確度高(precision, recall都高), 而在差評的部分表現較不理想, 在普通評價的部分大部分跟差評搞混,
同學可以試著學習到的各種方法來提升模型的表現