## 載入檔案(擇一執行)

### 載入檔案(上傳)

上傳檔案：
- result_online.csv
- result_online_cat2.csv
- offline_result.csv
- dict.txt.big

In [None]:
from google.colab import files
uploaded = files.upload()

### 載入檔案(Colab/Google Drive)
**需更改cd路徑**

檔案：
- result_online.csv
- result_online_cat2.csv
- result_offline.csv
- dict.txt.big

In [None]:
from google.colab import drive
drive.mount('./drive')

In [None]:
%cd /content/drive/MyDrive/New/DL/Final

/content/drive/MyDrive/New/DL/Final


## install / import / constants

In [3]:
!pip install fasttext
!pip install jieba

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting fasttext
  Downloading fasttext-0.9.2.tar.gz (68 kB)
[K     |████████████████████████████████| 68 kB 3.0 MB/s 
[?25hCollecting pybind11>=2.2
  Using cached pybind11-2.9.2-py2.py3-none-any.whl (213 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp37-cp37m-linux_x86_64.whl size=3143847 sha256=f12e8e290c835fedebb2f4d3465bea6786f696532717beeb721bc8eb691ae4df
  Stored in directory: /root/.cache/pip/wheels/4e/ca/bf/b020d2be95f7641801a6597a29c8f4f19e38f9c02a345bab9b
Successfully built fasttext
Installing collected packages: pybind11, fasttext
Successfully installed fasttext-0.9.2 pybind11-2.9.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [4]:
import fasttext
import jieba
import importlib
import os
import random
import csv
import numpy as np
from sklearn.model_selection import train_test_split
import sklearn

jieba.set_dictionary('dict.txt.big')

TESTPART = 0.2
TESTCOUNT = 200 #C1C2:200, C1:1000
MEDIAN = 200

## Functions

- load_dataset : 使用自行分類的方式載入資料集並Manual平衡
- load_dataset_split : 使用sklearn.train_test_split()載入資料集
- precision_and_recall : 計算各類別Precision / Recall

In [12]:
def load_dataset(file_name):
    
    Xall = []
    Yall = []
    class_all = []

    train_X = []
    train_Y = []
    test_X = []
    test_Y = []

    classes = set()
    dic = {}
    class_dic = {}

    with open(file_name) as f:

        data = csv.reader(f)

        for idx, row in enumerate(data):
            if idx == 0:
                continue

            Xall.append(row[1])
            class_all.append(row[2])
            classes.add(row[2])
                    
            if class_dic.get(class_all[idx - 1]) == None:
                class_dic[class_all[idx - 1]] = [Xall[idx - 1]]
            else:
                class_dic[class_all[idx - 1]].append(Xall[idx - 1])

        #Class => No.
        classes = list(classes)
        for idx, c in enumerate(classes):
            dic[c] = idx + 1
        
#Debug
#       print(len(classes))
#       print(classes[0])

#1-hot
        '''
        LEN = len(classes)
        for idx, x in enumerate(Xall):
            Yall.append([0] * LEN)
            Yall[idx][dic[class_all[idx]]] = 1
        '''

#Debug
#           if idx == 0:
#               print(Xall[idx])
#               print(class_all[idx])
#               print(dic[class_all[idx]], classes[dic[class_all[idx]]])

        for l in class_dic:
            random.shuffle(class_dic[l])

        part = int(len(Xall) * TESTPART)

        for idx, l in enumerate(class_dic):

            if len(class_dic[l]) == 1:
                train_X.append(class_dic[l][0])
                train_Y.append(l)
                test_X.append(class_dic[l][0])
                test_Y.append(l)

            else:

                if len(class_dic[l]) <= TESTCOUNT:
                    part = max(1, int(len(class_dic[l]) * TESTPART))
                else:
                    part = TESTCOUNT

                for c in class_dic[l][:part]:
                    test_X.append(c)
                    test_Y.append(l)
                for c in class_dic[l][part:]:
                    train_X.append(c)
                    train_Y.append(l)

    return train_X, train_Y, test_X, test_Y, classes

#Debug
#train_X, train_Y, test_X, test_Y, classes = load_dataset()


In [13]:
def load_dataset_split(file_name):
    X = []
    Y = []
    classes = set()

    with open(file_name) as f:

        data = csv.reader(f)

        for idx, rows in enumerate(data):
            if idx == 0:
                continue
            X.append(rows[1])
            Y.append(rows[2])
            classes.add(rows[2])

    classes = list(classes)

    X_train, X_test = train_test_split(X, train_size = 0.8, random_state=32)
    Y_train, Y_test = train_test_split(Y, train_size = 0.8, random_state=32)

    return X_train, Y_train, X_test, Y_test, classes

In [14]:
#Parameters:
#   prediction: (List)預測結果; Y: (List)對應答案; classes: 包含所有分類的List
#Returns:
#   Overall precision, List of (class, precision, recall)
def precision_and_recall(prediction, Y, classes):

    count_dic = {}
    correct_dic = {}

    all_prec = float()
    class_prec = {}
    class_recall = {}
    
    ret = []
    item_count_recall = []
    item_count_prec = []

    #Overall precision/accuracy

    correct = 0
    for idx, s in enumerate(prediction):
        if s == Y[idx]:
            correct += 1
    all_prec = correct / len(Y)

    #Recall of classes
    for s in classes:
        count_dic[s] = 0
        correct_dic[s] = 0

    for idx, s in enumerate(Y):
        count_dic[s] += 1
        if s == prediction[idx]:
            correct_dic[s] += 1

    for s in classes:
        if count_dic[s] == 0:
            class_recall[s] = None
        else:
            class_recall[s] = correct_dic[s] / count_dic[s]

        item_count_recall.append((correct_dic[s], count_dic[s]))
            
    #Precision of classes
    for s in classes:
        count_dic[s] = 0
        correct_dic[s] = 0

    for idx, s in enumerate(prediction):
        count_dic[s] += 1
        if s == Y[idx]:
            correct_dic[s] += 1

    for s in classes:
        if count_dic[s] == 0:
            class_prec[s] = None
        else:
            class_prec[s] = correct_dic[s] / count_dic[s]
            
        item_count_prec.append((correct_dic[s], count_dic[s]))

    #Return
    for s in classes:
        ret.append((s, class_prec[s], class_recall[s]))

    return all_prec, ret, item_count_prec, item_count_recall

## 載入資料集+平衡

### 參數調整

In [None]:
#1: Category1, 2:Category2
TRAIN_CATEGORY = 1

#是否使用train_test_split()
TTS = True

### 載入資料集及平衡

In [63]:
## Load data

if TRAIN_CATEGORY == 1:
#Cat1
    DATA_SIZE = 1000
    FILENAME = 'result_online_cat2.csv'
elif TRAIN_CATEGORY == 2:
#Cat2
    DATA_SIZE = 200  
    FILENAME = 'result_online.csv'

if TTS:
    X_train_orig, Y_train_orig, X_test, Y_test, classes = load_dataset_split(FILENAME)

    X_train = []
    Y_train = []

    ### Balance train data
    class_dict = {}

    for c in classes:
        class_dict[c] = []

    for idx, c in enumerate(Y_train_orig):
        class_dict[c].append(X_train_orig[idx])

    print(class_dict[classes[0]][:3])

    for l in class_dict:
        if len(class_dict[l]) > 0:
            balanced = sklearn.utils.resample(class_dict[l], n_samples = DATA_SIZE, random_state = 32)
        
            X_train += balanced
            Y_train += [l] * DATA_SIZE

    print(len(balanced), balanced[:3])
else:
    X_train, Y_train, X_test, Y_test, classes = load_dataset(FILENAME)

['瓶裝酒櫃德國CASO雙溫控紅酒櫃', '公升玻璃門右開單門冷藏櫃', 'GrandCru瓶裝崁入型酒櫃電子濕度調節耗電量低恆溫避震']
200 ['立體聲重低音藍芽喇叭', '聲道三件式重低音喇叭', 'MAP重低音藍牙喇叭']


In [64]:
print(X_train[50])
print(Y_train[50])

print(len(X_train), len(X_test))
print(len(classes))

#name_dict['保健/醫療']

#print(len(vocab_data))
#print(vocab_data[10] + '\n')

#print(len(idf_data))
#print(idf_data[10])

節能無聲客房冰箱冷藏箱小冰箱紅酒櫃
家電|酒櫃
81200 35046
407


## Train model

In [65]:
#Preprocess train set to fasttext format

CUTALL = True

processed = []

name_dict = {}
for idx, c in enumerate(classes):
    name_dict[c] = idx + 1

for idx, s in enumerate(X_train):
    lbl = '__label__' + str(name_dict[Y_train[idx]])
    cutted = ' '.join(jieba.cut(s, cut_all = CUTALL))
    processed.append(lbl + ' , ' + cutted)

print(processed[0])

random.shuffle(processed)

__label__1 , GrandCru 炫 黑 玻璃 玻璃門 系列 瓶裝 桌上 桌上型 酒櫃 電子 濕 度 調節 UV 隔熱


In [66]:
#Output training file

f = open('train.txt', 'w', encoding='utf-8')

for s in processed:
    f.write(s + '\n')

f.close()

In [67]:
#Fasttext

model = fasttext.train_supervised(input = 'train.txt', dim = 100, epoch = 20, lr = 0.1, loss = 'softmax')
model.save_model('main.model')

## Test model

In [None]:
#Process Testing Lists

X_test_proc = []
for s in X_test:
    X_test_proc.append(' '.join(jieba.cut(s, cut_all = CUTALL)))

print(X_test_proc[0])

挺 好 風雨 風雨衣 雨衣 通用 鞋 套 黑色 橘色


### Test Model(model.test)

In [None]:
#Preprocess test set to fasttext format

print(X_test[0])

CUTALL = True

processed_test = []

for idx, s in enumerate(X_test_proc):
    lbl = '__label__' + str(name_dict[Y_test[idx]])
    processed_test.append(lbl + ' , ' + s)

print(processed_test[0])

random.shuffle(processed_test)

挺好風雨衣通用鞋套黑色橘色
__label__32 , 挺 好 風雨 風雨衣 雨衣 通用 鞋 套 黑色 橘色


In [None]:
#Output testing file

f = open('test.txt', 'w', encoding='utf-8')

for s in processed_test:
    f.write(s + '\n')

f.close()

In [None]:
res = model.test('test.txt')

print("Overall Precision: " , res[1])

(35045, 0.637665858182337, 0.637665858182337)

### Test model(Manual)

In [None]:
#Test for single prediction
print(X_test_proc[4500])
label = model.predict(X_test_proc[4500])

print(classes[int(label[0][0].split('__')[2]) - 1])

斜 口 醒酒 酒瓶 波爾 波爾多 爾多 紅酒 酒杯 禮盒 組 LSG
餐廚用品|酒器酒杯


In [None]:
#Prediction

X_pred = []

print(label[0])
print(len(classes))

for s in X_test_proc:
    label = model.predict(s)
    X_pred.append(classes[int(label[0][0].split('__')[2]) - 1])

('__label__141',)
407


In [None]:
#Precision

prec, pr, item_count_prec, item_count_recall = precision_and_recall(X_pred, Y_test, classes)
print('Precision:', prec, '\n')
print('Precision & recall of classes:\n')
for idx, s in enumerate(pr):
    print('Class: ', s[0])
    print('\nPrecision: \n', item_count_prec[idx][0], '/', item_count_prec[idx][1], '\n', s[1])

    print('\nRecall: \n', item_count_recall[idx][0], '/', item_count_recall[idx][1], '\n', s[2], '\n\n', sep='')

## Train set prediction

In [None]:
#Process Training Lists

X_train_temp = []
for s in X_train:
    X_train_temp.append(' '.join(jieba.cut(s, cut_all = CUTALL)))

print(X_train_temp[0])

觸 控 充電 充電式 好 攜帶 可摺疊 摺疊 LED 化妝 化妝鏡 二 入 組


In [None]:
#Train Prediction

X_train_pred = []

print(label[0])
print(len(classes))

for s in X_train_temp:
    label = model.predict(s)
    X_train_pred.append(classes[int(label[0][0].split('__')[2]) - 1])

('__label__282',)
407


In [None]:
print(X_train_temp[4500])
label = model.predict(X_train_temp[4500])

print(classes[int(label[0][0].split('__')[2]) - 1])

福利 福利品 GalayAGGG 吋
手機/平板|3C福利品


In [None]:
#Train Precision

prec, pr, item_count_prec, item_count_recall = precision_and_recall(X_train_pred, Y_train, classes)
print('Precision:', prec, '\n')

Precision: 0.9616379310344828 



## Test offline

### Read file

In [68]:
Offline_X_orig = []
Offline_Y_c1 = []
Offline_Y_c1c2 = []

with open('offline_result.csv') as f:

  data = csv.reader(f)

  for idx, row in enumerate(data):
      if idx == 0:
        continue
      if idx > 191:
        break
      Offline_X_orig.append(row[3])
      Offline_Y_c1.append(row[4])
      Offline_Y_c1c2.append(row[6])

print(Offline_X_orig[119])
print(Offline_Y_c1[119])
print(Offline_Y_c1c2[119])

泳具配件飾品其他
服飾
服飾|泳裝/比基尼


### Test C1C2

In [69]:
#Preprocess test set to fasttext format
#C1C2

CUTALL = True

Offline_X = []

for idx, s in enumerate(Offline_X_orig):
    lbl = '__label__' + str(name_dict[Offline_Y_c1c2[idx]])
    cutted = ' '.join(jieba.cut(s, cut_all = CUTALL))
    Offline_X.append(lbl + ' , ' + cutted)

print(Offline_X[0])

#random.shuffle(Offline_X)

__label__102 , 小熊 餅乾 乾草 草莓 蛋糕 風味 g


In [70]:
#Test for single prediction
print(Offline_X[100])
label = model.predict(Offline_X[100])

print('Ans = ', Offline_Y_c1c2[100])
print(classes[int(label[0][0].split('__')[2]) - 1])

__label__7 , 茶 裏 王 青 心 烏龍 烏龍茶
Ans =  食品飲料|飲料
食品飲料|飲料


In [71]:
#Predict

Offline_pred = []

for s in Offline_X:
    label = model.predict(s)
    Offline_pred.append(classes[int(label[0][0].split('__')[2]) - 1])

print(Offline_pred[0])

食品飲料|進口零食


In [72]:
#Precision

prec, pr, item_count_prec, item_count_recall = precision_and_recall(Offline_pred, Offline_Y_c1c2, classes)
print('Precision:', prec, '\n')
print('Precision & recall of classes:\n')
for idx, s in enumerate(pr):
    if item_count_recall[idx][1] + item_count_prec[idx][1] > 0:
      print('Class: ', s[0])
      print('\nPrecision: \n', item_count_prec[idx][0], '/', item_count_prec[idx][1], '\n', s[1])

      print('\nRecall: \n', item_count_recall[idx][0], '/', item_count_recall[idx][1], '\n', s[2], '\n\n', sep='')

Precision: 0.2774869109947644 

Precision & recall of classes:

Class:  文具樂器|文具用品

Precision: 
 0 / 1 
 0.0

Recall: 
0/0
None


Class:  食品飲料|飲料

Precision: 
 3 / 3 
 1.0

Recall: 
3/9
0.3333333333333333


Class:  彩妝保養|專櫃保養品牌

Precision: 
 1 / 3 
 0.3333333333333333

Recall: 
1/8
0.125


Class:  食品飲料|罐頭/食材/烘焙

Precision: 
 4 / 15 
 0.26666666666666666

Recall: 
4/14
0.2857142857142857


Class:  看看買|快閃活動

Precision: 
 0 / 1 
 0.0

Recall: 
0/1
0.0


Class:  服飾|女裝

Precision: 
 0 / 1 
 0.0

Recall: 
0/0
None


Class:  圖書影音|生活風格

Precision: 
 0 / 1 
 0.0

Recall: 
0/0
None


Class:  修繕園藝|五金工具

Precision: 
 0 / 1 
 0.0

Recall: 
0/0
None


Class:  加值/軟體|myBook電子書

Precision: 
 0 / 2 
 0.0

Recall: 
0/0
None


Class:  母嬰玩具|積木

Precision: 
 0 / 2 
 0.0

Recall: 
0/0
None


Class:  彩妝保養|開架保養品牌

Precision: 
 0 / 0 
 None

Recall: 
0/1
0.0


Class:  彩妝保養|醫美保養品牌

Precision: 
 0 / 3 
 0.0

Recall: 
0/1
0.0


Class:  圖書影音|文學小說

Precision: 
 0 / 12 
 0.0

Recall: 
0/0
None


Class:  車|機車/用品

Precis

### Test C1

In [57]:
#Preprocess test set to fasttext format
#C1

CUTALL = True

Offline_X = []

for idx, s in enumerate(Offline_X_orig):
    lbl = '__label__' + str(name_dict[Offline_Y_c1[idx]])
    cutted = ' '.join(jieba.cut(s, cut_all = CUTALL))
    Offline_X.append(lbl + ' , ' + cutted)

print(Offline_X[0])

#random.shuffle(Offline_X)

__label__22 , 小熊 餅乾 乾草 草莓 蛋糕 風味 g


In [59]:
#Test for single prediction
print(Offline_X[100])
label = model.predict(Offline_X[100])

print('Ans = ', Offline_Y_c1[100])
print(classes[int(label[0][0].split('__')[2]) - 1])

__label__22 , 茶 裏 王 青 心 烏龍 烏龍茶
Ans =  食品飲料
食品飲料


In [60]:
#Predict

Offline_pred = []

for s in Offline_X:
    label = model.predict(s)
    Offline_pred.append(classes[int(label[0][0].split('__')[2]) - 1])

print(Offline_pred[0])

食品飲料


In [62]:
#Precision

prec, pr, item_count_prec, item_count_recall = precision_and_recall(Offline_pred, Offline_Y_c1, classes)
print('Precision:', prec, '\n')
print('Precision & recall of classes:\n')
for idx, s in enumerate(pr):
    if item_count_recall[idx][1] + item_count_prec[idx][1] > 0:
      print('Class: ', s[0])
      print('\nPrecision: \n', item_count_prec[idx][0], '/', item_count_prec[idx][1], '\n', s[1])

      print('\nRecall: \n', item_count_recall[idx][0], '/', item_count_recall[idx][1], '\n', s[2], '\n\n', sep='')

Precision: 0.5445026178010471 

Precision & recall of classes:

Class:  生鮮

Precision: 
 9 / 28 
 0.32142857142857145

Recall: 
9/21
0.42857142857142855


Class:  直配大陸

Precision: 
 0 / 3 
 0.0

Recall: 
0/0
None


Class:  運動/按摩

Precision: 
 0 / 1 
 0.0

Recall: 
0/0
None


Class:  看看買

Precision: 
 0 / 3 
 0.0

Recall: 
0/1
0.0


Class:  修繕園藝

Precision: 
 0 / 2 
 0.0

Recall: 
0/0
None


Class:  鞋包箱

Precision: 
 0 / 1 
 0.0

Recall: 
0/0
None


Class:  車

Precision: 
 0 / 2 
 0.0

Recall: 
0/0
None


Class:  母嬰玩具

Precision: 
 0 / 6 
 0.0

Recall: 
0/0
None


Class:  日用/紙品

Precision: 
 1 / 1 
 1.0

Recall: 
1/1
1.0


Class:  宗教/藝術

Precision: 
 0 / 4 
 0.0

Recall: 
0/1
0.0


Class:  手機/平板

Precision: 
 0 / 2 
 0.0

Recall: 
0/0
None


Class:  家電

Precision: 
 0 / 0 
 None

Recall: 
0/1
0.0


Class:  內衣

Precision: 
 2 / 4 
 0.5

Recall: 
2/4
0.5


Class:  綠色生活

Precision: 
 0 / 4 
 0.0

Recall: 
0/0
None


Class:  保健/醫療

Precision: 
 3 / 8 
 0.375

Recall: 
3/3
1.0


Class:  個人清潔