模式识别大作业之商品分类_黄国正_20210529

说明：本笔记依托于kaggle比赛项目Shopee - Price Match Guarantee,使得读取数据更加方便。全部数据均来自Shopee数据集shopee-product-matching的训练集数据,加入比赛后可直接读取。

本笔记实现了文本分类和图像分类的大部分方法实现；采用神经网络的方法见PCbighomework_network_hgz；文本图像结合的方法见PCbighomework_mix_hgz

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

#加载所有需要的包
#数据处理
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import random

#绘图读图
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.image as mpimg 
import seaborn as sns

#学习
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC,SVC
import sklearn.svm as svm
from sklearn.metrics import accuracy_score,precision_score, recall_score, f1_score,roc_auc_score,average_precision_score
from sklearn import neighbors
from sklearn import datasets
from sklearn.datasets import load_digits,load_boston
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingRegressor,VotingClassifier,RandomForestClassifier,AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing

#文本处理
import spacy
from spacy.util import minibatch, compounding
from sklearn.feature_extraction.text import TfidfVectorizer

#图像处理
import os
import cv2
import glob
import joblib
#执行代码时尽量从前往后执行，避免因重复命名报错
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
print("Set up complete!")

一、数据整理

In [None]:
#加载全部数据，来自kaggle数据集shopee-product-matching（本地资源只需改变路径）
totaldata = pd.read_csv("../input/shopee-product-matching/train.csv")
#展示前几个数据，可以看到image为图片索引，phash为图片phash映射值，title为文本信息，
#是我们之后要用到的信息量，最终有一个标签label_group，这一数值比较大而广，我们先观察一下其分布。
totaldata.head()

In [None]:
#观察label_group的分布，用sns绘制各种图形进行可视化
totallabel=totaldata['label_group']
pd.plotting.register_matplotlib_converters()
# Set the width and height of the figure
plt.figure(figsize=(16,6))
plt.title("label group")
# Line chart图 
sns.lineplot(data=totallabel)

In [None]:
#散点图，发现标签比较均匀
sns.scatterplot(data=totallabel)

In [None]:
#概率图
sns.kdeplot(data=totallabel, shade=True)

In [None]:
#接着我们分析每个标签平均有几个样本在数据中，先取出唯一标签
labeluni=totallabel.unique()
#输出标签长度，可以看出每个标签的平均样本数很少
len(labeluni)

In [None]:
#对唯一标签散点图，可以看到仍然很均匀
sns.scatterplot(data=labeluni)

In [None]:
#概率图
sns.kdeplot(data=labeluni, shade=True)

关于每个标签对应样本少的解决方案：建立mini数据集时，将选定标签的所有样本加入训练集中，避免单一标签的样本过多；测试时也注意考虑测试集样本中出现训练集中没有标签的情况，进行相应的插值操作。经测试这一观察能大大提升准确率。

In [None]:
#进一步分析标签，对数据标签进行排序
totaldata.label_group.sort_values()

In [None]:
#观察标签相邻的样本，这里选择4294197112和4293276364
totaldata.loc[totaldata.label_group == 4293276364]


In [None]:
totaldata.loc[totaldata.label_group == 4294197112]

发现标签相邻并不代表物品类似；同时标签中还出现了印尼语，说明商品文本可能不适合进行英语模型的训练，需要注意

二、用文本信息进行分类预测

由于算力和kaggle笔记本的限制，之后的训练方法采用的数据均在原训练集中划分出toy数据集，按照之前的讨论，先确定标签个数，再在所有样本中找到这些标签的所有样本进行训练和测试。由于不同算法的速度不同，效果展示的标签个数略有不同，但可以调整，当扩大为全集长度时，toy数据集就是全训练集。

In [None]:
#超小测试样本及划分
toydata=totaldata[0:100]
#将toydata中所有标签的样本提取出来，这将大大提升准确率
toytrickdata=totaldata.loc[totaldata.label_group.isin(toydata.label_group)]
toydata=toytrickdata
tx_train,tx_test, ty_train, ty_test = train_test_split(toydata,toydata.label_group,test_size=0.3, random_state=0)

In [None]:
#超小测试文本样本及划分
toytextdata=toydata[['title','label_group']]
ttextx_train,ttextx_test, ttexty_train, ttexty_test = train_test_split(toytextdata,toytextdata.label_group,test_size=0.1, random_state=0)

1.Word embedding 方法：采用spacy中预训练好的模型，将文本数据映射成向量，并用于之后的分类

In [None]:
#word embedding 方法
# Need to load the large model to get the vectors
nlp = spacy.load('en_core_web_lg')
# Disabling other pipes because we don't need them and it'll speed up this part a bit
with nlp.disable_pipes():
    doc_vectors = np.array([nlp(text).vector for text in toytextdata.title])
#数据划分
ttextdataemx_train, ttextdataemx_test, ttextdataemy_train, ttextdataemy_test = train_test_split(doc_vectors, toytextdata.label_group,
                                                    test_size=0.3, random_state=1)

之后我们调用sklearn中对应的包，对得到的embedding向量用SVM、KNN、决策树、随机森林、Adaboost、GradientBoosting、Voting等方法进行预测，并输出对应的准确率、召回率、F1 score等衡量预测效果的值。对其中一些方法还进行了参数调整，方便选取最优的模型。

In [None]:
#word embedding 上做SVM预测
# Set dual=False to speed up training, and it's not needed
svc = LinearSVC(random_state=1, dual=False, max_iter=10000)
svc.fit(ttextdataemx_train, ttextdataemy_train)
print(f"Embedding SVM Accuracy: {svc.score(ttextdataemx_test, ttextdataemy_test) * 100:.3f}%", )
print(f"Embedding SVM Precision score: {precision_score(ttextdataemy_test,svc.predict(ttextdataemx_test) ,average='macro',zero_division=0) * 100:.3f}%", )
print(f"Embedding SVM Recall score: {recall_score(ttextdataemy_test,svc.predict(ttextdataemx_test) ,average='macro',zero_division=0) * 100:.3f}%", )
print(f"Embedding SVM F1 score: {f1_score(ttextdataemy_test,svc.predict(ttextdataemx_test),average='macro',zero_division=0) * 100:.3f}%", )

In [None]:
#word embedding 上做knn分析
#参数k可调
for k in [3,4,5,6,7,8,9,10,11,12]:
    clf = neighbors.KNeighborsClassifier(n_neighbors=k)
    clf.fit(ttextdataemx_train, ttextdataemy_train)
    print("k="+str(k))
    print(f"Embedding KNN Accuracy: {clf.score(ttextdataemx_test, ttextdataemy_test) * 100:.3f}%", )
    print(f"Embedding KNN Precision score: {precision_score(ttextdataemy_test,clf.predict(ttextdataemx_test) ,average='macro',zero_division=0) * 100:.3f}%", )
    print(f"Embedding KNN Recall score: {recall_score(ttextdataemy_test,clf.predict(ttextdataemx_test) ,average='macro',zero_division=0) * 100:.3f}%", )
    print(f"Embedding KNN F1 score: {f1_score(ttextdataemy_test,clf.predict(ttextdataemx_test),average='macro',zero_division=0) * 100:.3f}%", )

此时KNN算法的最佳参数为k=3

In [None]:
#基于Embedding决策树和随机森林
#决策树可调参criterion,splitter,max_depth等，此处略去
dt = DecisionTreeClassifier(random_state=1)
dt.fit(ttextdataemx_train,ttextdataemy_train)
print(f"Embedding DecisionTree Accuracy: {dt.score(ttextdataemx_test, ttextdataemy_test) * 100:.3f}%", )
print(f"Embedding DecisionTree Precision score: {precision_score(ttextdataemy_test,dt.predict(ttextdataemx_test) ,average='macro',zero_division=0) * 100:.3f}%", )
print(f"Embedding DecisionTree Recall score: {recall_score(ttextdataemy_test,dt.predict(ttextdataemx_test) ,average='macro',zero_division=0) * 100:.3f}%", )
print(f"Embedding DecisionTree F1 score: {f1_score(ttextdataemy_test,dt.predict(ttextdataemx_test),average='macro',zero_division=0) * 100:.3f}%", )
#随机森林n_estimators可调，但n_estimators越大效果一般越好
rf = RandomForestClassifier(random_state=1)
rf.fit(ttextdataemx_train,ttextdataemy_train)
print(f"Embedding RandomForest Accuracy: {rf.score(ttextdataemx_test, ttextdataemy_test) * 100:.3f}%", )
print(f"Embedding RandomForest Precision score: {precision_score(ttextdataemy_test,rf.predict(ttextdataemx_test) ,average='macro',zero_division=0) * 100:.3f}%", )
print(f"Embedding RandomForest Recall score: {recall_score(ttextdataemy_test,rf.predict(ttextdataemx_test) ,average='macro',zero_division=0) * 100:.3f}%", )
print(f"Embedding RandomForest F1 score: {f1_score(ttextdataemy_test,rf.predict(ttextdataemx_test),average='macro',zero_division=0) * 100:.3f}%", )

In [None]:
#adaboost
#base_estimator可调整,n_estimators可调,learning_rate可调
for l in [0.01,0.1,0.3,0.5,1]:
    print("learning rate="+str(l))
    ab = AdaBoostClassifier(learning_rate=l,base_estimator=DecisionTreeClassifier(),random_state=1)
    ab.fit(ttextdataemx_train,ttextdataemy_train)
    print(f"Embedding AdaBoost Accuracy: {ab.score(ttextdataemx_test, ttextdataemy_test) * 100:.3f}%", )
    print(f"Embedding AdaBoost Precision score: {precision_score(ttextdataemy_test,ab.predict(ttextdataemx_test) ,average='macro',zero_division=0) * 100:.3f}%", )
    print(f"Embedding AdaBoost Recall score: {recall_score(ttextdataemy_test,ab.predict(ttextdataemx_test) ,average='macro',zero_division=0) * 100:.3f}%", )
    print(f"Embedding AdaBoost F1 score: {f1_score(ttextdataemy_test,ab.predict(ttextdataemx_test),average='macro',zero_division=0) * 100:.3f}%", )

Adaboost的默认学习率1依然效果最好

In [None]:
#梯度增强方法
for l in [0.01,0.03,0.1,0.3,0.5]:
    print("learning rate="+str(l))
    gbt = GradientBoostingRegressor(learning_rate=l,random_state=1)
    gbt.fit(ttextdataemx_train,ttextdataemy_train)
    print(f"Embedding GradientBoosting Accuracy: {gbt.score(ttextdataemx_test,ttextdataemy_test) * 100:.3f}%", )

此时梯度增强算法在学习率为0.1时效果最好

In [None]:
#Voting Classifier方法，一种ensemble方法，分为硬投票（各方法权值相同）和软投票（权重可调整）
#各种估计方法ensemble，参数可分别调整，从略
estimators = [ 
    ('rf',RandomForestClassifier(random_state=1,n_estimators=20)),
    ('svc',SVC(kernel='rbf', probability=True,random_state=1)),
    ('knc',KNeighborsClassifier()),
    ('abc',AdaBoostClassifier(base_estimator=DecisionTreeClassifier() ,n_estimators=20,random_state=1)),
    ('lr',LogisticRegression(random_state=1)) 
]
#硬投票 参数设置
vc = VotingClassifier(estimators=estimators, voting='hard')
vc.fit(ttextdataemx_train,ttextdataemy_train)
print(f"Embedding HardVoting Classifier Accuracy: {vc.score(ttextdataemx_test, ttextdataemy_test) * 100:.3f}%", )
print(f"Embedding HardVoting Classifier Precision score: {precision_score(ttextdataemy_test,vc.predict(ttextdataemx_test) ,average='macro',zero_division=0) * 100:.3f}%", )
print(f"Embedding HardVoting Classifier Recall score: {recall_score(ttextdataemy_test,vc.predict(ttextdataemx_test) ,average='macro',zero_division=0) * 100:.3f}%", )
print(f"Embedding HardVoting Classifier F1 score: {f1_score(ttextdataemy_test,vc.predict(ttextdataemx_test),average='macro',zero_division=0) * 100:.3f}%", )
#输出单个方法准确率
for est,name in zip(vc.estimators_,vc.estimators):
    est.fit(ttextdataemx_train,ttextdataemy_train)
    print(name[0],f"Accuracy: {est.score(ttextdataemx_test, ttextdataemy_test) * 100:.3f}%", )
    print(name[0],f"Precision score: {precision_score(ttextdataemy_test,est.predict(ttextdataemx_test) ,average='macro',zero_division=0) * 100:.3f}%", )
    print(name[0],f"Recall score: {recall_score(ttextdataemy_test,est.predict(ttextdataemx_test) ,average='macro',zero_division=0) * 100:.3f}%", )
    print(name[0],f"F1 score: {f1_score(ttextdataemy_test,est.predict(ttextdataemx_test),average='macro',zero_division=0) * 100:.3f}%", )


In [None]:
#基于各方法准确率进行软投票 权重可调，根据之前给出的各方法准确率可以微调，这里采取排序法，之前准确率越高权重越大
vc = VotingClassifier(estimators=estimators, voting='soft', weights=[3,2,4,1,5])
vc.fit(ttextdataemx_train,ttextdataemy_train)
print(f"Embedding SoftVoting Classifier Accuracy: {vc.score(ttextdataemx_test, ttextdataemy_test) * 100:.3f}%", )
print(f"Embedding SoftVoting Classifier Precision score: {precision_score(ttextdataemy_test,vc.predict(ttextdataemx_test) ,average='macro',zero_division=0) * 100:.3f}%", )
print(f"Embedding SoftVoting Classifier Recall score: {recall_score(ttextdataemy_test,vc.predict(ttextdataemx_test) ,average='macro',zero_division=0) * 100:.3f}%", )
print(f"Embedding SoftVoting Classifier F1 score: {f1_score(ttextdataemy_test,vc.predict(ttextdataemx_test),average='macro',zero_division=0) * 100:.3f}%", )

软投票准确率超过了之前最高的lr,ensemble有效

2.TF-IDF 特征：文本分析中常用的特征，用词频信息对文本进行特征提取，对语言不统一的文本仍然生效甚至更具有分辨性。仍然用之前的方法进行调参和分析。

In [None]:
#TF-IDF模型
#重新加载数据
#超小测试样本及划分
toydata=totaldata[0:100]
toytrickdata=totaldata.loc[totaldata.label_group.isin(toydata.label_group)]
toydata=toytrickdata
tx_train,tx_test, ty_train, ty_test = train_test_split(toydata,toydata.label_group,test_size=0.3, random_state=0)
#超小测试文本样本及划分
toytextdata=toydata[['title','label_group']]
ttextx_train,ttextx_test, ttexty_train, ttexty_test = train_test_split(toytextdata,toytextdata.label_group,test_size=0.3, random_state=0)

#tiidf特征提取
count_vec = TfidfVectorizer(binary=False, decode_error='ignore', stop_words='english')
response = count_vec.fit_transform(ttextx_train.title) # s must be string
trainfeature = response.toarray()
response = count_vec.transform(ttextx_test.title) # s must be string
testfeature = response.toarray()

In [None]:
#TF-IDF特征上做SVM预测
# Set dual=False to speed up training, and it's not needed
svc = LinearSVC(random_state=1, dual=False, max_iter=10000)
svc.fit(trainfeature, ttexty_train)
print(f"TF-IDF SVM Accuracy: {svc.score(testfeature, ttexty_test) * 100:.3f}%", )
print(f"TF-IDF SVM Precision score: {precision_score(ttexty_test,svc.predict(testfeature) ,average='macro',zero_division=0) * 100:.3f}%", )
print(f"TF-IDF SVM Recall score: {recall_score(ttexty_test,svc.predict(testfeature) ,average='macro',zero_division=0) * 100:.3f}%", )
print(f"TF-IDF SVM F1 score: {f1_score(ttexty_test,svc.predict(testfeature),average='macro',zero_division=0) * 100:.3f}%", )

准确率十分高

In [None]:
#TF-IDF特征上做knn分析
#参数k可训练
for k in [3,4,5,6,7,8,9,10,11,12]:
    print("k="+str(k))
    clf = neighbors.KNeighborsClassifier(n_neighbors=k)
    clf.fit(trainfeature, ttexty_train)
    print(f"TF-IDF KNN Accuracy: {clf.score(testfeature, ttexty_test) * 100:.3f}%", )
    print(f"TF-IDF KNN Precision score: {precision_score(ttexty_test,clf.predict(testfeature) ,average='macro',zero_division=0) * 100:.3f}%", )
    print(f"TF-IDF KNN Recall score: {recall_score(ttexty_test,clf.predict(testfeature) ,average='macro',zero_division=0) * 100:.3f}%", )
    print(f"TF-IDF KNN F1 score: {f1_score(ttexty_test,clf.predict(testfeature),average='macro',zero_division=0) * 100:.3f}%", )

此时KNN算法的最佳参数为k=3

In [None]:
#基于TF-IDF决策树和随机森林
dt = DecisionTreeClassifier(random_state=1)
dt.fit(trainfeature, ttexty_train)
print(f"TF-IDF DecisionTree Accuracy: {dt.score(testfeature, ttexty_test) * 100:.3f}%", )
print(f"TF-IDF DecisionTree Precision score: {precision_score(ttexty_test,dt.predict(testfeature) ,average='macro',zero_division=0) * 100:.3f}%", )
print(f"TF-IDF DecisionTree Recall score: {recall_score(ttexty_test,dt.predict(testfeature) ,average='macro',zero_division=0) * 100:.3f}%", )
print(f"TF-IDF DecisionTree F1 score: {f1_score(ttexty_test,dt.predict(testfeature),average='macro',zero_division=0) * 100:.3f}%", )
#n_estimators可调
rf = RandomForestClassifier(random_state=1)
rf.fit(trainfeature, ttexty_train)
print(f"TF-IDF RandomForest Accuracy: {rf.score(testfeature, ttexty_test) * 100:.3f}%", )
print(f"TF-IDF RandomForest Precision score: {precision_score(ttexty_test,rf.predict(testfeature) ,average='macro',zero_division=0) * 100:.3f}%", )
print(f"TF-IDF RandomForest Recall score: {recall_score(ttexty_test,rf.predict(testfeature) ,average='macro',zero_division=0) * 100:.3f}%", )
print(f"TF-IDF RandomForest F1 score: {f1_score(ttexty_test,rf.predict(testfeature),average='macro',zero_division=0) * 100:.3f}%", )

In [None]:
#adaboost
#base_estimator可调整,n_estimators可调,learning_rate可调
for l in [0.01,0.1,0.3,0.5,1]:
    print("learning rate="+str(l))
    ab = AdaBoostClassifier(learning_rate=l,base_estimator=DecisionTreeClassifier(),random_state=1)
    ab.fit(trainfeature, ttexty_train)
    print(f"TF-IDF AdaBoost Accuracy: {ab.score(testfeature, ttexty_test) * 100:.3f}%", )
    print(f"TF-IDF AdaBoost Precision score: {precision_score(ttexty_test,ab.predict(testfeature) ,average='macro',zero_division=0) * 100:.3f}%", )
    print(f"TF-IDF AdaBoost Recall score: {recall_score(ttexty_test,ab.predict(testfeature) ,average='macro',zero_division=0) * 100:.3f}%", )
    print(f"TF-IDF AdaBoost F1 score: {f1_score(ttexty_test,ab.predict(testfeature),average='macro',zero_division=0) * 100:.3f}%", )

此时AdaBoost的学习率不影响准确率

In [None]:
#梯度增强方法
for l in [0.01,0.03,0.1,0.3,0.5,1]:
    print("learning rate="+str(l))
    gbt = GradientBoostingRegressor(learning_rate=l,random_state=1)
    gbt.fit(trainfeature, ttexty_train)
    print(f"TF-IDF GradientBoosting Accuracy: {gbt.score(testfeature, ttexty_test) * 100:.3f}%", )

此时梯度增强算法在学习率为0.5时效果最好

In [None]:
#Voting Classifier
#各种估计方法ensemble
estimators = [ 
    ('rf',RandomForestClassifier(random_state=1,n_estimators=20)),
    ('svc',SVC(kernel='rbf', probability=True,random_state=1)),
    ('knc',KNeighborsClassifier()),
    ('abc',AdaBoostClassifier(base_estimator=DecisionTreeClassifier() ,n_estimators=20,random_state=1)),
    ('lr',LogisticRegression(random_state=1)) 
]
#硬投票 参数设置
vc = VotingClassifier(estimators=estimators, voting='hard')
vc.fit(trainfeature, ttexty_train)
print(f"TF-IDF HardVoting Classifier Accuracy: {vc.score(testfeature, ttexty_test) * 100:.3f}%", )
print(f"TF-IDF HardVoting Classifier Precision score: {precision_score(ttexty_test,vc.predict(testfeature) ,average='macro',zero_division=0) * 100:.3f}%", )
print(f"TF-IDF HardVoting Classifier Recall score: {recall_score(ttexty_test,vc.predict(testfeature) ,average='macro',zero_division=0) * 100:.3f}%", )
print(f"TF-IDF HardVoting Classifier F1 score: {f1_score(ttexty_test,vc.predict(testfeature),average='macro',zero_division=0) * 100:.3f}%", )
#输出单个方法准确率
for est,name in zip(vc.estimators_,vc.estimators):
    est.fit(trainfeature, ttexty_train)
    print(name[0],f"Accuracy: {est.score(testfeature, ttexty_test) * 100:.3f}%", )
    print(name[0],f"Precision score: {precision_score(ttexty_test,est.predict(testfeature) ,average='macro',zero_division=0) * 100:.3f}%", )
    print(name[0],f"Recall score: {recall_score(ttexty_test,est.predict(testfeature) ,average='macro',zero_division=0) * 100:.3f}%", )
    print(name[0],f"SVM F1 score: {f1_score(ttexty_test,est.predict(testfeature),average='macro',zero_division=0) * 100:.3f}%", )


In [None]:
#基于各方法准确率进行软投票 权重可调，根据之前给出的各方法准确率可以微调，这里采取排序法，之前准确率越高权重越大
vc = VotingClassifier(estimators=estimators, voting='soft', weights=[5,2,3,4,1])
vc.fit(trainfeature, ttexty_train)
print(f"TF-IDF SoftVoting Classifier Accuracy: {vc.score(testfeature, ttexty_test) * 100:.3f}%", )
print(f"TF-IDF SoftVoting Classifier Precision score: {precision_score(ttexty_test,vc.predict(testfeature) ,average='macro',zero_division=0) * 100:.3f}%", )
print(f"TF-IDF SoftVoting Classifier Recall score: {recall_score(ttexty_test,vc.predict(testfeature) ,average='macro',zero_division=0) * 100:.3f}%", )
print(f"TF-IDF SoftVoting Classifier F1 score: {f1_score(ttexty_test,vc.predict(testfeature),average='macro',zero_division=0) * 100:.3f}%", )

软投票准确率超过了大部分分支分类器，比较稳健

3.spacy 中还有一个TextCategorizer方法，可以将文本进行分类，但需要设定语言，这里设定为英文进行测试

In [None]:
#spacy nlp方法 把标签转换为字符串，进行分类 TextCategorizer法
#重新加载数据
#超小测试样本及划分
toydata=totaldata[0:100]
toytrickdata=totaldata.loc[totaldata.label_group.isin(toydata.label_group)]
toydata=toytrickdata
tx_train,tx_test, ty_train, ty_test = train_test_split(toydata,toydata.label_group,test_size=0.3, random_state=0)
#超小测试文本样本及划分
toytextdata=toydata[['title','label_group']]
ttextx_train,ttextx_test, ttexty_train, ttexty_test = train_test_split(toytextdata,toytextdata.label_group,test_size=0.3, random_state=0)

In [None]:
#spacy nlp方法 把标签转换为字符串，进行分类 TextCategorizer法
ttextx_train=ttextx_train.astype('str')
ttextx_test=ttextx_test.astype('str')
ttexty_train=ttexty_train.astype('str')
ttexty_test==ttexty_test.astype('str')

# Create an empty model，但文本不全是英文，可能有影响
nlp = spacy.blank("en")

# Create the TextCategorizer with exclusive classes and "bow" architecture
textcat = nlp.create_pipe(
              "textcat",
              config={
                "exclusive_classes": True,
                "architecture": "bow"})

# Add the TextCategorizer to the empty model
nlp.add_pipe(textcat)
#add label
for label in ttextx_train.label_group.unique():
    textcat.add_label(label)
#训练数据
train =ttextx_train.apply(lambda row: (row['title'],row['label_group']), axis=1).tolist()

In [None]:
#加载数据
def load_data(train, split=0.8): 

    # Shuffle the data
    random.shuffle(train)
    texts, labels = zip(*train)
    # get the categories for each review
    categories = ttextx_train.label_group

    cats = []
    for y in labels:
        cat = {category: 0 for category in categories}
        cat[y] = 1
        cats.append(cat)

    # Splitting the training and evaluation data
    split = int(len(train) * split)
    return (texts[:split], cats[:split]), (texts[split:], cats[split:])

# Calling the load_data() function 
(train_texts, train_cats), (dev_texts, dev_cats) = load_data(train)

# Processing the final format of training data
train_data = list(zip(train_texts,[{'cats': cats} for cats in train_cats]))


In [None]:
#准确率计算
def evaluate(tokenizer, textcat, texts, cats):
    docs = (tokenizer(text) for text in texts)
    tp = 0
    for i, doc in enumerate(textcat.pipe(docs)):
        #获取最大值对应的key
        gold = max(cats[i], key=cats[i].get)
        ds =  doc.cats.items()
        h = {}
        [h.update({k:v}) for k,v in ds]
        predict = max(h, key=h.get)
        if gold == predict:
            tp += 1
            
    precision = tp / len(texts)
    return precision


In [None]:
#迭代训练，时间较长
n_iter = 10
# 禁用其他组件
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat']
with nlp.disable_pipes(*other_pipes):  # 只训练 textcat
    optimizer = nlp.begin_training()

    print("Training the model...")
    print('{:^5}\t{:^5}'.format('LOSS', 'PRECISION'))

    # 开始训练
    for i in range(n_iter):
        losses = {}
        batches = minibatch(train_data, size=compounding(4., 32., 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=0.2,
                      losses=losses)

        with textcat.model.use_params(optimizer.averages):
            score = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats) 

            print('{0:.3f}\t{1:.3f}'.format(losses['textcat'], score))


In [None]:
#测试准确率
test =ttextx_test.apply(lambda row: (row['title'],row['label_group']), axis=1).tolist()
texts, labels = zip(*test)
docs = (nlp.tokenizer(text) for text in texts)
cats = []
for y in labels:
    cat = {category: 0 for category in ttextx_test.label_group}
    cat[y] = 1
    cats.append(cat)
tp = 0
for i, doc in enumerate(textcat.pipe(docs)):
    #获取最大值对应的key
    gold = max(cats[i], key=cats[i].get)
    ds =  doc.cats.items()
    h = {}
    [h.update({k:v}) for k,v in ds]
    predict = max(h, key=h.get)
    if gold == predict:
        tp += 1
                
precision = tp / len(ttextx_test)

print(f"Spacy nlp Accuracy: {precision* 100:.3f}%", )

发现效果还是较好的

三、用图像信息进行分类预测

仍然在划分出的toydata数据集中进行训练和预测，除对image信息进行读取加工外，还可利用image_phash信息进行预测。处理image更常用的神经网络方法见notebook PCbighomework_network_hgz

In [None]:
#超小测试样本及划分
toydata=totaldata[0:100]
#将toydata中所有标签的样本提取出来，这将大大提升准确率
toytrickdata=totaldata.loc[totaldata.label_group.isin(toydata.label_group)]
toydata=toytrickdata
tx_train,tx_test, ty_train, ty_test = train_test_split(toydata,toydata.label_group,test_size=0.3, random_state=0)
#超小测试图像样本及划分
toyimagedata=toydata[['image','label_group']]
timagex_train,timagex_test, timagey_train, timagey_test = train_test_split(toyimagedata,toyimagedata.label_group,test_size=0.3, random_state=0)


1.先对图像进行简单信息提取，直接将其转化为一维向量，之后可做KNN分析

In [None]:
#建立图像矩阵样本，将其向量化（一维信息）
toyimagearraydata=np.zeros((len(toyimagedata),512*512*3))
for i in range(len(toyimagedata)):
    a = mpimg.imread('../input/shopee-product-matching/train_images/'+toyimagedata.image[toyimagedata.index[i]]) 
    toyimagearraydata[i] = np.resize(a,512*512*3)
#样本划分
timagearrayx_train,timagearrayx_test, timagearrayy_train, timagearrayy_test = train_test_split(toyimagearraydata,toyimagedata.label_group,test_size=0.3, random_state=0)

In [None]:
#在1维图像样本上做knn分析
#参数k可训练
for k in [3,4,5,6,7,8,9,10,11,12]:
    print("k="+str(k))
    clf = neighbors.KNeighborsClassifier(n_neighbors=k)
    clf.fit(timagearrayx_train, timagey_train.values)
    Z = clf.predict(timagearrayx_test)
    print(f"1dimensionimage KNN Accuracy: {clf.score(timagearrayx_test,timagey_test.values) * 100:.3f}%", )
    print(f"1dimensionimage KNN Precision score: {precision_score(timagey_test.values,clf.predict(timagearrayx_test) ,average='macro',zero_division=0) * 100:.3f}%", )
    print(f"1dimensionimage KNN Recall score: {recall_score(timagey_test.values,clf.predict(timagearrayx_test) ,average='macro',zero_division=0) * 100:.3f}%", )
    print(f"1dimensionimage KNN F1 score: {f1_score(timagey_test.values,clf.predict(timagearrayx_test),average='macro',zero_division=0) * 100:.3f}%", )


此时KNN算法的最佳参数为k=3

可见一维样本的信息还是损失过多，准确率不高，做svm分析等的时间过长，在此不做展示

2.SIFT特征处理。使用opencv2中的SIFT相关函数可以提取出图像信息的关键点，并以此作为特征进行图像分类。

In [None]:
#SIFT处理图像
#超小测试样本及划分/100可接受速度
toydata=totaldata[0:100]
#将toydata中所有标签的样本提取出来，这将大大提升准确率
toytrickdata=totaldata.loc[totaldata.label_group.isin(toydata.label_group)]
toydata=toytrickdata
tx_train,tx_test, ty_train, ty_test = train_test_split(toydata,toydata.label_group,test_size=0.3, random_state=0)
#超小测试图像样本及划分
toyimagedata=toydata[['image','label_group']]
timagex_train,timagex_test, timagey_train, timagey_test = train_test_split(toyimagedata,toyimagedata.label_group,test_size=0.3, random_state=0)


In [None]:
#SIFT特征提取，时间较长

def calcSiftFeature(img):
    #设置图像sift特征关键点最大为200
    sift = cv2.SIFT_create()
    #计算图片的特征点和特征点描述
    keypoints, features = sift.detectAndCompute(img, None)
    return features

#计算词袋
def learnVocabulary(features):
    wordCnt = 50
    #criteria表示迭代停止的模式   eps---精度0.1，max_iter---满足超过最大迭代次数20
    criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 20, 0.1)
    #得到k-means聚类的初始中心点
    flags = cv2.KMEANS_RANDOM_CENTERS
    # 标签，中心 = kmeans(输入数据（特征)、聚类的个数K,预设标签，聚类停止条件、重复聚类次数、初始聚类中心点
    compactness, labels, centers = cv2.kmeans(features, wordCnt, None,criteria, 20, flags)
    return centers

def calcFeatVec(features, centers):
    featVec = np.zeros((1, 50))
    for i in range(0, features.shape[0]):
        #第i张图片的特征点
        fi = features[i]
        diffMat = np.tile(fi, (50, 1)) - centers
        #axis=1按行求和，即求特征到每个中心点的距离
        sqSum = (diffMat**2).sum(axis=1)
        dist = sqSum**0.5
        #升序排序
        sortedIndices = dist.argsort()
        #取出最小的距离，即找到最近的中心点
        idx = sortedIndices[0]
        #该中心点对应+1
        featVec[0][idx] += 1
    return featVec

features = np.float32([]).reshape(0, 128)#存放训练集图片的特征
for i in range(len(toyimagedata)):
    img = cv2.imread('../input/shopee-product-matching/train_images/'+toyimagedata.image[toyimagedata.index[i]])
    #获取图片sift特征点
    img_f = calcSiftFeature(img)
    #特征点加入训练数据
    features = np.append(features, img_f, axis=0)
#训练集的词袋
centers = learnVocabulary(features)

data_vec = np.float32([]).reshape(0, 50)#存放训练集图片的特征
labels = np.float32([])
for i in range(len(toyimagedata)):
    img = cv2.imread('../input/shopee-product-matching/train_images/'+toyimagedata.image[toyimagedata.index[i]])
    img_f = calcSiftFeature(img)
    img_vec = calcFeatVec(img_f, centers)
    data_vec = np.append(data_vec,img_vec,axis=0)
    labels = np.append(labels,toyimagedata.label_group[toyimagedata.index[i]])



In [None]:
#由于数据比较离散，进行Normalization处理,可调整
tt=data_vec
data_vec = preprocessing.normalize(tt, norm='l2')

In [None]:
#sift数据划分
tsiftdatax_train, tsiftdatax_test, tsiftdatay_train, tsiftdatay_test = train_test_split(data_vec, labels,
                                                    test_size=0.3, random_state=1)

SIFT变换后的数据又可以使用之前的各类方法进行分类

In [None]:
#sift上做SVM预测
svc = LinearSVC(random_state=1, dual=False, max_iter=10000)
svc.fit(tsiftdatax_train, tsiftdatay_train)
print(f"SIFT SVM Accuracy: {svc.score(tsiftdatax_test, tsiftdatay_test) * 100:.3f}%", )
print(f"SIFT SVM Precision score: {precision_score(tsiftdatay_test,svc.predict(tsiftdatax_test) ,average='macro',zero_division=0) * 100:.3f}%", )
print(f"SIFT SVM Recall score: {recall_score(tsiftdatay_test,svc.predict(tsiftdatax_test) ,average='macro',zero_division=0) * 100:.3f}%", )
print(f"SIFT SVM F1 score: {f1_score(tsiftdatay_test,svc.predict(tsiftdatax_test),average='macro',zero_division=0) * 100:.3f}%", )


In [None]:
#sift上做knn分析
#参数k可训练
for k in [3,4,5,6,7,8,9,10,11,12]:
    print("k="+str(k))
    clf = neighbors.KNeighborsClassifier(n_neighbors=k)
    clf.fit(tsiftdatax_train, tsiftdatay_train)
    Z = clf.predict(tsiftdatax_test)
    print(f"SIFT KNN Accuracy: {clf.score(tsiftdatax_test, tsiftdatay_test) * 100:.3f}%", )
    print(f"SIFT KNN Precision score: {precision_score(tsiftdatay_test,clf.predict(tsiftdatax_test) ,average='macro',zero_division=0) * 100:.3f}%", )
    print(f"SIFT KNN Recall score: {recall_score(tsiftdatay_test,clf.predict(tsiftdatax_test) ,average='macro',zero_division=0) * 100:.3f}%", )
    print(f"SIFT KNN F1 score: {f1_score(tsiftdatay_test,clf.predict(tsiftdatax_test),average='macro',zero_division=0) * 100:.3f}%", )

此时KNN算法的最佳参数为k=7

In [None]:
#基于SIFT决策树和随机森林
dt = DecisionTreeClassifier(random_state=1)
dt.fit(tsiftdatax_train, tsiftdatay_train)
print(f"SIFT DecisionTree Accuracy: {dt.score(tsiftdatax_test, tsiftdatay_test) * 100:.3f}%", )
print(f"SIFT DecisionTree Precision score: {precision_score(tsiftdatay_test,dt.predict(tsiftdatax_test) ,average='macro',zero_division=0) * 100:.3f}%", )
print(f"SIFT DecisionTree Recall score: {recall_score(tsiftdatay_test,dt.predict(tsiftdatax_test) ,average='macro',zero_division=0) * 100:.3f}%", )
print(f"SIFT DecisionTree F1 score: {f1_score(tsiftdatay_test,dt.predict(tsiftdatax_test),average='macro',zero_division=0) * 100:.3f}%", )
rf = RandomForestClassifier(random_state=1)
rf.fit(tsiftdatax_train, tsiftdatay_train)
print(f"SIFT RandomForest Accuracy: {rf.score(tsiftdatax_test, tsiftdatay_test) * 100:.3f}%", )
print(f"SIFT RandomForest Precision score: {precision_score(tsiftdatay_test,rf.predict(tsiftdatax_test) ,average='macro',zero_division=0) * 100:.3f}%", )
print(f"SIFT RandomForest Recall score: {recall_score(tsiftdatay_test,rf.predict(tsiftdatax_test) ,average='macro',zero_division=0) * 100:.3f}%", )
print(f"SIFT RandomForest F1 score: {f1_score(tsiftdatay_test,rf.predict(tsiftdatax_test),average='macro',zero_division=0) * 100:.3f}%", )

In [None]:
#adaboost
#base_estimator可调整,n_estimators可调,learning_rate可调
for l in [0.01,0.1,0.3,0.5,1]:
    print("learning rate="+str(l))
    ab = AdaBoostClassifier(learning_rate=l,base_estimator=DecisionTreeClassifier(),random_state=1)
    ab.fit(tsiftdatax_train, tsiftdatay_train)
    print(f"SIFT AdaBoost Accuracy: {ab.score(tsiftdatax_test, tsiftdatay_test) * 100:.3f}%", )
    print(f"SIFT AdaBoost Precision score: {precision_score(tsiftdatay_test,ab.predict(tsiftdatax_test) ,average='macro',zero_division=0) * 100:.3f}%", )
    print(f"SIFT AdaBoost Recall score: {recall_score(tsiftdatay_test,ab.predict(tsiftdatax_test) ,average='macro',zero_division=0) * 100:.3f}%", )
    print(f"SIFT AdaBoost F1 score: {f1_score(tsiftdatay_test,ab.predict(tsiftdatax_test),average='macro',zero_division=0) * 100:.3f}%", )


此时AdaBoost的学习率不影响结果

In [None]:
#梯度增强方法
for l in [0.01,0.03,0.1,0.3,0.5,1]:
    print("learning rate="+str(l))
    gbt = GradientBoostingRegressor(learning_rate=l,random_state=1)
    gbt.fit(tsiftdatax_train, tsiftdatay_train)
    print(f"SIFT GradientBoosting Accuracy: {gbt.score(tsiftdatax_test, tsiftdatay_test) * 100:.3f}%", )


此时梯度增强算法在学习率为0.1时效果最好

In [None]:
#Voting Classifier
#各种估计方法ensemble
estimators = [ 
    ('rf',RandomForestClassifier(random_state=1,n_estimators=20)),
    ('svc',SVC(kernel='rbf', probability=True,random_state=1)),
    ('knc',KNeighborsClassifier()),
    ('abc',AdaBoostClassifier(base_estimator=DecisionTreeClassifier() ,n_estimators=20,random_state=1)),
    ('lr',LogisticRegression(random_state=1)) 
]
#硬投票 参数设置
vc = VotingClassifier(estimators=estimators, voting='hard')
vc.fit(tsiftdatax_train, tsiftdatay_train)
print(f"SIFT HardVoting Classifier Accuracy: {vc.score(tsiftdatax_test, tsiftdatay_test) * 100:.3f}%", )
print(f"SIFT HardVoting Classifier Precision score: {precision_score(tsiftdatay_test,vc.predict(tsiftdatax_test) ,average='macro',zero_division=0) * 100:.3f}%", )
print(f"SIFT HardVoting Classifier Recall score: {recall_score(tsiftdatay_test,vc.predict(tsiftdatax_test) ,average='macro',zero_division=0) * 100:.3f}%", )
print(f"SIFT HardVoting Classifier F1 score: {f1_score(tsiftdatay_test,vc.predict(tsiftdatax_test),average='macro',zero_division=0) * 100:.3f}%", )
#输出单个方法准确率
for est,name in zip(vc.estimators_,vc.estimators):
    est.fit(tsiftdatax_train, tsiftdatay_train)
    print(name[0],f"Accuracy: {est.score(tsiftdatax_test, tsiftdatay_test) * 100:.3f}%", )
    print(name[0],f"Precision score: {precision_score(tsiftdatay_test,est.predict(tsiftdatax_test) ,average='macro',zero_division=0) * 100:.3f}%", )
    print(name[0],f"Recall score: {recall_score(tsiftdatay_test,est.predict(tsiftdatax_test) ,average='macro',zero_division=0) * 100:.3f}%", )
    print(name[0],f"F1 score: {f1_score(tsiftdatay_test,est.predict(tsiftdatax_test),average='macro',zero_division=0) * 100:.3f}%", )


In [None]:
#基于各方法准确率进行软投票 权重可调，根据之前给出的各方法准确率可以微调，这里采取排序法，之前准确率越高权重越大
vc = VotingClassifier(estimators=estimators, voting='soft', weights=[5,1,4,2,3])
vc.fit(tsiftdatax_train, tsiftdatay_train)
print(f"SIFT SoftVoting Classifier Accuracy: {vc.score(tsiftdatax_test, tsiftdatay_test) * 100:.3f}%", )
print(f"SIFT SoftVoting Classifier Precision score: {precision_score(tsiftdatay_test,vc.predict(tsiftdatax_test) ,average='macro',zero_division=0) * 100:.3f}%", )
print(f"SIFT SoftVoting Classifier Recall score: {recall_score(tsiftdatay_test,vc.predict(tsiftdatax_test) ,average='macro',zero_division=0) * 100:.3f}%", )
print(f"SIFT SoftVoting Classifier F1 score: {f1_score(tsiftdatay_test,vc.predict(tsiftdatax_test),average='macro',zero_division=0) * 100:.3f}%", )

软投票略超过各分支分类器，但整体准确率还是不高

3.用图像phash值做分析。训练集中给出了image_phash的值，这是一种图像哈希方式，可以通过定义新的度量汉明距离，用KNN方法进行样本分类

In [None]:
#超小测试样本及划分
toydata=totaldata[0:100]
#将toydata中所有标签的样本提取出来，这将大大提升准确率
toytrickdata=totaldata.loc[totaldata.label_group.isin(toydata.label_group)]
toydata=toytrickdata
#超小测试图像哈希样本及划分
toyphashdata=toydata[['image_phash','label_group']]
tphashx_train,tphashx_test, tphashy_train, tphashy_test = train_test_split(toyphashdata,toyphashdata.label_group,test_size=0.3, random_state=0)

In [None]:
#建立图像哈希样本
tphasharraydata = []
for i in range(len(toyphashdata)): 
    tphasharraydata.append(np.array(list(toyphashdata['image_phash'][toyphashdata.index[i]])))
tphasharraydata=np.array(tphasharraydata)
#样本划分
tphasharrayx_train,tphasharrayx_test, tphasharrayy_train, tphasharrayy_test = train_test_split(tphasharraydata,toyphashdata.label_group,test_size=0.3, random_state=0)

In [None]:
#用图像哈希值计算汉明距离
def hanming(image_code, ref_code):
    """
    calculate hanming distance between image and reference
    Args:
        image_code: list type
        ref_code: list type
    """
    assert len(image_code)== len(ref_code)
    return sum(np.array(image_code)!=np.array(ref_code))/len(image_code)

#自制knn

dist = np.zeros([tphasharrayx_train.shape[0],tphasharrayx_test.shape[0]])
for i in range(tphasharrayx_train.shape[0]):
    for j in range(tphasharrayx_test.shape[0]):
        dist[i,j] = hanming(tphasharrayx_train[i], tphasharrayx_test[j])
#参数k可训练
for k in [3,4,5,6,7,8,9,10,11,12]:
    print("k="+str(k))
    max_index = []
    for j in range(len(tphasharrayx_test)):
        list1 = dist[:,j]
        list2 = sorted(list1)
        max_num = list2[:k]
        max_index.append([y for y,i in enumerate(list1) if i in max_num])
    #分类
    pre = [pd.value_counts(tphasharrayy_train.values[max_index[i]]).idxmax() for i in range(len(tphasharrayx_test))]
 
    #查看结果
    print(f"ImagePhash KNN Accuracy: {sum(pre == tphasharrayy_test.values)/len(tphasharrayy_test)* 100:.3f}%", )
    print(f"ImagePhash KNN Precision score: {precision_score(pre,tphasharrayy_test.values ,average='macro',zero_division=0) * 100:.3f}%", )
    print(f"ImagePhash KNN Recall score: {recall_score(pre,tphasharrayy_test.values ,average='macro',zero_division=0) * 100:.3f}%", )
    print(f"ImagePhash KNN F1 score: {f1_score(pre,tphasharrayy_test.values,average='macro',zero_division=0) * 100:.3f}%", )

此时KNN的最佳参数为k=3

综上，我们从文本信息和图像信息分别提取特征并对商品标签进行了预测，总的来说还是文本信息的准确率更高，因为同标签商品的描述十分类似，但图像则较难分辨。PCbighomework_network_hgz用神经网络进一步对图像信息进行训练，得到了明显较好的结果；PCbighomework_mix_hgz试图整合文本和图像最优模型的信息，是综合文本图像信息产生新分类器的尝试。具体结果对照见实验报告。