### 项目说明
**需求说明**

在搜索竞价广告系统中，用户通过在搜索引擎输入具体的查询词来获取相关信息。因此，用户的历史查询词与用户的基本属性及潜在需求有密切的关系。

举例如下：

    年龄在19岁至23岁区间的自然人会有较多的搜索行为与大学生活、社交等主题有关

    男性相比女性会在军事、汽车等主题有更多的搜索行为

    高学历人群会更加倾向于获取社会、经济等主题的信息

根据用户历史一个月的查询记录，以用户的人口属性（包括性别、年龄、学历）为标签，通过机器学习、数据挖掘技术构建分类算法来对新增用户的人口属性进行判定。

**数据说明**

字段	说明

    ID	加密后的ID
    Age	0：未知年龄; 1：0-18岁; 2：19-23岁; 3：24-30岁; 4：31-40岁; 5：41-50岁; 6： 51-999岁
    Gender	0：未知  1：男性  2：女性
    Education	0：未知学历; 1：博士; 2：硕士; 3：大学生; 4：高中; 5：初中; 6：小学
    Query List	搜索词列表

### 数据加载

#### 导包

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

#### 验证编码格式

In [2]:
def read_file(file_path,mode='rb'):
    with open(file_path,mode=mode) as f:
        i = 0
        while True:
            line = f.readline()
            i += 1
            yield line
train_file = read_file('../../dataset/personas/user_tag_query.10W.TRAIN.csv')
test_file = read_file('../../dataset/personas/user_tag_query.10W.TEST.csv')

In [3]:
import chardet
print(chardet.detect(next(train_file)))
print(chardet.detect(next(test_file)))

{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
{'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}


In [4]:
import chardet
print(chardet.detect(next(train_file)))
print(chardet.detect(next(test_file)))

{'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}
{'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}


#### 观察文件格式

In [5]:
next(train_file).decode('GB2312')[:500]

'43CC3AF5A8D6430A3B572337A889AFE4,2,1,3,"广州厨宝烤箱\t世情薄,人情恶,雨送黄昏花易落,晓风干,泪痕\t厦门酒店用品批发市场\t我只是不想让你支付,原谅我的无情,对不起\t处女座代表的花朵\t烤鸡胸肉的做法 烤箱\t不曾忘记,也不曾想起\t一辈子那么长 一天没走到终点\t联塑pvc排水管规格表\t大王椰\t性格文静是什么意思\t250ml牛奶用多少克奶粉冲\t化蝶去寻花 夜夜栖芳草什么意思\t学会爱自己 才会真正懂得爱1001学会爱自己 \t福睿斯\t斗鱼tv\t厨宝是什么\t厨宝烤箱\t禹州城市广场电影院\t大王棕树图片\t王棕树图片\t发酵箱\t6920882798458\t我只是不想让你支付,原谅我的无情,对不起10\t你住的城市下雨了,很想问你有没有带伞 可是\t250ml牛奶用多少克奶粉\t烤翅的做法烤箱\t全脂奶粉泡的比例\t牛肉怎么炒才嫩\t厨宝牌是谁的\t猴配猴的婚姻怎么样\t可惜没如果mv爱情公寓\t守不住的才是个笑话\t犯太岁是什么意思\t禹州城市广场电影院1001禹州城市广场电影院\t烤箱什么牌子好\t全脂奶粉怎么泡\t红铁人面粉蛋白质含量\t你走后我却活成你的样子\t鸡排的做法大全\t放爱\t斗鱼直播\t'

In [6]:
next(test_file).decode('GB2312')[:500]

'CA9F675A024FB2353849350A35CF8B0F\t黑暗文\tlpl夏季赛\t大富豪电玩城\t英雄联盟之电竞称王\t手机怎么扫描手机上的二维码\t重庆重钢老板\t资阳俊士\t2016lpl夏季赛积分\t主角修炼无情道的小说\t重庆重钢\t风流小农民艳遇记\t主角以无情杀戮修炼\t网店推广\t2016lpl夏季赛omg是否保级成功\t成都工商管理学院学费多少 小说\t手机上怎么扫描二维码\t全能运动员\t男主很阴暗残忍的小说\tvertu手机\t胸型号\t中英文在线翻译\t神秘的性奖励班会小说\t成都新都龙桥地图\t薛之谦张婉清\t网店的开张以及宣传 小说\t重钢亏六十亿\t英雄联盟之观战系统\t人民公敌\t黑暗文小说\t主角修魔杀戮小说\t成都旅游经典\t史上第一大魔神\t价格超过十万的手机\twww6090青苹果影院\t周星驰多少岁出名\t薛之谦\t2016八月中国经济比去年增长多少\t中英文翻译 小说\t阿斯顿马丁\t亵渎\t主角杀戮无女主小说\tgood luck什么意思\tcm朋克ufc首秀\t新都龙桥自行车\t演员薛之谦张婉清\t重庆重钢集团资产多少\t男主角性格阴暗的小说\t隐藏分查询\t最强特种兵之王\t中国迪士尼乐园在哪个城市\t主角是反派无女主小说\t野'

#### 文件格式化

train_data 是以'，'分隔的标准csv文件

test_data 是以空格分开的csv文件，同时QueryList字段内的搜索字段也是以空格分开，故只能按空格分割一次

##### 预测集格式化

In [7]:
%%time
import chardet
import csv
def load_file(file_path):
    with open(file_path,encoding='GB2312',errors='ignore') as f:
        while True:
                line = f.readline()
                yield line
                if not line:
                    print('文件加载结束')
                    break
def format_and_save_file(file_text, save_file_path,columns=None):       
    csv_writer = csv.writer(open(save_file_path, 'w',encoding='utf-8',newline=''))
    if columns:
        csv_writer.writerow(columns)
    while True:
        try:
            csv_writer.writerow(next(file_text).replace('\n','').split('\t',maxsplit=1))
        except Exception as e:
            print(e)
            print('文件格式化结束')
            break
test_file_text = load_file('../../dataset/personas/user_tag_query.10W.TEST.csv')
format_and_save_file(test_file_text, '../../dataset/personas/utf_8_user_tag_query.10W.TEST.csv',['ID','QueryList'])

文件加载结束

文件格式化结束
Wall time: 8.56 s


##### 训练集格式化

In [8]:
train_file_data  = pd.read_csv('../../dataset/personas/user_tag_query.10W.TRAIN.csv', encoding='gb18030')
train_file_data.to_csv('../../dataset/personas/utf_8_user_tag_query.10W.TRAIN.csv', encoding='utf-8', index=False)

#### 文件加载

In [3]:
train_data  = pd.read_csv('../../dataset/personas/utf_8_user_tag_query.10W.TRAIN.csv')

In [4]:
test_data  = pd.read_csv('../../dataset/personas/utf_8_user_tag_query.10W.TEST.csv')

### 数据集成

#### 检查字段一致性

In [11]:
train_data.columns

Index(['ID', 'age', 'Gender', 'Education', 'QueryList'], dtype='object')

In [12]:
train_data.head()

Unnamed: 0,ID,age,Gender,Education,QueryList
0,22DD920316420BE2DF8D6EE651BA174B,1,1,4,柔和双沟\t女生\t中财网首页 财经\thttp://pan.baidu.com/s/1pl...
1,43CC3AF5A8D6430A3B572337A889AFE4,2,1,3,"广州厨宝烤箱\t世情薄,人情恶,雨送黄昏花易落,晓风干,泪痕\t厦门酒店用品批发市场\t我只..."
2,E97654BFF5570E2CCD433EA6128EAC19,4,1,0,钻石之泪耳机\t盘锦到沈阳\t旅顺公交\t辽宁阜新车牌\tbaidu\tk715\tk716...
3,6931EFC26D229CCFCEA125D3F3C21E57,4,2,3,最受欢迎狗狗排行榜\t舶怎么读\t场景描 写范例\t三维绘图软件\t枣和酸奶能一起吃吗\t好...
4,E780470C3BB0D340334BD08CDCC3C71A,2,2,4,干槽症能自愈吗\t太太万岁叶舒心去没去美国\t干槽症\t右眼皮下面一直跳是怎么回事\t麦当劳...


In [13]:
test_data.columns

Index(['ID', 'QueryList'], dtype='object')

#### 数据集成

In [14]:
data = pd.concat([train_data.reindex(columns=['ID', 'QueryList','age', 'Gender', 'Education']),test_data],sort=False,ignore_index=True)

#### 检查数据集成结果

In [15]:
data.shape

(199832, 5)

In [16]:
data.head()

Unnamed: 0,ID,QueryList,age,Gender,Education
0,22DD920316420BE2DF8D6EE651BA174B,柔和双沟\t女生\t中财网首页 财经\thttp://pan.baidu.com/s/1pl...,1.0,1.0,4.0
1,43CC3AF5A8D6430A3B572337A889AFE4,"广州厨宝烤箱\t世情薄,人情恶,雨送黄昏花易落,晓风干,泪痕\t厦门酒店用品批发市场\t我只...",2.0,1.0,3.0
2,E97654BFF5570E2CCD433EA6128EAC19,钻石之泪耳机\t盘锦到沈阳\t旅顺公交\t辽宁阜新车牌\tbaidu\tk715\tk716...,4.0,1.0,0.0
3,6931EFC26D229CCFCEA125D3F3C21E57,最受欢迎狗狗排行榜\t舶怎么读\t场景描 写范例\t三维绘图软件\t枣和酸奶能一起吃吗\t好...,4.0,2.0,3.0
4,E780470C3BB0D340334BD08CDCC3C71A,干槽症能自愈吗\t太太万岁叶舒心去没去美国\t干槽症\t右眼皮下面一直跳是怎么回事\t麦当劳...,2.0,2.0,4.0


In [17]:
data.tail()

Unnamed: 0,ID,QueryList,age,Gender,Education
199827,4AB983FE74DCB5B04FA0A8CE7779E2EC,东北一家人沈腾\t叶小白有关的小说\t姚启圣\t李小狼\t七煌老板孙博文\t珠海鸿景花园\t...,,,
199828,8FCE58D7DA890DF4F6365283E02F936D,沈阳天士力药房\t儿童支气管炎\t肺炎10天点滴还咳嗽\t熊岳虹吸谷\t过敏性咳嗽\t饭团子...,,,
199829,0821784C7EFD4FC3C96FE8EE52989551,乡村小神医\t脸上各种斑图片及名称\t猕猴桃是热性还是凉性\t梦见外公又死了\t离婚了你还爱...,,,
199830,BF98531D782D4C31CC26202081E71E4B,经营策略分析\t经营组织论第一节\t腾讯微博\t盛世光年婚礼视频优酷\ttopik等级划分\...,,,
199831,,,,,


#### 删除无用数据

In [18]:
data.drop(199831, inplace=True)
data.tail()

Unnamed: 0,ID,QueryList,age,Gender,Education
199826,B4D8E2DA560327C4D6F5D66CEB451209,安乃近\t大太平\t温秀 造句\t似乎造句\t若是\t无声之手\t洋葱\t不过\t清吉太平\...,,,
199827,4AB983FE74DCB5B04FA0A8CE7779E2EC,东北一家人沈腾\t叶小白有关的小说\t姚启圣\t李小狼\t七煌老板孙博文\t珠海鸿景花园\t...,,,
199828,8FCE58D7DA890DF4F6365283E02F936D,沈阳天士力药房\t儿童支气管炎\t肺炎10天点滴还咳嗽\t熊岳虹吸谷\t过敏性咳嗽\t饭团子...,,,
199829,0821784C7EFD4FC3C96FE8EE52989551,乡村小神医\t脸上各种斑图片及名称\t猕猴桃是热性还是凉性\t梦见外公又死了\t离婚了你还爱...,,,
199830,BF98531D782D4C31CC26202081E71E4B,经营策略分析\t经营组织论第一节\t腾讯微博\t盛世光年婚礼视频优酷\ttopik等级划分\...,,,


### 数据观察

#### 数据规模观察

In [19]:
data.shape

(199831, 5)

In [20]:
train_data.shape

(99831, 5)

In [21]:
test_data.shape

(100001, 2)

#### 数据字段观察

In [22]:
train_data.columns

Index(['ID', 'age', 'Gender', 'Education', 'QueryList'], dtype='object')

In [23]:
test_data.columns

Index(['ID', 'QueryList'], dtype='object')

In [24]:
data.columns

Index(['ID', 'QueryList', 'age', 'Gender', 'Education'], dtype='object')

#### 数据基本信息观察

In [25]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 199831 entries, 0 to 199830
Data columns (total 5 columns):
ID           199831 non-null object
QueryList    199831 non-null object
age          99831 non-null float64
Gender       99831 non-null float64
Education    99831 non-null float64
dtypes: float64(3), object(2)
memory usage: 9.1+ MB


In [26]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99831 entries, 0 to 99830
Data columns (total 5 columns):
ID           99831 non-null object
age          99831 non-null int64
Gender       99831 non-null int64
Education    99831 non-null int64
QueryList    99831 non-null object
dtypes: int64(3), object(2)
memory usage: 3.8+ MB


In [27]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100001 entries, 0 to 100000
Data columns (total 2 columns):
ID           100000 non-null object
QueryList    100000 non-null object
dtypes: object(2)
memory usage: 1.5+ MB


In [28]:
train_data.describe()

Unnamed: 0,age,Gender,Education
count,99831.0,99831.0,99831.0
mean,2.082089,1.387184,3.903547
std,1.184173,0.529536,1.521979
min,0.0,0.0,0.0
25%,1.0,1.0,3.0
50%,2.0,1.0,4.0
75%,3.0,2.0,5.0
max,6.0,2.0,6.0


#### 数据字段属性值观察

In [29]:
data.head()

Unnamed: 0,ID,QueryList,age,Gender,Education
0,22DD920316420BE2DF8D6EE651BA174B,柔和双沟\t女生\t中财网首页 财经\thttp://pan.baidu.com/s/1pl...,1.0,1.0,4.0
1,43CC3AF5A8D6430A3B572337A889AFE4,"广州厨宝烤箱\t世情薄,人情恶,雨送黄昏花易落,晓风干,泪痕\t厦门酒店用品批发市场\t我只...",2.0,1.0,3.0
2,E97654BFF5570E2CCD433EA6128EAC19,钻石之泪耳机\t盘锦到沈阳\t旅顺公交\t辽宁阜新车牌\tbaidu\tk715\tk716...,4.0,1.0,0.0
3,6931EFC26D229CCFCEA125D3F3C21E57,最受欢迎狗狗排行榜\t舶怎么读\t场景描 写范例\t三维绘图软件\t枣和酸奶能一起吃吗\t好...,4.0,2.0,3.0
4,E780470C3BB0D340334BD08CDCC3C71A,干槽症能自愈吗\t太太万岁叶舒心去没去美国\t干槽症\t右眼皮下面一直跳是怎么回事\t麦当劳...,2.0,2.0,4.0


In [30]:
data.tail()

Unnamed: 0,ID,QueryList,age,Gender,Education
199826,B4D8E2DA560327C4D6F5D66CEB451209,安乃近\t大太平\t温秀 造句\t似乎造句\t若是\t无声之手\t洋葱\t不过\t清吉太平\...,,,
199827,4AB983FE74DCB5B04FA0A8CE7779E2EC,东北一家人沈腾\t叶小白有关的小说\t姚启圣\t李小狼\t七煌老板孙博文\t珠海鸿景花园\t...,,,
199828,8FCE58D7DA890DF4F6365283E02F936D,沈阳天士力药房\t儿童支气管炎\t肺炎10天点滴还咳嗽\t熊岳虹吸谷\t过敏性咳嗽\t饭团子...,,,
199829,0821784C7EFD4FC3C96FE8EE52989551,乡村小神医\t脸上各种斑图片及名称\t猕猴桃是热性还是凉性\t梦见外公又死了\t离婚了你还爱...,,,
199830,BF98531D782D4C31CC26202081E71E4B,经营策略分析\t经营组织论第一节\t腾讯微博\t盛世光年婚礼视频优酷\ttopik等级划分\...,,,


In [31]:
train_data.head()

Unnamed: 0,ID,age,Gender,Education,QueryList
0,22DD920316420BE2DF8D6EE651BA174B,1,1,4,柔和双沟\t女生\t中财网首页 财经\thttp://pan.baidu.com/s/1pl...
1,43CC3AF5A8D6430A3B572337A889AFE4,2,1,3,"广州厨宝烤箱\t世情薄,人情恶,雨送黄昏花易落,晓风干,泪痕\t厦门酒店用品批发市场\t我只..."
2,E97654BFF5570E2CCD433EA6128EAC19,4,1,0,钻石之泪耳机\t盘锦到沈阳\t旅顺公交\t辽宁阜新车牌\tbaidu\tk715\tk716...
3,6931EFC26D229CCFCEA125D3F3C21E57,4,2,3,最受欢迎狗狗排行榜\t舶怎么读\t场景描 写范例\t三维绘图软件\t枣和酸奶能一起吃吗\t好...
4,E780470C3BB0D340334BD08CDCC3C71A,2,2,4,干槽症能自愈吗\t太太万岁叶舒心去没去美国\t干槽症\t右眼皮下面一直跳是怎么回事\t麦当劳...


In [32]:
test_data.head()

Unnamed: 0,ID,QueryList
0,ED89D43B9F602F96D96C25255F7C228C,陈学冬将出的作品\t刘昊然与谭松韵\t211学校的分数线\t谁唱的味道好听\t吻戏是真吻还是...
1,83C3B7B4AAF8074655A8079F561A76D6,e的0.0052次方\tqq怎么快速提现\t绝色倾城飞烟\t马克思主义基本原理概论\t康世恩...
2,CA9F675A024FB2353849350A35CF8B0F,黑暗文\tlpl夏季赛\t大富豪电玩城\t英雄联盟之电竞称王\t手机怎么扫描手机上的二维码\...
3,DE45B5C4E57AAEBCF3FDFA2A774093BF,中秋水库钓鱼\t鱼竿\t用蚯蚓钓鱼怎样调漂\t传统钓\t3号鱼钩\t鲫鱼汤的做法大全\t鱼饵...
4,406A681FB3DF81EC0E561796AE50AE50,号码吉凶\t退休干部死后配偶\t郫县有哪些大学\t胜利油田属于中石化还是中石油\t苏珊米勒狮...


### 语料库准备

#### 停用词加载

In [33]:
stopwords_list = []
with open('../../common/stopwords1893.txt', 'r', encoding='utf-8') as f:
    while True:
        line = f.readline()
        if not line:
            break
        stopwords_list.append(line.replace('\n', ''))

In [34]:
stopwords_list[:5]

['!', '"', '#', '$', '%']

#### 分词、停用词去除、词性过滤

In [35]:
%%time
# 根据词性过滤
import jieba
import jieba.posseg

# 添加自定义词语
jieba.add_word('王者荣耀',tag='n')
jieba.add_word('百度云',tag='n')
jieba.add_word('徽信',tag='n')
jieba.add_word('电子发票',tag='n')
jieba.add_word( '表情包',tag='n')
jieba.add_word( '萌妻',tag='n')
jieba.add_word('水煮',tag='v')

posseg_corpus = []
i = 0
rows = float(data.QueryList.shape[0]) /100
def tokenize_posseg(sentences):
    global i
    cleaned_sentences = sentences.replace('\t', ' ')
    token = jieba.posseg.cut(sentences)
    flag_list = ['n', 'nz', 'nr','ns', 'nt','ng', 'v','vn', 'j', 'i']
#     flag_list = ['n', 'v','j']
    word_list = [word.word for word in list(token) 
                 if word.word not in stopwords_list 
                 and word.flag in flag_list
                ]
    if not word_list:
        posseg_corpus.append(' '.join([word.word for word in list(token) 
                 if word.word not in stopwords_list 
                ]))
    else:
        posseg_corpus.append(' '.join(word_list))
    i +=1
    print('\r 进度：%0.4f%%' % (i/rows), end="")
    return ' '.join(word_list)
data['PossegCleanedQueryList'] = data.QueryList.apply(tokenize_posseg)

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\xiaoyu\AppData\Local\Temp\jieba.cache
Loading model cost 0.680 seconds.
Prefix dict has been built succesfully.


 进度：100.0000%Wall time: 4h 13min 13s


In [36]:
data.PossegCleanedQueryList.head()

0    双沟 女生 中财网 财经 周公 解梦 查询 曹云金 郭德纲 总裁 大人 行行好 中财网 财经...
1    广州 厨宝 烤箱 世情 人情 雨 送 花易落 晓 风干 泪痕 厦门 酒店用品 批发市场 不想...
2    钻石 泪 耳机 盘锦 沈阳 旅顺 公交 辽宁 阜新 车牌 盘锦 台安 网游 网游 辽 北镇 ...
3    受欢迎 狗狗 排行榜 舶 读 场景 描 写 范例 绘图 软件 枣 酸奶 吃 租 衣服 网站 ...
4    干槽症 太太 叶 没去 美国 干槽症 眼皮 跳 麦 旋风 勺子 吉林市 鹿王 制药 股份 有...
Name: PossegCleanedQueryList, dtype: object

#### 语料库数据持久化

In [37]:
%%time
# 经过词性过滤
posseg_save_file_path = '../../dataset/personas/posseg_word_corpus.text'
with open(posseg_save_file_path, 'w', encoding='utf-8') as f:
    for posseg_corpu in posseg_corpus:
        if isinstance(posseg_corpu,list):
            posseg_corpu = ' '.join(posseg_corpu)
        f.write(posseg_corpu)
        f.write('\n')

Wall time: 1.77 s


#### 结构化数据查看

In [38]:
# pandas读取
posseg_cleaned_query_list = pd.read_csv('../../dataset/personas/posseg_word_corpus.text',header=None, names=['posseg_cleaned_query_list'])

In [39]:
posseg_cleaned_query_list.head()

Unnamed: 0,posseg_cleaned_query_list
0,双沟 女生 中财网 财经 周公 解梦 查询 曹云金 郭德纲 总裁 大人 行行好 中财网 财经...
1,广州 厨宝 烤箱 世情 人情 雨 送 花易落 晓 风干 泪痕 厦门 酒店用品 批发市场 不想...
2,钻石 泪 耳机 盘锦 沈阳 旅顺 公交 辽宁 阜新 车牌 盘锦 台安 网游 网游 辽 北镇 ...
3,受欢迎 狗狗 排行榜 舶 读 场景 描 写 范例 绘图 软件 枣 酸奶 吃 租 衣服 网站 ...
4,干槽症 太太 叶 没去 美国 干槽症 眼皮 跳 麦 旋风 勺子 吉林市 鹿王 制药 股份 有...


In [40]:
posseg_corpus[0][:100]

'双沟 女生 中财网 财经 周公 解梦 查询 曹云金 郭德纲 总裁 大人 行行好 中财网 财经 传媒 教师节 全文 男子 砸毁 墓碑 黄岩岛 填 海图 缘 落跑 甜心 梁朝伟 替身 框 笑傲江湖 电视剧'

In [41]:
posseg_cleaned_query_list.tail()

Unnamed: 0,posseg_cleaned_query_list
199822,安乃近 太平 温秀 造句 造句 手 洋葱 清吉 太平 笙箫 电视剧 倾心 心 倾城 吃 西药...
199823,东北 一家人 沈腾 叶小白 小说 姚启圣 李小狼 七煌 老板 博文 珠海 鸿景 花园 云菲菲...
199824,沈阳 天士力 药房 儿童 支气管炎 肺炎 天 点滴 咳嗽 熊岳 谷 过敏性 咳嗽 饭团子 饭...
199825,乡村 神医 斑 图片 名称 猕猴桃 热性 凉性 梦见 外公 死 离婚 爱我吗 女孩子 发型 ...
199826,经营策略 分析 经营 组织 腾讯 博 盛世 光年 婚礼 视频 划分 临摹 照片 济南 恒隆 ...


In [42]:
data['PossegCleanedQueryList'].shape

(199831,)

In [43]:
data.head()

Unnamed: 0,ID,QueryList,age,Gender,Education,PossegCleanedQueryList
0,22DD920316420BE2DF8D6EE651BA174B,柔和双沟\t女生\t中财网首页 财经\thttp://pan.baidu.com/s/1pl...,1.0,1.0,4.0,双沟 女生 中财网 财经 周公 解梦 查询 曹云金 郭德纲 总裁 大人 行行好 中财网 财经...
1,43CC3AF5A8D6430A3B572337A889AFE4,"广州厨宝烤箱\t世情薄,人情恶,雨送黄昏花易落,晓风干,泪痕\t厦门酒店用品批发市场\t我只...",2.0,1.0,3.0,广州 厨宝 烤箱 世情 人情 雨 送 花易落 晓 风干 泪痕 厦门 酒店用品 批发市场 不想...
2,E97654BFF5570E2CCD433EA6128EAC19,钻石之泪耳机\t盘锦到沈阳\t旅顺公交\t辽宁阜新车牌\tbaidu\tk715\tk716...,4.0,1.0,0.0,钻石 泪 耳机 盘锦 沈阳 旅顺 公交 辽宁 阜新 车牌 盘锦 台安 网游 网游 辽 北镇 ...
3,6931EFC26D229CCFCEA125D3F3C21E57,最受欢迎狗狗排行榜\t舶怎么读\t场景描 写范例\t三维绘图软件\t枣和酸奶能一起吃吗\t好...,4.0,2.0,3.0,受欢迎 狗狗 排行榜 舶 读 场景 描 写 范例 绘图 软件 枣 酸奶 吃 租 衣服 网站 ...
4,E780470C3BB0D340334BD08CDCC3C71A,干槽症能自愈吗\t太太万岁叶舒心去没去美国\t干槽症\t右眼皮下面一直跳是怎么回事\t麦当劳...,2.0,2.0,4.0,干槽症 太太 叶 没去 美国 干槽症 眼皮 跳 麦 旋风 勺子 吉林市 鹿王 制药 股份 有...


In [44]:
data.PossegCleanedQueryList.head()

0    双沟 女生 中财网 财经 周公 解梦 查询 曹云金 郭德纲 总裁 大人 行行好 中财网 财经...
1    广州 厨宝 烤箱 世情 人情 雨 送 花易落 晓 风干 泪痕 厦门 酒店用品 批发市场 不想...
2    钻石 泪 耳机 盘锦 沈阳 旅顺 公交 辽宁 阜新 车牌 盘锦 台安 网游 网游 辽 北镇 ...
3    受欢迎 狗狗 排行榜 舶 读 场景 描 写 范例 绘图 软件 枣 酸奶 吃 租 衣服 网站 ...
4    干槽症 太太 叶 没去 美国 干槽症 眼皮 跳 麦 旋风 勺子 吉林市 鹿王 制药 股份 有...
Name: PossegCleanedQueryList, dtype: object

In [45]:
data.PossegCleanedQueryList.shape

(199831,)

In [46]:
posseg_cleaned_query_list.shape

(199827, 1)

### word2vec词向量化

#### 模型训练

In [48]:
%%time
# 分词，形成词语列表
import jieba 
corpus = []
lenth = len(posseg_corpus)/100
i = 0
for corpu in posseg_corpus:
    if not isinstance(corpu, list):
        corpus.append(list(jieba.cut(corpu)))
    else:
        corpus.append(corpu)
    i += 1
    print('\r 进度：%0.4f%%' % (i/lenth), end="")
print('\n')


 进度：100.0000%

Wall time: 13min 55s


In [49]:
%%time
# word2vec
from gensim.models import Word2Vec
word2vec_posseg_model = Word2Vec(corpus, size=300, min_count=1, window=10,workers=4)

Wall time: 12min 5s


In [50]:
word2vec_posseg_model.corpus_count

199831

In [51]:
len(word2vec_posseg_model.wv.vocab)

752385

In [52]:
word2vec_posseg_model.wv.most_similar('北京')

[('北京市', 0.7586333751678467),
 ('顺义', 0.7031264901161194),
 ('昌平', 0.6952042579650879),
 ('密云', 0.6840555667877197),
 ('丰台', 0.6791456937789917),
 ('石景山', 0.6734213829040527),
 ('燕郊', 0.6636086106300354),
 ('望京', 0.6555204391479492),
 ('怀柔', 0.6548206210136414),
 ('房山', 0.6521821022033691)]

In [53]:
word2vec_posseg_model.wv.most_similar('房子')

[('楼房', 0.7757467031478882),
 ('新房', 0.7520249485969543),
 ('房屋', 0.711332380771637),
 ('毛坯房', 0.7046459913253784),
 ('老房子', 0.701375424861908),
 ('买房子', 0.6962460279464722),
 ('新房子', 0.6938353776931763),
 ('房能', 0.6696431636810303),
 ('三室一厅', 0.6683304905891418),
 ('商品房', 0.6485531330108643)]

In [54]:
word2vec_posseg_model.wv.similarity('兄弟','姐妹')

0.24910423

In [55]:
word2vec_posseg_model.wv.similarity('房子','车子')

0.29615608

In [56]:
word2vec_posseg_model.wv.similarity('老婆','孩子')

0.22770692

#### 模型保存

In [57]:
word2vec_posseg_model.save('../../models/personas_word2vec_posseg.model')

#### 模型加载

In [58]:
%%time
from gensim.models import Word2Vec
word2vec_posseg_model = Word2Vec.load('../../models/personas_word2vec_posseg.model')

Wall time: 20.5 s


In [59]:
word2vec_posseg_model.wv.most_similar('北京')

[('北京市', 0.7586333751678467),
 ('顺义', 0.7031264901161194),
 ('昌平', 0.6952042579650879),
 ('密云', 0.6840555667877197),
 ('丰台', 0.6791456937789917),
 ('石景山', 0.6734213829040527),
 ('燕郊', 0.6636086106300354),
 ('望京', 0.6555204391479492),
 ('怀柔', 0.6548206210136414),
 ('房山', 0.6521821022033691)]

### 搜索结果平均向量化

#### 求平均向量
其目的在于构造每个用户搜索特征向量

In [60]:
%%time
posseg_file_name = '../../dataset/personas/posseg_word_corpus.text'
with open(posseg_file_name, 'r',encoding='utf-8') as f:
    index = 0
    lines = f.readlines()
    posseg_doc_vec = np.zeros((len(lines),300))
    for line in lines:
        word_vec = np.zeros((1,300))
        wrod_num = 0
        for word in line:
            if word in word2vec_posseg_model:
                wrod_num += 1
                word_vec += np.array([word2vec_posseg_model[word]])
        if sum(np.isnan(word_vec)).sum():
            print(line)
            print(index)
        print('\r [index]:%d' % index, end='')
        posseg_doc_vec[index] = word_vec / float(wrod_num)
        index += 1

 [index]:199830Wall time: 49min 29s


#### 搜索特征向量持久化

In [61]:
posseg_doc_vec.tofile('../../models/personas_posseg_sent_vec.text')

### 缺失值填充

#### 检查是否存在nan的向量

In [62]:
for index, i in enumerate(posseg_doc_vec):
    if sum(np.isnan(i)).sum():
        print(index)

32557
60085
87498
119773


In [63]:
%%time
empty_index = 0
def empty_list(obj):
    global empty_index
    index = empty_index
    if not obj:
        print(index)
    empty_index += 1
    return ' '.join(obj)
data_empty = data.PossegCleanedQueryList.apply(empty_list)

32557
60085
87498
119773
Wall time: 18.1 s


In [64]:
data[data.PossegCleanedQueryList=='']

Unnamed: 0,ID,QueryList,age,Gender,Education,PossegCleanedQueryList
32557,7077737F3F2D58E1EB0D3D2E8824B41D,pan.baidu.com/s/1esytqsi\tpan.baidu.com/s/1mhb...,2.0,1.0,3.0,
60085,5FAF74A9D9E2B55C6724BDEAA6D7574B,denied\tchimpanzee\tsanitation\tpredisposed\ti...,2.0,2.0,4.0,
87498,F723D21F70AA59932E0A77C12A8F4B95,substandard\twages\toutlay\tgovernance\tkeyboa...,3.0,1.0,4.0,
119773,CC302AED793BE054C6158160729F8687,550476145267\t550476291385\t666759024744\t5504...,,,,


In [65]:
data[data.PossegCleanedQueryList==''].index.tolist()

[32557, 60085, 87498, 119773]

In [66]:
data_index = data.index.tolist()

In [67]:
for i in [32557, 60085, 87498, 119773]:
    data.drop(i, inplace=True)

In [68]:
data[data.PossegCleanedQueryList=='']

Unnamed: 0,ID,QueryList,age,Gender,Education,PossegCleanedQueryList


In [69]:
data.shape

(199827, 6)

#### 删除nan值的数据

In [70]:
nan_index_list = [32557,60085,87498,119773]
posseg_doc_vec[nan_index_list]

array([[nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan]])

In [71]:
cleaned_posseg_doc_vec = np.delete(posseg_doc_vec,[32557,60085,87498,119773],axis=0)

In [72]:
cleaned_posseg_doc_vec.shape

(199827, 300)

#### 过滤为0的缺失值数据

In [73]:
data[0:99828].tail()

Unnamed: 0,ID,QueryList,age,Gender,Education,PossegCleanedQueryList
99826,E797FFCDCAF3899AB4D17B61170D8BFF,梦三生\t逆行天后漫画结局\t英雄联盟角色介绍大全\t韩剧网最新韩国电视剧\t119宣传语是...,1.0,1.0,5.0,梦 逆行 漫画 结局 英雄 联盟 角色 介绍 韩剧 网 韩国 电视剧 宣传语 斗破 苍穹 肆...
99827,E06375F7D092ABDE78C2D79E4725D6B0,中国军队配枪\t女生\t央视版权问题\t重庆\t一次性手术刀\t精神枷锁\t人民检察官\t舆...,1.0,1.0,5.0,中国 军队 配枪 女生 央视 版权 重庆 手术刀 精神枷锁 检察官 舆论 精神 禁锢 百度 ...
99828,D55119CB0B9366B20974522B58C00912,英文翻译\t幼儿园面试讲课\t学前教育书第二版\tshock to\t幼儿园教师资格证面试讲...,2.0,2.0,5.0,英文翻译 幼儿园 面试 讲课 书 版 幼儿园 教师 资格证 面试 讲课 设计 签名 免费 艺...
99829,EB4DBBD602C6459A19A77F09035E170C,哈尔滨祖研中医院地址\t指甲盖侧面的肉怎么是白色\t补骨质有副作用吗\t哈尔滨去呼兰\t黑龙...,3.0,2.0,3.0,哈尔滨 祖研 医院地址 指甲盖 肉 白色 补 骨质 副作用 哈尔滨 呼兰 黑龙江 祖研 中医...
99830,61CF81DB79423CB89E5DAA752BC4D9DD,陈翔毛晓彤分手\t青春偶像剧校园电影\t九尾狐家族\t王俊凯小说\t女娲的成长日记女娲变身\...,1.0,2.0,5.0,陈翔 分手 青春偶像 剧 校园 电影 九尾狐 家族 王俊凯 小说 女娲 成长 日记 女娲 变...


#### 训练集准备

In [74]:
%%time
import numpy as np
# 年龄
age_index = np.nonzero(data[0:99828].age)
# age_X = doc_vec[age_index]
posseg_age_X = cleaned_posseg_doc_vec[age_index]
age_y = data.age.iloc[age_index]

# 性别
gender_index = np.nonzero(data[0:99828].Gender)
# gender_X = doc_vec[gender_index]
posseg_gender_X = cleaned_posseg_doc_vec[gender_index]
gender_y = data.Gender.iloc[gender_index]

# 学历
education_index = np.nonzero(data[0:99828].Education)
# education_X = doc_vec[education_index]
posseg_education_X = cleaned_posseg_doc_vec[education_index]
education_y = data.Education.iloc[education_index]

Wall time: 1.48 s


In [75]:
print(posseg_age_X.shape,age_y.shape)

(98162, 300) (98162,)


In [76]:
print(posseg_gender_X.shape,gender_y.shape)

(97675, 300) (97675,)


In [77]:
print(posseg_education_X.shape,education_y.shape)

(90569, 300) (90569,)


#### 逻辑回归预测缺失值

In [78]:
%%time
# 数据集重新采样
from sklearn.model_selection import train_test_split
posseg_age_train_X, posseg_age_test_X, posseg_age_train_y, posseg_age_test_y = train_test_split(posseg_age_X, age_y, test_size=0.3,random_state=0)
posseg_gender_train_X, posseg_gender_test_X, posseg_gender_train_y, posseg_gender_test_y = train_test_split(posseg_gender_X, gender_y, test_size=0.3,random_state=0)
posseg_education_train_X, posseg_education_test_X, posseg_education_train_y, posseg_education_test_y = train_test_split(posseg_education_X, education_y, test_size=0.3,random_state=0)

Wall time: 2.49 s


In [79]:
%%time
from sklearn.linear_model import LogisticRegression
def lr_model(train_X, test_X, train_y, test_y):
    model = LogisticRegression()
    model.fit(train_X, train_y)
    print(model.score(test_X, test_y))
    return model

# 年龄训练
age_model = lr_model(posseg_age_train_X, posseg_age_test_X, posseg_age_train_y, posseg_age_test_y)

# 性别训练
gender_model = lr_model(posseg_gender_train_X, posseg_gender_test_X, posseg_gender_train_y, posseg_gender_test_y)

# 学历训练
education_model = lr_model(posseg_education_train_X, posseg_education_test_X, posseg_education_train_y, posseg_education_test_y)

0.5099663825596794
0.7779749513701669
0.5424533509992271
Wall time: 1min 30s


In [80]:
%%time
from sklearn.multiclass import OneVsRestClassifier
ada_model = OneVsRestClassifier(LogisticRegression(random_state=42))
ada_model.fit(posseg_age_train_X, posseg_age_train_y)
print(ada_model.score(posseg_age_test_X, posseg_age_test_y))

0.5099663825596794
Wall time: 41 s


In [81]:
%%time
from sklearn.multiclass import OneVsOneClassifier
ada_model = OneVsOneClassifier(LogisticRegression(random_state=42))
ada_model.fit(posseg_age_train_X, posseg_age_train_y)
print(ada_model.score(posseg_age_test_X, posseg_age_test_y))

0.5161465584569934
Wall time: 33.5 s


In [82]:
# 提取缺失值
# 年龄
age_zero_index = np.array((data[0:99828][data[0:99828].age==0]).index)
age_predict_X = cleaned_posseg_doc_vec[age_zero_index]

# 性别
gender_zero_index = np.array((data[0:99828][data[0:99828].Gender==0]).index)
gender_predict_X = cleaned_posseg_doc_vec[gender_zero_index]

# 学历
enducation_zero_index = np.array((data[0:99828][data[0:99828].Education==0]).index)
enducation_predict_X = cleaned_posseg_doc_vec[enducation_zero_index]


#### 缺失值预测

In [83]:
# 年龄预测
age_predict_y = age_model.predict(age_predict_X)

# 性别预测
gender_predict_y = gender_model.predict(gender_predict_X)

# 学历预测
enducation_predict_y = education_model.predict(enducation_predict_X)

#### 预测值填充

In [84]:
# 年龄填充
data.loc[data.age==0,'age'] = list(age_predict_y)

# 性别填充
data.loc[data.Gender==0,'Gender'] = list(gender_predict_y)

# 学历填充
data.loc[data.Education==0,'Education'] = list(enducation_predict_y)

In [85]:
# 检查填充是否成功
print(sum(data.age==0))
print(sum(data.Gender==0))
print(sum(data.Education==0))

0
0
0


### 模型训练

In [86]:
%%time
train = cleaned_posseg_doc_vec[0:99828]
test = cleaned_posseg_doc_vec[99828:]
age_label = data[0:99828].age
gender_label = data[0:99828].Gender
education_label = data[0:99828].Education

Wall time: 1.99 ms


In [87]:
%%time
# 数据集重新采样
from sklearn.model_selection import train_test_split
lr_age_train_X, lr_age_test_X, lr_age_train_y, lr_age_test_y = train_test_split(train, age_label, test_size=0.3,random_state=0)
lr_gender_train_X, lr_gender_test_X, lr_gender_train_y, lr_gender_test_y = train_test_split(train, gender_label, test_size=0.3,random_state=0)
lr_education_train_X, lr_education_test_X, lr_education_train_y, lr_education_test_y = train_test_split(train, education_label, test_size=0.3,random_state=0)

Wall time: 455 ms


In [88]:
lr_age_test_X.shape

(29949, 300)

In [89]:
%%time
# 年龄训练
lr_age_model = lr_model(lr_age_train_X, lr_age_test_X, lr_age_train_y, lr_age_test_y)

# 性别训练
lr_gender_model = lr_model(lr_gender_train_X, lr_gender_test_X, lr_gender_train_y, lr_gender_test_y)

# 学历训练
lr_education_model = lr_model(lr_education_train_X, lr_education_test_X, lr_education_train_y, lr_education_test_y)

0.514942068182577
0.7780560285819226
0.5480316538114796
Wall time: 1min 36s


In [90]:
%%time
from sklearn.multiclass import OneVsOneClassifier
ada_model = OneVsOneClassifier(LogisticRegression(random_state=42))
ada_model.fit(lr_age_train_X, lr_age_train_y)
print(ada_model.score(lr_age_test_X, lr_age_test_y))

0.5199171925606865
Wall time: 32.1 s


In [91]:
%%time
from sklearn.multiclass import OneVsOneClassifier
ada_model = OneVsOneClassifier(LogisticRegression(random_state=42))
ada_model.fit(lr_education_train_X, lr_education_train_y)
print(ada_model.score(lr_education_test_X, lr_education_test_y))

0.5505693011452804
Wall time: 34.2 s


In [92]:
train[:1].sum()

-35.66463068974919

### 模型预测

In [93]:
lr_age_predict_y = lr_age_model.predict(lr_age_test_X)
lr_gender_predict_y = lr_gender_model.predict(lr_gender_test_X)
lr_education_predict_y = lr_education_model.predict(lr_education_test_X)
lr_age_prediction = lr_age_model.predict(test)
lr_gender_prediction = lr_gender_model.predict(test)
lr_education_prediction = lr_education_model.predict(test)

### 模型评估

#### 分类报告评估

In [94]:
from sklearn.metrics import classification_report
print(classification_report(lr_age_test_y, lr_age_predict_y))

              precision    recall  f1-score   support

         1.0       0.57      0.82      0.68     12040
         2.0       0.46      0.38      0.41      8177
         3.0       0.41      0.40      0.41      5572
         4.0       0.36      0.08      0.14      3098
         5.0       0.14      0.00      0.00       920
         6.0       0.00      0.00      0.00       142

   micro avg       0.51      0.51      0.51     29949
   macro avg       0.32      0.28      0.27     29949
weighted avg       0.47      0.51      0.47     29949



In [95]:
print(classification_report(lr_gender_test_y, lr_gender_predict_y))

              precision    recall  f1-score   support

         1.0       0.79      0.84      0.82     17493
         2.0       0.75      0.70      0.72     12456

   micro avg       0.78      0.78      0.78     29949
   macro avg       0.77      0.77      0.77     29949
weighted avg       0.78      0.78      0.78     29949



In [96]:
print(classification_report(lr_education_test_y, lr_education_predict_y))

              precision    recall  f1-score   support

         1.0       0.00      0.00      0.00       103
         2.0       0.00      0.00      0.00       149
         3.0       0.51      0.41      0.45      6082
         4.0       0.48      0.38      0.42      9132
         5.0       0.59      0.82      0.69     12789
         6.0       0.18      0.01      0.01      1694

   micro avg       0.55      0.55      0.55     29949
   macro avg       0.29      0.27      0.26     29949
weighted avg       0.51      0.55      0.51     29949



#### 交叉验证评估

In [97]:
%%time
from sklearn.model_selection import cross_val_score
def cross_score(estimator, X, y, num_validations=5):
    accuracy=cross_val_score(estimator, X, y,
                             scoring='accuracy',cv=num_validations)
    print('准确率：{:.2f}%'.format(accuracy.mean()*100))
    precision=cross_val_score(estimator, X, y,
                             scoring='precision_weighted',cv=num_validations)
    print('精确度：{:.2f}%'.format(precision.mean()*100))
    recall=cross_val_score(estimator, X, y,
                             scoring='recall_weighted',cv=num_validations)
    print('召回率：{:.2f}%'.format(recall.mean()*100))
    f1=cross_val_score(estimator, X, y,
                             scoring='f1_weighted',cv=num_validations)
    print('F1  值：{:.2f}%'.format(f1.mean()*100))
    return accuracy, precision, recall, f1
print('age_cross_score:')
age_cross_score = cross_score(lr_age_model, lr_age_test_X, lr_age_test_y )
print('gender_cross_score:')
gender_cross_score = cross_score(lr_gender_model, lr_gender_test_X, lr_gender_test_y )
print('education_cross_score:')
education_cross_score = cross_score(lr_education_model, lr_education_test_X, lr_education_test_y )

age_cross_score:
准确率：51.02%
精确度：47.39%
召回率：51.02%
F1  值：46.95%
gender_cross_score:
准确率：77.56%
精确度：77.43%
召回率：77.56%
F1  值：77.41%
education_cross_score:
准确率：53.89%
精确度：50.45%
召回率：53.89%
F1  值：50.36%
Wall time: 8min 56s


#### 评估结论

**分类问题**
- 逻辑回归适合典型的二分类问题，本项目中，除了性别为二分类，年龄、学历为按区分段分开，不适用于逻辑回归
- 年龄、学历按区间段分开，可以将其当成多分类问题
- 文本的多分类问题，可以采用结合基础的分类器结合OneVsOne、OneVsRest策略进行建模
- 文本多分类，也可以用多分类的朴素贝叶斯分类器，但因词向量存在小于0的情况，不符合贝叶斯定理
- 文本多分类，也可以使用支持向量机，但由于维度过大且文本向量具有稀疏性，计算困难，故不适合

### 模型筛选

### 模型改进

#### 特征改用tf_idf

##### 训练tf_idf向量

In [5]:
%%time
# 普通文件读取
corpus = []
i = 0
with open('../../dataset/personas/posseg_word_corpus.text', 'r', encoding='utf-8') as f:
    while True:
        line = f.readline()
        if not line:
            break
        if 'nan' in line:
            print(i)
            continue
        corpus.append(line.replace('\n', ''))
        i += 1 
        

Wall time: 2.49 s


In [6]:
tf_data = pd.concat([train_data.reindex(columns=['ID', 'QueryList','age', 'Gender', 'Education']),test_data],sort=False,ignore_index=True)

In [7]:
tf_data.drop(199831, inplace=True)
tf_data.tail()

Unnamed: 0,ID,QueryList,age,Gender,Education
199826,B4D8E2DA560327C4D6F5D66CEB451209,安乃近\t大太平\t温秀 造句\t似乎造句\t若是\t无声之手\t洋葱\t不过\t清吉太平\...,,,
199827,4AB983FE74DCB5B04FA0A8CE7779E2EC,东北一家人沈腾\t叶小白有关的小说\t姚启圣\t李小狼\t七煌老板孙博文\t珠海鸿景花园\t...,,,
199828,8FCE58D7DA890DF4F6365283E02F936D,沈阳天士力药房\t儿童支气管炎\t肺炎10天点滴还咳嗽\t熊岳虹吸谷\t过敏性咳嗽\t饭团子...,,,
199829,0821784C7EFD4FC3C96FE8EE52989551,乡村小神医\t脸上各种斑图片及名称\t猕猴桃是热性还是凉性\t梦见外公又死了\t离婚了你还爱...,,,
199830,BF98531D782D4C31CC26202081E71E4B,经营策略分析\t经营组织论第一节\t腾讯微博\t盛世光年婚礼视频优酷\ttopik等级划分\...,,,


In [9]:
%%time
# 训练tf_idf向量
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
tf_idf_vec = vectorizer.fit_transform(corpus)

Wall time: 50.7 s


##### 加载语料库

In [10]:
# 加载语料库
tf_corpus  = pd.DataFrame({'corpus':corpus})
tf_corpus.head()

Unnamed: 0,corpus
0,双沟 女生 中财网 财经 周公 解梦 查询 曹云金 郭德纲 总裁 大人 行行好 中财网 财经...
1,广州 厨宝 烤箱 世情 人情 雨 送 花易落 晓 风干 泪痕 厦门 酒店用品 批发市场 不想...
2,钻石 泪 耳机 盘锦 沈阳 旅顺 公交 辽宁 阜新 车牌 盘锦 台安 网游 网游 辽 北镇 ...
3,受欢迎 狗狗 排行榜 舶 读 场景 描 写 范例 绘图 软件 枣 酸奶 吃 租 衣服 网站 ...
4,干槽症 太太 叶 没去 美国 干槽症 眼皮 跳 麦 旋风 勺子 吉林市 鹿王 制药 股份 有...


##### 训练集准备

In [11]:
%%time
# 训练集准备
import numpy as np
# 年龄
tf_idf_age_index = np.nonzero(tf_data[0:99831].age)
tf_idf_age_X = vectorizer.transform(tf_corpus.iloc[tf_idf_age_index].corpus.tolist())
tf_idf_age_y = tf_data.age.iloc[tf_idf_age_index]

# 性别
tf_idf_gender_index = np.nonzero(tf_data[0:99831].Gender)
tf_idf_gender_X = vectorizer.transform(tf_corpus.iloc[tf_idf_gender_index].corpus.tolist())
tf_idf_gender_y = tf_data.Gender.iloc[tf_idf_gender_index]

# 学历
tf_idf_education_index = np.nonzero(tf_data[0:99831].Education)
tf_idf_education_X = vectorizer.transform(tf_corpus.iloc[tf_idf_education_index].corpus.tolist())
tf_idf_education_y = tf_data.Education.iloc[tf_idf_education_index]

Wall time: 1min 2s


In [12]:
print(tf_idf_age_X.shape,tf_idf_age_y.shape)

(98165, 802507) (98165,)


In [13]:
print(tf_idf_gender_X.shape,tf_idf_gender_y.shape)

(97678, 802507) (97678,)


In [14]:
print(tf_idf_education_X.shape,tf_idf_education_y.shape)

(90572, 802507) (90572,)


##### 逻辑回归 + OneVsOne预测缺失值

In [15]:
%%time
# 数据集重新采样
from sklearn.model_selection import train_test_split
tf_idf_age_train_X, tf_idf_age_test_X, tf_idf_age_train_y, tf_idf_age_test_y = train_test_split(
    tf_idf_age_X, tf_idf_age_y, test_size=0.3,random_state=0)

tf_idf_gender_train_X, tf_idf_gender_test_X, tf_idf_gender_train_y, tf_idf_gender_test_y = train_test_split(
    tf_idf_gender_X, tf_idf_gender_y, test_size=0.3,random_state=0)

tf_idf_education_train_X, tf_idf_education_test_X, tf_idf_education_train_y, tf_idf_education_test_y = train_test_split(
    tf_idf_education_X, tf_idf_education_y, test_size=0.3,random_state=0)

Wall time: 1.21 s


In [16]:
%%time
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsOneClassifier
def one_lr_model(train_X, test_X, train_y, test_y):
    model = OneVsOneClassifier(LogisticRegression(random_state=42))
    model.fit(train_X, train_y)
    print(model.score(test_X, test_y))
    return model

# 年龄训练
tf_idf_age_model = one_lr_model(tf_idf_age_train_X, tf_idf_age_test_X, 
                     tf_idf_age_train_y, tf_idf_age_test_y)

# 性别训练
tf_idf_gender_model = one_lr_model(tf_idf_gender_train_X, tf_idf_gender_test_X, 
                        tf_idf_gender_train_y, tf_idf_gender_test_y)

# 学历训练
tf_idf_education_model = one_lr_model(tf_idf_education_train_X, tf_idf_education_test_X, 
                           tf_idf_education_train_y, tf_idf_education_test_y)


0.5721561969439728
0.815997815997816
0.6038569115265715
Wall time: 49.6 s


##### 提取缺失值

In [17]:
%%time
# 年龄
tf_idf_age_zero_index = np.array((tf_data[0:99831][tf_data[0:99831].age==0]).index)
tf_idf_age_predict_X = vectorizer.transform(tf_corpus.iloc[tf_idf_age_zero_index].corpus.tolist())

# 性别
tf_idf_gender_zero_index = np.array((tf_data[0:99831][tf_data[0:99831].Gender==0]).index)
tf_idf_gender_predict_X = vectorizer.transform(tf_corpus.iloc[tf_idf_gender_zero_index].corpus.tolist())

# 学历
tf_idf_enducation_zero_index = np.array((tf_data[0:99831][tf_data[0:99831].Education==0]).index)
tf_idf_enducation_predict_X = vectorizer.transform(tf_corpus.iloc[tf_idf_enducation_zero_index].corpus.tolist())


Wall time: 3.04 s


In [18]:
print(tf_idf_age_predict_X.shape)
print(tf_idf_gender_predict_X.shape)
print(tf_idf_enducation_predict_X.shape)

(1666, 802507)
(2153, 802507)
(9259, 802507)


##### 缺失值预测

In [19]:
# 年龄预测
tf_idf_age_predict_y = tf_idf_age_model.predict(tf_idf_age_predict_X)

# 性别预测
tf_idf_gender_predict_y = tf_idf_gender_model.predict(tf_idf_gender_predict_X)

# 学历预测
tf_idf_enducation_predict_y = tf_idf_education_model.predict(tf_idf_enducation_predict_X)

##### 预测值填充

In [20]:
tf_idf_data = tf_data
tf_idf_data.head()

Unnamed: 0,ID,QueryList,age,Gender,Education
0,22DD920316420BE2DF8D6EE651BA174B,柔和双沟\t女生\t中财网首页 财经\thttp://pan.baidu.com/s/1pl...,1.0,1.0,4.0
1,43CC3AF5A8D6430A3B572337A889AFE4,"广州厨宝烤箱\t世情薄,人情恶,雨送黄昏花易落,晓风干,泪痕\t厦门酒店用品批发市场\t我只...",2.0,1.0,3.0
2,E97654BFF5570E2CCD433EA6128EAC19,钻石之泪耳机\t盘锦到沈阳\t旅顺公交\t辽宁阜新车牌\tbaidu\tk715\tk716...,4.0,1.0,0.0
3,6931EFC26D229CCFCEA125D3F3C21E57,最受欢迎狗狗排行榜\t舶怎么读\t场景描 写范例\t三维绘图软件\t枣和酸奶能一起吃吗\t好...,4.0,2.0,3.0
4,E780470C3BB0D340334BD08CDCC3C71A,干槽症能自愈吗\t太太万岁叶舒心去没去美国\t干槽症\t右眼皮下面一直跳是怎么回事\t麦当劳...,2.0,2.0,4.0


In [21]:
# 年龄填充
tf_idf_data.loc[tf_idf_data.age==0,'age'] = list(tf_idf_age_predict_y)

# 性别填充
tf_idf_data.loc[tf_idf_data.Gender==0,'Gender'] = list(tf_idf_gender_predict_y)

# 学历填充
tf_idf_data.loc[tf_idf_data.Education==0,'Education'] = list(tf_idf_enducation_predict_y)

##### 检查填充是否成功

In [22]:
# 检查填充是否成功
print(sum(tf_idf_data.age==0))
print(sum(tf_idf_data.Gender==0))
print(sum(tf_idf_data.Education==0))

0
0
0


##### 模型训练

In [25]:
%%time
# 训练集准备
import numpy as np
# 年龄
tfidf_age_X = vectorizer.transform(tf_corpus[0:99831].corpus.tolist())
tfidf_age_y = tf_idf_data.age[0:99831]

# 性别
tfidf_gender_X = vectorizer.transform(tf_corpus[0:99831].corpus.tolist())
tfidf_gender_y = tf_idf_data.Gender[0:99831]

# 学历
tfidf_education_X = vectorizer.transform(tf_corpus[0:99831].corpus.tolist())
tfidf_education_y = tf_idf_data.Education[0:99831]


Wall time: 1min 5s


In [26]:
print(tfidf_age_X.shape,tfidf_age_y.shape)

(99831, 802507) (99831,)


In [27]:
print(tfidf_gender_X.shape,tfidf_gender_y.shape)

(99831, 802507) (99831,)


In [28]:
print(tfidf_education_X.shape,tfidf_education_y.shape)

(99831, 802507) (99831,)


In [29]:
%%time
# 数据集重新采样
from sklearn.model_selection import train_test_split
tfidf_age_train_X, tfidf_age_test_X, tfidf_age_train_y, tfidf_age_test_y = train_test_split(
    tfidf_age_X, tfidf_age_y, test_size=0.3,random_state=0)

tfidf_gender_train_X, tfidf_gender_test_X, tfidf_gender_train_y, tfidf_gender_test_y = train_test_split(
    tfidf_gender_X, tfidf_gender_y, test_size=0.3,random_state=0)

tfidf_education_train_X, tfidf_education_test_X, tfidf_education_train_y, tfidf_education_test_y = train_test_split(
    tfidf_education_X, tfidf_education_y, test_size=0.3,random_state=0)

Wall time: 1.29 s


In [30]:
%%time
# 逻辑回归 + OneVsOne
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsOneClassifier
def one_lr_model(train_X, test_X, train_y, test_y):
    model = OneVsOneClassifier(LogisticRegression(random_state=42))
    model.fit(train_X, train_y)
    print(model.score(test_X, test_y))
    return model

# 年龄训练
tfidf_age_model = one_lr_model(tfidf_age_train_X, tfidf_age_test_X, 
                                tfidf_age_train_y, tfidf_age_test_y)

# 性别训练
tfidf_gender_model = one_lr_model(tfidf_gender_train_X, tfidf_gender_test_X, 
                                   tfidf_gender_train_y, tfidf_gender_test_y)
                  

# 学历训练
tfidf_education_model = one_lr_model(tfidf_education_train_X, tfidf_education_test_X, 
                                      tfidf_education_train_y, tfidf_education_test_y)


0.5815692821368948
0.8158931552587646
0.6351919866444073
Wall time: 50.7 s


In [31]:
%%time
# 直接用学历预测年龄、直接用年龄预测学历
# 训练集准备
education_2_age_X = pd.DataFrame({'education':tf_idf_data.Education[0:99831].values})
age_2_education_X = pd.DataFrame({'education':tf_idf_data.age[0:99831].values})

# 数据集重新采样
from sklearn.model_selection import train_test_split
education_2_age_train_X, education_2_age_test_X, education_2_age_train_y, education_2_age_test_y = train_test_split(
    education_2_age_X, tfidf_age_y, test_size=0.3,random_state=0)

age_2_education_train_X, age_2_education_test_X, age_2_education_train_y, age_2_education_test_y = train_test_split(
    age_2_education_X, tfidf_education_y, test_size=0.3,random_state=0)

# 年龄训练
education_2_age_model = one_lr_model(education_2_age_train_X, education_2_age_test_X, 
                                      education_2_age_train_y, education_2_age_test_y)

# 学历训练
age_2_education_model = one_lr_model(age_2_education_train_X, age_2_education_test_X, 
                                      age_2_education_train_y, age_2_education_test_y)

0.6585308848080134
0.6385308848080133
Wall time: 542 ms


##### 模型预测

In [32]:
tfidf_age__predict_y = tfidf_age_model.predict(tfidf_age_test_X)
tfidf_gender__predict_y = tfidf_gender_model.predict(tfidf_gender_test_X)
tfidf_education__predict_y = tfidf_education_model.predict(tfidf_education_test_X)

##### 模型评估

In [33]:
from sklearn.metrics import classification_report
print(classification_report(tfidf_age_test_y, tfidf_age__predict_y))

              precision    recall  f1-score   support

         1.0       0.67      0.81      0.73     11892
         2.0       0.56      0.52      0.54      8104
         3.0       0.45      0.52      0.48      5671
         4.0       0.42      0.18      0.25      3233
         5.0       0.29      0.00      0.00       898
         6.0       0.00      0.00      0.00       152

   micro avg       0.58      0.58      0.58     29950
   macro avg       0.40      0.34      0.33     29950
weighted avg       0.56      0.58      0.56     29950



In [34]:
print(classification_report(tfidf_gender_test_y, tfidf_gender__predict_y))

              precision    recall  f1-score   support

         1.0       0.82      0.87      0.84     17349
         2.0       0.80      0.75      0.77     12601

   micro avg       0.82      0.82      0.82     29950
   macro avg       0.81      0.81      0.81     29950
weighted avg       0.82      0.82      0.81     29950



In [35]:
print(classification_report(tfidf_education_test_y, tfidf_education__predict_y))

              precision    recall  f1-score   support

         1.0       0.00      0.00      0.00       108
         2.0       0.00      0.00      0.00       173
         3.0       0.62      0.55      0.58      6205
         4.0       0.56      0.57      0.56      9379
         5.0       0.70      0.82      0.75     12370
         6.0       0.46      0.07      0.12      1715

   micro avg       0.64      0.64      0.64     29950
   macro avg       0.39      0.33      0.34     29950
weighted avg       0.62      0.64      0.61     29950



##### 交叉验证评估

In [36]:
%%time
from sklearn.model_selection import cross_val_score
def tf_idf_cross_score(estimator, X, y, num_validations=5):
    accuracy=cross_val_score(estimator, X, y,
                             scoring='accuracy',cv=num_validations)
    print('准确率：{:.2f}%'.format(accuracy.mean()*100))
    precision=cross_val_score(estimator, X, y,
                             scoring='precision_weighted',cv=num_validations)
    print('精确度：{:.2f}%'.format(precision.mean()*100))
    recall=cross_val_score(estimator, X, y,
                             scoring='recall_weighted',cv=num_validations)
    print('召回率：{:.2f}%'.format(recall.mean()*100))
    f1=cross_val_score(estimator, X, y,
                             scoring='f1_weighted',cv=num_validations)
    print('F1  值：{:.2f}%'.format(f1.mean()*100))
    return accuracy, precision, recall, f1
print('age_cross_score:')
tf_idf_age_cross_score = tf_idf_cross_score(tfidf_age_model, tfidf_age_test_X, tfidf_age_test_y )
print('gender_cross_score:')
tf_idf_gender_cross_score = tf_idf_cross_score(tfidf_gender_model, tfidf_gender_test_X, tfidf_gender_test_y )
print('education_cross_score:')
tf_idf_education_cross_score = tf_idf_cross_score(tfidf_education_model, tfidf_education_test_X, tfidf_education_test_y )

age_cross_score:
准确率：55.70%
精确度：51.85%
召回率：55.70%
F1  值：52.01%
gender_cross_score:
准确率：80.52%
精确度：80.47%
召回率：80.52%
F1  值：80.35%
education_cross_score:
准确率：61.47%
精确度：59.87%
召回率：61.47%
F1  值：58.93%
Wall time: 6min 24s


##### tf_idf特征提取改进结果
**tf_idf模型特征改进效果**

**max_feature设置最大维度**
- 由默认设置为400000
- 逻辑回归模型：0.5601001669449082
- 多分类OneVsOne + 逻辑回归模型：0.5657429048414023

**max_df参数设置为0.6**
- 由默认设置为0.6
- 逻辑回归模型：0.5606343906510851
- 多分类OneVsOne + 逻辑回归模型：0.5653756260434056

**特征融合**
- 将结构化的query特征与学历特征进行拼接
- 逻辑回归模型：0.5606343906510851
- 多分类OneVsOne + 逻辑回归模型：0.5653756260434056

**原模型效果**
- 逻辑回归模型：0.5602003338898164
- 多分类OneVsOne + 逻辑回归模型：0.5656761268781302

**tf_idf与word2vec比较**
- tf_idf向量化的模型比word2ve模型的准确度提升了5个百分点左右，所有tf_idf特征提取效果比较好


#### 模型筛选

###### 决策树效果

In [37]:
%%time
from sklearn.tree import DecisionTreeClassifier
def tree_model(train_X, test_X, train_y, test_y):
    model = DecisionTreeClassifier(random_state=42)
    model.fit(train_X, train_y)
    print(model.score(test_X, test_y))
    return model

# 年龄训练
tree_age_tree_model = tree_model(tfidf_age_train_X, tfidf_age_test_X, 
                                tfidf_age_train_y, tfidf_age_test_y)

# 性别训练
tree_gender_tree_model = tree_model(tfidf_gender_train_X, tfidf_gender_test_X, 
                                   tfidf_gender_train_y, tfidf_gender_test_y)
                  

# 学历训练
tree_education_tree_model = tree_model(tfidf_education_train_X, tfidf_education_test_X, 
                                      tfidf_education_train_y, tfidf_education_test_y)

0.4192654424040067
0.6719198664440734
0.4565609348914858
Wall time: 27min 35s


###### 随机森林

In [41]:
%%time
from sklearn.ensemble import RandomForestClassifier 


def rfc_model(train_X, test_X, train_y, test_y):
    model = RandomForestClassifier()
    model.fit(train_X, train_y)
    print(model.score(test_X, test_y))
    return model

# 年龄训练
rfc_age_tree_model = rfc_model(tfidf_age_train_X, tfidf_age_test_X, 
                                tfidf_age_train_y, tfidf_age_test_y)

# 性别训练
rfc_gender_tree_model = rfc_model(tfidf_gender_train_X, tfidf_gender_test_X, 
                                   tfidf_gender_train_y, tfidf_gender_test_y)
                  

# 学历训练
rfc_education_tree_model = rfc_model(tfidf_education_train_X, tfidf_education_test_X, 
                                      tfidf_education_train_y, tfidf_education_test_y)

0.4758263772954925
0.7319866444073456
0.5215692821368948
Wall time: 5min 17s


###### GBDT

In [42]:
%%time
from sklearn.ensemble import GradientBoostingClassifier


def gbdt_model(train_X, test_X, train_y, test_y):
    model = GradientBoostingClassifier()
    model.fit(train_X, train_y)
    print(model.score(test_X, test_y))
    return model

# 年龄训练
gbdt_age_tree_model = gbdt_model(tfidf_age_train_X, tfidf_age_test_X, 
                                tfidf_age_train_y, tfidf_age_test_y)

# 性别训练
gbdt_gender_tree_model = gbdt_model(tfidf_gender_train_X, tfidf_gender_test_X, 
                                   tfidf_gender_train_y, tfidf_gender_test_y)
                  

# 学历训练
gbdt_education_tree_model = gbdt_model(tfidf_education_train_X, tfidf_education_test_X, 
                                      tfidf_education_train_y, tfidf_education_test_y)

0.5510183639398998
0.7655091819699499
0.5903839732888146
Wall time: 6h 36min 27s


###### AdaBoost

In [43]:
%%time
from sklearn.ensemble import AdaBoostClassifier


def ada_model(train_X, test_X, train_y, test_y):
    model = AdaBoostClassifier()
    model.fit(train_X, train_y)
    print(model.score(test_X, test_y))
    return model

# 年龄训练
ada_age_tree_model = ada_model(tfidf_age_train_X, tfidf_age_test_X, 
                                tfidf_age_train_y, tfidf_age_test_y)

# 性别训练
ada_gender_tree_model = ada_model(tfidf_gender_train_X, tfidf_gender_test_X, 
                                   tfidf_gender_train_y, tfidf_gender_test_y)
                  

# 学历训练
ada_education_tree_model = ada_model(tfidf_education_train_X, tfidf_education_test_X, 
                                      tfidf_education_train_y, tfidf_education_test_y)

0.4714524207011686
0.7542237061769616
0.5092153589315526
Wall time: 42min 36s


###### XGBoost

In [44]:
%%time
import xgboost
from xgboost import XGBClassifier


def xgb_model(train_X, test_X, train_y, test_y):
    model = XGBClassifier()
    model.fit(train_X, train_y)
    print(model.score(test_X, test_y))
    return model

# 年龄训练
xgb_age_tree_model = xgb_model(tfidf_age_train_X, tfidf_age_test_X, 
                                tfidf_age_train_y, tfidf_age_test_y)

# 性别训练
xgb_gender_tree_model = xgb_model(tfidf_gender_train_X, tfidf_gender_test_X, 
                                   tfidf_gender_train_y, tfidf_gender_test_y)
                  

# 学历训练
xgb_education_tree_model = xgb_model(tfidf_education_train_X, tfidf_education_test_X, 
                                      tfidf_education_train_y, tfidf_education_test_y)

0.5418030050083472
0.7656761268781302
0.5792988313856428
Wall time: 57min 11s


#### 增强词性过滤

#### 模型特征选择

### 总结
**特征相关**
- word2vec与tf_idf的作为词向量特征时，两者对模型的贡献值是不同的，tf_idf可以是模型的区分效果更好一些，大概提升5%。

    
**模型相关**
- 增加特征标签探索
- 从模型的效果来看，单颗决策树模型训练时间长，而且效果不佳

**存在的问题**
- 如何将学历特征与查询词特征融合以增强区分度，加强模型的预测能力
    - 一种方式PCA融合
    - 一种方式是将学历特征加入word2vec中训练
- 文本特征稀疏性的问题
    - 可以尝试LDA模型，看看做出来的特征是否能加强数据的区分度