**关键词提取**   

> 按**栏目**分，得到每一个标题的所有分词：
>
> > 将每一条的留言内容、标题、回复内容合并
> >
> > 过滤掉非中、英文
> >
> > 进行分词
> >
> > 分词得到的内容与应用词匹配，筛选应用词中出现过的分词
>
> TF-IDF关键词提取：
>
> > 将内容按栏目合并
> >
> > 过滤掉非中、英文
> >
> > 进行分词，且与应用词匹配
> >
> > 计算TF-IDF，注意默认参数
>
> TextRank关键词提取：
>
> > 按记录合并留言内容、标题、回复内容
> >
> > 过滤掉非中、英文
> >
> > 进行分词，且与应用词匹配
> >
> > 按照标题进行TextRank计算

In [1]:
import os
import time
from glob import glob
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import jieba
import jieba.analyse
import jieba.posseg as pseg
from wordcloud import WordCloud
import re
import imageio
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['font.sans-serif'] = 'SimHei' 
plt.rcParams['axes.unicode_minus'] = False 
pd.set_option('max_rows',10000)
pd.set_option('max_columns',200)
from sklearn.feature_extraction._dict_vectorizer import DictVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from text_tools import * 

In [2]:
df = pd.read_excel('../data/processed/员工心声明细未去重删掉前三列为空.xlsx').iloc[:,1:]
print('原始形状：',df.shape)
df = df.drop_duplicates(subset=['标题','留言内容','回复内容'])
print('去重后形状：',df.shape)

原始形状： (11948, 23)
去重后形状： (11874, 23)


In [3]:
df['回复内容'].fillna('0',inplace=True)
df['sentence'] = df['标题'] + df['留言内容'] + df['回复内容']

## 按栏目分词    
将原始数据按照栏目(col)分组，每个表为这个col的所有的sentence组成的字符串   
分别对col进行分词，并且写入字典。   
>col_words的value是每个col分词后的列表，后期可以用来分词    
>col_words_unique的value是每个col分词后且去重的列表，用来检测每个词属于哪个col

In [4]:
col_words = {}
col_words_unique = {}
for col,df_col in df.groupby(['栏目']):
    df_col['sentence'] = df_col['sentence'].astype('str')
    ser_col = list(df_col['sentence'])
    # 将同一个栏目的内容合并成一个字符串
    tmp = '0'.join(ser_col)
    # col_words加载
    tmp_res = cut_word(pd.Series(tmp),distinct=False)
    col_words[col] = tmp_res[0]
    # col_words_unique加载
    tmp_res_unique = cut_word(pd.Series(tmp),distinct=True)
    col_words[col] = tmp_res_unique[0]
    print('-'*5,'{}栏处理完成'.format(col),'-'*5)
print('所有栏目处理完成')

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\lenovo\AppData\Local\Temp\jieba.cache
Loading model cost 1.649 seconds.
Prefix dict has been built successfully.


结巴分词完成，现在进行能用词筛选
通过能用词筛选完成。总共用时：3.285597324371338s
结巴分词完成，现在进行能用词筛选
通过能用词筛选完成。总共用时：1.7045302391052246s
----- 三农业务栏处理完成 -----
结巴分词完成，现在进行能用词筛选
通过能用词筛选完成。总共用时：3.075683116912842s
结巴分词完成，现在进行能用词筛选
通过能用词筛选完成。总共用时：2.8610169887542725s
----- 个人金融栏处理完成 -----
结巴分词完成，现在进行能用词筛选
通过能用词筛选完成。总共用时：0.2734522819519043s
结巴分词完成，现在进行能用词筛选
通过能用词筛选完成。总共用时：0.32566118240356445s
----- 交易银行栏处理完成 -----
结巴分词完成，现在进行能用词筛选
通过能用词筛选完成。总共用时：1.0188822746276855s
结巴分词完成，现在进行能用词筛选
通过能用词筛选完成。总共用时：0.9494633674621582s
----- 人力资源栏处理完成 -----
结巴分词完成，现在进行能用词筛选
通过能用词筛选完成。总共用时：0.2655465602874756s
结巴分词完成，现在进行能用词筛选
通过能用词筛选完成。总共用时：0.31151795387268066s
----- 信息科技栏处理完成 -----
结巴分词完成，现在进行能用词筛选
通过能用词筛选完成。总共用时：0.5288293361663818s
结巴分词完成，现在进行能用词筛选
通过能用词筛选完成。总共用时：0.5762555599212646s
----- 信用卡栏处理完成 -----
结巴分词完成，现在进行能用词筛选
通过能用词筛选完成。总共用时：0.03133654594421387s
结巴分词完成，现在进行能用词筛选
通过能用词筛选完成。总共用时：0.03288745880126953s
----- 信用审批栏处理完成 -----
结巴分词完成，现在进行能用词筛选
通过能用词筛选完成。总共用时：1.3061490058898926s
结巴分词完成，现在进行能用词筛选
通过能用词筛选完成。总共用时：1.0804104804992676s
----- 公司金融栏处理

In [5]:
# col_seq：记录每个栏目的名称
# col_cont：每个栏目分词后的结果用空格链接成字符串，col_cont记录每个栏目的字符串处理内容
col_seq = []
col_cont = []
for key in col_words.keys():
    col_seq.append(key)
    col_cont.append(' '.join(col_words[key]))

## TF-IDF关键词提取

In [6]:
# 每个栏目所有的内容为一个文档，将所有的栏目作为数组进行tfidf处理，得到每个栏目的关键词
tfidf_res,tfidf_info = tfidf_process(col_seq,col_cont)

文档名称数量和文档内容数量相等，继续进行
tfidf处理完成


## TextRank关键词提取

In [7]:
allow_pos = ('a'  ,'ad' ,'ag' ,'al' ,'an','d','v' ,'vd'   ,'vf'   ,'vg'   ,'vi'   ,'vl'   ,'vn'   ,'vshi' ,'vx'   ,'vyou' ,'n'    ,'ng'   ,'nl'   ,'nr'   ,'nr1'  ,'nr2'  ,'nrf'  ,'nrfg'  ,'nrj'  ,'ns'   ,'nsf'  ,'nt'   ,'nz')

In [8]:
# 每个栏目为一个文档，分别处理每个栏目，得到每个栏目的关键词
trk_res = textrank_process(col_seq,col_cont,allow_pos)

文档名称数量和文档内容数量相等，继续进行
TextRank处理完成


## 综合TF-IDF和TextRank分析    
匹配每个栏目中TF-IDF和TextRank的筛选出的关键词

In [9]:
tfidf_info

Unnamed: 0,三农业务,个人金融,交易银行,人力资源,信息科技,信用卡,信用审批,公司金融,其他,内控合规,安全保卫,小企业金融,工会,授信管理,消费信贷,网络金融,财务管理,运营管理,采购管理,金融市场,门户建设,风险管理
count,8370.0,8370.0,8370.0,8370.0,8370.0,8370.0,8370.0,8370.0,8370.0,8370.0,8370.0,8370.0,8370.0,8370.0,8370.0,8370.0,8370.0,8370.0,8370.0,8370.0,8370.0,8370.0
mean,0.006893,0.008149,0.004426,0.006391,0.004701,0.005319,0.001974,0.006633,0.00708,0.003465,0.002595,0.004228,0.002991,0.005981,0.006447,0.005322,0.003169,0.007454,0.002068,0.002305,0.004978,0.003087
std,0.008484,0.007285,0.009995,0.008868,0.009868,0.009549,0.010751,0.008688,0.008328,0.010367,0.010619,0.01008,0.010514,0.009149,0.008827,0.009548,0.010461,0.007995,0.010734,0.010685,0.009731,0.010486
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.009136,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.007237,0.0,0.0,0.0,0.0
75%,0.014104,0.014309,0.0,0.013174,0.0,0.012034,0.0,0.014366,0.014344,0.0,0.0,0.0,0.0,0.0134,0.014037,0.012031,0.0,0.014246,0.0,0.0,0.0,0.0
80%,0.015821,0.015422,0.0,0.01547,0.013128,0.015034,0.0,0.016025,0.016204,0.0,0.0,0.0,0.0,0.015495,0.015588,0.01503,0.0,0.016239,0.0,0.0,0.01475,0.0
85%,0.018035,0.016785,0.017816,0.018706,0.017892,0.017324,0.0,0.018102,0.017345,0.0,0.0,0.016984,0.0,0.018195,0.017485,0.01732,0.0,0.017503,0.0,0.0,0.017727,0.0


### 三农业务

In [10]:
snyw_res = kw_merge(tfidf_res,trk_res,'三农业务',0.9,0.9,major='inner')

In [11]:
snyw_res

Unnamed: 0,word,tfidf_weight,ttk_weight,word_flag
0,齐心,0.026409,0.570685,形容词
1,瞄准,0.026409,0.48392,动词
2,精通,0.026409,0.480225,形容词
3,助保,0.026409,0.498347,动词
4,空表,0.026409,0.577929,名词
5,医疗机构,0.026409,0.577688,名词
6,单同,0.026409,0.578189,形容词
7,确诊,0.026409,0.600326,动词
8,砥砺,0.026409,0.573025,动词
9,双流,0.026409,0.570146,名词


### 个人金融

In [12]:
grjr_res = kw_merge(tfidf_res,trk_res,'个人金融',0.9,0.9,major='inner')

In [13]:
# grjr_res

### 交易银行

In [14]:
jyyh_res = kw_merge(tfidf_res,trk_res,'交易银行',0.9,0.9,major='inner')

In [15]:
# jyyh_res

### 人力资源

In [16]:
rlzy_res = kw_merge(tfidf_res,trk_res,'人力资源',0.9,0.9,major='inner')

In [17]:
# rlzy_res

### 信息科技

In [18]:
xxkj_res = kw_merge(tfidf_res,trk_res,'信息科技',0.9,0.9,major='inner')

In [19]:
# xxkj_res

### 信用卡

In [20]:
credit_card_res = kw_merge(tfidf_res,trk_res,'信用卡',0.9,0.9,major='inner')

In [21]:
# credit_card_res

### 信用审批

In [22]:
xysp_res = kw_merge(tfidf_res,trk_res,'信用审批',0.98,0.9,major='inner')

In [23]:
# xysp_res

### 公司金融

In [24]:
gsjr_res = kw_merge(tfidf_res,trk_res,'公司金融',0.98,0.9,major='inner')

In [25]:
# gsjr_res

### 其他

In [26]:
others_res = kw_merge(tfidf_res,trk_res,'其他',0.98,0.9,major='inner')

In [27]:
# others_res

### 内控合规

In [28]:
nkhg_res = kw_merge(tfidf_res,trk_res,'其他',0.98,0.9,major='inner')

In [29]:
# nkhg_res

### 安全保卫

In [30]:
aqbw_res = kw_merge(tfidf_res,trk_res,'其他',0.98,0.9,major='inner')

In [31]:
# aqbw_res

### 小企业金融

In [32]:
xqyjr_res = kw_merge(tfidf_res,trk_res,'安全保卫',0.9,0.9,major='inner')

In [33]:
# xqyjr_res

### 工会

In [34]:
labor_union_res = kw_merge(tfidf_res,trk_res,'工会',0.9,0.9,major='inner')

In [35]:
# labor_union_res

### 授信管理

In [36]:
credit_mgm_res = kw_merge(tfidf_res,trk_res,'授信管理',0.9,0.9,major='inner')

In [37]:
# credit_mgm_res

### 消费信贷

In [38]:
xfxd_res = kw_merge(tfidf_res,trk_res,'授信管理',0.8,0.9,major='inner')

In [39]:
# xfxd_res

### 网络金融

In [40]:
network_finance_res = kw_merge(tfidf_res,trk_res,'网络金融',0.8,0.9,major='inner')

In [41]:
# network_finance_res

### 财务管理

In [42]:
financeal_mgm_res = kw_merge(tfidf_res,trk_res,'财务管理',0.96,0.9,major='inner')

In [43]:
# financeal_mgm_res

### 运营管理

In [44]:
operation_mgm_res = kw_merge(tfidf_res,trk_res,'运营管理',0.6,0.9,major='inner')

In [45]:
# operation_mgm_res

### 采购管理

In [46]:
purchase_res = kw_merge(tfidf_res,trk_res,'采购管理',0.98,0.9,major='inner')

In [47]:
# purchase_res

### 金融市场

In [48]:
financial_mkt = kw_merge(tfidf_res,trk_res,'金融市场',0.98,0.9,major='inner')

In [49]:
# financial_mkt

### 门户建设

In [50]:
mhjs_res = kw_merge(tfidf_res,trk_res,'门户建设',0.9,0.9,major='inner')

In [51]:
# mhjs_res

### 风险管理

In [52]:
risk_mgm_res = kw_merge(tfidf_res,trk_res,'风险管理',0.96,0.9,major='inner')

In [53]:
risk_mgm_res.head()

Unnamed: 0,word,tfidf_weight,ttk_weight,word_flag
0,揭示,0.066815,0.717792,动词
1,法务,0.066815,0.610323,名词
2,受害,0.066815,0.628186,动词
3,货车,0.066815,0.719579,名词
4,作风,0.066815,0.711341,名词


### 全部写入excel

In [10]:
wrt1 = pd.ExcelWriter('../result/v2/按栏目抽取关键词结果_v1.xlsx')
wrt2 = pd.ExcelWriter('../result/v2/按栏目抽取关键词结果_v2.xlsx')
for col in col_seq:
    tb1 = kw_merge(tfidf_res,trk_res,col,0.9,0.9,major='inner')
    tb1.to_excel(wrt1,sheet_name=col,index=False)
    tb2 = kw_concat(tfidf_res,trk_res,col)
    tb2.to_excel(wrt2,sheet_name=col,index=False)
wrt1.save()
wrt2.save()