# 模块化目标

## 文本矢量化 + 随机森林

* 输入 doc_term_dists 文本-字词分布矩阵
* 产出 df_importance 可供助理/研究人员建构taxonomy使用    
* 文本矢量化
   * features 特徵值 extraction 提取
* 随机森林
   * 一种降维方式

## 统一用语

* 文本-字词 doc_term 矩阵
* features 特徵值
    * 字词 term / 关键词 keywords
    
## 学习目标

* 观察输入输出
* 注意数据流

In [1]:
fn = { "LDA_input": "data_sets/LDA_input.csv",# input
       "for_term_coders": "data_sets/_for_term_coders.xlsx"
      }
import pandas as pd
from collections import OrderedDict, defaultdict
import pickle

# 输入简单文本数据
必要文本doc栏位，最少
* doc_label
* doc_content

In [2]:
df = pd.read_csv(fn["LDA_input"], sep="\t", encoding="utf8", index_col=0)
corpus = df.doc_content.to_dict(into=OrderedDict)  # 从to_list() 升级改成to_dict()

# 先提取，后降维
* 來源: [dimensionality-reduction-techniques](https://www.analyticsvidhya.com/blog/2018/08/dimensionality-reduction-techniques-python/)

## 文本矢量化: 特徵值提取

In [3]:
import feature_extraction as FE

M = FE.doc_vectorizer (corpus, max_df=0.5, min_df=0.1, max_features=50000 ) # 特徵值门槛   
display(type(M))
M.keys()

dict

dict_keys(['corpus_index', 'doc_lengths', 'term', 'term_frequency', 'doc_term_dists'])

## 文本-字词矩阵：降维结果前

In [4]:
# 注意kind 参数有什麽不同结果
dfm = FE.gen_df_doc_term_matrix_from_model(M, kind="corpus_index")
dfm

term,app,h5,ip,vr,一体化,上线,下游,下滑,专业化,专业性,...,风险,首届,驱动,高品质,高峰论坛,高效,高校,高端,高质量,龙头
corpus_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,2,0,0,1,0,0,0,0,...,1,0,0,1,0,1,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,7,6,2,1,0,0,0,0,0
3,1,0,0,0,0,0,1,0,0,0,...,2,0,2,0,3,0,1,1,0,0
4,0,0,0,0,0,0,0,1,1,0,...,0,0,2,0,0,0,0,0,2,0
5,2,0,0,0,0,0,0,0,0,0,...,0,0,2,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,1,2,1,0,0
7,0,0,0,2,0,1,0,1,0,0,...,6,0,2,0,0,1,0,0,0,0
8,1,0,1,0,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,1,0
9,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
10,0,4,0,0,1,4,1,0,2,0,...,0,0,0,0,0,1,0,0,2,0


In [5]:
dfm =  FE.gen_df_doc_term_matrix_from_model(M)
dfm

term,app,h5,ip,vr,一体化,上线,下游,下滑,专业化,专业性,...,风险,首届,驱动,高品质,高峰论坛,高效,高校,高端,高质量,龙头
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0,2,0,0,1,0,0,0,0,...,1,0,0,1,0,1,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,7,6,2,1,0,0,0,0,0
2,1,0,0,0,0,0,1,0,0,0,...,2,0,2,0,3,0,1,1,0,0
3,0,0,0,0,0,0,0,1,1,0,...,0,0,2,0,0,0,0,0,2,0
4,2,0,0,0,0,0,0,0,0,0,...,0,0,2,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,1,2,1,0,0
6,0,0,0,2,0,1,0,1,0,0,...,6,0,2,0,0,1,0,0,0,0
7,1,0,1,0,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,1,0
8,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
9,0,4,0,0,1,4,1,0,2,0,...,0,0,0,0,0,1,0,0,2,0


## 文本-字词矩阵：降维结果后
随机森林结果后

In [6]:
# 降维 2d -> 1d
df_importance = FE.gen_df_importance_using_RandomForest(dfm, max_depth=30)  # max_depth愈浅愈..?
df_importance

feature,feature,importance
rank_importance,Unnamed: 1_level_1,Unnamed: 2_level_1
1,数据,0.057099
2,舆情,0.032421
3,企事业,0.023225
4,报告,0.021648
5,舆论,0.017774
...,...,...
494,电视,0.000011
495,商城,0.000000
496,后台,0.000000
497,网易,0.000000


In [7]:
# 准备供字词编码员使用
df_out = FE.gen_df_term_coder(df_importance, columns_name="userdefined")
df_out

userdefined,feature,importance,类别,修正,memo,label
rank_importance,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,数据,0.057099,,,,数据
2,舆情,0.032421,,,,舆情
3,企事业,0.023225,,,,企事业
4,报告,0.021648,,,,报告
5,舆论,0.017774,,,,舆论
...,...,...,...,...,...,...
494,电视,0.000011,,,,电视
495,商城,0.000000,,,,商城
496,后台,0.000000,,,,后台
497,网易,0.000000,,,,网易


-----
# 输出特徵值(字词)重要性

文本字词_RF_随机森林_降维_特徵值_重要性

* [使用openpyxl进行高阶EXCEL报表更新或创建](https://openpyxl.readthedocs.io/en/latest/pandas.html)



In [8]:
list_df_xls_df = [df_importance, df_out]
list_df_xls_sheetnames = [x.columns.name for x in list_df_xls_df]
list_df_xls = dict(zip(list_df_xls_sheetnames,list_df_xls_df))

In [11]:
import xls_io as xls
xls.check_and_write_xls ( fn["for_term_coders"], list_df_xls)