In [3]:
# !pip install jieba



In [41]:
import pandas as pd
import jieba.posseg as pseg
import jieba
import jieba.analyse
import os

In [53]:
jieba.__version__

'0.42.1'

In [3]:
os.chdir('shopee-product-title-translation-open')

In [5]:
train_en = pd.read_csv('train_en.csv')
test = pd.read_csv('test.csv')
train_tcn = pd.read_csv('train_tcn.csv')

In [6]:
train_tcn

Unnamed: 0,product_title,category
0,Gucci Gucci Guilty Pour Femme Stud Edition 罪愛女...,Health & Beauty
1,（二手）PS4 GTA 5 俠盜獵車手5 Grand Theif Auto V繁體 中文版,Game Kingdom
2,百獸卡,Life & Entertainment
3,nac nac活氧全效柔衣素,Mother & Baby
4,#Nike耐吉官方F.C. 男子足球長褲新款標準型 拒水 拉鏈褲腳\nCD0557,Men's Apparel
...,...,...
499995,Dress,Women's Apparel
499996,Lilian Lin,Food & Beverages
499997,77 抹茶杏仁乳加 77乳加 減甜 大人味 大人的77 宇治抹茶 杏仁 宇治抹茶杏仁 抹茶 ...,Food & Beverages
499998,Panasonic 國際牌 電動 牙刷頭 (EW-DM81 專用刷頭) WEW0974-W,Home Electronic


In [54]:
train_en

Unnamed: 0,product_title,category
0,Recollections Color Splash Clear Stamps & Stencil,Hobbies & Stationery
1,"soap,lotion scrub set 400",Health & Personal Care
2,Spigen Galaxy S10e Case Tough Armor Gunmetal,Mobile Accessories
3,Acrylic Lanalon Bright Red,Hobbies & Stationery
4,303 FLAT SHEET/Blanket 100% cotton,Home & Living
...,...,...
499995,rocker arm roller racing mio,Motors
499996,Secosana (preloved bag),Women's Bags
499997,jag bag,Women's Bags
499998,Baby wipes 15 sheets (Alcohol and Paraben Free...,Babies & Kids


## Resources

https://github.com/crownpku/Awesome-Chinese-NLP

# Pre-processing Jieba
https://github.com/fxsjy/jieba

Support three types of segmentation mode:
- Accurate Mode attempts to cut the sentence into the most accurate segmentations, which is suitable for text analysis.
- Full Mode gets all the possible words from the sentence. Fast but not accurate.
- Search Engine Mode, based on the Accurate Mode, attempts to cut long words into several short words, which can raise the recall rate. Suitable for search engines.

**How to apply this in shopee challenge?**
- adding custom words in the dictionary
    - need to find updated libraries
    - subwords
    - maybe we can get most common words in english, translate it to chinese to add as custom words
- using the jieba.cut_for_search to tokenize the statement

https://medium.com/@jjsham/nlp-tokenizing-chinese-phases-3302da4336bf
- The major difference between Chinese and English is that structure of writing. The problem of NLP in Chinese is: If you tokenize Chinese characters from the articles, there is no whitespace in between phrases in Chinese so simply split based on whitespace like in English may not work as well as in English. What I mean is it could work, but very likely not. I will explain what it could work and what to do if it does not work. Note that I will do the experiment in traditional Chinese characters to avoid the problem of simplified Chinese.
- Jieba is a good package to use for preprocessing but it is best not to use this package to preprocess written Cantonese

## DOCUMENTATION

### Cut
- The jieba.cut function accepts three input parameters: the first parameter is the string to be cut; the second parameter is cut_all, controlling the cut mode; the third parameter is to control whether to use the Hidden Markov Model.
- **jieba.cut_for_search accepts two parameter: the string to be cut; whether to use the Hidden Markov Model. This will cut the sentence into short words suitable for search engines.**

In [37]:
train_tcn.head().product_title.apply(lambda x: list(jieba.cut_for_search(x)))

0    [Gucci,  , Gucci,  , Guilty,  , Pour,  , Femme...
1    [（, 二手, ）, PS4,  , GTA,  , 5,  , 俠盜, 獵車, 手, 5,...
2                                              [百獸, 卡]
3                           [nac,  , nac, 活氧全, 效柔, 衣素]
4    [#, Nike, 耐吉, 官方, F, ., C, .,  , 男子, 足球, 長, 褲,...
Name: product_title, dtype: object

jieba.lcut and jieba.lcut_for_search returns a list.

In [38]:
train_tcn.head().product_title.apply(jieba.lcut_for_search)

0    [Gucci,  , Gucci,  , Guilty,  , Pour,  , Femme...
1    [（, 二手, ）, PS4,  , GTA,  , 5,  , 俠盜, 獵車, 手, 5,...
2                                              [百獸, 卡]
3                           [nac,  , nac, 活氧全, 效柔, 衣素]
4    [#, Nike, 耐吉, 官方, F, ., C, .,  , 男子, 足球, 長, 褲,...
Name: product_title, dtype: object

jieba.Tokenizer(dictionary=DEFAULT_DICT) 
- creates a new customized Tokenizer, which enables you to use different dictionaries at the same time. jieba.dt is the default Tokenizer, to which almost all global functions are mapped.

### Load dictionary
- The dictionary format is the same as that of dict.txt: one word per line; each line is divided into three parts separated by a space: word, word frequency, POS tag. If file_name is a path or a file opened in binary mode, the dictionary must be UTF-8 encoded.
- The word frequency and POS tag can be omitted respectively. The word frequency will be filled with a suitable value if omitted.

In [None]:
jieba.load_userdict(file_name)

### Modify dictionary

- Use add_word(word, freq=None, tag=None) and del_word(word) to modify the dictionary dynamically in programs.
- Use suggest_freq(segment, tune=True) to adjust the frequency of a single word so that it can (or cannot) be segmented.

In [None]:
jieba.add_word()

In [None]:
jieba.suggest_freq(('中', '将'), True)

### Keyword Extraction

- sentence: the text to be extracted
- topK: return how many keywords with the highest TF/IDF weights. The default value is 20
- withWeight: whether return TF/IDF weights with the keywords. The default value is False
- allowPOS: filter words with which POSs are included. Empty for no filtering.

**not sure if we will need this with shopee since we need as much words as possible in the query**

In [None]:
jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())

Developers can specify their own custom IDF corpus in jieba keyword extraction
- Custom Corpus Sample：https://github.com/fxsjy/jieba/blob/master/extra_dict/idf.txt.big

Developers can specify their own custom stop words corpus in jieba keyword extraction
- Custom Corpus Sample：https://github.com/fxsjy/jieba/blob/master/extra_dict/stop_words.txt

In [None]:
jieba.analyse.set_idf_path(file_name)
jieba.analyse.set_stop_words(file_name) # file_name is the path for the custom corpus

### POS tagging

Tags the POS of each word after segmentation, using labels compatible with ictclas.

jieba.posseg.POSTokenizer(tokenizer=None) creates a new customized Tokenizer. tokenizer specifies the jieba.Tokenizer to internally use. jieba.posseg.dt is the default POSTokenizer.

### Parallel Processing

Principle: Split target text by line, assign the lines into multiple Python processes, and then merge the results, which is considerably faster.

Note that parallel processing supports only default tokenizers, jieba.dt and jieba.posseg.dt.

In [42]:
jieba.enable_parallel(4)

NotImplementedError: jieba: parallel mode only supports posix system

Currently, doesnt support windows.

### Tokenize
- The input must be unicode

#### Default mode

In [47]:
?jieba.tokenize

In [43]:
result = jieba.tokenize(u'永和服装饰品有限公司')
for tk in result:
    print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))


word 永和		 start: 0 		 end:2
word 服装		 start: 2 		 end:4
word 饰品		 start: 4 		 end:6
word 有限公司		 start: 6 		 end:10


#### Search mode - for finer segmentation

In [46]:
result = jieba.tokenize(u'永和服装饰品有限公司', mode='search')
for tk in result:
    print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))


word 永和		 start: 0 		 end:2
word 服装		 start: 2 		 end:4
word 饰品		 start: 4 		 end:6
word 有限		 start: 6 		 end:8
word 公司		 start: 8 		 end:10
word 有限公司		 start: 6 		 end:10


### ChineseAnalyzer for Whoosh

In [48]:
from jieba.analyse import ChineseAnalyzer

ImportError: cannot import name 'ChineseAnalyzer' from 'jieba.analyse' (C:\ProgramData\Anaconda3\lib\site-packages\jieba\analyse\__init__.py)

### Dictionaries to Use

It is possible to use your own dictionary with Jieba, and there are also two dictionaries ready for download:

1. A smaller dictionary for a smaller memory footprint: https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small

2. **IMPT: There is also a bigger dictionary that has better support for traditional Chinese (繁體): https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big**

In [51]:
jieba.set_dictionary("D:\Shopee\Open Round 4_Title Translation\dict.txt.big.txt")

## Tutorials

1. How to improve the performance of chinese text tokenization in python and jieba
https://levelup.gitconnected.com/how-to-improve-the-performance-of-chinese-text-tokenization-in-python-and-jieba-26add53f3756
    - adding custom words 

In [11]:
text = train_tcn.iloc[4,0]
words = pseg.cut(text)
for w in words:
    print('%s %s' % (w.word, w.flag))

Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\rdico\AppData\Local\Temp\jieba.cache
Loading model cost 2.885 seconds.
Prefix dict has been built successfully.


# x
Nike eng
耐吉 nz
官方 n
F eng
. m
C eng
. m
  x
男子 n
足球 n
長 a
褲 n
新款 n
標準 n
型 k
  x
拒水 v
  x
拉 v
鏈 ng
褲 ng
腳 n
\ x
nCD0557 eng


### add custom words
add_word(word, freq=None, tag=None)

In [12]:
jieba.add_word('于吉', freq=None, tag='nr')

In [None]:
#sample code for adding nouns
with open('noun_list.txt', 'r', encoding='utf8') as f:
    custom_noun = f.readlines()
    for noun in custom_noun:
        jieba.add_word(noun.replace('\n', ''), freq=None, tag='n')

2. Adding spaces

https://breezegeography.wordpress.com/2018/01/25/how-to-segment-chinese-texts-putting-in-spaces-with-jieba/

This command tells Jieba to put a space between 非 and 农业. Alternatively, if Jieba defaulted to separating the characters and I wanted them to be considered one unit, I’d alter the code inside the parentheses to read ‘非农业’ .

In [14]:
jieba.suggest_freq(('非','农业'), True)

4

The first line triggers the actual segmentation process. Jieba has a couple segmentation modes, by setting “cut_all = False” I indicate that I want Jieba to run its slower, more accurate mode.

In [29]:
words = jieba.cut(text, cut_all=False)
list(words)
# for w in words:
#     print('%s %s' % (w.word, w.flag))

['#',
 'Nike',
 '耐吉',
 '官方',
 'F',
 '.',
 'C',
 '.',
 ' ',
 '男子',
 '足球',
 '長',
 '褲',
 '新款',
 '標準',
 '型',
 ' ',
 '拒水',
 ' ',
 '拉鏈',
 '褲腳',
 '\\',
 'nCD0557']

# Pre-processing: hanlp

The multilingual NLP library for researchers and companies, built on TensorFlow 2.0, for advancing state-of-the-art deep learning techniques in both academia and industry.

https://github.com/hankcs/HanLP

## Tokenization

HanLP will automatically resolve the identifier CTB6_CONVSEG to an URL, then download it and unzip it

In [2]:
import hanlp
tokenizer = hanlp.load('CTB6_CONVSEG')

ModuleNotFoundError: No module named 'hanlp'

In [None]:
tokenizer('商品和服务')
['商品', '和', '服务']

However, you can predict much faster. In the era of deep learning, batched computation usually gives a linear scale-up factor of batch_size. So, you can predict multiple sentences at once, at the cost of GPU memory.

In [None]:
hanlp.pretrained.ALL #list all pre-trained models
hanlp.pretrained.* #browse pre-trained models by categories of NLP tasks

## POS tagging

Did you notice the different pos tags for the same word 希望 ("hope")? The first one means "my dream" as a noun while the later means "want" as a verb. This tagger uses fasttext[^fasttext] as its embedding layer, which is free from OOV.

In [None]:
tagger = hanlp.load(hanlp.pretrained.pos.CTB5_POS_RNN_FASTTEXT_ZH)
tagger(['我', '的', '希望', '是', '希望', '和平'])
['PN', 'DEG', 'NN', 'VC', 'VV', 'NN']

## Named Entity Recognition

The NER component requires tokenized tokens as input, then outputs the entities along with their types and spans.

In [None]:
recognizer = hanlp.load(hanlp.pretrained.ner.MSRA_NER_BERT_BASE_ZH)
recognizer([list('上海华安工业（集团）公司董事长谭旭光和秘书张晚霞来到美国纽约现代艺术博物馆参观。'),
                list('萨哈夫说，伊拉克将同联合国销毁伊拉克大规模杀伤性武器特别委员会继续保持合作。')])
[[('上海华安工业（集团）公司', 'NT', 0, 12), ('谭旭光', 'NR', 15, 18), ('张晚霞', 'NR', 21, 24), ('美国', 'NS', 26, 28), ('纽约现代艺术博物馆', 'NS', 28, 37)], 
 [('萨哈夫', 'NR', 0, 3), ('伊拉克', 'NS', 5, 8), ('联合国销毁伊拉克大规模杀伤性武器特别委员会', 'NT', 10, 31)]]

## Pipelines

Since parsers require part-of-speech tagging and tokenization, while taggers expects tokenization to be done beforehand, wouldn't it be nice if we have a pipeline to connect the inputs and outputs, like a computation graph?

In [None]:
pipeline = hanlp.pipeline() \
    .append(hanlp.utils.rules.split_sentence, output_key='sentences') \
    .append(tokenizer, output_key='tokens') \
    .append(tagger, output_key='part_of_speech_tags') \
    .append(syntactic_parser, input_key=('tokens', 'part_of_speech_tags'), output_key='syntactic_dependencies') \
    .append(semantic_parser, input_key=('tokens', 'part_of_speech_tags'), output_key='semantic_dependencies')
pipeline

In [None]:
print(pipeline(text)) #output is json file

In [None]:
pipeline.save('zh.json') #save output

## Train own models

To write DL models is not hard, the real hard thing is to write a model able to reproduce the score in papers. The snippet below shows how to train a 97% F1 cws model on MSR corpus.

In [None]:
tokenizer = NgramConvTokenizer()
save_dir = 'data/model/cws/convseg-msr-nocrf-noembed'
tokenizer.fit(SIGHAN2005_MSR_TRAIN,
              SIGHAN2005_MSR_VALID,
              save_dir,
              word_embed={'class_name': 'HanLP>Word2VecEmbedding',
                          'config': {
                              'trainable': True,
                              'filepath': CONVSEG_W2V_NEWS_TENSITE_CHAR,
                              'expand_vocab': False,
                              'lowercase': False,
                          }},
              optimizer=tf.keras.optimizers.Adam(learning_rate=0.001,
                                                 epsilon=1e-8, clipnorm=5),
              epochs=100,
              window_size=0,
              metrics='f1',
              weight_norm=True)
tokenizer.evaluate(SIGHAN2005_MSR_TEST, save_dir=save_dir)

# TO READ

- https://medium.com/the-artificial-impostor/nlp-four-ways-to-tokenize-chinese-documents-f349eb6ba3c3
- https://github.com/google/sentencepiece
- https://pypi.org/project/hanlp/#description
- https://medium.com/@makcedward/how-subword-helps-on-your-nlp-model-83dd1b836f46#:~:text=Introduction%20to%20subword&text=Subword%20is%20in%20between%20word,)%20to%20represent%20%E2%80%9Csubword%E2%80%9D.
- https://github.com/crownpku/Awesome-Chinese-NLP
- https://towardsdatascience.com/beginners-guide-to-sentiment-analysis-for-simplified-chinese-using-snownlp-ce88a8407efb