# Materials

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import sys
sys.path.append("../")

In [2]:
from IPython.display import Markdown

In [3]:
from MWE2019.utils import tqdm
from MWE2019.materials import Materials

In [4]:
materials = Materials()

Remove NGram frequency < 50: 3588 removed
QIE removed: 0
Idiom removed: 3588
Character count before removal: 12883
Character count after removal: 2824
Remove character not in ngrams: 10059 removed
load CwnNodeVec from cache:  ..\MWE2019\..\data\cache_cwn_node_vec\cwn_node_vec_homophily.pkl


In [5]:
Markdown(materials.describe())


## Materials of MWE2019
### NGram list (3996 ngrams)
  * QIE (`qies`): 2478 QIEs, e.g. 資料來源, 綜合報導, 警方調查, 主辦單位, 億元台幣 ...  
    These QIEs are selected with high PMI (top 25%) and low enclosing frequency (top 25%), 
    and they are not idioms as defined in MOE.
  * Idiom (`idioms`): 1518 idioms, e.g. 不可思議, 不約而同, 迫不及待, 層出不窮, 脫穎而出, ...
  * These ngrams (QIEs and idioms) are all occured 5 or more times in 1.3 billion corpus.        

### Characters 
  * There are 2824 used in all ngrams
  * CWN contains information of 2451 character
  * Number of characters both in CWN and used in NGrams are 
    `2451`, and
    `373` of them are only presented in 
    ngrams but not in CWN.
  * Some characters do not have S-vectors due to data problem, only 1983 characters have 
    valid senses and S-vectors          

### Frequency
  * NGram frequency (`ngFreq`): Frequency of QIEs and idioms in 1.3 billion corpus
  * character frequency (`chFreq`): Frequency of those `2824` character, 
    as used by the ngrams, in the same corpus.

### Vector representation
  * Character M-Vector (Morphological Vector)  
    computed from CWN networks with node2vec (`CwnNodeVec`), each of 2451 character 
    was mapped to a vector of length 100.
  * Character S-Vector (Sense Vector)  
    computed from the example sentences in CWN senses, as described in GWA2019 paper. Each of 
    2451 character was mapped to a vector of length 3072
  * NGrams S-vector
    computed from the sentences extracted from corpus. Each of 3988 ngram was mapped
    to a vector of length 3072


In [6]:
materials.charS["我"]

{'05238701': array([ 1.3702985 , -0.42867818,  0.37206087, ..., -1.3290256 ,
         0.23013055,  0.30289397], dtype=float32),
 '05238702': array([ 1.0225841 , -0.50791353,  0.14265805, ..., -0.82802534,
         0.4293771 , -0.3263795 ], dtype=float32),
 '05238703': array([ 0.85308725, -0.30901477, -0.44316593, ..., -0.9147484 ,
        -0.26282156, -0.12963648], dtype=float32)}

In [7]:
materials.charM["我"]

array([ 1.4744961 ,  0.4337109 ,  1.4106735 , -2.2810156 ,  0.35942197,
        0.4173864 ,  0.1422289 , -0.1577536 , -1.1994652 ,  0.55580264,
        2.4018683 ,  0.39741367,  0.8287281 , -1.0472449 ,  0.7825135 ,
       -0.43116054,  0.6863932 , -1.7577618 , -0.38924882, -0.9499124 ,
        0.5906449 ,  1.3905882 , -0.07124268,  0.9719028 , -0.18468164,
        0.47104996,  0.99415255,  0.10736081, -1.1961255 , -1.1752406 ,
        0.43250856, -1.4576637 , -0.02280311,  0.72732574, -0.47921476,
        0.25154743, -0.61630523, -1.1154647 ,  0.8614917 , -0.8633631 ,
       -0.06870596,  0.23455188, -0.13793357, -0.75500125,  1.890468  ,
       -0.12898102, -1.3052497 , -0.1247216 , -0.84704137, -1.9184508 ,
       -1.231895  , -0.82128197, -2.1493657 ,  0.06145079,  1.2174122 ,
        0.53087527, -0.06972771,  0.12918875, -0.5679427 , -0.9584813 ,
       -1.1210693 ,  1.1679254 , -0.23634249,  2.3687775 ,  0.4184395 ,
        0.39574358,  0.67770433, -0.28224826, -1.2258294 ,  0.02

In [8]:
materials.ngramS["層出不窮"]

array([ 0.4420729 ,  0.21298479, -0.22277981, ..., -0.16774593,
        0.169826  ,  0.62633693], dtype=float32)

In [9]:
materials.ngramS["主辦單位"]

array([ 0.52642787,  0.22308645,  0.20355526, ...,  0.13169022,
       -0.77774876,  0.3822952 ], dtype=float32)