#### CSCE 670 :: Information Storage and Retrieval :: Texas A&M University :: Spring 2020


# Spotlight - Ben D'Antonio

*Summary of this Spotlight:* This spotlight contains information on Tencent's AI Lab Embedding Corpus for Chinese Words and Phrases. Tencent is a Chinese-centered conglomerate company with many subsidiaries specializing in Internet-related technology. Tencent has generated an NLP dataset of 8 million+ Chinese words embedded as 200 feature value vectors with Machine Learning. Features include freshness, popularity, and domain-specific features in addition to traditional embeddings features. 

*Contents:* This spotlight will show where to obtain the dataset and how to get started with it, demonstrate a simple model to show the dataset's effectiveness, and provide information, resources, and discussion about the dataset.

## Resources

**Data Source URL**:
* https://ai.tencent.com/ailab/nlp/embedding.html

**Brief article discussing the dataset**:
* https://medium.com/syncedreview/tencent-ai-lab-open-sources-8m-word-chinese-nlp-vector-dataset-564764b1abc8

**A previous implementation of the dataset**:
* https://github.com/BridgeMia/Tencent-Word2Vec-Augmentation

## Getting started with Tencent's Chinese Word Corpus

Let's start by grabbing the file! Navigate to https://ai.tencent.com/ailab/nlp/embedding.html if you haven't yet and download the dataset. After it is completely unzipped, it should be about 13GB. Place it in the same directory as this notebook. By default, its file name should match what's written in the code below. Now you're ready!

The dataset format is as follows:
* [\# of terms] [\# of features]
* [first term] [first feature] [second feature] ...
* [second term] [first feature] [second feature] ...
* ...

A pretty simple format!

The file contains 8,824,330 terms, where each term has exactly 200 feature values that line up across all terms, with each value being in the range `(-1, 1)`.

We can parse it as follows into a dictionary! The database is far too large to load in its entirely here. We will use a small subset (the start) of the data. Here, we choose 10,000 for the number of entries. Despite this small size, we will see that we can still generate compelling results. After running the below code blocks, we will be able to see the indexed terms and each term's first feature value.

In [None]:
import numpy as np
numFeats = 200
numEntries = 10000
numFormatting = 1

database = dict()

def loadDB():
    dataFile = open('Tencent_AILab_ChineseEmbedding.txt', 'r', encoding="utf8")
    cnt = 0
    for line in dataFile:
        if cnt > numEntries:
            break
        cnt += 1
        if cnt <= numFormatting:
            continue
        itemList = line.split()
        term = itemList[0]
        itemList.pop(0)
        database[term] = np.asarray(itemList, dtype='float')

In [None]:
loadDB()

for t, f in database.items():
    print(t, f[0])

**Please note**: If you can not see some or all Chinese characters/hanzi being output, you may have an encoding issue in your local environment. This notebook was developed on Windows. 

As we can see scrolling down, the dataset has catalogued just about everything, from individual characters and punctuation, to complex terms, emoticons, and even slang!

Now, let's create a simple model to demonstrate the dataset's embeddings. We will use a cosine similarity system with a few example words. In cosine similarity, a score of 0 means the embeddings are identical. Thus, our top results (and the words that are most similar to our chosen word) will be the embeddings with the lowest cosine similarity score.

In [None]:
import numpy as np
import scipy.spatial as sSpatial

def cosineAccessDatabase(term, numResults):
    results = set()
    scores = dict()
    
    qFeats = database[term]
    for dTerm, dFeats in database.items():
        if len(dFeats) != numFeats:
            continue
        scores[dTerm] = sSpatial.distance.cosine(qFeats, dFeats)
        
    topResults = []
    i = 0
    for t, s in sorted(scores.items(), key=lambda item: item[1]):
        if i >= numResults:
            break
        if t == term:
            continue
        topResults.append((t, s))
        i += 1
        if i >= numResults:
            break
        
    return topResults

## Examples

Let's look at some examples! As I don't speak Chinese, I opted to use Google translate to see what the words roughly translated to. However, the results generally demonstrate obvious connections. Examples shown were not cherry-picked for returning strong results. I just chose a few words of varying length from different parts of speech.

### 1. '电话' : "Phone"

In [None]:
cosineAccessDatabase('电话', 5)

**Results**:
* '打电话': "Call",
* '短信': "SMS",
* '拨打': "Dial",
* '联系方式': "contact details",
* '关机': "Shutdown"

### 2. '。' : "." (Chinese period punctuation mark)

In [None]:
cosineAccessDatabase('。', 5)

**Results**:
* '的': "Of",
* '也': "and also",
* '和': "with",
* '而': "and",
* ',': ","

### 3. '轻轻的' : "Gently"

In [None]:
cosineAccessDatabase('轻轻的', 5)

**Results**:
* '轻轻地': "Gently",
* '轻轻': "lightly",
* '轻声': "Softly",
* '伸手': "Reach out",
* '微微': "pico-"

### 4. '情人节' : "Valentine's Day"

In [None]:
cosineAccessDatabase('情人节', 5)

**Results**:
* '圣诞节': "Christmas",
* '节日': "festival",
* '生日': "birthday",
* '元旦': "New year's day",
* '春节': "Chinese New Year"

### 5. '希望能够' : "Hope to"

In [None]:
cosineAccessDatabase('希望能够', 5)

**Results**:
* '希望能': "hope to",
* '希望': "hope",
* '能够': "were able",
* '如果能': "If possible",
* '争取': "Fight for"

## Discussion

It seems that even with a simple cosine similarity model, we can generate complelling results with this dataset. The embeddings seem to clearly capture not only the notion of words that belong to similar activities but also capture more abstract semantics.

* The first example shows that handling common nouns such as phones is a fairly easy task.
* The punctuation mark results were interesting because the embeddings seem to believe punctuation and conjuctions to be extremely similar. In many sentences, a period can be substituted for a conjuction, so this is an interesting connection.
* Clearly the system can also handle adverbs. The first three results started with the same character, 轻, which I found interesting. I'm not quite sure what to make of this due to my lack of Chinese speaking. Grabbing a prefix "pico-" is also notable.
* Valentine's Day" returned similar holidays and festivities as opposed to something like "hearts" or "love." Perhaps none of those terms appeared in the first 10,000 entries.
* The semantics of "Hope to" seem aptly captured, returning not only similar words but also the beginning of a subjective clause expressing hope ("If possible"). 

I also tested the English word "on" and got back other English prepositions such as "with" and "and". Lastly, I tested a Chinese emoticon ('xx'). The emoticon is generally used to express unsatisfaction with a situation. The results included '某某' ("So and so") and '……', suggesting that the embeddings can capture the semantics of emoticons.

I regret that I could not intentionally grab slang terms to evaluate, but maybe someone who speaks Chinese well can evaluate that on their own.

**And that's it!** Even with an extremely small subset of the dataset, a simple cosine model can generate compelling results. I hope this spotlight has given a thorough overview of Tencent's dataset and provided some motivation for further exploring it. 

Feel free to try your own or change the database size!

In [None]:
cosineAccessDatabase('[your term]', 5)