# European Summer School in Chinese Digital Humanities

## Word embeddings
In this notebook I will introduce a script that will allow you to conduct stylometric analysis by only changing a few options. This notebook will perform heirarchical cluster analysis.

### The imports
There are a number of items from various Python librarys that we need to import to conduct the analysis we are interested in. It is, of course, possible for us to write all of the code necessary for this ourselves, but it is much preferable to rely on things that other people have created for us.

In [None]:
# Library for loading data
import os, re

# Libraries for analysis
import gensim
import pandas as pd

# Library for visualization
import plotly.express as px
from sklearn.decomposition import PCA

# Custom local modules with useful utilities
from clean import clean # for cleaning the text
from totrad import Convert # to convert to tradtitional characters

# set gdrive_loc (this is only necessary when using these notebooks via google_collab)
gdrive_loc = "My Drive/europeanchinesedh-main"

### Set analysis options

#### corpus_folder_name
Provide the name of the corpus folder in a string. Leave as "demo_corpus" to use the supplied corpus.

#### convert_to_traditional
Set this to False to not modify the characters in the files. Set it to True if you would like to perform auto conversion.

In [None]:
corpus_folder_name = "demo_corpus"

convert_to_traditional = False

From this point on you don't need to change any of the code to run the analysis, but you are welcome to mix things up if you like.

### Tokenization note
Note that in this case we are using a custom tokenization routine because gensim expects a different type of setup than we saw while using sklearn. It expects a list of lists in which the outer list is all of the sentences in the corpus and the inner lists are each token in each sentence.

In [None]:
def sent_tokenize(text, sent_div=r"[。！？]"):
    return re.split(sent_div, text)
    
def tokenize(text, n=1):
    return [[sentence[i:i+n] for i in range(0, len(sentence), n)] for sentence in sent_tokenize(text)]

ignore = {".DS_Store", ".txt"}

sentences = []

if gdrive_loc:
    corpus_folder_name = f"{gdrive_loc}/{corpus_folder_name}"

for root, dirs, files in os.walk(corpus_folder_name):
    for filename in files:
        if filename not in ignore:
            with open(os.path.join(root,filename), 'r', encoding='utf8') as rf:
                # gensim's word2vec model expects sentences that contain words 
                # so we need to do a bit of pre processing. let's write a 
                # tokenization function!
                

                # if covert_to_traditional is set to True, convert
                if convert_to_traditional:
                    text = c.to_trad(rf.read())
                else:
                    text = rf.read()
            
                sentences.extend(tokenize(text))


# create the model and tune it!
word2Vec_model = gensim.models.Word2Vec(
                    sentences, # input sentences
                    sg=1, # use skip grams (0=CBOW). skip grams tend to work better on smaller corpora
                    vector_size=100, # this is how many dimensions the vectors will be
                    min_count=5 # how many times must a word apper
                    )

# we can update the model with train
# word2Vec_model.train([["four", "score", "and", "seven", "years", "ago"]], 
#                     total_examples=word2Vec_model.corpus_count,
#                     epochs=word2Vec_model.epochs)

# Let's save the model to file!
# word2Vec_model.save('my_vecs.p')

# let's take a look at some of the methods you can use:
# most similar
print(word2Vec_model.wv.most_similar('之'))

# calculate similarity (cosine similarity)
# distance will calculate euclidean distance
print(word2Vec_model.wv.similarity('之', '的')) 



In [None]:
# dimensionaliity reduction for visualization
my_pca = PCA(n_components=3)

word_vecs = word2Vec_model.wv
vocab = word_vecs.key_to_index.keys()
vecs = [word_vecs[word] for word in vocab]
    
transformed_data = my_pca.fit_transform(vecs)
df = pd.DataFrame(index=vocab, columns=['Dim1', 'Dim2', 'Dim3'], data=transformed_data)


In [None]:
fig = px.scatter_3d(df, x='Dim1', y='Dim2', z='Dim3', text=vocab)
fig.show()