# European Summer School in Chinese Digital Humanities

## Stylometry: HCA
In this notebook I will introduce a script that will allow you to conduct stylometric analysis by only changing a few options. This notebook will perform heirarchical cluster analysis.

### The imports
There are a number of items from various Python librarys that we need to import to conduct the analysis we are interested in. It is, of course, possible for us to write all of the code necessary for this ourselves, but it is much preferable to rely on things that other people have created for us.

In [24]:
# Library for loading data
import os

# Libraries for analysis
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
from scipy.cluster.hierarchy import linkage

# Library for visualization
import plotly
from plotly.figure_factory import create_dendrogram

# Custom local modules with useful utilities
from clean import clean # for cleaning the text
from totrad import Convert # to convert to tradtitional characters

### Set analysis options

#### corpus_folder_name
Provide the name of the corpus folder in a string. Leave as "demo_corpus" to use the supplied corpus.

#### analysis_vocab_file
If you want to provide a custom set of words to use for the analysis provide the name of a text file that contains the words, one word to a line. This should be a string like
"analysis_vocab.txt"

#### most_common_words
Set the number of most common terms to use for your analysis. This is ignored if you provide a vocab file. By default this is set to None, which will analyze every word in the corpus. This should be an integer like
100

#### convert_to_traditional
Set this to False to not modify the characters in the files. Set it to True if you would like to perform autoconversion

In [25]:
corpus_folder_name = "demo_corpus"

analysis_vocab_file = None

most_common_words = 100

convert_to_traditional = False

In [27]:
# create containers for the data
texts, titles, dynasties, sikus, subcats, authors = [], [], [], [], [], []

if convert_to_traditional:
    c = Convert(preserve_multiple=False)

if analysis_vocab_file:
    with open(analysis_vocab_file, 'r', encoding='utf8') as rf:
        vocab = [v for v in rf.read().split("\n") if v != ""]
else:
    vocab = None


    
# go through every file in the demo corpus
for root, dirs, files in os.walk(corpus_folder_name):
  for fname in files:
    # check that the file in question is a .txt file
    if fname.endswith(".txt"):
      # open the file
      with open(os.path.join(root, fname), 'r', encoding='utf8') as rf:

        # read and clean the file and append it to the texts list
        text = clean(rf.read())
        
        # if covert_to_traditional is set to True, convert
        if convert_to_traditional:
            text = c.to_trad(text)
        
        # append to text list
        texts.append(text)

        # extract the metadata from the name of the file
        title,dynasty,siku,subcat,author = fname.replace(".txt", "").split("_")

        # append the metadata to the apppopriate label list
        titles.append(title)
        dynasties.append(dynasty)
        sikus.append(siku)
        subcats.append(subcat)
        authors.append(author)






In [None]:
# Run the analysis
vectorizer = TfidfVectorizer(vocabulary=vocab, max_features=most_common_words, use_idf=False, analyzer="char")

vecs = vectorizer.fit_transform(texts)

distances = euclidean_distances(vecs)

In [30]:
# viz the material
fig = create_dendrogram(distances, labels=titles, orientation="right", linkagefun=lambda x: linkage(x, "ward"))


fig.update_layout({
            'width':800, 'height':800,
                         })

# Edit yaxis2
fig.update_layout(yaxis2={
    'showticklabels': True,
    'ticks':""
})

fig.show()