# European Summer School in Chinese Digital Humanities

## Stylometry: HCA
In this notebook I will introduce a script that will allow you to conduct stylometric analysis by only changing a few options. This notebook will perform heirarchical cluster analysis.

### The imports
There are a number of items from various Python librarys that we need to import to conduct the analysis we are interested in. It is, of course, possible for us to write all of the code necessary for this ourselves, but it is much preferable to rely on things that other people have created for us.

In [None]:
# Library for loading data
import os

# Libraries for analysis
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import linkage, dendrogram

# Library for visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Custom local modules with useful utilities
from clean import clean # for cleaning the text
from totrad import Convert # to convert to tradtitional characters

### Set analysis options

#### corpus_folder_name
Provide the name of the corpus folder in a string. Leave as "demo_corpus" to use the supplied corpus.

#### analysis_vocab_file
If you want to provide a custom set of words to use for the analysis provide the name of a text file that contains the words, one word to a line. This should be a string like
"analysis_vocab.txt"

#### most_common_words
Set the number of most common terms to use for your analysis. This is ignored if you provide a vocab file. By default this is set to None, which will analyze every word in the corpus. This should be an integer like
100

#### n_gram:
By default this is set to work on <i>n</i>-grams where n is 1, meaning individual characters will be at the root of the analysis. You are welcome to play with around with this as you see fit. The higher the n, the sparser the data.

#### convert_to_traditional
Set this to False to not modify the characters in the files. Set it to True if you would like to perform autoconversion


#### label_types
This is a tuple of strings that describe the metadata convention you are using to name your corpus. This assumes that you have followed the naming convention I suggested in my overview.

#### color_value:
Here you will set the INDEX of the feature you want to highlight in color. Remember that Python indexing starts at 0, so for the default demo corpus 0 will be title, 1 will be dynasty, 2 will be siku, 3 will be subcat, and 4 will be author. It is good practice to use color only when there are relatively few unique values, so I would suggest only using 1, 2, or 3 here.

#### label_value:
This is the INDEX of the feature you want to use as the leaf label in the dendrogram. The same indexes as color value apply, but any of them are find. I generally find title to be the most informative, but you could easily also pair siku color with dynasty label (or vice versa)

In [None]:
corpus_folder_name = "demo_corpus"

analysis_vocab_file = None

most_common_words = 100

n_gram = 1

convert_to_traditional = False

# Types of labels for documents in the corpus
# This must match your metadata naming scheme!
label_types = ('title', 'dynasty', 'siku', 'subcat', 'author') # tuple with strings

# Some of these labels will set the color used to differentiate the points in the plot.
# The label at this index is used to set Color:
color_value = 3 # Index of label to use for color (integer). Here 3 points to "subcat"

# What do you want to use to label the dendrogram?
label_value = 0 # 0 selects title

# Set font to use for graph. By default matplotlib does not work well with Chinese, so you will have to set this manually.
# For macs: Heiti TC
# For windows: SimHei
font_to_use = "Heiti TC"

From this point on you don't need to change any of the code to run the analysis, but you are welcome to mix things up if you like.

In [None]:
# create containers for the data
texts, labels = [], []

# Set up text converter if so desired
if convert_to_traditional:
    c = Convert(preserve_multiple=False)

if analysis_vocab_file:
    with open(analysis_vocab_file, 'r', encoding='utf8') as rf:
        vocab = [v for v in rf.read().split("\n") if v != ""]
else:
    vocab = None


    
# go through every file in the demo corpus
for root, dirs, files in os.walk(corpus_folder_name):
  for fname in files:
    # check that the file in question is a .txt file
    if fname.endswith(".txt"):
      # open the file
      with open(os.path.join(root, fname), 'r', encoding='utf8') as rf:

        # clean filename
        fname = fname[:-4]
            
        # read and clean the file and append it to the texts list
        text = clean(rf.read())
        
        
        # if covert_to_traditional is set to True, convert
        if convert_to_traditional:
            fname = c.to_trad(fname)
            text = c.to_trad(text)
        
        # append to text list
        texts.append(text)

        # extract the metadata from the name of the file
        labels.append(fname.split("_"))


In [None]:
# Run the analysis
vectorizer = TfidfVectorizer(ngram_range=(n_gram, n_gram), vocabulary=vocab, max_features=most_common_words, use_idf=False, analyzer="char")

vecs = vectorizer.fit_transform(texts)
vecs = vecs.toarray()
distances = pdist(vecs,metric='euclidean')

linkages = linkage(distances,'ward')

In [None]:



plt.rcParams['font.sans-serif'] = [font_to_use]
plt.rcParams["figure.figsize"] = (10,15)
dendrogram(linkages, labels=[l[label_value] for l in labels], orientation="left", leaf_font_size=14)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
plt.tight_layout()

# let's get an axis object so we can manipulate the color of the labels
ax = plt.gca()
# get the labels
tick_labels = ax.get_ymajorticklabels()

# find all the unique values for each of the label types
unique_label_values = set([l[color_value] for l in labels])

# create color dictionaries for all labels
color_dictionaries = []
colorpalette = sns.color_palette("husl",len(unique_label_values)).as_hex()
color_dictionary = dict(zip(unique_label_values,colorpalette))
    
title_to_feature = {t:f for t,f in zip([l[label_value] for l in labels], [l[color_value] for l in labels])}


# iterate through each label and set its color based on the dictionaries we set
# up earlier
for label in tick_labels:
    label.set_color(color_dictionary[title_to_feature[label.get_text()]])

plt.show()