# European Summer School in Chinese Digital Humanities

## Stylometry: PCA
In this notebook I will introduce a script that will allow you to conduct stylometric analysis by only changing a few options. This notebook will perform principal component analysis.

### The imports
There are a number of items from various Python librarys that we need to import to conduct the analysis we are interested in. It is, of course, possible for us to write all of the code necessary for this ourselves, but it is much preferable to rely on things that other people have created for us.

In [None]:
from google.colab import drive
drive.mount('/content/drive/')
%cd drive/MyDrive/europeanchinesedh-main/

# set up Chinese font
!wget -O TaipeiSansTCBeta-Regular.ttf https://drive.google.com/uc?id=1eGAsTN1HBpJAkeVM57_C7ccp7hbgSz3_&export=download
import matplotlib as mpl
import matplotlib.pyplot as plt 
from matplotlib.font_manager import fontManager

fontManager.addfont('TaipeiSansTCBeta-Regular.ttf')
mpl.rc('font', family='Taipei Sans TC Beta')

In [None]:
# Library for loading and exporting data
import os, json

# Libraries for analysis
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
import numpy as np

# Library for visualization
import seaborn as sns
import matplotlib.colors

# Custom local modules with useful utilities
from clean import clean # for cleaning the text
from totrad import Convert # to convert to tradtitional characers


### Set analysis options

#### corpus_folder_name
Provide the name of the corpus folder in a string. Leave as "demo_corpus" to use the supplied corpus.

#### analysis_vocab_file
If you want to provide a custom set of words to use for the analysis provide the name of a text file that contains the words, one word to a line. This should be a string like
"analysis_vocab.txt"

#### most_common_words
Set the number of most common terms to use for your analysis. This is ignored if you provide a vocab file. By default this is set to None, which will analyze every word in the corpus. This should be an integer like
100

#### n_gram
By default this is set to work on <i>n</i>-grams where n is 1, meaning individual characters will be at the root of the analysis. You are welcome to play with around with this as you see fit. The higher the n, the sparser the data.


#### convert_to_traditional
Set this to False to not modify the characters in the files. Set it to True if you would like to perform autoconversion

### pca_components
Set this integer to the number of components you would like the algorithm to return. This notebook is only set up to visualize the first 2 components, but this is available for your use should you like to dive further in.

In [None]:
corpus_folder_name = "demo_corpus"

analysis_vocab_file = None

most_common_words = 100

n_gram = 1

convert_to_traditional = False

pca_components = 2 # Only useful for digging even deeper in the data


## Adjustable parameters: Appearance
These parameters will help you set the appearance of the plot itself.

### label_types 
A tuple that specifies the nature of the corpus labeling. Here, the sample corpus files are all named with the convention author_title_section_genre.txt. Each type of label is one element in this tuple, in the same order they appear in the name

### color_value 
this integer specifies which label should be used to generate a color scheme for the plot. 2 points to the 3rd element in the tuple, the siku categorization. There are three different siku categories reflected in the dataset, making this a good option. Here you should pick whichever label your analysis is focused on. More than 8 or so elements, however, will generate colors that are hard to tell apart.

### label_value 
this integer specifies which label should be used for labeling the points in the plot. 0 points to the 1st element in the tuple, the title.

### point_size 
is an integer that sets how large the points in the plot tare

### point_labels 
is a boolean (True or False) that specifies if the points should be labeled.

### plot_loadings 
is a boolean that specifies if the vocabulary should be drawn on the plot (which will aid in interpretation). The further a term is from the center of the plot, the more it is influencing texts in a given direction.

### hide_points 
is a boolean that specifies of the points should be drawn. Set to False to see the loadings better.

### output_dimensions 
is a tuple that sets the width and height of the output plot in inches. The inner values can be either integers or floats.

### output_file 
contains the name of the file where the plot will be saved. The file extension will determine file type. png, pdf, jpg, tif, and others are all valid selections. On Macs, because of an oddity of the plotting library, pdfs will be very large. You can fix this by opening the file with adobe illustrator (or another similar program) and then saving a copy. This is because the entire font is embedded in the file.

In [None]:
# Types of labels for documents in the corpus
# This must match your metadata naming scheme!
label_types = ('title', 'dynasty', 'siku', 'subcat', 'author') # tuple with strings

# Some of these labels will set the color used to differentiate the points in the plot.
# The label at this index is used to set Color:
color_value = 3 # Index of label to use for color (integer). Here 3 points to "genre"

# Index of label to use for plot labels (if points are labeled)
label_value = 0 # Index of label to use for labels (integer). Here 0 points to "title"

# Point size (integer)
point_size = 8

# Show point labels (add labels for each text):
point_labels = False # True or False

# Plot loadings (write the characters tot he plot)
plot_loadings = False # True or False

# Hide points (useful for seeing loadings better):
hide_points = False # True or False

# Output file info (dimensions are in inches (width, height)):
output_dimensions = (10, 7.5) # Tuple of integers or floats

# Output file extension determines output type. Save as a pdf if you want to edit in illustator
# PDF Output on mac is very large, but just opening and saving a copy in illustrator will fix this
output_file = "myfigure.png"

From this point on you don't need to change any of the code to run the analysis, but you are welcome to mix things up if you like.

In [None]:
# create containers for the data

if convert_to_traditional:
    c = Convert(preserve_multiple=False)

if analysis_vocab_file:
    with open(analysis_vocab_file, 'r', encoding='utf8') as rf:
        vocab = [v for v in rf.read().split("\n") if v != ""]
else:
    vocab = None


    
##############
# Load Texts #
##############

print("Loading, cleaning, and tokenizing")
# Go through each document in the corpus folder and save info to lists
texts = []
labels = []

for root, dirs, files in os.walk(corpus_folder_name):
    for i, f in enumerate(files):
        if f.endswith(".txt"):
            # add the labels to the label list
            labels.append(f[:-4].split("_"))

            # Open the text, clean it, and tokenize it
            with open(os.path.join(root,f),"r", encoding='utf8', errors='ignore') as rf:
               # read and clean the file and append it to the texts list
                text = clean(rf.read())

                # if covert_to_traditional is set to True, convert
                if convert_to_traditional:
                    text = c.to_trad(text)
                texts.append(text)
            

In [None]:
####################
# Perform Analysis #
####################

vectorizer = TfidfVectorizer(vocabulary=vocab, ngram_range=(n_gram, n_gram), max_features=most_common_words, use_idf=False, analyzer="char")
vecs = vectorizer.fit_transform(texts)
vecs = vecs.toarray()

# Lets perform PCA on the vectors:
pca = PCA(n_components=pca_components)
my_pca = pca.fit_transform(vecs)


##############
# Plot Setup #
##############

print("Setting plot info")
# set the plot size
plt.figure(figsize=output_dimensions)

# find all the unique values for each of the label types
unique_label_values = [set() for i in range(len(label_types))]

for label_list in labels:
    for i, label in enumerate(label_list):
        unique_label_values[i].add(label)

# create color dictionaries for all labels
color_dictionaries = []
for unique_labels in unique_label_values:
    colorpalette = sns.color_palette("husl",len(unique_labels)).as_hex()
    color_dictionaries.append(dict(zip(unique_labels,colorpalette)))

# Now we need the Unique Labels
unique_color_labels = list(unique_label_values[color_value])
# Let's get a number for each class
number_for_class = [i for i in range(len(unique_color_labels))]

# Make a dictionary! This is new sytax for us! It just makes a dictionary where
# the keys are the unique years and the values are found in number_for_class
label_for_class_number = dict(zip(unique_color_labels,number_for_class))

# Let's make a new representation for each document that is just these integers
# and it needs to be a numpy array
text_class = np.array([label_for_class_number[lab[color_value]] for lab in labels])


# Make a list of the colors
colors = [color_dictionaries[color_value][lab] for lab in unique_color_labels]

if hide_points:
    point_size = 0

###################
# Create the plot #
###################

print("Plotting texts")
for col, class_number, lab in zip(colors, number_for_class, unique_color_labels):
    plt.scatter(my_pca[text_class==class_number,0],my_pca[text_class==class_number,1],label=lab,c=col, s=point_size)

# Let's label individual points so we know WHICH document they are
if point_labels:
    print("Adding Labels")
    for lab, datapoint in zip(labels, my_pca):
        plt.annotate(str(lab[label_value]),xy=datapoint)

# Let's graph component loadings
vocabulary = vectorizer.get_feature_names()
loadings = pca.components_
if plot_loadings:
    print("Rendering Loadings")    
    for i, word in enumerate(vocabulary):
        plt.annotate(word, xy=(loadings[0, i], loadings[1,i]))
    

# Let's add a legend! matplotlib will make this for us based on the data we 
# gave the scatter function.
plt.legend()
plt.savefig(output_file)


############################################
# Output data for JavaScript Visualization #
############################################

data = []
for datapoint in my_pca:
    pcDict = {}
    for i, dp in enumerate(datapoint):
        pcDict[f"PC{str(i + 1)}"] = dp
    data.append(pcDict)

jsLoadings = []
for i, word in enumerate(vocabulary):
    temploading = {}
    for j,dp in enumerate(loadings):
        temploading[f"PC{str(j+1)}"] = dp[i]
    jsLoadings.append([word, temploading])

color_dictionaries_list = []
for cd in color_dictionaries:
    cdlist = [v for v in cd.values()]
    color_dictionaries_list.append(cdlist)

colorstrings = json.dumps(color_dictionaries_list)
labelstrings = json.dumps(labels)
valuetypes = json.dumps([k for k in data[0].keys()])
datastrings = json.dumps(data)

limited_label_types = []
for i, t in enumerate(label_types):
    if len(unique_label_values[i]) <= 20:
        limited_label_types.append(t)

cattypestrings = json.dumps(limited_label_types)
loadingstrings = json.dumps(jsLoadings)
stringlist = [f"var colorDictionaries = {colorstrings};", f"var labels = {labelstrings};",
            f"var data = {datastrings};", f"var categoryTypes = {list(label_types)};", 
            f"var loadings = {jsLoadings};", f"var valueTypes = {valuetypes};",
            f"var limitedCategories = {limited_label_types};",
            f"var activecatnum = {color_value};", f"var activelabelnum = {label_value};",
            f"var explainedvariance = [{round(pca.explained_variance_[0],3)},{round(pca.explained_variance_[1],3)}]"]


with open("pca_viz/data.js", "w", encoding="utf8") as wf:
    wf.write("\n".join(stringlist))



# Show the plot
plt.show()