For detailed descriptions and full references see Final report
To process a set of files:
Makes a folder pdfs
under the top-level project folder.
Place files in that folder. They can be any of the ~2000 types suported by Apache Tika: https://tika.apache.org/0.9/formats.html
run the process_files.py
script from the project folder:
> python process_files.py
process_files.py
can take an optional parameter with a folder name, which needs to exist under the project folder.
You will now have a file, lines.txt
in the chosen folder. Copy / move it to the folder ./src/pdfs
, relative to the project folder.
For more information, execute
> python process_files.py --help
Next, as per the file pdfs_notebook.ipynb
, run the following code:
from deep_cluster import DeepCluster
dc = DeepCluster(
'test-pdfs-dec',
dims=[768, 500, 500, 2000, 40],
entity_count=10,
train_size=0,
num_clusters=25,
maxiter=2000)
dc.train_and_evaluate_model(10000, verbose=1, folder="pdfs/lines.txt")
Initialisation options include:
run_name
: the folder name for saving resultstrain_size
: how many smaples to use for trainingnum_clusters
: how many clusters do we want produced?cluster
: the clustering algorithm for seedingKmeans
for k-meansGMM
for Gaussian Mixture MethodsOPTICS
agg
for agglomerative clustering
entities
: optional list of spaCy entity labels for findentity_count
: how man entity classes to find. Don't use withentities
dims
: optional list of numbers - the sizes of each layer in the encoder.loss_weights
: optional list of 3 weights [0.0-1.0] for the losses usedKL
divergenceMean
Squared ErrorCluster
accuracy loss
maxiter
: how many iterations to trainverbose
: 0 is quiet, 1 is verbose, 2 is very verbose
To use the latent representation network:
from deep_latent import DeepLatentCluster
dc = DeepLatentCluster(
run_name='test-latent-all-Kmeans',
config={
'train_size':0,
'reconstr_weight':1.0,
'latent_weight':1e-5,
"cluster": "Kmeans"
})
dc.make_model()
dc.train_model()
dc.evaluate_model('test-latent-all', sample_size=4000)
DeepLatentCluster
has many configuration options that can be supplied to the config
param in dictionary form. For a comprehensive list, see the class init method, but common ones include:
train_size
: how many samples to includereconstr_weight
: weight applied to loss from the autoencoder reconstructionlatent_weight
: weight applied to loss from the latent representation networkcluster
: algorithm to use:Kmeans
for k-meansGMM
for Gaussian Mixture MethodsOPTICS
agg
for agglomerative clustering
opt
: the optimizer functionnoise_factor
: [0.0-1.0] How much gaussian noise to add to training dataentity_count
: how many spaCy entity classesnum_clusters
: target number of identified cluster, including the spaCy entitiesmax_iter
: how many training iterations per epochepochs
: how many training epochstolerance
: [float] the fraction of change of cluster allocations needed to continue training
Results will be stored in ./results/[run_name]
for the run name supplied to the class inititialiser.