# Understanding the Voynich manuscript using deep learning techniques

The [voynisch manuscript](https://en.wikipedia.org/wiki/Voynich_manuscript) is a mysterical manuscript that has been the subject to studies for over 100 years but is still surrounded by a lot of questions and isn't translated until today. Theories include the manuscript being a haox, cipher or a natural language that does not exist anymore. 

For this research project I want to see if we can answers some of the questions that surround this manuscript using deep learning techniques or even can try to partly translate it.

## Collect datasets 

As a start i decided to gather some textual data i could use, my idea is to look at properties over languages i can actually translate and understand to see what could work for the voynich manuscript. The voynich manuscript contains about 170,000 chars spread over 35,000 whitespaces seperated groups(probably words). Based on the illustrations it suspected that the following topics are covered:

- Herbal (112 folios)
- Astronomical (21 folios)
- Biological (20 folios) 
- Cosmological (13 folios)
- Pharmaceutical (34 folios)
- Recipes (22 folios)

My idea is to select pages from wikipedia that correlate to these topics from different languages and use the text on this pages as a corpus for our analysis. The fact that topics correlate means that it's more likely that the words in our dataset share sementatic meaning to the words in the manuscript. This definitely doesn't mean that exact translations will be available, and also note that the semenatic overlap between the different wikipedia datasets is likely to be higher than between the manuscript and the wikipedia pages. But i have to select some pages anyway, since we want an equal 

In [1]:
import wikipedia
from nltk.tokenize import word_tokenize
from bs4 import BeautifulSoup
import re
import os
import requests
import codecs
path = "/Users/stijnvoss/Documents/uni/capita-selecta-ai/datasets/"
wm_titles = {
    'Herbal': ['Vanilla','Mustard plant','Mustard seed','Pineapple','Pumpkin','Lime (fruit)','Parsley','Rosemary', 'Thymus (plant)', 'Apocynaceae', 'Basil','Oregano','Ballota_nigra','Ginger','Lavandula','Cumin','Nutmeg','Ruta graveolens','Anise','Peppermint','Mentha','Saffron','Achillea millefolium'],
    'Astronomical': ['Geocentric model', 'Astronomical object','Planet','Heliocentrism','Galaxy'],
    'Biological': ['Human body', 'Cardiology','Circulatory system'],
    'Cosmological': ['Plato', 'God'],
    'Pharmaceutical':['Antipyretic','Phytotherapy','Pharmaceutical_drug','Pharmacognosy'],
    'Recipes':['Brussels sprout','Doberge cake','Angel cake','Gingerbread','Ontbijtkoek', "Recipe","Milk", "Risotto", "Paella"]
}

weights = {
    'Herbal': 112,
    'Astronomical': 21,
    'Biological': 20,
    'Cosmological': 13,
    'Pharmaceutical':34,
    'Recipes':22
}

total_chars = 170000
def find_titles_for_lang(titles, lang):
    if lang == 'en':
        return dict([(t,t) for t in titles])
    else:
        r = requests.get("https://en.wikipedia.org/w/api.php",{
                "action":'query',
                'titles': "|".join(titles),
                'prop':'langlinks',
                'llinlanguagecode':'en',
                'lllang':lang,
                'lllimit':100,
                'format':'json'})
        return dict([(p['title'], p['langlinks'][0]['*']) for p in r.json()['query']['pages'].values() if 'langlinks' in p])

def assemble_dataset(lang,ds):
    global titles, weights, total_chars, path
    total_weight = float(sum(weights.values()))
    lang_folder = os.path.join(path,lang)
    if not os.path.exists(lang_folder):
        os.makedirs(lang_folder)
    ds_chars = 0
    for topic, titles in wm_titles.iteritems():
        topic_chars = 0
        min_chars = int(total_chars * (weights[topic]/total_weight))
        translated_titles = find_titles_for_lang(titles, lang)
        with codecs.open(os.path.join(lang_folder, topic +".txt"), 'w',encoding='utf8') as io:
            for title in translated_titles.values():
                chars, content = extract_content(title, lang)
                for c in content:
                    if c.strip() != '':
                        min_chars = min_chars - 1
                        topic_chars += 1
                        ds_chars = ds_chars + 1
                    if min_chars <= 0:
                        break
                    io.write(c)
                if min_chars <= 0:
                    break

        print topic,"chars found", topic_chars, min_chars
    print lang, "total chars found", ds_chars
    
def extract_content(wiki_title,lang):
    wikipedia.set_lang(lang)
    try:
        p = wikipedia.page(wiki_title)
        content = p.content.replace("=","")
        return len(re.sub(r'\s+', '', content)), content
    except wikipedia.PageError as e:
        return 0, ""
    except wikipedia.DisambiguationError as e:
        return 0, ""
#find_titles_for_lang(['Geocentric model', 'Astronomical object','Planet','Heliocentrism','Galaxy'],'es')
assemble_dataset('es', path)
assemble_dataset('en', path)
assemble_dataset('nl', path)

Pharmaceutical chars found 26036 0
Cosmological chars found 9954 0
Astronomical chars found 16081 0
Herbal chars found 85765 0
Biological chars found 15315 0
Recipes chars found 16846 0
es total chars found 169997


AttributeError: 'list' object has no attribute 'iteritems'

In [120]:
1

1