Lab 3 - Part 6: Json and XML formats
=============
In this lab, we will parse XML files and create json files. Three different modules from the [Python Standard Library](https://docs.python.org/3/library/index.html) will be used:
- [ElementTree XML](https://docs.python.org/3/library/xml.etree.elementtree.html)
- [JSON encoder and decoder](https://docs.python.org/3/library/json.html)
- [Collections](https://docs.python.org/3/library/collections.html)

We will use a [term list](https://sprakresurser.isof.se/myndighetstermlistan/) that is developed and made available by the Institute for Language and Folklore. It is in XML format, more specifically a term format that is called TBX. Therefore, the file has the suffix '.tbx'.

The focus of this lab will be to practice understanding Python code that someone else has written, rather than to write that much code yourself. In step 4, 6, 10 and 11, you will change the code, however.


I. Understand the XML parsing, create a readable json file and get to know the Counter class
------------------------------------------------------------------------------
1) The XML file that is being parsed is a [term list](https://sprakresurser.isof.se/myndighetstermlistan/) from the Government Authority Terminology group. Take a look at the structure of the XML file. What kind of information does it contain? **`<term>activity report</term>` is one of the XML elements in the file.** 

**In this element, what is the *tag*, what is the *closing tag* and what is the *text*?** 

> __Answer__: A tag is defined inside brackets; <a_tag>. Tags are closed with </a_tag>. Everything
  between the opening and closing tags are children to the tag.

**Can you give an example of a tag that has an *attribute*?**

> __Answer__: The `termEntry` tag has the attribute `id`.

**Can you give an example of an element that contains another element?**

> __Answer__: The element `termEntry` with `id="aktivitetsrapport"` contains several `langSet` elements

**Give another example  (besides ``term``) of an element that contains *text*?**

> __Answer__: For example `note`

2) The code below contains three different ways of importing components from the Python standard library. How does the syntax differ between the three ways they are imported? The way the components are imported affects the syntax when using the components. Spot all code lines where the imported components are used. How does the syntax differ, depending on the import statement syntax used?

> __Answer__: `import x as y` is used to refer to the imported component by the alias `y`,
  often as a means of having a shorter identifier. The second, plain variant, `import x` requires
  explicit referencing, e.g. `x.something()`. The final variant, `from x import y` allows referencing
  `y` as if it was declared locally

3) Run and make sure you understand the code for parsing the XML below. There are many ways to parse XML files, and the code below is just an example. If there are parts of the code you don't understand, try to print the elements that are being looped over, to understand each part.

4) CHANGE THE CODE: Take a look at the resulting json file (`en-term-entries.json`), for instance by downloading it and looking at it on your own computer. 

How the json is formated now makes it very unitelligible. Read the [json documentation](https://docs.python.org/3/library/json.html) to see if there is a possibility to format it better, by adding an indentation. 

**Most lists associated with the keyword ``terms`` include one element, what are the two exceptions?**

5) **What is the difference between json.dump() and json.dumps()?** Which is used here, and why? 

> __Answer__: dump() writes to a text stream, dumps() returns a JSON string. In this case, the first option is applied as it gets written to a file.  

6) CHANGE THE CODE: There is a code line below with just an emtpy dictionary `language_dict = {}`
Instead of this code line, write code that reads the content of the json file you have just saved.  Read the [json documentation](https://docs.python.org/3/library/json.html) to find the opposite to `json.dump()`, i.e. how to instead load the content from a file.  

7) Take a look at the [Counter](https://docs.python.org/3/library/collections.html#collections.Counter) class. 
In addtion to the Python [built-in types](https://docs.python.org/3/library/stdtypes.html), such as [lists](https://docs.python.org/3/library/stdtypes.html#sequence-types-list-tuple-range) and [dicts](https://docs.python.org/3/library/stdtypes.html#mapping-types-dict), there is also a modudule for a bit more specialised Collections. The Counter class is one such example of a more specialised Collection. To use it, you need to import it from the module. 

**When and for what purpose is the Counter class used in the code below?**

> __Answer__: Count the number of letters in a list of strings, like a histogram

8) Make sure you understand what the `count_letters_in_all_terms_and_definitions_for_lang`, `get_flattened_from_list_list` and `count_letters` functions do. If you don't understand, go through the functions step-by-step, by printing the variables.

[list comprehensions](https://www.w3schools.com/python/python_lists_comprehension.asp) is again used (as in file reading/writing lab). This time, it is used together with a condition.

9) What are the five most frequent letters for the english terms? (Remember it till later.)

> __Answer__: e a r i t

In [7]:
# Import a module, and rename it to ET (a bit strange name for a module)
import xml.etree.ElementTree as ET

# Import a module. and don't rename it
import json

# Import a specific class from the module collections, the class Counter, 
from collections import Counter

###########
# Constants
###########

# Keys for the json file
TERMS = "terms"
DEFINITIONS = "definitions"

# Language codes
EN = "en"
SV = "sv"

############
# Functions
###########

def count_letters(str_list):
    joined_str_lower = "".join(str_list).lower()   
    letter_counter = Counter(list(joined_str_lower))
    return letter_counter

# Flattens one nested level of a list
def get_flattened_from_list_list(list_list):
    flat_list = []
    for lst in list_list:
        flat_list.extend(lst)
    return flat_list
    
def count_letters_for_entry_type(terms_dict, language, entry_type):
    # Check that there are term entries for the language
    if language not in terms_dict:
        print(f"No term entries for the language: {language}. Return None")
        return None
    
    terms_entries_for_lang = terms_dict[language]
    print(f"\nFound {len(terms_entries_for_lang)} term entries for the language: {language}")
    
    if len(terms_entries_for_lang) > 1:
        print(f"\nFirst and second term entries look like this: {terms_entries_for_lang[0:2]}\n")
    
    # Get all the terms in the term entries
    all_terms_list_list = \
    [term[entry_type] for term in terms_entries_for_lang if entry_type in term]
    
    # all_terms_list_list will be a list of lists. 
    # But 'count_letters' takes a flat list of strings as parameter.
    # Therefore, flatten the list
    all_terms_list = get_flattened_from_list_list(all_terms_list_list)
    
    # Count all letters for the terms
    letter_dict_terms = count_letters(all_terms_list)
    
    return letter_dict_terms


def parse_tbx_file_for_language(tbx_filename, json_output_filename, language_code):
    # Constants to use for parsing
    NAME_SPACE_STR = "{http://www.w3.org/XML/1998/namespace}"
    LANG_ATTR = NAME_SPACE_STR + "lang"
    TERM = "term"
    DESCRIPTIONS = "descrip" # USE THIS CONSTANT FOR STEP 11
    
    tree = ET.parse(tbx_filename)
    root = tree.getroot()
    print(root)
    
    all_terms = []
    for term_entry in root.iter('termEntry'):
        term_entry_dict = {}
        
        for language in term_entry.iter('langSet'):
            if language.attrib[LANG_ATTR] == language_code:
                term_entry_dict[TERMS] = [term.text for term in language.iter(TERM)]
                #term_entry_dict[DEFINITIONS] # ... # CHANGE CODE HERE, FOR STEP 11 IN THE INSTRUCTIONS
                term_entry_dict[DEFINITIONS] = [desc.text for desc in language.iter(DESCRIPTIONS)]
                
        all_terms.append(term_entry_dict)
   
    language_output_dict = {language_code: all_terms}
    
    
    # Save the output as a json file
    with open(json_output_filename, "w") as fp:
        json.dump(language_output_dict, fp, indent=2) # CHANGE THE CODE HERE, FOR STEP 4 in the instructions
        #json.dump(language_output_dict, fp) 

######################
# The code starts here
########################

# What language to use
#language_code = EN # CHANGE CODE HERE, FOR STEP 10 IN THE INSTRUCTIONS
language_code = SV # CHANGE CODE HERE, FOR STEP 10 IN THE INSTRUCTIONS

# Do the XML-parsing
tbx_filename = "/kaggle/input/terms-from-the-swedish-authority-terminology-group/authority_term_list.tbx"
json_filename = language_code + "-term-entries.json"
parse_tbx_file_for_language(tbx_filename, json_filename, language_code)


# Read the saved output from the json file
# CHANGE CODE HERE, FOR STEP 6 IN THE INSTRUCTIONS. 
# DONT' CREATE AN EMPTY DICTIONARY, BUT LOAD DATA FROM FILE INSTEAD
language_dict = {} # Load the json that you have saved to file here
with open(json_filename) as fp:
    language_dict = json.load(fp)
#print(language_dict)
    
letter_count_terms = count_letters_for_entry_type(language_dict, language_code, TERMS)
print(language_code,"Nr of letters for terms:", letter_count_terms)
letter_count_definitions = count_letters_for_entry_type(language_dict, language_code, DEFINITIONS)
print(language_code, "Nr of letters for definitions:", letter_count_definitions)


<Element 'martifHeader' at 0x7d554327d4e0>

Found 24 term entries for the language: sv

First and second term entries look like this: [{'terms': ['aktivitetsrapport'], 'definitions': ['redovisning av aktiviteter som utförts under en viss tidsperiod']}, {'terms': ['aktivitetsrapportera'], 'definitions': ['(Arbetsförmedlingen:) redovisa genomförda eller planerade aktiviteter som bidrar till att man får ett arbete eller börjar studera']}]

sv Nr of letters for terms: Counter({'a': 36, 't': 36, 'n': 34, 'e': 30, 's': 28, 'i': 27, 'r': 27, 'k': 20, 'd': 17, 'l': 15, 'o': 14, 'm': 13, 'g': 11, 'ä': 10, ' ': 7, 'p': 6, 'å': 6, 'y': 6, 'h': 6, 'v': 5, 'ö': 3, 'u': 3, 'b': 2, 'c': 2, '-': 1, 'j': 1})

Found 24 term entries for the language: sv

First and second term entries look like this: [{'terms': ['aktivitetsrapport'], 'definitions': ['redovisning av aktiviteter som utförts under en viss tidsperiod']}, {'terms': ['aktivitetsrapportera'], 'definitions': ['(Arbetsförmedlingen:) redovisa genom

II. Parsing the Swedish part of the term entry
-----------------------------------------------

10) CHANGE THE CODE: so that it extracts the Swedish part of the term entry instead. Run it to produce a new json file. (There is already the constant `SV`, that you can use.) Is it the same letters that are most common among the Swedish terms?

11) CHANGE THE CODE: so that is also extracts the Swedish *definitions* for the term entry, i.e., the content that match: `<descrip type="definition">`. It should be possible to have several  definitions for a language in a term entry (although there are no examples of where there are several different definitions in this particular tbx-file). 

You can mirror the code for extracting and saving terms for solving this task. There is a code line that is commented out, that you can use: 

The resulting json file should have the following format when extracting the definitions: 
```
{
    "terms": [
            "aktivitetsrapport"
            ],
    "definitions": [
                "redovisning av aktiviteter som utf\u00f6rts under en viss tidsperiod"
            ]
}
```

**12. In the *Swedish definitions*: How many 'm':s are there and how many 's:'?. Note it down to be able to answer the quizz.**
> __Answer__: m's: 13, s's: 28

Extra
-------
1) There are many other ways by which you can parse XML using Python. For instance with XPath. Rewrite the code above, so that is does the same, but using [XPath](https://docs.python.org/3/library/xml.etree.elementtree.html#elementtree-xpath)

2) There are many way in which a list can flattened. Rewrite the loop above with another method to [flatten lists](https://realpython.com/python-flatten-list/)
    
3) Rewrite this part, so that it is handled by catching an exceptions instead.
```
    if language not in terms_dict:
        return None
    terms_entries_for_lang = terms_dict[language] 
   
```