# Introduction to Python Programming

## 3. Nested Data Structures

You now know about the most commonly-used Python data types. There are more, each with characteristics suited to a particular task or situation and, if you like, you can make your own as well. You can see the full list of built-in data structures here: https://docs.python.org/3.5/library/stdtypes.html. You should also have an idea of the ways that you can repeat actions on lists and dictionaries with `for` loops, as well as take decisions based on data using `if` statements. This knowledge and these skills can be applied to some quite sophisticated programming. 

One other key to writing effective programs is the ability to encapsulate your data and allow them to be accessed and analysed in a way that is efficient and appropriate. If you don't take the time to consider the best way to capture, store, and access your data, you can end up writing programs that are much more complicated and error-prone than they need to be. This can cost you a lot more time in the future. As with any project, a little time spent planning at the beginning can save a lot of time later on.

#### _Warmup exercise_

Imagine, you woke up this morning and you had several ideas for things you want to do today, each can be described by a word or a short sentence. You want to write down those ideas (store them in Python), which data type is in your opinion best suited to host all of that data?
- an integer
- a string
- a list
- a group
- a dictionary
- other

[write down answer]

After compiling your collection of all the things you want to get done today, you realized that they can and should be categorized into different projects like 'work', 'home' etc.
Given your response to the question above, how would you store your data into multiple levels?

[write down answer]

#### Combining Structures

Often, datasets are best encapsulated by combining the basic data structures together to form larger, well-organised collections that fit the characteristics of the dataset. This combination of multiple data structures is often referred to as 'nesting'. Getting used to dealing with these structures as they grow larger and more complicated can take a bit of time, but it becomes easier with practise and is very important for good programming.

To make this easier to understand, let's begin with some examples. Let's say that you teach seminars to three different groups of students, and you are writing a program that will keep track of the names of these students. For each seminar group, you could store the names in a list.

In [42]:
GroupA = ['Sarah', 'Richard', 'Matthew', 'Fiona', 'Sally', 'Samuel']
GroupB = ['Richard', 'Sally', 'Sandy', 'Peter', 'Rebecca', 'Steve', 'Lesley', 'Stuart']
GroupC = ['Simon', 'Laura', 'Gareth', 'Alan', 'Helen']

Now, you might want to go through all of the names for all of the groups. You could write three loops:

In [43]:
for student in GroupA:
    print(student)

Sarah
Richard
Matthew
Fiona
Sally
Samuel


In [44]:
for student in GroupB:
    print(student)

Richard
Sally
Sandy
Peter
Rebecca
Steve
Lesley
Stuart


In [45]:
for student in GroupC:
    print(student)

Simon
Laura
Gareth
Alan
Helen


Teaching can be quite time-consuming at the best of times, and this is getting quite repetitive! Of course, it gets even worse if you teach even more groups, or if you are working with a larger dataset of some other kind. Instead, you could combine the lists together into a single list,

In [46]:
AllGroups = GroupA + GroupB + GroupC

but you might not want to lose track of which students belong to which class. So instead, it is a good idea to store the individual lists as entries in another data structure that can be processed iteratively just like the lists themselves. We have a few options here, but let's begin with a list of lists - where each entry in the list is itself another list:

In [47]:
AllGroups = [ GroupA , GroupB , GroupC ]
# This is equivalent to the below, but much easier to read!
AllGroups = [ ['Sarah', 'Richard', 'Matthew', 'Fiona', 'Sally', 'Samuel'] ,
              ['Richard', 'Sally', 'Sandy', 'Peter', 'Rebecca', 'Steve', 'Lesley', 'Stuart'] ,
              ['Simon', 'Laura', 'Gareth', 'Alan', 'Helen'] ]

Now, to work through all the names of all the students, we need to iterate over each entry in the top-level list `AllGroups` and, because we know that each of these entries is itself a list, iterate over the entries of this second-level list as well. As you will remember from the previous worksheet, iterating over a list can be acheived with a `for` loop. In this case, to iterate over everything in our two-level nested lists we will just use two nested `for` loops.

In [48]:
for group in AllGroups:
    for student in group:
        print(student)

Sarah
Richard
Matthew
Fiona
Sally
Samuel
Richard
Sally
Sandy
Peter
Rebecca
Steve
Lesley
Stuart
Simon
Laura
Gareth
Alan
Helen


Great! So now we can get a full list of our students' names, without writing an individual `for` loop for every class we teach. But we have lost some information in the process, because we can't tell by looking at the list above where the names for one class end and another begin.

#### _Exercise 3.1_

What other combination of data structures might we use to encapsulate our class lists? Before you look at the next section, try to figure out what the best nested structure might be for this data.

In [49]:
# a dictionary would probably be a better fit for this data, as we have a class name associated with each list of students. see example below...

#### Choosing An Appropriate Structure

Consider what information we have for each class: it has a name (A, B, C) and a list (the student names). This sounds a lot like the kind of information that is best stored in a dictionary - we have a key (the group name) and an associated value (the list of students). So, we can combine the data into a second type of nested structure: a dictionary of lists.

In [24]:
AllGroups = { 'Group A': GroupA,
              'Group B': GroupB,
              'Group C': GroupC }

Now that we have produced this dictionary, we can iterate over it, and each of the lists contained within, using a verysimilar nested `for` loop structure to the one we had before for our list of lists:

In [25]:
for group in AllGroups.keys():
    print(group)
    for student in AllGroups[group]:
        print(student)

Group A
Sarah
Richard
Matthew
Fiona
Sally
Samuel
Group B
Richard
Sally
Sandy
Peter
Rebecca
Steve
Lesley
Stuart
Group C
Simon
Laura
Gareth
Alan
Helen


Things have improved a little. Now we are printing the name of the group each time we start a new one, but these group names are not in alphabetical order (unless you got lucky - remember that the order that keys are accessed from a  dictionary using `.keys()` can't be relied upon or predicted, or you're using Python 3.6) and they're quite hard to spot amongst the names of the students. To make sure that you understand what just happened, let's take a look at that code block above step-by-step.

```Python
for group in AllGroups.keys():
```

Here, we create a `for` loop to iterate over every key in the dictionary `AllGroups`. Hopefully, you recognise this from the previous worksheet. At the start of each iteration in this `for` loop the value of the variable `group` will be assigned as the next key from the dictionary. 

```Python
    print(group)
```

We print the group name before starting to loop through the list of student names.

```Python
    for student in AllGroups[group]:
```

Now, we are defining the second `for` loop, assigning the value of the variable `student` with the name of the next student in the list for the current group. Here, remember that `AllGroups[group]` returns the value associated with the key `group` in the dictionary `AllGroups`. `group` is whichever group name we are currently dealing with in this iteration of the first for loop. That is, if the first key returned by `AllGroups.keys()` is `'Group B'`, then:
 - `group` has been assigned the value `'Group B'`,  
 - so `AllGroups[group]` is currently equivalent to `AllGroups['Group B']`,  
 - which returns the value associated with the key `'Group B'`, which is the list `GroupB`.

If you followed that, then you will understand that the last line

```Python
        print(student)
```

will print out the name of the next student in the list referred to by `AllGroups[group]`.

The order of the four lines of code above is very important. Consider what would happen if you put the `print(group)` line within the second `for` loop. You can even give it a try and see if your hypothesis was right.

#### _Exercise 3.2_

As mentioned before, there are a couple of problems with the output that we are getting from the code block above. Try to find a way to make the names of the groups stand out a bit more from the names of the students.  
After you have achieved that, see if you can find a way to explicitly control the order in which the groups are displayed, so that they appear alphabetically - Group A, Group B, Group C.

In [26]:
for group in AllGroups.keys():
    print(group.upper())
    for student in AllGroups[group]:
        print(student)

GROUP A
Sarah
Richard
Matthew
Fiona
Sally
Samuel
GROUP B
Richard
Sally
Sandy
Peter
Rebecca
Steve
Lesley
Stuart
GROUP C
Simon
Laura
Gareth
Alan
Helen


In [27]:
for group in AllGroups.keys():
    print()
    print(group.upper())
    for student in AllGroups[group]:
        print(student)


GROUP A
Sarah
Richard
Matthew
Fiona
Sally
Samuel

GROUP B
Richard
Sally
Sandy
Peter
Rebecca
Steve
Lesley
Stuart

GROUP C
Simon
Laura
Gareth
Alan
Helen


In [28]:
groups = list(AllGroups.keys())
groups.sort()
for group in groups:
    print()
    print(group.upper())
    for student in AllGroups[group]:
        print(student)


GROUP A
Sarah
Richard
Matthew
Fiona
Sally
Samuel

GROUP B
Richard
Sally
Sandy
Peter
Rebecca
Steve
Lesley
Stuart

GROUP C
Simon
Laura
Gareth
Alan
Helen


In [29]:
groups = sorted(AllGroups.keys())
print(type(groups))
for group in groups:
    print()
    print(group.upper())
    for student in sorted(AllGroups[group]):
        print(student)

<class 'list'>

GROUP A
Fiona
Matthew
Richard
Sally
Samuel
Sarah

GROUP B
Lesley
Peter
Rebecca
Richard
Sally
Sandy
Steve
Stuart

GROUP C
Alan
Gareth
Helen
Laura
Simon


#### More Nesting

There is one more type of nested data structure that we need to consider. To help with this, we need to expand the dataset slightly. Being the diligent and conscientious tutor that you are, you spent hours preparing, running and marking an assessment for each seminar group. You have collected the results, which are given below:

__Group A__  

| Student | Mark |
|---------|------|
| Sarah   | 78   |
| Richard | 65   |
| Matthew | 53   |
| Fiona   | 71   |
| Sally   | 43   |
| Samuel  | 80   |

__Group B__  

| Student | Mark |
|---------|------|
| Richard | 57   |
| Sally   | 89   |
| Sandy   | 75   |
| Peter   | 77   |
| Rebecca | 62   |
| Steve   | 71   |
| Lesley  | 75   |
| Stuart  | 80   |

__Group C__  

| Student | Mark |
|---------|------|
| Simon   | 47   |
| Laura   | 91   |
| Gareth  | 74   |
| Alan    | 61   |
| Helen   | 74   |


Now, we have pairs of data for each group - the student's name and their mark. We want to store this data in a way that keeps all of the results together for all of the groups, but allows the data for each group and individual student to be accessed independently. If you first consider the groups individually - pairs of names and marks - this type of dataset is clearly best represented as a dictionary. And, for the reasons discussed before, we know that storing each group and it's data in a dictionary that can be accessed by the name of the group is a good idea too. So, to store all of this information we want to build a dictionary of dictionaries. First, let's create our individual dictionaries for each group.

__Note__ If you're not working interactively with the IPython Notebook version of this workbook, you could be about to do a lot of typing to enter all this data yourself. To save you the trouble, the data is available as a file from [GitHub](https://git.embl.de/grp-bio-it/ITPP/blob/master/seminarGroupMarks.txt). However, we won't cover how to read data from a file until the next worksheet. So for now, you can either skip ahead to find out how to do it (but make sure that you come back later!), read through and try to follow along without typing everything in yourself (you might find it difficult to understand exactly what's going on this way), or type/copy the whole lot. Sorry!

In [30]:
GroupA_Results = { 'Sarah'  : 78 ,
                   'Richard': 65 ,
                   'Matthew': 53 ,
                   'Fiona'  : 71 ,
                   'Sally'  : 43 ,
                   'Samuel' : 80 }
GroupB_Results = { 'Richard': 57 ,
                   'Sally'  : 89 ,
                   'Sandy'  : 65 ,
                   'Peter'  : 77 ,
                   'Rebecca': 62 ,
                   'Steve'  : 71 ,
                   'Lesley' : 75 ,
                   'Stuart' : 80 }
GroupC_Results = { 'Simon'  : 47 ,
                   'Laura'  : 91 ,
                   'Gareth' : 74 ,
                   'Alan'   : 61 ,
                   'Helen'  : 74 }

Now, we can create a top-level dictionary with three entries &mdash; one for each group. The keys are the names of the groups, and the associated values the dictionary of results for the students in the groups.

In [31]:
AllGroupResults = { 'Group A' : GroupA_Results ,
                    'Group B' : GroupB_Results ,
                    'Group C' : GroupC_Results }

All of the assessment results are stored in a single dictionary. You can access the results for a particular group quite easily, using the approach introduced earlier:

In [32]:
AllGroupResults['Group C']

{'Simon': 47, 'Laura': 91, 'Gareth': 74, 'Alan': 61, 'Helen': 74}

But what if you want to know how a particular student scored? How can we access the value for a particular key in the dictionary that is itself the value associated with a key at the top level? Well, above we accessed the value associated with the key `'Group C'` using the syntax `dictionary[key]`. In this case, we know that that returns another dictionary. So, if we now want to get the score for a particular student in that group, we just query that dictionary in the same way.

In [33]:
AllGroupResults['Group C']['Laura']
s.lower().count('the')

91

If this looks strange to you, or you're struggling to make sense of it, remember that Python will interpret the line from left to right:  
- first, it comes across the variable `AllGroupResults`, which it identifies as a dictionary  
- then, it sees that you want to extract the value associated with the key `'Group C'`  
- it fetches that value, `AllGroupResults['Group C']`, and identifies it as a dictionary, too  
- then it moves on to the last part of the line, `['Laura']`, and recognises that you want to extract the value associated with the key `'Laura'` in the dictionary given by `AllGroupResults['Group C']`  
- finally, it fetches that value and returns it.

#### _Exercise 3.3_

Using what you've learned above about iterating over nested data structures, write a program that will loop through every student in every group and print out their mark. Make sure that you can identify in the output of your program, the group that each student name has come from.  
Once you have acheived this, try changing the behaviour of your program to calculate and output the (mean) average mark for each group.

In [37]:
for group in AllGroupResults:
    print()
    print('Results for {}'.format(group))
    for student in AllGroupResults[group]:
        print('{}:\t{}'.format(student, AllGroupResults[group][student]))


Results for Group A
Sarah:	78
Richard:	65
Matthew:	53
Fiona:	71
Sally:	43
Samuel:	80

Results for Group B
Richard:	57
Sally:	89
Sandy:	65
Peter:	77
Rebecca:	62
Steve:	71
Lesley:	75
Stuart:	80

Results for Group C
Simon:	47
Laura:	91
Gareth:	74
Alan:	61
Helen:	74


In [38]:
for group in AllGroupResults:
    print()
    print('Results for {}'.format(group))
    for student in AllGroupResults[group]:
        print('{}:\t{}'.format(student, AllGroupResults[group][student]))
    group_scores = list(AllGroupResults[group].values())
    mean_score = sum(group_scores) / len(group_scores)
    print('Mean score for {}: {}'.format(group, mean_score))


Results for Group A
Sarah:	78
Richard:	65
Matthew:	53
Fiona:	71
Sally:	43
Samuel:	80
Mean score for Group A: 65.0

Results for Group B
Richard:	57
Sally:	89
Sandy:	65
Peter:	77
Rebecca:	62
Steve:	71
Lesley:	75
Stuart:	80
Mean score for Group B: 72.0

Results for Group C
Simon:	47
Laura:	91
Gareth:	74
Alan:	61
Helen:	74
Mean score for Group C: 69.4


#### Scaling Up

Being able to store and access data in a suitable structure is all well and good, but the benefits of it become more obvious when you have a repetitive task that you need to complete. This is especially true when you consider this kind of problem applied to a dataset much larger than the toy examples that we are working with here. What if you ran a whole faculty, with tens of different seminar groups, containing hundreds of students, who each take multiple exams/assesments?!

In this kind of situation, manual entry of the data isn't practical. Instead, it is much more helpful to be able to read the information that you need from a file. Working with input from other sources, and output of results, will be covered in the next worksheet.

#### _Exercise 3.4_

The exercises in the next worksheet are designed to be more challenging for newcomers to programming/Python. In preparation for this, here is one last exercise designed to consolidate the things that you have learned so far.  
Now that you have finished marking your students' assessments, you need to let them know how they got on. Write a program that will print out the body of an email to each student according to this template:

Dear _[name]_,  
I have finished marking the assessment for your seminar group, _[group name]_. You scored _[their mark]_.  
_[an additional comment according to their score (see below)]_  
Kind regards,  
_[your name]_

The additional comment on the third line of the email should be chosen according to the mark that they scored on the assessment.  

| Score range | Comment                                                                |
|-------------|------------------------------------------------------------------------|
| <60:        | 'You must try harder next time. Are you taking this course seriously?' |
| 60-79:      | 'Well done, that's a good score.'                                      |
| >79:        | 'Congratulations! That's an excellent score!'                          |

In [None]:
for group in AllGroupResults:
    print()
    print('Results for {}'.format(group))
    for student in AllGroupResults[group]:
        print('{}:\t{}'.format(student, AllGroupResults[group][student]))
    group_scores = list(AllGroupResults[group].values())
    mean_score = sum(group_scores) / len(group_scores)
    print('Mean score for {}: {}'.format(group, mean_score))

In [41]:
my_name = 'Toby'

def get_feedback_message(score):
    if score < 60:
        return 'You must try harder next time. Are you taking this course seriously?'
    elif 60 <= score <= 79:
        return "Well done, that's a good score."
    else:
        return "Congratulations! That's an excellent score!"

template = '''Dear {}, 
I have finished marking the assessment for your seminar group, {}. 
You scored {}. 
{}
Kind regards,
{}'''
for group in AllGroupResults:
    for student in AllGroupResults[group]:
        score = AllGroupResults[group][student]
        print(template.format(student, 
                              group, 
                              score, 
                              get_feedback_message(score), 
                              my_name))
        print()

Dear Sarah, 
I have finished marking the assessment for your seminar group, Group A. 
You scored 78. 
Well done, that's a good score.
Kind regards,
Toby

Dear Richard, 
I have finished marking the assessment for your seminar group, Group A. 
You scored 65. 
Well done, that's a good score.
Kind regards,
Toby

Dear Matthew, 
I have finished marking the assessment for your seminar group, Group A. 
You scored 53. 
You must try harder next time. Are you taking this course seriously?
Kind regards,
Toby

Dear Fiona, 
I have finished marking the assessment for your seminar group, Group A. 
You scored 71. 
Well done, that's a good score.
Kind regards,
Toby

Dear Sally, 
I have finished marking the assessment for your seminar group, Group A. 
You scored 43. 
You must try harder next time. Are you taking this course seriously?
Kind regards,
Toby

Dear Samuel, 
I have finished marking the assessment for your seminar group, Group A. 
You scored 80. 
Congratulations! That's an excellent score!
Kind 

Of course, the potential for nested data structures doesn't stop at two-level lists of lists, or lists of dictionaries, or dictionaries of lists, or dictionaries of dictionaries! Depending on the situation, you might want to combine some of the other different structure types (which we don't cover so much here), build up three- or four-level structures, and so on. Be warned, though: with every additional layer, your program becomes more and more complicated. This makes it harder for you to keep track of what you're dealing with while you're writing it, harder to read when you or someone else comes back to the program at a later date, and more difficult to identify and correct mistakes in the code itself.

#### Summary

* The elements of data types like lists or dictionaries can themselves be things like lists or tuples or dictionaries, allowing arbitrarily complex data structures to be built up.
* `for` loops can be nested in order to access every entry in these complex structures.
* Individual entries can be accessed by stacking up indices/keys, which Python interprets one at a time, from left to right.
* The values of variables can be inserted into strings, with their format controlled, using the `.format()` string method.

#### _Debugging Exercise_

In this exercise, you are presented with some code that doesn't work as it should. Your task is to debug it.

In [53]:
# You start with a group of students and scores and you 
# would like to print the name and the score of each.
# Can you spot and fix the errors?
# Hint: Each line contains at least one error/mistake

group = {'Sarah':20, 'Richard':30, 'Matthew':40, 'Fiona':50, 'Sally':60, 'Samuel':70}

for student in group:
    print("The score of {} is {}".format(student, group[student]))


The score of Sarah is 20
The score of Richard is 30
The score of Matthew is 40
The score of Fiona is 50
The score of Sally is 60
The score of Samuel is 70
