> **IMPORTANT:** Every week, you will be solving exercises in a Jupyter notebook that looks like this one. Because you are cloning a Github repository that only I can push to, you should **NOT EDIT** any of the files you pull from Github. Instead, what you should do, is either make a new notebook and write your solutions in there, or make a copy of this notebook and **save it somewhere else** on your computer, not inside the `caobd_s19` folder that you cloned, so you can write your answers in there. **If you don't follow this advice your solutions may be overwritten and lost**.

# Week 1: Coding with data in Python

We start out with the basics. The exercises today cover:

* Writing Python code and Markdown in Jupyter notebooks
* Introductory Python
* Getting some data from Reddit

**Feedback:** I'm always trying to improve. If you find errors or have concerns you can voice them safely and anonymously at https://ulfaslak.com/vent. You can also send me an email at ulfaslak@gmail.com or talk to me in class. I care about everything you have to say.

## Exercises

### Part 1: Know thy notebook

This document is what we call a *Jupyter notebook*. We will be using these extensively throughout the course so **READ THIS CLOSELY**. There are two basic things you need to know about Jupyter notebooks:

1. A notebook is nothing but a list of cells. A cell can either be a **code cell** or a **Markdown cell**. Code cells are for writing executable code, and Markdown cells (like this one) are for explaining things in text and making your notebook more readable. A typical workflow that you will soon get use to, is something like: solving a problem with some code in a *code cell* and explaining your reasoning or the results you obtained in a *Markdown cell*. You can toggle cell type when you are in *command mode* by pressing <kbd>y</kbd> for code and <kbd>m</kbd> for Markdown. **Try to do that**. Change this Markdown cell to a code cell, and change it back again. What happens if you execute (<kbd>shift</kbd>+<kbd>enter</kbd>) this cell as code cell, compared to when it is a Markdown cell?

2. The notebook has two *modes*: **edit mode** and **command mode**. You enter command mode by pressing <kbd>esc</kbd> or clicking outside a cell, and edit mode by clicking a cell and pressing <kbd>enter</kbd> or double clicking a cell. When you're in edit mode, the outline of the current cell turns green (not with `jupyter lab`, though, there the bar is always blue)and whatever you type into your keyboard goes into that cell, whether it is a code or Markdown cell. [Here](http://maxmelnick.com/2016/04/19/python-beginner-tips-and-tricks.html)'s a nice rundown of the different commands you can use. **Beware of <kbd>x</kbd> and <kbd>d</kbd>**. Read the full list of hotkeys by pressing <kbd>h</kbd> in command mode to figure out why.

>*Heads up:* Because we'll be using Jupyter notebooks so much in this course, I strongly recommend investing 5 minutes more than you would normally, playing around with cell types, modes and hotkeys. It will save you heaps of time down the road.

When you run a code cell by pressing <kbd>shift</kbd> + <kbd>enter</kbd>, the code gets evaluated by the Python interpreter installed on your computer. The interpreter always returns some output, so unless you store it in a variable, it gets printed below the cell. In general, you will use code cells for doing analysis and working with data.

*Markdown* is a simple markup language for formatting text (similar to *HTML* or $\LaTeX$, which you may know). You will typically use it for writing explanations about how you solve the exercises and the results you get, and styling your notebook with sections and subsections. It can do **bold**, *italics* and $\LaTeX$ formatting (for equations), and much much more. You can read about the Markdown language [here](http://daringfireball.net/projects/markdown/).

Below is your first exercise. The exercise are numbered by the convention `[session]`.`[section]`.`[problem]`.`[subproblem]`. For example, exercise 4.2.3.1 is in week 4, section 2, problem 3, and subproblem 1.

>**Ex. 1.0.1**: In the Markdown cell below, write a short text that shows that you can:
>* Create sections
>* Write words in bold and italics
>* Write an equation in LateX formatting
>* Create bullet lists
>* Create [hyperlinks](https://en.wikipedia.org/wiki/Hyperlink)

>*Hint: Remember to execute the cell (<kbd>shift</kbd>+<kbd>enter</kbd>) so the Markdown gets rendered.*

>#### Answer to Ex. 1.0.1
>
>das text in **bold** and *italics*.
>
>LateX equation: $x$ = $y^2$
>
>bullet list:
>* one
>* two
>
>hyperlink: [google](www.google.com)

### Part 2: Essential Python (DSFS Chapter 2)

These exercises take you through some very basic Python functionality. Use them to calibrate your expectations: If you find them hard, you must spend some more time getting up to speed (see the [preperation goals](https://canvas.disabroad.org/courses/2500/pages/sessions) for today's session).

>**Ex. 1.1.1**: Create a list `a` that contains the numbers from $1$ to $1110$ (including $1$ and $1110$), incremented by one, using the `range` function.

In [9]:
a = range(1,1101)
print (a[-1])

1100


>**Ex. 1.1.2**: Show that you understand [slicing](http://stackoverflow.com/questions/509211/explain-pythons-slice-notation) in Python by extracting a list `b` with the numbers from $760$ to $769$ (including both) from the list created above.

In [11]:
b = a[759:769]
print(b[0], b[-1])

760 769


>**Ex. 1.1.3**: Define a function that takes as input a number $x$ and outputs the number multiplied by itself plus three $f(x) = x(x+3)$. 

In [16]:
def fff(x):
    return x*x+3*x
print(fff(2))

10


>**Ex. 1.1.4**: Apply this function to every element of the list `b` using a `for` loop and append the results to a new list `c`. Print `c`.

In [18]:
c = []
for x in b:
    c = c + [fff(x)]
    
print(c)

[579880, 581404, 582930, 584458, 585988, 587520, 589054, 590590, 592128, 593668]


>**Ex. 1.1.5**: Do the exact same thing using a *list comprehension*.

In [20]:
d = [fff(x) for x in b]
print(d)

[579880, 581404, 582930, 584458, 585988, 587520, 589054, 590590, 592128, 593668]


>**Ex. 1.1.6**: Write the numbers in `c` to a text file with one number per line.

In [89]:
# [Answer to Ex. 1.1.6]

f = open("output-week1.txt", "w+")

for i in c:
    f.write(str(i)+"\n")
    print(i)
    
f.close()

579880
581404
582930
584458
585988
587520
589054
590590
592128
593668


>**Ex. 1.1.7**: Show that you understand how strings work in Python. You should:
>
>1. Add a comment above each line of code that explains it.
>2. Find all the lines where **a string** is put into a string. How many are there?
>3. Explain the difference between `%d`, `%s` and `%r`.
>
>[Source](https://learnpythonthehardway.org/book/ex6.html)

In [39]:
# This is an example of a comment

x = "There are %d types of people." % 10
binary = "binary"
do_not = "don't"
y = "Those who know %s and those who %s." % (binary, do_not)

print x
print y

# %r representation of the object -- can be any type of object.  useful when you don't know the exact type or e.g.
# list with heterogeneous objects

print "I said: %r." % x
print "I also said: '%s'." % y

hilarious = False
joke_evaluation = "Isn't that joke so funny?! %r"

print joke_evaluation % hilarious

w = "This is the left side of..."
e = "a string with a right side."

print w + e

There are 10 types of people.
Those who know binary and those who don't.
I said: 'There are 10 types of people.'.
I also said: 'Those who know binary and those who don't.'.
Isn't that joke so funny?! False
This is the left side of...a string with a right side.


[Answer to Ex. 1.1.7.2]

[Answer to Ex. 1.1.7.3]

%r representation of the object -- can be any type of object.  useful when you don't know the exact type or e.g.
list with heterogeneous objects

>**Ex. 1.1.8**: Why does `5 // 2 == 2` in Python 3.7? How is division different between Python 2 and 3?

In [2]:
5 // 2

2

>**Ex. 1.1.9**: What is the point of using `try` and `except`? Write some code that shows how to use these.

In [33]:
# to try something that might throw an error and then address this case -- all in the same place as providing the basic
# fucntionality

>**Ex 1.1.10**: `dict`s and `defaultdict`s.
1. What is a `defaultdict`? How would you say it is different from a normal Python `dict`?
2. Write some code that takes a list of tuples:

>        l = [("a", 1), ("b", 3), ("a", None), ("c", False), ("b", True), ("a", None)]

>     And produces a `defaultdict` object

>        defaultdict(<type 'list'>, {'a': [1, None, None], 'c': [False], 'b': [3, True]})

>*Hint: you can import `defaultdict` from `collections`*

In [22]:
# The difference between defaultdict and dict is that for the latter if a key doesn't exist it will throw
# an error when using it.
# the former, upon use, initializes it with a default 0 or [] or {} depending on the type.

from collections import defaultdict

l = [("a", 1), ("b", 3), ("a", None), ("c", False), ("b", True), ("a", None)]

dd = defaultdict(list)
for x,y in l:
    dd[x].append(y)
    
print(dd)


defaultdict(<class 'list'>, {'a': [1, None, None], 'b': [3, True], 'c': [False]})


>**Ex 1.1.11**: Take a list `a = list("justreadtheinstructions")` and
1. count the number of times each element occurs using `Counter`,
2. report the two most common elements

>*Hint: you can import `Counter` from `collections`*

In [78]:
# [Answer to Ex. 1.1.11]

a = list("justreadtheinstructions")
print (a)

from collections import Counter 

counter_a = Counter(a)
print(counter_a)

for letter, count in counter_a.most_common(2):
    print(letter, count)

type(counter_a)

['j', 'u', 's', 't', 'r', 'e', 'a', 'd', 't', 'h', 'e', 'i', 'n', 's', 't', 'r', 'u', 'c', 't', 'i', 'o', 'n', 's']
Counter({'t': 4, 's': 3, 'u': 2, 'r': 2, 'e': 2, 'i': 2, 'n': 2, 'j': 1, 'a': 1, 'd': 1, 'h': 1, 'c': 1, 'o': 1})
t 4
s 3


collections.Counter

>**Ex 1.1.12**: Take another list `b = list("ofcourseistillloveyou")` and
1. get the `set` of characters that exist in both `a` and `b` (intersection),
2. get the `set` of characters that exist in either `a` or `b` (union), and
3. compute the [Jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index) between the distinct elements in `a` and `b`.

>*Hint: use the `set` function to get a `set`-type object of distinct elements from a list*

In [79]:
# [Answer to Ex. 1.1.12]

b = list("ofcourseistillloveyou")
print(b)

set_a = set(a)
print(set_a)

set_b = set(b)
print(set_b)

intersection = set_a.intersection(set_b)
print(intersection)

union = set_a.union(set_b)
print(union)

jaccard_similarity = len(intersection) / len(union)
print(jaccard_similarity)

['o', 'f', 'c', 'o', 'u', 'r', 's', 'e', 'i', 's', 't', 'i', 'l', 'l', 'l', 'o', 'v', 'e', 'y', 'o', 'u']
{'j', 'e', 't', 'h', 's', 'n', 'r', 'd', 'c', 'u', 'a', 'o', 'i'}
{'l', 'v', 'f', 'e', 't', 's', 'r', 'y', 'c', 'u', 'o', 'i'}
{'e', 't', 's', 'r', 'c', 'u', 'o', 'i'}
{'l', 'f', 'j', 'e', 'h', 's', 'n', 'y', 'c', 'a', 'o', 'i', 'v', 't', 'r', 'd', 'u'}
0.47058823529411764


### Part 3: A little bit of real data

>**Ex. 1.2.1**: Learn about JSON by reading the **[wikipedia page](https://en.wikipedia.org/wiki/JSON)**. Then answer the following questions in the cell below. 
>
>1. What do the letters stand for?
>2. What is `json`?
>3. Why is `json` superior to `xml`? (... or why not?)

[Answers to Ex. 1.2.1.1-3]

1. JavaScript Object Notation
2. JSON is a language-independent data format. It was derived from JavaScript, but many modern programming languages include code to generate and parse JSON-format data.
3. JSON is promoted as a low-overhead alternative to XML as both of these formats have widespread support for creation, reading, and decoding in the real-world situations where they are commonly used. XML has been used to describe structured data and to serialize objects. Various XML-based protocols exist to represent the same kind of data structures as JSON for the same kind of data interchange purposes. Data can be encoded in XML in several ways. The most expansive form using tag pairs results in a much larger representation than JSON, but if data is stored in attributes and 'short tag' form where the closing tag is replaced with '/>', the representation is often about the same size as JSON or just a little larger. However, an XML attribute can only have a single value and each attribute can appear at most once on each element.

XML separates "data" from "metadata" (via the use of elements and attributes), while JSON does not have such a concept.

Another key difference is the addressing of values. JSON has objects with a simple "key" → "value" mapping, whereas in XML addressing happens on "nodes", which all receive a unique ID via the XML processor. Additionally, the XML standard defines a common attribute "xml:id", that can be used by the user, to set an ID explicitly.

XML values are strings of characters, with no built-in type safety. XML has the concept of schema, that permits strong typing, user-defined types, predefined tags, and formal structure, allowing for formal validation of an XML stream. JSON has strong typing built-in, and has a similar schema concept in JSON Schema.

XML supports comments, but JSON does not

>**Ex. 1.2.2**: Working with JSON files
>1. Use [`requests`](https://www.google.dk/search?q=python+requests+get+json&gws_rd=cr&ei=M5OdWaewD8Ti6AS54J24Bg), or another Python module, to store **[this data](https://www.reddit.com/r/gameofthrones/.json)** in a new variable `data`.
>2. What is the [type](https://stackoverflow.com/questions/2225038/determine-the-type-of-an-object) of `data`?

In [62]:
# [Answer to Ex. 1.2.2.1]

import urllib.request as request
import json

url = "https://www.reddit.com/r/gameofthrones/.json"

response = request.urlopen(url)

source = response.read()
data = json.loads(source)
print(data)

{'kind': 'Listing', 'data': {'modhash': '', 'dist': 25, 'children': [{'kind': 't3', 'data': {'approved_at_utc': None, 'subreddit': 'gameofthrones', 'selftext': '', 'author_fullname': 't2_37cuaoqv', 'saved': False, 'mod_reason_title': None, 'gilded': 0, 'clicked': False, 'title': '[No Spoilers] Hodor and Ghost!!❤️', 'link_flair_richtext': [{'e': 'text', 't': 'No Spoilers'}], 'subreddit_name_prefixed': 'r/gameofthrones', 'hidden': False, 'pwls': 6, 'link_flair_css_class': 's-none', 'downs': 0, 'thumbnail_height': 121, 'hide_score': False, 'name': 't3_cr533u', 'quarantine': False, 'link_flair_text_color': 'dark', 'author_flair_background_color': 'transparent', 'subreddit_type': 'public', 'ups': 11650, 'total_awards_received': 1, 'media_embed': {}, 'thumbnail_width': 140, 'author_flair_template_id': 'b3bb5b30-0975-11e9-8cca-0e74e6cf765c', 'is_original_content': False, 'user_reports': [], 'secure_media': None, 'is_reddit_media_domain': True, 'is_meta': False, 'category': None, 'secure_media

In [63]:
type(data)

dict

>**Ex. 1.2.3**: Let's try to inspect the data you retrieved. 
>
>1. Use the `json` module to print your data variable as a string with `indent=4`.
>2. The data is a dictionary, a type of Python object that stores data as key-value pairs. Print the keys.
>
>*Hint: 1. Use the `json` function `dumps`. 2. Call `.keys()` on the variable.*

In [71]:
# [Answer to Ex. 1.2.3.1]

import json
print(json.dumps(data, indent=4))

{
    "kind": "Listing",
    "data": {
        "modhash": "",
        "dist": 25,
        "children": [
            {
                "kind": "t3",
                "data": {
                    "approved_at_utc": null,
                    "subreddit": "gameofthrones",
                    "selftext": "",
                    "author_fullname": "t2_37cuaoqv",
                    "saved": false,
                    "mod_reason_title": null,
                    "gilded": 0,
                    "clicked": false,
                    "title": "[No Spoilers] Hodor and Ghost!!\u2764\ufe0f",
                    "link_flair_richtext": [
                        {
                            "e": "text",
                            "t": "No Spoilers"
                        }
                    ],
                    "subreddit_name_prefixed": "r/gameofthrones",
                    "hidden": false,
                    "pwls": 6,
                    "link_flair_css_class": "s-none",
                

In [72]:
# [Answer to Ex. 1.2.3.2]

print(data.keys())

dict_keys(['kind', 'data'])


>**Ex. 1.2.4**: The URL reveals that the data is from reddit/r/gameofthrones, but can you recover that information from the data? Give your answer by 'keying' into the JSON data using square brackets.

>*Hint: 'Keying' is a word i just made up. By it, I mean the following. Consider a JSON object such as:*
>
>        my_json_obj = {
>            'cats': {
>                'awesome': ['Missy'],
>                'useless': ['Kim', 'Frank', 'Sandy']
>            },
>            'dogs': {
>                'awesome': ['Finn', 'Dolores', 'Fido', 'Casper'],
>                'useless': []
>            }
>        }
>
>*I can get the list of useless cats by keying into `my_json_obj` like such:*
>
>        >>> my_json_obj['cats']['useless']
>        Out [ ]: ['Kim', 'Frank', 'Sandy']
>
>*`my_json_obj['cats']` returns the dictionary `{'awesome': ['Missy'], 'useless': ['Kim', 'Frank', 'Sandy']}` and getting '`useless`' from that eventually gives us `['Kim', 'Frank', 'Sandy']`. If any of those list items were a list of a dictionary themselves, we could have kept keying deeper into the structure.*

In [75]:
# [Answer to Ex. 1.2.4]
print(data['data']['children'][0]['data']['subreddit'])

gameofthrones


>**Ex 1.2.5**: Write two `for` loops (or list comprehensions for extra street credits) which:
>1. Counts the number of spoilers.
>2. Only prints headlines that aren't spoilers.

In [76]:
# [Answer to Ex. 1.2.5.1]
# isolate each entry; check the spoiler field

n_spoilers = 0
for child in data['data']['children']:
    n_spoilers += child['data']['spoiler']
print(n_spoilers)

9


In [77]:
# [Answer to Ex. 1.2.5.2]
for child in data['data']['children']:
    if not child['data']['spoiler']:
        print(child['data']['title'])

[No Spoilers] Hodor and Ghost!!❤️
[NO SPOILERS] Vladimir Furdik, the guy who plays the Night King, looks like a mix between Bronn and Ser Jorah
[NO SPOILERS] Rare picture of the Starks of Winterfell.
[NO SPOILERS] I had no idea why my buddies called me Tarly until they showed me this.
[NO SPOILERS] A simplistic Weirwood pixel art
[NO SPOILERS] Just started watching the series, and I have to say, this guy is so good looking, best on the show, he looks like a real life Prince Charming.
[No Spoilers] Vladimir Frudik(Who plays Night King) with Night King ice sculpture!!
[No Spoilers] A TV Murder Mystery: Who Killed Game of Thrones?
[NO SPOILERS] Game Of Thrones - Main Theme (Intro) 🎮 Retro/Chiptune Version 🎮
[NO SPOILERS] Game of Thrones - Winterfell - 3d Timelapse
[No Spoilers] Where to find "Beautiful Death" posters?
[No Spoilers] Yall if I were to buy the whole GOT book series, how long do will it take me to finish? I need a prediction duration.
[NO SPOILERS] Possible GRRM inspiration
[