# Seminar 02 - Jupyter notebooks and IS data analysis 101

This is a Markdown cell. Markdown cells will contain instructions and explain what you have to do.

Execute the Python cell below by selecting it and pressing the `▶ Run` button in the toolbar.

In [None]:
# This is a comment inside Python cell
print("Hello World")

Let us now start with a few simple tasks that will prepare you for a futher data analysis.

### Task 1

In the cell below, write a function to print all numbers in the range from 1 to 10.

In [None]:
# TODO: implement this function
def print10():
    pass

print10()

### Task 2
The following cell shows how we can extract a datetime object from a string. 

Details and meaning of the `"%Y-%m-%dT%H:%M:%S"` formatting string can be found in the documentation:
https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

Your task is to determine the week day (Monday, Tuesday, ...) corresponding to this date.

Hint: Look into the documentation.

In [None]:
# The import syntax is quite rich and allows renaming to make the code more succinct
from datetime import datetime as dt

# This is the datetime value we are interested in parsing
value = "2021-06-30T18:01:56"

# Parse the string
date = dt.strptime(value, "%Y-%m-%dT%H:%M:%S")

# Finally, we can access the parsed attributes
print(date.year == 2021)

# TODO: print the name of the week day corresponding to this date
pass

### Task 3

In order to analyze the private data the IS stores, we need to get familiar with the XML format. Thankfully, Python's Beautiful Soup library will be of great help.

We will start by working with a simple dataset before we move on to work with the real data.

In [None]:
# BeautifulSoup4 is a library for parsing HTML and XML code
from bs4 import BeautifulSoup

# Let us start with a small dummy XML tree
# Think of this as an inline representation of an XML file
xml_tree = '''
<?xml version="1.0" encoding="utf-8" ?>
<vsechny_udaje>
  <osoba verze="1.10" system="is.muni.cz" vytvoreno="2015-11-30T17:01:26">
    <zakladni_udaje>
      <uco>123456789</uco>
      <jmeno>Jan</jmeno>
      <prijmeni>Novak</prijmeni>
    </zakladni_udaje>
    <skupiny_log>
      <skupina id="2" nazev="Vstupní dveře budovy, FI Botanická 68a" cip="3ba123626429a8d4273a">
        <p d="2015-04-19T22:58:14" o="in"/>
        <p d="2015-10-07T07:24:00" o="in"/>
        <p d="2016-04-20T08:56:20" o="in"/>
        <p d="2016-12-15T02:56:15" o="in"/>
        <p d="2017-09-09T10:24:47" o="in"/>
        <p d="2017-12-24T17:54:07" o="in"/>
        <p d="2018-07-20T09:16:59" o="in"/>
        <p d="2018-10-19T20:48:36" o="in"/>
        <p d="2019-04-19T08:43:22" o="in"/>
        <p d="2019-11-09T04:21:51" o="in"/>
      </skupina>
    </skupiny_log>
  </osoba>
</vsechny_udaje>
'''

# In this step we parse the XML tree into a Python data structure
soup = BeautifulSoup(xml_tree, 'lxml')

Of course, we can *see* all the data right away, but try to interact with them through Python.

In [None]:
# To print the UCO we need to find that element in the tree
# This is easy because there is only a single element with the `uco` tag
print("The UCO is:", soup.find('uco').text)

# TODO: Try to access and print also the `jmeno` and `prijmeni`
pass

In [None]:
# What if we wanted to access the date inside the `osoba` tag?
# We can start by finding that tag
osoba = soup.find('osoba')

# And then look into its attributes
print(osoba.attrs)

# `osoba.attrs` is a Python dictionary that we can access as usual
print(isinstance(osoba.attrs, dict))

# TODO: Finish the code to print the year and month from the `vytvoreno` attribute
# Use the datetime parsing code above (`dt` is still accessible, no need to import it again)
pass

### Task 4

The `skupiny_log` tag contains logs of historical events when the person opened a door using his card. Try to calculate the difference in days between the first and the last visit to the faculty.

In [None]:
# You can use `find_all` to find all the tags of a given name
# and also limit the findings to a subtree of the XML, that is, 
# "chain" the `find` and `find_all` calls.

# TODO: get all logs of faculty entries
open_tags = soup.find('skupiny_log').find_all(_)

# Once you have the first and last visit you can simply subtract one from another
# to receive so called `timedelta`, i.e. the difference between two datetimes.
# Timedelta has several attributes and `days` can be one of them.

# TODO: print the time difference between the first and the last entry
pass

### Task 5
If you haven't already downloaded your data that IS stores, do it now. You'll find it here:
https://is.muni.cz/auth/privacy/data_access_and_portability

Download all of the data in XML file format (it might take a minute or so). Then load it into this notebook.

In [None]:
# Now you will load up the real dataset
# TODO: fill in the name of your data file
with open(_, 'r') as handle:
    raw_xml = handle.read()
    
# TODO: Finish up the code to parse the `raw_xml`
pass

In [None]:
# What is some interesting information?
# Find out what is code of your ISIC card. Hint: it is `cip` attribute of `skupina` tag
skupina_tag = _
isic_code = _

### Task 6
The IS stores also _all_ of your clicks (on the website in the past 6 months). Those are the `klik` tags. Find out all the unique IP addresses that you have used. Check out also the [`python-geoip`](https://pythonhosted.org/python-geoip/) library that can tell you the geo-location of the IP address.

In [None]:
# TODO

### Task 7
List all the doors/classrooms, where you have used your ISIC card to enter. When was the first and the last time you've entered each one of them?

In [None]:
# TODO

### Task 8
When analyzing data it can be quite handy to plot them. Use [`matplotlib`](https://matplotlib.org/stable/users/index.html) library to get a better insight into the data set (think about plotting dates of access, user `klik`s, IP address usage,...).

In [None]:
import matplotlib.pyplot as plt
import random

# to plot a simple line
plt.plot([random.randint(1, 10) for _ in range(100)])
plt.plot([random.randint(4, 8) for _ in range(100)])

In [None]:
# histograms quickly show the frequency of a value
plt.hist([random.randint(1,10) for _ in range(1000)])