## Teaching the Teacher: Python
# Day 1 - Afternoon - Python Basics + Jupyter Notebooks



28.06.2021


*Theo Araujo, Department of Commmunication Science, University of Amsterdam*


# Agenda

1. Jupyter notebooks
2. Functions and iterations (again)

# 1. Jupyter Notebooks

Some of our classes use  Jupyter Notebook (and Jupyter Labs) as they are one of the most popular and powerful ways to execute data analyses in Python. A Jupyter Notebook combines:
* the Python code
* the results of the Python code
* annotations that you can make in MarkDown

*Let's dive into this in a bit more detail*

In [None]:
print('This is a Python code')

What we see above is that ```print('This is a Python code')``` is the code you will run. When you run it, the cell will receive a line number on its left side. The results of the code appear immediately below it.

In this case, the print command shows something on the screen - whatever is between parenthesis. We'll get to how it works in a bit.

The annotations you can make are in MarkDown. This is a special formatting language - that I am using now simply to write some text. It does not run any code, but is helpful to document what you are doing. 

It has a few interesting options:

# Here I can make a first level heading
## Now a second level heading
### or a third level heading

I can also write things in **bold** or in *italic*.

And I can make unordered lists:
* Item 1
* Item 2
* Item 3

Or ordered lists:
1. Item 1
2. Item 2
3. Item 3

## Jupyter notebooks & Open Science

Notebooks are not *the only way* to ensure code replicability, but are an *important* way to do so. Some of the advantages in research and in teaching:
* **Combination** of markdown, code, output, and markdown:<br>the *why*, the *how*, the *what* and the *what it means* <br>&nbsp;<br>
* **Portable**: exporting to html or PDF (handy for assignments)<br>&nbsp;<br>
* **Online visualisations**: when working with GitHub

**Not the solution to everything:**
* Sometimes slower than running directly in Python
* Not very handy when there's a lot of output
* Can crash - when there's **just.too.much.output**.


# 2. Functions and iterations (again)

## Conditions

Conditions are a way to test if something is happening and, if so, to do something about it. Let's explore a bit.


In [None]:
a = 1

In [None]:
if a == 1:
    print('a is 1!!!')

In [None]:
a = 0

In [None]:
if a == 1:
    print('a is 1!')

In [None]:
if a == 2:
    print('a is 2!')
else:
    print('a is not 2!')

Some important things that we've done above:
* First, we indicated a condition by checking IF SOMETHING is equal (==) to SOMETHING ELSE.
* Notice that after the condition, we have a :. It indicates what needs to be done if that condition is TRUE
* After the column, we have indented text (meaning: there's a tab). Everything that is within that area (even the lines below, should they also be indented) belong to that condition
* We also used ELSE to indicate what to do if the condition was FALSE


Let's look at another example:

In [None]:
mylist = [1, 2, 3]

if 3 in mylist:
    print("there's a 3 in my list")
    print("my list has", len(mylist), "items")
else:
    print("there's no 3 in my list")

In [None]:
if 4 in mylist:
    print("there's a 4 in my list")
    print("my list has", len(mylist), "items")
else:
    print("there's no 4 in my list")

In [None]:
a = 1
b = 1000000

if a > b: 
    print('a > b')
elif a < b: 
    print('a < b')
elif a == b: 
    print('a == b')

The ```elif``` condition is a shorthand for ```else if```. This allows us to check one condition and, if it is not true, check another condition. In the example above it does not matter. Check the two examples below and let's see why it matters.

In [None]:
a = 0
b = 0


if a == b: 
    print('a == b')
elif a == 0: 
    print('a == 0')


In [None]:
a = 0
b = 0


if a == b: 
    print('a == b')
if a == 0: 
    print('a == 0')

## Iterations
Iterations are a way to ask Python to do something continuously - until a certain condition is met. While it is possible to create a loop that does not end (without an exit condition), we should never do that... otherwise, the code would run forever. Let's see some basic loops.

In [None]:
counter = 0
while counter < 10:
    print('counter at', counter)
    counter += 1

print('counter finished at', counter)

In [None]:
mylist = [1,2,3,4,5,6,7,8,9,10]

for i in mylist:
    print(i)

**while** and **for** are some common ways to do loops in Python:
* while will repeat the code (that is in the indented area) until a certain condition is met
* for will loop for each element of a list (or a string, or tuple) until there are no elements anymore

In [None]:
for l in 'this is a sentence!':
    print(l)

## Functions

In [None]:
visitors = ['Facebook', 'Twitter', 'Twitter', 'YouTube', 'NYT',
           'Facebook', 'WP', 'YouTube', 'NYT', 'NYT', 'Instagram']

In [None]:
visitors

In [None]:
def check_website(website):
    if website in ['Facebook', 'Twitter', 'Instagram']:
        return 'Social Media'
    if website in ['NYT', 'WP']:
        return 'News website'
    return 'not categorized'

In [None]:
for item in visitors:
    print(check_website(item))

### Why are functions important?

First, a detour: Share of effort in a Computational Communication Science project

**Expectations...**

| Activity                            | Expectation for time spent  |
|-------------------------------------|-----------------------------|
|Data collection                      | 10%                         |
|Data cleaning                        | 10%                         |
|Data exploration / visualisation     | 40%                         |
|Model building & analysis.           | 40%                         |


**My reality...**

| Activity                            | Expectation for time spent  | (My) Reality   |
|-------------------------------------|-----------------------------|----------------|
|Data collection                      | 10%                         | 60%            |
|Data cleaning                        | 10%                         | 60%            |
|Data exploration / visualisation     | 40%                         | 10%            |
|Model building & analysis.           | 40%                         | 5%             |

*(Yep, it doesn't add up to 100% - usually it **always** takes more time than expected)*
















### What do we use functions for?

* Data collection
* Data cleaning while collecting data
* Data cleaning before doing the analysis
* And some more data cleaning (because there's always something that doesn't work the first time)



### My approach to building functions

1. Create the sample code for one example
2. Encapsulate with a function
3. Iterate through a list of examples
4. Go wild


#### Something recent: anonymizers for Facebook data donation

Loading some synthetic data from [OSD2F](https://github.com/uvacw/osd2f).

In [None]:
import json

In [None]:
data = json.load(open('mockdata/mockdata-your_posts_1.json', 'r'))

In [None]:
len(data)

In [None]:
data

Challenge: I want to anonymize the titles for now. At least, I want to change the username by the tag ```<user>```

In [None]:
data[0]

In [None]:
title = data[0]['title']

In [None]:
title

In [None]:
title.split('wrote on')

In [None]:
title.split('wrote on')[0]

In [None]:
title.split('wrote on')[0].strip()

In [None]:
user = title.split('wrote on')[0].strip()

In [None]:
user

In [None]:
title.replace(user, '<user>')

In [None]:
def anonymize_user(title):
    # Some safety measures first... 
    # 1. Maybe the title is empty (None)
    if not title:
        return title
    # 2. Maybe "wrote on" is not in the title...
    if 'wrote on' not in title:
        return title
    
    # Now it's time to do the magic...
    user = title.split('wrote on')[0].strip()
    
    return title.replace(user, '<user>')
        

In [None]:
example_titles = [None, 
                  "someone wrote whatever somewhere else",
                 "someone wrote on another person's timeline"]

In [None]:
for title in example_titles:
    print(anonymize_user(title))

Trying with a subset of the data

In [None]:
testdata = data[:15]

In [None]:
testdata

In [None]:
for item in testdata:
    if 'title' in item.keys():
        item['title'] = anonymize_user(item['title'])

In [None]:
testdata

I guess it works... now let's try in all the data

In [None]:
for item in data:
    if 'title' in item.keys():
        item['title'] = anonymize_user(item['title'])

In [None]:
data

# Your turn:

Using the same mockdata, extend the anonymizations steps to:
1. Remove also the name of the person in whose timeline the user has written (name the person ```<alter>```)
2. Remove the tags from the data