## Teaching the Teacher: Python
# Day 1 - Afternoon - Python Basics + Jupyter Notebooks



28.06.2021


*Theo Araujo, Department of Commmunication Science, University of Amsterdam*


# Agenda

1. Jupyter notebooks
2. Functions and iterations (again)

# 1. Jupyter Notebooks

Some of our classes use  Jupyter Notebook (and Jupyter Labs) as they are one of the most popular and powerful ways to execute data analyses in Python. A Jupyter Notebook combines:
* the Python code
* the results of the Python code
* annotations that you can make in MarkDown

*Let's dive into this in a bit more detail*

In [1]:
print('This is a Python code')

This is a Python code


What we see above is that ```print('This is a Python code')``` is the code you will run. When you run it, the cell will receive a line number on its left side. The results of the code appear immediately below it.

In this case, the print command shows something on the screen - whatever is between parenthesis. We'll get to how it works in a bit.

The annotations you can make are in MarkDown. This is a special formatting language - that I am using now simply to write some text. It does not run any code, but is helpful to document what you are doing. 

It has a few interesting options:

# Here I can make a first level heading
## Now a second level heading
### or a third level heading

I can also write things in **bold** or in *italic*.

And I can make unordered lists:
* Item 1
* Item 2
* Item 3

Or ordered lists:
1. Item 1
2. Item 2
3. Item 3

## Jupyter notebooks & Open Science

Notebooks are not *the only way* to ensure code replicability, but are an *important* way to do so. Some of the advantages in research and in teaching:
* **Combination** of markdown, code, output, and markdown:<br>the *why*, the *how*, the *what* and the *what it means* <br>&nbsp;<br>
* **Portable**: exporting to html or PDF (handy for assignments)<br>&nbsp;<br>
* **Online visualisations**: when working with GitHub

**Not the solution to everything:**
* Sometimes slower than running directly in Python
* Not very handy when there's a lot of output
* Can crash - when there's **just.too.much.output**.


# 2. Functions and iterations (again)

## Conditions

Conditions are a way to test if something is happening and, if so, to do something about it. Let's explore a bit.


In [2]:
a = 1

In [3]:
a

1

In [4]:
if a == 1:
    print('a is 1!!!')

a is 1!!!


In [5]:
a = 0

In [6]:
if a == 1:
    print('a is 1!')

In [7]:
if a == 2:
    print('a is 2!')
else:
    print('a is not 2!')

a is not 2!


In [8]:
a = 2

In [9]:
if a == 2:
    print('a is 2!')
if a > 0:
    print('a is larger than 0')
else:
    print('a is not 2!')

a is 2!
a is larger than 0


Some important things that we've done above:
* First, we indicated a condition by checking IF SOMETHING is equal (==) to SOMETHING ELSE.
* Notice that after the condition, we have a :. It indicates what needs to be done if that condition is TRUE
* After the column, we have indented text (meaning: there's a tab). Everything that is within that area (even the lines below, should they also be indented) belong to that condition
* We also used ELSE to indicate what to do if the condition was FALSE


Let's look at another example:

In [10]:
mylist = [1, 2, 3]

if 3 in mylist:
    print("there's a 3 in my list")
    print("my list has", len(mylist), "items")
else:
    print("there's no 3 in my list")

there's a 3 in my list
my list has 3 items


In [11]:
if 4 in mylist:
    print("there's a 4 in my list")
    print("my list has", len(mylist), "items")
else:
    print("there's no 4 in my list")

there's no 4 in my list


In [12]:
'a' in 'abc'

True

In [13]:
'a' in ['a', 'b']

True

In [14]:
1 in {'a': 1, 'b': 2}

False

In [15]:
a = 1
b = 1000000

if a > b: 
    print('a > b')
elif a < b: 
    print('a < b')
elif a == b: 
    print('a == b')

a < b


The ```elif``` condition is a shorthand for ```else if```. This allows us to check one condition and, if it is not true, check another condition. In the example above it does not matter. Check the two examples below and let's see why it matters.

In [16]:
a = 0
b = 0


if a == b: 
    print('a == b')
elif a == 0: 
    print('a == 0')


a == b


In [17]:
a = 0
b = 0


if a == b: 
    print('a == b')
if a == 0: 
    print('a == 0')

a == b
a == 0


In [18]:
if a == 0:
    if b == 0:
        print('0,0')
    else:
        print('0,1')
else:
    print('1,1')

0,0


## Iterations
Iterations are a way to ask Python to do something continuously - until a certain condition is met. While it is possible to create a loop that does not end (without an exit condition), we should never do that... otherwise, the code would run forever. Let's see some basic loops.

In [19]:
counter = 0
while counter < 10:
    print('counter at', counter)
    counter += 1

print('counter finished at', counter)

counter at 0
counter at 1
counter at 2
counter at 3
counter at 4
counter at 5
counter at 6
counter at 7
counter at 8
counter at 9
counter finished at 10


In [20]:
mylist = [1,2,3,4,5,6,7,8,9,10]

for aaa in mylist:
    print(aaa)

1
2
3
4
5
6
7
8
9
10


**while** and **for** are some common ways to do loops in Python:
* while will repeat the code (that is in the indented area) until a certain condition is met
* for will loop for each element of a list (or a string, or tuple) until there are no elements anymore

In [21]:
for x in 'this is a sentence!':
    print(x)

t
h
i
s
 
i
s
 
a
 
s
e
n
t
e
n
c
e
!


Another example: let's select and print only the items that are strings

In [22]:
orgs = ['DUO', 55, None, 'Belastingdienst', 'UWW', False]

In [23]:
orgs

['DUO', 55, None, 'Belastingdienst', 'UWW', False]

In [24]:
for org in orgs:
    if type(org) == str:
        print(org)

DUO
Belastingdienst
UWW


In [25]:
for org in orgs:
    if not org:
        print(org)

None
False


In [26]:
for org in orgs:
    if org == None:
        print(org)

None


In [27]:
None == 'None'

False

In [28]:
type(['1',])

list

In [29]:
type('1')

str

## Functions

In [30]:
visitors = ['Facebook', 'Twitter', 'Twitter', 'YouTube', 'NYT',
           'Facebook', 'WP', 'YouTube', 'NYT', 'NYT', 'Instagram']

In [31]:
visitors

['Facebook',
 'Twitter',
 'Twitter',
 'YouTube',
 'NYT',
 'Facebook',
 'WP',
 'YouTube',
 'NYT',
 'NYT',
 'Instagram']

In [32]:
def check_website(website):
    if website in ['Facebook', 'Twitter', 'Instagram']:
        return 'Social Media'
    if website in ['NYT', 'WP']:
        return 'News website'
    return 'not categorized'

In [33]:
for website in visitors:
    print(website)
    if website in ['Facebook', 'Twitter', 'Instagram', 'WP']:
        print('Social Media')
    if website in ['NYT', 'WP']:
        print('News website')
    else:
        print('not categorized')

Facebook
Social Media
not categorized
Twitter
Social Media
not categorized
Twitter
Social Media
not categorized
YouTube
not categorized
NYT
News website
Facebook
Social Media
not categorized
WP
Social Media
News website
YouTube
not categorized
NYT
News website
NYT
News website
Instagram
Social Media
not categorized


In [34]:
for item in visitors:
    print(check_website(item))

Social Media
Social Media
Social Media
not categorized
News website
Social Media
News website
not categorized
News website
News website
Social Media


In [35]:
def add_one(x):
    return x + 1

In [36]:
add_one(1)

2

In [37]:
add_one(2)

3

In [38]:
def add_values(x, y):
    return x + y

In [39]:
add_values(1,2)

3

In [40]:
add_values('a', 'b')

'ab'

In [41]:
a = add_values(1,2)

In [42]:
a

3

In [43]:
def add_values_print(x, y):
    print(x + y)

In [44]:
b = add_values_print(1, 3)

4


In [45]:
b

### Why are functions important?

First, a detour: Share of effort in a Computational Communication Science project

**Expectations...**

| Activity                            | Expectation for time spent  |
|-------------------------------------|-----------------------------|
|Data collection                      | 10%                         |
|Data cleaning                        | 10%                         |
|Data exploration / visualisation     | 40%                         |
|Model building & analysis.           | 40%                         |


**My reality...**

| Activity                            | Expectation for time spent  | (My) Reality   |
|-------------------------------------|-----------------------------|----------------|
|Data collection                      | 10%                         | 60%            |
|Data cleaning                        | 10%                         | 60%            |
|Data exploration / visualisation     | 40%                         | 10%            |
|Model building & analysis.           | 40%                         | 5%             |

*(Yep, it doesn't add up to 100% - usually it **always** takes more time than expected)*
















### What do we use functions for?

* Data collection
* Data cleaning while collecting data
* Data cleaning before doing the analysis
* And some more data cleaning (because there's always something that doesn't work the first time)



### My approach to building functions

1. Create the sample code for one example
2. Encapsulate with a function
3. Iterate through a list of examples
4. Go wild


#### Something recent: anonymizers for Facebook data donation

Loading some synthetic data from [OSD2F](https://github.com/uvacw/osd2f).

In [46]:
import json

In [47]:
data = json.load(open('mockdata/mockdata-your_posts_1.json', 'r'))

In [48]:
len(data)

100

In [49]:
data

[{'timestamp': 1607922261,
  'title': "charlesthompson wrote on Veronica Olson's timeline.",
  'tags': ['Taylor Cabrera', 'Jennifer Young', 'Virginia Thompson'],
  'data': []},
 {'timestamp': 1620342917,
  'title': "charlesthompson wrote on Ashley Schneider's timeline.",
  'tags': [],
  'attachments': {'data': [{'external_context': {'url': 'http://www.carroll-craig.com/'}}]},
  'data': []},
 {'timestamp': 1599939342,
  'title': "charlesthompson wrote on James Dalton's timeline.",
  'tags': [],
  'attachments': {'data': [{'external_context': {'url': 'http://www.simpson-moore.com/'}}]},
  'data': []},
 {'timestamp': 1595402633,
  'title': "charlesthompson wrote on Tiffany Cantu's timeline.",
  'tags': ['Eric Henry', 'Sean Roman'],
  'data': []},
 {'timestamp': 1590761620,
  'title': "charlesthompson wrote on Kathy Murphy's timeline.",
  'tags': ['John Howell'],
  'attachments': {'data': [{'external_context': {'url': 'http://www.glenn.org/'}}]},
  'data': [{'post': 'Itself carry control j

Challenge: I want to anonymize the titles for now. At least, I want to change the username by the tag ```<user>```

In [50]:
data[0]
# this is a comment

{'timestamp': 1607922261,
 'title': "charlesthompson wrote on Veronica Olson's timeline.",
 'tags': ['Taylor Cabrera', 'Jennifer Young', 'Virginia Thompson'],
 'data': []}

I am having a problem here

In [51]:
title = data[0]['title']

In [52]:
title

"charlesthompson wrote on Veronica Olson's timeline."

In [53]:
title.split('wrote on')

['charlesthompson ', " Veronica Olson's timeline."]

In [54]:
title.split(' ')

['charlesthompson', 'wrote', 'on', 'Veronica', "Olson's", 'timeline.']

In [55]:
title.split('wrote on')[0]

'charlesthompson '

In [56]:
title.split('wrote on')[0].strip()

'charlesthompson'

In [57]:
user = title.split('wrote on')[0].strip()

In [58]:
user

'charlesthompson'

In [59]:
title

"charlesthompson wrote on Veronica Olson's timeline."

In [60]:
title.replace(user, '<user>')

"<user> wrote on Veronica Olson's timeline."

In [61]:
def anonymize_user(title):
    # Some safety measures first... 
    # 1. Maybe the title is empty (None)
    if not title:
        return title
    # 2. Maybe "wrote on" is not in the title...
    if 'wrote on' not in title:
        return title
    
    # Now it's time to do the magic...
    user = title.split('wrote on')[0].strip()
    
    return title.replace(user, '<user>')
        

In [62]:
example_titles = [None, 
                  "someone wrote whatever somewhere else",
                 "someone wrote on another person's timeline"]

In [63]:
for title in example_titles:
    print(anonymize_user(title))

None
someone wrote whatever somewhere else
<user> wrote on another person's timeline


In [64]:
exampletitles = ['charles thompson wrote on someone timeline', 'charlesthompson wrote on someone else timeline']

In [65]:
exampletitles

['charles thompson wrote on someone timeline',
 'charlesthompson wrote on someone else timeline']

In [66]:
exampletitles[0].split(' ')

['charles', 'thompson', 'wrote', 'on', 'someone', 'timeline']

In [67]:
exampletitles[1].split(' ')

['charlesthompson', 'wrote', 'on', 'someone', 'else', 'timeline']

Trying with a subset of the data

In [68]:
testdata = data[:15]

In [69]:
testdata

[{'timestamp': 1607922261,
  'title': "charlesthompson wrote on Veronica Olson's timeline.",
  'tags': ['Taylor Cabrera', 'Jennifer Young', 'Virginia Thompson'],
  'data': []},
 {'timestamp': 1620342917,
  'title': "charlesthompson wrote on Ashley Schneider's timeline.",
  'tags': [],
  'attachments': {'data': [{'external_context': {'url': 'http://www.carroll-craig.com/'}}]},
  'data': []},
 {'timestamp': 1599939342,
  'title': "charlesthompson wrote on James Dalton's timeline.",
  'tags': [],
  'attachments': {'data': [{'external_context': {'url': 'http://www.simpson-moore.com/'}}]},
  'data': []},
 {'timestamp': 1595402633,
  'title': "charlesthompson wrote on Tiffany Cantu's timeline.",
  'tags': ['Eric Henry', 'Sean Roman'],
  'data': []},
 {'timestamp': 1590761620,
  'title': "charlesthompson wrote on Kathy Murphy's timeline.",
  'tags': ['John Howell'],
  'attachments': {'data': [{'external_context': {'url': 'http://www.glenn.org/'}}]},
  'data': [{'post': 'Itself carry control j

In [70]:
for aaaaaaa in testdata:
    print(aaaaaaa)

{'timestamp': 1607922261, 'title': "charlesthompson wrote on Veronica Olson's timeline.", 'tags': ['Taylor Cabrera', 'Jennifer Young', 'Virginia Thompson'], 'data': []}
{'timestamp': 1620342917, 'title': "charlesthompson wrote on Ashley Schneider's timeline.", 'tags': [], 'attachments': {'data': [{'external_context': {'url': 'http://www.carroll-craig.com/'}}]}, 'data': []}
{'timestamp': 1599939342, 'title': "charlesthompson wrote on James Dalton's timeline.", 'tags': [], 'attachments': {'data': [{'external_context': {'url': 'http://www.simpson-moore.com/'}}]}, 'data': []}
{'timestamp': 1595402633, 'title': "charlesthompson wrote on Tiffany Cantu's timeline.", 'tags': ['Eric Henry', 'Sean Roman'], 'data': []}
{'timestamp': 1590761620, 'title': "charlesthompson wrote on Kathy Murphy's timeline.", 'tags': ['John Howell'], 'attachments': {'data': [{'external_context': {'url': 'http://www.glenn.org/'}}]}, 'data': [{'post': 'Itself carry control job staff. People out same. Give right door po

In [71]:
for item in testdata:
    if 'title' in item.keys():
        print(anonymize_user(item['title']))

<user> wrote on Veronica Olson's timeline.
<user> wrote on Ashley Schneider's timeline.
<user> wrote on James Dalton's timeline.
<user> wrote on Tiffany Cantu's timeline.
<user> wrote on Kathy Murphy's timeline.
<user> wrote on Nathan Silva's timeline.
<user> wrote on Michelle Smith's timeline.
<user> wrote on Laura Crosby's timeline.
<user> wrote on Ashley Webb's timeline.
<user> wrote on Martin Marshall's timeline.
<user> wrote on Kenneth Anderson's timeline.
<user> wrote on Joseph Rocha DDS's timeline.
<user> wrote on Albert Thompson's timeline.
<user> wrote on Elizabeth Mosley's timeline.
<user> wrote on Eric Curtis's timeline.


In [72]:
for item in testdata:
    if 'title' in item.keys():
        item['title'] = anonymize_user(item['title'])

In [73]:
testdata

[{'timestamp': 1607922261,
  'title': "<user> wrote on Veronica Olson's timeline.",
  'tags': ['Taylor Cabrera', 'Jennifer Young', 'Virginia Thompson'],
  'data': []},
 {'timestamp': 1620342917,
  'title': "<user> wrote on Ashley Schneider's timeline.",
  'tags': [],
  'attachments': {'data': [{'external_context': {'url': 'http://www.carroll-craig.com/'}}]},
  'data': []},
 {'timestamp': 1599939342,
  'title': "<user> wrote on James Dalton's timeline.",
  'tags': [],
  'attachments': {'data': [{'external_context': {'url': 'http://www.simpson-moore.com/'}}]},
  'data': []},
 {'timestamp': 1595402633,
  'title': "<user> wrote on Tiffany Cantu's timeline.",
  'tags': ['Eric Henry', 'Sean Roman'],
  'data': []},
 {'timestamp': 1590761620,
  'title': "<user> wrote on Kathy Murphy's timeline.",
  'tags': ['John Howell'],
  'attachments': {'data': [{'external_context': {'url': 'http://www.glenn.org/'}}]},
  'data': [{'post': 'Itself carry control job staff. People out same. Give right door po

I guess it works... now let's try in all the data

In [74]:
for item in data:
    if 'title' in item.keys():
        item['title'] = anonymize_user(item['title'])

In [75]:
data

[{'timestamp': 1607922261,
  'title': "<user> wrote on Veronica Olson's timeline.",
  'tags': ['Taylor Cabrera', 'Jennifer Young', 'Virginia Thompson'],
  'data': []},
 {'timestamp': 1620342917,
  'title': "<user> wrote on Ashley Schneider's timeline.",
  'tags': [],
  'attachments': {'data': [{'external_context': {'url': 'http://www.carroll-craig.com/'}}]},
  'data': []},
 {'timestamp': 1599939342,
  'title': "<user> wrote on James Dalton's timeline.",
  'tags': [],
  'attachments': {'data': [{'external_context': {'url': 'http://www.simpson-moore.com/'}}]},
  'data': []},
 {'timestamp': 1595402633,
  'title': "<user> wrote on Tiffany Cantu's timeline.",
  'tags': ['Eric Henry', 'Sean Roman'],
  'data': []},
 {'timestamp': 1590761620,
  'title': "<user> wrote on Kathy Murphy's timeline.",
  'tags': ['John Howell'],
  'attachments': {'data': [{'external_context': {'url': 'http://www.glenn.org/'}}]},
  'data': [{'post': 'Itself carry control job staff. People out same. Give right door po

In [76]:
data

[{'timestamp': 1607922261,
  'title': "<user> wrote on Veronica Olson's timeline.",
  'tags': ['Taylor Cabrera', 'Jennifer Young', 'Virginia Thompson'],
  'data': []},
 {'timestamp': 1620342917,
  'title': "<user> wrote on Ashley Schneider's timeline.",
  'tags': [],
  'attachments': {'data': [{'external_context': {'url': 'http://www.carroll-craig.com/'}}]},
  'data': []},
 {'timestamp': 1599939342,
  'title': "<user> wrote on James Dalton's timeline.",
  'tags': [],
  'attachments': {'data': [{'external_context': {'url': 'http://www.simpson-moore.com/'}}]},
  'data': []},
 {'timestamp': 1595402633,
  'title': "<user> wrote on Tiffany Cantu's timeline.",
  'tags': ['Eric Henry', 'Sean Roman'],
  'data': []},
 {'timestamp': 1590761620,
  'title': "<user> wrote on Kathy Murphy's timeline.",
  'tags': ['John Howell'],
  'attachments': {'data': [{'external_context': {'url': 'http://www.glenn.org/'}}]},
  'data': [{'post': 'Itself carry control job staff. People out same. Give right door po

## Extra questions (from the morning):

* How do I clean a dataset/list?



# Your turn:

Using the same mockdata, extend the anonymizations steps to:
1. Remove also the name of the person in whose timeline the user has written (name the person ```<alter>```)
2. Remove the tags from the data

In [77]:
title

"someone wrote on another person's timeline"

In [78]:
title.split('wrote on')[0]

'someone '

In [79]:
title.split('wrote on')[1]

" another person's timeline"

In [80]:
receiver = title.split('wrote on')[1]

In [81]:
receiver

" another person's timeline"

In [82]:
title.replace(receiver, " <receiver>'s timeline")

"someone wrote on <receiver>'s timeline"

In [83]:
receiver

" another person's timeline"

In [84]:
receiver.split("'s")

[' another person', ' timeline']

In [85]:
name_receiver = receiver.split("'s")[0]

In [86]:
name_receiver

' another person'

In [87]:
title.replace(name_receiver, '<receiver>')

"someone wrote on<receiver>'s timeline"

In [88]:
def anonymize_receiver(title):
    # Some safety measures first... 
    # 1. Maybe the title is empty (None)
    if not title:
        return title
    # 2. Maybe "wrote on" is not in the title...
    if 'wrote on' not in title:
        return title
        
    # Now it's time to do the magic...
    receiver = title.split('wrote on')[1]
    name_receiver = receiver.split("'s")[0].strip()
    
    return title.replace(name_receiver, '<receiver>')
    

In [89]:
anonymize_receiver("someone wrote on another person's timeline")

"someone wrote on <receiver>'s timeline"

In [90]:
anonymize_receiver("someone wrote on Mary Smith's timeline")

"someone wrote on <receiver>'s timeline"

In [91]:
title

"someone wrote on another person's timeline"

In [92]:
testdata = data[:15]

In [93]:
for item in testdata:
    if 'title' in item.keys():
        print(item['title'])
        print(anonymize_receiver(item['title']))
        print(anonymize_user(item['title']))
        print(anonymize_user(anonymize_receiver(item['title'])))
        print('\n')

<user> wrote on Veronica Olson's timeline.
<user> wrote on <receiver>'s timeline.
<user> wrote on Veronica Olson's timeline.
<user> wrote on <receiver>'s timeline.


<user> wrote on Ashley Schneider's timeline.
<user> wrote on <receiver>'s timeline.
<user> wrote on Ashley Schneider's timeline.
<user> wrote on <receiver>'s timeline.


<user> wrote on James Dalton's timeline.
<user> wrote on <receiver>'s timeline.
<user> wrote on James Dalton's timeline.
<user> wrote on <receiver>'s timeline.


<user> wrote on Tiffany Cantu's timeline.
<user> wrote on <receiver>'s timeline.
<user> wrote on Tiffany Cantu's timeline.
<user> wrote on <receiver>'s timeline.


<user> wrote on Kathy Murphy's timeline.
<user> wrote on <receiver>'s timeline.
<user> wrote on Kathy Murphy's timeline.
<user> wrote on <receiver>'s timeline.


<user> wrote on Nathan Silva's timeline.
<user> wrote on <receiver>'s timeline.
<user> wrote on Nathan Silva's timeline.
<user> wrote on <receiver>'s timeline.


<user> wrote o