*****************************************************************
#  The Social Web 
- Instructors: Davide Ceolin, Filip Ilievski and Zubaria Inayat.
- TAs: Sandro Barres-Hamers, Alexander Schmatz, Márton Bodó and Danae Mitsea.
- Exercises for Hands-on session 1
*****************************************************************

Prerequisites:
- Python 3.8
- Python packages: mastodon.py, prettytable, matplotlib, ipython, re

First you need to know how to retrieve some social web data. Exercises 1 and 2 will show you how to retrieve trends and search results from Mastodon. 

But let's check first if we're running a sufficiently new version of Python:

In [1]:
import platform
import sys
print("This jupyter notebook is running on Python " + platform.python_version())
# It's good practice to assert packages requirements at the beginning of a script:
assert sys.version_info >= (3, 6)

Let's install now the required packages for this hands-on session:


In [1]:
!pip install mastodon.py

## Part 1: Setting up your Mastodon API in python:

If you do not have an account already;
1. Go to https://joinmastodon.org/ and create a Mastodon account. Unlike Twitter "Mastodon is not a single website. To use it, you need to make an account with a provider — we call them servers — that lets you connect with other people across Mastodon" (from their website). Pick a general one and find a server/instance that suits you on https://instances.social/. 
2. You will receive an Email to confirm your account.
3. Confirm your account.

We will use Mastodon.py in this assignment. It is a Python wrapper of the Mastodon API. This makes it possible to interact with Mastodon servers through Python. For the documentation, check https://mastodonpy.readthedocs.io/en/stable/.
 

Register your app. This only needs to be done once. Uncomment the code and substitute your information. The outputs are confidential. Delete them before submitting this notebook.

In [10]:
from mastodon import Mastodon

Mastodon.create_app(
    'your_app_name',
    api_base_url = 'https://mastodon.social', #this is determined by your chosen server, if you picked the general server, don't change this line
    to_file = 'your_client_crediential_file_name.secret'
)


Then, log in. This can be done every time your application starts (e.g. when writing a simple bot), or you can use the persisted information:
(Note that this won’t work when using 2FA - you’ll have to use OAuth, in that case.)

In [None]:
API = Mastodon(client_id = 'your_client_crediential_file_name.secret',)
API.log_in(
    'your_email@adress.nl', #no caps
    'verysecretpassword',
    to_file = 'your_user_crediential_file_name.secret'
)

### 1.1: Retrieving information about instance (your server)

API.instance() returns a dictionary with a lot of information about your instance.
Look at the dictionary and see what kind of information is being returned.
Can you see how many users populate your server?
How long can Toots (Mastodon tweets) become?

In [18]:
API.instance()

You can also retrieve dictionaries containing information about trending hashtags or trending statuses (toots).
Below you can see code to print the currently most trending status.
Check the documentation on how to get the trending hashtags.

In [22]:
from IPython.display import HTML #for formatting HTML script
HTML(API.trending_statuses()[0]["content"])

### Task 1 
Write code that prints out the first 3 trending hashtags.

In [24]:
#Your Code

### 1.2: Retrieving recent Toots

In [39]:
from IPython.display import HTML #via the api we get raw html scripts, so  use this function to format it nicely

q = "#tbt"
search_results = API.search_v2(q)
for n in range(10):
    print(f"Toot {n+1}\n\n")
    html = search_results["statuses"][n]["content"]
    display(HTML(html))
    print("_"*100)

### Task 2

In the cell below, create a second variable (e.g. `search_results2`) that holds the results of a query other than the one presented above. Think about a query that would yield very different results than the first one, for example one that may yield a shorter output or about a different topic.

In [1]:
#Your Code

## Part 2: Extracting text, screen names, and hashtags from tweets 

Simply printing all the search results to screen is nice, but to really start analysing them, it is handy to select the interesting parts and store them in a different structure such as a list. 

In this example you are using a thing called "List Comprehension".

### 2.1 List Comprehensions
List comprehension is a powerful construct that allows to succinctly build a list.
With it you can process items from any iterable (e.g. dictionaries, lists, tuples, iterators...) and output a list while optionally performing an operation on each value.

Here's a few examples from Mining the Social Web:

In [31]:
# double all values from 0 to 9
double_list = [i*2 for i in range(10)]

# raise to the power of 2, but only if the number is uneven
power_even_list = [i**2 for i in range(10) if i%2!=0]

# clean strings in a tuple
stripped_lines = [x.strip() for x in ('The\n', 'Social\n', 'Web\n')]

# return length of each string in stripped_lines
len_str_lines = [len(s) for s in stripped_lines]

# finally, we can nest list comprehensions to flatten a list of lists:
list_of_lists = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
range_9 = [x for y in list_of_lists for x in y]

#print output
print(double_list)
print(power_even_list)
print(stripped_lines)
print(len_str_lines)
print(range_9)

### 2.2 Parsing text, screen names and hashtags from tweets
*(from Example 1-6 in Mining the Social Web)*

Hereafter, we'll be creating a variable `toots` of type list. \
The list will be filled with the `content` elements from each `toot`, whereas `toot` comes from looping through all `statuses` in the `search_results` dict. \
Look up the list comprehensions in your Python reference materials to make sure you understand what's happening here. 

In [40]:
ids = [ toot["id"] for toot in \
       search_results["statuses"] ]

# the escape character "\" allows for the list comprehension to continue
# on a new line. While not strictly necessary, it makes code more readable
# for your fellow programmers.

toots = [ toot['content'] for toot in search_results["statuses"] ]

# Compute a collection of all words from all tweets
words = [ w for t in toots for w in t.split() ]

import json
# print toots and words in JSON format
print(json.dumps(toots, indent=1))
print(json.dumps(words, indent=1))

What happened here?
When extracting the toots from your instance, they are in HTML format. While these markers may be valuable for certain types of analysis, they may not be necessary for text-based analysis alone. Remove the HTML-tags by applying the following function to make it more readable.

In [41]:
import re #regular expressions module for text processing

def strip_html_tags(html_text):
    plain_text = re.sub(r'<.*?>', '', html_text)
    return plain_text

toots = [ strip_html_tags(toot['content']) for toot in search_results["statuses"] ]

# Compute a collection of all words from all tweets
words = [ w for t in toots for w in t.split() ]

import json
#print cleaned list of toots and words with indentation for better readability
print(json.dumps(toots, indent=1))
print(json.dumps(words, indent=1))

### Task 3

You are now ready to parse usernames, hashtags and text (words) from the results you previously obtained in Task 2 (e.g. `statuses_2`). While doing it, make sure to leave the variables created in 2.2 untouched. Instead, create your own variable names, which you'll be using soon.


In [None]:
#Your Code

## Part 3: Creating a basic frequency distribution from words in tweets
*(from Examples 1-7 in Mining the Social Web)* 


In the cell below we display the 10 most common hashtag instances:

In [43]:
from collections import Counter

for item in [words]:
    c = Counter(item)
    
print(c.most_common()[:10]) # top 10

Your output should look something like this: \
`[('ThrowbackThursday', 34), ('throwbackthursday', 11), ('TBT', 6), ('ThrowBackThursday', 6), ('Trivia', 3), ('madoka_magica', 2), ('New', 2), ('EURO2020', 2), ('artists', 2)]`

### Task 4
Show hashtags frequency for results that you obtained in Task 3. Think about possible explanations for the different results you get from the analyses for the different queries.


In [44]:
#Your Code

### 3.1 Storing your results
So far, we have been storing the data in working memory. Often it's handy to store your data to disk so you can retrieve it in a next session. 

The pickle module lets you do exactly that, by serializing data in a binary format.


In [46]:
import pickle

filepath = "my_data.pickle"
# this indented python syntax is broadly defined as "context manager".
# This means that everything happening under its indentation will use f
# as file handle to filepath. The Shortand `wb` stands for "write binary",
# which is how we serialize data to disk.
with open(filepath, "wb") as f:
    pickle.dump(words2, f) # write the contents of your list 'words' to file 'f'
    
# Note that, after the end of the indented block, the file is automatically closed.
# Hence, no memory resource on your system is wasted idly.

If you browse to your working directory, you should find a file there named "myData.pickle". You can open this in a text editor, or load its contents back into a variable to do some more analyses on.


In [47]:
# open the myData.pickle file and store its contents into variable 'words'

with open(filepath, "rb") as f:
    words = pickle.load(f)
print(words)

### 3.2 Using prettytable to display tuples in a nice way



In [48]:
from prettytable import PrettyTable


pt = PrettyTable(field_names=['Words', 'Count'])
c = Counter(words2)
[ pt.add_row(kv) for kv in c.most_common()[:10] ]
pt.align["Words"], pt.align['Count'] = 'l', 'r' # Set column alignment
print(pt) 

### 3.3 Calculating lexical diversity for tweets 
*(from Example 1-9 in Mining the Social Web)*:

In [52]:
# Define a function for computing lexical diversity
def lexical_diversity(tokens):
    return 1.0*len(set(tokens))/len(tokens)

# Define a function for computing the average number of words per tweet
def average_words(statuses):
    total_words = sum([ len(s.split()) for s in statuses ])
    return 1.0*total_words/len(statuses) 

# Let's use these functions:
print(lexical_diversity(words))
print(average_words(toots))

### Task 5: What do the printed numbers indicate? Try to explain them.

(*Double click this cell to write your answer*)

### 3.4 Looking up users who have retweeted a status 
*(from Example 1-11 in Mining the Social Web):*

In [54]:
retooter = [user["username"] for user in API.status_reblogged_by(111222929039002402)] #might need to insert a different toot id if on different server
print("Users who've reblogged the toot:\n")
print(retooter)

### Task 6 (advanced)

If you have a Toot account with a nontrivial number of tweets you can do some analysis of your own account.
Check the documentation on how to access your toots.



What are the most common terms that appear in your toot? \
Which toot was replied the most?\
How many of your toots are retweeted (and why do you think this is the case)?

In [72]:
#Your Code

### 3.6 Plotting frequencies of words 
*(from Example 1-12 in Mining the Social Web)*

In [73]:
#!pip install matplotlib
word_counts = sorted(Counter(words).values(), reverse=True)
import matplotlib.pyplot as plt
plt.loglog(word_counts)
plt.ylabel("Freq")
plt.xlabel("Word Rank")
plt.show()

### Generating histograms of words, screen names, and hashtags 
*(from Example 1-13 in Mining the Social Web):*

In [74]:
c = Counter(words)
plt.hist(c.values())
    
plt.title("")
plt.ylabel("Number of items in bin")
plt.xlabel("Bins (number of times an item appeared)")
    
plt.figure()

In [75]:
# extra: seaborn plots with a one-liner (single line of code)
#!pip install seaborn
import seaborn as sns
sns.histplot(word_counts,kde=False)