# Getting Tumblr data from people with the ESFJ MBTI type

Created by: Sara Jakša

First we need to import the libaries that we are going to need.

As a note, because the API libary for accessing Tumblr is not yet ported to python3, this script is using python2.

In [1]:
from bs4 import BeautifulSoup
import pytumblr
import codecs
import re

Add your own consumer key and secret here. You can register your app to get it on [https://www.tumblr.com/oauth/apps](https://www.tumblr.com/oauth/apps)

In [2]:
consumer_key = ''
consumer_secret = ''

Here you can specify the tag that is going to be searched and the name of the file where the final results will be saved.

In [3]:
tag = "ESFJ"
filename = "tumblr-esfj.csv"

In [4]:
tumblr_url = r"[\w-]+.tumblr.com"

Here we are going to start the client, that will enable us to access Tumblr data.

In [5]:
client = pytumblr.TumblrRestClient(
    consumer_key,
    consumer_secret,
)

The following part is to first get post that use specific tag. If you provide the timestamp, then it will search from this time backward. Otherwise is searches from the moment that it accesses it backward.

In [7]:
def getData(tag, timestamp=None):
    posturlsusingtag = list()
    while 1:
        timestamp, posturlsusingtag = getTumblrPosts(before=timestamp, tag=tag, limit=20, posturlsusingtag=posturlsusingtag)
        if not timestamp:
            return None
        print(timestamp)
    return timestamp, posturlsusingtag

In [6]:
def getTumblrPosts(before=None, tag="ESFJ", limit=20, posturlsusingtag=[]):
    posts = client.tagged(tag, limit=limit, before=before)
    if len(posts) == 0:
        return None
    for post in posts:
        if post["type"] == "text":
            posturlsusingtag.append(post[u"post_url"])
    return post[u"timestamp"], posturlsusingtag

In [8]:
timestamp, posturlsusingtag = getData(tag)

1490713434
1489999136
1489188836


This part will allow us to get Tumblr blog names from the urls of the posts.

In [9]:
def getTumblrNameFromUrl(urllist):
    tumblrnames = []
    for url in urllist:
        if "tumblr.com" in url:
            url = re.search(tumblr_url, url)
            url = url.group()
            tumblrnames.append(url[:-11])
    return list(set(tumblrnames))

In [10]:
names = getTumblrNameFromUrl(posturlsusingtag)

Since people discussing personality types also use the tags of types different then themsleves, this is not a good indication if which type they are. But a lot of times people will simply include their type in their blog description. This is what this function searches for - was the type name mentioned in the description.

In [11]:
def gerTumblrDescription(urllist, tag):
    blogurls = []
    for url in urllist:
        blog = client.blog_info(url + ".tumblr.com")
        try:
            if tag.lower() in blog[u"blog"][u"description"].lower():
                blogurls.append(url)
        except KeyError:
            pass
    return list(set(blogurls))

In [12]:
typenamesonly = gerTumblrDescription(names, tag)

Now that we got the blogs of people with actually type we are searching for, now we can simply get the 20 posts of each of them nad save them to file.

In [13]:
def getBlogPosts(filename, blogurls):
    with codecs.open(filename, "a", "utf-8") as write:
        for name in blogurls:
            blog = client.posts(name)
            try:
                posts = blog["posts"]
            except:
                print(blog)
                continue
            for post in posts:
                try:
                    body = BeautifulSoup(post["body"], 'html.parser')
                except KeyError:
                    continue
                body = body.get_text()
                if not body:
                    body = ""
                if post["title"]:
                    title = post["title"]
                else:
                    title = ""
                text = title + body
                text = text.replace("\t", " ").replace("\n", " ").replace("\r", " ")
                write.write(name + "\t" +
                            str(post["id"]) + "\t" +
                            text + "\n")

In [14]:
getBlogPosts(filename, typenamesonly)