# Working with RSS feed in python

In [1]:
import feedparser

In [24]:
blog_feed = feedparser.parse("https://publications.iamroyakash.com/rss.xml")

In [25]:
# The items are available in feed.entries, which is a list.
# You access items in the list in the same order in which they appear in the original feed, so the first item is available in feed.entries[0].

In [26]:
type(blog_feed)

feedparser.util.FeedParserDict

In [27]:
type(blog_feed.entries), blog_feed.feed.subtitle

(list,
 'Computer scientist theroyakash researches computer vision and artificial intelligence. This is the publication from theroyakash.')

In [28]:
len(blog_feed), len(blog_feed.entries), len(blog_feed['entries'])

(10, 7, 7)

In [29]:
blog_feed['feed']['title'], blog_feed.feed.title

('theroyakash publications', 'theroyakash publications')

In [30]:
blog_feed['feed']['link'], blog_feed.feed.link, blog_feed.version

('https://publications.iamroyakash.com',
 'https://publications.iamroyakash.com',
 'rss20')

In [46]:
print(blog_feed.entries[6].title)
print(blog_feed.entries[6].link)
print(blog_feed.entries[6].author)
print(blog_feed.entries[6].published)

Welcome to theroyakash Publication
https://publications.iamroyakash.com/welcome
theroyakash
Sun, 04 Oct 2020 04:06:06 GMT


In [32]:
print(blog_feed.entries[0].tags)

[{'term': 'Python', 'scheme': None, 'label': None}, {'term': 'caching', 'scheme': None, 'label': None}, {'term': 'speed', 'scheme': None, 'label': None}, {'term': 'life-hack', 'scheme': None, 'label': None}, {'term': 'Christmas Hackathon', 'scheme': None, 'label': None}]


In [61]:
'tags' in blog_feed.entries[6]

False

In [34]:
print(blog_feed.entries[0].summary)

Let's say you have a function that is a super-slow function. Not sure how you can find which is a super slow function? Measure it with this.
Now there is now way you can optimize the function, what you can do instead is that you can store results fro...


In [36]:
print(blog_feed.entries[0].title_detail.base)

https://publications.iamroyakash.com/rss.xml


In [37]:
print(blog_feed.entries[0].tags[0].term)

Python


In [38]:
authors = [author.name for author in blog_feed.entries[0].authors]
print(authors)

['theroyakash']


In [47]:
posts = blog_feed.entries
posts_details = []
for post in posts:
    temp = dict()
    try:
        temp['title'] =post.title
        temp['link'] =post.link
        temp['author'] =post.author
        temp['time_published'] = post.published
        temp['tags'] = [tag.term for tag in post.tags]
        temp['authors'] = [author.name for author in post.authors]
    except:
        pass
    posts_details.append(temp)


In [53]:
def get_posts_details(rss=None):
    """
    Take link of rss feed as argument
    """
    if rss is not None:
        import feedparser
        blog_feed = blog_feed = feedparser.parse(rss)

        posts = blog_feed.entries
        posts_details = {"Blog title" : blog_feed.feed.title,
                        "Blog link" : blog_feed.feed.link}
        post_list = []
        for post in posts:
            temp = dict()
            try:
                temp["title"] =post.title
                temp["link"] =post.link
                temp["author"] =post.author
                temp["time_published"] = post.published
                temp["summary"] = post.summary
                temp["tags"] = [tag.term for tag in post.tags]
                temp["authors"] = [author.name for author in post.authors]
            except:
                pass
            post_list.append(temp)
        posts_details["posts"] = post_list
        return posts_details
    else:
        return None

In [54]:
blog_rss = "https://publications.iamroyakash.com/rss.xml"

data = get_posts_details(rss = blog_rss)

import json
print(json.dumps(data, indent=2))

{
  "Blog title": "theroyakash publications",
  "Blog link": "https://publications.iamroyakash.com",
  "posts": [
    {
      "title": "Speed up your python code by caching",
      "link": "https://publications.iamroyakash.com/cache-your-code",
      "author": "theroyakash",
      "time_published": "Sat, 26 Dec 2020 09:32:36 GMT",
      "summary": "Let's say you have a function that is a super-slow function. Not sure how you can find which is a super slow function? Measure it with this.\nNow there is now way you can optimize the function, what you can do instead is that you can store results fro...",
      "tags": [
        "Python",
        "caching",
        "speed",
        "life-hack",
        "Christmas Hackathon"
      ],
      "authors": [
        "theroyakash"
      ]
    },
    {
      "title": "How to benchmark your python program?",
      "link": "https://publications.iamroyakash.com/benchmark-your-python-program",
      "author": "theroyakash",
      "time_published": "Sat,

In [22]:
blog_feed.version

'rss20'

# BLOG

In article, we will be seeing how extract feed and posts details using RSS feed for a hashnode blog. Although we are going to use it for blogs on hashnode it can be used for other feeds as well

### What is RSS?

RSS stands for Rich Site Summary or Really Simple Syndication and uses standard web feed formats to publish
frequently updated information: blog entries, news headlines, audio, video.

An RSS document (called “feed”, “web feed”, or “channel”) includes full or
summarized text, and metadata, like publishing date and author’s name.

With RSS it is possible to distribute up-to-date web content from one web site to thousands of other web sites around the world.

It is written in XML.

The most commonly used elements in RSS feeds are “title”, “link”, “description”,
“publication date”, and “entry ID”.

The less commonnly used elements are “image”, “categories”, “enclosures”
and “cloud”.

### Why use RSS?
RSS was designed to show selected data.

Without RSS, users will have to check your site daily for new updates. This may be too time-consuming for many users. With an RSS feed (RSS is often called a News feed or RSS feed) they can check your site faster using an RSS aggregator (a site or program that gathers and sorts out RSS feeds).

### Parsing feeds with Feedparser
Feedparser is a Python library that parses feeds in all known formats, including
Atom, RSS, and RDF.

### Installing feed parser

```
pip install feedparser
```

### getting rss feed

```
blog_feed = feedparser.parse("https://vaibhavkumar.hashnode.dev/rss.xml")
```

#### title of feed
```
blog_feed.feed.title
```

#### link of feed

```
blog_feed.feed.link
```

#### number of posts/entries
```
len(blog_feed.entries)
```
Each entry in the feed is a dictionary. Use [0] to print the first entry.

```
print(blog_feed.entries[0].title)
print(blog_feed.entries[0].link)
print(blog_feed.entries[0].author)
print(blog_feed.entries[0].published)
```

gettings tags and authors

```
tags = [tag.term for tag in blog_feed.entries[0].tags]
authors= [author.name for author in blog_feed.entries[0].authors]
```

Other attributes
```
blog_feed.version
blog_feed.header
blog_feed.header.get('content-type)
```

### Putting it together
Now use the above code to write a function which takes link of RSS feed and return the details.


```
def get_posts_details(rss=None):
    """
    Take link of rss feed as argument
    """
    if rss is not None:
        import feedparser
        blog_feed = blog_feed = feedparser.parse(rss)

        posts = blog_feed.entries
        posts_details = {"Blog title" : blog_feed.feed.title,
                        "Blog link" : blog_feed.feed.link}
        post_list = []
        for post in posts:
            temp = dict()
            temp["title"] =post.title
            temp["link"] =post.link
            temp["author"] =post.author
            temp["time_published"] = post.published
            temp["tags"] = [tag.term for tag in post.tags]
            temp["authors"] = [author.name for author in post.authors]
            temp["summary"] = post.summary
            post_list.append(temp)
        posts_details["posts"] = post_list
        return posts_details
    else:
        return None
```

Output:
```
import json

blog_rss = "https://vaibhavkumar.hashnode.dev/rss.xml"

data = get_posts_details(rss = blog_rss)

print(json.dumps(data, indent=2))
```

Using this one can quickly get the posts lists, links and other details. Also once we have all the posts links, we can crawl them one by one and scraping details like number of likes, comments on each individual posts.

Also, we can use this to expose the details via JSON based APIs.

Try it, with your own blog's RSS feed link.

Thanks for reading. Do give your suggestions and feedback down in the comments.