# BDA Final Project
# Spotify Podcasts Analysis

Xavier Cucurull Salamero

December 2021


<center><img src='Docs/img/spotify_bda_logo.png'  width="500" ></center>


10.048.969 catalan speakers
https://www.plataforma-llengua.cat/media/upload/pdf/informecat2018_1528713023.pdf

A total of 580 million people speak Spanish in the world, 7.6% of the world’s population.
https://blogs.cervantes.es/londres/2019/10/15/spanish-a-language-spoken-by-580-million-people-and-only-483-million-of-them-native/

Spanish 2nd number of native spakers
https://www.babbel.com/en/magazine/the-10-most-spoken-languages-in-the-world

Spotify now offers podcasts
https://www.theverge.com/2015/5/20/8629335/spotify-adds-podcasts-videoshttps://www.theverge.com/2015/5/20/8629335/spotify-adds-podcasts-videos

[Test](#1.-Introduction)



## 1. Introduction

Podcasts are an audio-only mean of communication that is rapidly growing. Spotify, one of the major audio streaming services has recently put more attention into podcasting. Last month the company expanded their Podcast Subscription program to global markets [1] and in their Q3 shareholder letter the company [2] pointed out that podcasts on Spotify are up to 3.2M from 2.9M the previous quarter, maintaining also a strong monthly active users engagement.

With the growth of podcasting, getting insights on the market evolution is crucial for many stakeholders. Podchaser, the "IMDb for podcasts", which provides payment services targeted at marketers, PR firms, and other professionals to get relevant insights and metrics, announced in January $4M in funding [3].

I am a podcast enthusiast and I have recently found myself consuming content that is only available online, as opposed to those podcasts that are also broadcast live on the radio, with some podcasts being Spotify exclusive. In addition, during the COVID-19 lockdown I had the impression that new podcasts were created every week, as if almost everybody wanted to make one. As a result, I thought that conducting an analysis on the podcast market could be really interesting and a good challenge to apply the knowledgee obtained during the course.

This project is divided in two parts. First, the development of a podcast scraper using the Spotify API [4] with the goal of constructing a database containing various information about the show such as name, publisher and release date. Second, a big data analysis on the obtained dataset, with the goal of getting insight on the evolution and current state of the podcast market. For simplicity, given the long processing time necessary to construct the database, the scope of the study has been narrowed down to podcasts in Catalan and Spanish, which are my two native languages.


## 2. Data collection
Being a Spotify user, I first searched for their API to see what could be done and I saw that last year Spotify introduced the podcasts API [5]. This seemed promising because compared to other services such as Podchaser, which has a limit of 25,000 monthly/requests on their free plan, Spotify desn't have any restrictions in terms of the total number of queries, only a quota related to the amount of requests that can be made in a 30 seconds window.

The main Spotify web API function is "Search", which allows to get catalog information about albums, artists, playlists, tracks, shows or episodes that match a keyword string. So in order to get a list of all the podcasts in the Spotify catalog it was necessary to find those keywords that would retrieve them. With the API limit of 1000 responses for request this seemed like an impossible task. How could I think of those words or strings that would lead me to discover the titles of all the podcasts available? Before using the Spotify API to get information about the shows I needed to find a list of podcast names. To do that I thought of another podcast provider, Apple's iTunes [6]. Being part of the Apple Developer Program costs $99, which is not affordable for the scope of this project. Since I just needed to get a list of the names of all the podcasts available on the iTunes catalog, I could write a simple Python script to perform the web scraping. Once I got this list, I could search for each of the names using the Spotify API. Below there is a diagram of the database creation process.

<center><img src='Docs/img/database_creation_diagram.png'></center>


### 2.1. Apple Podcasts scraper
The first part of the database construction is the Apple Podcasts scraper. This Python script parses the Apple Podcasts page, exploring all the different categories, and creates a CSV table with the title and genre of all the podcasts in the catalog. The libraries used are ```requests``` and ```beautifulsoup```. This process results in 2M titles.

The code of the script is available in TODO

TODO: add processing time, example of output.


### 2.2. Spotify Podcasts scraper
The second part of the database construction is the Spotify Podcast scraper. Using the list of Apple podcasts and the Spotify web API, this Python script iterates over each of the titles on the previously obtained table and uses the search api to get those shows that match the provided string. Since some titles availabe in the Apple podcast list might not be present in the Spotify catalog and vice-versa, for each given title a maximum (configurable) number of 10 shows are retrieved. This allows for more "exploration" of the Spotify catalog, being able to discover shows that were not present in the original podcasts list. When searching for a show, the Spotify API returns information about the **id**, the **name**, the **description**, the **publisher**, the **total number of episodes**, whether it is flagged as **explicit**, the **languages** and **media type**. In order to obtain information about the release date and the date of the last show, the next step is to retrieve the episodes corresponding to each of the show. Since getting episode information adds a lot of overhead to the database creation process, at this step shows with languages Catalan ("ca") or Spanish ("es) are filtered. In addition, in order to reduce the number of API calls, only a maximum number of 100 episodes (containing the first and the last) are retrieved. With those episodes, we obtain the **release date** of the show, the **last date** the show streamed an episode and the **average duration** of the episodes. Finally, the show is added to the database. Since the show id is used as a unique id in the database, the addition of duplicate entries is avoided.

The code of the script is available in [TODO] which implements the class ```SpotifyScraper```.

TODO: give some more details about python class, execution time, problems.

### 2.3. MongoDB database
Given the relative simplicity of the database to create and because of the fact that only one database -the podcasts database- is needed, MongoDB has been chosen for the project. 
TODO more details?Given the relative simplicity of the database to create and because of the fact that only one database -the podcasts database- is needed, MongoDB has been chosen for the project. 
TODO more details?

### 2.4. Limitations
This method, although effective to build a big dataset that is representative of the podcasts market, has some limitations. First of all, although most podcasters try to have their show in platforms such as iTunes or Spotify, that is not always the case. In addition, with Spotify's increasing interest in podcasting, there are nowadays shows that are exclusive to Spotify. The titles of these shows don't appear on the Apple Podcasts list and, as a result, might be missed when scraping Spotify. One example is the podcast "Oye Polo", from Radio Primavera Sound, which was not captured by the database constructing method. 

### 2.5. Code Documentation
All the Python scripts developed in this project have been documented using Docstrings and the corresponding documentation has been generated using Sphinx. As a reference, the documentation of code can be found in ```Docs/build/html``` directory or [here](Docs/build/html/py-modindex.html).

## 3. Data description

### 3.3. Missing values/outliers

## 5. Data analysis
Once the database is constructed, the data analysis is be performed in this Jupyter Notebook.

In [1]:
import pymongo
import json
from bson.son import SON

client = pymongo.MongoClient('localhost', 27017, username='mongoadmin', password='pass1234')

# Get already created bda database
database = client['finalproject']

# let's create a new collection
podcasts = database.podcasts

In [2]:
print(podcasts.count_documents({}))

254160


Missing values shall be specified as ```null``` in MongoDB. In pymongo these ```null``` values are interpreted as ```None``` so in our case, to find missing values using Python we will query for ```None``` values.

Some cases of ```null``` values appear when a podcast has no episodes. For those episodes where the number of ```total_episodes``` is 0, the ```average_duration_min```, ```release_date``` and ```last_date``` have ```null``` values.

In [6]:
res = podcasts.find({'total_episodes': 0}, {'release_date' : 1, 'last_date': 1, 'average_duration_min' : 1})

list(res)[0]

{'_id': ObjectId('61b276d16015f8bce7c4ea04'),
 'average_duration_min': None,
 'release_date': {'year': None, 'month': None, 'day': None},
 'last_date': {'year': None, 'month': None, 'day': None}}

In [None]:
# TODO:

# Try to "visualize" each of the variables
# Compare languages and comment
# Episode frequency: total days / total episodes
# Episode release dates -> compare most popular days, months
# Year increase of podcasts
#
# Plus: genres?

## 6. Conclusions

1. Describe the problem, including any required background, and explain why you believe it is important / interesting.
2. Collect the data and store it. Apply the techniques we have seen in the course. Relational vs. No-SQL, which one makes more sense?
3. Describe the data for the reader. Which fields are included by default? Which are you going to use for your analysis? For example, Twitter includes geolocation information, but just on a very small number of tweets, and the detected language may not always be correct.
4. Does the data publisher have any convention to indicate missing values? How are you going to treat them? Do you see any strange values that do not adhere to that convention (outliers)?
5. Analyse your clean data set to answer the proposed problem.
6. Conclusions.


## References

[1] https://finance.yahoo.com/news/spotifys-podcast-subscriptions-expand-global-173408899.html?guccounter=1


[2] https://www.sec.gov/Archives/edgar/data/1639920/000119312521308632/d174858dex991.htm

[3] https://www.podchaser.com/articles/inside-podchaser/podchaser-raises-4m

[4] https://developer.spotify.com/documentation/web-api

[5] https://developer.spotify.com/community/news/2020/03/20/introducing-podcasts-api/

[6] https://podcasts.apple.com/es/genre/podcasts/id26

In [13]:
import pandas as pd
import sqlite3
import sqlalchemy 

try:
    conn = sqlite3.connect("podcastindex_feeds.db")    
except Exception as e:
    print(e)

#Now in order to read in pandas dataframe we need to know table name
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
print(f"Table Name : {cursor.fetchall()}")

df = pd.read_sql_query('SELECT * FROM podcasts', conn)
conn.close()

Table Name : [('podcasts',)]


In [17]:
df.head()   # language

rslt_df = df[df['language']=='ca']

In [None]:
len(rslt_df)

In [20]:
rslt_df = df[df['language']=='ca']
print(len(rslt_df))

11673


In [21]:
rslt_df = df[df['language']=='es']
print(len(rslt_df))

582750
