In [1]:
import pandas as pd
import numpy as np

# iTunes Podcasts Dataset

In this notebook, we preprocess the iTunes dataset according to the preprocessing described by the authors of the research paper.

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/odenizgiz/Podcasts-Data/master/df_popular_podcasts.csv")

In [3]:
df.shape

(10155, 9)

In [4]:
df.head()

Unnamed: 0,Name,Artwork,Genre IDs,Episode Count,Episode Durations,iTunes URL,Feed URL,Podcast URL,Description
0,"Selling the Couch with Melvin Varghese, Ph.D. ...",http://is2.mzstatic.com/image/thumb/Music71/v4...,"['1410', '26', '1321', '1307', '1315']",129,"[37.7, 36.18, 30.83, 30.87, 32.75, 36.15, 33.8...",https://itunes.apple.com/us/podcast/selling-co...,http://sellingthecouch.libsyn.com/rss,http://www.SellingTheCouch.com,Selling the Couch is the #1 podcast for aspiri...
1,Susan Gregg's Food for the Soul,http://is5.mzstatic.com/image/thumb/Music62/v4...,"['1444', '26', '1314', '1307', '1324', '1443',...",50,0,https://itunes.apple.com/us/podcast/susan-greg...,http://www.susangregg.com/feed/podcast,http://www.susangregg.com,You can feel safe and be happy no matter what ...
2,Field Days,http://is2.mzstatic.com/image/thumb/Music62/v4...,"['1474', '26', '1325']",58,"[17.13, 21.95, 23.48, 22.35, 33.47, 42.75, 26....",https://itunes.apple.com/us/podcast/field-days...,http://feeds.soundcloud.com/users/soundcloud:u...,http://soundcloud.com/field-days,The Michigan Department of Corrections - Field...
3,Card Talk: The World's Greatest Magic: the Gat...,http://is1.mzstatic.com/image/thumb/Music6/v4/...,"['1461', '26', '1323']",36,"[97.62, 94.87, 94.35, 74.22, 78.5, 84.43, 76.8...",https://itunes.apple.com/us/podcast/card-talk-...,http://taltcast.libsyn.com/rss,http://cardtalk.libsyn.com,Sometime Magic Pro GCB & semi-competitive Magi...
4,The UAV Digest,http://is3.mzstatic.com/image/thumb/Music62/v4...,"['1455', '26', '1323', '1318']",50,"[29.02, 33.55, 30.47, 31.43, 33.5, 33.98, 33.0...",https://itunes.apple.com/us/podcast/the-uav-di...,http://theuavdigest.com/feed/podcast/,http://theuavdigest.com,The UAV Digest is an audio podcast devoted to ...


In [5]:
df.columns

Index(['Name', 'Artwork', 'Genre IDs', 'Episode Count', 'Episode Durations',
       'iTunes URL', 'Feed URL', 'Podcast URL', 'Description'],
      dtype='object')

## Relevant Columns

In [6]:
df = df[["Name", "Description"]]

We keep only the two relevant columns: the name of the podcast and its description.

In [7]:
df.shape

(10155, 2)

In [8]:
df = df.rename({'Name': 'title', 'Description': 'description'}, axis=1)

We rename the columns to be consistent with the expected input to the algorithms in the research paper.

In [9]:
df.shape

(10155, 2)

In [10]:
df.head()

Unnamed: 0,title,description
0,"Selling the Couch with Melvin Varghese, Ph.D. ...",Selling the Couch is the #1 podcast for aspiri...
1,Susan Gregg's Food for the Soul,You can feel safe and be happy no matter what ...
2,Field Days,The Michigan Department of Corrections - Field...
3,Card Talk: The World's Greatest Magic: the Gat...,Sometime Magic Pro GCB & semi-competitive Magi...
4,The UAV Digest,The UAV Digest is an audio podcast devoted to ...


## Unique Podcast Titles

In [11]:
df.nunique()

title          10080
description     9973
dtype: int64

In [12]:
df = df.drop_duplicates("title")

We keep only unique podcast titles, as the paper suggests.

In [13]:
df.shape

(10080, 2)

## Filtering Records

The authors state that they keep only those rows of the dataset "*with the concatenations of title and description longer than 3 terms*". We interpret terms as words in this context, which does not produce the desired result. We end up with 10050 rows to keep, compared to the 9859 the authors ended up with. Therefore, the interpretation must be different, as not all words are likely to be considered terms. However, it is not explicitly stated in the research paper which techniques were used in this step.


In [14]:
title_lengths = list(map(lambda t: len(t.split()), df["title"]))
description_lengths = list(map(lambda t: len(t.split()), df["description"]))

In [15]:
concat_lengths = np.add(title_lengths, description_lengths)

In [16]:
keep = concat_lengths > 3

In [17]:
sum(keep)

10050

In [18]:
df[keep]

Unnamed: 0,title,description
0,"Selling the Couch with Melvin Varghese, Ph.D. ...",Selling the Couch is the #1 podcast for aspiri...
1,Susan Gregg's Food for the Soul,You can feel safe and be happy no matter what ...
2,Field Days,The Michigan Department of Corrections - Field...
3,Card Talk: The World's Greatest Magic: the Gat...,Sometime Magic Pro GCB & semi-competitive Magi...
4,The UAV Digest,The UAV Digest is an audio podcast devoted to ...
...,...,...
10150,"MTG Pro Tutor - Insights, Tips & Advice from M...",MTG Pro Tutor is a top rated Magic: The Gather...
10151,National CASA Podcast,"Today, more than 1,000 CASA program offices op..."
10152,CAR DISCOVERY｜PodcastLife,【www.adelcars.com】アデル・カーズ株式会社提供番組：CAR DISCOVER...
10153,Tottenham Hotspur Family,The Tottenham Hotspur Family Podcast\n\n\nA gl...


## Language Recognition

Lastly, we experiment with language recognition from the podcast description. It is not stated that language recognition was used on this dataset. However, we do find out that there are 28 different languages detected. After filtering only English podcasts, we arrive at 9580 rows left, which is now lower than what the authors worked with (9859 rows). Therefore, this was not what the authors did, even though they did use language recognition on the other datasets. Hence, we think it might make sense to use it here as well.


In [19]:
!pip install fasttext-langdetect

Collecting fasttext-langdetect
  Downloading fasttext-langdetect-1.0.5.tar.gz (6.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting fasttext>=0.9.1 (from fasttext-langdetect)
  Downloading fasttext-0.9.2.tar.gz (68 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.8/68.8 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pybind11>=2.2 (from fasttext>=0.9.1->fasttext-langdetect)
  Using cached pybind11-2.11.1-py3-none-any.whl (227 kB)
Building wheels for collected packages: fasttext-langdetect, fasttext
  Building wheel for fasttext-langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext-langdetect: filename=fasttext_langdetect-1.0.5-py3-none-any.whl size=7502 sha256=e0b0049ac0b0c161792f55cb7013469a99c89aa9f54c437333827198ffa7c6c0
  Stored in directory: /root/.cache/pip/wheels/e6/5b/5d/47e2fd5c2ff1028722739ce35f365e8f6eeb89ec97aa63e621
  Building wheel for fasttext (s

In [20]:
from ftlangdetect import detect

In [21]:
langs = []
for desc in df["description"]:
  txt = " ".join(desc.split()) # remove newline
  result = detect(text=txt, low_memory=False)
  langs.append(result["lang"])



In [22]:
lang_keep = np.array(langs) == "en"

In [23]:
df[lang_keep].shape

(9580, 2)

In [24]:
df[lang_keep].head()

Unnamed: 0,title,description
0,"Selling the Couch with Melvin Varghese, Ph.D. ...",Selling the Couch is the #1 podcast for aspiri...
1,Susan Gregg's Food for the Soul,You can feel safe and be happy no matter what ...
2,Field Days,The Michigan Department of Corrections - Field...
3,Card Talk: The World's Greatest Magic: the Gat...,Sometime Magic Pro GCB & semi-competitive Magi...
4,The UAV Digest,The UAV Digest is an audio podcast devoted to ...


## Dataset Export



In [25]:
df[lang_keep].to_csv("itunes.tsv", sep='\t', index=False, header=True)