<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-Cleaning-and-Feature-Engineering" data-toc-modified-id="Data-Cleaning-and-Feature-Engineering-1">Data Cleaning and Feature Engineering</a></span></li><li><span><a href="#Podcast-Date-/-Time-Series-Features" data-toc-modified-id="Podcast-Date-/-Time-Series-Features-2">Podcast Date / Time Series Features</a></span><ul class="toc-item"><li><span><a href="#Social-Feed-/-External-Site-Classifications" data-toc-modified-id="Social-Feed-/-External-Site-Classifications-2.1">Social Feed / External Site Classifications</a></span></li></ul></li></ul></div>

In [1]:
import pickle
import random
import sys, os
from datetime import datetime
import pandas as pd
import math

In [5]:
with open('../scraped/channel/by_category/Arts.pickle','rb') as file:
    df = pickle.load(file)

df.head(2)

Unnamed: 0,title,chan_url,num_comments,author,isExplicit,sub_count,play_count,ch_feed-socials,ep_total,recent_eps,hover_text_concat,chan_desc,cover_img_url,first_release,description
0,Fresh Air,https://castbox.fm/channel/Fresh-Air-id431951,126,NPR,0,167880,4705927,"[https://twitter.com/nprfreshair, https://www....",50,"[[2019-10-03, 00:48:44, 0], [2019-10-02, 00:49...",President Trump made building a border wall be...,"Fresh Air from WHYY, the Peabody Award-winning...",https://is3-ssl.mzstatic.com/image/thumb/Podca...,2019-08-07,
1,The Moth,https://castbox.fm/channel/The-Moth-id12,129,The Moth,0,143346,2419216,"[https://twitter.com/TheMoth, https://www.face...",157,"[[2019-10-01, 00:51:10, 8], [2019-09-24, 00:51...","This week, Two women meet by chance on a dark ...","Since its launch in 1997, The Moth has present...",https://is5-ssl.mzstatic.com/image/thumb/Podca...,2017-03-28,


## Data Cleaning and Feature Engineering

## Podcast Date / Time Series Features

Use the `recent_eps` column, along with the `first_release` column to build features that assess recency, frequency, and age of podcast.

We'll build feature functions from a single podcast as a sample.

In [24]:
df.iloc[0]

title                                                        Fresh Air
chan_url                 https://castbox.fm/channel/Fresh-Air-id431951
num_comments                                                       126
author                                                             NPR
isExplicit                                                           0
sub_count                                                       167880
play_count                                                     4705927
ch_feed-socials      [https://twitter.com/nprfreshair, https://www....
ep_total                                                            50
recent_eps           [[2019-10-03, 00:48:44, 0], [2019-10-02, 00:49...
hover_text_concat    President Trump made building a border wall be...
chan_desc            Fresh Air from WHYY, the Peabody Award-winning...
cover_img_url        https://is3-ssl.mzstatic.com/image/thumb/Podca...
first_release                                               2019-08-07
descri

In [7]:
sample_eps = df.iloc[0].recent_eps
sample_eps

[['2019-10-03', '00:48:44', 0],
 ['2019-10-02', '00:49:21', 6],
 ['2019-10-01', '00:49:13', 4],
 ['2019-09-30', '00:50:06', 4],
 ['2019-09-27', '00:50:24', 2],
 ['2019-09-27', '00:48:28', 1],
 ['2019-09-26', '00:48:59', 9],
 ['2019-09-25', '00:49:41', 2],
 ['2019-09-24', '00:49:44', 11],
 ['2019-09-23', '00:49:55', 4]]

In [16]:
# Let's build a column called 'recent_ep_rate'
# which calculates the average days b/t episode releases
# for the last 10 or so episodes.

def convert_ep_date(string):
    return datetime.strptime(string, '%Y-%m-%d')

def recent_ep_mean_dist(recent_eps):
    ep_list = [convert_ep_date(ep[0]) for ep in recent_eps]
#     print(ep_list)
    
    days_bt_eps = []
    for i in range(len(ep_list)-1):
        days_bt_eps += [abs(ep_list[i+1] - ep_list[i])]

    return pd.Series(days_bt_eps).dt.days.mean()
    

recent_ep_mean_dist(sample_eps)

1.1111111111111112

Load feature column into dataframe.

In [37]:
ep_cols = ['title','ep_total','first_release','recent_eps','recent_ep_rate']
df['recent_ep_rate'] = df.recent_eps.apply(recent_ep_mean_dist)
df[ep_cols].head(3)

Unnamed: 0,title,ep_total,first_release,recent_eps,recent_ep_rate
0,Fresh Air,50,2019-08-07,"[[2019-10-03, 00:48:44, 0], [2019-10-02, 00:49...",1.111111
1,The Moth,157,2017-03-28,"[[2019-10-01, 00:51:10, 8], [2019-09-24, 00:51...",4.666667
2,TED Talks Daily,975,2016-02-08,"[[2019-10-03, 00:13:43, 3], [2019-10-02, 00:16...",1.111111


Let's build a column called `lifetime_ep_freq` which takes first and last episode date, and divides that distance by total.

In [58]:
ep_cols_2 = ep_cols + ['lifetime_ep_freq']


def lifetime_ep_freq(row):
    '''
    
    = [last ep date] - [first ep date] / [ep count] 
    
    '''
    
    
#     print('last_ep: ', row['recent_eps'][0][0])
    last_ep_date = convert_ep_date(row['recent_eps'][0][0])
#     print(last_ep_date)
    release_date = convert_ep_date(row['first_release'])
#     print(release_date)
    ep_total = row['ep_total']
#     print(ep_total)
    
    freq = (last_ep_date - release_date) / ep_total

    return freq
    
    
df['lifetime_ep_freq'] = df.apply(lifetime_ep_freq, axis=1).dt.total_seconds() / (24. * 60. * 60.)
df[ep_cols_2].head(3)

Unnamed: 0,title,ep_total,first_release,recent_eps,recent_ep_rate,lifetime_ep_freq
0,Fresh Air,50,2019-08-07,"[[2019-10-03, 00:48:44, 0], [2019-10-02, 00:49...",1.111111,1.14
1,The Moth,157,2017-03-28,"[[2019-10-01, 00:51:10, 8], [2019-09-24, 00:51...",4.666667,5.840764
2,TED Talks Daily,975,2016-02-08,"[[2019-10-03, 00:13:43, 3], [2019-10-02, 00:16...",1.111111,1.367179


Let's calculate average episode length for the `recent_eps` and save as `avg_ep_len`.

In [None]:
ep_cols_3 = ep_cols + ['avg_ep_len']

def avg_ep_len(recent_eps):
    '''
    
    Average length of recent 10 or so episodes.
    
    '''
    
    
#     print('last_ep: ', row['recent_eps'][0][0])
    last_ep_date = convert_ep_date(row['recent_eps'][0][0])
#     print(last_ep_date)
    release_date = convert_ep_date(row['first_release'])
#     print(release_date)
    ep_total = row['ep_total']
#     print(ep_total)
    
    freq = (last_ep_date - release_date) / ep_total

    return freq
    
    
# df['avg_ep_len'] = 
# df[ep_cols_3].head(3)

### Social Feed / External Site Classifications

Create feature columns to describe quantity of social/external feeds.

Additional scraping of each social feed to assess activity of feed should follow.

Unnamed: 0,title,chan_url,num_comments,author,isExplicit,sub_count,play_count,ch_feed-socials,ep_total,recent_eps,hover_text_concat,chan_desc,cover_img_url,first_release,description
0,Fresh Air,https://castbox.fm/channel/Fresh-Air-id431951,126,NPR,0,167880,4705927,"[https://twitter.com/nprfreshair, https://www....",50,"[[2019-10-03, 00:48:44, 0], [2019-10-02, 00:49...",President Trump made building a border wall be...,"Fresh Air from WHYY, the Peabody Award-winning...",https://is3-ssl.mzstatic.com/image/thumb/Podca...,2019-08-07,
1,The Moth,https://castbox.fm/channel/The-Moth-id12,129,The Moth,0,143346,2419216,"[https://twitter.com/TheMoth, https://www.face...",157,"[[2019-10-01, 00:51:10, 8], [2019-09-24, 00:51...","This week, Two women meet by chance on a dark ...","Since its launch in 1997, The Moth has present...",https://is5-ssl.mzstatic.com/image/thumb/Podca...,2017-03-28,
2,TED Talks Daily,https://castbox.fm/channel/TED-Talks-Daily-id4541,710,TED,0,1382334,27571564,"[https://twitter.com/TEDTalks, https://www.fac...",975,"[[2019-10-03, 00:13:43, 3], [2019-10-02, 00:16...",With fascinating research and hilarious anecdo...,"Want TED Talks on the go? Every weekday, this ...",https://is1-ssl.mzstatic.com/image/thumb/Podca...,2016-02-08,
3,99% Invisible,https://castbox.fm/channel/id18,208,Roman Mars,0,177413,3149718,"[https://twitter.com/romanmars, http://99perce...",404,"[[2019-10-01, 00:33:53, 11], [2019-09-24, 00:4...",There’s an idea in city planning called “infor...,"Design is everywhere in our lives, perhaps mos...",https://is4-ssl.mzstatic.com/image/thumb/Podca...,2010-09-23,
4,Snap Judgment Presents: Spooked,https://castbox.fm/channel/Snap-Judgment-Prese...,208,Snap Judgment and WNYC Studios,0,23327,322310,"[https://twitter.com/SpookedPod, https://www.f...",36,"[[2019-09-19, 00:21:11, 7], [2019-09-13, 00:30...",Jennifer Percy was just a young graduate stude...,Spooked features true-life supernatural storie...,https://is2-ssl.mzstatic.com/image/thumb/Podca...,2017-09-01,
5,Myths and Legends,https://castbox.fm/channel/Myths-and-Legends-i...,327,Jason Weiser Carissa Weiser / Bardic,0,152126,2568707,"[https://twitter.com/MythPodcast, https://www....",224,"[[2019-10-01, 00:39:57, 4], [2019-09-24, 00:40...","Two stories, one from Italy, one from England....","Jason Weiser tells stories from myths, legends...",https://is1-ssl.mzstatic.com/image/thumb/Podca...,2015-04-27,
6,Nice Try!,https://castbox.fm/channel/Nice-Try!-id2112178,6,Curbed,0,17593,58893,"[https://twitter.com/curbed, https://www.faceb...",9,"[[2019-08-15, 00:46:51, 5], [2019-07-18, 00:36...",Before she began writing for the New York Time...,Avery Trufelman explores stories of people who...,https://is2-ssl.mzstatic.com/image/thumb/Podca...,2019-05-03,
7,Snap Judgment,https://castbox.fm/channel/Snap-Judgment-id131,127,Snap Judgment and WNYC Studios,0,43501,829582,"[https://twitter.com/snapjudgment, https://www...",248,"[[2019-09-26, 00:28:37, 8], [2019-09-19, 00:36...",A dad surprises his son by taking him on his f...,"Snap Judgment (Storytelling, with a BEAT) mixe...",https://is1-ssl.mzstatic.com/image/thumb/Podca...,2015-10-16,
8,The Beauty Closet,https://castbox.fm/channel/The-Beauty-Closet-i...,1,Goop Inc and Cadence 13,0,354,4732,[https://goop.com/thepodcast],11,"[[2019-09-25, 00:43:28, 1], [2019-09-18, 00:50...",Legendary makeup artist Bobbi Brown is as famo...,What’s the secret to the world’s glowiest skin...,https://is5-ssl.mzstatic.com/image/thumb/Podca...,2019-07-09,
9,LeVar Burton Reads,https://castbox.fm/channel/LeVar-Burton-Reads-...,153,LeVar Burton and Stitcher,1,29068,493187,"[https://twitter.com/levarburton, https://www....",75,"[[2019-10-01, 00:52:15, 5], [2019-09-24, 00:39...",A castle's mysteries begin to slowly unravel. ...,"The best short fiction, handpicked by the worl...",https://is4-ssl.mzstatic.com/image/thumb/Podca...,2017-06-02,
