# Predict Podcast Listening Time

Podcast_Name
Type: string
Values: Names of popular podcasts.

Episode_Title
Type: string
Values: Titles of the episodes.

Episode_Length
Type: float (minutes)
Values: Length of the episode in minutes. Example: 5.0, 10.0, 30.0, 45.0, 60.0, 90.0.

Genre
Type: string
Values: "Technology", "Education", "Comedy", "Health", "True Crime", "Business", "Sports", "Lifestyle", "News", "Music".

Host_Popularity
Type: float (scale 0-100)
Values: A score indicating the popularity of the host. Example: 50.0, 75.0, 90.0.

Publication_Day
Type: string
Values: Day of the week the episode was published. Example: "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday".

Publication_Time
Type: string
Values: "Morning", "Afternoon", "Evening", "Night".

Guest_Popularity
Type: float (scale 0-100)
Values: A score indicating the popularity of the guest (if any). Example: 20.0, 50.0, 85.0.

Number_of_Ads
Type: int
Values: Number of advertisements within the episode. Example: 0, 1, 2, 3.

Episode_Sentiment
Type: string
Values: Sentiment of the episode's content. Example: "Positive", "Neutral", "Negative".

Listening_Time
Type: float (minutes)
Values: The actual average listening duration (target variable).

In [118]:
import pandas as pd
import numpy as np
import itertools
import polars as pl
import pyarrow
import matplotlib.pyplot  as plt
import seaborn as sns
import os
from pathlib import Path
import re
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix




In [119]:
df_train= pd.read_csv('input/train.csv', sep=',')
df_train

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,Listening_Time_minutes
0,0,Mystery Matters,Episode 98,,True Crime,74.81,Thursday,Night,,0.0,Positive,31.41998
1,1,Joke Junction,Episode 26,119.80,Comedy,66.95,Saturday,Afternoon,75.95,2.0,Negative,88.01241
2,2,Study Sessions,Episode 16,73.90,Education,69.97,Tuesday,Evening,8.97,0.0,Negative,44.92531
3,3,Digital Digest,Episode 45,67.17,Technology,57.22,Monday,Morning,78.70,2.0,Positive,46.27824
4,4,Mind & Body,Episode 86,110.51,Health,80.07,Monday,Afternoon,58.68,3.0,Neutral,75.61031
...,...,...,...,...,...,...,...,...,...,...,...,...
749995,749995,Learning Lab,Episode 25,75.66,Education,69.36,Saturday,Morning,,0.0,Negative,56.87058
749996,749996,Business Briefs,Episode 21,75.75,Business,35.21,Saturday,Night,,2.0,Neutral,45.46242
749997,749997,Lifestyle Lounge,Episode 51,30.98,Lifestyle,78.58,Thursday,Morning,84.89,0.0,Negative,15.26000
749998,749998,Style Guide,Episode 47,108.98,Lifestyle,45.39,Thursday,Morning,93.27,0.0,Negative,100.72939


# EDA and Data Prep

In [120]:
df_train[df_train['Podcast_Name']=='Mystery Matters']

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,Listening_Time_minutes
0,0,Mystery Matters,Episode 98,,True Crime,74.81,Thursday,Night,,0.0,Positive,31.41998
67,67,Mystery Matters,Episode 13,57.93,True Crime,72.48,Saturday,Afternoon,90.00,1.0,Neutral,30.89510
95,95,Mystery Matters,Episode 40,94.45,True Crime,53.37,Friday,Afternoon,36.36,0.0,Positive,66.59645
142,142,Mystery Matters,Episode 50,23.74,True Crime,86.73,Friday,Afternoon,,2.0,Neutral,22.10401
155,155,Mystery Matters,Episode 59,108.34,True Crime,64.26,Monday,Afternoon,8.37,0.0,Neutral,76.97396
...,...,...,...,...,...,...,...,...,...,...,...,...
749801,749801,Mystery Matters,Episode 66,92.54,True Crime,31.97,Tuesday,Morning,70.61,0.0,Negative,76.19639
749818,749818,Mystery Matters,Episode 39,33.28,True Crime,91.90,Saturday,Night,,0.0,Positive,24.44947
749822,749822,Mystery Matters,Episode 20,32.41,True Crime,90.45,Friday,Morning,61.55,0.0,Positive,23.78321
749868,749868,Mystery Matters,Episode 80,,True Crime,33.15,Thursday,Night,27.19,0.0,Positive,89.95920


In [121]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 750000 entries, 0 to 749999
Data columns (total 12 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   id                           750000 non-null  int64  
 1   Podcast_Name                 750000 non-null  object 
 2   Episode_Title                750000 non-null  object 
 3   Episode_Length_minutes       662907 non-null  float64
 4   Genre                        750000 non-null  object 
 5   Host_Popularity_percentage   750000 non-null  float64
 6   Publication_Day              750000 non-null  object 
 7   Publication_Time             750000 non-null  object 
 8   Guest_Popularity_percentage  603970 non-null  float64
 9   Number_of_Ads                749999 non-null  float64
 10  Episode_Sentiment            750000 non-null  object 
 11  Listening_Time_minutes       750000 non-null  float64
dtypes: float64(5), int64(1), object(6)
memory usage: 68.7+ MB


In [122]:
df_train.describe()

Unnamed: 0,id,Episode_Length_minutes,Host_Popularity_percentage,Guest_Popularity_percentage,Number_of_Ads,Listening_Time_minutes
count,750000.0,662907.0,750000.0,603970.0,749999.0,750000.0
mean,374999.5,64.504738,59.859901,52.236449,1.348855,45.437406
std,216506.495284,32.969603,22.873098,28.451241,1.15113,27.138306
min,0.0,0.0,1.3,0.0,0.0,0.0
25%,187499.75,35.73,39.41,28.38,0.0,23.17835
50%,374999.5,63.84,60.05,53.58,1.0,43.37946
75%,562499.25,94.07,79.53,76.6,2.0,64.81158
max,749999.0,325.24,119.46,119.91,103.91,119.97


## Todo's

- Fill the null values in Episode_Length_minutes with the mean value of the Podcast
- Create AD_per_minute column
- Create H_G_Popularity by just summarizing the two Popularity columns
- Create encoding for string columns
- Create weekend column

In [123]:
df_train['Podcast_Name'].value_counts()

Podcast_Name
Tech Talks             22847
Sports Weekly          20053
Funny Folks            19635
Tech Trends            19549
Fitness First          19488
Business Insights      19480
Style Guide            19364
Game Day               19272
Melody Mix             18889
Criminal Minds         17735
Finance Focus          17628
Detective Diaries      17452
Crime Chronicles       17374
Athlete's Arena        17327
Fashion Forward        17280
Tune Time              17254
Business Briefs        17012
Lifestyle Lounge       16661
True Crime Stories     16373
Sports Central         16191
Digital Digest         16171
Humor Hub              16144
Mystery Matters        16002
Comedy Corner          15927
Joke Junction          15074
Wellness Wave          15009
Sport Spot             14778
Gadget Geek            14770
Home & Living          14686
Laugh Line             14673
Life Lessons           14464
World Watch            14043
Sound Waves            13928
Global News            13649
M

### Fill the null values in Episode_Length_minutes with the mean value of the Podcast

In [124]:
# Replace zero's with nan's 
df_train['Episode_Length_minutes'] = df_train['Episode_Length_minutes'].replace(0, np.nan)
# Get mean values per podcast
mean_map = df_train.groupby('Podcast_Name')['Episode_Length_minutes'].mean().round(2)
# Map the means to fill NaN
df_train['Episode_Length_minutes'] = (
    df_train['Episode_Length_minutes']
    .fillna(df_train['Podcast_Name'].map(mean_map)))

In [125]:
mean_map

Podcast_Name
Athlete's Arena        65.84
Brain Boost            63.46
Business Briefs        66.67
Business Insights      62.85
Comedy Corner          62.63
Crime Chronicles       66.10
Criminal Minds         61.45
Current Affairs        62.01
Daily Digest           65.29
Detective Diaries      65.83
Digital Digest         63.42
Educational Nuggets    65.07
Fashion Forward        64.63
Finance Focus          62.05
Fitness First          65.38
Funny Folks            63.73
Gadget Geek            65.03
Game Day               62.71
Global News            65.78
Health Hour            64.42
Healthy Living         64.54
Home & Living          66.53
Humor Hub              65.42
Innovators             64.53
Joke Junction          61.35
Laugh Line             63.03
Learning Lab           65.28
Life Lessons           64.58
Lifestyle Lounge       64.31
Market Masters         65.21
Melody Mix             67.91
Mind & Body            66.28
Money Matters          66.90
Music Matters          65.19
M

In [126]:
df_train.describe()

Unnamed: 0,id,Episode_Length_minutes,Host_Popularity_percentage,Guest_Popularity_percentage,Number_of_Ads,Listening_Time_minutes
count,750000.0,750000.0,750000.0,603970.0,749999.0,750000.0
mean,374999.5,64.502542,59.859901,52.236449,1.348855,45.437406
std,216506.495284,31.001355,22.873098,28.451241,1.15113,27.138306
min,0.0,1.24,1.3,0.0,0.0,0.0
25%,187499.75,39.42,39.41,28.38,0.0,23.17835
50%,374999.5,64.42,60.05,53.58,1.0,43.37946
75%,562499.25,90.31,79.53,76.6,2.0,64.81158
max,749999.0,325.24,119.46,119.91,103.91,119.97


In [127]:
df_train[df_train['Episode_Length_minutes']==0]

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,Listening_Time_minutes


In [128]:
df_train[df_train['Podcast_Name']=='Fashion Forward']

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,Listening_Time_minutes
160,160,Fashion Forward,Episode 15,6.91,Lifestyle,47.16,Monday,Evening,15.93,1.0,Neutral,6.90383
219,219,Fashion Forward,Episode 42,34.82,Lifestyle,88.68,Sunday,Evening,14.63,0.0,Positive,16.86000
233,233,Fashion Forward,Episode 55,92.80,Lifestyle,49.70,Tuesday,Night,86.59,1.0,Negative,83.69482
247,247,Fashion Forward,Episode 71,64.63,Lifestyle,70.14,Monday,Morning,,2.0,Negative,9.01001
274,274,Fashion Forward,Episode 44,56.26,Lifestyle,74.26,Sunday,Night,16.92,1.0,Neutral,31.52727
...,...,...,...,...,...,...,...,...,...,...,...,...
749824,749824,Fashion Forward,Episode 78,23.84,Lifestyle,49.89,Tuesday,Night,50.66,3.0,Positive,12.13000
749857,749857,Fashion Forward,Episode 3,67.75,Lifestyle,77.61,Saturday,Night,15.88,2.0,Neutral,40.69137
749863,749863,Fashion Forward,Episode 27,70.43,Lifestyle,37.63,Sunday,Afternoon,10.09,3.0,Neutral,46.46165
749955,749955,Fashion Forward,Episode 42,39.08,Lifestyle,91.53,Tuesday,Night,,0.0,Negative,21.99000


### Create AD_per_minute column

In [129]:
df_train.describe()

Unnamed: 0,id,Episode_Length_minutes,Host_Popularity_percentage,Guest_Popularity_percentage,Number_of_Ads,Listening_Time_minutes
count,750000.0,750000.0,750000.0,603970.0,749999.0,750000.0
mean,374999.5,64.502542,59.859901,52.236449,1.348855,45.437406
std,216506.495284,31.001355,22.873098,28.451241,1.15113,27.138306
min,0.0,1.24,1.3,0.0,0.0,0.0
25%,187499.75,39.42,39.41,28.38,0.0,23.17835
50%,374999.5,64.42,60.05,53.58,1.0,43.37946
75%,562499.25,90.31,79.53,76.6,2.0,64.81158
max,749999.0,325.24,119.46,119.91,103.91,119.97


In [130]:
df_train[pd.isna(df_train['Number_of_Ads'])]

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,Listening_Time_minutes
247170,247170,Game Day,Episode 33,35.66,Sports,27.35,Friday,Evening,49.87,,Negative,23.94516


In [131]:
# Replace NaN with 0 in the 'Episode_Length_minutes' column
df_train['Number_of_Ads'].fillna(df_train['Number_of_Ads'].median(), inplace=True)

In [132]:
df_train['ad_per_minute'] = df_train['Number_of_Ads']/df_train['Episode_Length_minutes']

### Create H_G_Popularity by just summarizing the two Popularity columns

In [133]:
# Get mean values per podcast
mean_map = df_train.groupby('Podcast_Name')['Guest_Popularity_percentage'].median().round(2)
# Map the means to fill NaN
df_train['Guest_Popularity_percentage'] = (
    df_train['Guest_Popularity_percentage']
    .fillna(df_train['Podcast_Name'].map(mean_map)))

In [134]:
df_train['H_G_Popularity'] = (
    df_train['Host_Popularity_percentage'] + df_train['Guest_Popularity_percentage']
)

In [135]:
df_train.describe()

Unnamed: 0,id,Episode_Length_minutes,Host_Popularity_percentage,Guest_Popularity_percentage,Number_of_Ads,Listening_Time_minutes,ad_per_minute,H_G_Popularity
count,750000.0,750000.0,750000.0,750000.0,750000.0,750000.0,750000.0,750000.0
mean,374999.5,64.502542,59.859901,52.486399,1.348854,45.437406,0.033272,112.3463
std,216506.495284,31.001355,22.873098,25.540986,1.15113,27.138306,0.051559,34.630894
min,0.0,1.24,1.3,0.0,0.0,0.0,0.0,7.51
25%,187499.75,39.42,39.41,34.55,0.0,23.17835,0.0,87.5
50%,374999.5,64.42,60.05,53.72,1.0,43.37946,0.018932,112.86
75%,562499.25,90.31,79.53,71.04,2.0,64.81158,0.040617,137.28
max,749999.0,325.24,119.46,119.91,103.91,119.97,0.945238,213.33


### Create encoding for string columns

In [136]:
df_train.head(5)

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,Listening_Time_minutes,ad_per_minute,H_G_Popularity
0,0,Mystery Matters,Episode 98,64.39,True Crime,74.81,Thursday,Night,53.38,0.0,Positive,31.41998,0.0,128.19
1,1,Joke Junction,Episode 26,119.8,Comedy,66.95,Saturday,Afternoon,75.95,2.0,Negative,88.01241,0.016694,142.9
2,2,Study Sessions,Episode 16,73.9,Education,69.97,Tuesday,Evening,8.97,0.0,Negative,44.92531,0.0,78.94
3,3,Digital Digest,Episode 45,67.17,Technology,57.22,Monday,Morning,78.7,2.0,Positive,46.27824,0.029775,135.92
4,4,Mind & Body,Episode 86,110.51,Health,80.07,Monday,Afternoon,58.68,3.0,Neutral,75.61031,0.027147,138.75


In [137]:
df_train['Genre'].value_counts()

Genre
Sports        87606
Technology    86256
True Crime    85059
Lifestyle     82461
Comedy        81453
Business      80521
Health        71416
News          63385
Music         62743
Education     49100
Name: count, dtype: int64

In [138]:
df_train['Publication_Day'].value_counts()

Publication_Day
Sunday       115946
Monday       111963
Friday       108237
Wednesday    107886
Thursday     104360
Saturday     103505
Tuesday       98103
Name: count, dtype: int64

In [139]:
df_train['Publication_Time'].value_counts()

Publication_Time
Night        196849
Evening      195778
Afternoon    179460
Morning      177913
Name: count, dtype: int64

In [140]:
df_train['Episode_Sentiment'].value_counts()

Episode_Sentiment
Neutral     251291
Negative    250116
Positive    248593
Name: count, dtype: int64

In [141]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 750000 entries, 0 to 749999
Data columns (total 14 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   id                           750000 non-null  int64  
 1   Podcast_Name                 750000 non-null  object 
 2   Episode_Title                750000 non-null  object 
 3   Episode_Length_minutes       750000 non-null  float64
 4   Genre                        750000 non-null  object 
 5   Host_Popularity_percentage   750000 non-null  float64
 6   Publication_Day              750000 non-null  object 
 7   Publication_Time             750000 non-null  object 
 8   Guest_Popularity_percentage  750000 non-null  float64
 9   Number_of_Ads                750000 non-null  float64
 10  Episode_Sentiment            750000 non-null  object 
 11  Listening_Time_minutes       750000 non-null  float64
 12  ad_per_minute                750000 non-null  float64
 13 

In [142]:
df_train

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,Listening_Time_minutes,ad_per_minute,H_G_Popularity
0,0,Mystery Matters,Episode 98,64.39,True Crime,74.81,Thursday,Night,53.38,0.0,Positive,31.41998,0.000000,128.19
1,1,Joke Junction,Episode 26,119.80,Comedy,66.95,Saturday,Afternoon,75.95,2.0,Negative,88.01241,0.016694,142.90
2,2,Study Sessions,Episode 16,73.90,Education,69.97,Tuesday,Evening,8.97,0.0,Negative,44.92531,0.000000,78.94
3,3,Digital Digest,Episode 45,67.17,Technology,57.22,Monday,Morning,78.70,2.0,Positive,46.27824,0.029775,135.92
4,4,Mind & Body,Episode 86,110.51,Health,80.07,Monday,Afternoon,58.68,3.0,Neutral,75.61031,0.027147,138.75
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
749995,749995,Learning Lab,Episode 25,75.66,Education,69.36,Saturday,Morning,53.70,0.0,Negative,56.87058,0.000000,123.06
749996,749996,Business Briefs,Episode 21,75.75,Business,35.21,Saturday,Night,52.18,2.0,Neutral,45.46242,0.026403,87.39
749997,749997,Lifestyle Lounge,Episode 51,30.98,Lifestyle,78.58,Thursday,Morning,84.89,0.0,Negative,15.26000,0.000000,163.47
749998,749998,Style Guide,Episode 47,108.98,Lifestyle,45.39,Thursday,Morning,93.27,0.0,Negative,100.72939,0.000000,138.66


## Test's Data Prep

In [143]:
df_test = pd.read_csv('input/test.csv', sep=',')
df_test

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment
0,750000,Educational Nuggets,Episode 73,78.96,Education,38.11,Saturday,Evening,53.33,1.0,Neutral
1,750001,Sound Waves,Episode 23,27.87,Music,71.29,Sunday,Morning,,0.0,Neutral
2,750002,Joke Junction,Episode 11,69.10,Comedy,67.89,Friday,Evening,97.51,0.0,Positive
3,750003,Comedy Corner,Episode 73,115.39,Comedy,23.40,Sunday,Morning,51.75,2.0,Positive
4,750004,Life Lessons,Episode 50,72.32,Lifestyle,58.10,Wednesday,Morning,11.30,2.0,Neutral
...,...,...,...,...,...,...,...,...,...,...,...
249995,999995,Mind & Body,Episode 100,21.05,Health,65.77,Saturday,Evening,96.40,3.0,Negative
249996,999996,Joke Junction,Episode 85,85.50,Comedy,41.47,Saturday,Night,30.52,2.0,Negative
249997,999997,Joke Junction,Episode 63,12.11,Comedy,25.92,Thursday,Evening,73.69,1.0,Neutral
249998,999998,Market Masters,Episode 46,113.46,Business,43.47,Friday,Night,93.59,3.0,Positive


In [144]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 11 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   id                           250000 non-null  int64  
 1   Podcast_Name                 250000 non-null  object 
 2   Episode_Title                250000 non-null  object 
 3   Episode_Length_minutes       221264 non-null  float64
 4   Genre                        250000 non-null  object 
 5   Host_Popularity_percentage   250000 non-null  float64
 6   Publication_Day              250000 non-null  object 
 7   Publication_Time             250000 non-null  object 
 8   Guest_Popularity_percentage  201168 non-null  float64
 9   Number_of_Ads                250000 non-null  float64
 10  Episode_Sentiment            250000 non-null  object 
dtypes: float64(4), int64(1), object(6)
memory usage: 21.0+ MB


In [145]:
# Replace zero's with nan's 
df_test['Episode_Length_minutes'] = df_test['Episode_Length_minutes'].replace(0, np.nan)
# Get mean values per podcast
mean_map_test = df_test.groupby('Podcast_Name')['Episode_Length_minutes'].mean().round(2)
# Map the means to fill NaN
df_test['Episode_Length_minutes'] = (
    df_test['Episode_Length_minutes']
    .fillna(df_test['Podcast_Name'].map(mean_map_test)))

# Replace NaN with median in the 'Number_of_Ads' column
df_test['Number_of_Ads'].fillna(df_test['Number_of_Ads'].median(), inplace=True)

df_test['ad_per_minute'] = df_test['Number_of_Ads']/df_test['Episode_Length_minutes']

# Get mean values per podcast
mean_map = df_test.groupby('Podcast_Name')['Guest_Popularity_percentage'].median().round(2)
# Map the means to fill NaN
df_test['Guest_Popularity_percentage'] = (
    df_test['Guest_Popularity_percentage']
    .fillna(df_test['Podcast_Name'].map(mean_map)))

df_test['H_G_Popularity'] = df_test['Host_Popularity_percentage']+df_test['Guest_Popularity_percentage']

df_test

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,ad_per_minute,H_G_Popularity
0,750000,Educational Nuggets,Episode 73,78.96,Education,38.11,Saturday,Evening,53.33,1.0,Neutral,0.012665,91.44
1,750001,Sound Waves,Episode 23,27.87,Music,71.29,Sunday,Morning,53.28,0.0,Neutral,0.000000,124.57
2,750002,Joke Junction,Episode 11,69.10,Comedy,67.89,Friday,Evening,97.51,0.0,Positive,0.000000,165.40
3,750003,Comedy Corner,Episode 73,115.39,Comedy,23.40,Sunday,Morning,51.75,2.0,Positive,0.017333,75.15
4,750004,Life Lessons,Episode 50,72.32,Lifestyle,58.10,Wednesday,Morning,11.30,2.0,Neutral,0.027655,69.40
...,...,...,...,...,...,...,...,...,...,...,...,...,...
249995,999995,Mind & Body,Episode 100,21.05,Health,65.77,Saturday,Evening,96.40,3.0,Negative,0.142518,162.17
249996,999996,Joke Junction,Episode 85,85.50,Comedy,41.47,Saturday,Night,30.52,2.0,Negative,0.023392,71.99
249997,999997,Joke Junction,Episode 63,12.11,Comedy,25.92,Thursday,Evening,73.69,1.0,Neutral,0.082576,99.61
249998,999998,Market Masters,Episode 46,113.46,Business,43.47,Friday,Night,93.59,3.0,Positive,0.026441,137.06


In [146]:
df_test.describe()

Unnamed: 0,id,Episode_Length_minutes,Host_Popularity_percentage,Guest_Popularity_percentage,Number_of_Ads,ad_per_minute,H_G_Popularity
count,250000.0,250000.0,250000.0,250000.0,250000.0,250000.0,250000.0
mean,874999.5,418.5513,59.716491,52.417905,1.355852,0.034187,112.134396
std,72168.927986,156975.0,22.880028,25.52655,4.274399,0.579268,34.617272
min,750000.0,2.47,2.49,0.0,0.0,0.0,7.65
25%,812499.75,39.37,39.25,34.55,0.0,0.0,87.23
50%,874999.5,64.43,59.9,53.32,1.0,0.018909,112.67
75%,937499.25,90.81,79.39,70.99,2.0,0.040363,137.04
max,999999.0,78486260.0,117.76,116.82,2063.0,288.531469,205.08


In [147]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 13 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   id                           250000 non-null  int64  
 1   Podcast_Name                 250000 non-null  object 
 2   Episode_Title                250000 non-null  object 
 3   Episode_Length_minutes       250000 non-null  float64
 4   Genre                        250000 non-null  object 
 5   Host_Popularity_percentage   250000 non-null  float64
 6   Publication_Day              250000 non-null  object 
 7   Publication_Time             250000 non-null  object 
 8   Guest_Popularity_percentage  250000 non-null  float64
 9   Number_of_Ads                250000 non-null  float64
 10  Episode_Sentiment            250000 non-null  object 
 11  ad_per_minute                250000 non-null  float64
 12  H_G_Popularity               250000 non-null  float64
dtyp

# Pipeline

In [148]:
df_train

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,Listening_Time_minutes,ad_per_minute,H_G_Popularity
0,0,Mystery Matters,Episode 98,64.39,True Crime,74.81,Thursday,Night,53.38,0.0,Positive,31.41998,0.000000,128.19
1,1,Joke Junction,Episode 26,119.80,Comedy,66.95,Saturday,Afternoon,75.95,2.0,Negative,88.01241,0.016694,142.90
2,2,Study Sessions,Episode 16,73.90,Education,69.97,Tuesday,Evening,8.97,0.0,Negative,44.92531,0.000000,78.94
3,3,Digital Digest,Episode 45,67.17,Technology,57.22,Monday,Morning,78.70,2.0,Positive,46.27824,0.029775,135.92
4,4,Mind & Body,Episode 86,110.51,Health,80.07,Monday,Afternoon,58.68,3.0,Neutral,75.61031,0.027147,138.75
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
749995,749995,Learning Lab,Episode 25,75.66,Education,69.36,Saturday,Morning,53.70,0.0,Negative,56.87058,0.000000,123.06
749996,749996,Business Briefs,Episode 21,75.75,Business,35.21,Saturday,Night,52.18,2.0,Neutral,45.46242,0.026403,87.39
749997,749997,Lifestyle Lounge,Episode 51,30.98,Lifestyle,78.58,Thursday,Morning,84.89,0.0,Negative,15.26000,0.000000,163.47
749998,749998,Style Guide,Episode 47,108.98,Lifestyle,45.39,Thursday,Morning,93.27,0.0,Negative,100.72939,0.000000,138.66


## Train-test split

In [149]:
df_train_pipeline = df_train[['Podcast_Name', 'Episode_Length_minutes', 'ad_per_minute', 'H_G_Popularity','Genre', 'Publication_Day', 'Publication_Time', 'Episode_Sentiment']]
df_train_pipeline

Unnamed: 0,Podcast_Name,Episode_Length_minutes,ad_per_minute,H_G_Popularity,Genre,Publication_Day,Publication_Time,Episode_Sentiment
0,Mystery Matters,64.39,0.000000,128.19,True Crime,Thursday,Night,Positive
1,Joke Junction,119.80,0.016694,142.90,Comedy,Saturday,Afternoon,Negative
2,Study Sessions,73.90,0.000000,78.94,Education,Tuesday,Evening,Negative
3,Digital Digest,67.17,0.029775,135.92,Technology,Monday,Morning,Positive
4,Mind & Body,110.51,0.027147,138.75,Health,Monday,Afternoon,Neutral
...,...,...,...,...,...,...,...,...
749995,Learning Lab,75.66,0.000000,123.06,Education,Saturday,Morning,Negative
749996,Business Briefs,75.75,0.026403,87.39,Business,Saturday,Night,Neutral
749997,Lifestyle Lounge,30.98,0.000000,163.47,Lifestyle,Thursday,Morning,Negative
749998,Style Guide,108.98,0.000000,138.66,Lifestyle,Thursday,Morning,Negative


In [150]:
X_train, X_test, y_train, y_test = train_test_split(df_train_pipeline, df_train['Listening_Time_minutes'], test_size=0.2, random_state=35)


## Linear Regression

Best Public Score: 614,61970

In [151]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error, root_mean_squared_error   

In [152]:
df_train.drop('Listening_Time_minutes', axis=1)

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,ad_per_minute,H_G_Popularity
0,0,Mystery Matters,Episode 98,64.39,True Crime,74.81,Thursday,Night,53.38,0.0,Positive,0.000000,128.19
1,1,Joke Junction,Episode 26,119.80,Comedy,66.95,Saturday,Afternoon,75.95,2.0,Negative,0.016694,142.90
2,2,Study Sessions,Episode 16,73.90,Education,69.97,Tuesday,Evening,8.97,0.0,Negative,0.000000,78.94
3,3,Digital Digest,Episode 45,67.17,Technology,57.22,Monday,Morning,78.70,2.0,Positive,0.029775,135.92
4,4,Mind & Body,Episode 86,110.51,Health,80.07,Monday,Afternoon,58.68,3.0,Neutral,0.027147,138.75
...,...,...,...,...,...,...,...,...,...,...,...,...,...
749995,749995,Learning Lab,Episode 25,75.66,Education,69.36,Saturday,Morning,53.70,0.0,Negative,0.000000,123.06
749996,749996,Business Briefs,Episode 21,75.75,Business,35.21,Saturday,Night,52.18,2.0,Neutral,0.026403,87.39
749997,749997,Lifestyle Lounge,Episode 51,30.98,Lifestyle,78.58,Thursday,Morning,84.89,0.0,Negative,0.000000,163.47
749998,749998,Style Guide,Episode 47,108.98,Lifestyle,45.39,Thursday,Morning,93.27,0.0,Negative,0.000000,138.66


### Training

In [153]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import mean_squared_error, make_scorer
import numpy as np


# Define numeric and categorical columns
numeric_columns = ['Episode_Length_minutes', 'ad_per_minute', 'H_G_Popularity']
categorical_columns = ['Podcast_Name', 'Genre', 'Publication_Day', 'Publication_Time', 'Episode_Sentiment']

# Preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_columns),
        ('cat', Pipeline([
            ('onehot', OneHotEncoder(handle_unknown='ignore')),
            ('scale', StandardScaler(with_mean=False))  # StandardScaler after OneHotEncoder
        ]), categorical_columns)
    ]
)

# Define the pipeline with Ridge regression
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', Ridge())
])

# Define the parameter grid for GridSearchCV
param_grid = {
    'regressor__alpha': [0.1, 1.0, 10.0],  # Regularization strength
    'regressor__solver': ['auto', 'svd', 'cholesky', 'lsqr']  # Solver options
}

# Define a custom scoring function (e.g., RMSE)
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

scoring = make_scorer(rmse, greater_is_better=False)

# Initialize GridSearchCV
grid_search = GridSearchCV(
    pipeline,
    param_grid,
    scoring=scoring,
    cv=5,  # 5-fold cross-validation
    verbose=2,
    n_jobs=-1
)



In [154]:
# Fit the model
grid_search.fit(X_train, y_train)

# Print the best parameters and RMSE
print("Best parameters:", grid_search.best_params_)
print("Best RMSE:", -grid_search.best_score_)

# Use the best model to make predictions
best_model = grid_search.best_estimator_

Fitting 5 folds for each of 12 candidates, totalling 60 fits


30 fits failed out of a total of 60.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
15 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\thiag.NOTEAVELL\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\thiag.NOTEAVELL\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\thiag.NOTEAVELL\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\pipeline.py", line 475, in fit
    self._final

Best parameters: {'regressor__alpha': 10.0, 'regressor__solver': 'lsqr'}
Best RMSE: 13.521524884029512


### Test

In [155]:
y_pred = best_model.predict(X_test)

### Evaluation

In [156]:
# Evaluate model performance using MSE, RMSE, and R² Score.
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)  # RMSE is the square root of MSE
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("R² Score:", r2)

Mean Squared Error: 182.9884839977374
Root Mean Squared Error: 13.527323608080698
R² Score: 0.7514229401640069


## Submition

In [157]:
# Use the trained pipeline to make predictions on the test set.
y_final = best_model.predict(df_test)
print("Test set predictions:")
y_final

Test set predictions:


array([55.34856973, 17.47934927, 50.04615918, ...,  4.38806917,
       83.65186896, 55.25668003])

In [158]:
df_test['Listening_Time_minutes'] = y_final
df_test['Listening_Time_minutes'] = round(df_test['Listening_Time_minutes'], 3)
df_test

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,ad_per_minute,H_G_Popularity,Listening_Time_minutes
0,750000,Educational Nuggets,Episode 73,78.96,Education,38.11,Saturday,Evening,53.33,1.0,Neutral,0.012665,91.44,55.349
1,750001,Sound Waves,Episode 23,27.87,Music,71.29,Sunday,Morning,53.28,0.0,Neutral,0.000000,124.57,17.479
2,750002,Joke Junction,Episode 11,69.10,Comedy,67.89,Friday,Evening,97.51,0.0,Positive,0.000000,165.40,50.046
3,750003,Comedy Corner,Episode 73,115.39,Comedy,23.40,Sunday,Morning,51.75,2.0,Positive,0.017333,75.15,83.668
4,750004,Life Lessons,Episode 50,72.32,Lifestyle,58.10,Wednesday,Morning,11.30,2.0,Neutral,0.027655,69.40,50.359
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
249995,999995,Mind & Body,Episode 100,21.05,Health,65.77,Saturday,Evening,96.40,3.0,Negative,0.142518,162.17,11.639
249996,999996,Joke Junction,Episode 85,85.50,Comedy,41.47,Saturday,Night,30.52,2.0,Negative,0.023392,71.99,60.193
249997,999997,Joke Junction,Episode 63,12.11,Comedy,25.92,Thursday,Evening,73.69,1.0,Neutral,0.082576,99.61,4.388
249998,999998,Market Masters,Episode 46,113.46,Business,43.47,Friday,Night,93.59,3.0,Positive,0.026441,137.06,83.652


In [159]:
df_final_lr = df_test[['id', 'Listening_Time_minutes']]
df_final_lr

Unnamed: 0,id,Listening_Time_minutes
0,750000,55.349
1,750001,17.479
2,750002,50.046
3,750003,83.668
4,750004,50.359
...,...,...
249995,999995,11.639
249996,999996,60.193
249997,999997,4.388
249998,999998,83.652


In [160]:
from datetime import datetime

# Get the current date and time
now = datetime.now()

# Format the datetime object into the desired string format
timestamp_str = now.strftime('%Y%m%d_%H%M%S')

In [161]:
df_final_lr.to_csv(f'output/submission_file_lin_r_{timestamp_str}.csv', sep=',', index=False)

# XGBoost/SGDRegressor Pipeline With GridSearchCV



In [162]:
df_train

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,Listening_Time_minutes,ad_per_minute,H_G_Popularity
0,0,Mystery Matters,Episode 98,64.39,True Crime,74.81,Thursday,Night,53.38,0.0,Positive,31.41998,0.000000,128.19
1,1,Joke Junction,Episode 26,119.80,Comedy,66.95,Saturday,Afternoon,75.95,2.0,Negative,88.01241,0.016694,142.90
2,2,Study Sessions,Episode 16,73.90,Education,69.97,Tuesday,Evening,8.97,0.0,Negative,44.92531,0.000000,78.94
3,3,Digital Digest,Episode 45,67.17,Technology,57.22,Monday,Morning,78.70,2.0,Positive,46.27824,0.029775,135.92
4,4,Mind & Body,Episode 86,110.51,Health,80.07,Monday,Afternoon,58.68,3.0,Neutral,75.61031,0.027147,138.75
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
749995,749995,Learning Lab,Episode 25,75.66,Education,69.36,Saturday,Morning,53.70,0.0,Negative,56.87058,0.000000,123.06
749996,749996,Business Briefs,Episode 21,75.75,Business,35.21,Saturday,Night,52.18,2.0,Neutral,45.46242,0.026403,87.39
749997,749997,Lifestyle Lounge,Episode 51,30.98,Lifestyle,78.58,Thursday,Morning,84.89,0.0,Negative,15.26000,0.000000,163.47
749998,749998,Style Guide,Episode 47,108.98,Lifestyle,45.39,Thursday,Morning,93.27,0.0,Negative,100.72939,0.000000,138.66


In [163]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 750000 entries, 0 to 749999
Data columns (total 14 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   id                           750000 non-null  int64  
 1   Podcast_Name                 750000 non-null  object 
 2   Episode_Title                750000 non-null  object 
 3   Episode_Length_minutes       750000 non-null  float64
 4   Genre                        750000 non-null  object 
 5   Host_Popularity_percentage   750000 non-null  float64
 6   Publication_Day              750000 non-null  object 
 7   Publication_Time             750000 non-null  object 
 8   Guest_Popularity_percentage  750000 non-null  float64
 9   Number_of_Ads                750000 non-null  float64
 10  Episode_Sentiment            750000 non-null  object 
 11  Listening_Time_minutes       750000 non-null  float64
 12  ad_per_minute                750000 non-null  float64
 13 

## Training

In [164]:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import make_scorer, mean_squared_error, r2_score
from xgboost import XGBRegressor

def regression_pipeline(numerical_cols, categorical_cols):
    # Preprocessing for numerical and categorical features
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', StandardScaler(), numerical_cols),
            ('cat', Pipeline([
            ('onehot', OneHotEncoder(handle_unknown='ignore')),
            ('scale', StandardScaler(with_mean=False))  # StandardScaler after OneHotEncoder
        ]), categorical_columns)
        ],
        sparse_threshold=1.0
    )

    # Create pipeline with placeholder estimator
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('estimator', XGBRegressor(tree_method='hist', n_jobs=1))
    ])

    # Custom metrics definitions
    def rmse(y_true, y_pred):
        return np.sqrt(mean_squared_error(y_true, y_pred))

    # Scoring dictionary
    scoring = {
        'rmse': make_scorer(rmse, greater_is_better=False),
        'mse': make_scorer(mean_squared_error, greater_is_better=False),
        'r2': make_scorer(r2_score)
    }

    # Parameter grid
    param_grid = [
        {   # XGBoost parameters
            'estimator': [XGBRegressor(tree_method='hist', n_jobs=1)],
            'estimator__n_estimators': [100, 200],
            'estimator__max_depth': [3, 6],
            'estimator__learning_rate': [0.1, 0.03, 0.06],
            'estimator__subsample': [0.8, 1.0],
            'estimator__colsample_bytree': [0.8, 1.0]
        },
        {   # SGDRegressor parameters
            'estimator': [SGDRegressor()],
            'estimator__penalty': ['l1', 'l2', 'elasticnet'],
            'estimator__alpha': [0.0001, 0.001],
            'estimator__l1_ratio': [0.15, 0.3],
            'estimator__max_iter': [1000],
            'estimator__tol': [1e-3],
            'estimator__early_stopping': [True]
        }
    ]

    # Configure grid search
    grid_search = GridSearchCV(
        pipeline,
        param_grid,
        scoring=scoring,
        refit='rmse',  # Refit best model based on RMSE
        cv=4,
        verbose=2,
        n_jobs=-2,
        error_score='raise'
    )

    return grid_search

# Usage example:
# numerical = ['col1', 'col2']
# categorical = ['cat_col1', 'cat_col2']
# model = regression_pipeline(numerical, categorical)
# model.fit(X_train, y_train)

# To view metrics after fitting:
# print(f"Best RMSE: {-model.best_score_:.4f}")  # Note the negative sign
# cv_results = pd.DataFrame(model.cv_results_)
# print(cv_results[['params', 'mean_test_rmse', 'mean_test_mse', 'mean_test_r2']])

In [165]:
model = regression_pipeline(numeric_columns, categorical_columns)
model.fit(X_train, y_train)

Fitting 4 folds for each of 60 candidates, totalling 240 fits


In [166]:
# To view metrics after fitting:
print(f"Best RMSE: {-model.best_score_:.4f}")  # Note the negative sign
cv_results = pd.DataFrame(model.cv_results_)
cv_results[['params', 'mean_test_rmse', 'mean_test_mse', 'mean_test_r2']]

Best RMSE: 13.1898


Unnamed: 0,params,mean_test_rmse,mean_test_mse,mean_test_r2
0,"{'estimator': XGBRegressor(base_score=None, bo...",-13.257705,-175.768335,0.761369
1,"{'estimator': XGBRegressor(base_score=None, bo...",-13.257491,-175.76264,0.761377
2,"{'estimator': XGBRegressor(base_score=None, bo...",-13.23654,-175.207539,0.76213
3,"{'estimator': XGBRegressor(base_score=None, bo...",-13.238364,-175.255891,0.762065
4,"{'estimator': XGBRegressor(base_score=None, bo...",-13.206699,-174.418429,0.763202
5,"{'estimator': XGBRegressor(base_score=None, bo...",-13.2076,-174.442438,0.763169
6,"{'estimator': XGBRegressor(base_score=None, bo...",-13.191558,-174.018785,0.763744
7,"{'estimator': XGBRegressor(base_score=None, bo...",-13.195953,-174.134903,0.763587
8,"{'estimator': XGBRegressor(base_score=None, bo...",-13.52366,-182.890891,0.7517
9,"{'estimator': XGBRegressor(base_score=None, bo...",-13.524659,-182.917999,0.751663


In [167]:
cv_results[cv_results['mean_test_rmse'] == cv_results['mean_test_rmse'].min()]

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_estimator,param_estimator__colsample_bytree,param_estimator__learning_rate,param_estimator__max_depth,param_estimator__n_estimators,param_estimator__subsample,...,mean_test_mse,std_test_mse,rank_test_mse,split0_test_r2,split1_test_r2,split2_test_r2,split3_test_r2,mean_test_r2,std_test_r2,rank_test_r2
50,4.538713,0.188105,0.673355,0.055584,SGDRegressor(),,,,,,...,-185.084806,1.191615,60,0.750504,0.74783,0.749847,0.746703,0.748721,0.001526,60


## Test

In [168]:
y_pred = model.predict(X_test)

## Evaluation

In [169]:
# Evaluate model performance using MSE, RMSE, and R² Score.
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)  # RMSE is the square root of MSE
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("R² Score:", r2)

Mean Squared Error: 173.93726079434919
Root Mean Squared Error: 13.188527620411202
R² Score: 0.7637183939688779


## Submition

In [170]:
y_final = model.predict(df_test)

In [171]:
df_test['Listening_Time_minutes'] = y_final
df_test['Listening_Time_minutes'] = round(df_test['Listening_Time_minutes'], 3)
df_test

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,ad_per_minute,H_G_Popularity,Listening_Time_minutes
0,750000,Educational Nuggets,Episode 73,78.96,Education,38.11,Saturday,Evening,53.33,1.0,Neutral,0.012665,91.44,55.180000
1,750001,Sound Waves,Episode 23,27.87,Music,71.29,Sunday,Morning,53.28,0.0,Neutral,0.000000,124.57,18.052000
2,750002,Joke Junction,Episode 11,69.10,Comedy,67.89,Friday,Evening,97.51,0.0,Positive,0.000000,165.40,51.519001
3,750003,Comedy Corner,Episode 73,115.39,Comedy,23.40,Sunday,Morning,51.75,2.0,Positive,0.017333,75.15,80.166000
4,750004,Life Lessons,Episode 50,72.32,Lifestyle,58.10,Wednesday,Morning,11.30,2.0,Neutral,0.027655,69.40,49.195000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
249995,999995,Mind & Body,Episode 100,21.05,Health,65.77,Saturday,Evening,96.40,3.0,Negative,0.142518,162.17,11.500000
249996,999996,Joke Junction,Episode 85,85.50,Comedy,41.47,Saturday,Night,30.52,2.0,Negative,0.023392,71.99,57.226002
249997,999997,Joke Junction,Episode 63,12.11,Comedy,25.92,Thursday,Evening,73.69,1.0,Neutral,0.082576,99.61,6.551000
249998,999998,Market Masters,Episode 46,113.46,Business,43.47,Friday,Night,93.59,3.0,Positive,0.026441,137.06,79.196999


In [172]:
df_final_xgb = df_test[['id', 'Listening_Time_minutes']]
df_final_xgb

Unnamed: 0,id,Listening_Time_minutes
0,750000,55.180000
1,750001,18.052000
2,750002,51.519001
3,750003,80.166000
4,750004,49.195000
...,...,...
249995,999995,11.500000
249996,999996,57.226002
249997,999997,6.551000
249998,999998,79.196999


In [173]:
from datetime import datetime

# Get the current date and time
now = datetime.now()

# Format the datetime object into the desired string format
timestamp_str = now.strftime('%Y%m%d_%H%M%S')

In [174]:
df_final_xgb.to_csv(f'output/submission_file_xgb_{timestamp_str}.csv', sep=',', index=False)