# Predict Podcast Listening Time

Podcast_Name
Type: string
Values: Names of popular podcasts.

Episode_Title
Type: string
Values: Titles of the episodes.

Episode_Length
Type: float (minutes)
Values: Length of the episode in minutes. Example: 5.0, 10.0, 30.0, 45.0, 60.0, 90.0.

Genre
Type: string
Values: "Technology", "Education", "Comedy", "Health", "True Crime", "Business", "Sports", "Lifestyle", "News", "Music".

Host_Popularity
Type: float (scale 0-100)
Values: A score indicating the popularity of the host. Example: 50.0, 75.0, 90.0.

Publication_Day
Type: string
Values: Day of the week the episode was published. Example: "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday".

Publication_Time
Type: string
Values: "Morning", "Afternoon", "Evening", "Night".

Guest_Popularity
Type: float (scale 0-100)
Values: A score indicating the popularity of the guest (if any). Example: 20.0, 50.0, 85.0.

Number_of_Ads
Type: int
Values: Number of advertisements within the episode. Example: 0, 1, 2, 3.

Episode_Sentiment
Type: string
Values: Sentiment of the episode's content. Example: "Positive", "Neutral", "Negative".

Listening_Time
Type: float (minutes)
Values: The actual average listening duration (target variable).

In [33]:
import pandas as pd
import numpy as np
import itertools
import polars as pl
import pyarrow
import matplotlib.pyplot  as plt
import seaborn as sns
import os
from pathlib import Path
import re
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix




In [34]:
df_train= pd.read_csv('input/train.csv', sep=',')
df_train

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,Listening_Time_minutes
0,0,Mystery Matters,Episode 98,,True Crime,74.81,Thursday,Night,,0.0,Positive,31.41998
1,1,Joke Junction,Episode 26,119.80,Comedy,66.95,Saturday,Afternoon,75.95,2.0,Negative,88.01241
2,2,Study Sessions,Episode 16,73.90,Education,69.97,Tuesday,Evening,8.97,0.0,Negative,44.92531
3,3,Digital Digest,Episode 45,67.17,Technology,57.22,Monday,Morning,78.70,2.0,Positive,46.27824
4,4,Mind & Body,Episode 86,110.51,Health,80.07,Monday,Afternoon,58.68,3.0,Neutral,75.61031
...,...,...,...,...,...,...,...,...,...,...,...,...
749995,749995,Learning Lab,Episode 25,75.66,Education,69.36,Saturday,Morning,,0.0,Negative,56.87058
749996,749996,Business Briefs,Episode 21,75.75,Business,35.21,Saturday,Night,,2.0,Neutral,45.46242
749997,749997,Lifestyle Lounge,Episode 51,30.98,Lifestyle,78.58,Thursday,Morning,84.89,0.0,Negative,15.26000
749998,749998,Style Guide,Episode 47,108.98,Lifestyle,45.39,Thursday,Morning,93.27,0.0,Negative,100.72939


In [35]:
df_train[df_train['Podcast_Name']=='Mystery Matters']

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,Listening_Time_minutes
0,0,Mystery Matters,Episode 98,,True Crime,74.81,Thursday,Night,,0.0,Positive,31.41998
67,67,Mystery Matters,Episode 13,57.93,True Crime,72.48,Saturday,Afternoon,90.00,1.0,Neutral,30.89510
95,95,Mystery Matters,Episode 40,94.45,True Crime,53.37,Friday,Afternoon,36.36,0.0,Positive,66.59645
142,142,Mystery Matters,Episode 50,23.74,True Crime,86.73,Friday,Afternoon,,2.0,Neutral,22.10401
155,155,Mystery Matters,Episode 59,108.34,True Crime,64.26,Monday,Afternoon,8.37,0.0,Neutral,76.97396
...,...,...,...,...,...,...,...,...,...,...,...,...
749801,749801,Mystery Matters,Episode 66,92.54,True Crime,31.97,Tuesday,Morning,70.61,0.0,Negative,76.19639
749818,749818,Mystery Matters,Episode 39,33.28,True Crime,91.90,Saturday,Night,,0.0,Positive,24.44947
749822,749822,Mystery Matters,Episode 20,32.41,True Crime,90.45,Friday,Morning,61.55,0.0,Positive,23.78321
749868,749868,Mystery Matters,Episode 80,,True Crime,33.15,Thursday,Night,27.19,0.0,Positive,89.95920


In [36]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 750000 entries, 0 to 749999
Data columns (total 12 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   id                           750000 non-null  int64  
 1   Podcast_Name                 750000 non-null  object 
 2   Episode_Title                750000 non-null  object 
 3   Episode_Length_minutes       662907 non-null  float64
 4   Genre                        750000 non-null  object 
 5   Host_Popularity_percentage   750000 non-null  float64
 6   Publication_Day              750000 non-null  object 
 7   Publication_Time             750000 non-null  object 
 8   Guest_Popularity_percentage  603970 non-null  float64
 9   Number_of_Ads                749999 non-null  float64
 10  Episode_Sentiment            750000 non-null  object 
 11  Listening_Time_minutes       750000 non-null  float64
dtypes: float64(5), int64(1), object(6)
memory usage: 68.7+ MB


In [37]:
df_train.describe()

Unnamed: 0,id,Episode_Length_minutes,Host_Popularity_percentage,Guest_Popularity_percentage,Number_of_Ads,Listening_Time_minutes
count,750000.0,662907.0,750000.0,603970.0,749999.0,750000.0
mean,374999.5,64.504738,59.859901,52.236449,1.348855,45.437406
std,216506.495284,32.969603,22.873098,28.451241,1.15113,27.138306
min,0.0,0.0,1.3,0.0,0.0,0.0
25%,187499.75,35.73,39.41,28.38,0.0,23.17835
50%,374999.5,63.84,60.05,53.58,1.0,43.37946
75%,562499.25,94.07,79.53,76.6,2.0,64.81158
max,749999.0,325.24,119.46,119.91,103.91,119.97


## Todo's

- Fill the null values in Episode_Length_minutes with the mean value of the Podcast
- Create AD_per_minute column
- Create H_G_Popularity by just summarizing the two Popularity columns
- Create encoding for string columns
- Create weekend column

In [38]:
df_train['Podcast_Name'].value_counts()

Podcast_Name
Tech Talks             22847
Sports Weekly          20053
Funny Folks            19635
Tech Trends            19549
Fitness First          19488
Business Insights      19480
Style Guide            19364
Game Day               19272
Melody Mix             18889
Criminal Minds         17735
Finance Focus          17628
Detective Diaries      17452
Crime Chronicles       17374
Athlete's Arena        17327
Fashion Forward        17280
Tune Time              17254
Business Briefs        17012
Lifestyle Lounge       16661
True Crime Stories     16373
Sports Central         16191
Digital Digest         16171
Humor Hub              16144
Mystery Matters        16002
Comedy Corner          15927
Joke Junction          15074
Wellness Wave          15009
Sport Spot             14778
Gadget Geek            14770
Home & Living          14686
Laugh Line             14673
Life Lessons           14464
World Watch            14043
Sound Waves            13928
Global News            13649
M

## Fill the null values in Episode_Length_minutes with the mean value of the Podcast

In [39]:
# Replace zero's with nan's 
df_train['Episode_Length_minutes'] = df_train['Episode_Length_minutes'].replace(0, np.nan)
# Get mean values per podcast
mean_map = df_train.groupby('Podcast_Name')['Episode_Length_minutes'].mean().round(2)
# Map the means to fill NaN
df_train['Episode_Length_minutes'] = (
    df_train['Episode_Length_minutes']
    .fillna(df_train['Podcast_Name'].map(mean_map)))

In [40]:
mean_map

Podcast_Name
Athlete's Arena        65.84
Brain Boost            63.46
Business Briefs        66.67
Business Insights      62.85
Comedy Corner          62.63
Crime Chronicles       66.10
Criminal Minds         61.45
Current Affairs        62.01
Daily Digest           65.29
Detective Diaries      65.83
Digital Digest         63.42
Educational Nuggets    65.07
Fashion Forward        64.63
Finance Focus          62.05
Fitness First          65.38
Funny Folks            63.73
Gadget Geek            65.03
Game Day               62.71
Global News            65.78
Health Hour            64.42
Healthy Living         64.54
Home & Living          66.53
Humor Hub              65.42
Innovators             64.53
Joke Junction          61.35
Laugh Line             63.03
Learning Lab           65.28
Life Lessons           64.58
Lifestyle Lounge       64.31
Market Masters         65.21
Melody Mix             67.91
Mind & Body            66.28
Money Matters          66.90
Music Matters          65.19
M

In [41]:
df_train.describe()

Unnamed: 0,id,Episode_Length_minutes,Host_Popularity_percentage,Guest_Popularity_percentage,Number_of_Ads,Listening_Time_minutes
count,750000.0,750000.0,750000.0,603970.0,749999.0,750000.0
mean,374999.5,64.502542,59.859901,52.236449,1.348855,45.437406
std,216506.495284,31.001355,22.873098,28.451241,1.15113,27.138306
min,0.0,1.24,1.3,0.0,0.0,0.0
25%,187499.75,39.42,39.41,28.38,0.0,23.17835
50%,374999.5,64.42,60.05,53.58,1.0,43.37946
75%,562499.25,90.31,79.53,76.6,2.0,64.81158
max,749999.0,325.24,119.46,119.91,103.91,119.97


In [42]:
df_train[df_train['Episode_Length_minutes']==0]

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,Listening_Time_minutes


In [43]:
df_train[df_train['Podcast_Name']=='Fashion Forward']

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,Listening_Time_minutes
160,160,Fashion Forward,Episode 15,6.91,Lifestyle,47.16,Monday,Evening,15.93,1.0,Neutral,6.90383
219,219,Fashion Forward,Episode 42,34.82,Lifestyle,88.68,Sunday,Evening,14.63,0.0,Positive,16.86000
233,233,Fashion Forward,Episode 55,92.80,Lifestyle,49.70,Tuesday,Night,86.59,1.0,Negative,83.69482
247,247,Fashion Forward,Episode 71,64.63,Lifestyle,70.14,Monday,Morning,,2.0,Negative,9.01001
274,274,Fashion Forward,Episode 44,56.26,Lifestyle,74.26,Sunday,Night,16.92,1.0,Neutral,31.52727
...,...,...,...,...,...,...,...,...,...,...,...,...
749824,749824,Fashion Forward,Episode 78,23.84,Lifestyle,49.89,Tuesday,Night,50.66,3.0,Positive,12.13000
749857,749857,Fashion Forward,Episode 3,67.75,Lifestyle,77.61,Saturday,Night,15.88,2.0,Neutral,40.69137
749863,749863,Fashion Forward,Episode 27,70.43,Lifestyle,37.63,Sunday,Afternoon,10.09,3.0,Neutral,46.46165
749955,749955,Fashion Forward,Episode 42,39.08,Lifestyle,91.53,Tuesday,Night,,0.0,Negative,21.99000


## Create AD_per_minute column

In [44]:
df_train.describe()

Unnamed: 0,id,Episode_Length_minutes,Host_Popularity_percentage,Guest_Popularity_percentage,Number_of_Ads,Listening_Time_minutes
count,750000.0,750000.0,750000.0,603970.0,749999.0,750000.0
mean,374999.5,64.502542,59.859901,52.236449,1.348855,45.437406
std,216506.495284,31.001355,22.873098,28.451241,1.15113,27.138306
min,0.0,1.24,1.3,0.0,0.0,0.0
25%,187499.75,39.42,39.41,28.38,0.0,23.17835
50%,374999.5,64.42,60.05,53.58,1.0,43.37946
75%,562499.25,90.31,79.53,76.6,2.0,64.81158
max,749999.0,325.24,119.46,119.91,103.91,119.97


In [58]:
df_train[pd.isna(df_train['Number_of_Ads'])]

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,Listening_Time_minutes,ad_per_minute,H_G_Popularity
247170,247170,Game Day,Episode 33,35.66,Sports,27.35,Friday,Evening,49.87,,Negative,23.94516,,77.22


In [59]:
# Replace NaN with 0 in the 'Episode_Length_minutes' column
df_train['Number_of_Ads'] = df_train['Number_of_Ads'].fillna(0)

In [60]:
df_train['ad_per_minute'] = df_train['Number_of_Ads']/df_train['Episode_Length_minutes']

## Create H_G_Popularity by just summarizing the two Popularity columns

In [48]:
# Replace NaN with 0 in the 'Episode_Length_minutes' column
df_train['Guest_Popularity_percentage'] = df_train['Guest_Popularity_percentage'].fillna(0)

In [49]:
df_train['H_G_Popularity'] = df_train['Host_Popularity_percentage']+df_train['Guest_Popularity_percentage']

## Create encoding for string columns

In [51]:
df_train.head(5)

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,Listening_Time_minutes,ad_per_minute,H_G_Popularity
0,0,Mystery Matters,Episode 98,64.39,True Crime,74.81,Thursday,Night,0.0,0.0,Positive,31.41998,0.0,74.81
1,1,Joke Junction,Episode 26,119.8,Comedy,66.95,Saturday,Afternoon,75.95,2.0,Negative,88.01241,0.016694,142.9
2,2,Study Sessions,Episode 16,73.9,Education,69.97,Tuesday,Evening,8.97,0.0,Negative,44.92531,0.0,78.94
3,3,Digital Digest,Episode 45,67.17,Technology,57.22,Monday,Morning,78.7,2.0,Positive,46.27824,0.029775,135.92
4,4,Mind & Body,Episode 86,110.51,Health,80.07,Monday,Afternoon,58.68,3.0,Neutral,75.61031,0.027147,138.75


In [52]:
df_train['Genre'].value_counts()

Genre
Sports        87606
Technology    86256
True Crime    85059
Lifestyle     82461
Comedy        81453
Business      80521
Health        71416
News          63385
Music         62743
Education     49100
Name: count, dtype: int64

In [53]:
df_train['Publication_Day'].value_counts()

Publication_Day
Sunday       115946
Monday       111963
Friday       108237
Wednesday    107886
Thursday     104360
Saturday     103505
Tuesday       98103
Name: count, dtype: int64

In [54]:
df_train['Publication_Time'].value_counts()

Publication_Time
Night        196849
Evening      195778
Afternoon    179460
Morning      177913
Name: count, dtype: int64

In [55]:
df_train['Episode_Sentiment'].value_counts()

Episode_Sentiment
Neutral     251291
Negative    250116
Positive    248593
Name: count, dtype: int64

In [61]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 750000 entries, 0 to 749999
Data columns (total 14 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   id                           750000 non-null  int64  
 1   Podcast_Name                 750000 non-null  object 
 2   Episode_Title                750000 non-null  object 
 3   Episode_Length_minutes       750000 non-null  float64
 4   Genre                        750000 non-null  object 
 5   Host_Popularity_percentage   750000 non-null  float64
 6   Publication_Day              750000 non-null  object 
 7   Publication_Time             750000 non-null  object 
 8   Guest_Popularity_percentage  750000 non-null  float64
 9   Number_of_Ads                750000 non-null  float64
 10  Episode_Sentiment            750000 non-null  object 
 11  Listening_Time_minutes       750000 non-null  float64
 12  ad_per_minute                750000 non-null  float64
 13 

In [62]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

def one_hot_encode_columns(df, columns_to_encode, suffix='_encoded'):
    """
    Apply label encoding to specified columns and create new encoded columns with a suffix.
    
    Parameters:
        df (pd.DataFrame): Input DataFrame
        columns_to_encode (list): List of column names to encode
        suffix (str): Suffix to append to new encoded columns (default: '_encoded')
    
    Returns:
        pd.DataFrame: DataFrame with new encoded columns
    """
    df = df.copy()  # Avoid modifying the original DataFrame
    
    for col in columns_to_encode:
        # Create new column name
        new_col_name = f"{col}{suffix}"
        
        # Initialize LabelEncoder
        le = LabelEncoder()
        
        # Fit and transform the column
        df[new_col_name] = le.fit_transform(df[col].astype(str))  # Handle mixed/non-str types
        
    return df

In [None]:
# Definição dos nomes das colunas a serem processadas
numeric_columns  = ['Episode_Length_minutes', 'ad_per_minute', 'H_G_Popularity']
categorical_columns  = ['Genre', 'Publication_Day', 'Publication_Time', 'Episode_Sentiment']

# Create a ColumnTransformer:
# - Applies StandardScaler to the numeric columns.
# - Applies OneHotEncoder to the categorical column.
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_columns),
        ('cat', OneHotEncoder(), categorical_columns)
    ]
)

# Build a pipeline that first preprocesses the data and then applies linear regression.
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

# Train the pipeline with the DataFrame and the target variable.
pipeline.fit(df, target)

# To view the transformed data, use the preprocessor.
transformed_data = preprocessor.fit_transform(df)
print("Transformed data:")
print(transformed_data)

In [65]:
df_train = label_encode_columns(df_train, columns_to_encode=['Podcast_Name', 'Genre', 'Publication_Day', 'Episode_Sentiment'], suffix='_encoded')

In [66]:
df_train

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,Listening_Time_minutes,ad_per_minute,H_G_Popularity,Genre_encoded,Publication_Day_encoded,Episode_Sentiment_encoded,Podcast_Name_encoded
0,0,Mystery Matters,Episode 98,64.39,True Crime,74.81,Thursday,Night,0.00,0.0,Positive,31.41998,0.000000,74.81,9,4,2,34
1,1,Joke Junction,Episode 26,119.80,Comedy,66.95,Saturday,Afternoon,75.95,2.0,Negative,88.01241,0.016694,142.90,1,2,0,24
2,2,Study Sessions,Episode 16,73.90,Education,69.97,Tuesday,Evening,8.97,0.0,Negative,44.92531,0.000000,78.94,2,5,0,40
3,3,Digital Digest,Episode 45,67.17,Technology,57.22,Monday,Morning,78.70,2.0,Positive,46.27824,0.029775,135.92,8,1,2,10
4,4,Mind & Body,Episode 86,110.51,Health,80.07,Monday,Afternoon,58.68,3.0,Neutral,75.61031,0.027147,138.75,3,1,1,31
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
749995,749995,Learning Lab,Episode 25,75.66,Education,69.36,Saturday,Morning,0.00,0.0,Negative,56.87058,0.000000,69.36,2,2,0,26
749996,749996,Business Briefs,Episode 21,75.75,Business,35.21,Saturday,Night,0.00,2.0,Neutral,45.46242,0.026403,35.21,0,2,1,2
749997,749997,Lifestyle Lounge,Episode 51,30.98,Lifestyle,78.58,Thursday,Morning,84.89,0.0,Negative,15.26000,0.000000,163.47,4,4,0,28
749998,749998,Style Guide,Episode 47,108.98,Lifestyle,45.39,Thursday,Morning,93.27,0.0,Negative,100.72939,0.000000,138.66,4,4,0,41


# Pipeline

In [69]:
df_train_num = df_train[['Episode_Length_minutes'
                             ,'Host_Popularity_percentage'
                             ,'ad_per_minute'
                             ,'H_G_Popularity'
                             ,'Genre_encoded'
                             ,'Publication_Day_encoded'
                             ,'Episode_Sentiment_encoded'
                             ,'Podcast_Name_encoded']]
df_train_num

Unnamed: 0,Episode_Length_minutes,Host_Popularity_percentage,ad_per_minute,H_G_Popularity,Genre_encoded,Publication_Day_encoded,Episode_Sentiment_encoded,Podcast_Name_encoded
0,64.39,74.81,0.000000,74.81,9,4,2,34
1,119.80,66.95,0.016694,142.90,1,2,0,24
2,73.90,69.97,0.000000,78.94,2,5,0,40
3,67.17,57.22,0.029775,135.92,8,1,2,10
4,110.51,80.07,0.027147,138.75,3,1,1,31
...,...,...,...,...,...,...,...,...
749995,75.66,69.36,0.000000,69.36,2,2,0,26
749996,75.75,35.21,0.026403,35.21,0,2,1,2
749997,30.98,78.58,0.000000,163.47,4,4,0,28
749998,108.98,45.39,0.000000,138.66,4,4,0,41


## Test's treatment

In [86]:
df_test = pd.read_csv('input/test.csv', sep=',')
df_test

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment
0,750000,Educational Nuggets,Episode 73,78.96,Education,38.11,Saturday,Evening,53.33,1.0,Neutral
1,750001,Sound Waves,Episode 23,27.87,Music,71.29,Sunday,Morning,,0.0,Neutral
2,750002,Joke Junction,Episode 11,69.10,Comedy,67.89,Friday,Evening,97.51,0.0,Positive
3,750003,Comedy Corner,Episode 73,115.39,Comedy,23.40,Sunday,Morning,51.75,2.0,Positive
4,750004,Life Lessons,Episode 50,72.32,Lifestyle,58.10,Wednesday,Morning,11.30,2.0,Neutral
...,...,...,...,...,...,...,...,...,...,...,...
249995,999995,Mind & Body,Episode 100,21.05,Health,65.77,Saturday,Evening,96.40,3.0,Negative
249996,999996,Joke Junction,Episode 85,85.50,Comedy,41.47,Saturday,Night,30.52,2.0,Negative
249997,999997,Joke Junction,Episode 63,12.11,Comedy,25.92,Thursday,Evening,73.69,1.0,Neutral
249998,999998,Market Masters,Episode 46,113.46,Business,43.47,Friday,Night,93.59,3.0,Positive


In [89]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 13 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   id                           250000 non-null  int64  
 1   Podcast_Name                 250000 non-null  object 
 2   Episode_Title                250000 non-null  object 
 3   Episode_Length_minutes       250000 non-null  float64
 4   Genre                        250000 non-null  object 
 5   Host_Popularity_percentage   250000 non-null  float64
 6   Publication_Day              250000 non-null  object 
 7   Publication_Time             250000 non-null  object 
 8   Guest_Popularity_percentage  250000 non-null  float64
 9   Number_of_Ads                250000 non-null  float64
 10  Episode_Sentiment            250000 non-null  object 
 11  ad_per_minute                250000 non-null  float64
 12  H_G_Popularity               250000 non-null  float64
dtyp

In [90]:
df_test.describe()

Unnamed: 0,id,Episode_Length_minutes,Host_Popularity_percentage,Guest_Popularity_percentage,Number_of_Ads,ad_per_minute,H_G_Popularity
count,250000.0,250000.0,250000.0,250000.0,250000.0,250000.0,250000.0
mean,874999.5,418.5513,59.716491,41.998082,1.355852,0.034187,101.714573
std,72168.927986,156975.0,22.880028,32.851723,4.274399,0.579268,40.175951
min,750000.0,2.47,2.49,0.0,0.0,0.0,7.65
25%,812499.75,39.37,39.25,7.79,0.0,0.0,71.81
50%,874999.5,64.43,59.9,42.09,1.0,0.018909,99.78
75%,937499.25,90.81,79.39,70.99,2.0,0.040363,131.47
max,999999.0,78486260.0,117.76,116.82,2063.0,288.531469,205.08


In [87]:
# Replace zero's with nan's 
df_test['Episode_Length_minutes'] = df_test['Episode_Length_minutes'].replace(0, np.nan)
# Get mean values per podcast
mean_map_test = df_test.groupby('Podcast_Name')['Episode_Length_minutes'].mean().round(2)
# Map the means to fill NaN
df_test['Episode_Length_minutes'] = (
    df_test['Episode_Length_minutes']
    .fillna(df_test['Podcast_Name'].map(mean_map_test)))

# Replace NaN with 0 in the 'Number_of_Ads' column
df_test['Number_of_Ads'] = df_test['Number_of_Ads'].fillna(0)

df_test['ad_per_minute'] = df_test['Number_of_Ads']/df_test['Episode_Length_minutes']

# Replace NaN with 0 in the 'Episode_Length_minutes' column
df_test['Guest_Popularity_percentage'] = df_test['Guest_Popularity_percentage'].fillna(0)

df_test['H_G_Popularity'] = df_test['Host_Popularity_percentage']+df_test['Guest_Popularity_percentage']

#Encoding string columns with label encoder
#df_test = label_encode_columns(df_test, columns_to_encode=['Podcast_Name', 'Genre', 'Publication_Day', 'Episode_Sentiment'], suffix='_encoded')

df_test

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,ad_per_minute,H_G_Popularity
0,750000,Educational Nuggets,Episode 73,78.96,Education,38.11,Saturday,Evening,53.33,1.0,Neutral,0.012665,91.44
1,750001,Sound Waves,Episode 23,27.87,Music,71.29,Sunday,Morning,0.00,0.0,Neutral,0.000000,71.29
2,750002,Joke Junction,Episode 11,69.10,Comedy,67.89,Friday,Evening,97.51,0.0,Positive,0.000000,165.40
3,750003,Comedy Corner,Episode 73,115.39,Comedy,23.40,Sunday,Morning,51.75,2.0,Positive,0.017333,75.15
4,750004,Life Lessons,Episode 50,72.32,Lifestyle,58.10,Wednesday,Morning,11.30,2.0,Neutral,0.027655,69.40
...,...,...,...,...,...,...,...,...,...,...,...,...,...
249995,999995,Mind & Body,Episode 100,21.05,Health,65.77,Saturday,Evening,96.40,3.0,Negative,0.142518,162.17
249996,999996,Joke Junction,Episode 85,85.50,Comedy,41.47,Saturday,Night,30.52,2.0,Negative,0.023392,71.99
249997,999997,Joke Junction,Episode 63,12.11,Comedy,25.92,Thursday,Evening,73.69,1.0,Neutral,0.082576,99.61
249998,999998,Market Masters,Episode 46,113.46,Business,43.47,Friday,Night,93.59,3.0,Positive,0.026441,137.06


In [88]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 13 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   id                           250000 non-null  int64  
 1   Podcast_Name                 250000 non-null  object 
 2   Episode_Title                250000 non-null  object 
 3   Episode_Length_minutes       250000 non-null  float64
 4   Genre                        250000 non-null  object 
 5   Host_Popularity_percentage   250000 non-null  float64
 6   Publication_Day              250000 non-null  object 
 7   Publication_Time             250000 non-null  object 
 8   Guest_Popularity_percentage  250000 non-null  float64
 9   Number_of_Ads                250000 non-null  float64
 10  Episode_Sentiment            250000 non-null  object 
 11  ad_per_minute                250000 non-null  float64
 12  H_G_Popularity               250000 non-null  float64
dtyp

## Applying Linear Regression

In [91]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error, root_mean_squared_error   

In [108]:
df_train.drop('Listening_Time_minutes', axis=1)

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,ad_per_minute,H_G_Popularity,Genre_encoded,Publication_Day_encoded,Episode_Sentiment_encoded,Podcast_Name_encoded
0,0,Mystery Matters,Episode 98,64.39,True Crime,74.81,Thursday,Night,0.00,0.0,Positive,0.000000,74.81,9,4,2,34
1,1,Joke Junction,Episode 26,119.80,Comedy,66.95,Saturday,Afternoon,75.95,2.0,Negative,0.016694,142.90,1,2,0,24
2,2,Study Sessions,Episode 16,73.90,Education,69.97,Tuesday,Evening,8.97,0.0,Negative,0.000000,78.94,2,5,0,40
3,3,Digital Digest,Episode 45,67.17,Technology,57.22,Monday,Morning,78.70,2.0,Positive,0.029775,135.92,8,1,2,10
4,4,Mind & Body,Episode 86,110.51,Health,80.07,Monday,Afternoon,58.68,3.0,Neutral,0.027147,138.75,3,1,1,31
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
749995,749995,Learning Lab,Episode 25,75.66,Education,69.36,Saturday,Morning,0.00,0.0,Negative,0.000000,69.36,2,2,0,26
749996,749996,Business Briefs,Episode 21,75.75,Business,35.21,Saturday,Night,0.00,2.0,Neutral,0.026403,35.21,0,2,1,2
749997,749997,Lifestyle Lounge,Episode 51,30.98,Lifestyle,78.58,Thursday,Morning,84.89,0.0,Negative,0.000000,163.47,4,4,0,28
749998,749998,Style Guide,Episode 47,108.98,Lifestyle,45.39,Thursday,Morning,93.27,0.0,Negative,0.000000,138.66,4,4,0,41


In [113]:
df_train_pipeline = df_train[['Episode_Length_minutes', 'ad_per_minute', 'H_G_Popularity', 'Podcast_Name','Genre', 'Publication_Day', 'Publication_Time', 'Episode_Sentiment']]
df_train_pipeline

Unnamed: 0,Episode_Length_minutes,ad_per_minute,H_G_Popularity,Podcast_Name,Genre,Publication_Day,Publication_Time,Episode_Sentiment
0,64.39,0.000000,74.81,Mystery Matters,True Crime,Thursday,Night,Positive
1,119.80,0.016694,142.90,Joke Junction,Comedy,Saturday,Afternoon,Negative
2,73.90,0.000000,78.94,Study Sessions,Education,Tuesday,Evening,Negative
3,67.17,0.029775,135.92,Digital Digest,Technology,Monday,Morning,Positive
4,110.51,0.027147,138.75,Mind & Body,Health,Monday,Afternoon,Neutral
...,...,...,...,...,...,...,...,...
749995,75.66,0.000000,69.36,Learning Lab,Education,Saturday,Morning,Negative
749996,75.75,0.026403,35.21,Business Briefs,Business,Saturday,Night,Neutral
749997,30.98,0.000000,163.47,Lifestyle Lounge,Lifestyle,Thursday,Morning,Negative
749998,108.98,0.000000,138.66,Style Guide,Lifestyle,Thursday,Morning,Negative


In [114]:
# Splitting the dataset into training and testing sets.
# Here, we set aside 20% of the data for testing.
X_train, X_test, y_train, y_test = train_test_split(df_train_pipeline, df_train['Listening_Time_minutes'], test_size=0.2, random_state=35)

# Specify the column names to be processed
numeric_columns  = ['Episode_Length_minutes', 'ad_per_minute', 'H_G_Popularity']
categorical_columns  = ['Podcast_Name','Genre', 'Publication_Day', 'Publication_Time', 'Episode_Sentiment']

# Create a ColumnTransformer that:
# - Applies StandardScaler to the numeric columns.
# - Applies OneHotEncoder to the categorical column.
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_columns),
        ('cat', OneHotEncoder(), categorical_columns)
    ]
)

# Build a pipeline that first preprocesses the data and then applies linear regression.
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

# Train the pipeline using the training set.
pipeline.fit(X_train, y_train)




In [115]:
# Use the trained pipeline to make predictions on the test set.
y_pred = pipeline.predict(X_test)
print("Test set predictions:")
print(y_pred)


Test set predictions:
[ 1.18251044 52.91633175 69.27507388 ... 67.41428066  3.87604258
 48.06904555]


In [116]:
# Evaluate model performance using MSE, RMSE, and R² Score.
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)  # RMSE is the square root of MSE
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("R² Score:", r2)

# Retrieve and print the coefficients and intercept of the linear regression model.
# Note that these coefficients correspond to the transformed features.
regressor = pipeline.named_steps['regressor']
print("Coefficients:", regressor.coef_)
print("Intercept:", regressor.intercept_)

Mean Squared Error: 183.03864826407687
Root Mean Squared Error: 13.529177664000013
R² Score: 0.7513547955159773
Coefficients: [ 2.31731774e+01 -6.37724825e-01  3.49826671e-01 -1.35653003e-01
 -1.24200952e+00  1.34967014e-01  6.54848909e-02 -1.63982503e-01
  1.13663948e+00  4.62393160e-01 -7.81648395e-01 -9.24322798e-01
  2.91795540e-01  5.60457259e-01 -1.36424892e+00  5.47199419e-01
 -2.50655351e-01  1.03634077e+00  8.62501810e-02  1.57075848e+00
 -3.59898264e-01 -1.58948161e+00  1.37228759e+00  1.09254840e+00
  2.88576044e-02 -3.12905733e-01  7.42264489e-01 -5.32805805e-01
 -1.85797193e-01 -1.38354775e+00  4.63733181e-01  8.29625789e-01
  1.44694389e-01 -4.44895382e-01  1.42143866e+00  5.80022054e-01
 -4.04759448e-01  5.86839106e-01 -8.00236716e-01 -1.09647139e+00
 -1.05208294e+00 -1.13275559e+00 -7.52774742e-01 -5.30975219e-01
  4.44459548e-01  1.18868256e+00  4.57021230e-01  6.25887048e-01
 -2.20496876e-01  1.07903023e+00 -1.28727293e+00 -3.54128035e-02
  2.17295872e-02  1.01446442e

# Submit


In [106]:
df_test

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,ad_per_minute,H_G_Popularity
0,750000,Educational Nuggets,Episode 73,78.96,Education,38.11,Saturday,Evening,53.33,1.0,Neutral,0.012665,91.44
1,750001,Sound Waves,Episode 23,27.87,Music,71.29,Sunday,Morning,0.00,0.0,Neutral,0.000000,71.29
2,750002,Joke Junction,Episode 11,69.10,Comedy,67.89,Friday,Evening,97.51,0.0,Positive,0.000000,165.40
3,750003,Comedy Corner,Episode 73,115.39,Comedy,23.40,Sunday,Morning,51.75,2.0,Positive,0.017333,75.15
4,750004,Life Lessons,Episode 50,72.32,Lifestyle,58.10,Wednesday,Morning,11.30,2.0,Neutral,0.027655,69.40
...,...,...,...,...,...,...,...,...,...,...,...,...,...
249995,999995,Mind & Body,Episode 100,21.05,Health,65.77,Saturday,Evening,96.40,3.0,Negative,0.142518,162.17
249996,999996,Joke Junction,Episode 85,85.50,Comedy,41.47,Saturday,Night,30.52,2.0,Negative,0.023392,71.99
249997,999997,Joke Junction,Episode 63,12.11,Comedy,25.92,Thursday,Evening,73.69,1.0,Neutral,0.082576,99.61
249998,999998,Market Masters,Episode 46,113.46,Business,43.47,Friday,Night,93.59,3.0,Positive,0.026441,137.06


In [117]:
# Use the trained pipeline to make predictions on the test set.
y_final = pipeline.predict(df_test)
print("Test set predictions:")
y_final

Test set predictions:


array([55.53431541, 17.05129147, 49.95849414, ...,  4.52680419,
       83.65009477, 55.41655455])

In [129]:
df_test['Listening_Time_minutes'] = y_final
df_test['Listening_Time_minutes'] = round(df_test['Listening_Time_minutes'], 3)
df_test

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,ad_per_minute,H_G_Popularity,Listening_Time_minutes
0,750000,Educational Nuggets,Episode 73,78.96,Education,38.11,Saturday,Evening,53.33,1.0,Neutral,0.012665,91.44,55.534
1,750001,Sound Waves,Episode 23,27.87,Music,71.29,Sunday,Morning,0.00,0.0,Neutral,0.000000,71.29,17.051
2,750002,Joke Junction,Episode 11,69.10,Comedy,67.89,Friday,Evening,97.51,0.0,Positive,0.000000,165.40,49.958
3,750003,Comedy Corner,Episode 73,115.39,Comedy,23.40,Sunday,Morning,51.75,2.0,Positive,0.017333,75.15,83.894
4,750004,Life Lessons,Episode 50,72.32,Lifestyle,58.10,Wednesday,Morning,11.30,2.0,Neutral,0.027655,69.40,50.596
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
249995,999995,Mind & Body,Episode 100,21.05,Health,65.77,Saturday,Evening,96.40,3.0,Negative,0.142518,162.17,11.577
249996,999996,Joke Junction,Episode 85,85.50,Comedy,41.47,Saturday,Night,30.52,2.0,Negative,0.023392,71.99,60.432
249997,999997,Joke Junction,Episode 63,12.11,Comedy,25.92,Thursday,Evening,73.69,1.0,Neutral,0.082576,99.61,4.527
249998,999998,Market Masters,Episode 46,113.46,Business,43.47,Friday,Night,93.59,3.0,Positive,0.026441,137.06,83.650


In [130]:
df_final = df_test[['id', 'Listening_Time_minutes']]
df_final

Unnamed: 0,id,Listening_Time_minutes
0,750000,55.534
1,750001,17.051
2,750002,49.958
3,750003,83.894
4,750004,50.596
...,...,...
249995,999995,11.577
249996,999996,60.432
249997,999997,4.527
249998,999998,83.650


In [131]:
from datetime import datetime

# Get the current date and time
now = datetime.now()

# Format the datetime object into the desired string format
timestamp_str = now.strftime('%Y%m%d_%H%M%S')

In [132]:
df_final.to_csv(f'output/submission_file_{timestamp_str}.csv', sep=',', index=False)