## <div style='background:#2b6684;color:white;padding:0.5em;border-radius:0.2em'>Introduction</div>

## When I'm not on Kaggle, well I'm watching ... Netflix


...with this dataset in particular I wanted to put special **emphasis on the 'Exploratory Analysis'** trying to produce
not only appealing but also meaningful visuals.<br><br> 
Some questions I'll try to answer are the following:

* Which genres are the most common and/or most successfull?
* What's the most common language?
* What's the average runtime per genre/language?
* Are there special relations between some of the features and our target variable the IMDB Score?

After an thorough analysis I will also try to **predict the IMDB Score**.
Therefore I will build some baseline models and compare the performance in the end.

Thank you already for checking out my notebook. Feel free to leave a comment, an upvote or just say hi :)

**Take a look at some of my other work here:**
* [Water-Quality EDA & Model-Comparison](https://www.kaggle.com/mlanhenke/waterquality-eda-baseline-model-comparison)
* [Student-Test-Scores - EDA & Score Prediction](https://www.kaggle.com/mlanhenke/test-scores-epic-eda-prediction-cb-xgb-lgbm)

## <div style='background:#2b6684;color:white;padding:0.5em;border-radius:0.2em'>Import Data</div>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.lines as lines
import matplotlib.gridspec as gridspec
import seaborn as sns

from scipy.stats import probplot
from warnings import filterwarnings
filterwarnings('ignore')

plt.rcParams['font.family'] = 'monospace'

In [None]:
df = pd.read_csv('../input/netflix-original-films-imdb-scores/NetflixOriginals.csv')

## <div style='background:#2b6684;color:white;padding:0.5em;border-radius:0.2em'>Color Palettes</div>

In [None]:
colors = ['#751a2c','#b33a3a','#d57056','#f2b0a5','#261421']
bg_color = '#fbfbfb'
txt_color = '#5c5c5c'

sns.palplot(colors)

## <div style='background:#2b6684;color:white;padding:0.5em;border-radius:0.2em'>Basic Overview</div>

In [None]:
print(f"Shape: {df.shape}")
print(':'*25)
df.head(5)

In [None]:
df.info()

## <div style='background:#2b6684;color:white;padding:0.5em;border-radius:0.2em'>Feature Engineering</div>

...to analyze this dataset we have to create some time-features before. So let's get to work

In [None]:
# create time based features
df['Premiere'] = pd.to_datetime(df['Premiere'])
df['Year'] = df['Premiere'].dt.year
df['Month'] = df['Premiere'].dt.month
df['Week'] = df['Premiere'].dt.week
df['DayName'] = df['Premiere'].dt.day_name()
df['Weekday'] = df['Premiere'].dt.weekday

# create string based features
df['TitleLen'] = df['Title'].apply(lambda x: len(x))

## <div style='background:#2b6684;color:white;padding:0.5em;border-radius:0.2em'>Target Analysis: IMDB Score</div>

In [None]:
fig = plt.figure(tight_layout=True, figsize=(15,9))
gs = gridspec.GridSpec(nrows=2, ncols=2, width_ratios=[3,1])

fig.patch.set_facecolor(bg_color)

ax0 = fig.add_subplot(gs[:,0])
ax1 = fig.add_subplot(gs[0,1])
ax2 = fig.add_subplot(gs[1,1])

ax0.set_facecolor(bg_color)
ax1.set_facecolor(bg_color)
ax2.set_facecolor(bg_color)

mean = df['IMDB Score'].mean()
median = df['IMDB Score'].median()

#################
#### KDE-PLOT####
#################
ax0.axvline(x=mean, ymin=0, ymax=1, zorder=2, color='#fff', alpha=0.5, lw=2, ls='--')
ax0.axvline(x=median, ymin=0, ymax=1, zorder=2, color='#fff', alpha=0.5, lw=2, ls=':')

ax0.annotate(
    s=f"Mean: {np.round(mean,1)}",
    xy=(mean, 0.25),
    xytext=(mean - 0.8,0.3),
    color=txt_color,
    fontsize=14, fontweight='light', 
    fontfamily='calibri', fontstyle='italic',
    va='center', ha='center',
    bbox=dict(
        boxstyle='square,pad=0.3',
        facecolor=bg_color,edgecolor=txt_color
    ),
    arrowprops=dict(
        arrowstyle='->', 
        color='#000',
        connectionstyle='arc3, rad=0.5'
    )
)

ax0.annotate(
    s=f"Median: {np.round(median,1)}",
    xy=(median, 0.2),
    xytext=(median + 1.1, 0.25),
    color=txt_color,
    fontsize=14, fontweight='light', 
    fontfamily='calibri', fontstyle='italic',
    va='center', ha='center',
    bbox=dict(
        boxstyle='square,pad=0.3',
        facecolor=bg_color,edgecolor=txt_color
    ),
    arrowprops=dict(
        arrowstyle='->', 
        color='#000',
        connectionstyle='arc3, rad=-0.45'
    )
)

sns.kdeplot(
    data=df, x='IMDB Score', shade=True, color=colors[0],
    edgecolor=colors[4], lw=1, alpha=0.8, ax=ax0, zorder=1
)

ax0.set_xlabel('')
ax0.set_ylabel('')
ax0.set_yticks([])

##################
#### BOX-PLOT ####
##################

ax1.boxplot(
    data=df, x='IMDB Score',
    vert=False, patch_artist=True,
    boxprops=dict(facecolor=colors[4], color='#fff', lw=0),
    whiskerprops=dict(color='gray', lw=1, ls='--'),
    capprops=dict(color='gray', lw=1, ls='--'),
    medianprops=dict(color='#fff', lw=2),
    flierprops=dict(markerfacecolor=colors[0],alpha=0.75)
)

ax1.annotate(
    s='left-outliers',
    xy=(39.5, 165),
    xytext=(0,225),
    color=txt_color,
    fontsize=14, fontweight='light', 
    fontfamily='calibri', fontstyle='italic',
    xycoords='axes points',
    arrowprops=dict(arrowstyle='-[, widthB=1.75')
)

ax1.set_xlabel('')
ax1.set_ylabel('')
ax1.set_xticks([])
ax1.set_yticks([])

###################
#### PROB-PLOT ####
###################

res = probplot(x=df['IMDB Score'], plot=ax2)

l0 = ax2.get_lines()[0]
l1 = ax2.get_lines()[1]

l0.set_marker('D')
l0.set_alpha(0.25)
l0.set_color(colors[3])
l1.set_color(colors[4])
l1.set_linestyle('--')
l1.set_linewidth(0.5)
l1.set_alpha(0.75)

ax2.set_xlabel('')
ax2.set_ylabel('')
ax2.set_xticks([])
ax2.set_yticks([])
ax2.set_title('')

# Text & Titles
fig.text(
    s=':IMDB Score - Distribution',
    x=0, y=0.975,
    color=txt_color,
    fontsize=17, fontweight='bold'
)

fig.text(
    s='''
    Our IMDB Score is normally distributed.
    However we have some outliers to the left, which can
    be seen in the boxplot as well as on the probability
    distribution plot.
    ''',
    x=0, y=0.875,
    color=txt_color
)

fig.text(
    s='Box-Plot', rotation=90, 
    x=0.72, y=0.90,
    color=txt_color
)

fig.text(
    s='Probability-Plot', rotation=90, 
    x=0.72, y=0.275,
    color=txt_color
)

# seperation lines
sl1 = lines.Line2D(xdata=[0.73,0.73], ydata=[0.05,0.4], lw=1, alpha=0.5, color='#aeaeae', transform=fig.transFigure, figure=fig)
sl2 = lines.Line2D(xdata=[0.73,0.73], ydata=[0.6,0.95], lw=1, alpha=0.5, color='#aeaeae', transform=fig.transFigure, figure=fig)
fig.lines.extend([sl1, sl2])

# despine
for spine in ['top','left','right','bottom']:
    ax0.spines[spine].set_visible(False)
    ax1.spines[spine].set_visible(False)
    ax2.spines[spine].set_visible(False)

# show
plt.show()

## <div style='background:#2b6684;color:white;padding:0.5em;border-radius:0.2em'>Feature Analysis</div>

In [None]:
# create a helper function
def group_df(df:pd.DataFrame, col:str) -> pd.DataFrame:
    tmp = df.groupby(col).agg({'Title':'count','Runtime':'mean','IMDB Score':'mean'})
    tmp = tmp.sort_values(by='Title', ascending=False).reset_index()
    tmp = tmp.rename(columns={'Title':'Count', 'Runtime':'MeanRuntime','IMDB Score':'MeanScore'})
    return tmp

In [None]:
# create grouped dataframes for analysis
df_genre = group_df(df, 'Genre')[:5]
df_language = group_df(df, 'Language')[:5]

# calculate ratio for alpha values
df_genre['Ratio'] = df_genre['Count'].apply(lambda x: x / df_genre['Count'].sum())

#### <div style='background:#5c5c5c;color:white;padding:0.5em;border-radius:0.2em'>Titles</div>

In [None]:
# basic overview how many titles over time
df_time = df.groupby('Year').nunique()[['Title']].reset_index()
df_time = df_time[df_time['Year'] <= 2020]
df_time = df_time.rename(columns={'Title':'Count'})
sum_titles = df_time['Count'].sum()

# plot
fig, ax = plt.subplots(figsize=(12,6))
fig.patch.set_facecolor(bg_color)
ax.set_facecolor(bg_color)

ax.plot(data=df_time['Year'], y1= df_time['Count'], color=colors[4], lw=0.5)
ax.fill_between(x=df_time['Year'], y1=0, y2=df_time['Count'], color=colors[0], alpha=0.85)

ax.axhline(y=0, color=colors[4], lw=2, alpha=1)
ax.set_xlim(df_time['Year'].min(), df_time['Year'].max())

ax.yaxis.tick_right()
ax.tick_params(axis='both', which='both', length=0)

# Text & Titles
fig.text(
    s=':Number of Titles over Time (until 2020)',
    x=0, y=0.975,
    color=txt_color,
    fontsize=17, fontweight='bold'
)

fig.text(
    s='''
    the amount of titles added 
    has steadily risen over the years
    ''',
    x=-0.01, y=0.89,
    color=txt_color
)

fig.text(
    s='Total Movies:',
    x=0.762, y=0.84,
    color=txt_color,
    fontsize=9,
)

fig.text(
    s=sum_titles,
    x=0.785, y=0.80,
    color=txt_color,
    fontsize=14,fontweight='bold'
)

# seperation lines
sl1 = lines.Line2D(xdata=[0.75,0.75], ydata=[0.78,0.86], lw=1, alpha=0.5, color='#aeaeae', transform=fig.transFigure, figure=fig)
sl2 = lines.Line2D(xdata=[0.75,0.80], ydata=[0.78,0.78], lw=1, alpha=0.5, color='#aeaeae', transform=fig.transFigure, figure=fig)
fig.lines.extend([sl1,sl2])

# despine
for spine in ['top','left','right','bottom']:
    ax.spines[spine].set_visible(False)

plt.show()

In [None]:
import squarify

df_top_titles = df.groupby('Title').mean()['IMDB Score'].nlargest(5)
df_flop_titles = df.groupby('Title').mean()['IMDB Score'].nsmallest(5).sort_values(ascending=False)
df_titles = pd.concat([df_top_titles, df_flop_titles])
df_titles = pd.DataFrame({'Title':df_titles.index,'Score':df_titles.values})

# create labels for treemap
labels = [label +'\n'+ str(score) +' Score' for label, score in zip(df_titles['Title'],df_titles['Score'])]

fig, ax = plt.subplots(tight_layout=True, figsize=(15,9))
fig.patch.set_facecolor(bg_color)
ax.set_facecolor(bg_color)

# create treemap
squarify.plot(
    sizes=df_titles['Score'], label=labels, color=colors, alpha=0.8,
    pad=0.05, ax=ax, text_kwargs=dict(color='white', fontsize=9, fontweight='light')
)

ax.axis('off')

# Text & Titles
fig.text(
    s=':Top & Flop Titles',
    x=0, y=1.1,
    color=txt_color,
    fontsize=17, fontweight='bold'
)

fig.text(
    s='''
    Congratulations David Attenborough,
    I also like your movies ...
    ''',
    x=-0.01, y=1.04,
    color=txt_color
)

plt.show()

#### <div style='background:#5c5c5c;color:white;padding:0.5em;border-radius:0.2em'>Genres</div>

In [None]:
!pip install circlify

In [None]:
import circlify

fig = plt.figure(tight_layout=True, figsize=(15,10))
gs = gridspec.GridSpec(nrows=1, ncols=2, width_ratios=[1.5,0.5])

fig.patch.set_facecolor(bg_color)
fig.subplots_adjust(wspace=1, right=2)

ax0 = fig.add_subplot(gs[0,0])
ax1 = fig.add_subplot(gs[0,1])

ax0.set_facecolor(bg_color)
ax1.set_facecolor(bg_color)

# create circles based on title count
circles = circlify.circlify(
    data=df_genre['Count'].tolist(),
    show_enclosure=False,
    target_enclosure=circlify.Circle(x=0,y=0,r=1)
)

# find and set limit
lim = max(
    max(
        abs(circle.x) + circle.r,
        abs(circle.y) + circle.r
    )
    for circle in circles
)

ax0.set_xlim(-lim, lim)
ax0.set_ylim(-lim, lim)

# labels
labels = df_genre['Genre'][::-1]
scores = df_genre['MeanScore'][::-1]
ratios = df_genre['Ratio'][::-1]

# print circles
for label, score, ratio, circle in zip(labels, scores, ratios, circles):
    x,y,r = circle
    ax0.add_patch(
        plt.Circle(
            (x,y), r, 
            alpha=(1*ratio+0.5), lw=1, 
            fill=True, facecolor=colors[0]
            )
        )
    ax0.annotate(
        s=f"{label}\n{np.round(score,1)}",
        xy=(x,y),
        va='center',ha='center', color='#fff'
    )

# average runtime per genre
ax1.set_xlim(0, df_genre['MeanRuntime'].max()+10)

ax1 = sns.scatterplot(
    data=df_genre, x=10, y='Genre', color='#000', s=200
)
ax1 = sns.scatterplot(
    data=df_genre, x='MeanRuntime', y='Genre', color=colors[0], s=2e3
)

for idx in range(0,len(df_genre['Genre'])):
    xmin = 10/(df_genre['MeanRuntime'].max()+10)
    xmax = df_genre['MeanRuntime'][idx]/(df_genre['MeanRuntime'].max()+10)

    ax1.axhline(
        y=df_genre['Genre'][idx], 
        xmin=xmin, 
        xmax=xmax,
        color=txt_color, zorder=0
    )

    ax1.annotate(
        s=f"{int(df_genre['MeanRuntime'][idx])}\nmin",
        xy=(df_genre['MeanRuntime'][idx],df_genre['Genre'][idx]),
        va='center', ha='center',
        color='#fff'
        
    )

ax1.set_xticks([])
ax1.set_xlabel('')
ax1.set_ylabel('')
ax1.tick_params(axis='both', which='both', length=0)

# despine
for spine in ['top','left','right','bottom']:
    ax1.spines[spine].set_visible(False)

ax0.axis('off')

# Text & Titles
fig.text(
    s=':TOP 5 - Genres',
    x=0, y=0.975,
    color=txt_color,
    fontsize=17, fontweight='bold'
)

fig.text(
    s='''
    by number of titles (size)
    average score and runtime
    ''',
    x=-0.01, y=0.925,
    color=txt_color
)

fig.text(
    s='''
    Documentaries are not only 
    the biggest (159 Titles)
    but also the (on average) 
    highest scoring movies.
    ''',
    x=0.51, y=0.7,
    color=txt_color,
    fontsize=9,alpha=0.5
)

fig.text(
    s='avg. Runtime',
    rotation=90,
    x=0.665, y=0.875,
    color=txt_color,
    fontsize=9,alpha=0.5
)

# seperation lines
sl1 = lines.Line2D(xdata=[0.525,0.525], ydata=[0.68,0.78], lw=2, alpha=0.5, color='#aeaeae', transform=fig.transFigure, figure=fig)
sl2 = lines.Line2D(xdata=[0.675,0.675], ydata=[0.05,0.95], lw=1, alpha=0.25, color='#aeaeae', transform=fig.transFigure, figure=fig)
fig.lines.extend([sl1,sl2])

plt.show()

In [None]:
# prepare data for top genre boxplot
cols = [*df_genre['Genre'].value_counts().index]

df_top_genre = df.copy()
df_top_genre['TopGenre'] = df_top_genre['Genre'].apply(lambda x: 1 if x in cols else 0)
data = df_top_genre[df_top_genre['TopGenre'] == 1]

# violin plot
fig, ax = plt.subplots(figsize=(15,6))

fig.patch.set_facecolor(bg_color)
ax.set_facecolor(bg_color)

sns.violinplot(data=data, x='Genre', y='IMDB Score', palette=colors, saturation=0.5, ax=ax)

ax.set_ylabel('')
ax.set_xlabel('')
ax.tick_params(axis='x',length=0)

# despine
for spine in ['top','left','right']:
    ax.spines[spine].set_visible(False)

# Text & Titles
fig.text(
    s=':Genres vs. IMDB Score',
    x=0.1, y=1.1,
    color=txt_color,
    fontsize=17, fontweight='bold'
)

fig.text(
    s='''
    Documentaries are the highest scoring genre,
    but we can spot a few outliers (low scores) too.
    ''',
    x=0.09, y=1.02,
    color=txt_color
)

plt.show()

#### <div style='background:#5c5c5c;color:white;padding:0.5em;border-radius:0.2em'>Languages</div>

In [None]:
# !pip install squarify
# import squarify

fig = plt.figure(figsize=(15,10))

gs = gridspec.GridSpec(nrows=2, ncols=2, height_ratios=[3,1])

ax0 = fig.add_subplot(gs[0,:])
ax1 = fig.add_subplot(gs[1,0])
ax2 = fig.add_subplot(gs[1,1])

fig.patch.set_facecolor(bg_color)
fig.subplots_adjust(wspace=0.2, hspace=0.1)
ax0.set_facecolor(bg_color)
ax1.set_facecolor(bg_color)
ax2.set_facecolor(bg_color)

# create labels for treemap
labels = [label +'\n#'+ str(count) +' Titles' for label, count in zip(df_language['Language'],df_language['Count'])]

# create treemap
squarify.plot(
    sizes=df_language['Count'], label=labels, color=colors, 
    pad=True, ax=ax0, text_kwargs=dict(color='white', fontsize=13, fontweight='light'))

# average runtime
ax1.bar(
    x=df_language['Language'], height=df_language['MeanRuntime'],
    color='#000', edgecolor='#000', lw=1, alpha=0.45
)


ax1.tick_params(length=0)
ax1.set_yticks([])
ax1.set_ylabel('')

# average scores
ax2.bar(
    x=df_language['Language'], height=df_language['MeanScore'],
    color='#000', edgecolor='#000', lw=1, alpha=0.45
)

ax2.tick_params(length=0)
ax2.set_yticks([])
ax2.set_ylabel('')

# annotations
for idx in range(0,len(df_language['Language'])):
    ax1.annotate(
        s=f"Ø {int(df_language['MeanRuntime'][idx])} min",
        xy=(df_language['Language'][idx], 60),
        rotation=90,
        va='center', ha='center',
        color='#fff', fontsize=9
    )
    ax2.annotate(
        s=f"Ø\n{np.round(df_language['MeanScore'][idx],1)}",
        xy=(df_language['Language'][idx], 4),
        va='center', ha='center',
        color='#fff', fontsize=9
    )
    
# despine
ax0.axis('off')
for spine in ['top','left','right']:
    ax1.spines[spine].set_visible(False)
    ax2.spines[spine].set_visible(False)

# Text & Titles
fig.text(
    s=':TOP 5 - Languages',
    x=0.1, y=0.975,
    color=txt_color,
    fontsize=17, fontweight='bold'
)

fig.text(
    s='''
    by number of titles (size)
    average score and runtime
    ''',
    x=0.09, y=0.925,
    color=txt_color
)

fig.text(
    s='avg. Runtime',
    rotation=90,
    x=0.1075, y=0.2,
    color=txt_color,
    fontsize=9,alpha=0.5
)

fig.text(
    s='avg. Score',
    rotation=90,
    x=0.5275, y=0.2,
    color=txt_color,
    fontsize=9,alpha=0.5
)

sl1 = lines.Line2D(xdata=[0.115,0.115], ydata=[0.15,0.3], lw=1, alpha=0.25, color='#aeaeae', transform=fig.transFigure, figure=fig)
sl2 = lines.Line2D(xdata=[0.535,0.535], ydata=[0.15,0.3], lw=1, alpha=0.25, color='#aeaeae', transform=fig.transFigure, figure=fig)
fig.lines.extend([sl1,sl2])

plt.show()

#### <div style='background:#5c5c5c;color:white;padding:0.5em;border-radius:0.2em'>Runtime</div>

In [None]:
# figure, grid
fig = plt.figure(tight_layout=True, figsize=(15,9))
gs = gridspec.GridSpec(nrows=2, ncols=2)

fig.patch.set_facecolor(bg_color)

ax0 = fig.add_subplot(gs[0,0])
ax1 = fig.add_subplot(gs[1,0])
ax2 = fig.add_subplot(gs[:,1])

# plots
ax0 = sns.kdeplot(
    data=df, x='Runtime', ax=ax0,
    shade=True, color=colors[0],
    edgecolor=colors[4], lw=1, alpha=0.8
)

ax1.boxplot(
    data=df, x='Runtime',
    vert=False, patch_artist=True,
    boxprops=dict(facecolor=colors[4], color='#fff', lw=0),
    whiskerprops=dict(color='gray', lw=1, ls='--'),
    capprops=dict(color='gray', lw=1, ls='--'),
    medianprops=dict(color='#fff', lw=2),
    flierprops=dict(markerfacecolor=colors[0],alpha=0.75)
)

ax2.scatter(
    y=df['Runtime'], x=df['IMDB Score'],
    color=colors[3], alpha=0.5, s=1*df['Runtime']
)

# Text & Titles
fig.text(
    s=':Runtime - Distribution & Relation',
    x=0, y=1.1,
    color=txt_color,
    fontsize=17, fontweight='bold'
)

fig.text(
    s='''
    as we can see the runtime is negatively skewed with
    outliers to the left. From the scatterplot we can tell
    that there is no relation between runtime and IMDB-Score.
    ''',
    x=0, y=1.02,
    color=txt_color
)

fig.text(
    s='''
    IMDB Score vs. Runtime
    ''',
    x=0.535, y=0.89,
    color=txt_color
)

fig.text(
    s='''
    Runtime Distribution
    ''',
    x=0.05, y=0.89,
    color=txt_color
)

fig.text(
    s='''
    Runtime Outliers
    ''',
    x=0.05, y=0.34,
    color=txt_color
)

# seperation lines
sl1 = lines.Line2D(xdata=[0.535,0.535], ydata=[0.85,0.95], lw=1, alpha=0.5, color='#aeaeae', transform=fig.transFigure, figure=fig)
sl2 = lines.Line2D(xdata=[0.05,0.05], ydata=[0.85,0.95], lw=1, alpha=0.5, color='#aeaeae', transform=fig.transFigure, figure=fig)
sl3 = lines.Line2D(xdata=[0.05,0.05], ydata=[0.3,0.4], lw=1, alpha=0.5, color='#aeaeae', transform=fig.transFigure, figure=fig)
fig.lines.extend([sl1,sl2,sl3])

# ax colors
ax0.set_facecolor(bg_color)
ax1.set_facecolor(bg_color)
ax2.set_facecolor(bg_color)

# labels & ticks
ax0.set_xlabel('')
ax0.set_ylabel('')
ax0.set_yticks([])

ax1.set_yticks([])
ax1.set_xticks([])

ax2.set_yticks([])

ax0.tick_params(length=0, colors=txt_color)
ax2.tick_params(length=0, colors=txt_color)

# despine
for spine in ['top','left','right','bottom']:
    ax1.spines[spine].set_visible(False)
    
for spine in ['top','left','right']:
    ax0.spines[spine].set_visible(False)
    ax2.spines[spine].set_visible(False)

ax2.spines['bottom'].set_color(txt_color)
ax2.spines['bottom'].set_alpha(0.25)

plt.show()

#### <div style='background:#5c5c5c;color:white;padding:0.5em;border-radius:0.2em'>Premiere (Time)</div>

In [None]:
df_month = df.groupby('Month').mean()[['IMDB Score']].reset_index()

fig, ax = plt.subplots(figsize=(15,6))

fig.patch.set_facecolor(bg_color)
ax.set_facecolor(bg_color)

ax.plot(data=df_month['Month'], y1=df_month['IMDB Score'], color=colors[4], lw=10)
ax.fill_between(x=np.arange(0,12), y1=df_month['IMDB Score'], color=colors[0], alpha=0.05, label='avg. Score')

sns.swarmplot(data=df, x='Month', y='IMDB Score', palette=colors, ax=ax)

ax.set_ylabel('')
ax.set_xlabel('')
ax.set_ylim(0,8)
ax.tick_params(axis='both',length=0)

# despine
for spine in ['top','left','right']:
    ax.spines[spine].set_visible(False)

# Text & Titles
fig.text(
    s=':Month vs. IMDB Score',
    x=0.1, y=1.1,
    color=txt_color,
    fontsize=17, fontweight='bold'
)

fig.text(
    s='''
    The average score is distributed
    evenly across the different month.
    One can see a small peak around June
    and October, but there is no strong
    relation to the IMDB Score evident.
    ''',
    x=0.09, y=0.94,
    color=txt_color
)

plt.legend(loc='lower center',frameon=False)
plt.show()

In [None]:
df_day = df.groupby('DayName').mean()[['IMDB Score']].reset_index()

fig, ax = plt.subplots(figsize=(15,6))

fig.patch.set_facecolor(bg_color)
ax.set_facecolor(bg_color)

sns.swarmplot(data=df, x='DayName', y='IMDB Score', palette=colors, ax=ax)

ax.plot(data=df_day['DayName'], y1=df_day['IMDB Score'], color=colors[4], lw=10)
ax.fill_between(x=df_day['DayName'], y1=0, y2=df_day['IMDB Score'], color=colors[0], alpha=0.05, label='avg. Score')

ax.set_ylim(0,8)

ax.set_ylabel('')
ax.set_xlabel('')
ax.tick_params(axis='both',length=0)

# despine
for spine in ['top','left','right']:
    ax.spines[spine].set_visible(False)

# Text & Titles
fig.text(
    s=':Day vs. IMDB Score',
    x=0.1, y=1.1,
    color=txt_color,
    fontsize=17, fontweight='bold'
)

fig.text(
    s='''
    Friday is release day. The average score
    is relatively even distributed, however
    wednesday seems to be a bad day for a new movie.
    ''',
    x=0.09, y=0.98,
    color=txt_color
)

plt.legend(loc='lower center',frameon=False)
plt.show()

## <div style='background:#2b6684;color:white;padding:0.5em;border-radius:0.2em'>Modeling</div>

In [None]:
# preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

# prepare dataset
X = df.drop(columns=['Title', 'Premiere','DayName','IMDB Score'])
y = df['IMDB Score']

# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=1)

# encoder = OrdinalEncoder(handle_unknown='use_encoded_value',unknown_value=999,dtype='int')
# X_train[['Genre','Language']] = encoder.fit_transform(X=X_train[['Genre','Language']])
# X_test[['Genre','Language']] = encoder.transform(X=X_test[['Genre','Language']])

encoder = OneHotEncoder(handle_unknown='ignore')
X_train = encoder.fit_transform(X=X_train)
X_test = encoder.transform(X=X_test)

In [None]:
# fit decisiontree based models
from sklearn.metrics import mean_squared_error
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor

models = [
    ('lgbm',LGBMRegressor()),
    ('catb',CatBoostRegressor(verbose=0)),
    ('xgb',XGBRegressor(verbosity=0)),
    ('RF',RandomForestRegressor(verbose=0))
]

results = dict()

for name, model in models:
    model.fit(X_train, y_train)
    y_hat = model.predict(X_test)
    rmse = mean_squared_error(y_test, y_hat, squared=False)
    results[name] = rmse

In [None]:
df_results = pd.DataFrame([results])
df_results = df_results.transpose()
df_results = df_results.rename(columns={0:'RMSE'})

fig, ax = plt.subplots(figsize=(12,6))

ax = sns.barplot(
    data=df_results,
    x='RMSE',
    y=df_results.index,
    color=colors[0],
    saturation=0.5,
    ax = ax
)

for idx in range(0, len(df_results['RMSE'])):
    ax.annotate(
        s=f"{np.round(df_results['RMSE'][idx],2)}",
        xy=(df_results['RMSE'][idx]-0.05,idx),
        va='center', ha='right',
        color='#fff'
    )

# Text & Titles
fig.text(
    s=':Model Evaluation - RMSE',
    x=0.1, y=1.05,
    color=txt_color,
    fontsize=17, fontweight='bold'
)

fig.text(
    s='''
    CatBoost performs best with an RMSE of 0.92,
    followed by LGBM (0.95) and RandomForest(0.96)
    ''',
    x=0.09, y=0.96,
    color=txt_color
)

ax.set_xlabel('')
ax.set_xticks([])
ax.tick_params(length=0)

for spine in ['top','left','right','bottom']:
    ax.spines[spine].set_visible(False)

plt.show()

## <div style='background:#2b6684;color:white;padding:0.5em;border-radius:0.2em'>Conclusion</div>

I had a **lot of fun** working with this (small) dataset and I put a lot of **effort into the visuals**.<br>
In terms of **predicting** the IMDB Score, the size of the dataset and the sparse feature-set was noticeable.<br>
To **improve model performance** one would have to gather more and more relevant (correlating) training data
as well as tune some hyperparameters.

<div style='border-radius:3px;background:#b1d3e3;padding:2em;text-align:left;font-family:monospace;font-weight:light;font-size:1.1em;color:black'>
    <b>Thanks for checking out my notebook!</b><br>
    Feel free to leave a comment, a suggestion, an upvote or just a simple message to say hello :)
</div>