# Introduction
Hi! Welcome to my new notebook. I hope you will find something interesting here. I would really appreciate it if you give me feedback about my work. I am a beginner and I don't really know how to prepare proper EDA or graphs. Thanks!

# 0. Preparing data...

In [None]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('../input/netflix-original-films-imdb-scores/NetflixOriginals.csv', index_col='Title')
df.head()

In [None]:
# nfx.info()
# nfx.isnull().sum()

# 1. Comparison of three the most popular genre

In [None]:
# df1['Genre'].value_counts()

# Choosing three the most popular genre
df1 = df.copy().query("Genre == ['Comedy', 'Documentary', 'Drama']").sort_values('Genre')
df1.head()

In [None]:
# Calculating means
df_mean = df1.groupby('Genre').mean()

In [None]:
# Creating graph
point_colors = ['lightgreen', 'lightblue', 'violet']
mean_colors = ['seagreen', 'steelblue', 'purple']

sns.set_theme(style="darkgrid")
plt.subplots(figsize=(8, 8))
sns.scatterplot(data=df1, x="Runtime", y="IMDB Score", hue='Genre', palette=point_colors, s=50)
sns.scatterplot(data=df_mean, x='Runtime', y='IMDB Score', hue=df_mean.index, 
                palette=mean_colors, s=150, marker='o', legend=None)
plt.title('Comparison of three the most popular genre')
plt.legend(loc='upper left', fontsize='large')

## 1.1 First graph interpretation
As you can see there are differences between the three most popular genres. Firstly, it is worth noting that on average dramas are the longest and documentaries are the shortest films in this conceptualization. Moreover, according to the data, if a film is shorter than 60 minutes it is more likely to be a documentary film. Secondly, I would like to point out the IMDB score. As you could notice, documentaries get on average the best score of all three genres. On the contrary if you would like to make a comedy you should be prepared for a worse IMDB score. Finally, as for the drama genre there is an interesting correlation between the film length and the score it gets. The graph shows that longer dramas can hit a better rate.


# 2. Comparison of the most popular languages used in films

In [None]:
df2 = df[['Language', 'IMDB Score', 'Runtime']].copy()
df2.head()

In [None]:
# Counting languages occurrence 
# df2['Language'].value_counts()

In [None]:
# Decreasing the number of languages
condition = df2['Language'].value_counts() < 4

df2.loc[df2['Language'].apply(lambda l: condition[l]), 'Language'] = 'Other'
df2.head()

In [None]:
# Calculating IMDB score and runtime means
df2 = df2.groupby('Language').mean().round(1)

# Sorting values
df2.sort_values(['IMDB Score'], inplace=True)

In [None]:
# Creating graphs
fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=(10, 5))
sns.set_theme(style="darkgrid")
sns.set_color_codes("pastel")

sns.barplot(x=df2.index, y="Runtime", data=df2, color="b", ax=ax1)
sns.barplot(x=df2.index, y="IMDB Score", data=df2, color="b", ax=ax2)

ax1.xaxis.set_visible(False)

plt.sca(ax2)
plt.xticks(rotation = 35)

ax1.bar_label(ax1.containers[0], label_type='center')
ax2.bar_label(ax2.containers[0], label_type='center')

ax1.set_title('Comparison of the most popular languages used in films')
ax1.set_ylabel('Average runtime')
ax2.set_ylabel('Average IMDB score')

## 2.1 Second graph interpretation
As for higher graph it can be said that the runtime of film depends on a language. Hindi films are the longest and English/Spanish films are noticeably shorter than other films. Statistics show that the users of IMDB consider that Japanese films to merit the best ratings, unlike Italian movies whose average score is only 5,5.

# 3. Comparison of films' release date

In [None]:
import datetime
def change_date_format(date):
    date = date.split(' ')
    date[0] = datetime.datetime.strptime(date[0], "%B").month
    date[1] = date[1].replace(',', '')
    return '{}-{}-{}'.format(date[2], date[0], date[1])

In [None]:
df3 = df[['Premiere', 'IMDB Score']].copy()

# Changing date column
df3['Premiere'] = df3['Premiere'].apply(lambda name: change_date_format(name))
df3[['Prem_Year', 'Prem_Month', 'Prem_Day']] = df3['Premiere'].str.split("-", expand=True)
df3.drop(columns=['Premiere'], inplace=True)

df3.head()

In [None]:
# Changing type of varibles
df3['Prem_Month'] = df3['Prem_Month'].astype(int)

# df1['Prem_Day'] = df1['Prem_Day'].astype(float)
# df1['Prem_Day'] = df1['Prem_Day'].astype(int)

# Sorting values
df3.sort_values(['Prem_Year', 'Prem_Month', 'Prem_Day'], inplace=True)

# Dropping film released in 2014
# df3['Prem_Year'].value_counts()
df3 = df3[df3['Prem_Year'] != '2014']

In [None]:
# Creating graphs
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(20, 7))
sns.set_theme(style="darkgrid")

sns.boxplot(data=df3, x="Prem_Year", y='IMDB Score', palette='pastel', ax=ax1)
sns.boxplot(data=df3, x="Prem_Month", y='IMDB Score', palette='pastel', ax=ax2)

ax1.set_title("Comparison of films' release year", size='20')
ax2.set_title("Comparison of films' release month", size='20')

## 3.1 Third graph interpretation
To begin with the left graph. There is a decreasing trend of IMDB Score starting in 2015 and ending in 2021. In addition the third quartile in 2021 is at the lowest position of all time. However in this year the minimum of the boxplot is higher than it was a year earlier. As for the right graph, it looks like films released in October have the highest IMDB Score. Moreover its maximum is also at the highest position. Another observation refers to the distance between the first and third quartile. In March and June the distance is the smallest. Consequently, films released in these months are quite similarly rated by users.

# 4. The best rated films in the 10 most popular genre

In [None]:
df4 = df[['Genre', 'IMDB Score']].copy()
df4.head()

In [None]:
# Extracting top 10 most popular genre
condition = df4['Genre'].value_counts() >= 7
df4 = df4[df4['Genre'].apply(lambda l: condition[l])]


# Finding the maximum value of each genre
df4['Max_value'] = df4.groupby(['Genre'])['IMDB Score'].transform('max')
condition = df4['IMDB Score'] == df4['Max_value']

df4 = df4[condition]
df4.drop(columns=['Max_value'], inplace=True)
df4.head()

In [None]:
# Sorting values
df4 = df4.sort_values(by='IMDB Score')

# Joining title and genre
df4['Genre'] = '"' + df4.index + '" (' + df4['Genre'] + ')'
df4 = df4.rename(columns={'Genre': 'Title (Genre)'})

In [None]:
# Creating graph
fig, ax = plt.subplots(figsize=(7, 7))
sns.set_theme(style="darkgrid")

df4.plot.barh(x='Title (Genre)', y='IMDB Score', ax=ax, color='silver')
ax.bar_label(ax.containers[0], label_type='center')

plt.title('The best rated films in the most popular genre')

# 5. Words in title and IMDB Score- Additional ridiculous graph

In [None]:
# Preparing data and calculating words in title
df5 = pd.read_csv('../input/netflix-original-films-imdb-scores/NetflixOriginals.csv', 
                 usecols=['Title', 'IMDB Score'])
df5['Title'] = df5['Title'].str.split(' ')
df5['Title'] = df5['Title'].apply(lambda a: len(a))
df5.head()

In [None]:
# Removing outliers
df5 = df5.query("Title < 8")

# Calculating mean IMDB score
df5_mean = df5.groupby('Title').mean().sort_values('IMDB Score')

In [None]:
# Creating graph
plt.subplots(figsize=(6, 4))
sns.barplot(x=df5_mean.index, y="IMDB Score", data=df5_mean, color="seagreen")
plt.xlabel('Amount of words in title')

## 5. Fifth graph interpretation
The answer to the question 'how to make a good film' is to make a title that consists of many words. The more words the better rating ;).

## Thank you for your attention!