# Examples using Pandas and Seabord

In this notebook, some visualizations are built using Python and seaborn package. For this, it will be utilised [Netflix Dataset](https://www.kaggle.com/shivamb/netflix-shows) available in Kaggle-Datasets. Using Pandas and Seaborn,we can build useful plots to answer some general question behind this dataset.

<img src="https://www.ksat.com/resizer/6elnlC7JWa-cqF4kuu6Imx2AKPI=/1600x900/smart/filters:format(jpeg):strip_exif(true):strip_icc(true):no_upscale(true):quality(65)/cloudfront-us-east-1.images.arcpublishing.com/gmg/6AKDZZENWRHF5OGJGDCCMID5LY.jpg">

In [None]:
# Setup environment 
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

# Set matplotlib config
%matplotlib inline
plt.style.use('seaborn')
print("Setup Complete")

In [None]:
# Load data
data = pd.read_csv("../input/netflix-shows/netflix_titles.csv")
data.head()


How has the evolution of Movies and TV shows been during the last twenty years?

In [None]:
# Aggregate data  
data_year = data.groupby(["release_year","type"],as_index = False).size()

# Filter to the recent movies and series (newest than 2000)
data_year = data_year[data_year.release_year >= 2000]

# Set size
plt.figure(figsize=(20,6))

# Print a lineplot
sns.lineplot(x="release_year", y="size", hue="type", data=data_year)
# Set some features
plt.xlabel("Release year")
plt.ylabel("Frequency")
plt.title("Release of Movies and TV Shows", size=24); # Use semicolon when you want to avoid warning message

# We go to add a special text in the final of the line marking that something special happened here
plt.annotate('COVID effect?', xy=(2020, 30), xytext=(2016, 50), arrowprops={'facecolor':'red', 'shrink':0.05}); 

We can observe a trend wheren movies turned out be more popular than TV shows, however, 0a sharp decrease of the number of movies is showed in 2020, likely associated to COVID19.

In [None]:
# To know the countries where are produced most movies and TV shows
countries = data[['country','type']].copy()

# We can see that in some cases, many countries produce movies and tv shows
#print(countries)

# We can split data by country. First we can create a list of strings, to after use explode
countries['country'] = countries['country'].str.split(',')
countries_split = countries.explode('country')

# We can see the result
print(countries_split)

What are the countries which produce more Movies and TV shows?

In [None]:
# We can group based on country column
data_country = countries_split.groupby(['country','type'],as_index = False).size()

# We go to create two barplot with about movies and other with respect tv series
country_tv = data_country[data_country['type'] == 'TV Show'].sort_values('size',ascending = False).reset_index()[0:10]
country_series = data_country[data_country['type'] == 'Movie'].sort_values('size',ascending = False).reset_index()[0:10]

# We back to build data_country with the countries with more movies and tv shows
data_country = pd.concat([country_tv,country_series])

#Set size
plt.figure(figsize=(20,6))

# Create a barplot
sns.barplot(x='country', y="size", hue="type", data=data_country);
plt.xlabel('Country')
plt.ylabel('Number')
plt.title('Numbers of Movies and TV shows by country', size=24);
plt.legend(loc='upper right');

We see that Uniter States and India are the countries with more movies. With respect to TV shows, United Sates and United Kingdom are the leaders.

What are the most popular genders?

In [None]:
# Now, I want to identify the mainstream genders. This information is available in the column "listed_in" in an grouped way. Thus, we need to carry out a process to ungroup it.

# To ungroup
data_genders = data["listed_in"]
data_list = data_genders.str.split(",",expand = True)
data_list = data_list.rename(columns={0: "First", 1: "Second",2 :"Third"})

# Create a dictionary with genders and frequency
gender_list = pd.concat([data_list["First"],data_list["Second"],data_list["Third"]],ignore_index = True).dropna()
ungrouped = gender_list.groupby(gender_list).size().sort_values(ascending = False)
#print(ungrouped)

# It was detected a problem with "Dramas" and " Dramas".  There is a whitespace in the beginning of the word.
gender_list2 = gender_list.map(lambda p: p.strip())
ungrouped2 = gender_list.groupby(gender_list2).size().sort_values(ascending = False)
#print(ungrouped2)

# Here is the code for our plit

# Set size
plt.figure(figsize=(20,6))

# We have a barplot
ungrouped2.plot.bar(x='lab', y='val', rot=0);
plt.xticks(rotation=90);
plt.xlabel('Genders')
plt.ylabel('Frequency')
plt.title('Genders in Movies and TV shows', size=24);

Based on this graph, we can observe that 'International Movies', 'Dramas' and 'Comedies' are the most popular classes. However, the first option could not be considered as a gender, thereby we can say that Drama and Comedy are the most popular genders. Maybe, someone would want to assess deeper and reclass some categoties to give more clarity to the result.

In [None]:
# Finally, I will create a Wordcloud with the text given in description. We go to use all the descriptions. 

# First, we create a text with the entire information
text = " ".join(review for review in data.description)
# We can see the length of our variable
print(len(text))

# Create and generate a word cloud image
wordcloud = WordCloud(background_color="white").generate(text)

# Display the generated image
plt.figure(figsize=(20,6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title('Most popular words according description of Movies and TV shows', size=24);
plt.show()

We can see that "life", "family", "find", "love" and "world" are the most popular words in these set of Movies and TV shows.

Rating is a classification used to mention that is the target public of each Movie or TV Serie. Some movies are focused of a general public, while others are addressed to people older than 18 years. Based on this, we can ask: What is the proportion of Movies and TV shows according their rating?

In [None]:
# We transform data
data_rating = data[['rating']].groupby('rating',as_index = False).size()
print(data_rating.head())

# We can create a pieplot
plt.figure(figsize=(20,6))

# We consider the creation of a special legend because we have many labels
patches, texts = plt.pie(data_rating['size'], startangle=90, radius=1.2)
labels = ['{0} - {1:1.2f} %'.format(i,j) for i,j in zip(data_rating['rating'], (data_rating['size']/data_rating['size'].sum())*100)]

# We add a legend to show data
sort_legend = True
if sort_legend:
    patches, labels, dummy =  zip(*sorted(zip(patches, labels, data_rating['size']),
                                          key=lambda x: x[2],
                                          reverse=True))

plt.legend(patches, labels, loc='best', bbox_to_anchor=(-0.1, 1.),
           fontsize=12)
plt.title('Proportion of Movies and TV shows based on rating', size=24);
plt.show()

The most popular classes are TV-14 and TV-MA.

Another interesting aspect is the analysis of the most popular directors. Who are the most popular directors of movies?

In [None]:
# As the same case of countries, sometimes we have more than one director per movie.
data_directors = data[data['type'] == 'Movie'][['director']]
#print(data_directors)

# We need to split them
data_directors['director'] = data_directors['director'].str.split(',')
data_directors = data_directors.explode('director')
#print(data_directors)

# We obtain the final dataset
directors = data_directors.groupby('director',as_index = False).size().sort_values('size',ascending = False)

# Create the barplot with the 10 most popular directors
plt.figure(figsize=(20,6))
sns.barplot(x='director', y="size", data=directors.iloc[0:10]);
plt.title("Most popular directors of Netflix's Movies", fontsize=24)
plt.xlabel('Directors')
plt.ylabel('Number of movies')
plt.show()

As we can see, Jan Suter and Raul Campos are the most popular directors in Netflix platform.

Now, a little more complex assessing, is the evaluation of the duration of Movies and how this has changed during the last years. We can handle our data to get the right information

In [None]:
# Handle data. We only consider Movies
data_duration_movie = data[data['type'] == 'Movie'][['duration','release_year']]

# We have data in minutes
#print(data_duration.head())

# We go to ensure that data has only minute format
#print(data_duration[~data_duration['duration'].str.contains("min")]) # We obtain an empty dataframe

# Now, we convert column in integer
data_duration_movie['duration'] = data_duration_movie['duration'].str.replace(' min', '').astype(int)

# We will do the same with tv shows data.
data_duration_tv = data[data['type'] == 'TV Show'][['duration','release_year']]

# We have data in seasons
#print(data_duration_tv.head())

# Now, we convert column in integer. Keep in mind that we have "Season" and "Seasons". We need to remove both cases
data_duration_tv['duration'] = data_duration_tv['duration'].str.replace('Season.*', '',regex=True).astype(int)

# Finally, we can create a histogram and a barplot to show data
fig, ax = plt.subplots(1,2, figsize=(20, 6))

# First plot
f1 = sns.histplot(x = data_duration_movie['duration'],kde = True,bins = 20,ax = ax[0]);
f1.set_title("Duration of Movies", fontsize=24)
f1.set_xlabel('Minutes')
f1.set_ylabel('Frequency')

# Second plot
f2 = sns.barplot(x='duration', y="size", data=data_duration_tv.groupby('duration',as_index = False).size(),ax = ax[1]);
f2.set_title("Duration of Movies", fontsize=24)
f2.set_xlabel('Number of seasons')
f2.set_ylabel('Frequency')
fig.show()

In [None]:
# Also, we can create a violin plot to see the distribution of duration of movies based on the last 10 years
data_duration_movie_year = data_duration_movie[data_duration_movie['release_year']>= 2010]

# Create the plot
plt.figure(figsize=(20,6))
sns.violinplot(x="release_year", y="duration", data=data_duration_movie_year) 
plt.title("Duration of Movies according the last 10 years", fontsize=24)
plt.xlabel('Release year')
plt.ylabel('Minutes')
plt.show()

Overall, Movies have a duration of 100 minutes and the general trend is decreasing under slightly. 

In [None]:
# Now we can explore the trend of duration of TV shows in the last 10 years
data_duration_tv_year = data_duration_tv[data_duration_tv['release_year']>= 2010]

# Create the plot
plt.figure(figsize=(20,6))
sns.violinplot(x="release_year", y="duration", data=data_duration_tv_year) 
plt.title("Number of seasons of TV shows according the last 10 years", fontsize=24)
plt.xlabel('Release year')
plt.ylabel('Number of seasons')
plt.show()

In this case, we can see that the trend of the duration of TV shows is relatively neutral during the last years. For me is amazing that a TV show has 17 seasons :O.

Some references used to the creation of this notebook:
* https://www.kaggle.com/radmirzosimov/netflix-eda-with-plotly-seaborn
* https://stackoverflow.com/questions/13682044/remove-unwanted-parts-from-strings-in-a-column
* https://seaborn.pydata.org/generated/seaborn.violinplot.html
* https://towardsdatascience.com/complete-guide-to-data-visualization-with-python-2dd74df12b5e