# **Netflix: Movies and TV Shows (preprocessing and cleaning)**
 
The purpose of this notebook is to cleanse the comma-separated values into tables for __unique actors, directors, countries, and genres__. Then we'll analyze the reshaped dataset in case to find any interesting patterns and try to satisfy the task's expectations. 

**Task Details:**
As mentioned above,  columns in this dataset have comma-separated values, which makes it difficult to find how many titles an actor or actress appeared in or how many titles a director has filmed.

**Expected Submission:**
Cleanse the comma-separated values into tables for unique actors, directors, countries, and genres that can be linked back to the original dataset via the **"show_id"** field.

In [None]:
# Import packages
import pandas as pd
import matplotlib.pyplot as plt
from mlxtend.frequent_patterns import apriori
from mlxtend.preprocessing import TransactionEncoder

In [None]:
# Import dataset
df = pd.read_csv('../input/netflix-shows/netflix_titles.csv')

# Observe dataset
df.head()

In [None]:
# Inspect dataset
df.info()

In [None]:
# Check for null values
df.isna().sum()

# Country Cleanse
---
Let's analyze the variable which has the least number of null values. In this case, it is a "Country" where the movie/show was produced. It has 507 null values. Let's observe the first 5 records. 

In [None]:
df[df.country.isna()].head()

Instead of drop these records with null countries, we can try to merge another dataset and restore the information. One of the famous and large movie/TV show datasets, which comes to mind, is IMDb.

In [None]:
imdb = pd.read_csv('../input/imdb-extensive-dataset/IMDb movies.csv', low_memory=False)
imdb.head()

As the IMDb dataset doesn't contain a key to merge with the Netflix data, we'll try to merge by title and production year. But before we need to clean the IMDb dataset. 

In [None]:
# one record in 'year' colummn in IMDB has inappropriate formatting 
imdb[imdb.year.str.contains(' ')]

In [None]:
# we can replace it just by year
imdb.loc[imdb.year.str.contains(' '), 'year'] = imdb.loc[imdb.year.str.contains(' '), 'year'].str.rsplit(' ', expand=True)[2]

In [None]:
# convert release year to datetime format for the merging procedure
df.release_year = pd.to_datetime(df.release_year, format='%Y').dt.year
imdb.year = pd.to_datetime(imdb.year, format='%Y').dt.year

In [None]:
# drop duplicates in subset
imdb = imdb.drop_duplicates(subset=['original_title', 'year'])

In [None]:
# merge 2 datasets by title and release year
df = df.merge(imdb.add_suffix('_imdb'), how='left', left_on=['title','release_year'], right_on=['original_title_imdb','year_imdb'])

In [None]:
# number of counties
len(df[(df.country.isna()) & (df.country_imdb.notnull())])

Unfortunately, we can restore only **13** records. Anyway, it's better than nothing. The problem of small merged countries might be that the IMDb dataset is not big enough and doesn't contain information about some movies/TV shows in the Netflix dataset. The solution might be to merge a _larger movie/TV Show dataset_, which contains titles in a foreign language as well.

In [None]:
# replace these null country records 
df.loc[df.country.isna(), 'country'] = df.loc[df.country.isna(), 'country_imdb']

Let's perform the same procedure for cast and director columns:

In [None]:
# number of replaceble cast records
len(df.loc[(df.cast.isna()) & (df.actors_imdb.notnull())])

In [None]:
# replace null cast records by actors from IMDb
df.loc[(df.cast.isna()) & (df.actors_imdb.notnull()), 'cast'] = df.loc[(df.cast.isna()) & (df.actors_imdb.notnull()), 'actors_imdb']

In [None]:
# number of replaceble director
len(df.loc[(df.director.isna()) & (df.director_imdb.notnull())])

In [None]:
# replace null director records by director from IMDb
df.loc[(df.director.isna()) & (df.director_imdb.notnull()), 'director'] = df.loc[(df.director.isna()) & (df.director_imdb.notnull()), 'director_imdb']

As a result, we can restore 11 records that have actors' information and 30 records that contain information about directors from IMDb. Now drop the IMDb columns because they are useless for further analysis.

In [None]:
# drop IMDb columns 
df = df.drop(df.columns[12:], axis=1)

In [None]:
# check decreased null values
df.isna().sum()

# Country Analysis
---
Rather than show production countries with the highest value count and build a bar plot of that (if be honest, you can find this kind of information in other notebooks), I would like to analyze country interaction in the production process (such as with which country the US produce more movies/TV shows) and to create a new dataframe to satisfy task demand. 

One-hot encoding can help to perform it. Basically, this approach is using for market basket analysis to create association rules, but I think it is suitable in our case. 

In [None]:
# subset dataset and split 
country = df.loc[df.country.notnull(), 'country'].astype('str').apply(lambda t: t.split(', '))

# Convert DataFrame column into list of strings
country = list(country)

# number of movies/TV shows without null values
len(country)

In [None]:
# Instantiate encoder and identify unique country
encoder = TransactionEncoder().fit(country)

# One-hot encode
onehot_country = encoder.transform(country)

# Convert one-hot encoded data to DataFrame and set show_id as index
onehot_country = pd.DataFrame(onehot_country, columns = encoder.columns_, index=df.loc[df.country.notnull(), 'show_id'])

# Print the one-hot encoded country dataset
onehot_country.head()

To interpret this table is not too hard. For example, a movie/TV Show with __show_id__ _"s1"_ was made in Brazil. Thus Brazil column is _True_ for this row. The rest columns are _False_. Using this new table, we can calculate the share of the movie/TV show production of each country in the Netflix dataset. 

In [None]:
# Print the one-hot encoded country share dataset
country_share = onehot_country.mean().sort_values(ascending=False).round(4) * 100
country_share

In [None]:
# take countries that share more than 1%
country_share = country_share[country_share > 1]
labels = country_share.round(3).astype('str') + ' %'

fig1, ax1 = plt.subplots(figsize=(20,15), facecolor='white')
ax1.pie(country_share, labels=labels, labeldistance=1.05,
        shadow=True)
plt.title('Percent of produced Movies/TV Show by Country', fontsize=20)
plt.legend(labels=country_share.index, loc='upper right')
plt.show()

According to the pie chart above, 45.2% of movies/TV Shows were produced by the United States and in collaboration of United States with other countries. Now, I am interested in what country collaborates more with the United States to produce movies/TV Shows.

In [None]:
# Compute frequency using the Apriori algorithm
frequency = apriori(onehot_country[onehot_country['United States'] == True], 
                    min_support = 0.0001, 
                    max_len = 4, 
                    use_colnames = True).rename({'support':'frequency', 'itemsets':'Countries'}, axis=1)

# sort row which contain 'United States' and more than 1 country
frequency = frequency[(frequency.Countries.apply(lambda t: 'United States' in t)) & (frequency.Countries.apply(lambda t: len(t) >= 2))]\
                    .sort_values('frequency', ascending=False).round(3)

# Print a preview of the frequency
frequency.head()

According to the table above, 7% of all movies/shows (where the US was participated in production) were made in collaboration with the United Kingdom. 6% with Canada and 3% with France.  

Reset index of the one-hot encoded country dataset so we can link back to the original dataset via the "show_id" field.

In [None]:
onehot_country = onehot_country.reset_index()
onehot_country.head()

# Genre Cleanse and Analysis
---
We can use the same one-hot encode technique for genre variable to clean and analyze it.

In [None]:
# subset dataset and split 
genre = df['listed_in'].apply(lambda t: t.split(', '))

# Convert DataFrame column into list of strings
genre = list(genre)

# number of movies/TV Shows
len(genre)

In [None]:
# Instantiate transaction encoder and identify unique items
encoder = TransactionEncoder().fit(genre)

# One-hot encode transactions
onehot = encoder.transform(genre)

# Convert one-hot encoded data to DataFrame and set show_id as index
onehot_genre = pd.DataFrame(onehot, columns = encoder.columns_, index=df['show_id'])

# Print the one-hot encoded transaction dataset
onehot_genre.shape

There are 42 different genres in out dataset for 7787 Movies/TV Shows. We can calculate the number of Movies/TV Shows for each genre and visualize it.

In [None]:
genre_count = onehot_genre.sum().sort_values(ascending=False)
genre_count.head()

In [None]:
plt.style.use('ggplot')
plt.figure(figsize=(20, 10))
genre_count.plot(kind='bar')
plt.xticks(rotation='90')
plt.tick_params(axis='x', labelsize=15)
plt.title('Number of Movies/TV Shows by genre', fontsize=20)
plt.show()

Looks like that the most popular genre is "International Movies" but for me it doesn't make any sense in term of the genre. Hence, let's look at the most common combination of genres with "International Movies."

In [None]:
# Compute frequent itemsets using the Apriori algorithm
frequency = apriori(onehot_genre[onehot_genre['International Movies'] == True], 
                    min_support = 0.0001, 
                    max_len = 2, 
                    use_colnames = True).rename({'support':'frequency', 'itemsets':'Genre'}, axis=1)

# sort row which contain 'International Movies' and more than 1 country
frequency = frequency[(frequency.Genre.apply(lambda t: 'International Movies' in t)) & (frequency.Genre.apply(lambda t: len(t) >= 2))]\
                    .sort_values('frequency', ascending=False).round(3)

frequency.head()

According to the table above, 53% of all International Movies (2437) are dramas. 30% are comedies and 15% in the action & adventure genre. 

You are probably questioned why the sum of the percent doesn't give a 100%. It is because that it might be a combination of genres, for example, drama and comedy. Apriori algorithms calculate any mentioned combination in all records and return its frequency.

For instance, suppose we have two international movies with genres "dramas, crime, documentaries" and "dramas, comedies". In this case, 100% of international movies are drama and 50% are crime. 

Let's look at the International TV Shows genres:

In [None]:
# Compute frequent itemsets using the Apriori algorithm
frequency = apriori(onehot_genre[onehot_genre['International TV Shows'] == True], 
                    min_support = 0.0001, 
                    max_len = 2, 
                    use_colnames = True).rename({'support':'frequency', 'itemsets':'Genre'}, axis=1)

# sort row which contain 'International Movies' and more than 1 country
frequency = frequency[(frequency.Genre.apply(lambda t: 'International TV Shows' in t)) & (frequency.Genre.apply(lambda t: len(t) >= 2))]\
                    .sort_values('frequency', ascending=False).round(3)

frequency.head()

The most popular genre of International TV Shows is also drama (40%). Nevertheless, romantic and crime are in the 2nd and 3rd places respectively, which is make sense for the TV Shows segment - people tend to watch detective and love dramas on TV).

Now, again, reset indexes of the one-hot genre dataset in case if we want to merge it with initial dataset.

In [None]:
# reset index with show_id information
onehot_genre = onehot_genre.reset_index()
onehot_genre.head()

# Cast Cleanse and Analysis
---
The same procedure fot the cast column.

In [None]:
# subset dataset and split 
cast = df.loc[df.cast.notnull(),'cast'].astype('str').apply(lambda t: t.split(', '))

# Convert DataFrame column into list of strings
cast = list(cast)

# number of movies/TV Shows
len(cast)

In [None]:
# Instantiate encoder and identify unique records
encoder = TransactionEncoder().fit(cast)

# One-hot encode
onehot = encoder.transform(cast)

# Convert one-hot encoded data to DataFrame and set show_id as index
onehot_cast = pd.DataFrame(onehot, columns = encoder.columns_, index=df.loc[df.cast.notnull(),'show_id'])

# Print the one-hot encoded dataset
onehot_cast.shape

According to the one-hot encoded dataset, there are 32966 actors/actresses in a total of 7080 Movies/TV Shows. Let's look at the top 5 actors/actresses with the highest numbers of Movies/TV shows:

In [None]:
onehot_cast.sum().sort_values(ascending=False).head()

Unfortunately, these names don't tell me anything. Thus, I created a function which returns a dataset with all necessary information:

In [None]:
# function that returns information about cast's Movies/TV Shows
def cast(actor):
    data = df[df.cast.astype('str').apply(lambda t: actor in t)]
    return(data)

In [None]:
# apply a function to the top actor and look at the first 5 Movies/TV Shows
cast('Anupam Kher').head()

Now I am interested in the most filmed actor in any movie which the US has participated in the production. To do so, we need to reset indexes and merge the "onehot_cast" dataset with appropriate data. We could merge with initial Netflix data but the problem that we want to group all movies in which the US was involved in the production. The initial dataset cannot satisfy our desire, but the encoded country dataset can:)

Additionally, I'll show how created new datasets might be used.

In [None]:
# reset index
onehot_cast = onehot_cast.reset_index()

In [None]:
# merge show type first
cast_country = onehot_cast.merge(df[['show_id', 'type']], how='left')

# merge one-hot encoded country dataset
cast_country = cast_country.merge(onehot_country, how='left')

In [None]:
# filter by movie type and the US country
cast_us = cast_country.loc[(cast_country.type == 'Movie') & (cast_country['United States'] == True)]

In [None]:
# calculate the total number of American movies of actors/actresses
us_cast_count = cast_us.loc[:,onehot_cast.columns].drop('show_id', axis=1)\
                       .sum().sort_values(ascending=False)

# Top-5 actors/actresses
us_cast_count.head()

Looks like Adam Sandler acted more than others in movies that were produced by the US and other countries. There are 19 his movies in the Netflix dataset. Let's observe these movies using our function:

In [None]:
# First 5 movies with Adam Sandler
cast('Adam Sandler').head()

In [None]:
# number of actors/actresses in American movies
len(us_cast_count[us_cast_count > 0])

# Director Cleanse
---

In [None]:
# subset dataset and split 
director = df.loc[df.director.notnull(),'director'].astype('str').apply(lambda t: t.split(', '))

# Convert DataFrame column into list of strings
director = list(director)

# number of movies/TV Shows
len(director)

In [None]:
# Instantiate encoder and identify unique records
encoder = TransactionEncoder().fit(director)

# One-hot encode
onehot = encoder.transform(director)

# Convert one-hot encoded data to DataFrame and set show_id as index
onehot_director = pd.DataFrame(onehot, columns = encoder.columns_, 
                               index=df.loc[df.director.notnull(),'show_id'])

# Print the one-hot encoded dataset
onehot_director.shape

There are 4501 directors of 5428 Movies/TV Shows according to the one-hot encoded dataset.

In [None]:
# number of Movies/TV Shows by director
onehot_director.sum().sort_values(ascending=False)

In [None]:
# function defining the Movies/TV Shows by director
def director(name):
    data = df[df.director.astype('str').apply(lambda t: name in t)]
    return(data)

In [None]:
# Countries where Jan Suter made his Movies/TV Shows
director('Jan Suter').country.unique()

Next question is what is the director with made the most number of TV Shows in the US? 

In [None]:
# merge with Netflix dataset
onehot_director_us = onehot_director.reset_index().merge(df, how='left')

# Filter the data
onehot_director_us = onehot_director_us.loc[(onehot_director_us.country == 'United States') & (onehot_director_us.type == 'TV Show')]

In [None]:
# Top-5 TV Show directors with the highest number of movies
onehot_director_us.loc[:,onehot_director.columns].sum().sort_values(ascending=False).head()

Finally, I would like to know what Movies and TV Shows of my favourite director Quentin Tarantino a Netflix dataset has:

In [None]:
# Quentin Tarantino's filmography
director('Quentin Tarantino')

# Summary
In this work we cleaned, reshaped, and visualize the __actors, directors, countries, and genres__ columns of the Netflix dataset. It is not easy to analyze these columns because each contains several values. Thus, each variable has its own encoded dataset with the "show_id" primary key. So you can merge them with source data to answer to interested you questions about  actors, directors, countries, or genres. I performed basic visualization just to show how we can operate and use these tables. Feel free to use these tables and function to perform more complicated exploratory data analysis. I hope I accomplished all task's  requirements and my code would serve to the further sophisticated analysis work.