The below diagram is from my whimsical account, which I most of the time follow. This is the complete mashine learning pipeline.

https://whimsical.com/machine-learning-M8Eq1mUB4jp89Mz7PpqchY

Netflix is the most popular platform nowadays for the purpose of entertainment. Here, we will deal with extraction, cleaning, and then visualization of netflix datasets. 

These are some questions to answers using the netflix dataset. 

1. How much content added across all years.
2. Top 10 Countries as contributer to netflix. 
3. Top Genres on Netflix.
4. Amount of content by rating. 
5. Top directors on netflix.
6. Top 10 Actors on netflix.
7. How Genres affects the rating.
8. In which month a movie should release to help rpoducers to get profit.
9. Top 10 best movies of 2020, you must watch. 
10. Understanding what content is available in different countries

# **Familarize Data.**

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
import missingno as msno
from sklearn.preprocessing import LabelEncoder
e=LabelEncoder()
from pandas_profiling import ProfileReport

from IPython.display import Image

In [None]:
df = pd.read_csv('/kaggle/input/netflix-shows/netflix_titles.csv')
df.head()

In this data set, there are 12 features.

**show_id** = Unique ID for every Movie / TV Show

**type** = Identifier - A Movie / TV Show

**title** = Title of Movie / TV Show

**director** = Director of the Movie

**cast** = Actors involved in the Movie / TV Show

**country** = Country where the Movie / TV Show was produced

**data_added** = Date it was added on Netflix

**release_year** = Actual Release year of the Movie / TV Show

**rating** = TV Rating of the Movie / TV Show

**duration** = Total Duration - in minutes or number of seasons

Let's overview each feature and figure out how to use them to visualization.

**Droping the columns which is not in use of us.**

In [None]:
df.drop(['show_id','description'], axis = 1, inplace=True)

**Rename the columns.**

In [None]:
df.rename(columns = {'listed_in' : 'Genre', 'date_added' : 'date'}, inplace= True)

See and Visualize the dataset in the easy way.

You can clearly see the properties of each columns. 

In [None]:
profile = ProfileReport(df, title='Netflix each column Report', html={'style':{'full_width':False}})

In [None]:
profile.to_notebook_iframe()

From above profiling report, 

**director** has a high cardinality: 4049 distinct values	

**cast** has a high cardinality: 6831 distinct values	

**country** has a high cardinality: 681 distinct values	

**date_added** has a high cardinality: 1565 distinct values

**duration** has a high cardinality: 216 distinct values	

**listed_in** has a high cardinality: 492 distinct values	

**description** has a high cardinality: 7769 distinct values	


In [None]:
profile.to_widgets()

# **Data Preprocessing**

# **Missing values**

In [None]:
df.info()

**You can clearly see the missing values occurs in director, cast, country, and rating.**


In [None]:
missing_no=df.isna().sum()
missing_no

In [None]:
for i in range(len(missing_no)):
  if missing_no[i] > 0:
    print('missing value rate in {} column = {}%'.format(df.columns[i],missing_no[i]/len(df)*100))

**Missing value plot**

In [None]:
a4_dims = (12, 8)
fig, ax = plt.subplots(figsize=a4_dims)
sns.heatmap(ax=ax,data=df.isnull(), cbar=False)

In [None]:
msno.matrix(df)

# **Deletion**

**Ignoring the director and cast column.**

**cleaning the missing values Country, rating, and date_added.**

In [None]:
df.drop(['director','cast'], axis = 1, inplace=True)

# **Feature Imputations**

# **Forward imputations**

In [None]:
df['date'].unique()

Converted the data form, for example 'August 14, 2020' to 2020-08-14 to better handle the dates.

In [None]:
df['date'] = pd.to_datetime(df.date)
df.head()

In [None]:
df['date'] = df.date.fillna(method='ffill')
df.head()

**Check Now, there is no missing values in date column.**

In [None]:
no_missing = len(df[:][df['date'].isna() == True])
no_missing

# **Feature imputation using Unique values**.

**Handling missing values in rating**

In [None]:
df['rating'].unique()

In [None]:
df[:][df['rating'].isna() == True]

In [None]:
rating_1 = ['TV-MA', 'R', 'PG-13', 'TV-14', 'TV-PG', 'NR', 'TV-G']
id = [67, 2359, 3660, 3736, 3737, 3738, 4323]

for i1 in range(len(id)):
  df.iloc[id[i1],5] = rating_1[i1]

df.iloc[67,5]

**Check Below, There is no Na value in rating column.**

In [None]:
df.isna().sum()

# **Frequent Imputation.**

**Handling missing values in Country Column**

In [None]:
df['country'].value_counts()

Frequent Imputation using mode.

In [None]:
df['country'] = df['country'].fillna(df['country'].mode()[0])


**Filling with first term in country**

In [None]:
df['p_country'] = df['country'].apply(lambda x: x.split(",")[0])
df.head()

**Check Null values in country**

No Na Value. 

In [None]:
df.isnull().sum()

**Checking out the country column as it also contains multiple country names for single row.**

* **The column of country contains the multiple name of countries in single row. Check out in dataframe.**

* **Datasets contains 69 differrent countries.**

* **country's column have 681 unique names along with multiple names of countries in single row.**

* **total number of countries which contains united states are**

In [None]:
c2 = []
c3 = []
c4 = []
c5 = []
c6 = []
for x in range(len(df)):
  if len(df['country'][x].split(',')) == 1:
    c2.append(df['country'][x])
  elif 'United States' in df['country'][x].split(','):
    c3.append(df['country'][x])
  elif ' United States' in df['country'][x].split(','):
    c4.append(df['country'][x])
  elif 'United Kingdom' in df['country'][x].split(','):
    c5.append(df['country'][x])
  elif ' United Kingdom' in df['country'][x].split(','):
    c6.append(df['country'][x])
csmn = np.array(c2)
print(pd.DataFrame(csmn, columns = ['country']))
print('\n Unique countries: \n', np.unique(csmn))
print('\nlength of the rows which contain single country: ',len(c2))
print('\nlength of the rows which contain multiple countries: ', len(df) - len(c2))
print('\nlength of the rows which contain united states in multiple countries: ',len(c3)+len(c4))
print('\nlength of the rows which contain united states in multiple countries: ',len(c5)+len(c6))


**csmn = countries single movie name => the rows which contain single name of movie in country's column** 

# **Genre Column**

**Genre column contains many multiple names, We have to deal with them.**

**Grouping based on type: Movies and TV Show**

In [None]:
df_g = df.groupby('type')

**The Genre column contains so many categories for individual movies, there is many categories which we do not want it to  be in that column, and we will only keep there which are common Genre. So we need to check the number of categories in Genre column.**

In [None]:
df_g1 = df_g.get_group('Movie')
df_g1 = df_g1.reset_index()
df_g1.drop('index',axis=1,inplace=True)
df_g1.head()

**The all Genre Words in Genre column.**

In [None]:
app = []
for st1 in range(len(df_g1)):
  for st in range(len(df_g1['Genre'][st1].split(','))):
    app.append(df_g1['Genre'][st1].split(',')[st].strip())

Total_Genre_in_Genre_column = np.array(app)
print('All Genre Categories in Genre column:\n',Total_Genre_in_Genre_column)
print('\n Length of total Categories in Genre column:',len(Total_Genre_in_Genre_column))

print('\n Total differrent Categories in Genre: \n\n',np.unique(Total_Genre_in_Genre_column))

Value count for each of differrent Categories present in Genre column. 

In [None]:
G = pd.DataFrame(Total_Genre_in_Genre_column)
G.value_counts()

**From Above I will only focus on following Common Genre.**

***Wanted Genre :*** 


*   **Action & Adventure** 
*   **Comedies**
*   **Documentaries**
*   **Dramas**
*   **Horror Movies** 
*   **Music & Musicals**
*   **Romantic Movies** 
*   **Sci-Fi & Fantasy** 
*   **Sports Movies**
*   **Thrillers**
*   **Stand-up Comedy**


***UnWanted Genre :***

*   **Anime Features**
*   **Children & Family Movies**
*   **Classic Movies**
*   **Cult Movies**
*   **Faith & Spirituality**
*   **Independent Movies**
*   **International Movies**
*  **LGBTQ Movies**
*   **Movies**



**There are Multiple categories of Genre in a single row as well as Single cateogry in single Row.**


**First We have to deal with the Single Cateogries in each row of Genre Column.**

**Differrence between;**

1.   **Single Categories in Single Row**:

      *   For Example: Check the 4 index ['Drama'], it Contains Only single Category.
      

2.   **Multiple Categories in Single Row**:

      *  For Example: Check the 1 Index ['Drama','Internal Movies'] and Check the 3 Index ['Action & Adventure', 'Independent Movies', 'Sci-Fi & Fantasy']



In [None]:
arr1_sngl = df_g1.Genre.apply(lambda sngl: sngl.split(','))
arr1_sngl

**Check the counting of all categories as single, binary, and multiple.**

**We have to improve this, we want more single, and binary.**

In [None]:
count1=0
count2=0
count3=0
for category in arr1_sngl:
  if len(category) == 1:
    count1=count1+1
  elif len(category) == 2:
    count2=count2+1
  elif len(category) > 2:
    count3=count3+1
    # print(category)
print('There are {} Single categories in Genre Column'.format(count1))
print('There are {} Binary categories in Genre Column'.format(count2))
print('There are {} Multiple categories in Genre Column'.format(count3))

### **Single Categories of Genre in each row.**

**Check UnWanted Genre Above.**

**change ['Unwanted Genre'] ==>** **Action and adventure.**



In [None]:
arr = df_g1.Genre.apply(lambda x: [x.strip() for x in x.split(",")])
arr

**These All are the Single Categories genre. You can see These are less, they don't affect our prediction that much. Therefore, we convert all of them to the 'Action & Adventure'** 

**We Basically Converts Unwanted Genre to the Wanted**

In [None]:
genre_we_dont_want = ['Anime Features', 'Children & Family Movies','Classic Movies', 'Cult Movies','Faith & Spirituality', 'Independent Movies',
                        'International Movies', 'LGBTQ Movies', 'Movies']
def movie(dont_want):
  for i in range(len(arr)):
    if arr[i][0] == dont_want:
        if len(arr[i]) == 1:
          print(i,arr[i])
for i1 in range(len(genre_we_dont_want)):
    movie(genre_we_dont_want[i1])  

**On 1201 index there is string of 'International Movies', We check on this index whether it has changed to the 'Action & Adventure' or not.**

In [None]:
df_g1.Genre[1201]

**Replace above all Single Categories Genre to the 'Action&Adventure'**

In [None]:
Unwanted_Genre = ['Anime Features', 'Children & Family Movies''Classic Movies', 'Cult Movies','Faith & Spirituality', 'Independent Movies',
                        'International Movies', 'LGBTQ Movies', 'Movies']
def movie(dont_want):
  for i4 in range(len(arr)):
    if arr[i4][0] == dont_want:
        if len(arr[i4]) == 1:
          arr[i4][0] = 'Action & Adventure'
for i5 in range(len(Unwanted_Genre)):
    movie(Unwanted_Genre[i5])  
print(arr)
print('\nReplaced with International Movies:', arr[1201])

**Creating Genre1 column in our dataset, In this colum, we have converted the unwanted Genre to Wanted Genre for Single Categories, For Multiple Categories We will do after finshishing with this one**

In [None]:
df_g1['Genre1'] = arr

**Convert the row back to str from list.**

In [None]:
m1 = []
for m in df_g1.Genre1:
  m1.append(', '.join(m))
df_g1['Genre1'] = pd.DataFrame(m1)
df_g1.Genre1

**Check this it has changed from 'International Movies' to 'Action & Adventure'**

In [None]:
df_g1.Genre1[1201]

**There in above code, you can see the second element of list is starting with space. We have to eliminate this. Check below**

**Now We have eliminated all the single Unwanted Catogries of Genre**

# **Multiple Categories in Single Row**

In [None]:
import re
df_g1['real_Genre']=df_g1.Genre1.apply(
    lambda x: re.split('\s*,\s*', x)).apply(
        lambda x: [e for e in x if e not in [ 'International Movies', 'Independent Movies','Children & Family Movies','Anime Features','Classic Movies', 'Cult Movies', 'Faith & Spirituality', 'LGBTQ Movies', 'Movies']])
df_g1

**Below, We converted the list format of real_Genre to the string format.**

**Check real_Genre Above and Below, you will see the differrence.**

In [None]:
df_g1['real_Genre']=df_g1['real_Genre'].apply(lambda s: ', '.join(s))
df_g1

In [None]:
df_g1['real_Genre'][1201]

**Check Below There is three Genre columns: Genre, Genre1, and real_Genre.**

1. **Genre: Original**


2. **Genre1: It is basically formated from Genre, where we Convert all the Unwanted Genre to the Wanted Genre for Single Category.** 

3. **real_Genre: It is formated from Genre1, where we remove the all Unwanted Genre from the multiple categories and It does have the same attribute of Genre1 as it is formated from Genre1.**

**Therefore, we Don't need the Genre1**

In [None]:
df_g1

**drop Genre1, real_Genre is formated from the Genre1 therefore we only keeps the real_Genre.**

In [None]:
df_g1.drop('Genre1', axis=1, inplace = True)
df_g1.head()

### **See the differrence from original data**

**We have to check how many single, binary, and multiple categories are in real_Genre.**

In [None]:
arr2_sngl = df_g1.real_Genre.apply(lambda sngl: sngl.split(','))
arr2_sngl = pd.DataFrame(arr2_sngl)
arr2_sngl = arr2_sngl.real_Genre.apply(lambda w: [z1.strip() for z1 in w])
arr2_sngl

**Check the counting of all categories as single, binary, and multiple.**

**We have improved it we have more single, and we also improved the binary more. Check the data in starting there were more multiple categories**

In [None]:
count1=0
count2=0
count3=0
sn = []
mt = []
for category in arr2_sngl:
  if len(category) == 1:
    count1=count1+1
    sn.append(category)

  elif len(category) == 2:
    count2=count2+1 

  elif len(category) > 2:
    count3=count3+1

sn = pd.DataFrame(sn)

print(sn.value_counts())
print('There are {} Single categories in real_Genre Column'.format(count1))
print('There are {} Binary categories in real_Genre Column'.format(count2))
print('There are {} Multiple categories in real_Genre Column'.format(count3))

**We have to deal with binary and Multiple**

# **Binary and Multiple Categories.**

**See Below, There is More no of following Categories;**

1. **Binary: Dramas, Comedies, Romantic Movies, Action & Adventure, Thrillers.**

2. **Multiple: Dramas, Comedies, Romantic Movies, Action & Adventure, Thrillers.**

In [None]:
bin=[]
mlt=[]
for category in arr2_sngl:
  if len(category) == 2:
    for st1 in range(len(category)):
        bin.append(category[st1])
  elif len(category) > 2:
    for st2 in range(len(category)):
        mlt.append(category[st2])

bin = pd.DataFrame(bin)
mlt = pd.DataFrame(mlt)


print('Binary:\n')
print(bin.value_counts())
print('\nMultiple:\n')
print(mlt.value_counts())

**Single_Genre: All of the rows of data which contains the single data in Genre column**

**binary_Genre: All of the rows of data which contains the binary data in Genre column**

**Multiple_Genre: All of the rows of data which contains the multiple data in Genre column**

In [None]:
mlt=[]
bin=[]
sng=[]
bin_index = []
mlt_index = []
sng_index = []
for index,category in enumerate(arr2_sngl):
  if len(category) == 1:
    sng.append(category)
    sng_index.append(index)
  if len(category) == 2:
    bin.append(category)
    bin_index.append(index)
  elif len(category) > 2:
    mlt.append(category)
    mlt_index.append(index)

bin = pd.DataFrame(bin)
mlt = pd.DataFrame(mlt)


Binary_Genre = (pd.DataFrame(list(map(lambda b: df_g1.iloc[b], bin_index))))
Binary_Genre.reset_index(inplace=True)
Binary_Genre.drop(['Genre','index'], axis=1, inplace=True)

Multiple_Genre = pd.DataFrame(list(map(lambda m: df_g1.iloc[m], mlt_index)))
Multiple_Genre.reset_index(inplace=True)
Multiple_Genre.drop(['Genre','index'], axis=1, inplace=True)


Single_Genre = pd.DataFrame(list(map(lambda s: df_g1.iloc[s], sng_index)))
Single_Genre.reset_index(inplace=True)
Single_Genre.drop(['Genre','index'], axis=1, inplace=True)


Binary_Genre

In [None]:
Single_Genre



In [None]:
Multiple_Genre

# **Feature Encoding**

# **Labelencoder**

**Country Column**

In [None]:
label_c=e.fit_transform(df_g1.p_country)


Check Below, Encoding in p_country

In [None]:
enc = df_g1.copy()
enc['p_country'] = label_c
enc

**Type Column.**

In [None]:
dummy_type = pd.get_dummies(df_g1['type'])
dummy_type

Check Below, encoding in type.

In [None]:
enc['type'] = dummy_type['Movie']
enc

**Real_Genre Column**

In [None]:
label_s = e.fit_transform(Single_Genre.real_Genre)
label_b = e.fit_transform(Binary_Genre.real_Genre)
label_m = e.fit_transform(Multiple_Genre.real_Genre)

Multiple_Genre['real_Genre_encode'] = label_m
Binary_Genre['real_Genre_encode'] = label_b
Single_Genre['real_Genre_encode'] = label_s


In [None]:
Binary_Genre.head()

In [None]:
Single_Genre.head()

In [None]:
Multiple_Genre.head()

# **Visualizations**

In [None]:
df_real = pd.read_csv('../input/netflix-movies-along-with-all-information/Netflix Movies.csv')

**Exploring the countries by the amount of the produces content of Netflix.**

In [None]:
plt.figure(figsize=(13,7))

g = sns.countplot(y = df_real.p_country, order=df_real.p_country.value_counts().index[:15])
plt.title('Top 15 Countries Contributor on Netflix')
plt.xlabel('Titles')
plt.ylabel('Country')
plt.show()

**To know the most popular director, we can visualize it.**

In [None]:
app = []
for st1 in range(len(df_real)):
  for st in range(len(df_real['director'][st1].split(','))):
    app.append(df_real['director'][st1].split(',')[st].strip())

dir = pd.DataFrame(app)


In [None]:
plt.figure(figsize=(13,7))

sns.countplot(y = dir[0], order=dir[0].value_counts().index[:10], palette='PuRd_r')
plt.title('Top 10 Director Based on given Titles')
plt.xlabel('Count')
plt.ylabel('Director')
plt.show()

**Top Genres on Netflix**

In [None]:
app = []
for st1 in range(len(df_real)):
  for st in range(len(df_real['Genre'][st1].split(','))):
    app.append(df_real['Genre'][st1].split(',')[st].strip())

gen = pd.DataFrame(app)


In [None]:
plt.figure(figsize=(13,7))

sns.countplot(y = gen[0], order=gen[0].value_counts().index[:10], palette='gist_rainbow')
plt.title(label='Top 10 Genre',fontsize=30,color="black")

plt.xlabel('Count')
plt.ylabel('Genre')
plt.show()

**Genre affected by the age.**

**The most of movies are allowed for Adults**

In [None]:
plt.figure(figsize=(13,7))

sns.countplot(y = Single_Genre.real_Genre, hue = df_real.rating_age, palette='gist_rainbow')

**Top Actor on Netflix based on the number of titles**

In [None]:
app = []
for st1 in range(len(df_real)):
  for st in range(len(df_real['cast'][st1].split(','))):
    app.append(df_real['cast'][st1].split(',')[st].strip())

actor = pd.DataFrame(app)




In [None]:
plt.figure(figsize=(13,7))

sns.countplot(y = actor[0], order=actor[0].value_counts().index[:10], palette='gist_rainbow_r')
plt.title(label='Top Actor on Netflix based on the number of titles',fontsize=30,color="black")

plt.xlabel('Count')
plt.ylabel('Actor')
plt.show()