# Netflix shows analysis
--- 

The given dataset consists of Shows and Movies from the year 1940 to 2017, available on
Netflix, a worldwide popular video-streaming application. Shows rated from “_G_”(General
Audiences) to “_R_”(strong violence, sexual content and adult language) are all included, attracting
a wide-range of audience. Details included about each show are rating, rating description, rating
level, release year, user rating score, user rating score, user rating size.

Average user rating score(out of 100) is 84 where ‘**13 reasons why**’ scores the highest rating of
99 and ‘**Life Unexpected**’ and ‘**Curious George**’ equally scoring 55, the lowest rated shows by
users in the list.

### Problems found in the dataset
There are several missing values in some of the rows. Few rows are repeated. Details regarding Genre are not
provided making it difficult to categorize the shows according to these values. A couple of
column names are misplaced.

### Solutions implemented
* Basic data cleaning(remove repeating rows).
* Add additional columns(Age restriction, genre)
* Fill in missing values in rows based on most generic values given(missing rating
  description based on most used description for each rating).

### Things to Analyse
* Trending genre for movies each decade based on show genre.
* Trending rating for shows and movies in each decade.
* What is common among high rated shows/movies?
* Do popular shows or movies have mature content?
* Ratio of popularity for general vs Age restricted movies all together. 

## Essential Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import numpy as np
import time
from bs4 import BeautifulSoup
import requests
from wordcloud import WordCloud, STOPWORDS
%matplotlib inline

In [None]:
import warnings

warnings.filterwarnings('ignore')

## Read the file

In [None]:
nf = pd.read_csv('../input/netflix-shows/Netflix Shows.csv', encoding='latin-1')
nf.head(10)

## Drop duplicate rows

In [None]:
df = nf.drop_duplicates() 
df

## Rename columns

The original dataset seems to have misplaced a couple of column names. Since there isn't much information available about it, I thought of renaming these columns for better understanding.

In [None]:
df.rename(columns = {'ratingLevel':'ratingDescription','ratingDescription':'ratingLevel' }, inplace = True)  

In [None]:
df = df.set_index(np.arange(0,500))

## Describing the DataFrame

In [None]:
df.describe()

In [None]:
df['ratingDescription'].fillna("Not_Filled", inplace=True) 

## Frequency of each rating from the column 'rating'

In [None]:
Rating_Count = (df["rating"].value_counts()).sort_index()
Rating_Count

## Fill in missing data

There were several missing rows in 'ratingDescription' column. So I decided to fill in the missing description values with the most generic description present in the column.

In [None]:
Ser = []

for i in Rating_Count.index:   #iterate through Rating_Count
    df_i = df.where(df['rating'] == i) #new dataframe is created where rating value is same as i. Example : new dataframe where rating == 'G'
    max_val = (df_i["ratingDescription"].value_counts()).sort_index() #list of rating description values along with their counts
    Ser.append(max_val.index[max_val.values == max_val.max()][0]) # get max occuring description and append to a list
    
Ser

## Create a Data Frame out of the resultant values.

In [None]:
new_df = pd.DataFrame({'rating': Rating_Count.index, 'Max_found_description': Ser}) #created new dataframe for maximun description occurence
new_df

## Fill in the missing values in 'ratingDescription' with corresponding values in 'Max_found_description' column for each rating

In [None]:
for i in range(len(df.index)):
    if 'Not_Filled' in df['ratingDescription'][i]:
        rating = df['rating'][i]
        description = new_df.loc[new_df['rating'] == rating, 'Max_found_description'].values[0]
        df['ratingDescription'][i] = description
        

The missing values in 'ratingDescription' are now filled.

In [None]:
df[:10]

## Adding new information regarding genre of each show/movie through web scraping

Since there was no genre related information in the dataset, I've come up with a script that pulls data from [wikipedia](https://www.wikipedia.org/) and [rotten-tomatoes](https://www.rottentomatoes.com/) websites, which then I'm using it to find out trends. I have used 'BeautifulSoup',a library used for web scraping.



The function below pulls genre from wikipedia and returns a list of genres available for the corresponding shows/movies in the Data Frame.

In [None]:
def get_genre(lst_val): #gets genre from wikipedia : returns a List
    url = "https://en.wikipedia.org/wiki/" + lst_val #last_val is the title of show/movie
    page = requests.get(url) 
    soup = BeautifulSoup(page.content, 'html.parser')
    capi = soup.find_all("td",class_="category")
    fill_val = [re.sub(r"\[\d+\]", "", i.get_text().strip().replace("\n",", ")) for i in capi]
    return fill_val

The function below pulls genre from 'rotten-tomatoes' and returns a Series of information available for the corresponding shows/movies in the Data Frame.

In [None]:
def get_info_rt(url_val): #gets details from rotten tomatoes : returns a Series
    url_mov ="m/"
    url_tv = "tv/"
    url = "https://www.rottentomatoes.com/"

    page = requests.get(url + url_mov + url_val)
    soup = BeautifulSoup(page.content, 'html.parser')

    capi = soup.select("ul.content-meta.info li")
    fill_val = [i.find(class_="meta-value").get_text().replace("\n","").strip().replace("and",",").replace(" ","") for i in capi] 
    fill_lbl = [i.find(class_="meta-label").get_text().replace("\n","").strip(":").replace("and",",").replace(" ","") for i in capi] 
    final_ser = pd.Series(fill_val,index=fill_lbl)
    
    if(len(final_ser) <= 1):
        page = requests.get(url + url_tv + url_val)
        soup = BeautifulSoup(page.content, 'html.parser')

        cont = soup.select("div.panel-body.content_body td")
        fill_lbl = [i.get_text().replace(" \n","").strip(": \n").replace(" ","") for i in cont][::2]
        fill_val = [i.get_text().replace(" \n","").strip(": \n").replace(" ","") for i in cont][1::2]
        final_ser = pd.Series(fill_val,index=fill_lbl)
        return final_ser

    return final_ser

The script below gets genre of each show/movie if available in the websites. the values are stored in a new column 'Genre' 

In [None]:
#Script to pull Genre from wikipedia and rotten tomatoes
t1 = time.time()

df['Genre'] = "not found"
list_titles =  [i.strip().replace(" ","_") for i in df['title']]

#If a particular title doesnt work, try appending the values below to the title 
change_titles = ['_(TV_series)','_(American_TV_series)','_(franchise)']

#since many shows had title tracks with the same name, results were of the song instead of the tv show/movie. 
# Hence to avoid them.
songs_to_avoid = "alternative rock ska pop funk hip hop electronic film score jazz classical orchestra country feature film soundtrack hindi"


for i in range(len(df.index[:])):
        wiki_status = 0
        #print(i)
        fill_val = get_genre(list_titles[i])
        

        if len(fill_val) > 0: 
            test_val = fill_val[0].lower().split(",",1)[0]

            x = test_val.strip() not in songs_to_avoid

            if x:
                wiki_status = 1
                df['Genre'][i] = " ".join(fill_val)
                continue
            
        for j in change_titles:
            changed_title = list_titles[i] + j
            fill_val2 = get_genre(changed_title)
            

            if len(fill_val2) > 0:
                wiki_status = 1
                df['Genre'][i] = " ".join(fill_val2) 
                break

        if wiki_status == 0 :
            value_change = list_titles[i].lower()
            from_rt = get_info_rt(value_change)
            if len(from_rt)>0:
                df['Genre'][i] = from_rt.get(key = 'Genre')
                
            else:
                year = str(df['release year'][i])
                value_change2 = value_change + "_" + year
                from_rt2 = get_info_rt(value_change2)
                df['Genre'][i] = from_rt2.get(key = 'Genre')
                
            
t2= time.time()
time_ = t2-t1
print("Done in seconds : ",time_)

Rating description mentions the minimum agefor a viewer for each show/movie which can be denotedin a seperate column 

In [None]:
#Create additionl column for age restricted details

df['Age Restriction'] = "None"
for i in range(len(df['ratingDescription'])):
    num = " ".join(re.findall(r'[0-9]+', df['ratingDescription'][i]))
    if num:
        df['Age Restriction'][i] =num + "+" 
    else:
        df['Age Restriction'][i] = "" 

In [None]:
df[:10]

In [None]:
df.to_csv('netflix_shows_with_genre.csv', index=True) 

In [None]:
df2 = pd.read_csv('./netflix_shows_with_genre.csv', encoding='UTF-8')
df2.head(10)

## Find out trending rating and genre for shows/movies in each decade

In [None]:
#Script to obtain trending Genre and rating in each decade

from collections import OrderedDict 
start = df2['release year'].min()

labels = []
values = [] 
genre = [] 
values_acc_score =[]
genre_acc_score = []

for i in range(len(df2)):
    
    end = start + 10
    span = str(start) + "-" + str(end)
    new_df = df2[(df2['release year'] >=start) & (df2['release year'] < end)]
    new_df = new_df[new_df['Genre'].notna()]
    new_df = new_df.set_index(np.arange(0,len(new_df)))

    if len(new_df) > 0:
        labels.append(span)

        #find trending rating and genre according to rating level
        max_user_rating = [new_df['rating'][i] for i in range(len(new_df)) if new_df['ratingLevel'][i] == new_df['ratingLevel'].max()]
        max_genre = [str(new_df['Genre'][i]).lower() for i in range(len(new_df)) if new_df['ratingLevel'][i] == new_df['ratingLevel'].max()]
        values.append(list(OrderedDict.fromkeys(max_user_rating)))
        genre.append(list(OrderedDict.fromkeys(max_genre)))

        #find trending rating and genre according to user rating score
        max_rating_score =[new_df['rating'][i] for i in range(len(new_df)) if new_df['user rating score'][i] == new_df['user rating score'].max()]
        max_genre_score = [new_df['Genre'][i].lower() for i in range(len(new_df)) if new_df['user rating score'][i] == new_df['user rating score'].max()]
        values_acc_score.append(list(OrderedDict.fromkeys(max_rating_score)))
        genre_acc_score.append(list(OrderedDict.fromkeys(max_genre_score)))

    start = start+ 10
    if start > 2017 :
        break

Trends = pd.DataFrame( columns=['Trending rating based on ratingLevel','Trending genre based on ratingLevel',
        'Trending rating based on user rating score','Trending genre based on user rating score'], index=labels)
Trends['Trending rating based on ratingLevel'] =[",".join(set(i)) for i in values]
Trends['Trending genre based on ratingLevel'] = [",".join(set(i)) for i in genre]
Trends['Trending rating based on user rating score'] = [",".join(i) for i in values_acc_score]
Trends['Trending genre based on user rating score'] = [",".join(i) for i in genre_acc_score]
Trends['Trending genre based on ratingLevel'][5] = ",".join(list(OrderedDict.fromkeys(Trends['Trending genre based on ratingLevel'][5].split(","))))
Trends['Trending genre based on ratingLevel'][4] = ",".join(list(OrderedDict.fromkeys(Trends['Trending genre based on ratingLevel'][4].split(","))))

        
Trends

## Wordcloud for trending Genre for shows in 20th and 21st Century

In [None]:
#Trending Genre in 20th century

comment_words = '' 
stopwords = set(STOPWORDS) 

lst = Trends['Trending genre based on ratingLevel'][:4].values.tolist() + Trends['Trending genre based on user rating score'][:4].values.tolist()

for val in lst: 
    val = str(val) 
    tokens = val.split() 

    for i in range(len(tokens)): 
        tokens[i] = tokens[i].lower() 
      
    comment_words += " ".join(tokens)+" "
  
wordcloud = WordCloud(width = 600, height = 400, background_color ='black', stopwords = stopwords, min_font_size = 10).generate(comment_words) 
                       
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
  
plt.show()

In [None]:
#Trending Genre for shows in 21st century

comment_words = '' 
stopwords = set(STOPWORDS) 
lst = Trends['Trending genre based on ratingLevel'][4:].values.tolist() + Trends['Trending genre based on user rating score'][4:].values.tolist()

for val in lst: 
    val = str(val) 
    tokens = val.split() 

    for i in range(len(tokens)): 
        tokens[i] = tokens[i].lower() 
      
    comment_words += " ".join(tokens)+" "
  
wordcloud = WordCloud(width = 600, height = 400, background_color ='black', stopwords = stopwords, min_font_size = 10).generate(comment_words) 
                       
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
  
plt.show()

## Rating frequency overall

In [None]:
rating_freq = df2['rating'].value_counts()

plt.figure(figsize=(15,10))
rating_freq.plot.bar(color='teal',alpha=0.5)
plt.xticks(rotation=50)
plt.xlabel("Ratings")
plt.ylabel("Frequency of ratings")
plt.title("Rating Frequency Overall")
plt.show()
plt.close()

In [None]:
age_freq = df2['Age Restriction'].value_counts().sort_values(ascending= True)

fig = plt.figure(figsize = (8, 6))
ax = fig.add_subplot()
explode = (0, 0, 0, 0.1)
colors = np.arange(4)
ax.pie(age_freq.values, explode=explode,labels = age_freq.index,shadow=True,
autopct = '%1.1f%%',textprops = {'fontsize': 15, 'color' : "black"})
ax.set_title("Agre restriction frequency on Shows/Movies")
ax.axis('equal')
plt.show()

## Top shows/movies

In [None]:
df_top = df2[df2['user rating score']>0].dropna()
df_top = df_top.set_index(np.arange(0,121))
df_top

In [None]:
score_freq = df_top['user rating score'].value_counts()

plt.figure(figsize=(15,10))
score_freq.plot.bar(color='red', alpha=0.5)
plt.xticks(rotation=50)
plt.xlabel("User Rating scores")
plt.ylabel("Frequency of user rating score")
plt.title("User Rating Score Frequency")
plt.show()

In [None]:
df_top.sort_values("user rating score", axis = 0, ascending = False, inplace = True, na_position ='last')

In [None]:
uniq = df_top['user rating score'].unique()

In [None]:
year_freq = df2['release year'].value_counts().sort_values(ascending= True)
year_freq

## Total Shows/Movies released in each year

In [None]:
plt.figure(figsize=(15,10))
year_freq.plot.bar(color='blue', alpha=0.5)
plt.xticks(rotation=50)
plt.xlabel("Release Years")
plt.ylabel("Shows/Movies Released")
plt.title("Shows/Movies Released In Each Year")

plt.show()

In [None]:
year_freq2 = df2['release year'].value_counts().sort_index(ascending= True)
fig = plt.figure(figsize = (20, 10))

ax2 = fig.add_subplot()
x = year_freq2.index
y = year_freq2.values
ax2.plot(x,y, marker='o', linestyle='-', color='b', 
label='Shows/Movies',alpha=0.5) 
ax2.set_xlabel('Release Years')
ax2.set_ylabel('Shows/Movies Released') 
ax2.set_title('Shows/Movies Released In Each Year')
ax2.legend(loc = "upper left") 

for a,b in zip(x, y): 
    plt.text(a, b, str(b))

plt.show()

## Frequency for rating level

In [None]:
rl_freq = df2['ratingLevel'].value_counts().sort_values(ascending= True)

plt.figure(figsize=(13,8))
rl_freq.plot.bar(color='blue', alpha=0.5)
plt.xticks(rotation=50)
plt.xlabel("Rating Levels")
plt.ylabel("Shows/Movies")
plt.title("Shows/Movies having rating Level")

plt.show()

In [None]:
rating_ = df2['rating'].unique()
year_ = df2['release year'].unique()
year_.sort()
print("Rating unique : ", rating_)
print("year unique : ", year_)

## Get frequency of each ratings every year 

In [None]:
c_list = []
rating_count_lst = []
final_c = 0

for i in range(len(year_)):
    for j in range(len(rating_)):
        count = 0
        for k in range(len(df['rating'])):
            if df2['rating'][k] == rating_[j] and df2['release year'][k] == year_[i]:
                count += 1
                final_c +=1
        c_list.append(count)
    rating_count_lst.append(c_list)
    c_list =[]   

print("rating freq in yeach year",rating_count_lst)
#print(series_rating)
print(len(rating_count_lst))
print(final_c)


## Count of ratings every year -- Create a DataFrame

In [None]:
each_year = pd.DataFrame(rating_count_lst,columns=rating_, index= year_) #Create a dataframe for the result obtained above
each_year

In [None]:
maxValues = each_year.idxmax(axis = 1) #Highest movie ratings each year 
print(maxValues) 

In [None]:
rating_overall = maxValues.value_counts()
rating_overall

In [None]:
fig = plt.figure(figsize = (8, 6))
ax = fig.add_subplot()
explode = (0.1, 0, 0, 0,0,0,0,0)
colors = np.arange(4)
ax.pie(rating_overall.values, explode=explode,labels = rating_overall.index,shadow=True,
autopct = '%1.1f%%',textprops = {'fontsize': 14, 'color' : "black"})
ax.set_title(" Yearly Dominating Rating Frequency")
ax.axis('equal')
plt.show()

## Heat Map for the DataFrame created above

In [None]:
import seaborn as sns

plt.figure(figsize=(15,15))
sns.heatmap(each_year,linewidths=1,annot=True,fmt='2.0f',cmap="viridis")
plt.title('Rating Frequency from 1940 To 2017')

## Drop rows with neglegible values

In [None]:
year_df = each_year.drop(axis=0,index=[1940,1976,1978,1982,1987,1986,1989,1990]) 
year_df

## Visualization of Frequency of each ratings from 1990-2017

In [None]:
fig = plt.figure(figsize = (20, 10))
ax2 = fig.add_subplot()
x= year_df.index
c=['blue','slateblue','darkslateblue','indigo','orangered','olive','cadetblue','purple','darkred','peru','darkgreen','fuchsia','teal']

for i in range(len(rating_)):
    ax2.plot(x,year_df[rating_[i]], marker='o', linestyle='-', color=c[i], label=rating_[i],alpha =0.5) 

ax2.set_xlabel('Release Years')
ax2.set_ylabel('Movie Ratings Frequency') 
ax2.set_title('Frequency of ratings of Shows/Movies In Each Year')
ax2.legend(loc = "upper left") 


plt.show()

In [None]:
for i in range(len(df2['rating'])):
    if 'Suitable for all ages.' in df2['ratingDescription'][i]:
        df2['Age Restriction'][i] = 'No Restriction'

In [None]:
age_freq = df2['Age Restriction'].value_counts().sort_values(ascending= True)

fig = plt.figure(figsize = (8, 9))
ax = fig.add_subplot()
#explode = (0, 0, 0, 0.1)
colors = np.arange(4)
ax.pie(age_freq.values,labels = age_freq.index,shadow=True,
autopct = '%1.1f%%',textprops = {'fontsize': 15, 'color' : "black"})
ax.set_title("Agre restriction frequency on Shows/Movies")
ax.axis('equal')
plt.show()

In [None]:
df2[10:20]

In [None]:
df_general_shows = df2[(df2['Age Restriction'] == 'No Restriction') & (df2['user rating score'] > 75)]
x = df_general_shows.shape[0] #rows
x

In [None]:
df_Mature_shows = df[(df['Age Restriction'] != 'no restriction') & (df['user rating score'] > 75)]
y = df_Mature_shows.shape[0] #rows
y

## Ratio of popularity for general vs Age restricted movies all together.

In [None]:
print("Ratio of popular general : mature movies/shows in netflix --> ",int(x/10), ":", int(y/10))

In [None]:
df_top = df2[df2['user rating score']>0].dropna()
df_top = df_top.set_index(np.arange(0,145))
df_top

## Words commonly used in Titles among popular Movies

In [None]:
comment_words = '' 
stopwords = set(STOPWORDS) 
   
for val in df_top.title[:]: 
    val = str(val) 
    tokens = val.split() 
    for i in range(len(tokens)): 
        tokens[i] = tokens[i].lower() 
      
    comment_words += " ".join(tokens)+" "
  
wordcloud = WordCloud(width = 700, height = 500, background_color ='black', stopwords = stopwords, min_font_size = 10).generate(comment_words) 
                        
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
  
plt.show()

## Trending Genre overall

In [None]:
comment_words = '' 
stopwords = set(STOPWORDS) 

for val in df_top.Genre: 
    val = str(val) 
    tokens = val.split() 

    for i in range(len(tokens)): 
        tokens[i] = tokens[i].lower() 
      
    comment_words += " ".join(tokens)+" "
  
wordcloud = WordCloud(width = 600, height = 400, background_color ='black', stopwords = stopwords, min_font_size = 10).generate(comment_words) 
                       
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
  
plt.show()

## Final dataset to CSV file

Final dataset with two additional columns, 'Genre' and 'Age Restriction'. 

In [None]:
df2.to_csv('500_Netflix_Shows.csv', index=False) 