In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

df = pd.read_csv('../input/Netflix Shows.csv', encoding='cp437')
df.head()

In [2]:
df.info()

There are many missing values for the user rating score and some missing values for rating level. All other values are present.

In [3]:
df.describe()

The oldest release data was in 1940 and the most recent is from this year (2017).

First lets check to see if there are any duplicate titles.

In [4]:
df['title'].value_counts().head()

In [5]:
df.drop_duplicates(inplace=True)
df['title'].value_counts().head()

In [6]:
multiple_titles = df['title'].value_counts().iloc[0:4].keys()
df[df['title'].isin(multiple_titles)]

There are still a few shows that have the same name after removing the duplicates from the database. These appear to be different shows. For example there were two different shows titled "Bordertown" in 2017.

In [7]:
df.info()

Over half of the entries in the original database were duplicates.

Next, I will investigate whether there is any relationship between the release year the the user rating score.

In [8]:
sns.jointplot(data=df, y='user rating score', x='release year')
plt.xlim(1939, 2018)

There does not appear be a relationship between the release date and user score.

Next will be users ratings by ratings.

In [9]:
order = np.sort(df['rating'].unique())
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, y='user rating score', x='rating',order=order)
plt.title('User scores by Rating')

Information from [Netflix Official Site][1] 

 - Ratings for Little Kids: G, TV-Y, TV-G.
 - Ratings for Older Kids: PG, TV-Y7, TV-Y7-FV, TV-PG.
 - Ratings for Teens: PG-13, TV-14.
 - Ratings for Adults: R, NC-17, NR, UR, TV-MA.

Ratings will be combined into age groups.

  [1]: https://help.netflix.com/en/node/2064

In [10]:
def age_group(rating):
    little = ['G','TV-Y','TV-G']
    older = ['PG', 'TV-Y7', 'TV-Y7-FV', 'TV-PG']
    teens = ['PG-13', 'TV-14']
    adult = ['R', 'NC-17', 'NR', 'UR', 'TV-MA']
    
    if rating in little:
        return 'Little Kids'
    elif rating in older:
        return 'Older Kids'
    elif rating in teens:
        return 'Teens'
    elif rating in adult:
        return 'Adults'
    else:
        return 'Missing'
    
df['age_group'] = df['rating'].apply(age_group)
df.head()

In [11]:
#Check for missing ratings
df['age_group'].unique()

In [12]:
order = ['Little Kids','Older Kids','Teens','Adults']
sns.countplot(df['age_group'],order=order)

In [13]:
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, y='user rating score', x='age_group', order=order)
plt.title('User scores by Age Group')

In [14]:
print('Highest and lowest user rated shows by age group:')

for group in order:
    print('\n' + group+':')
    print('Highest')
    print(df[df['age_group']==group].sort_values('user rating score',ascending=False)[['title','user rating score']].head(1))
    print('\nLowest')
    print(df[df['age_group']==group].sort_values('user rating score')[['title','user rating score']].head(1))