In [99]:
import pandas as pd

## Task 1: Dataset Overview

In [100]:
df_netflix = pd.read_csv('Data/netflix_titles.csv')
df_netflix.sample(5)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
2412,80097072,TV Show,NATURE: Natural Born Hustlers,,Kevin Draine,United States,"March 1, 2017",2016,TV-G,1 Season,"Docuseries, Science & Nature TV",Sometimes being shady is the only way to survi...
3055,81121179,Movie,NOVA: Extreme Animal Weapons,Peter Fison,Doug Averill,United States,"July 1, 2019",2017,TV-PG,53 min,Documentaries,"From huge tusks to giant horns, some animals s..."
1706,80080768,Movie,The Blackcoat's Daughter,Osgood Perkins,"Emma Roberts, Kiernan Shipka, Lucy Boynton, La...","Canada, United States","May 18, 2019",2015,R,95 min,"Horror Movies, Independent Movies, Thrillers",When their parents fail to pick them up for wi...
321,70058023,Movie,Superbad,Greg Mottola,"Jonah Hill, Michael Cera, Christopher Mintz-Pl...",United States,"September 1, 2019",2007,R,113 min,"Comedies, Cult Movies",Two best friends' quest to buy booze for a par...
1120,80125671,Movie,Shot Caller,Ric Roman Waugh,"Nikolaj Coster-Waldau, Omari Hardwick, Lake Be...",United States,"November 24, 2019",2017,R,121 min,"Dramas, Thrillers","Trying to go straight, a once-successful busin..."


In [101]:
df_netflix.dtypes

show_id          int64
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
description     object
dtype: object

In [102]:
df_netflix.shape

(6234, 12)

## Task 2: Identifying missing data

In [103]:
# Number of rows missing in each column
df_netflix.isnull().sum().sort_values(ascending = False)

director        1969
cast             570
country          476
date_added        11
rating            10
show_id            0
type               0
title              0
release_year       0
duration           0
listed_in          0
description        0
dtype: int64

In [104]:
# % of rows missing in each column
df_netflix.isnull().mean().round(4).sort_values(ascending = False) * 100

director        31.58
cast             9.14
country          7.64
date_added       0.18
rating           0.16
show_id          0.00
type             0.00
title            0.00
release_year     0.00
duration         0.00
listed_in        0.00
description      0.00
dtype: float64

## Task 3: Dealing with missing data

In [105]:
# Drop 'director' column
# df_netflix.drop('director', axis = 1)

In [106]:
# Drop row
# no_director = df_netflix[df_netflix['director'].isnull()].index
# df_netflix.drop(no_director, axis = 0)

In [107]:
df_netflix.drop(no_director, axis = 0).isnull().sum()

show_id           0
type              0
title             0
director          0
cast            356
country         171
date_added        0
release_year      0
rating            6
duration          0
listed_in         0
description       0
dtype: int64

In [108]:
# ~ + .isnull()
# df_netflix[~(df_netflix['director'].isnull())]

In [109]:
# dropna()
df_netflix.dropna(subset = ['director'])

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
4,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,"September 8, 2017",2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...
6,70304989,Movie,Automata,Gabe Ibáñez,"Antonio Banderas, Dylan McDermott, Melanie Gri...","Bulgaria, United States, Spain, Canada","September 8, 2017",2014,R,110 min,"International Movies, Sci-Fi & Fantasy, Thrillers","In a dystopian future, an insurance adjuster f..."
7,80164077,Movie,Fabrizio Copano: Solo pienso en mi,"Rodrigo Toro, Francisco Schultz",Fabrizio Copano,Chile,"September 8, 2017",2017,TV-MA,60 min,Stand-Up Comedy,Fabrizio Copano takes audience participation t...
9,70304990,Movie,Good People,Henrik Ruben Genz,"James Franco, Kate Hudson, Tom Wilkinson, Omar...","United States, United Kingdom, Denmark, Sweden","September 8, 2017",2014,R,90 min,"Action & Adventure, Thrillers",A struggling couple can't believe their luck w...
...,...,...,...,...,...,...,...,...,...,...,...,...
6142,80063224,TV Show,The Great British Baking Show,Andy Devonshire,"Mel Giedroyc, Sue Perkins, Mary Berry, Paul Ho...",United Kingdom,"August 30, 2019",2019,TV-PG,7 Seasons,"British TV Shows, Reality TV",A talented batch of amateur bakers face off in...
6158,80164216,TV Show,Miraculous: Tales of Ladybug & Cat Noir,Thomas Astruc,"Cristina Vee, Bryce Papenbrook, Keith Silverst...","France, South Korea, Japan","August 2, 2019",2018,TV-Y7,4 Seasons,"Kids' TV, TV Action & Adventure","When Paris is in peril, Marinette becomes Lady..."
6167,80115328,TV Show,Sacred Games,"Vikramaditya Motwane, Anurag Kashyap","Saif Ali Khan, Nawazuddin Siddiqui, Radhika Ap...","India, United States","August 15, 2019",2019,TV-MA,2 Seasons,"Crime TV Shows, International TV Shows, TV Dramas",A link in their pasts leads an honest cop to a...
6182,80176842,TV Show,Men on a Mission,Jung-ah Im,"Ho-dong Kang, Soo-geun Lee, Sang-min Lee, Youn...",South Korea,"April 9, 2019",2019,TV-14,4 Seasons,"International TV Shows, Korean TV Shows, Stand...",Male celebs play make-believe as high schooler...


In [110]:
# Use fillan() to replace NAN by the mean, median or mode

df_netflix['rating'].mode()

0    TV-MA
Name: rating, dtype: object

In [111]:
mode = ''.join(df_netflix['rating'].mode())

In [112]:
df_netflix['rating'] = df_netflix['rating'].fillna(mode)

In [113]:
df_netflix.sample(5)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
3409,80156940,Movie,Pyar Ke Do Pal,Rajiv Mehra,"Mithun Chakraborty, Jayapradha, Simple Kapadia...",India,"January 15, 2018",1986,TV-PG,153 min,"Dramas, International Movies, Music & Musicals",Twins separated by a court order meet at camp ...
5644,80109399,TV Show,Kibaoh Klashers,,"Ben Diskin, Cherami Leigh, Dakota Basseri, Kei...",China,"October 6, 2017",2017,TV-Y7,2 Seasons,Kids' TV,Young beetle Dylan and his friends Hailey and ...
4248,81020868,TV Show,Conan Without Borders,,,United States,"December 31, 2018",2018,TV-14,1 Season,"Docuseries, TV Comedies",Late-night talk show host Conan O'Brien hits t...
1572,81104372,TV Show,What If?,,"Monther Rayahnah, Khaled Ameen, Aseel Omran, R...",,"May 7, 2019",2019,TV-14,1 Season,"International TV Shows, TV Dramas",Four individuals at a crossroads in life are g...
3347,81038583,TV Show,Strongland,,,,"January 18, 2019",2018,TV-PG,1 Season,"Docuseries, International TV Shows",From Spain's countryside to Scotland's stony t...


In [114]:
df_netflix.drop(no_director, axis = 0).isnull().sum()

show_id           0
type              0
title             0
director          0
cast            356
country         171
date_added        0
release_year      0
rating            0
duration          0
listed_in         0
description       0
dtype: int64

## Task 3: Extracting Data

In [115]:
df_movie = df_netflix[df_netflix['type'] == 'Movie']
df_movie['duration (min)'] = df_movie['duration'].str.split(expand = True)[0].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_movie['duration (min)'] = df_movie['duration'].str.split(expand = True)[0].astype(int)


In [116]:
df_movie.sample(4)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,duration (min)
4607,81035888,Movie,Good and Prosperous,Sameh Abdulaziz,"Ali Rabee, Mohamed Abdel-Rahman, Bayoumi Fouad...",Egypt,"December 10, 2019",2017,TV-14,101 min,"Comedies, International Movies",As two jobless brothers search aimlessly for c...,101
87,80200087,Movie,Domino,Brian De Palma,"Nikolaj Coster-Waldau, Carice van Houten, Eriq...","Denmark, France, Italy, Belgium, Netherlands","September 28, 2019",2019,R,89 min,"International Movies, Thrillers",A Copenhagen police officer hunts for the man ...,89
325,70136074,Movie,The Last Exorcism,Daniel Stamm,"Patrick Fabian, Ashley Bell, Iris Bahr, Louis ...","France, United States","September 1, 2019",2010,PG-13,88 min,"Horror Movies, Independent Movies, Thrillers",Ready to expose his miraculous deeds as mere t...,88
731,70280748,Movie,Russell Peters: Notorious,Dave Higby,Russell Peters,United States,"October 14, 2013",2013,NR,72 min,Stand-Up Comedy,Global comedy star Russell Peters leaves no et...,72


In [117]:
df_movie.dtypes

show_id            int64
type              object
title             object
director          object
cast              object
country           object
date_added        object
release_year       int64
rating            object
duration          object
listed_in         object
description       object
duration (min)     int64
dtype: object

In [118]:
df_movie['year added'] = df_movie['date_added'].str.split(',', expand = True)[1]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_movie['year added'] = df_movie['date_added'].str.split(',', expand = True)[1]


In [119]:
df_movie.sample(4)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,duration (min),year added
4919,70219529,Movie,Joker,Shirish Kunder,"Akshay Kumar, Sonakshi Sinha, Shreyas Talpade,...",India,"August 2, 2018",2012,TV-PG,98 min,"Comedies, International Movies, Music & Musicals",A remote village situated neither in India or ...,98,2018
5320,80003150,Movie,Secrets of Althorp - The Spencers,Kasia Uscinska,Samuel West,United States,"April 22, 2017",2013,TV-PG,54 min,"Documentaries, International Movies","Princess Diana's brother, Charles, the ninth E...",54,2017
907,70229048,Movie,Thaandavam,Vijay,"Vikram, Jagapathi Babu, Anushka Shetty, Amy Ja...",India,"October 1, 2018",2012,TV-14,156 min,"Action & Adventure, International Movies, Musi...","In London, a mysterious blind man named Kenny ...",156,2018
2820,70250364,Movie,Katt Williams: Kattpacalypse,Marcus Raboy,Katt Williams,United States,"July 3, 2018",2012,NR,61 min,Stand-Up Comedy,Urban comic Katt Williams ushers in Kattpacaly...,61,2018
