### **TDI DATA SCIENCE TRACK || WEEK 8**

In [1]:
import pandas as pd
import numpy as np
import re

In [3]:
df = pd.read_csv("netflix_titles.csv")
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


### **Section A: Data Cleaning and Transformation**
In this section, the focus is on cleaning, manipulating, and transforming data using pandas to ensure consistency and usability.



**1.	Remove unnecessary columns like show_id, description, and director.**

In [4]:
df.drop(columns = ["show_id", "description","director"], inplace = True)

**2.	Eliminate rows with missing values from the dataset.**

In [5]:
df.isna().sum()

type              0
title             0
cast            825
country         831
date_added       10
release_year      0
rating            4
duration          3
listed_in         0
dtype: int64

In [6]:
df.dropna(axis = "index", inplace= True)

In [8]:
df.shape

(7290, 9)

**3.	Convert the date_added and release_year columns into datetime format, and extract year, month, and day from date_added, along with extracting the year from release_year.**

In [14]:
df["date_added"].tail()

8801        March 9, 2016
8802    November 20, 2019
8804     November 1, 2019
8805     January 11, 2020
8806        March 2, 2019
Name: date_added, dtype: object

In [12]:
df["date_added"] = pd.to_datetime(df["date_added"])

ValueError: time data " August 4, 2017" doesn't match format "%B %d, %Y", at position 1365. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.

**4.	Calculate the total number of cast members for each show or movie and determine how to extract the lead actor’s name from the cast column, assuming the first actor listed is the lead.**

In [20]:
# total number of cast for each show
df["cast"]

1       Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...
4       Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...
7       Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...
8       Mel Giedroyc, Sue Perkins, Mary Berry, Paul Ho...
9       Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...
                              ...                        
8801    Ali Suliman, Saleh Bakri, Yasa, Ali Al-Jabri, ...
8802    Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...
8804    Jesse Eisenberg, Woody Harrelson, Emma Stone, ...
8805    Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...
8806    Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...
Name: cast, Length: 7290, dtype: object

**6.	Standardize the duration column by converting it into an integer format.**

In [23]:
df["duration"].str()

array(['2 Seasons', '125 min', '9 Seasons', '104 min', '127 min',
       '4 Seasons', '5 Seasons', '166 min', '103 min', '97 min',
       '106 min', '3 Seasons', '1 Season', '96 min', '124 min', '116 min',
       '98 min', '91 min', '115 min', '122 min', '99 min', '88 min',
       '100 min', '6 Seasons', '102 min', '93 min', '95 min', '85 min',
       '83 min', '182 min', '147 min', '90 min', '128 min', '143 min',
       '119 min', '114 min', '118 min', '108 min', '117 min', '121 min',
       '142 min', '113 min', '154 min', '120 min', '82 min', '94 min',
       '109 min', '101 min', '105 min', '86 min', '229 min', '76 min',
       '89 min', '110 min', '156 min', '112 min', '129 min', '107 min',
       '135 min', '136 min', '165 min', '150 min', '133 min', '145 min',
       '92 min', '7 Seasons', '64 min', '59 min', '111 min', '87 min',
       '148 min', '189 min', '141 min', '130 min', '10 Seasons', '68 min',
       '131 min', '8 Seasons', '17 Seasons', '126 min', '155 min',
       '1

**7.	Clean and capitalize the country names in the country column, ensuring no extra spaces.**

In [29]:
df["country"]

AttributeError: 'Series' object has no attribute 'trim'

**8.	Calculate how many genres each show or movie is listed under (i.e., column to be added is genre count).**

In [33]:
df["listed_in"].unique()

array(['International TV Shows, TV Dramas, TV Mysteries',
       'International TV Shows, Romantic TV Shows, TV Comedies',
       'Dramas, Independent Movies, International Movies',
       'British TV Shows, Reality TV', 'Comedies, Dramas',
       'Dramas, International Movies', 'TV Comedies, TV Dramas',
       'Crime TV Shows, Spanish-Language TV Shows, TV Dramas',
       'International TV Shows, TV Action & Adventure, TV Dramas',
       'Comedies, International Movies, Romantic Movies',
       'Docuseries, International TV Shows, Reality TV', 'Comedies',
       'Horror Movies, Sci-Fi & Fantasy', 'Thrillers',
       'British TV Shows, International TV Shows, TV Comedies',
       "Kids' TV, TV Comedies", 'Action & Adventure, Dramas', "Kids' TV",
       "Kids' TV, TV Sci-Fi & Fantasy",
       'Action & Adventure, Classic Movies, Dramas',
       'Dramas, Horror Movies, Thrillers',
       'Action & Adventure, Horror Movies, Thrillers',
       'Action & Adventure', 'Dramas, Thrillers',
   

**9.	Transform columns like listed_in -> genre, country, and cast -> actor into list format.**

In [36]:
df["cast"]

TypeError: 'list' object is not callable

**10.	Explode the genre, country, and actor columns into individual rows, and remove any null values.**

**11.	Filter out rows with empty country fields and drop unnecessary columns like date_added, cast, and listed_in.**

**12.	Preview the first few rows of the cleaned and transformed dataset to verify the changes.**

In [37]:
df.head()

Unnamed: 0,type,title,cast,country,date_added,release_year,rating,duration,listed_in
1,TV Show,Blood & Water,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries"
4,TV Show,Kota Factory,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ..."
7,Movie,Sankofa,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...","United States, Ghana, Burkina Faso, United Kin...","September 24, 2021",1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies"
8,TV Show,The Great British Baking Show,"Mel Giedroyc, Sue Perkins, Mary Berry, Paul Ho...",United Kingdom,"September 24, 2021",2021,TV-14,9 Seasons,"British TV Shows, Reality TV"
9,Movie,The Starling,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...",United States,"September 24, 2021",2021,PG-13,104 min,"Comedies, Dramas"


### **Section B: Grouping and Aggregation**
This section involves using groupby and aggregation methods to derive insights from the dataset based on specific criteria. <br>Select and complete 7 of the following questions.

**1.	What is the most popular genre based on the number of movies or TV shows listed?**

**2.	Which country has the highest number of movies or TV shows listed?**

In [40]:
df.groupby("country")["type"].count().sort_values(ascending=False).max()

np.int64(2479)

**3.	Which movie or TV show has the most listings?**

In [None]:
df[]

**4.	Which movie or TV show has the largest cast?**

**5.	Which movie or TV show has been broadcast the most times?**

**6.	Which movie or TV show is listed in the most unique countries?**

**7.	Which actor has had the most lead roles?**

**8.	Which actor has appeared in the most movies or TV shows?**

**9.	What is the most common rating assigned to movies or TV shows?**

**10.	Are there more TV shows or movies listed on Netflix?**

**11.	On which day of the week are the most movies or TV shows added to Netflix?**

**12.	In which year were the most movies or TV shows added to Netflix?**

**13.	Which month sees the highest number of movies or TV shows added to Netflix?**

**14.	Which release year had the most movies or TV shows?**