# Lab Instructions

Find a dataset that interests you. I'd recommend starting on [Kaggle](https://www.kaggle.com/). Read through all of the material about the dataset and download a .CSV file.

1. Write a short summary of the data.  Where did it come from?  How was it collected?  What are the features in the data?  Why is this dataset interesting to you?  

2. Identify 5 interesting questions about your data that you can answer using Pandas methods.  

3. Answer those questions!  You may use any method you want (including LLMs) to help you write your code; however, you should use Pandas to find the answers.  LLMs will not always write code in this way without specific instruction.  

4. Write the answer to your question in a text box underneath the code you used to calculate the answer.



Summary of the Data:

Source: Kaggle dataset compiled by Shivam Bansal.

Collection Method: The data was scraped from Netflix’s official site and catalog APIs in 2021, then cleaned into CSV format.

Features:

show_id – Unique identifier for each show

type – Movie or TV Show

title – Title of the show

director – Director(s)

cast – Main cast members

country – Country of origin

date_added – When the show was added to Netflix

release_year – Original release year

rating – TV or movie content rating (e.g., PG, TV-MA)

duration – Length in minutes or number of seasons

listed_in – Genre categories

description – Summary of the show

Why It’s Interesting:

Netflix has become a major global entertainment platform, and this dataset lets us explore trends in media production, genres, and content ratings over time.

We can answer real-world business and cultural questions, like:

What types of content dominate Netflix’s catalog?

Which countries contribute the most content?

What are the most popular genres?

Which country has produced the most Netflix content?

What are the top 5 most common genres on Netflix?

How has the number of releases changed over time?

Which directors have the most titles on Netflix?

What’s the average duration of movies on Netflix?

In [1]:
import pandas as pd

# Load Netflix dataset
df = pd.read_csv('netflix_titles.csv')

# Preview
print(df.head())
print("\nDataset Info:")
print(df.info())


  show_id     type                  title         director  \
0      s1    Movie   Dick Johnson Is Dead  Kirsten Johnson   
1      s2  TV Show          Blood & Water              NaN   
2      s3  TV Show              Ganglands  Julien Leclercq   
3      s4  TV Show  Jailbirds New Orleans              NaN   
4      s5  TV Show           Kota Factory              NaN   

                                                cast        country  \
0                                                NaN  United States   
1  Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...   South Africa   
2  Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...            NaN   
3                                                NaN            NaN   
4  Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...          India   

           date_added  release_year rating   duration  \
0  September 25, 2021          2020  PG-13     90 min   
1  September 24, 2021          2021  TV-MA  2 Seasons   
2  September 24, 2021        

In [2]:
country_counts = df['country'].value_counts().head(10)
print(country_counts)


country
United States     2818
India              972
United Kingdom     419
Japan              245
South Korea        199
Canada             181
Spain              145
France             124
Mexico             110
Egypt              106
Name: count, dtype: int64


In [3]:
# Split the genre strings and count them individually
genres = df['listed_in'].dropna().str.split(', ')
genre_counts = genres.explode().value_counts().head(5)
print(genre_counts)


listed_in
International Movies      2752
Dramas                    2427
Comedies                  1674
International TV Shows    1351
Documentaries              869
Name: count, dtype: int64


In [4]:
release_trend = df['release_year'].value_counts().sort_index()
print(release_trend.tail(10))  # Show recent years


release_year
2012     237
2013     288
2014     352
2015     560
2016     902
2017    1032
2018    1147
2019    1030
2020     953
2021     592
Name: count, dtype: int64


In [5]:
top_directors = df['director'].value_counts().head(5)
print(top_directors)


director
Rajiv Chilaka             19
Raúl Campos, Jan Suter    18
Suhas Kadav               16
Marcus Raboy              16
Jay Karas                 14
Name: count, dtype: int64


In [6]:
# Filter to only movies
movies = df[df['type'] == 'Movie']

# Extract the numeric minutes from duration string
movies['duration_mins'] = movies['duration'].str.replace(' min', '').astype(float)

average_duration = movies['duration_mins'].mean()
print(average_duration)


99.57718668407311


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['duration_mins'] = movies['duration'].str.replace(' min', '').astype(float)
