<a href="https://colab.research.google.com/github/vikash2030/Netflix-Analytics-Project/blob/main/Netflix_Analytics_Project_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Data loading

Load the movie dataset from "data.csv" into a pandas DataFrame.


**Reasoning**:
Load the data from "data.csv" into a pandas DataFrame and display the first few rows to verify.



In [None]:
import pandas as pd

try:
    df_movies = pd.read_csv('data.csv')
    display(df_movies.head())
except FileNotFoundError:
    print("Error: 'data.csv' not found. Please ensure the file exists in the current directory.")
    df_movies = None
except pd.errors.ParserError:
    print("Error: Could not parse 'data.csv'. Please check the file format.")
    df_movies = None
except Exception as e:
    print(f"An unexpected error occurred: {e}")
    df_movies = None

Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries
0,Forrest Gump,movie,"Drama, Romance",1994.0,tt0109830,8.8,2316353.0,MX
1,The Fifth Element,movie,"Action, Adventure, Sci-Fi",1997.0,tt0119116,7.6,517121.0,"AT, CH, DE"
2,Kill Bill: Vol. 1,movie,"Action, Crime, Thriller",2003.0,tt0266697,8.2,1221791.0,"AE, AL, AO, AT, AU, AZ, BG, BH, BY, CA, CI, CM..."
3,Jarhead,movie,"Biography, Drama, War",2005.0,tt0418763,7.0,211547.0,"AD, AE, AG, AL, AO, AR, AT, AZ, BA, BB, BG, BH..."
4,Unforgiven,movie,"Drama, Western",1992.0,tt0105695,8.2,443775.0,"AU, BA, BG, CZ, HR, HU, MD, ME, MK, NZ, PL, RO..."


## Data cleaning


Clean the movie DataFrame `df_movies`.


**Reasoning**:
Clean the movie DataFrame `df_movies` according to the instructions, including handling missing values, standardizing formats, and performing data quality checks.



**Reasoning**:
Fix the FutureWarnings from the previous code block by avoiding chained assignment with inplace.



In [None]:
# Handle missing values
# Impute missing 'imdbAverageRating' with the median.
df_movies['imdbAverageRating'] = df_movies['imdbAverageRating'].fillna(df_movies['imdbAverageRating'].median())

# Impute missing 'imdbNumVotes' with the median.
df_movies['imdbNumVotes'] = df_movies['imdbNumVotes'].fillna(df_movies['imdbNumVotes'].median())

# Fill missing 'availableCountries' with 'Unknown'.
df_movies['availableCountries'] = df_movies['availableCountries'].fillna('Unknown')


# Standardize Rating and Genre Formats
# Convert 'imdbAverageRating' to numeric, coercing errors to NaN.
df_movies['imdbAverageRating'] = pd.to_numeric(df_movies['imdbAverageRating'], errors='coerce')

# Replace NaN values in 'imdbAverageRating' with the median.
df_movies['imdbAverageRating'] = df_movies['imdbAverageRating'].fillna(df_movies['imdbAverageRating'].median())

# Convert 'imdbNumVotes' to numeric, coercing errors to NaN.
df_movies['imdbNumVotes'] = pd.to_numeric(df_movies['imdbNumVotes'], errors='coerce')

# Replace NaN values in 'imdbNumVotes' with the median.
df_movies['imdbNumVotes'] = df_movies['imdbNumVotes'].fillna(df_movies['imdbNumVotes'].median())

# Standardize Genres: Split genres into individual columns
genres_split = df_movies['genres'].str.split(', ', expand=True)
genres_split.columns = [f'genre_{i+1}' for i in range(genres_split.shape[1])]
df_movies = pd.concat([df_movies, genres_split], axis=1)


# Data Quality Checks
# Remove duplicate rows based on all columns.
df_movies = df_movies.drop_duplicates()

# Convert 'releaseYear' to integer, coercing errors to NaN.
df_movies['releaseYear'] = pd.to_numeric(df_movies['releaseYear'], errors='coerce').astype('Int64')

display(df_movies.head())

Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries,genre_1,genre_2,...,genre_3,genre_4,genre_5,genre_6,genre_1.1,genre_2.1,genre_3.1,genre_4.1,genre_5.1,genre_6.1
0,Forrest Gump,movie,"Drama, Romance",1994,tt0109830,8.8,2316353.0,MX,Drama,Romance,...,,,,,Drama,Romance,,,,
1,The Fifth Element,movie,"Action, Adventure, Sci-Fi",1997,tt0119116,7.6,517121.0,"AT, CH, DE",Action,Adventure,...,Sci-Fi,,,,Action,Adventure,Sci-Fi,,,
2,Kill Bill: Vol. 1,movie,"Action, Crime, Thriller",2003,tt0266697,8.2,1221791.0,"AE, AL, AO, AT, AU, AZ, BG, BH, BY, CA, CI, CM...",Action,Crime,...,Thriller,,,,Action,Crime,Thriller,,,
3,Jarhead,movie,"Biography, Drama, War",2005,tt0418763,7.0,211547.0,"AD, AE, AG, AL, AO, AR, AT, AZ, BA, BB, BG, BH...",Biography,Drama,...,War,,,,Biography,Drama,War,,,
4,Unforgiven,movie,"Drama, Western",1992,tt0105695,8.2,443775.0,"AU, BA, BG, CZ, HR, HU, MD, ME, MK, NZ, PL, RO...",Drama,Western,...,,,,,Drama,Western,,,,


**Reasoning**:
The previous code block seems to have duplicated columns.  I will correct the code to avoid this, and then finish the task.



In [None]:
# Handle missing values
# Impute missing 'imdbAverageRating' with the median.
df_movies['imdbAverageRating'] = df_movies['imdbAverageRating'].fillna(df_movies['imdbAverageRating'].median())

# Impute missing 'imdbNumVotes' with the median.
df_movies['imdbNumVotes'] = df_movies['imdbNumVotes'].fillna(df_movies['imdbNumVotes'].median())

# Fill missing 'availableCountries' with 'Unknown'.
df_movies['availableCountries'] = df_movies['availableCountries'].fillna('Unknown')


# Standardize Rating and Genre Formats
# Convert 'imdbAverageRating' to numeric, coercing errors to NaN.
df_movies['imdbAverageRating'] = pd.to_numeric(df_movies['imdbAverageRating'], errors='coerce')

# Replace NaN values in 'imdbAverageRating' with the median.
df_movies['imdbAverageRating'] = df_movies['imdbAverageRating'].fillna(df_movies['imdbAverageRating'].median())

# Convert 'imdbNumVotes' to numeric, coercing errors to NaN.
df_movies['imdbNumVotes'] = pd.to_numeric(df_movies['imdbNumVotes'], errors='coerce')

# Replace NaN values in 'imdbNumVotes' with the median.
df_movies['imdbNumVotes'] = df_movies['imdbNumVotes'].fillna(df_movies['imdbNumVotes'].median())

# Standardize Genres: Split genres into individual columns
genres_split = df_movies['genres'].str.split(', ', expand=True)
genres_split.columns = [f'genre_{i+1}' for i in range(genres_split.shape[1])]
df_movies = pd.concat([df_movies, genres_split], axis=1)


# Data Quality Checks
# Remove duplicate rows based on all columns.
df_movies = df_movies.drop_duplicates()

# Convert 'releaseYear' to integer, coercing errors to NaN.
df_movies['releaseYear'] = pd.to_numeric(df_movies['releaseYear'], errors='coerce').astype('Int64')

df_movies = df_movies
display(df_movies.head())

Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries,genre_1,genre_2,...,genre_3,genre_4,genre_5,genre_6,genre_1.1,genre_2.1,genre_3.1,genre_4.1,genre_5.1,genre_6.1
0,Forrest Gump,movie,"Drama, Romance",1994,tt0109830,8.8,2316353.0,MX,Drama,Romance,...,,,,,Drama,Romance,,,,
1,The Fifth Element,movie,"Action, Adventure, Sci-Fi",1997,tt0119116,7.6,517121.0,"AT, CH, DE",Action,Adventure,...,Sci-Fi,,,,Action,Adventure,Sci-Fi,,,
2,Kill Bill: Vol. 1,movie,"Action, Crime, Thriller",2003,tt0266697,8.2,1221791.0,"AE, AL, AO, AT, AU, AZ, BG, BH, BY, CA, CI, CM...",Action,Crime,...,Thriller,,,,Action,Crime,Thriller,,,
3,Jarhead,movie,"Biography, Drama, War",2005,tt0418763,7.0,211547.0,"AD, AE, AG, AL, AO, AR, AT, AZ, BA, BB, BG, BH...",Biography,Drama,...,War,,,,Biography,Drama,War,,,
4,Unforgiven,movie,"Drama, Western",1992,tt0105695,8.2,443775.0,"AU, BA, BG, CZ, HR, HU, MD, ME, MK, NZ, PL, RO...",Drama,Western,...,,,,,Drama,Western,,,,


## Data preparation

### Subtask:
Prepare the cleaned movie data for loading into an SQL database.


In [None]:
import re

# 1. Data Type Conversion and Special Character Handling
for col in ['title', 'type', 'availableCountries', 'genre_1', 'genre_2', 'genre_3', 'genre_4', 'genre_5', 'genre_6']:
    df_movies[col] = df_movies[col].astype(str).apply(lambda x: x.replace("'", "''"))

# Convert numeric columns to appropriate types
df_movies['releaseYear'] = pd.to_numeric(df_movies['releaseYear'], errors='coerce').astype('Int64')
df_movies['imdbAverageRating'] = pd.to_numeric(df_movies['imdbAverageRating'], errors='coerce')
df_movies['imdbNumVotes'] = pd.to_numeric(df_movies['imdbNumVotes'], errors='coerce')

# 3 & 4. Format Available Countries and Genre Handling (same as before)
df_countries = df_movies[['imdbId', 'availableCountries']].copy()
df_countries['availableCountries'] = df_countries['availableCountries'].str.split(',\s*')
df_countries = df_countries.explode('availableCountries')
df_countries.rename(columns={'availableCountries': 'country'}, inplace=True)
df_countries = df_countries[df_countries['country'] != 'Unknown']
df_countries = df_countries.drop_duplicates()
df_movies = df_movies.drop('availableCountries', axis=1)

df_movies = df_movies.loc[:,~df_movies.columns.duplicated()]

df_genres = df_movies[['imdbId']].copy()
for i in range(1, 7):
    genre_col = f'genre_{i}'
    df_genres[genre_col] = df_movies[genre_col]
df_genres = df_genres.melt(id_vars=['imdbId'], var_name='genre_number', value_name='genre')
df_genres = df_genres.dropna(subset=['genre'])
df_genres = df_genres[['imdbId', 'genre']].drop_duplicates()


display(df_movies.head())
display(df_countries.head())
display(df_genres.head())

Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,genre_1,genre_2,genre_3,genre_4,genre_5,genre_6
0,Forrest Gump,movie,"Drama, Romance",1994,tt0109830,8.8,2316353.0,Drama,Romance,,,,
1,The Fifth Element,movie,"Action, Adventure, Sci-Fi",1997,tt0119116,7.6,517121.0,Action,Adventure,Sci-Fi,,,
2,Kill Bill: Vol. 1,movie,"Action, Crime, Thriller",2003,tt0266697,8.2,1221791.0,Action,Crime,Thriller,,,
3,Jarhead,movie,"Biography, Drama, War",2005,tt0418763,7.0,211547.0,Biography,Drama,War,,,
4,Unforgiven,movie,"Drama, Western",1992,tt0105695,8.2,443775.0,Drama,Western,,,,


Unnamed: 0,imdbId,country
0,tt0109830,MX
1,tt0119116,AT
1,tt0119116,CH
1,tt0119116,DE
2,tt0266697,AE


Unnamed: 0,imdbId,genre
0,tt0109830,Drama
1,tt0119116,Action
2,tt0266697,Action
3,tt0418763,Biography
4,tt0105695,Drama


## Data preparation




## Summary:

### 1. Q&A

* **Were all the project steps completed successfully?** No, several steps encountered errors, primarily due to database connection issues and limitations in programmatically interacting with the Power BI Desktop GUI. The database population was eventually successful using a local SQLite database. The Power BI dashboard creation and real-time updates were not programmatically achievable.

* **What was the primary reason for the initial database population failures?**  Incorrect database connection URLs (invalid port numbers, incorrect hostnames) and issues executing SQL `CREATE TABLE` statements using SQLAlchemy.

* **What was the final solution for populating the database?**  The `to_sql()` method's ability to create tables if they do not exist, bypassing the problems encountered when directly executing SQL `CREATE TABLE` statements was used to successfully populate a local SQLite database.

* **Were all project files successfully added to the Git repository?** No.  The `data.csv` and `movie_database.db` files were not found in the current directory, preventing them from being copied into the repository.


### 2. Data Analysis Key Findings

* **Data Cleaning:** Missing values in `imdbAverageRating`, `imdbNumVotes`, and `availableCountries` were imputed.  Ratings and votes were converted to numeric types. Genres were split into individual columns. Duplicate rows were removed. The `releaseYear` column was converted to integer.
* **Data Preparation for SQL:** Data types were converted for SQL compatibility. Single quotes in string columns were escaped.  `availableCountries` and `genres` were normalized into separate tables (`df_countries` and `df_genres`).
* **SQL Database Population:**  After multiple failed attempts due to connection issues, the database was successfully populated using a local SQLite database. Data was loaded into three tables (`movies`, `countries`, and `genres`).
* **Power BI Connection Details:** Connection details for Power BI were generated for the SQLite database.
* **GitHub Repository Setup:** The repository was initialized and a README file was created, but the necessary project files were not added.


### 3. Insights or Next Steps

* **Complete GitHub Repository Setup:** Manually add the `data.csv` and `movie_database.db` files to the project directory and commit them to the repository.
* **Power BI Dashboard Development:** Manually create the Power BI dashboard visualizations using the provided database connection details.  Consider using alternative visualization tools that are programmatically accessible if full automation is desired.
