# Movies analysis using Python Pandas and Object-oriented Approach (OOP)

The code in this notebook analyses the movies dataset downloaded from https://www.kaggle.com/rounakbanik/the-movies-dataset by using the data from the `movies_metadata.csv` and `ratings.csv` files according to the requirements specified in the file [Data_Engineering_Task.txt](https://github.com/x4x3r/scaling-octo-eureka/blob/main/Data-Engineer_task.txt).

Solutions to the data engineering task are given in point 1-7.

Refer to point **8. Movies analysis program using Pandas and object-oriented approach (OOP)** for the object-oriented (OOP) solution.

## Import the required libraries

In [2]:
import pandas as pd
import os
import numpy as np
from datetime import datetime, date

In [None]:
# Create directory named "data" and unzip the downloaded movies data
!mkdir '/home/user/Desktop/data_engineering_task/data'
!unzip 'archive.zip' -d '/home/user/Desktop/data_engineering_task/data'

In [None]:
# Check the unzipped data
%ls -l data/

## Data exploration and  processing

The number of movies can be calculated from the `movies_metadata.csv` dataset

In [None]:
#  Read the movies_metadata.csv dataset into a pandas DataFrame
movies_df = pd.read_csv('data/movies_metadata.csv', header=0)

# Display the first 5 rows of the DataFrame to get a glimpse of the data
movies_df.head(5)

In [None]:
movies_df['genres'].iloc[0]

In [None]:
# Check which columns can be used for counting the movies
movies_df.columns

In [None]:
# Determine the total number of rows in the movies metadata dataset
len(movies_df)

**Explanation**: There are 45466 rows in the movies_metadata.csv. `movies_df.columns` reveals that there are 2 possible id columns: 'id' and 'imdb_id'. Further inspection reveals that 'id' has more values and no NA values, while the index numbers in 'imdb_id' have 17 movies less. So, 'id' column might be appropriate for movie count and as an id column of the dataset.

In [None]:
# Check the number of rows in the 'id' column that have NA values
movies_df['id'].isna().sum()

**Explanation**: There are no rows in the 'id' column with NA values, indicating that it is a complete and suitable column for further analysis.

In [None]:
# Check the number of rows in the 'imdb_id' column that have NA values
movies_df['imdb_id'].isna().sum()

**Explanation**: There are 17 movies in the dataset that are not indexed by IMDB but still exist in the movies database on Kaggle. These movies have missing values in the 'imdb_id' column.

In [None]:
# Display the 17 movies that have missing 'imdb_id' values
movies_df[movies_df['imdb_id'].isna()]

**Explanation**: The above DataFrame shows the 17 movies that do not have an 'imdb_id'. These movies might have limited information available, making it difficult to use them in further analysis.

Also, `pd.read_csv` returns **DtypeWarning: Columns (10) have mixed types**. Columns (10) is the 'id' column. We will try to filter these non-alphanumeric symbols to see what's in there.

In [None]:
# Filter out the rows in the 'id' column that contain non-alphanumeric symbols
movies_df[~movies_df['id'].str.isalnum()]

**Explanation**: The 'id' column contains three rows with release dates instead of alphanumeric values. Since these rows do not provide much useful information and lack proper 'id' numbers, it is best to drop them from the DataFrame.

In [None]:
# Dropping the non-alphanumeric rows
movies_df = movies_df[movies_df['id'].str.isalnum()]

Also, it's worthwile checking for duplicate rows. Here are they:

In [None]:
# Check for duplicate rows in the 'id' column
movies_df[movies_df['id'].duplicated(keep=False)].sort_values('id')

**Explanation**: The above DataFrame displays the rows in the 'id' column that are duplicated. It helps identify any inconsistencies or redundant data.

In [None]:
# Keep only the rows with unique 'id' numbers and update the dataframe
movies_df = movies_df.drop_duplicates(subset='id')

## 1. Load the dataset from a CSV file.

We can write all data processing operations as a single function

In [3]:
def data_processing(ratings_filepath, movies_filepath):
    """1. Load the dataset from a CSV file."""
    
    dtypes = {
        'genres': object,
        'release_date': str,
        'title': str,
    }
    
    # Read the movies_metadata.csv
    movie_columns = ['genres', 'id', 'release_date', 'title']
    movies_df = pd.read_csv(movies_filepath, header=0 , sep=',', \
                            parse_dates=True, \
                            usecols=movie_columns, \
                            dtype=dtypes)
    
    # Drop the rows in movies_df where 'id' column contains non-alphanumeric symbols
    movies_df = movies_df[movies_df['id'].str.isalnum()]
    
    # Keep only the rows with unique 'id' numbers
    movies_df = movies_df.drop_duplicates(subset='id')
    
    # Convert 'id' column to int data type
    movies_df['id'] = movies_df['id'].astype(int)
    
    # Set the 'id' column as the index of movies_df
    movies_df.set_index('id', inplace=True)  
    
    # Read the ratings.csv file into ratings_df DataFrame
    columns = ['movieId', 'rating']
    ratings_df = pd.read_csv(ratings_filepath, \
                        usecols=columns, \
                        index_col='movieId', \
                         dtype={'movieId':int, \
                        'rating':float}, \
                            nrows=10000)
    
    return ratings_df, movies_df

ratings_filepath = os.getcwd() + '/' + 'data/ratings.csv'
movies_filepath = os.getcwd() + '/' + 'data/movies_metadata.csv'

ratings_df, movies_df = data_processing(ratings_filepath, movies_filepath)

## 2. Print the number of movies in the dataset.

In [None]:
# Print the number of unique movies
len(movies_df)

## 3. Print the average rating of all the movies.

In [None]:
def average_rating_loops(file_path):
    """Calculate the average rating of all movies"""
        
    # Define the variables
    chunk_size= 5000
    
    # Initialize variables for average rating calculation
    total_sum = 0
    total_count = 0
    
    # Read the file in chunks to prevent memory overloading
    for chunk in pd.read_csv(file_path, chunksize=chunk_size):
        chunk_sum = chunk['rating'].sum()
        
        # Update the average values
        total_sum += chunk_sum
        total_count += len(chunk)
    
    # Calculate the average rating for all movies
    average_rating = total_sum/total_count
    
    # Print the average rating of all the movies
    print('Average movie rating: ' + str(np.round(average_rating, 2)))

file_path='data/ratings.csv'
average_rating_loops(file_path)

Another way is to use Pandas load only the columns necessary for calculating the average rating of all movies

In [None]:
def average_rating(ratings):
    """Calculate the average rating of all movies"""
    
    average_rating = ratings_df['rating'].mean()
    print('Average movie rating: ' + str(np.round(average_rating, 2)))

average_rating(ratings_df)

## 4. Print the top 5 highest rated movies.

In [4]:
def top_5_movies(movies, ratings):
    """4. Print the top 5 highest rated movies."""
    
    # Calculate the average ratign by movieId
    avg_ratings_df = ratings.groupby('movieId').apply('mean', 'rating')
    
    # Merge the dataframes based on ratingsmovieId and id
    merged_df = pd.merge(avg_ratings_df, movies, left_index=True, right_index=True)
    
    # Calculate the top 5 movies
    top_5_movies = merged_df[['rating', 'title']].sort_values('rating', ascending=False).head(5)
    
    return top_5_movies

result = top_5_movies(movies=movies_df, ratings=ratings_df)
result

Unnamed: 0,rating,title
290,5.0,Barton Fink
1950,5.0,Lucky You
506,5.0,Marnie
47122,5.0,Flaming Creatures
1649,5.0,Bill & Ted's Bogus Journey


This code can be improved by using weightened average

In [6]:
import pandas as pd
import numpy as np

def top_5_movies(ratings, movies):
    """Print the top 5 highest-rated movies"""

    # Group by 'movieId' and calculate the weighted average
    weights = ratings.groupby(ratings.index).transform('count')
#     avg_rating = ratings.groupby(ratings.index)['rating'].apply(lambda x: np.average(a=x.rating, weights=weights))
    avg_rating = ratings.groupby(ratings.index)['rating'].apply(lambda x: np.average(a=x, weights=weights, \
                                                                                     axis=0 ))

    # Merge the dataframes based on the index
#     avg_ratings_df = pd.DataFrame(avg_rating, index=ratings.index, columns=['avg_rating'])
    merged_df = pd.merge(avg_ratings_df, movies, left_index=True, right_index=True)

    # Get the top 5 movies
    top_5_movies = merged_df[['avg_rating', 'title']].sort_values(by='avg_rating', ascending=False).head(5)

    return top_5_movies

# Assuming you have your ratings and movies DataFrames defined, call the function like this:
result = top_5_movies(ratings=ratings_df, movies=movies_df)
print(result)


TypeError: 1D weights expected when shapes of a and weights differ.

## 5. Print the number of movies released each year.

In [None]:
def movies_per_year(movies):
    """Print the number of movies per year"""
    
    # Extract the year component from 'release_date' and store into a new column 'release_year'
    movies['release_year'] = pd.to_datetime(movies['release_date'], errors='coerce').dt.year
    
    # Dropping the na in 'release_year'
    movies['release_year'] = movies['release_year'].dropna()
    
    # Count the number of movies released each year
    movies_per_year = movies['release_year'].value_counts().sort_index()
    
    # Create a movies_per_year_df DataFrame from the Pandas Series
    movies_per_year_df = pd.DataFrame({'Year': movies_per_year.index, 'Count': movies_per_year.values})
    
    # Cast 'Year' column to integer data type
    movies_per_year_df['Year'] = movies_per_year_df['Year'].astype(int)
    
    return movies_per_year_df
    
n_movies = movies_per_year(movies_df)

n_movies

## 6. Print the number of movies in each genre.

In [None]:
import ast

def movies_per_genre(movies):
    """Print the number of movies in each genre"""
    
    # Convert the string stored in column 'genres' to lists of dictionaries
    # movies_df['genres'] = movies_df['genres'].apply(ast.literal_eval)
    if isinstance(movies_df['genres'].iloc[0], str):
        movies_df['genres'] = movies_df['genres'].apply(ast.literal_eval)

    # Extract the relevant information from the nested dictionaries
    movies_df['genre_names'] = movies_df['genres'].apply(lambda x: [genre['name'] for genre in x])
    
    # Count the number of movies per genre and return a pandas.DataFrame
    movies_per_genre = movies_df['genre_names'].explode().value_counts()
    movies_per_genre.columns = ['Genre', 'Count']
    
    return movies_per_genre.to_frame().reset_index()

movies_per_genre(movies=movies_df)

## 7. Save the dataset to a JSON file.

In [None]:
top_5_movies.to_json('path/to/top_5_movies.csv')
movies_per_year.to_json('path/to/movies_per_year.csv')
movies_per_genre.to_json('path/to/movies_per_genre.csv')

## 8. Program to analyse the movies dataset using Pandas and object-oriented approach (OOP)

This program can be used in this notebook and as a standalone application to print the results in the console window

In [None]:
import pandas as pd
import numpy as np
import ast
from datetime import datetime, date


class MoviesAnalisys():
    """Data analysis of the movies dataset"""


    def __init__(self, movies_filepath, ratings_filepath):
        """Initialize attributes"""
        
        # 1. Load the dataset from a CSV file.
        # Define the data types for each column
        dtypes_movies = {
            'genres': object,
            'release_date': str,
            'title': str,
        }
        
        # Read the movies_metadata.csv file
        movies_columns = ['genres', 'id', 'release_date', 'title']
        self.movies = pd.read_csv(movies_filepath, header=0, sep=',', \
                        parse_dates=True, \
                        usecols=movies_columns, \
                        dtype=dtypes_movies)
        
        # Drop the non-alphanumeric rows
        self.movies = self.movies[self.movies['id'].str.isalnum()]
        
        # Keep only the rows with unique 'id' numbers
        self.movies = self.movies.drop_duplicates(subset='id')
        
        # Convert 'id' column to int data type
        self.movies['id'] = self.movies['id'].astype(int)
        
        # Set the 'id' column as self.movies index
        self.movies.set_index('id', inplace=True)  
        
        ratings_columns = ['movieId', 'rating']
        self.ratings = pd.read_csv(ratings_filepath, \
                                usecols=ratings_columns, \
                                index_col='movieId', \
                                dtype={'movieId': int, 'rating':float})


    def count_movies(self):
        """2. Print the number of movies in the dataset."""

        print('Number of movies in the dataset: ' + str(len(self.movies)))


    def avg_rating(self):
        """3. Print the average rating of all the movies."""
        
        average_rating = self.ratings['rating'].mean()
        print('Average movie rating: ' + str(np.round(average_rating, 2)))


    def top_5_movies(self):
        """4. Print the top 5 highest rated movies."""
        
        # Calculate the average ratign by movieId
        avg_ratings_df = self.ratings.groupby('movieId').apply('mean', 'rating')

        # Merge the dataframes based on ratingsmovieId and id
        merged_df = pd.merge(avg_ratings_df, self.movies, left_index=True, right_index=True)
        
        # Calculate the top 5 movies
        top_5_movies = merged_df[['rating', 'title']].sort_values('rating', ascending=False).head(5)
        
        # print('Top 5 movies:' + str(top_5_movies))
        
        return top_5_movies    

    
    def movies_per_genre(self):
        """6. Print the number of movies in each genre."""
        
        # Convert the string stored in column 'genres' to lists of dictionaries
        if isinstance(self.movies['genres'].iloc[0], str):
            self.movies['genres'] = self.movies['genres'].apply(ast.literal_eval)
    
        # Extract the relevant information from the nested dictionaries
        self.movies['genre_names'] = self.movies['genres'].apply(lambda x: [genre['name'] for genre in x])
        
        # Count the number of movies per genre and convert to pandas.DataFrame
        movies_per_genre = self.movies['genre_names'].explode().value_counts()
        movies_per_genre.columns = ['Genre', 'Count']
    
        return movies_per_genre.to_frame().reset_index()    

    
    def movies_per_year(self):
        """5. Print the number of movies released each year."""
        
        # Extract the year component from 'release_date' and store into a new column 'release_year'
        self.movies['release_year'] = pd.to_datetime(self.movies['release_date'], errors='coerce').dt.year
        
        # Drop the `na` in 'release_year'
        self.movies['release_year'] = self.movies['release_year'].dropna()
        
        # Count the number of movies released each year
        movies_per_year = self.movies['release_year'].value_counts().sort_index()
        
        # Create a movies_per_year_df pandas.DataFrame from the pandas.Series
        movies_per_year_df = pd.DataFrame({'Year': movies_per_year.index, 'Count': movies_per_year.values})
        
        # Cast 'Year' column to integer data type
        movies_per_year_df['Year'] = movies_per_year_df['Year'].astype(int)
    
        return movies_per_year_df
    
    
    def save_to_json(self, result, filepath):
        """7. Save the dataset to a JSON file."""
        
        result.to_json(filepath)

In [None]:
movies_filepath = 'data/movies_metadata.csv'
ratings_filepath = 'data/ratings.csv'

# Create an instance of the MoviesAnalysis class
movies_analysis = MoviesAnalisys(movies_filepath, ratings_filepath)

In [None]:
# Print the number of movies in the dataset
movies_count = movies_analysis.count_movies()

In [None]:
# Calculate the average rating of all movies
average_rating = movies_analysis.avg_rating()

In [None]:
# Extract the top 5 movies
top_5_movies_result = movies_analysis.top_5_movies()
top_5_movies_result

In [None]:
# Calculate the number of movies in each genre
movies_per_genre_result = movies_analysis.movies_per_genre()
movies_per_genre_result

In [None]:
# Calculate the number of movies released each year
movies_per_year_result = movies_analysis.movies_per_year()
movies_per_year_result

In [None]:
# Store some of the reulting dataframes as JSON files
top_5_movies_result_filepath = 'results_json/top_5_movies.json'
movies_per_genre_result_filepath = 'results_json/movies_per_genre.json'
movies_per_year_result_filepath = 'results_json/movies_per_year.json'

movies_analysis.save_to_json(top_5_movies_result, top_5_movies_result_filepath)
movies_analysis.save_to_json(movies_per_genre_result, movies_per_genre_result_filepath)
movies_analysis.save_to_json(movies_per_year_result, movies_per_year_result_filepath)

## 8. Improved implementation of the OOP featuring an explicit data-centered workflow

This new implementation over the previous one promotes modularity, explicit data flow, improved reusability and readability. All these improvement are made with the intention to make the code easier to use and maintain.

- **Modularity**: in the new implementation the program is divided in a set of classes (`DataIO`, `DataCleaning` and `DataAnalysis`), each responsible for a set of specific tasks. 

- **Explicit workflow**: the new implementation follows and explicit data-centered workflow, starting from `DataIO` class for reading and writing data, which then passes the data to `DataCleaning` class for data preprocessing and cleaning and finally to the `DataAnalysis` class for various data analysis tasks. This makes the code more logical, easier to understand and maintain.

- **Code reusability**: each class has well defined methods, which can be used in other projects and parts of the program. For example, the data cleaning methods can be reused in different analysis tasks without modifying the original class.

- **Readability and Debugging**: The new implementation is more readable and easier to debug due to its functional organisation (explicit data-first workflow). Each class is responsible for a specific part of the process, making it easier to locate and fix issues if they arise.

In [1]:
import pandas as pd
import numpy as np
import ast
from datetime import datetime, date


class DataIO():
    """Data input-output (reading from and writing to file)"""


    def __init__(self, movies_filepath, ratings_filepath):
        """Initialize attributes"""
        self.ratings_filepath = ratings_filepath
        self.movies_filepath = movies_filepath
        self.movies_df = pd.DataFrame()
        self.ratings_df = pd.DataFrame()
        
        
    def read_data(self):
        # 1. Load the dataset from a CSV file.
        # Define the data types for each column
        dtypes_movies = {
            'genres': object,
            'release_date': str,
            'title': str,
        }
        
        # Read the movies_metadata.csv file
        movies_columns = ['genres', 'id', 'release_date', 'title']
        self.movies_df = pd.read_csv(movies_filepath, header=0, sep=',', \
                                     parse_dates=True, \
                                     usecols=movies_columns, \
                                     dtype=dtypes_movies)
        
        
        # Read the ratings.csv file
        ratings_columns = ['movieId', 'rating']
        self.ratings_df = pd.read_csv(ratings_filepath, \
                                   usecols=ratings_columns, \
                                   index_col='movieId', \
                                   dtype={'movieId': int, 'rating':float})
        

        return self.movies_df, self.ratings_df


    def write_data(self, result, filepath):
        """7. Save the dataset to a JSON file."""
        
        result.to_json(filepath)



class DataCleaning():
    """Data cleaning and preprocessing"""

    def __init__(self, movies_df, ratings_df):
        """Initialize attributes"""
        self.movies_df = movies_df
        self.ratings_df = ratings_df


    def clean_data(self):
        """Clean the data before analysis"""
        
        # Drop the non-alphanumeric rows
        self.movies_df = self.movies_df[self.movies_df['id'].str.isalnum()]
        
        # Keep only the rows with unique 'id' numbers
        self.movies_df = self.movies_df.drop_duplicates(subset='id')
        
        # Convert 'id' column to int data type
        self.movies_df['id'] = self.movies_df['id'].astype(int)
        
        # Set the 'id' column as self.movies index
        self.movies_df.set_index('id', inplace=True)
        
        return self.movies_df



class DataAnalisys():
    """Data analysis of the movies dataset"""


    def __init__(self, ratings_df, movies_df):
        """Initialize attributes"""
        self.movies_df = movies_df
        self.ratings_df = ratings_df


    def count_movies(self):
        """2. Print the number of movies in the dataset."""
        print('Number of movies in the dataset: ' + str(len(self.movies_df)))


    def avg_rating(self):
        """3. Print the average rating of all the movies."""        
        average_rating = self.ratings_df['rating'].mean()
        print('Average movie rating: ' + str(np.round(average_rating, 2)))


    def top_5_movies(self):
        """4. Print the top 5 highest rated movies."""
        
        # Calculate the average ratign by movieId
        avg_ratings_df = self.ratings_df.groupby('movieId').apply('mean', 'rating')

        # Merge the dataframes based on ratingsmovieId and id
        merged_df = pd.merge(avg_ratings_df, self.movies_df, left_index=True, right_index=True)
        
        # Calculate the top 5 movies
        top_5_movies = merged_df[['rating', 'title']].sort_values('rating', ascending=False).head(5)
        
        # print('Top 5 movies:' + str(top_5_movies))
        
        return top_5_movies    

    
    def movies_per_genre(self):
        """6. Print the number of movies in each genre."""
        
        # Convert the string stored in column 'genres' to lists of dictionaries
        if isinstance(self.movies_df['genres'].iloc[0], str):
            self.movies_df['genres'] = self.movies_df['genres'].apply(ast.literal_eval)
    
        # Extract the relevant information from the nested dictionaries
        self.movies_df['genre_names'] = self.movies_df['genres'].apply(lambda x: [genre['name'] for genre in x])
        
        # Count the number of movies per genre and convert to pandas.DataFrame
        movies_per_genre = self.movies_df['genre_names'].explode().value_counts()
        movies_per_genre.columns = ['Genre', 'Count']
    
        return movies_per_genre.to_frame().reset_index()    

    
    def movies_per_year(self):
        """5. Print the number of movies released each year."""
        
        # Extract the year component from 'release_date' and store into a new column 'release_year'
        self.movies_df['release_year'] = pd.to_datetime(self.movies_df['release_date'], errors='coerce').dt.year
        
        # Drop the `na` in 'release_year'
        self.movies_df['release_year'] = self.movies_df['release_year'].dropna()
        
        # Count the number of movies released each year
        movies_per_year = self.movies_df['release_year'].value_counts().sort_index()
        
        # Create a movies_per_year_df pandas.DataFrame from the pandas.Series
        movies_per_year_df = pd.DataFrame({'Year': movies_per_year.index, 'Count': movies_per_year.values})
        
        # Cast 'Year' column to integer data type
        movies_per_year_df['Year'] = movies_per_year_df['Year'].astype(int)
    
        return movies_per_year_df


In [2]:
movies_filepath = 'data/movies_metadata.csv'
ratings_filepath = 'data/ratings.csv'

# Create a data_io object and initialize it with filepaths
data_io = DataIO(movies_filepath, ratings_filepath)

# Read the data from the files
movies_df, ratings_df = data_io.read_data()

In [6]:
# Create a data_cleaning object and initialize it with DataFrames
data_cleaning = DataCleaning(ratings_df=ratings_df, movies_df=movies_df)

# Call the clean_data method on movies_df
movies_df = data_cleaning.clean_data()

In [7]:
# Create a data_analysis object and initialize it with DataFrames
data_analysis = DataAnalisys(ratings_df=ratings_df, movies_df=movies_df)

# 2. Print the number of movies in the dataset.
data_analysis.count_movies()

Number of movies in the dataset: 45433


In [8]:
# 3. Print the average rating of all the movies
data_analysis.avg_rating()

Average movie rating: 3.53


In [9]:
# 4. Print the top 5 highest rated movies.
data_analysis.top_5_movies()

Unnamed: 0,rating,title
95977,5.0,The Man Behind The Gun
167666,5.0,Monster High: Escape from Skull Shores
130544,5.0,Palermo or Wolfsburg
129530,5.0,Brutal
164278,5.0,Harvey


In [10]:
# 6. Print the number of movies in each genre
data_analysis.movies_per_genre()

Unnamed: 0,genre_names,count
0,Drama,20244
1,Comedy,13176
2,Thriller,7619
3,Romance,6730
4,Action,6592
5,Horror,4671
6,Crime,4304
7,Documentary,3930
8,Adventure,3490
9,Science Fiction,3044


In [11]:
# 5. Print the number of movies released each year
data_analysis.movies_per_year()

Unnamed: 0,Year,Count
0,1874,1
1,1878,1
2,1883,1
3,1887,1
4,1888,2
...,...,...
130,2015,1904
131,2016,1604
132,2017,532
133,2018,5
