# Movie Ratings Data Analysis

`Author:` [Syed Muhammad Ebad](https://www.kaggle.com/syedmuhammadebad)
`Date:` 25-Sept-2024
[Send me an email](mailto:mohammadebad1@hotmail.com)
[Visit my GitHub profile](https://github.com/smebad)

[Dataset used in this notebook](https://www.kaggle.com/datasets/tunguz/movietweetings)

## Introduction

In this notebook, we will explore and analyze a dataset containing movie ratings and movie metadata. The primary goal is to understand the distribution of ratings and identify popular movies based on these ratings. We will visualize the data using interactive charts and provide insights on the most rated movies. By the end of this notebook, you'll gain a comprehensive understanding of the dataset and the movie rating patterns.

**Dataset Overview:**
- **Movies dataset**: Contains information about movies including their titles and genres.
- **Ratings dataset**: Contains user ratings for the movies.

We will merge these datasets based on the movie ID (`MovieID`) and proceed with the analysis.

---

## Step 1: Importing the Necessary Libraries

In [None]:
# importing the libraries
import pandas as pd
import numpy as np
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")

We will use pandas for data manipulation, numpy for numerical operations, and plotly.express for creating interactive visualizations.

## Step 2: Loading and Reviewing the Movies Dataset

In [None]:
df = pd.read_csv("/kaggle/input/movietweetings/movies.dat", delimiter="::")
df.head()


Observations:
* The dataset does not have column names, so we will assign them manually.

In [None]:
df.info()

* Fixing Missing Column Names:

In [None]:
# creating the column names
df.columns = ["MovieID", "Title", "Genres"]
df.head()

## Step 3: Loading and Reviewing the Ratings Dataset

In [None]:
df1 = pd.read_csv("/kaggle/input/movietweetings/ratings.dat", delimiter="::")
df1.head()

Observations:
* Similar to the movies dataset, the ratings dataset lacks column names. We will define them as well.

Fixing Missing Column Names:

In [None]:
# Creating the column names for the ratings dataset
df1.columns = ["UserID", "MovieID", "Rating", "Timestamp"]
df1.head()

## Step 4: Merging the Datasets
Now that we have cleaned both datasets, we will merge them on the common column MovieID to create a unified dataset for analysis.

In [None]:
movies = pd.merge(df, df1, on="MovieID")
movies.head()

## Step 5: Visualizing the Data
### 5.1: Distribution of Movie Ratings
We will create a pie chart to visualize how ratings are distributed across the dataset.

In [None]:
ratings = movies["Rating"].value_counts()
numbers = ratings.index
frequency = ratings.values

fig = px.pie(
    movies,
    values=frequency,
    names=numbers,
    title="Distribution of movie ratings",
    hole=0.5,
)
fig.update_traces(textposition="inside", textinfo="percent+label")
fig.show()

### 5.2: Top 10 Movies Based on Ratings
This bar chart will display the top 10 movies based on the number of ratings they received.

In [None]:
fig = px.bar(
    movies.head(10),
    x="Rating",
    y="Title",
    orientation="h",
    title="Top 10 movies based on ratings",
    color_discrete_sequence=["blue"],
    template="plotly_white",
)
fig.show()

### 5.3: Distribution of Ratings
We will now plot a histogram to explore the distribution of ratings in the dataset, which gives us an idea of how frequent each rating is.

In [None]:
fig = px.histogram(
    movies,
    x="Rating",
    color="Rating",
    marginal="box",
    hover_data=movies.columns,
    title="Distribution of Movie Ratings",
)
fig.update_layout(bargap=0.1)
fig.show()

## Step 6: Identifying the Top Movies with a Perfect Rating (10)
In this section, we will check which movies received the highest number of perfect ratings (i.e., a rating of 10).

In [None]:
top_rated = movies.query("Rating == 10")
print(top_rated["Title"].value_counts().head(10))

## Summary
In this notebook, we performed an exploratory analysis of the movie ratings dataset. Below are the key takeaways:

* Data Cleaning: We imported the movies and ratings datasets, assigned proper column names, and merged them into a single dataset.

* Visualizing the Ratings: The distribution of ratings was visualized using a pie chart and a histogram. The analysis revealed that ratings are somewhat uniformly distributed, with a notable number of perfect scores (10/10).

* Top Movies: We identified the top 10 most rated movies and the top movies that received a perfect score.

* Rating Patterns: The histogram and marginal box plot gave us insights into the overall distribution of ratings, indicating a spread of ratings from 1 to 10, with some peaks at the higher end.

Through this analysis, we gained valuable insights into the movie rating patterns and identified popular movies within the dataset. This exploratory analysis provides a foundation for further analysis or modeling tasks, such as predicting movie ratings or clustering user preferences.