# Read in the data
Use Pandas to read the data. This dataset is from Kaggle and contains top-rated movies from the IMDb website, gathered based on different genres and are also in different languages. Movies from the early 1930s to current year titles are collected, cleaned, and arranged in this dataset.

In [30]:
import pandas as pd
data = pd.read_csv('IMDb_All_Genres_etf_clean1.csv')

Show the top 5 rows for reference. For this project, I focus on the "Rating" column. I will calculate the mean, median, and mode of the ratings.

In [67]:
data.head()

Unnamed: 0,Movie_Title,Year,Director,Actors,Rating,Runtime(Mins),Censor,Total_Gross,main_genre,side_genre
0,Kantara,2022,Rishab Shetty,"Rishab Shetty, Sapthami Gowda, Kishore Kumar G...",9.3,148,UA,Gross Unkown,Action,"Adventure, Drama"
1,The Dark Knight,2008,Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart, M...",9.0,152,UA,$534.86M,Action,"Crime, Drama"
2,The Lord of the Rings: The Return of the King,2003,Peter Jackson,"Elijah Wood, Viggo Mortensen, Ian McKellen, Or...",9.0,201,U,$377.85M,Action,"Adventure, Drama"
3,Inception,2010,Christopher Nolan,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellio...",8.8,148,UA,$292.58M,Action,"Adventure, Sci-Fi"
4,The Lord of the Rings: The Two Towers,2002,Peter Jackson,"Elijah Wood, Ian McKellen, Viggo Mortensen, Or...",8.8,179,UA,$342.55M,Action,"Adventure, Drama"


# Use Pandas to Compute the Mean, Median and Mode

### Use Pandas to compute the mean

In [37]:
mean_rating_pd = data['Rating'].mean()
print(f"Mean Rating: {mean_rating_pd:.2f}")

Mean Rating: 6.76


### Use Pandas to compute the median

In [40]:
median_rating_pd = data['Rating'].median()
print(f'Median Rating: {median_rating_pd:.2f}')

Median Rating: 6.80


### Use Pandas to compute the mode

In [43]:
mode_rating_pd = data['Rating'].mode()[0]
print(f'Mode Rating: {mode_rating_pd:.2f}')

Mode Rating: 7.30


### Findings
The average rating across all movies is 6.76. The middle rating value (when all ratings are ordered from lowest to highest) is 6.80. The most frequently occurring rating is 7.30. 

The mean is slightly smaller than the median; hence, the ratings show a left-skewed (negatively skewed) distribution. This means most movies are rated relatively well, and the lower mean could be influenced by a few outliers or lower-rated movies that bring the average down.

# Using Python Standard Library to Compute the Mean, Median and Mode

In [74]:
import csv

In [51]:
rating_counts = {}
ratings_list = []
total = 0
count = 0

### Use Pure Python to Compute the Mean

In [76]:
with open('IMDb_All_Genres_etf_clean1.csv', 'r') as file:
    reader = csv.reader(file)
    header = next(reader)

    # Codes below are assisted with GenAI.
    try:
        rating_index = header.index("Rating")
    except ValueError:
        print("Column 'Rating' not found in the header.")
        rating_index = None
    ## Codes above are assisted with GenAI.
    
    if rating_index is not None:
        for row in reader:
            if len(row) > rating_index:
                rating_str = row[rating_index].strip()

                # Codes below are assisted with GenAI.
                try:
                    rating = float(rating_str)
                    ratings_list.append(rating)

                    total += rating
                    count += 1

                    if rating in rating_counts:
                        rating_counts[rating] += 1
                    else:
                        rating_counts[rating] = 1
                except ValueError:
                    print(f"Skipping non-numeric rating: {rating_str}")
                # Codes above are assisted with GenAI.
            
            else:
                print(f"Skipping row with missing rating: {row}")

print(f"Average Rating: {total / count if count > 0 else 0:.2f}")

Average Rating: 6.76


### Use Pure Python to Compute the Median

In [79]:
ratings_list.sort()
n = len(ratings_list)
if n % 2 == 1:
    median_rating = ratings_list[n // 2]
else:
    median_rating = (ratings_list[n // 2 - 1] + ratings_list[n // 2]) / 2

print(f"Median Rating: {median_rating:.2f}")

Median Rating: 6.80


### Use Pure Python to Compute the Mode

In [82]:
max_count = max(rating_counts.values())
mode_rating_list = [rating for rating, count in rating_counts.items() if count == max_count]

mode_rating = ", ".join(f"{float(rating):.2f}" for rating in mode_rating_list)

print(f"Mode Rating: {mode_rating}")

Mode Rating: 7.30


# Creating a data visualization

In [85]:
mean_stars = int(mean_rating_pd)
median_stars = int(median_rating_pd)
mode_stars = int(mode_rating_pd)

print("Mean Rating:", '*' * mean_stars)
print("Median Rating:", '*' * median_stars)
print("Mode Rating:", '*' * mode_stars)

Mean Rating: ******
Median Rating: ******
Mode Rating: *******
