<div style="border-radius: 10px; border: #6B8E23 solid; padding: 15px; background-color: #F5F5DC; font-size: 100%; text-align: left">

<h3 align="left"><font color='#556B2F'>📜 Introduction : </font></h3>

**What Drives a User to Make a Purchase?**

For instance, when a user decides to make a purchase, they go to an e-commerce site and search for a product. They are presented with suggestions and alternatives related to the product. One of the most critical factors that compel a person to make a purchase is the concept of social proof. It is one of the most influential reasons for a user to buy a product. Reviews and ratings from individuals who have previously purchased the product are examined, allowing the person to form an opinion about the product. This phenomenon is primarily driven by the belief in the wisdom of crowds.

Now, let's assume our user has narrowed down their choices to two products from a variety of options, disregarding the brand and price. Let's focus solely on the ratings of these two products. Our perception shifts towards the product with the highest rating, assuming it is the best.

In another scenario, let's assume that the products display both ratings and the number of reviews simultaneously. For example, a product with a rating of 5 has received 7 reviews, and a product with a rating of 4.5 has received 256 reviews. At this point, we might choose to purchase the product with a rating of 4.5 because the concept of social proof is effective.

In essence, the buyer's primary goal is to find the best product in terms of price and performance. The goal of the marketplace/seller is to deliver products to users as accurately as possible. Sellers may even secure sponsorships to ensure their products appear at the top of search results. Since social proof heavily influences users' purchasing decisions, it is crucial to present this social proof wherever it may be, avoiding manipulation and ensuring accuracy.

Entering this process comes with some challenges that concern developers when a user enters the purchasing process. Topics that will interest developers include:

* Calculating product ratings
* Sorting products
* Sorting user reviews on product detail pages
* Designs of pages, processes, and interaction areas
* Feature tests
* Testing possible actions and reactions

Throughout this process, we will examine various scientific aspects of topics that collectively influence the user experience:

* Rating Products
* Sorting Products
* Sorting Reviews

<center><img src="https://i.imgur.com/jIuU22M.png" width="800" height="800"></center>

# Content

1. [💯 Rating Products 💯](#1)
    * [Average](#2)
    * [Time-Based Weighted Average](#3)
    * [User-Based Weighted Average](#4)
    * [Weighted Rating](#5)
1. [🤩 Sorting Products 🤩](#6)
    * [Sorting by Rating](#7)
    * [Sorting by Comment Count or Purchase Count](#8)
    * [Sorting by Rating, Comment and Purchase](#9)
    * [Bayesian Average Rating Score](#10)
    * [Hybrid Sorting](#11)
    * [IMDB Movie Scoring and Sorting](#12)
        * [Importing Libraries & First Look](#13)
        * [Sorting by Vote Average](#14)
    * [IMDB Weighted Rating](#15)
    * [Bayesian Average Rating Score (BAR Score)](#16)
1. [🙏 Sorting Reviews 🙏](#17)
    * [Up-Down Difference Score](#18)
    * [Average Rating (Up Ratio)](#19)
    * [Wilson Lower Bound Score](#20)
    * [Case Study](#21)

<a id="1"></a>
<h1 style="border-radius: 10px; border: 2px solid #6B8E23; background-color: #F5F5DC; font-family: 'Pacifico', cursive; font-size: 200%; text-align: center; border-radius: 15px 50px; padding: 15px; box-shadow: 5px 5px 5px #556B2F; color: #556B2F;">💯 Rating Products 💯</h1>

"Rating Products" refers to the application of assigning or displaying ratings for various products, services, or items. This is done to provide potential customers or users with an indicator of the quality, satisfaction, or performance of products. Ratings are typically presented as a numerical value, often on a scale such as 1 to 5 stars, where a higher rating indicates a better or more favorable product.

Rating products is done to help users make informed decisions among different options. Users can use ratings and reviews provided by other customers to assess the overall quality of a product and its suitability for their needs. Higher ratings are generally associated with positive user experiences and can guide potential buyers towards choosing a highly-rated product.

Rating products can be found on various platforms such as e-commerce websites, online marketplaces, mobile applications, and product review websites. Users are encouraged to leave their ratings and written reviews to share their personal experiences and opinions, which can contribute to building trust and transparency in the marketplace.

<a id = "2"></a><br>
<p style="font-family: 'Pacifico', cursive; font-weight: bold; letter-spacing: 2px; color: #556B2F; font-size: 160%; text-align: left; padding: 0px; border-bottom: 3px solid">✨Average✨</p>

In [1]:
import pandas as pd
import math
import scipy.stats as st
from sklearn.preprocessing import MinMaxScaler

import warnings

warnings.filterwarnings("ignore")

# pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)
# pd.set_option('display.width', 500)
# pd.set_option('display.expand_frame_repr', False)
# pd.set_option('display.float_format', lambda x: '%.5f' % x

<div style="border-radius:10px; border:#D0C2F0 solid; padding: 15px; background-color: #F8E8EE; font-size:100%; text-align:left">

<h3 align="left"><font color='#5E5273'>📄 Notes: </font></h3>

**Sample Scenario for Examination:**

- (50+ Hours) Python A-Z™: Data Science and Machine Learning
- Rating: 4.8 (4.764925)
- Total Ratings: 4611
- Rating Percentages: 75, 20, 4, 1, <1
- Approximate Numerical Equivalents: 3458, 922, 184, 46, 6

In [2]:
df = pd.read_csv("/kaggle/input/course-reviewscsv/course_reviews.csv")

In [3]:
df.head()

Unnamed: 0,Rating,Timestamp,Enrolled,Progress,Questions Asked,Questions Answered
0,5.0,2021-02-05 07:45:55,2021-01-25 15:12:08,5.0,0.0,0.0
1,5.0,2021-02-04 21:05:32,2021-02-04 20:43:40,1.0,0.0,0.0
2,4.5,2021-02-04 20:34:03,2019-07-04 23:23:27,1.0,0.0,0.0
3,5.0,2021-02-04 16:56:28,2021-02-04 14:41:29,10.0,0.0,0.0
4,4.0,2021-02-04 15:00:24,2020-10-13 03:10:07,10.0,0.0,0.0


<div style="border-radius: 10px; border: #6B8E23 solid; padding: 15px; background-color: #F5F5DC; font-size: 100%; text-align: left">

<h3 align="left"><font color='#556B2F'>👀 Features : </font></h3>

1. **Rating:** The score given to the course.
2. **Timestamp:** The date when the rating was given.
3. **Enrolled:** The date of enrollment.
4. **Progress:** The percentage of the course that has been completed.
5. **Questions Asked:** The number of questions asked.
6. **Questions Answered:** The number of questions answered.

In [4]:
df.shape # 4323 reviews.

(4323, 6)

In [5]:
df["Rating"].value_counts() # rating distribution.

Rating
5.0    3267
4.5     475
4.0     383
3.5      96
3.0      62
1.0      15
2.0      12
2.5      11
1.5       2
Name: count, dtype: int64

In [6]:
df["Questions Asked"].value_counts() 

# the distribution of ratings based on the number of evaluations conducted

Questions Asked
0.0     3867
1.0      276
2.0       80
3.0       43
4.0       15
5.0       13
6.0        9
8.0        5
9.0        3
14.0       2
11.0       2
7.0        2
10.0       2
15.0       2
22.0       1
12.0       1
Name: count, dtype: int64

In [7]:
# What is the average rating based on the number of questions asked?

df.groupby("Questions Asked").agg({"Questions Asked": "count",
                                   "Rating": "mean"})

Unnamed: 0_level_0,Questions Asked,Rating
Questions Asked,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,3867,4.765193
1.0,276,4.740942
2.0,80,4.80625
3.0,43,4.744186
4.0,15,4.833333
5.0,13,4.653846
6.0,9,5.0
7.0,2,4.75
8.0,5,4.9
9.0,3,5.0


In [8]:
df["Rating"].mean()

4.764284061993986

<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>
    
* For calculating the average, you can use "**mean()**."
* However, it doesn't make a judgment of right or wrong. It's entirely a matter of preference.
* We will need to disregard any biases that may arise since the same rating calculation is done for other courses as well.
* The result in this case is a preference, an indicator, and is usable.
* The disadvantage is that we might miss recent satisfaction trends. For example, the course may have had high satisfaction in the first 3 months since its release but entered a negative trend in the next 3 months. Therefore, other factors should also be considered.
* Of course, this situation can also apply to the rating of a product on an e-commerce website, not just courses.

<a id = "3"></a><br>
<p style="font-family: 'Pacifico', cursive; font-weight: bold; letter-spacing: 2px; color: #556B2F; font-size: 160%; text-align: left; padding: 0px; border-bottom: 3px solid">✨Time-Based Weighted Average✨</p>

"Time-Based Weighted Average" refers to the average value of data in a data series calculated with different weights based on time. This method is used in situations where it is desired to assign more importance to newer or more recent data in the dataset. This is because, when operations are solely based on the average, the quality of recent events, such as shipping services, packaging services, etc., or recent customer satisfaction in an e-commerce site, might be overlooked.

The calculation of the Time-Based Weighted Average is based on the age or timing of each data point when determining its weight. In other words, more recent or newer data points have higher weights, while older data points have lower weights. This helps ensure that future predictions or analyses are based on more up-to-date information.

For example, a financial analyst might use Time-Based Weighted Average to predict the price movements of stocks. In this case, the stock prices of the past few days would be given more weight, while prices from earlier periods would be given less weight. This ensures that predictions are more current and relevant.

In [9]:
df.head()

Unnamed: 0,Rating,Timestamp,Enrolled,Progress,Questions Asked,Questions Answered
0,5.0,2021-02-05 07:45:55,2021-01-25 15:12:08,5.0,0.0,0.0
1,5.0,2021-02-04 21:05:32,2021-02-04 20:43:40,1.0,0.0,0.0
2,4.5,2021-02-04 20:34:03,2019-07-04 23:23:27,1.0,0.0,0.0
3,5.0,2021-02-04 16:56:28,2021-02-04 14:41:29,10.0,0.0,0.0
4,4.0,2021-02-04 15:00:24,2020-10-13 03:10:07,10.0,0.0,0.0


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4323 entries, 0 to 4322
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Rating              4323 non-null   float64
 1   Timestamp           4323 non-null   object 
 2   Enrolled            4323 non-null   object 
 3   Progress            4323 non-null   float64
 4   Questions Asked     4323 non-null   float64
 5   Questions Answered  4323 non-null   float64
dtypes: float64(4), object(2)
memory usage: 202.8+ KB


In [11]:
# Converting the data type from object to datetime;

df["Timestamp"] = pd.to_datetime(df["Timestamp"]) 

In [12]:
current_date = pd.to_datetime('2021-02-10 0:0:0')

# We provide a string value, convert it to the datetime type, and assign it as `current_time`.

In [13]:
df["days"] = (current_date - df["Timestamp"]).dt.days

In [14]:
df.head()

Unnamed: 0,Rating,Timestamp,Enrolled,Progress,Questions Asked,Questions Answered,days
0,5.0,2021-02-05 07:45:55,2021-01-25 15:12:08,5.0,0.0,0.0,4
1,5.0,2021-02-04 21:05:32,2021-02-04 20:43:40,1.0,0.0,0.0,5
2,4.5,2021-02-04 20:34:03,2019-07-04 23:23:27,1.0,0.0,0.0,5
3,5.0,2021-02-04 16:56:28,2021-02-04 14:41:29,10.0,0.0,0.0,5
4,4.0,2021-02-04 15:00:24,2020-10-13 03:10:07,10.0,0.0,0.0,5


In [15]:
df[df["days"] <= 30].count()

# There have been 194 comments made in the last 30 days.

Rating                194
Timestamp             194
Enrolled              194
Progress              194
Questions Asked       194
Questions Answered    194
days                  194
dtype: int64

In [16]:
# The average product ratings in the last 30 days in this dataset are;

df.loc[df["days"] <= 30, "Rating"].mean()

4.775773195876289

In [17]:
# Product rating averages greater than 30 and less than or equal to 90 days;

df.loc[(df["days"] > 30) & (df["days"] <= 90), "Rating"].mean()

4.763833992094861

In [18]:
# Product rating averages greater than 90 days and less than or equal to 180 days;

df.loc[(df["days"] > 90) & (df["days"] <= 180), "Rating"].mean()

4.752503576537912

In [19]:
# If we want to look at records older than 180 days;

df.loc[(df["days"] > 180), "Rating"].mean()

4.76641586867305

<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>

* As we can see, there has been an increase in course satisfaction in recent days.
* We calculated averages over different time intervals, but our goal is to introduce a weighting parameter for these time intervals.

In [20]:
# Weighted average rating calculation over time;

df.loc[df["days"] <= 30, "Rating"].mean() * 28/100 + \
    df.loc[(df["days"] > 30) & (df["days"] <= 90), "Rating"].mean() * 26/100 + \
    df.loc[(df["days"] > 90) & (df["days"] <= 180), "Rating"].mean() * 24/100 + \
    df.loc[(df["days"] > 180), "Rating"].mean() * 22/100

4.765025682267194

In [21]:
def time_based_weighted_average(dataframe, w1=28, w2=26, w3=24, w4=22):
    
    return dataframe.loc[df["days"] <= 30, "Rating"].mean() * w1 / 100 + \
           dataframe.loc[(dataframe["days"] > 30) & (dataframe["days"] <= 90), "Rating"].mean() * w2 / 100 + \
           dataframe.loc[(dataframe["days"] > 90) & (dataframe["days"] <= 180), "Rating"].mean() * w3 / 100 + \
           dataframe.loc[(dataframe["days"] > 180), "Rating"].mean() * w4 / 100

In [22]:
time_based_weighted_average(df)

4.765025682267194

In [23]:
# We can modify the weight values;

time_based_weighted_average(df, 30, 26, 22, 22)

4.765491074653962

<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>
    
* The sum of these weighted averages must be **100**.
* The significance of the digits after the decimal point is substantial, as it is observed to lead to significant profits in e-commerce websites. Therefore, we should take it into account.

<a id = "4"></a><br>
<p style="font-family: 'Pacifico', cursive; font-weight: bold; letter-spacing: 2px; color: #556B2F; font-size: 160%; text-align: left; padding: 0px; border-bottom: 3px solid">✨User-Based Weighted Average✨</p>

"User-Based Weighted Average" refers to a weighted average value calculated based on the preferences or characteristics of a user or individual. This type of average is used in situations where users assign different weights to various features or factors. It is commonly applied in areas such as marketing, recommendation systems, and delivering personalized content.

The calculation of the User-Based Weighted Average includes weights that reflect how much importance each user places on specific features or preferences. These weights are utilized to better understand user data and provide them with more tailored services or content.

For example, an online shopping platform may use User-Based Weighted Average to offer product recommendations based on users' past purchase history and preferences. By giving more weight to users' previous buying habits and preferences, personalized product suggestions can be provided, enhancing the user experience and potentially increasing sales.

<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>
    
1. Different User Groups May Have Different Weights: The ratings given by users should not have the same weight. This is because some users may have only watched a small portion of the course and left a rating, so their ratings may have a different weight than those of users who have watched a larger portion and contributed. This is important to better reflect the actual contributions of users and prevent fake ratings.

2. The Concept of Social Proof and Fake Ratings: Ratings from users who have made only a single purchase or given a high rating based on a small portion of the dataset may not accurately reflect the concept of "Social Proof." It should be considered that such ratings may not objectively represent the quality of a product or service, and there is a possibility of fake ratings. Therefore, having different weights for different user groups is crucial to creating a more balanced and reliable rating system.

In [24]:
df.head()

Unnamed: 0,Rating,Timestamp,Enrolled,Progress,Questions Asked,Questions Answered,days
0,5.0,2021-02-05 07:45:55,2021-01-25 15:12:08,5.0,0.0,0.0,4
1,5.0,2021-02-04 21:05:32,2021-02-04 20:43:40,1.0,0.0,0.0,5
2,4.5,2021-02-04 20:34:03,2019-07-04 23:23:27,1.0,0.0,0.0,5
3,5.0,2021-02-04 16:56:28,2021-02-04 14:41:29,10.0,0.0,0.0,5
4,4.0,2021-02-04 15:00:24,2020-10-13 03:10:07,10.0,0.0,0.0,5


In [25]:
df.groupby("Progress").agg({"Rating": "mean"}).head(10)

Unnamed: 0_level_0,Rating
Progress,Unnamed: 1_level_1
0.0,4.673913
1.0,4.642691
2.0,4.654762
3.0,4.663551
4.0,4.777328
5.0,4.69821
6.0,4.755102
7.0,4.732558
8.0,4.741935
9.0,4.83125


In [26]:
df.groupby("Progress").agg({"Rating": "mean"}).tail(10)

Unnamed: 0_level_0,Rating
Progress,Unnamed: 1_level_1
87.0,5.0
89.0,4.794118
90.0,4.923077
91.0,5.0
93.0,4.833333
94.0,5.0
95.0,4.794118
97.0,5.0
98.0,5.0
100.0,4.866319


<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>
    
* As the course progresses, there are increases in the ratings given.
* We can say that there is a relationship between the progress percentage of the course and the ratings given.
* By doing so, we can calculate the weighted ratings based on the course's progress to prevent some fraud or misleading calculations.

In [27]:
df.loc[df["Progress"] <= 10, "Rating"].mean() * 22 / 100 + \
    df.loc[(df["Progress"] > 10) & (df["Progress"] <= 45), "Rating"].mean() * 24 / 100 + \
    df.loc[(df["Progress"] > 45) & (df["Progress"] <= 75), "Rating"].mean() * 26 / 100 + \
    df.loc[(df["Progress"] > 75), "Rating"].mean() * 28 / 100

4.800257704672543

In [28]:
def user_based_weighted_average(dataframe, w1=22, w2=24, w3=26, w4=28):
    
    return dataframe.loc[dataframe["Progress"] <= 10, "Rating"].mean() * w1 / 100 + \
           dataframe.loc[(dataframe["Progress"] > 10) & (dataframe["Progress"] <= 45), "Rating"].mean() * w2 / 100 + \
           dataframe.loc[(dataframe["Progress"] > 45) & (dataframe["Progress"] <= 75), "Rating"].mean() * w3 / 100 + \
           dataframe.loc[(dataframe["Progress"] > 75), "Rating"].mean() * w4 / 100

In [29]:
user_based_weighted_average(df, 20, 24, 26, 30)

4.803286469062915

<a id = "5"></a><br>
<p style="font-family: 'Pacifico', cursive; font-weight: bold; letter-spacing: 2px; color: #556B2F; font-size: 160%; text-align: left; padding: 0px; border-bottom: 3px solid">✨Weighted Rating✨</p>

"Weighted Rating" represents an average value calculated by assigning different weights to various factors or features when determining the rating of a particular item or product. This is used to assess the quality or value of a product or item more precisely. Weights are determined based on the priority of specific factors or features.

For example, on a movie review site, when calculating the overall rating of a film, different weights can be assigned to critics' ratings and user ratings. By giving a higher weight to critics' ratings, it is assumed that critics generally have a greater impact on the artistic or technical quality of the film. Conversely, by assigning a lower weight to user ratings, the evaluation of the general audience is somewhat disregarded.

<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>

* At this stage, we will attempt to combine time-based and user-based weight calculations into a single function to achieve a more consistent result.

In [30]:
def course_weighted_rating(dataframe, time_w=50, user_w=50):
    
    return time_based_weighted_average(dataframe) * time_w/100 + user_based_weighted_average(dataframe)*user_w/100

<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>
    
* **time_w:** The weight of the rating from time.
* **user_w:** The weight of the rating from the user.

In [31]:
course_weighted_rating(df)

4.782641693469868

In [32]:
course_weighted_rating(df, time_w=40, user_w=60)

4.786164895710403

<a id="6"></a>
<h1 style="border-radius: 10px; border: 2px solid #6B8E23; background-color: #F5F5DC; font-family: 'Pacifico', cursive; font-size: 200%; text-align: center; border-radius: 15px 50px; padding: 15px; box-shadow: 5px 5px 5px #556B2F; color: #556B2F;">🤩 Sorting Products 🤩</h1>

"Sorting Products" is a process on an online shopping platform or e-commerce website where products are arranged based on specific criteria chosen by users. Users typically want to sort products during online shopping based on factors such as price, size, color, brand, popularity, or other features.

For example, on a clothing website, a user can quickly find products within their budget by selecting the "low to high price" sorting option. This allows users to easily choose and compare the products or categories they are interested in.

<a id = "7"></a><br>
<p style="font-family: 'Pacifico', cursive; font-weight: bold; letter-spacing: 2px; color: #556B2F; font-size: 160%; text-align: left; padding: 0px; border-bottom: 3px solid">✨Sorting by Rating✨</p>

In [33]:
import pandas as pd
import math
import scipy.stats as st
from sklearn.preprocessing import MinMaxScaler

# pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)
# pd.set_option('display.expand_frame_repr', False)
# pd.set_option('display.float_format', lambda x: '%.5f' % x)

In [34]:
df = pd.read_csv("/kaggle/input/product-sortingdataset/product_sorting.csv")

In [35]:
df.head()

Unnamed: 0,course_name,instructor_name,purchase_count,rating,commment_count,5_point,4_point,3_point,2_point,1_point
0,(50+ Saat) Python A-Z™: Veri Bilimi ve Machine...,Veri Bilimi Okulu,17380,4.8,4621,3466,924,185,46,6
1,Python: Yapay Zeka ve Veri Bilimi için Python ...,Veri Bilimi Okulu,48291,4.6,4488,2962,1122,314,45,45
2,5 Saatte Veri Bilimci Olun (Valla Billa),Instructor_1,18693,4.4,2362,1582,567,165,24,24
3,R ile Veri Bilimi ve Machine Learning (35 Saat),Veri Bilimi Okulu,6626,4.6,1027,688,257,51,10,21
4,(2020) Python ile Makine Öğrenmesi (Machine Le...,Veri Bilimi Okulu,11314,4.6,969,717,194,38,10,10


In [36]:
df.shape

(32, 10)

<div style="border-radius: 10px; border: #6B8E23 solid; padding: 15px; background-color: #F5F5DC; font-size: 100%; text-align: left">

<h3 align="left"><font color='#556B2F'>👀 Features : </font></h3>

- **course_name:** Courses that do not belong to the "Veri Bilimi Okulu" instructor are listed as "Course_1, Course_2...".
- **instructor_name:** The name of only the "Veri Bilimi Okulu" instructor is explicitly mentioned, while others are named as "Instructor_1, Instructor_2...".
- **purchase_count:** The number of purchases for the course.
- **rating:** The average rating of the course.
- **comment_count:** The number of comments or reviews received by the course.
- **5_point, 4_point, 3_point, 2_point, 1_point:** The distribution of comments' ratings.

In [37]:
df.sort_values("rating", ascending=False).head(10)

Unnamed: 0,course_name,instructor_name,purchase_count,rating,commment_count,5_point,4_point,3_point,2_point,1_point
0,(50+ Saat) Python A-Z™: Veri Bilimi ve Machine...,Veri Bilimi Okulu,17380,4.8,4621,3466,924,185,46,6
10,İleri Düzey Excel|Dashboard|Excel İp Uçları,Veri Bilimi Okulu,9554,4.8,2266,1654,499,91,22,0
19,Alıştırmalarla SQL Öğreniyorum,Veri Bilimi Okulu,3155,4.8,235,200,31,4,0,0
5,Course_1,Instructor_2,4601,4.8,213,164,45,4,0,0
6,Course_2,Instructor_3,3171,4.7,856,582,205,51,9,9
14,Uçtan Uca SQL Server Eğitimi,Veri Bilimi Okulu,12893,4.7,2425,1722,510,145,24,24
8,A'dan Z'ye Apache Spark (Scala & Python),Veri Bilimi Okulu,6920,4.7,214,154,41,13,2,4
13,Course_5,Instructor_6,6056,4.7,144,82,46,12,1,3
27,Course_15,Instructor_1,1164,4.6,98,65,24,6,0,3
1,Python: Yapay Zeka ve Veri Bilimi için Python ...,Veri Bilimi Okulu,48291,4.6,4488,2962,1122,314,45,45


<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>
    
* **"Course_1"** has a significantly low number of comments in relation to the purchase count.
* **"purchase_count"** and **"comment_count"** are the most influential factors for social proof.

<a id = "8"></a><br>
<p style="font-family: 'Pacifico', cursive; font-weight: bold; letter-spacing: 2px; color: #556B2F; font-size: 160%; text-align: left; padding: 0px; border-bottom: 3px solid">✨Sorting by Comment Count or Purchase Count✨</p>

In [38]:
df.sort_values("purchase_count", ascending=False).head(10)

Unnamed: 0,course_name,instructor_name,purchase_count,rating,commment_count,5_point,4_point,3_point,2_point,1_point
1,Python: Yapay Zeka ve Veri Bilimi için Python ...,Veri Bilimi Okulu,48291,4.6,4488,2962,1122,314,45,45
11,Course_3,Instructor_4,24809,4.3,250,95,87,51,12,5
2,5 Saatte Veri Bilimci Olun (Valla Billa),Instructor_1,18693,4.4,2362,1582,567,165,24,24
0,(50+ Saat) Python A-Z™: Veri Bilimi ve Machine...,Veri Bilimi Okulu,17380,4.8,4621,3466,924,185,46,6
20,Course_9,Instructor_3,12946,4.5,3371,2191,877,203,33,67
14,Uçtan Uca SQL Server Eğitimi,Veri Bilimi Okulu,12893,4.7,2425,1722,510,145,24,24
15,Uygulamalarla SQL Öğreniyorum,Veri Bilimi Okulu,11397,4.5,2353,1435,705,165,24,24
4,(2020) Python ile Makine Öğrenmesi (Machine Le...,Veri Bilimi Okulu,11314,4.6,969,717,194,38,10,10
10,İleri Düzey Excel|Dashboard|Excel İp Uçları,Veri Bilimi Okulu,9554,4.8,2266,1654,499,91,22,0
8,A'dan Z'ye Apache Spark (Scala & Python),Veri Bilimi Okulu,6920,4.7,214,154,41,13,2,4


<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>

* **"Course_3"** stands out directly. It may potentially be available for free because, despite having the second-highest purchase count, its comment count is significantly low.
* This metric doesn't seem to be successful on its own. Additionally, as we did below, comment counts alone fall short in conveying meaningful information.

In [39]:
df.sort_values("commment_count", ascending=False).head(10)

Unnamed: 0,course_name,instructor_name,purchase_count,rating,commment_count,5_point,4_point,3_point,2_point,1_point
0,(50+ Saat) Python A-Z™: Veri Bilimi ve Machine...,Veri Bilimi Okulu,17380,4.8,4621,3466,924,185,46,6
1,Python: Yapay Zeka ve Veri Bilimi için Python ...,Veri Bilimi Okulu,48291,4.6,4488,2962,1122,314,45,45
20,Course_9,Instructor_3,12946,4.5,3371,2191,877,203,33,67
14,Uçtan Uca SQL Server Eğitimi,Veri Bilimi Okulu,12893,4.7,2425,1722,510,145,24,24
2,5 Saatte Veri Bilimci Olun (Valla Billa),Instructor_1,18693,4.4,2362,1582,567,165,24,24
15,Uygulamalarla SQL Öğreniyorum,Veri Bilimi Okulu,11397,4.5,2353,1435,705,165,24,24
10,İleri Düzey Excel|Dashboard|Excel İp Uçları,Veri Bilimi Okulu,9554,4.8,2266,1654,499,91,22,0
3,R ile Veri Bilimi ve Machine Learning (35 Saat),Veri Bilimi Okulu,6626,4.6,1027,688,257,51,10,21
4,(2020) Python ile Makine Öğrenmesi (Machine Le...,Veri Bilimi Okulu,11314,4.6,969,717,194,38,10,10
9,Modern R Programlama Eğitimi,Veri Bilimi Okulu,6537,4.4,901,559,252,72,9,9


<a id = "9"></a><br>
<p style="font-family: 'Pacifico', cursive; font-weight: bold; letter-spacing: 2px; color: #556B2F; font-size: 160%; text-align: left; padding: 0px; border-bottom: 3px solid">✨Sorting by Rating, Comment and Purchase✨</p>

<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>

* Our goal is to examine these three factors (ratings, purchases, and reviews) together.
* If we consider the idea of multiplying the three factors together, such as **"purchase_count"** and **"comment_count"** numbers are much larger than **"rating"** numbers, in such a situation, the **"rating"** values would have no effect. The reason for this is the difference in their scales.
* Based on this result, we can normalize all three variables to the same scale in terms of **"rating."**

In [40]:
df["purchase_count_scaled"] = MinMaxScaler(feature_range=(1, 5)). \
    fit(df[["purchase_count"]]). \
    transform(df[["purchase_count"]])

In [41]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
purchase_count,32.0,7110.71875,9760.893396,30.0,877.5,3687.5,9994.0,48291.0
rating,32.0,4.35625,0.447889,3.1,4.275,4.5,4.625,4.8
commment_count,32.0,882.0625,1321.498903,7.0,87.75,194.5,983.5,4621.0
5_point,32.0,598.09375,920.140114,1.0,49.25,112.5,695.25,3466.0
4_point,32.0,211.53125,312.262915,2.0,19.75,45.5,253.25,1122.0
3_point,32.0,54.125,76.576232,0.0,6.75,14.5,56.25,314.0
2_point,32.0,9.53125,12.991273,0.0,0.75,3.0,10.5,46.0
1_point,32.0,8.96875,14.570151,0.0,2.0,3.0,9.0,67.0
purchase_count_scaled,32.0,1.586869,0.809009,1.0,1.070243,1.303143,1.825843,5.0


In [42]:
df["comment_count_scaled"] = MinMaxScaler(feature_range=(1, 5)). \
    fit(df[["commment_count"]]). \
    transform(df[["commment_count"]])

In [43]:
df.head()

Unnamed: 0,course_name,instructor_name,purchase_count,rating,commment_count,5_point,4_point,3_point,2_point,1_point,purchase_count_scaled,comment_count_scaled
0,(50+ Saat) Python A-Z™: Veri Bilimi ve Machine...,Veri Bilimi Okulu,17380,4.8,4621,3466,924,185,46,6,2.438014,5.0
1,Python: Yapay Zeka ve Veri Bilimi için Python ...,Veri Bilimi Okulu,48291,4.6,4488,2962,1122,314,45,45,5.0,4.884699
2,5 Saatte Veri Bilimci Olun (Valla Billa),Instructor_1,18693,4.4,2362,1582,567,165,24,24,2.546839,3.041612
3,R ile Veri Bilimi ve Machine Learning (35 Saat),Veri Bilimi Okulu,6626,4.6,1027,688,257,51,10,21,1.546694,1.884265
4,(2020) Python ile Makine Öğrenmesi (Machine Le...,Veri Bilimi Okulu,11314,4.6,969,717,194,38,10,10,1.935248,1.833984


In [44]:
(df["comment_count_scaled"] * 32 / 100 +
 df["purchase_count_scaled"] * 26 / 100 +
 df["rating"] * 42 / 100)

0     4.249884
1     4.795104
2     3.483494
3     2.937105
4     3.022039
5     2.751651
6     2.857214
7     2.522386
8     2.759901
9     2.816233
10    3.427921
11    2.987387
12    2.528686
13    2.721863
14    3.501984
15    3.365772
16    2.517977
17    2.282315
18    2.458066
19    2.726593
20    3.681563
21    2.519056
22    2.538280
23    2.354586
24    2.264794
25    2.021050
26    2.436116
27    2.561682
28    2.544666
29    1.925836
30    1.924000
31    2.273764
dtype: float64

<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>

* They are weighted scores resulting from the conditions we have defined.
* When creating this score, the most important factor for us is **"rating,"** followed by **"comment_count_scaled,"** and finally **"purchase_count_scaled."**
* The reason for this is the possibility of the course having values of "purchase_count_scaled" for free courses. However, to make a general inference, we should include it in the score calculation rather than completely ignoring it.

In [45]:
def weighted_sorting_score(dataframe, w1=32, w2=26, w3=42):
    
    return (dataframe["comment_count_scaled"] * w1 / 100 +
            dataframe["purchase_count_scaled"] * w2 / 100 +
            dataframe["rating"] * w3 / 100)

In [46]:
df["weighted_sorting_score"] = weighted_sorting_score(df)

In [47]:
df.sort_values("weighted_sorting_score", ascending=False).head(10)

Unnamed: 0,course_name,instructor_name,purchase_count,rating,commment_count,5_point,4_point,3_point,2_point,1_point,purchase_count_scaled,comment_count_scaled,weighted_sorting_score
1,Python: Yapay Zeka ve Veri Bilimi için Python ...,Veri Bilimi Okulu,48291,4.6,4488,2962,1122,314,45,45,5.0,4.884699,4.795104
0,(50+ Saat) Python A-Z™: Veri Bilimi ve Machine...,Veri Bilimi Okulu,17380,4.8,4621,3466,924,185,46,6,2.438014,5.0,4.249884
20,Course_9,Instructor_3,12946,4.5,3371,2191,877,203,33,67,2.070512,3.916342,3.681563
14,Uçtan Uca SQL Server Eğitimi,Veri Bilimi Okulu,12893,4.7,2425,1722,510,145,24,24,2.06612,3.096229,3.501984
2,5 Saatte Veri Bilimci Olun (Valla Billa),Instructor_1,18693,4.4,2362,1582,567,165,24,24,2.546839,3.041612,3.483494
10,İleri Düzey Excel|Dashboard|Excel İp Uçları,Veri Bilimi Okulu,9554,4.8,2266,1654,499,91,22,0,1.789374,2.958388,3.427921
15,Uygulamalarla SQL Öğreniyorum,Veri Bilimi Okulu,11397,4.5,2353,1435,705,165,24,24,1.942127,3.03381,3.365772
4,(2020) Python ile Makine Öğrenmesi (Machine Le...,Veri Bilimi Okulu,11314,4.6,969,717,194,38,10,10,1.935248,1.833984,3.022039
11,Course_3,Instructor_4,24809,4.3,250,95,87,51,12,5,3.053749,1.210663,2.987387
3,R ile Veri Bilimi ve Machine Learning (35 Saat),Veri Bilimi Okulu,6626,4.6,1027,688,257,51,10,21,1.546694,1.884265,2.937105


<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>

* Considering the influence of the three factors, there has been a change in the ranking based on their weights. We can now have more confidence in this data compared to its previous state.

In [48]:
df[df["course_name"].str.contains("Veri Bilimi")].sort_values("weighted_sorting_score", ascending=False)

Unnamed: 0,course_name,instructor_name,purchase_count,rating,commment_count,5_point,4_point,3_point,2_point,1_point,purchase_count_scaled,comment_count_scaled,weighted_sorting_score
1,Python: Yapay Zeka ve Veri Bilimi için Python ...,Veri Bilimi Okulu,48291,4.6,4488,2962,1122,314,45,45,5.0,4.884699,4.795104
0,(50+ Saat) Python A-Z™: Veri Bilimi ve Machine...,Veri Bilimi Okulu,17380,4.8,4621,3466,924,185,46,6,2.438014,5.0,4.249884
3,R ile Veri Bilimi ve Machine Learning (35 Saat),Veri Bilimi Okulu,6626,4.6,1027,688,257,51,10,21,1.546694,1.884265,2.937105
7,Veri Bilimi için İstatistik: Python ile İstati...,Veri Bilimi Okulu,929,4.5,126,88,26,9,0,3,1.074512,1.103164,2.522386


<a id = "10"></a><br>
<p style="font-family: 'Pacifico', cursive; font-weight: bold; letter-spacing: 2px; color: #556B2F; font-size: 160%; text-align: left; padding: 0px; border-bottom: 3px solid">✨Bayesian Average Rating Score✨</p>

Bayesian Average Rating Score, especially employed for analyzing user reviews and ratings, is a method that utilizes Bayesian statistics to more accurately evaluate and analyze user feedback.

Traditional average rating calculations simply take the simple average of the scores given by users. However, this approach can lead to adverse effects on the data, especially when dealing with users who rarely give ratings or provide extreme scores. The Bayesian Average Rating Score aims to minimize such effects.

The Bayesian approach combines the user's ratings with a prior distribution. The prior distribution signifies that the data is assumed to follow a predetermined or known distribution. This prior distribution is taken into account when calculating the contributions of users to the dataset. Consequently, the impacts of users who rarely give ratings or provide extreme scores are balanced, resulting in more reliable outcomes.

<div style="border-radius:10px; border:#D0C2F0 solid; padding: 15px; background-color: #F8E8EE; font-size:100%; text-align:left">

<h3 align="left"><font color='#5E5273'>📄 Notes: </font></h3>

* Another names about this topic;
    * Sorting Products with 5 Star Rated
    * Sorting Products According to Distribution of 5 Star Rating

In [49]:
def bayesian_average_rating(n, confidence=0.95):
    
    if sum(n) == 0:
        
        return 0
    
    K = len(n)
    z = st.norm.ppf(1 - (1 - confidence) / 2)
    N = sum(n)
    
    first_part = sum((k + 1) * (n_k + 1) / (N + K) for k, n_k in enumerate(n))
    second_part = sum((k + 1) * (k + 1) * (n_k + 1) / (N + K) for k, n_k in enumerate(n))
    
    score = first_part - z * math.sqrt((second_part - first_part * first_part) / (N + K + 1))
    
    return score

<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>
    
The Bayesian Average Rating is a probabilistic method that calculates a weighted average based on the distribution of ratings.

This method uses the distribution information of ratings from 1 to 5 in the dataset to compute an average rating.

- `n`: Represents the number of stars to be entered and the observed frequency of these stars.
- `confidence`: This is a value entered to obtain a Z-table value for the purpose of calculating a value.

In [50]:
df.head()

Unnamed: 0,course_name,instructor_name,purchase_count,rating,commment_count,5_point,4_point,3_point,2_point,1_point,purchase_count_scaled,comment_count_scaled,weighted_sorting_score
0,(50+ Saat) Python A-Z™: Veri Bilimi ve Machine...,Veri Bilimi Okulu,17380,4.8,4621,3466,924,185,46,6,2.438014,5.0,4.249884
1,Python: Yapay Zeka ve Veri Bilimi için Python ...,Veri Bilimi Okulu,48291,4.6,4488,2962,1122,314,45,45,5.0,4.884699,4.795104
2,5 Saatte Veri Bilimci Olun (Valla Billa),Instructor_1,18693,4.4,2362,1582,567,165,24,24,2.546839,3.041612,3.483494
3,R ile Veri Bilimi ve Machine Learning (35 Saat),Veri Bilimi Okulu,6626,4.6,1027,688,257,51,10,21,1.546694,1.884265,2.937105
4,(2020) Python ile Makine Öğrenmesi (Machine Le...,Veri Bilimi Okulu,11314,4.6,969,717,194,38,10,10,1.935248,1.833984,3.022039


In [51]:
df["bar_score"] = df.apply(lambda x: bayesian_average_rating(x[["1_point",
                                                                "2_point",
                                                                "3_point",
                                                                "4_point",
                                                                "5_point"]]), axis=1)

# It can also be named as "bar_sorting_score," "bar_rating," and "bar_average_rating."

In [52]:
df.sort_values("weighted_sorting_score", ascending=False).head(10)

Unnamed: 0,course_name,instructor_name,purchase_count,rating,commment_count,5_point,4_point,3_point,2_point,1_point,purchase_count_scaled,comment_count_scaled,weighted_sorting_score,bar_score
1,Python: Yapay Zeka ve Veri Bilimi için Python ...,Veri Bilimi Okulu,48291,4.6,4488,2962,1122,314,45,45,5.0,4.884699,4.795104,4.516038
0,(50+ Saat) Python A-Z™: Veri Bilimi ve Machine...,Veri Bilimi Okulu,17380,4.8,4621,3466,924,185,46,6,2.438014,5.0,4.249884,4.665857
20,Course_9,Instructor_3,12946,4.5,3371,2191,877,203,33,67,2.070512,3.916342,3.681563,4.480627
14,Uçtan Uca SQL Server Eğitimi,Veri Bilimi Okulu,12893,4.7,2425,1722,510,145,24,24,2.06612,3.096229,3.501984,4.568162
2,5 Saatte Veri Bilimci Olun (Valla Billa),Instructor_1,18693,4.4,2362,1582,567,165,24,24,2.546839,3.041612,3.483494,4.51521
10,İleri Düzey Excel|Dashboard|Excel İp Uçları,Veri Bilimi Okulu,9554,4.8,2266,1654,499,91,22,0,1.789374,2.958388,3.427921,4.641679
15,Uygulamalarla SQL Öğreniyorum,Veri Bilimi Okulu,11397,4.5,2353,1435,705,165,24,24,1.942127,3.03381,3.365772,4.454811
4,(2020) Python ile Makine Öğrenmesi (Machine Le...,Veri Bilimi Okulu,11314,4.6,969,717,194,38,10,10,1.935248,1.833984,3.022039,4.595674
11,Course_3,Instructor_4,24809,4.3,250,95,87,51,12,5,3.053749,1.210663,2.987387,3.877743
3,R ile Veri Bilimi ve Machine Learning (35 Saat),Veri Bilimi Okulu,6626,4.6,1027,688,257,51,10,21,1.546694,1.884265,2.937105,4.482079


In [53]:
df.sort_values("bar_score", ascending=False).head(10)

Unnamed: 0,course_name,instructor_name,purchase_count,rating,commment_count,5_point,4_point,3_point,2_point,1_point,purchase_count_scaled,comment_count_scaled,weighted_sorting_score,bar_score
19,Alıştırmalarla SQL Öğreniyorum,Veri Bilimi Okulu,3155,4.8,235,200,31,4,0,0,1.259008,1.197659,2.726593,4.729128
0,(50+ Saat) Python A-Z™: Veri Bilimi ve Machine...,Veri Bilimi Okulu,17380,4.8,4621,3466,924,185,46,6,2.438014,5.0,4.249884,4.665857
10,İleri Düzey Excel|Dashboard|Excel İp Uçları,Veri Bilimi Okulu,9554,4.8,2266,1654,499,91,22,0,1.789374,2.958388,3.427921,4.641679
5,Course_1,Instructor_2,4601,4.8,213,164,45,4,0,0,1.378857,1.178587,2.751651,4.634477
4,(2020) Python ile Makine Öğrenmesi (Machine Le...,Veri Bilimi Okulu,11314,4.6,969,717,194,38,10,10,1.935248,1.833984,3.022039,4.595674
14,Uçtan Uca SQL Server Eğitimi,Veri Bilimi Okulu,12893,4.7,2425,1722,510,145,24,24,2.06612,3.096229,3.501984,4.568162
1,Python: Yapay Zeka ve Veri Bilimi için Python ...,Veri Bilimi Okulu,48291,4.6,4488,2962,1122,314,45,45,5.0,4.884699,4.795104,4.516038
2,5 Saatte Veri Bilimci Olun (Valla Billa),Instructor_1,18693,4.4,2362,1582,567,165,24,24,2.546839,3.041612,3.483494,4.51521
6,Course_2,Instructor_3,3171,4.7,856,582,205,51,9,9,1.260334,1.736021,2.857214,4.50797
3,R ile Veri Bilimi ve Machine Learning (35 Saat),Veri Bilimi Okulu,6626,4.6,1027,688,257,51,10,21,1.546694,1.884265,2.937105,4.482079


<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>
    
The "bar_score" provides a ranking solely based on ratings. Therefore, this score, which is created based on the distribution of ratings, can be considered a scientifically calculated score within the rating calculation.

Looking at the above ranking, courses with still low comment and purchase counts remain at the top. The main reason for this is that **"bar_score"** is created solely based on ratings.

Examining the 1st and 5th indices below, we can understand that the reason for this is the zero values in **"1_point"** and **"2_point"**. This is why they have a high score despite low percentages.

In [54]:
df[df["course_name"].index.isin([5, 1])].sort_values("bar_score", ascending=False)

Unnamed: 0,course_name,instructor_name,purchase_count,rating,commment_count,5_point,4_point,3_point,2_point,1_point,purchase_count_scaled,comment_count_scaled,weighted_sorting_score,bar_score
5,Course_1,Instructor_2,4601,4.8,213,164,45,4,0,0,1.378857,1.178587,2.751651,4.634477
1,Python: Yapay Zeka ve Veri Bilimi için Python ...,Veri Bilimi Okulu,48291,4.6,4488,2962,1122,314,45,45,5.0,4.884699,4.795104,4.516038


<a id = "11"></a><br>
<p style="font-family: 'Pacifico', cursive; font-weight: bold; letter-spacing: 2px; color: #556B2F; font-size: 160%; text-align: left; padding: 0px; border-bottom: 3px solid">✨Hybrid Sorting✨</p>

<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>
    
* The steps we've taken so far are as follows:

    - First, we calculated the averages for Average, Time-Based Weighted Average, and User-Based Weighted Average.
    - After calculating these, we looked at the weighted averages of these three values.
    - As observed, we can use the Bayesian Average Rating Score for scoring. When we calculate ratings solely based on Bayesian calculations, it may slightly reduce the existing scores, making its usage debatable.
    - Another method is to add the Bayesian value as a weight to the calculated scores.
    - Additionally, we noticed that we couldn't sort products using only Rating, Comment, or Purchase (WSS). We created a scoring system using these three factors, which can be a solution to many of our problems.
    - Taking it a step further, we calculated the Bayesian Average Rating Score, which is a new scoring system based on the distribution of ratings.
    - Now, we are going to combine the Bar Score with the factors of Rating, Comment, or Purchase.
    - Although the Bar Score is scientifically reliable, a hybrid calculation can be done to account for the factors that may have been overlooked.

In [55]:
def hybrid_sorting_score(dataframe, bar_w=60, wss_w=40):
    
    bar_score = dataframe.apply(lambda x: bayesian_average_rating(x[["1_point",
                                                                     "2_point",
                                                                     "3_point",
                                                                     "4_point",
                                                                     "5_point"]]), axis=1)
    wss_score = weighted_sorting_score(dataframe)

    return bar_score*bar_w/100 + wss_score*wss_w/100

In [56]:
df["hybrid_sorting_score"] = hybrid_sorting_score(df)

In [57]:
df.sort_values("hybrid_sorting_score", ascending=False).head(10)

Unnamed: 0,course_name,instructor_name,purchase_count,rating,commment_count,5_point,4_point,3_point,2_point,1_point,purchase_count_scaled,comment_count_scaled,weighted_sorting_score,bar_score,hybrid_sorting_score
1,Python: Yapay Zeka ve Veri Bilimi için Python ...,Veri Bilimi Okulu,48291,4.6,4488,2962,1122,314,45,45,5.0,4.884699,4.795104,4.516038,4.627664
0,(50+ Saat) Python A-Z™: Veri Bilimi ve Machine...,Veri Bilimi Okulu,17380,4.8,4621,3466,924,185,46,6,2.438014,5.0,4.249884,4.665857,4.499468
20,Course_9,Instructor_3,12946,4.5,3371,2191,877,203,33,67,2.070512,3.916342,3.681563,4.480627,4.161001
10,İleri Düzey Excel|Dashboard|Excel İp Uçları,Veri Bilimi Okulu,9554,4.8,2266,1654,499,91,22,0,1.789374,2.958388,3.427921,4.641679,4.156176
14,Uçtan Uca SQL Server Eğitimi,Veri Bilimi Okulu,12893,4.7,2425,1722,510,145,24,24,2.06612,3.096229,3.501984,4.568162,4.141691
2,5 Saatte Veri Bilimci Olun (Valla Billa),Instructor_1,18693,4.4,2362,1582,567,165,24,24,2.546839,3.041612,3.483494,4.51521,4.102524
15,Uygulamalarla SQL Öğreniyorum,Veri Bilimi Okulu,11397,4.5,2353,1435,705,165,24,24,1.942127,3.03381,3.365772,4.454811,4.019195
4,(2020) Python ile Makine Öğrenmesi (Machine Le...,Veri Bilimi Okulu,11314,4.6,969,717,194,38,10,10,1.935248,1.833984,3.022039,4.595674,3.96622
19,Alıştırmalarla SQL Öğreniyorum,Veri Bilimi Okulu,3155,4.8,235,200,31,4,0,0,1.259008,1.197659,2.726593,4.729128,3.928114
5,Course_1,Instructor_2,4601,4.8,213,164,45,4,0,0,1.378857,1.178587,2.751651,4.634477,3.881346


<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>
    
The importance of the position where **Course_9** is located becomes apparent. With our new calculation, we obtained a more realistic ranking. Additionally, the significance of Course_1 catching up to the 10th position is that we have information that this course is new. Therefore, we consider it as a promising course for the future and give it the necessary importance. The element that enables this is that 60% weight is given to the bar_score. This way, we also elevate potential promising courses.

In [58]:
df[df["course_name"].str.contains("Veri Bilimi")].sort_values("hybrid_sorting_score", ascending=False).head()

Unnamed: 0,course_name,instructor_name,purchase_count,rating,commment_count,5_point,4_point,3_point,2_point,1_point,purchase_count_scaled,comment_count_scaled,weighted_sorting_score,bar_score,hybrid_sorting_score
1,Python: Yapay Zeka ve Veri Bilimi için Python ...,Veri Bilimi Okulu,48291,4.6,4488,2962,1122,314,45,45,5.0,4.884699,4.795104,4.516038,4.627664
0,(50+ Saat) Python A-Z™: Veri Bilimi ve Machine...,Veri Bilimi Okulu,17380,4.8,4621,3466,924,185,46,6,2.438014,5.0,4.249884,4.665857,4.499468
3,R ile Veri Bilimi ve Machine Learning (35 Saat),Veri Bilimi Okulu,6626,4.6,1027,688,257,51,10,21,1.546694,1.884265,2.937105,4.482079,3.86409
7,Veri Bilimi için İstatistik: Python ile İstati...,Veri Bilimi Okulu,929,4.5,126,88,26,9,0,3,1.074512,1.103164,2.522386,4.342189,3.614267


<a id = "12"></a><br>
<p style="font-family: 'Pacifico', cursive; font-weight: bold; letter-spacing: 2px; color: #556B2F; font-size: 160%; text-align: left; padding: 0px; border-bottom: 3px solid">✨IMDB Movie Scoring and Sorting✨</p>

<a id = "13"></a><br>
<div style="font-family:JetBrains Mono; font-weight:bold; letter-spacing: 2px; color:#E5788F; font-size:150%; text-align:left; padding: 0px;">Importing Libraries & First Look</div>

In [59]:
import pandas as pd
import math
import scipy.stats as st
from sklearn.preprocessing import MinMaxScaler

# pd.set_option('display.max_columns', None)
# pd.set_option('display.expand_frame_repr', False)
# pd.set_option('display.float_format', lambda x: '%.5f' % x)

In [60]:
df = pd.read_csv("/kaggle/input/imdb-movies/movies_metadata.csv",
                 usecols=["title", "vote_average", "vote_count"])

In [61]:
df.head()

Unnamed: 0,title,vote_average,vote_count
0,Toy Story,7.7,5415.0
1,Jumanji,6.9,2413.0
2,Grumpier Old Men,6.5,92.0
3,Waiting to Exhale,6.1,34.0
4,Father of the Bride Part II,5.7,173.0


In [62]:
df.shape # 45466 movies.

(45466, 3)

<a id = "14"></a><br>
<div style="font-family:JetBrains Mono; font-weight:bold; letter-spacing: 2px; color:#E5788F; font-size:150%; text-align:left; padding: 0px;">Sorting by Vote Average</div>

In [63]:
df.sort_values("vote_average", ascending=False).head(10)

Unnamed: 0,title,vote_average,vote_count
21642,Ice Age Columbus: Who Were the First Americans?,10.0,1.0
15710,If God Is Willing and da Creek Don't Rise,10.0,1.0
22396,Meat the Truth,10.0,1.0
22395,Marvin Hamlisch: What He Did For Love,10.0,1.0
35343,Elaine Stritch: At Liberty,10.0,1.0
186,Reckless,10.0,1.0
45047,The Human Surge,10.0,1.0
22377,The Guide,10.0,1.0
22346,هیچ کجا هیچ کس,10.0,1.0
1634,Other Voices Other Rooms,10.0,1.0


In [64]:
df["vote_count"].describe([0.10, 0.25, 0.50, 0.70, 0.80, 0.90, 0.95, 0.99]).T

count    45460.000000
mean       109.897338
std        491.310374
min          0.000000
10%          1.000000
25%          3.000000
50%         10.000000
70%         25.000000
80%         50.000000
90%        160.000000
95%        434.000000
99%       2183.820000
max      14075.000000
Name: vote_count, dtype: float64

<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>
    
* We don't want the ones with few votes. To set a filter, we are looking at the distributions in the description.
    
* The median value of these votes is 10. Their averages are around 109, and when we look at the 95th percentile, there are movies with around 400 votes. Let's try using this value as a filter.

In [65]:
df[df["vote_count"] > 400].sort_values("vote_average", ascending=False).head(10)

Unnamed: 0,title,vote_average,vote_count
10309,Dilwale Dulhania Le Jayenge,9.1,661.0
40251,Your Name.,8.5,1030.0
834,The Godfather,8.5,6024.0
314,The Shawshank Redemption,8.5,8358.0
1152,One Flew Over the Cuckoo's Nest,8.3,3001.0
1176,Psycho,8.3,2405.0
1178,The Godfather: Part II,8.3,3418.0
292,Pulp Fiction,8.3,8670.0
1184,Once Upon a Time in America,8.3,1104.0
5481,Spirited Away,8.3,3968.0


<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>

Here, "**vote_average**" and "**vote_count**" seem to not convey much on their own. Therefore, let's scale the "**vote_count**" values to a more interpretable scale of 1-10.

In [66]:
df["vote_count_score"] = MinMaxScaler(feature_range=(1, 10)). \
    fit(df[["vote_count"]]). \
    transform(df[["vote_count"]])

In [67]:
df.head()

Unnamed: 0,title,vote_average,vote_count,vote_count_score
0,Toy Story,7.7,5415.0,4.462522
1,Jumanji,6.9,2413.0,2.542948
2,Grumpier Old Men,6.5,92.0,1.058828
3,Waiting to Exhale,6.1,34.0,1.021741
4,Father of the Bride Part II,5.7,173.0,1.110622


In [68]:
df["average_count_score"] = df["vote_average"] * df["vote_count_score"]

In [69]:
df.sort_values("average_count_score", ascending=False).head(10)

Unnamed: 0,title,vote_average,vote_count,vote_count_score,average_count_score
15480,Inception,8.1,14075.0,10.0,81.0
12481,The Dark Knight,8.3,12269.0,8.845187,73.415048
22879,Interstellar,8.1,11187.0,8.153321,66.041904
17818,The Avengers,7.4,12000.0,8.673179,64.181528
14551,Avatar,7.2,12114.0,8.746075,62.971737
26564,Deadpool,7.4,11444.0,8.317655,61.55065
2843,Fight Club,8.3,9678.0,7.188419,59.663879
20051,Django Unchained,7.8,10297.0,7.584227,59.156973
23753,Guardians of the Galaxy,7.9,10014.0,7.403268,58.485819
292,Pulp Fiction,8.3,8670.0,6.543872,54.314139


<a id = "15"></a><br>
<p style="font-family: 'Pacifico', cursive; font-weight: bold; letter-spacing: 2px; color: #556B2F; font-size: 160%; text-align: left; padding: 0px; border-bottom: 3px solid">✨IMDB Weighted Rating✨</p>

<div style="border-radius:10px; border:#D0C2F0 solid; padding: 15px; background-color: #F8E8EE; font-size:100%; text-align:left">

<h3 align="left"><font color='#5E5273'>📄 Notes: </font></h3>

weighted_rating = (v/(v+M) * r) + (M/(v+M) * C)

* r : vote average, the rating of the movie
* v : vote count
* M : minimum votes required to be listed in the Top 250
* C : the mean vote across the whole report (currently 7.0)

When calculating the IMDB weighted rating, two factors come to the forefront:

* The "C" value, which is the overall average of all movies
* The minimum "M" value required to be listed in the rankings

Taking a general look at the formula:

* v: the number of votes for the movie
* (v+M): the required number of votes
* A correction will be made to the rating based on the relationship between "v" and "v+M."

<div style="border-radius:10px; border:#632626 solid; padding: 15px; background-color: #FDF6EC; font-size:100%; text-align:left">

<h3 align="left"><font color='#11324D'>💡 An Example: </font></h3>
    
* First Part: (v/(v+M) * r)
* Second Part: (M/(v+M) * C)

Let's consider two movies:

* Movie-1: A movie with 1,000 votes.

    * r = 8
    * M = 500
    * v = 1,000
    * C = 7

(1,000 / (1,000 + 500)) * 8 = 5.33

(500 / (1,000+500)) * 7 = 2.33

Total Score = 5.33 + 2.33 = 7.66

* Movie-2: A movie with more votes than Movie-1.

    * r = 8
    * M = 500
    * v = 3,000
    * C = 7

(3,000 / (3,000 + 500)) * 8 = 6.85

(500 / (3,000 + 500)) * 7 = 1

Total Score = 6.85 + 1 = 7.85

We can see how the score adjustment changes as the number of votes increases. If more points are obtained than the required number of votes, this situation has been captured. This way, we also see that a point is obtained from the general average of the audience.

One of the conclusions that can be drawn from this is that a business can create its own scoring method and use it.

Let's continue with the coding part.

In [70]:
# Let's assign values to M and C;

M = 2500
C = df['vote_average'].mean()

In [71]:
def weighted_rating(r, v, M, C):
    return (v / (v + M) * r) + (M / (v + M) * C)

In [72]:
df.sort_values("average_count_score", ascending=False).head(15)

Unnamed: 0,title,vote_average,vote_count,vote_count_score,average_count_score
15480,Inception,8.1,14075.0,10.0,81.0
12481,The Dark Knight,8.3,12269.0,8.845187,73.415048
22879,Interstellar,8.1,11187.0,8.153321,66.041904
17818,The Avengers,7.4,12000.0,8.673179,64.181528
14551,Avatar,7.2,12114.0,8.746075,62.971737
26564,Deadpool,7.4,11444.0,8.317655,61.55065
2843,Fight Club,8.3,9678.0,7.188419,59.663879
20051,Django Unchained,7.8,10297.0,7.584227,59.156973
23753,Guardians of the Galaxy,7.9,10014.0,7.403268,58.485819
292,Pulp Fiction,8.3,8670.0,6.543872,54.314139


<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>
    
There are some doubts in this ranking, for example, the position of the Deadpool movie. Let's examine these now.

In [73]:
weighted_rating(7.40000, 11444.00000, M, C) # Deadpool

7.080544896574546

In [74]:
weighted_rating(8.10000, 14075.00000, M, C) # Inception

7.725672279809078

In [75]:
weighted_rating(8.50000, 8358.00000, M, C) # The Shawshank Redemption

7.83648167598411

In [76]:
df["weighted_rating"] = weighted_rating(df["vote_average"],
                                        df["vote_count"], M, C)

In [77]:
df.sort_values("weighted_rating", ascending=False).head(10)

Unnamed: 0,title,vote_average,vote_count,vote_count_score,average_count_score,weighted_rating
12481,The Dark Knight,8.3,12269.0,8.845187,73.415048,7.846044
314,The Shawshank Redemption,8.5,8358.0,6.344369,53.92714,7.836482
2843,Fight Club,8.3,9678.0,7.188419,59.663879,7.74946
15480,Inception,8.1,14075.0,10.0,81.0,7.725672
292,Pulp Fiction,8.3,8670.0,6.543872,54.314139,7.699778
834,The Godfather,8.5,6024.0,4.851936,41.241456,7.6548
22879,Interstellar,8.1,11187.0,8.153321,66.041904,7.646688
351,Forrest Gump,8.2,8147.0,6.209449,50.917485,7.593775
7000,The Lord of the Rings: The Return of the King,8.1,8226.0,6.259964,50.705712,7.521547
4863,The Lord of the Rings: The Fellowship of the Ring,8.0,8892.0,6.685826,53.486607,7.47731


<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>
    
* This way, we have obtained a more reliable ranking.

<a id = "16"></a><br>
<p style="font-family: 'Pacifico', cursive; font-weight: bold; letter-spacing: 2px; color: #556B2F; font-size: 160%; text-align: left; padding: 0px; border-bottom: 3px solid">✨BAR Score✨</p>

<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>

* Up to this point, the top 5 films we obtained with the IMDb's rating system used until 2015 were as follows:

     * 12481 - The Dark Knight
     * 314 - The Shawshank Redemption
     * 2843 - Fight Club
     * 15480 - Inception
     * 292 - Pulp Fiction

In [78]:
def bayesian_average_rating(n, confidence=0.95):
    
    if sum(n) == 0:
        
        return 0
    
    K = len(n)
    z = st.norm.ppf(1 - (1 - confidence) / 2)
    N = sum(n)
    first_part = 0.0
    second_part = 0.0
    
    for k, n_k in enumerate(n):
        
        first_part += (k + 1) * (n[k] + 1) / (N + K)
        second_part += (k + 1) * (k + 1) * (n[k] + 1) / (N + K)
        
    score = first_part - z * math.sqrt((second_part - first_part * first_part) / (N + K + 1))
    
    return score

In [79]:
bayesian_average_rating([34733, 4355, 4704, 6561, 13515, 26183, 87368, 273082, 600260, 1295351])

# We entered the rating numbers from 1 to 10 stars for The Shawshank Redemption.

9.14538444560111

In [80]:
bayesian_average_rating([37128, 5879, 6268, 8419, 16603, 30016, 78538, 199430, 402518, 837905])

# We entered the rating numbers from 1 to 10 stars for The God Father.

8.940007324860396

<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>
    
* With our recent weighted calculation using Bayesian, we seem to be approaching the current IMDb ratings. The reason for the slight difference between our calculation and the current IMDb ratings can be corrected by introducing the user quality factor.

* User Quality is a metric determined based on the user's comment count and their influence on the platform.

* In the "movies_metadata.csv" dataset, we have the total number of votes, but we don't have the distribution of votes. Let's continue with the "imdb_ratings.csv" dataset, which contains these distributions, and perform a general calculation to rank the movies.

In [81]:
df = pd.read_csv("/kaggle/input/imdb-rating/imdb_ratings.csv")

In [82]:
df = df.iloc[0:, 1:]

In [83]:
df.head()

Unnamed: 0,id,movieName,rating,ten,nine,eight,seven,six,five,four,three,two,one
0,111161,1. The Shawshank Redemption (1994),9.2,1295382,600284,273091,87368,26184,13515,6561,4704,4355,34733
1,68646,2. The Godfather (1972),9.1,837932,402527,199440,78541,30016,16603,8419,6268,5879,37128
2,71562,3. The Godfather: Part II (1974),9.0,486356,324905,175507,70847,26349,12657,6210,4347,3892,20469
3,468569,4. The Dark Knight (2008),9.0,1034863,649123,354610,137748,49483,23237,11429,8082,7173,30345
4,50083,5. 12 Angry Men (1957),8.9,246765,225437,133998,48341,15773,6278,2866,1723,1478,8318


In [84]:
df["bar_score"] = df.apply(lambda x: bayesian_average_rating(x[["one", "two", "three", "four", "five",
                                                                "six", "seven", "eight", "nine", "ten"]]), axis=1)

In [85]:
df.sort_values("bar_score", ascending=False).head(15)

Unnamed: 0,id,movieName,rating,ten,nine,eight,seven,six,five,four,three,two,one,bar_score
0,111161,1. The Shawshank Redemption (1994),9.2,1295382,600284,273091,87368,26184,13515,6561,4704,4355,34733,9.145389
1,68646,2. The Godfather (1972),9.1,837932,402527,199440,78541,30016,16603,8419,6268,5879,37128,8.940016
3,468569,4. The Dark Knight (2008),9.0,1034863,649123,354610,137748,49483,23237,11429,8082,7173,30345,8.895962
2,71562,3. The Godfather: Part II (1974),9.0,486356,324905,175507,70847,26349,12657,6210,4347,3892,20469,8.812499
4,50083,5. 12 Angry Men (1957),8.9,246765,225437,133998,48341,15773,6278,2866,1723,1478,8318,8.767934
6,167260,7. The Lord of the Rings: The Return of ...,8.9,703093,433087,270113,117411,44760,21818,10873,7987,6554,28990,8.752038
5,108052,6. Schindler's List (1993),8.9,453906,383584,220586,82367,27219,12922,6234,4572,4289,19328,8.743609
11,109830,12. Forrest Gump (1994),8.8,622104,553654,373644,151284,51140,22720,11692,7647,5941,12110,8.699152
12,1375666,13. Inception (2010),8.7,724798,627987,408686,174229,60668,26910,13436,8703,6932,17621,8.693148
10,137523,11. Fight Club (1999),8.8,637087,572654,371752,152295,53059,24755,12648,8606,6948,17435,8.674475


<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>
    
* This way, we have brought it into alignment with the current ranking. By incorporating various factors and conducting weighted scientific calculations, we are approaching the most accurate results.
    
* Up to this point, we have focused on ranking products. Regarding product ranking, we have reached the following awareness:

    - Important factors from a business perspective should be taken into account.
    - If there are multiple factors, these factors should first be standardized to consider their effects simultaneously, and then the effects, if different, should be expressed with their respective weights.
    - While some statistical methods may seem reliable on their own according to the literature, it is necessary to use these methods in combination with domain knowledge rather than using them alone.

<a id="17"></a>
<h1 style="border-radius: 10px; border: 2px solid #6B8E23; background-color: #F5F5DC; font-family: 'Pacifico', cursive; font-size: 200%; text-align: center; border-radius: 15px 50px; padding: 15px; box-shadow: 5px 5px 5px #556B2F; color: #556B2F;">🙏 Sorting Reviews 🙏</h1>

"Sorting Reviews" term generally refers to arranging customer reviews or evaluations in a specific order or categories. It is used to make sense of a large number of user reviews related to a product, service, or another subject, making them more meaningful and accessible. Reviews can often be grouped around specific criteria, ratings, or topics, and presented to users in an organized manner to help them better understand the information.

For example, when reviewing user comments on a product on an e-commerce site, users often have the option to sort reviews based on criteria such as "most helpful," "most recent," "highest rated," or "lowest rated."

Additionally, in this context, we are not concerned with whether the reviews have high or low scores. As a marketplace, we aim to deliver the most accurate social proof to users. Therefore, even a negative review can be highlighted if it reflects a consensus or is found useful by the community.

Here, factors like "User Quality Score" also play a crucial role. It is an important factor in determining the internal ranking of reviews, even if they have an equal level of usefulness.

<a id = "18"></a><br>
<p style="font-family: 'Pacifico', cursive; font-weight: bold; letter-spacing: 2px; color: #556B2F; font-size: 160%; text-align: left; padding: 0px; border-bottom: 3px solid">✨Up-Down Difference Score✨</p>

<div style="border-radius:10px; border:#D0C2F0 solid; padding: 15px; background-color: #F8E8EE; font-size:100%; text-align:left">

<h3 align="left"><font color='#5E5273'>📄 Notes: </font></h3>

* Review 1: 600 up 400 down total 1000
* Review 2: 5500 up 4500 down total 10000

In [86]:
def score_up_down_diff(up, down):
    
    return up - down

In [87]:
# Review 1 Score:

score_up_down_diff(600, 400)

200

In [88]:
# Review 2 Score:

score_up_down_diff(5500, 4500)

1000

<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>

* Based on these results, Review 2 appears to be ranked higher. However, despite the difference, Review 1 has an up percentage of 60%, while Review 2 has an up percentage of 55% in the percentile ranges. Therefore, we should approach the question of whether Review 1 should be higher with some skepticism. This method falls short of providing a reliable conclusion.

<a id = "19"></a><br>
<p style="font-family: 'Pacifico', cursive; font-weight: bold; letter-spacing: 2px; color: #556B2F; font-size: 160%; text-align: left; padding: 0px; border-bottom: 3px solid">✨Average Rating (Up Ratio)✨</p>

In [89]:
def score_average_rating(up, down):
    
    if up + down == 0:
        
        return 0
    
    return up / (up + down)

In [90]:
score_average_rating(600, 400)

0.6

In [91]:
score_average_rating(5500, 4500)

0.55

<div style="border-radius:10px; border:#D0C2F0 solid; padding: 15px; background-color: #F8E8EE; font-size:100%; text-align:left">

<h3 align="left"><font color='#5E5273'>📄 Notes: </font></h3>

Other scenario;

* Review 1: 2 up 0 down total 2
* Review 2: 100 up 1 down total 101

In [92]:
score_average_rating(2, 0)

1.0

In [93]:
score_average_rating(100, 1)

0.9900990099009901

<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>

* In the second scenario, it seems like Review 1 has won. However, we can see that it doesn't make sense for a comment with only 2 upvotes to be ranked higher than one with 100 upvotes. This method falls short because it missed the frequency information while considering the ratio.

<a id = "20"></a><br>
<p style="font-family: 'Pacifico', cursive; font-weight: bold; letter-spacing: 2px; color: #556B2F; font-size: 160%; text-align: left; padding: 0px; border-bottom: 3px solid">✨Wilson Lower Bound Score✨</p>

In [94]:
def wilson_lower_bound(up, down, confidence=0.95): # The confidence level is typically set to 0.95.
    
    """
    Calculate the Wilson Lower Bound Score

    - The lower limit of the confidence interval to be calculated for the Bernoulli parameter p is considered as the WLB score.
    - The calculated score is used for product ranking.
    - Note:
    If the scores are in the range of 1-5, they are marked as 1-3 negative and 4-5 positive, making them suitable for Bernoulli distribution.
    However, this introduces some problems. Therefore, Bayesian average rating should be used.

    Parameters
    ----------
    up: int
        up count
    down: int
        down count
    confidence: float
        confidence

    Returns
    -------
    wilson score: float

    """
    n = up + down
    
    if n == 0:
        
        return 0
    
    z = st.norm.ppf(1 - (1 - confidence) / 2)
    phat = 1.0 * up / n
    
    return (phat + z * z / (2 * n) - z * math.sqrt((phat * (1 - phat) + z * z / (4 * n)) / n)) / (1 + z * z / n)

<div style="border-radius:10px; border:#D0C2F0 solid; padding: 15px; background-color: #F8E8EE; font-size:100%; text-align:left">

<h3 align="left"><font color='#5E5273'>📄 Notes: </font></h3>

The `wilson_lower_bound` function calculates the Wilson Lower Bound Score for use in ranking items based on user reviews. The score is a lower limit of the confidence interval for the Bernoulli parameter "p," which represents the probability of success in a binary outcome (e.g., upvotes and downvotes on a product).

A breakdown of the function:

- It takes three parameters:
  - `up`: The number of "positive" outcomes (e.g., upvotes or positive reviews).
  - `down`: The number of "negative" outcomes (e.g., downvotes or negative reviews).
  - `confidence`: The desired confidence level for the lower bound score (default is 0.95 for 95% confidence).

- The function first calculates the total number of trials (reviews) as `n = up + down`.

- If there are no reviews (i.e., `n == 0`), the function returns a score of 0, indicating that there is no data to calculate a score.

- The function then calculates a critical value `z` based on the desired confidence level using the inverse cumulative distribution function (PPF) of the standard normal distribution.

- It computes the estimated probability of success `phat` as the ratio of upvotes to the total number of reviews.

- Finally, it calculates the Wilson Lower Bound Score using the Wilson Score interval formula, incorporating the critical value `z`, the estimated success probability `phat`, and the total number of reviews `n`.

The Wilson Lower Bound Score is a useful metric for ranking items based on user feedback while accounting for the uncertainty associated with a small number of reviews. Items with a higher Wilson Lower Bound Score are ranked higher, indicating a higher level of user confidence in the product. This method is particularly valuable when dealing with binary rating systems (e.g., thumbs up or thumbs down) and can help prevent items with a few positive ratings from being ranked higher than they should be.

<div style="border-radius:10px; border:#D0C2F0 solid; padding: 15px; background-color: #F8E8EE; font-size:100%; text-align:left">

<h3 align="left"><font color='#5E5273'>📄 Notes: </font></h3>

The Bernoulli parameter "p" is the probability parameter of a Bernoulli distribution. The Bernoulli distribution models random experiments with only two possible outcomes, typically referred to as "success" and "failure."

The parameter "p" represents the probability of success during any given trial. For example, in a coin toss, the probability of a "successful" outcome (e.g., getting heads) is expressed as "p." Typically, the "p" value ranges between 0 and 1 because the probability of a successful outcome falls within that range.

The Bernoulli distribution is commonly used to model binary random experiments. For instance, it can be used to model whether an advertisement is clicked or not, whether a product is defective or not, or whether a student passes an exam. The "p" parameter represents the probability of success in such experiments.

<div style="border-radius:10px; border:#D0C2F0 solid; padding: 15px; background-color: #F8E8EE; font-size:100%; text-align:left">

<h3 align="left"><font color='#5E5273'>📄 Notes: </font></h3>
    
The inverse cumulative distribution function of the standard normal distribution, often referred to as the Percent Point Function (PPF), is used to calculate values that correspond to a specific probability level (typically denoted as alpha). The standard normal distribution is a probability distribution with a mean of 0 and a standard deviation of 1.

The PPF provides a threshold value for a given probability level (alpha), which is the value of a random variable that has that probability or less. In other words, it is the inverse of the cumulative distribution function (CDF) of the standard normal distribution. The CDF calculates the probability of a value being less than or equal to a certain value, while the PPF calculates the value that corresponds to a given probability level.

It is commonly used in statistical analyses and hypothesis testing. For example, when calculating a p-value, the PPF may be used. A p-value represents the probability of a test statistic being less than or equal to a certain probability level. If the test statistic meets a specific alpha level, the test result is considered significant.

In summary, the PPF is employed to calculate critical values for a given probability level by inverting the probability values associated with the standard normal distribution. It is a widely used tool in statistical analyses, hypothesis testing, and the computation of confidence intervals.

**Scenario - 1**

In [95]:
wilson_lower_bound(600, 400)

0.5693094295142663

In [96]:
wilson_lower_bound(5500, 4500)

0.5402319557715324

**Scenario - 2**

In [97]:
wilson_lower_bound(2, 0)

0.3423802275066531

In [98]:
wilson_lower_bound(100, 1)

0.9460328420055449

<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>

* With these results, we have obtained more reliable outputs.

<a id = "21"></a><br>
<p style="font-family: 'Pacifico', cursive; font-weight: bold; letter-spacing: 2px; color: #556B2F; font-size: 160%; text-align: left; padding: 0px; border-bottom: 3px solid">✨Case Study✨</p>

In [99]:
up = [15, 70, 14, 4, 2, 5, 8, 37, 21, 52, 28, 147, 61, 30, 23, 40, 37, 61, 54, 18, 12, 68]

down = [0, 2, 2, 2, 15, 2, 6, 5, 23, 8, 12, 2, 1, 1, 5, 1, 2, 6, 2, 0, 2, 2]

In [100]:
comments = pd.DataFrame({"up": up, "down": down})

In [101]:
comments.head()

# Each row corresponds to a comment, and we can see the up-down values received by that comment.

Unnamed: 0,up,down
0,15,0
1,70,2
2,14,2
3,4,2
4,2,15


In [102]:
# score_pos_neg_diff

comments["score_pos_neg_diff"] = comments.apply(lambda x: score_up_down_diff(x["up"],
                                                                             x["down"]), axis=1)

In [103]:
# score_average_rating

comments["score_average_rating"] = comments.apply(lambda x: score_average_rating(x["up"], x["down"]), axis=1)

In [104]:
# Wilson Lower Bound

comments["wilson_lower_bound"] = comments.apply(lambda x: wilson_lower_bound(x["up"], x["down"]), axis=1)

In [105]:
comments.sort_values("wilson_lower_bound", ascending=False).head(10)

Unnamed: 0,up,down,score_pos_neg_diff,score_average_rating,wilson_lower_bound
11,147,2,145,0.986577,0.952384
12,61,1,60,0.983871,0.914133
1,70,2,68,0.972222,0.904258
21,68,2,66,0.971429,0.901677
18,54,2,52,0.964286,0.878812
15,40,1,39,0.97561,0.874049
13,30,1,29,0.967742,0.838059
16,37,2,35,0.948718,0.831144
19,18,0,18,1.0,0.824121
17,61,6,55,0.910448,0.818072


<div style="border-radius:10px; border:#D0C2F0 solid; padding: 15px; background-color: #F8E8EE; font-size:100%; text-align:left">

<h3 align="left"><font color='#5E5273'>👻 Analysis Results: </font></h3>
    
In this comprehensive Kaggle notebook, we delved into various methodologies for evaluating and sorting content, particularly focusing on rating products and the intricacies of average ratings. We explored advanced techniques such as Time-Based Weighted Average, User-Based Weighted Average, Weighted Rating, and Bayesian Average Rating Score, highlighting their significance in providing more nuanced and accurate evaluations. The discussion expanded to the practical aspects of sorting products, considering factors like rating, comment count, and purchase count. Additionally, we delved into hybrid sorting methods and drew insights from the well-known IMDB Movie Scoring and Sorting, emphasizing the importance of sophisticated approaches for content evaluation.

The latter part of the notebook explored practical implementations, including importing libraries, examining data, and presenting case studies. Specifically, we scrutinized sorting by Vote Average, IMDB Weighted Rating, Bayesian Average Rating Score (BAR Score), and various sorting strategies based on reviews and user interactions. We introduced unique scoring methods like Up-Down Difference Score, Average Rating (Up Ratio), and Wilson Lower Bound Score, each contributing to a more comprehensive understanding of content evaluation.

In conclusion, this Kaggle notebook serves as a valuable resource for data enthusiasts and analysts seeking a deep dive into advanced content evaluation techniques. By combining theoretical discussions with practical case studies, it equips readers with a diverse set of tools to enhance their ability to assess and rank content effectively.