### Goal: Performing an application on how to calculate the most accurate rating by making various evaluations based on the ratings given to a product.

###### Scenario: (50 + saat) Python A-Z: Veri Bilimi ve Machine Learning Score: 4.8(4.764925)
###### Total Score: 4611
###### Score Percentages: 75, 20, 4, 1, <1
###### Approximate numerical equivalents: 3458, 922, 184, 46, 6

In [40]:
import pandas as pd
import math
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as st
from sklearn.preprocessing import MinMaxScaler
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 15) 
pd.set_option('display.width', 500)
pd.set_option('display.expand_frame_repr', False) 
pd.set_option('display.float_format', lambda x: '%.5f' % x)

In [41]:
import warnings 
warnings.filterwarnings('ignore')

In [42]:
df_ = pd.read_csv('/Users/yagizkarakaya/Desktop/DSMLBootcamp/Measurement_Problems/measurement_problems/datasets/course_reviews.csv')

In [43]:
df = df_.copy()

In [44]:
def missing_values_analysis(dataframe):
    na_columns_ = [col for col in dataframe.columns if dataframe[col].isnull().sum() > 0]
    n_miss = dataframe[na_columns_].isnull().sum().sort_values(ascending=True)
    ratio_ = (dataframe[na_columns_].isnull().sum() / dataframe.shape[0] * 100).sort_values(ascending=True)
    missing_df = pd.concat([n_miss, np.round(ratio_, 2)], axis=1, keys=['Total Missing Values', 'Ratio'])
    missing_df = pd.DataFrame(missing_df)
    return missing_df

In [45]:
def check_df(dataframe, head=5, box=False, column="Purchase"):
    print("--------------------- SHAPE ---------------------")
    print(dataframe.shape)

    print("---------------------- TYPES --------------------")
    print(dataframe.dtypes)

    print("--------------------- HEAD ---------------------")
    print(dataframe.head(head))

    print("--------------------- Missing Value Analysis ---------------------")
    print(missing_values_analysis(dataframe))

    print("--------------------- QUANTILES ---------------------")
    print(dataframe.quantile([0, 0.05, 0.50, 0.95, 0.99, 1]).T)

In [46]:
check_df(df)

--------------------- SHAPE ---------------------
(4323, 6)
---------------------- TYPES --------------------
Rating                float64
Timestamp              object
Enrolled               object
Progress              float64
Questions Asked       float64
Questions Answered    float64
dtype: object
--------------------- HEAD ---------------------
   Rating            Timestamp             Enrolled  Progress  Questions Asked  Questions Answered
0 5.00000  2021-02-05 07:45:55  2021-01-25 15:12:08   5.00000          0.00000             0.00000
1 5.00000  2021-02-04 21:05:32  2021-02-04 20:43:40   1.00000          0.00000             0.00000
2 4.50000  2021-02-04 20:34:03  2019-07-04 23:23:27   1.00000          0.00000             0.00000
3 5.00000  2021-02-04 16:56:28  2021-02-04 14:41:29  10.00000          0.00000             0.00000
4 4.00000  2021-02-04 15:00:24  2020-10-13 03:10:07  10.00000          0.00000             0.00000
--------------------- Missing Value Analysis --------

###### Rating Distribution

In [47]:
df['Rating'].value_counts()

5.00000    3267
4.50000     475
4.00000     383
3.50000      96
3.00000      62
1.00000      15
2.00000      12
2.50000      11
1.50000       2
Name: Rating, dtype: int64

###### Questions Asked Distribution

In [48]:
df['Questions Asked'].value_counts

<bound method IndexOpsMixin.value_counts of 0      0.00000
1      0.00000
2      0.00000
3      0.00000
4      0.00000
         ...  
4318   1.00000
4319   0.00000
4320   0.00000
4321   0.00000
4322   0.00000
Name: Questions Asked, Length: 4323, dtype: float64>

###### The score given in the breakdown of the questions asked

In [49]:
df.groupby('Questions Asked').agg({'Questions Asked': 'count',
                                    'Rating' : 'mean'})

Unnamed: 0_level_0,Questions Asked,Rating
Questions Asked,Unnamed: 1_level_1,Unnamed: 2_level_1
0.00000,3867,4.76519
1.00000,276,4.74094
2.00000,80,4.80625
3.00000,43,4.74419
4.00000,15,4.83333
...,...,...
11.00000,2,5.00000
12.00000,1,5.00000
14.00000,2,4.50000
15.00000,2,3.00000


###### Average Score


In [50]:
df['Rating'].mean()

4.764284061993986

### Time Based Weighted Average

In [51]:
df.head()

Unnamed: 0,Rating,Timestamp,Enrolled,Progress,Questions Asked,Questions Answered
0,5.0,2021-02-05 07:45:55,2021-01-25 15:12:08,5.0,0.0,0.0
1,5.0,2021-02-04 21:05:32,2021-02-04 20:43:40,1.0,0.0,0.0
2,4.5,2021-02-04 20:34:03,2019-07-04 23:23:27,1.0,0.0,0.0
3,5.0,2021-02-04 16:56:28,2021-02-04 14:41:29,10.0,0.0,0.0
4,4.0,2021-02-04 15:00:24,2020-10-13 03:10:07,10.0,0.0,0.0


In [52]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4323 entries, 0 to 4322
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Rating              4323 non-null   float64
 1   Timestamp           4323 non-null   object 
 2   Enrolled            4323 non-null   object 
 3   Progress            4323 non-null   float64
 4   Questions Asked     4323 non-null   float64
 5   Questions Answered  4323 non-null   float64
dtypes: float64(4), object(2)
memory usage: 202.8+ KB


###### Here our Timestamp variable is object. We must convert it to a time variable.

In [53]:
df['Timestamp'] = pd.to_datetime(df['Timestamp'])

In [54]:
current_date = pd.to_datetime('2021-02-10 00:00:00')

In [55]:
df['days'] = (current_date - df['Timestamp']).dt.days

###### Let's access the comments made in the last 30 days.

In [56]:
df[df['days'] <= 30].count()

Rating                194
Timestamp             194
Enrolled              194
Progress              194
Questions Asked       194
Questions Answered    194
days                  194
dtype: int64

###### The average of the comments made in the last 30 days.

In [57]:
df.loc[df['days'] <= 30, 'Rating'].mean()

4.775773195876289

###### 30/90

In [58]:
df.loc[(df['days'] > 30) & (df['days'] <= 90), 'Rating'].mean()

4.763833992094861

###### 90/180

In [59]:
df.loc[(df['days'] > 90) & (df['days'] <= 180), 'Rating'].mean()

4.752503576537912

###### >180

In [60]:
df[(df['days'] > 180)].mean()

Rating                 4.76642
Progress              27.55848
Questions Asked        0.26436
Questions Answered     0.43536
days                 386.66074
dtype: float64

In [61]:
def time_based_weighted_average(dataframe, w1=28, w2=26, w3=24, w4=22):
    return  \
    dataframe.loc[dataframe['days'] <= 30, 'Rating'].mean() * w1/100 + \
    dataframe.loc[(dataframe['days'] > 30) & dataframe['days'] <= 90, 'Rating'].mean() * w2/100 + \
    dataframe.loc[(dataframe['days'] > 90) & dataframe['days'] <= 180, 'Rating'].mean() * w3/100 + \
    dataframe[dataframe['days'] > 180].mean() * w4/100

In [62]:
time_based_weighted_average(df)

Rating                4.76797
Progress              9.78222
Questions Asked       3.77752
Questions Answered    3.81514
days                 88.78472
dtype: float64

In [63]:
time_based_weighted_average(df, 30, 26, 22, 22)

Rating                4.76820
Progress              9.78245
Questions Asked       3.77775
Questions Answered    3.81537
days                 88.78495
dtype: float64

### User-Based Weighted Average

##### Goal: To conduct weighting based on the ratings given according to the progress status of the course

In [64]:
df.head()

Unnamed: 0,Rating,Timestamp,Enrolled,Progress,Questions Asked,Questions Answered,days
0,5.0,2021-02-05 07:45:55,2021-01-25 15:12:08,5.0,0.0,0.0,4
1,5.0,2021-02-04 21:05:32,2021-02-04 20:43:40,1.0,0.0,0.0,5
2,4.5,2021-02-04 20:34:03,2019-07-04 23:23:27,1.0,0.0,0.0,5
3,5.0,2021-02-04 16:56:28,2021-02-04 14:41:29,10.0,0.0,0.0,5
4,4.0,2021-02-04 15:00:24,2020-10-13 03:10:07,10.0,0.0,0.0,5


In [65]:
df.groupby('Progress').agg({'Rating': 'mean'})

Unnamed: 0_level_0,Rating
Progress,Unnamed: 1_level_1
0.00000,4.67391
1.00000,4.64269
2.00000,4.65476
3.00000,4.66355
4.00000,4.77733
...,...
94.00000,5.00000
95.00000,4.79412
97.00000,5.00000
98.00000,5.00000


In [66]:
df.loc[df['Progress'] <= 10, 'Rating'].mean() * 22/100 + \
df.loc[(df['Progress'] > 10) & df['days'] <= 45, 'Rating'].mean() * 24/100 + \
df.loc[(df['Progress'] > 45) & df['days'] <= 75, 'Rating'].mean() * 26/100 + \
df.loc[df['Progress'] > 75, 'Rating'] * 28/100

6      4.81813
14     4.81813
112    4.81813
167    4.67813
174    4.81813
         ...  
4198   4.81813
4199   4.81813
4201   4.81813
4273   4.81813
4290   4.81813
Name: Rating, Length: 448, dtype: float64

###### Functionalization

In [67]:
def user_based_weighted_average(dataframe, w1=22, w2=24, w3=26, w4=28):
    return \
    dataframe.loc[dataframe['Progress'] <= 10, 'Rating'].mean() * 22/100 + \
    dataframe.loc[(dataframe['Progress'] > 10) & dataframe['days'] <= 45, 'Rating'].mean() * 24/100 + \
    dataframe.loc[(dataframe['Progress'] > 45) & dataframe['days'] <= 75, 'Rating'].mean() * 26/100 + \
    dataframe.loc[dataframe['Progress'] > 75, 'Rating'] * 28/100

In [68]:
user_based_weighted_average(df)

6      4.81813
14     4.81813
112    4.81813
167    4.67813
174    4.81813
         ...  
4198   4.81813
4199   4.81813
4201   4.81813
4273   4.81813
4290   4.81813
Name: Rating, Length: 448, dtype: float64

### Weighted Rating

In [69]:
def course_weighted_rating(dataframe, time_w=40, user_w=60):
    return \
    time_based_weighted_average(dataframe) * time_w / 100 + \
    user_based_weighted_average(dataframe) * user_w / 100