# Mining Big Datasets - Assignment 1

## Business Anlytics 
### Athens University of Economics and Business 


##### In this assignment we will implement a simple workflow that will assess the similarity between supermarket customers. The workflow will be used to compute, and suggest for any input customer, a list of his/her 10 most similar other customers. Moreover, we will be using these results to predict the rating of a customer. To calculate the (dis)similarity between customers you will first compute the dissimilarity for every given attribute.

---
>Georgia Vlassi p2822001<br />
>Stylianos Vretteas p2822003<br />
>Dimitrios Mentakis p2822024<br />
---


Before we start the implementation of the questions we have to import specific python libraries. This is the first step in order to load our dataset. 


In [1]:
import pandas as pd
import numpy as np
import itertools
import matplotlib.pyplot as plt


* Then we move forward by getting the files that will containt our final dataset. In each step of the question when we need a file we will have a short analysis of why we choosed it and what it has.

* We have to download all the files that we get from the presentation of the Assignment. All the files regardless of their types must move them to the same file where our notebook is saved.

### 1) Import and pre-process the dataset with customers

This dataset loaded `groceries.csv` contains demographic characteristics of supermarket 10000 customers along with a list of groceries they bought.

In [2]:
# Read the dataset and print the first 10 rows
df = pd.read_csv('groceries.csv',sep =';')
df.head(10)

Unnamed: 0,Customer_ID,Age,Sex,Marital_Status,Education,Income,Customer_Rating,Persons_in_Household,Occupation,Groceries
0,1,75,male,married,primary,20000,very_good,3,retired,"citrus fruit,semi-finished bread,margarine,rea..."
1,2,61,female,single,secondary,28000,good,1,housemaid,"tropical fruit,yogurt,coffee"
2,3,32,male,single,secondary,34000,very_good,1,blue-collar,whole milk
3,4,62,male,married,primary,31000,very_good,3,blue-collar,"pip fruit,yogurt,cream cheese,meat spreads"
4,5,66,female,married,secondary,19000,good,3,retired,"other vegetables,whole milk,condensed milk,lon..."
5,6,55,female,single,secondary,35000,very_good,1,unemployed,"whole milk,butter,yogurt,rice,abrasive cleaner"
6,7,23,female,married,tertiary,21000,good,3,housemaid,rolls/buns
7,8,26,female,single,secondary,30000,good,2,blue-collar,"other vegetables,UHT-milk,rolls/buns,bottled b..."
8,9,29,female,married,secondary,32000,very_good,3,blue-collar,potted plants
9,10,57,female,married,secondary,26000,good,3,entrepreneur,"whole milk,cereals"


In [3]:
# Show the type of each column
df.dtypes

Customer_ID              int64
Age                     object
Sex                     object
Marital_Status          object
Education               object
Income                  object
Customer_Rating         object
Persons_in_Household     int64
Occupation              object
Groceries               object
dtype: object

* Observing the data, there are misissing values at columns of `Age` and `Income`.
In order to handle these values, we replace them with NaNs.


In [4]:
df.replace(" ", np.nan, inplace=True)

* As we can see from the below results, there is a null value at the age of the 16th customer.

In [5]:
pd.isna(df['Age']).head(20)

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15     True
16    False
17    False
18    False
19    False
Name: Age, dtype: bool

* Thus, having replaced the missing values, we can count the NAs of each column, and show the total amount per column.

In [6]:
df.isna().sum()

Customer_ID               0
Age                     473
Sex                       0
Marital_Status            0
Education                 0
Income                  477
Customer_Rating           0
Persons_in_Household      0
Occupation                0
Groceries                 0
dtype: int64

Moreover, according to the instructions of the question we have to replace them with the average value of 
the attribute in the dataset (keeping the integer part of the average).
However, in order to calculate the mean value of a numeric attribute initially we should convert its type to float.

In [7]:
# Convert Age and Income columns to float and show theor types.
df = df.astype({'Age': float})
df = df.astype({'Income': float})
df.dtypes

Customer_ID               int64
Age                     float64
Sex                      object
Marital_Status           object
Education                object
Income                  float64
Customer_Rating          object
Persons_in_Household      int64
Occupation               object
Groceries                object
dtype: object

In [8]:
# Average Age by keeping the integer part
avg_age = round(df['Age'].mean())
avg_age

53

In [9]:
# Average Income by keeping the integer part
avg_income = round(df['Income'].mean())
avg_income

30037

In [10]:
#Fill the null values of the Age with the average
df['Age'].fillna((avg_age), inplace=True)

In [11]:
#Fill the null values of the Income with the average
df['Income'].fillna((avg_income), inplace=True)

Show the first 20 lines to confirm the replacements at the `Age` of the 16th customer and the `Income` of the 17th customer.

In [12]:
df.head(20)

Unnamed: 0,Customer_ID,Age,Sex,Marital_Status,Education,Income,Customer_Rating,Persons_in_Household,Occupation,Groceries
0,1,75.0,male,married,primary,20000.0,very_good,3,retired,"citrus fruit,semi-finished bread,margarine,rea..."
1,2,61.0,female,single,secondary,28000.0,good,1,housemaid,"tropical fruit,yogurt,coffee"
2,3,32.0,male,single,secondary,34000.0,very_good,1,blue-collar,whole milk
3,4,62.0,male,married,primary,31000.0,very_good,3,blue-collar,"pip fruit,yogurt,cream cheese,meat spreads"
4,5,66.0,female,married,secondary,19000.0,good,3,retired,"other vegetables,whole milk,condensed milk,lon..."
5,6,55.0,female,single,secondary,35000.0,very_good,1,unemployed,"whole milk,butter,yogurt,rice,abrasive cleaner"
6,7,23.0,female,married,tertiary,21000.0,good,3,housemaid,rolls/buns
7,8,26.0,female,single,secondary,30000.0,good,2,blue-collar,"other vegetables,UHT-milk,rolls/buns,bottled b..."
8,9,29.0,female,married,secondary,32000.0,very_good,3,blue-collar,potted plants
9,10,57.0,female,married,secondary,26000.0,good,3,entrepreneur,"whole milk,cereals"


In [13]:
#Verify that we do not have any NAs at the columns 
df.isna().sum()

Customer_ID             0
Age                     0
Sex                     0
Marital_Status          0
Education               0
Income                  0
Customer_Rating         0
Persons_in_Household    0
Occupation              0
Groceries               0
dtype: int64

### 2) Compute data (dis-)similarity

Initially, we have to distinguish the type of each attribute:
* `Age` is a numeric variable
* `Sex` is a nominal variable
* `Marital_Status` is a nominal variable
* `Education` will be transformed to ordinal variable
* `Income` is a numeric variable
* `Customer_Rating` will be transformed to ordinal variable
* `Persons_in_Household` is a numeric variable
* `Occupation` is a nominal variable
* `Groceries` is a set

In order to handle the information of those types of variables we create dictionaries.

In [14]:
# Create dictionaries  
rating_rank = {'poor': 1,'fair': 2, 'good':3, 'very_good':4, 'excellent':5} 
education_rank = {'primary': 1,'secondary': 2, 'tertiary':3}

In [15]:
# Use them in order to add new collumns
df['Customer_Rating_rank'] = [rating_rank[i] for i in df.Customer_Rating] 
df['Education_rank'] = [education_rank[i] for i in df.Education]
df.head(10)

Unnamed: 0,Customer_ID,Age,Sex,Marital_Status,Education,Income,Customer_Rating,Persons_in_Household,Occupation,Groceries,Customer_Rating_rank,Education_rank
0,1,75.0,male,married,primary,20000.0,very_good,3,retired,"citrus fruit,semi-finished bread,margarine,rea...",4,1
1,2,61.0,female,single,secondary,28000.0,good,1,housemaid,"tropical fruit,yogurt,coffee",3,2
2,3,32.0,male,single,secondary,34000.0,very_good,1,blue-collar,whole milk,4,2
3,4,62.0,male,married,primary,31000.0,very_good,3,blue-collar,"pip fruit,yogurt,cream cheese,meat spreads",4,1
4,5,66.0,female,married,secondary,19000.0,good,3,retired,"other vegetables,whole milk,condensed milk,lon...",3,2
5,6,55.0,female,single,secondary,35000.0,very_good,1,unemployed,"whole milk,butter,yogurt,rice,abrasive cleaner",4,2
6,7,23.0,female,married,tertiary,21000.0,good,3,housemaid,rolls/buns,3,3
7,8,26.0,female,single,secondary,30000.0,good,2,blue-collar,"other vegetables,UHT-milk,rolls/buns,bottled b...",3,2
8,9,29.0,female,married,secondary,32000.0,very_good,3,blue-collar,potted plants,4,2
9,10,57.0,female,married,secondary,26000.0,good,3,entrepreneur,"whole milk,cereals",3,2


Define functions in order to compute dissimilarities for numeric, categorical and ordinal data. We are using the same procedure that we learned on lecture `Working with data`.

* Function to calculate dissimilarity for a nominal variable.

In [16]:
def nominal_diss(data, column, index):
    diss_list = list()
    row = data[column][index]
    for i in data[column]:
        if row == i:
            diss_list.append(0)
        else:
            diss_list.append(1)
                                
    return np.asarray(diss_list)

The dissimilarity of the 3rd customer with himself is 0.

In [17]:
#Example of dissimilarity at the Sex column of the 3rd customer
nominal_diss(df,'Sex',2)

array([0, 1, 0, ..., 0, 1, 0])

* Function to calculate dissimilarity for a numerical variable

In [18]:
def numeric_diss(data,column, index):
    diss_list = list()
    row = data[column][index]
    max = data[column].max()
    min  = data[column].min()
    
    for i in data[column]:
        a = row
        b = i
        diff = abs(a - b)  
        diss_list.append((diff/(max - min)))

                                      
    return np.asarray(diss_list)

The dissimilarity of the 3rd customer with himself is 0.

In [19]:
#Example of dissimilarity at the Age column of the 3rd customer
numeric_diss(df,'Age',2)

array([0.66153846, 0.44615385, 0.        , ..., 0.09230769, 0.18461538,
       0.41538462])

* Function to calculate dissimilarity for an ordinal variable.

In [20]:
def ordinal_diss(data, column, index):
    diss_list = list()
    row = data[column][index]
    
    for i in data[column]:
        a = row
        b = i 
        diff = abs(a - b)  
        if diff == 0:
            diss_list.append(0)
        elif diff == 1:
            diss_list.append(0.5)
        else:
            diss_list.append(1)
                                      
    return np.asarray(diss_list)

The dissimilarity of the 3rd customer with himself is 0.

In [21]:
# Example of dissimilarity at the Customer_Rating_rank column of the 3rd customer
ordinal_diss(df,'Customer_Rating_rank',2)

array([0. , 0.5, 0. , ..., 1. , 1. , 0.5])

In [22]:
# create dictionary for each basket

#Basket_dict = {}

#for i in df.index:
#    Basket_dict.update({i : df['Groceries'][i].split(',')})

#Basket_dict

* Function to calculate dissimilarity for sets using `Jaccard Distance`.

In [23]:
def sets_diss(data, column, index):
    diss_list = list()

    row = data[column][index]
    for i in data[column]:

        # Set similarity: Jaccard Index
        intersection = len(list(set(row).intersection(i)))
        union = (len(row) + len(i)) - intersection
        # we substract in order to get the DISSIMILARITY score 
        diss_list.append (1 - ((intersection) / union))

    return np.asarray(diss_list)

The dissimilarity of the 3rd customer with himself should not be equal to 0, as we are comparing the ratio of the intersection and union of the same set of groceries.

In [24]:
sets_diss(df,'Groceries',2)

array([0.89655172, 0.84848485, 0.18181818, ..., 0.92307692, 0.93069307,
       0.87719298])

We define a function that calculates the dissimilariry between an id and the rest ids of the dataset. It takes as input the dataframe and an id and returns the average score of the dissimilarity between this id and the rest ids of groceries dataset.

In [25]:
def calc_diss(data, index):

    Age = numeric_diss(data, 'Age', index)
    Sex = nominal_diss(data, 'Sex', index)
    Marital_Status = nominal_diss(data, 'Marital_Status', index)
    Education = ordinal_diss(data,"Education_rank",index)
    Income = numeric_diss(data, 'Income', index)
    Customer_Rating = ordinal_diss(data,"Customer_Rating_rank", index)
    Persons_in_Household = numeric_diss(data, 'Persons_in_Household', index)
    Occupation = nominal_diss(data, 'Occupation', index)
    Groceries = sets_diss(data, "Groceries", index )
    
    # average score    
    return (Age + Sex + Marital_Status + Education + Income +
            Customer_Rating + Persons_in_Household + Occupation + Groceries ) / 9

In [26]:
# Calculate the dissimilarity for customer 1 with the others
calc_diss(df,1)

array([0.61432281, 0.0704607 , 0.43157677, ..., 0.6839818 , 0.42233669,
       0.51437383])

In [27]:
# Array contains the dissimilarity scores between customer and the whole dataset
len(calc_diss(df,1))

10000

### 3) Nearest Neighbor (NN) search

In the beginning, for the implementation of this question we start by creating a list, which includes the requested ids.

In [28]:
my_list = [73,563,1603,2200,3703, 4263, 5300, 6129, 7800, 8555]    

We assign the initial dataframe to a new variable, in order to proceed in implementations. For the list mentioned above we calculate the 10 Nearest Neighbor customers according to their similarities. The results are sorted in ascending order and we print the first 10 of each requested customer.

In [29]:
# for loop - print the 10 - most Similar neighbors
df_diss = pd.DataFrame(df)

for cid in my_list:
    print("10 Nearest Neighbors for the customer ",cid)
    idx = df_diss.loc[df_diss['Customer_ID']== cid].index[0]
    c = calc_diss(df_diss, idx)
    c = pd.DataFrame(c)
    
    result = c.sort_values(by = 0, ascending = True)
    # Drop the first row, which indicates the customer itself
    result = result.iloc[1: , :]
    # Keep only the first 10
    result = result.head(10)
    # Similarity 
    print(1-result)

10 Nearest Neighbors for the customer  73
             0
1290  0.880924
1845  0.874125
1202  0.860964
5880  0.859923
6903  0.852086
3622  0.850309
8880  0.845020
1626  0.842249
5921  0.841838
7932  0.840532
10 Nearest Neighbors for the customer  563
             0
3633  0.910631
6167  0.893306
6928  0.889311
6195  0.889114
8269  0.884538
558   0.882292
2765  0.882246
7329  0.881686
2458  0.878888
418   0.875410
10 Nearest Neighbors for the customer  1603
             0
7344  0.900282
167   0.867797
108   0.865577
4627  0.859281
567   0.855929
4813  0.847031
8958  0.836667
7334  0.835508
6840  0.828002
9259  0.821177
10 Nearest Neighbors for the customer  2200
             0
7496  0.845853
5159  0.834709
402   0.827662
4927  0.826043
3550  0.796264
9241  0.794775
4146  0.793590
5329  0.790498
8883  0.788847
3418  0.787333
10 Nearest Neighbors for the customer  3703
             0
9941  0.886708
4837  0.880967
3351  0.877319
1603  0.874609
7783  0.872623
7193  0.871887
5852  0.866546
183

### 4) Customer rating prediction

#### 1. Calculate the similarities between the given customer and all other customers and compute his 10-nn (most similar) customers.  To compute the similarity calculations for this step we should exclude the customer rating attribute.

In [30]:
# Caluclate the dissimilarity - exlude Customer_Rating

def calc_diss_updated(data, index):

    Age = numeric_diss(data, 'Age', index)
    Sex = nominal_diss(data, 'Sex', index)
    Marital_Status = nominal_diss(data, 'Marital_Status', index)
    Education = ordinal_diss(data,"Education_rank",index)
    Income = numeric_diss(data, 'Income', index)
    #Customer_Rating = ordinal_diss(data,"Customer_Rating_rank", index) - excluded as per assignment instructions
    Persons_in_Household = numeric_diss(data, 'Persons_in_Household', index)
    Occupation = nominal_diss(data, 'Occupation', index)
    Groceries = sets_diss(data, "Groceries", index )
    
    # average score - exclude Customer_Rating and divide by 8
    return (Age + Sex + Marital_Status + Education + Income +
            Persons_in_Household + Occupation + Groceries ) / 8

We can calculate the updated similarities between a given customer and all other customers and show the similarity score of the 10 Nearest Neighbors.

In [31]:
cid2 = int(input("Enter Customer_id: "))

df_new = pd.DataFrame(df)
print("10 Nearest Neighbors for the customer ",cid2)
idx2 = df_new.loc[df_new['Customer_ID']== cid2].index[0]
c2 = calc_diss_updated(df_new, idx2)
c2 = pd.DataFrame(c2)

result2 = c2.sort_values(by = 0, ascending = True)
# Drop the first row, which indicates the customer itself
result2 = result2.iloc[1: , :]
# Keep only the first 10
result2 = result2.head(10)
# Similarity 
print(1-result2)

Enter Customer_id: 268
10 Nearest Neighbors for the customer  268
             0
7809  0.875019
6180  0.869407
6848  0.868651
8840  0.864550
3610  0.864203
7194  0.862178
8536  0.860917
5198  0.858305
4655  0.857987
5682  0.854507


#### 2) Based only on the 10most similar customers computed in the previous step, predict the customer rating

Initially, we create a new dataframe based on the list of the 10 most similar customers computed above.

In [32]:
top10_df_list = list((1-result2).index)

top10_df = df.loc[top10_df_list]

top10_df

Unnamed: 0,Customer_ID,Age,Sex,Marital_Status,Education,Income,Customer_Rating,Persons_in_Household,Occupation,Groceries,Customer_Rating_rank,Education_rank
7809,7810,23.0,male,single,secondary,31000.0,good,2,blue-collar,"other vegetables,soft cheese,soda,bottled beer",3,2
6180,6181,39.0,male,single,secondary,31000.0,good,2,blue-collar,"soda,misc. beverages,candy",3,2
6848,6849,32.0,male,single,secondary,34000.0,excellent,2,blue-collar,oil,5,2
8840,8841,37.0,male,single,secondary,38000.0,good,2,blue-collar,"bottled water,napkins",3,2
3610,3611,37.0,male,single,secondary,25000.0,excellent,2,blue-collar,"beef,pip fruit,root vegetables,whole milk,yogu...",5,2
7194,7195,34.0,male,single,secondary,24000.0,good,2,blue-collar,chocolate marshmallow,3,2
8536,8537,43.0,male,single,secondary,31000.0,excellent,2,blue-collar,"whole milk,frozen meals,ice cream",5,2
5198,5199,26.0,male,single,secondary,30037.0,excellent,1,blue-collar,"whole milk,yogurt,spread cheese,bottled water,...",5,2
4655,4656,27.0,male,single,secondary,30037.0,good,1,blue-collar,"whole milk,yogurt",3,2
5682,5683,26.0,male,single,secondary,29000.0,good,1,blue-collar,"frankfurter,finished products,chicken,other ve...",3,2


From the above dataframe we keep only the `Customer_Rating_rank` column, due to compute the average and the weighted average. Additionaly, the results of the similarity calculation get assigned to a new column named `Similarity_Score`.

In [33]:
# take the Customer_Rating_rank for the top 10 similar
top10_df = pd.DataFrame(df.loc[top10_df_list]["Customer_Rating_rank"])
top10_df["Similarity_Score"] = 1 - result2 
top10_df

# average rating of the 10 most similar customers
avg = round(top10_df["Customer_Rating_rank"].mean())

# weighted average rating of the 10 most similar customers
w_avg = round(sum(top10_df["Customer_Rating_rank"] * top10_df["Similarity_Score"]) / sum(top10_df["Similarity_Score"]))
top10_df

Unnamed: 0,Customer_Rating_rank,Similarity_Score
7809,3,0.875019
6180,3,0.869407
6848,5,0.868651
8840,3,0.86455
3610,5,0.864203
7194,3,0.862178
8536,5,0.860917
5198,5,0.858305
4655,3,0.857987
5682,3,0.854507


In [34]:
# Predicted average rating for given customer
avg

4

In [35]:
# Predicted weighted average rating for given customer
w_avg

4

* In order to print the classification label of the customer rating for the given customer, we list the keys and values of the dictionary we created in the first task with tha rating rankings. With this method we can take the key from the dictionary which is the customer rating for the values of the predicted average and the predicted weighted average. 

In [36]:
# take key fron the dictionary  
# list out keys and values separately
key_list = list(rating_rank.keys())
val_list = list(rating_rank.values())

* We combine the previous steps including the updated dissimilarity function with the calculations of the average and the weighted average rankings of a given customer.

In [37]:
cid2 = int(input("Enter Customer_id: "))

print("10 Nearest Neighbors for the customer ",cid2)
print(" ")
idx2 = df_new.loc[df_new['Customer_ID']== cid2].index[0]
c2 = calc_diss_updated(df_new, idx2)
c2 = pd.DataFrame(c2)

result2 = c2.sort_values(by = 0, ascending = True)
# Drop the first row, which indicates the customer itself
result2 = result2.iloc[1: , :]
# Keep only the first 10
result2 = result2.head(10)
# Similarity 
print(1-result2)

# create top10_df
# take the top 10 similar for cid2
top10_df_list = list((1-result2).index)

# take the Customer_Rating_rank for the top 10 similar
top10_df = pd.DataFrame(df_new.loc[top10_df_list]["Customer_Rating_rank"])
top10_df["Similarity_Score"] = 1 - result2 
top10_df

# average rating of the 10 most similar customers
avg = round(top10_df["Customer_Rating_rank"].mean())
# weighted average rating of the 10 most similar customers
w_avg = round(sum(top10_df["Customer_Rating_rank"] * top10_df["Similarity_Score"]) / sum(top10_df["Similarity_Score"]))
# print key with val avg
position_avg = val_list.index(avg)
# print key with val w_avg
position_wavg = val_list.index(w_avg)

print(" ")
print ("Average rating for customer", cid2, "is",key_list[position_avg],"(",avg,")")
print ("Weighted Average rating for customer", cid2, "is",key_list[position_wavg],"(",w_avg,")")
top10_df

Enter Customer_id: 268
10 Nearest Neighbors for the customer  268
 
             0
7809  0.875019
6180  0.869407
6848  0.868651
8840  0.864550
3610  0.864203
7194  0.862178
8536  0.860917
5198  0.858305
4655  0.857987
5682  0.854507
 
Average rating for customer 268 is very_good ( 4 )
Weighted Average rating for customer 268 is very_good ( 4 )


Unnamed: 0,Customer_Rating_rank,Similarity_Score
7809,3,0.875019
6180,3,0.869407
6848,5,0.868651
8840,3,0.86455
3610,5,0.864203
7194,3,0.862178
8536,5,0.860917
5198,5,0.858305
4655,3,0.857987
5682,3,0.854507


#### 3. For the evaluation of your classification algorithm you will use the 50 first records of the groceries dataset and predict the rating for them. Then, for all n=50 records calculate the Mean Prediction Error for both prediction methods.

We create a new dataframe in order to test the classification algorithm. Two new empty columns named `Prediction_Average` and  `Prediction_Weighted_Average` appended to this dataframe.

In [38]:
df4 = df.copy()
df4["Prediction_Average"] = " "
df4["Prediction_Weighted_Average"] = " "

We reindex the dataframe to start from number 1 instead of the the default number 0, as there was a misalignment with the range called later.

In [39]:
# reindex df4
df4.index = pd.RangeIndex(start=1, stop=len(df4)+1, step=1)

In [40]:
df4.head(10)

Unnamed: 0,Customer_ID,Age,Sex,Marital_Status,Education,Income,Customer_Rating,Persons_in_Household,Occupation,Groceries,Customer_Rating_rank,Education_rank,Prediction_Average,Prediction_Weighted_Average
1,1,75.0,male,married,primary,20000.0,very_good,3,retired,"citrus fruit,semi-finished bread,margarine,rea...",4,1,,
2,2,61.0,female,single,secondary,28000.0,good,1,housemaid,"tropical fruit,yogurt,coffee",3,2,,
3,3,32.0,male,single,secondary,34000.0,very_good,1,blue-collar,whole milk,4,2,,
4,4,62.0,male,married,primary,31000.0,very_good,3,blue-collar,"pip fruit,yogurt,cream cheese,meat spreads",4,1,,
5,5,66.0,female,married,secondary,19000.0,good,3,retired,"other vegetables,whole milk,condensed milk,lon...",3,2,,
6,6,55.0,female,single,secondary,35000.0,very_good,1,unemployed,"whole milk,butter,yogurt,rice,abrasive cleaner",4,2,,
7,7,23.0,female,married,tertiary,21000.0,good,3,housemaid,rolls/buns,3,3,,
8,8,26.0,female,single,secondary,30000.0,good,2,blue-collar,"other vegetables,UHT-milk,rolls/buns,bottled b...",3,2,,
9,9,29.0,female,married,secondary,32000.0,very_good,3,blue-collar,potted plants,4,2,,
10,10,57.0,female,married,secondary,26000.0,good,3,entrepreneur,"whole milk,cereals",3,2,,


We repeat the steps followed at the previous sub-question 4.2 to calculate the `Prediction_Average` and `Prediction_Weighted_Average` for the first 50 customers of the initial dataframe.

In [41]:
for cid3 in range(1,len(df4.head(51))):
   
    idx3 = df4.loc[df4['Customer_ID']== cid3].index[0]
    c3 = calc_diss_updated(df4, idx3)
    c3 = pd.DataFrame(c3)

    result3 = c3.sort_values(by = 0, ascending = True)
    # Drop the first row, which indicates the customer itself
    result3 = result3.iloc[1: , :]
    # Keep only the first 10
    result3 = result3.head(10)
    # Similarity 
    #print(1-result3)

    # create top10_df
    # take the top 10 similar for cid2
    top10_df_list = list((1-result3).index)

    # take the Customer_Rating_rank for the top 10 similar
    top10_df = pd.DataFrame(df.loc[top10_df_list]["Customer_Rating_rank"])
    top10_df["Similarity_Score"] = 1 - result3 
    top10_df
    
    # average rating of the 10 most similar customers
    df4["Prediction_Average"][cid3] = round(top10_df["Customer_Rating_rank"].mean())
    # weighted average rating of the 10 most similar customers
    df4["Prediction_Weighted_Average"][cid3] = round(sum(top10_df["Customer_Rating_rank"] * top10_df["Similarity_Score"]) / sum(top10_df["Similarity_Score"]))


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df4["Prediction_Average"][cid3] = round(top10_df["Customer_Rating_rank"].mean())
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df4["Prediction_Weighted_Average"][cid3] = round(sum(top10_df["Customer_Rating_rank"] * top10_df["Similarity_Score"]) / sum(top10_df["Similarity_Score"]))
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df4["Prediction_Average"][cid3] = round(top10_df["Customer_Rating_rank"].mean())
A value is trying to be set on a copy of a slice from a Dat

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df4["Prediction_Average"][cid3] = round(top10_df["Customer_Rating_rank"].mean())
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df4["Prediction_Weighted_Average"][cid3] = round(sum(top10_df["Customer_Rating_rank"] * top10_df["Similarity_Score"]) / sum(top10_df["Similarity_Score"]))
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df4["Prediction_Average"][cid3] = round(top10_df["Customer_Rating_rank"].mean())
A value is trying to be set on a copy of a slice from a Dat

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df4["Prediction_Average"][cid3] = round(top10_df["Customer_Rating_rank"].mean())
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df4["Prediction_Weighted_Average"][cid3] = round(sum(top10_df["Customer_Rating_rank"] * top10_df["Similarity_Score"]) / sum(top10_df["Similarity_Score"]))
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df4["Prediction_Average"][cid3] = round(top10_df["Customer_Rating_rank"].mean())
A value is trying to be set on a copy of a slice from a Dat

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df4["Prediction_Average"][cid3] = round(top10_df["Customer_Rating_rank"].mean())
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df4["Prediction_Weighted_Average"][cid3] = round(sum(top10_df["Customer_Rating_rank"] * top10_df["Similarity_Score"]) / sum(top10_df["Similarity_Score"]))
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df4["Prediction_Average"][cid3] = round(top10_df["Customer_Rating_rank"].mean())
A value is trying to be set on a copy of a slice from a Dat

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df4["Prediction_Average"][cid3] = round(top10_df["Customer_Rating_rank"].mean())
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df4["Prediction_Weighted_Average"][cid3] = round(sum(top10_df["Customer_Rating_rank"] * top10_df["Similarity_Score"]) / sum(top10_df["Similarity_Score"]))
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df4["Prediction_Average"][cid3] = round(top10_df["Customer_Rating_rank"].mean())
A value is trying to be set on a copy of a slice from a Dat

For the final calculations of the Mean Prediction Error, we create a dataframe including only the necessary columns.

In [42]:
# Final df 
df_final = df4.loc[0:50,["Customer_ID","Customer_Rating_rank","Prediction_Average","Prediction_Weighted_Average"]]
df_final

Unnamed: 0,Customer_ID,Customer_Rating_rank,Prediction_Average,Prediction_Weighted_Average
1,1,4,4,4
2,2,3,3,3
3,3,4,4,3
4,4,4,3,3
5,5,3,3,3
6,6,4,3,3
7,7,3,3,3
8,8,3,3,3
9,9,4,4,4
10,10,3,3,3


In [43]:
# Calculate the Mean Prediction Error for both AVG
mean_pred_error_avg = sum(abs(df_final["Prediction_Average"]-df_final["Customer_Rating_rank"]))/len(df_final)
print('The Mean Prediction Error of Average is ',mean_pred_error_avg)
mean_pred_error_avg

The Mean Prediction Error of Average is  0.68


0.68

In [44]:
# Calculate the Mean Prediction Error for both W_AVG
mean_pred_error_wavg = sum(abs(df_final["Prediction_Weighted_Average"]-df_final["Customer_Rating_rank"]))/len(df_final)
print('The Mean Prediction Error Weighted Average is',mean_pred_error_wavg)
mean_pred_error_wavg

The Mean Prediction Error Weighted Average is 0.7


0.7

The Mean Prediction Error is a measure of prediction accuracy of a forecasting method in statistics. 
For example, the Mean Prediction Error of the Weighted Average 70% indicates that the average difference between the predicted value and the actual value is 70%. To conclude, the results of the two prection methods, which are used for this assignment, are not ideal.