# AIPI 590 - XAI | Assignment #4
### Using iModels Library (OneRClassifier) to build Interpretable Model for online order prediction of Zomato
### Ritu Toshniwal


[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ritu1412/Interpretable-Models-using-iModels-Library/blob/main/notebooks/oneR_algorithm.ipynb)

In [None]:
import os

# Remove Colab default sample_data
!rm -r ./sample_data

# Clone GitHub files to colab workspace
repo_name = "Interpretable-Models-using-iModels-Library" 
git_path = 'https://github.com/ritu1412/Interpretable-Models-using-iModels-Library.git' #Change to your path
!git clone "{git_path}"

# Install dependencies from requirements.txt file
!pip install -r "{os.path.join(repo_name,'requirements.txt')}" #Add if using requirements.txt

# Change working directory to location of notebook
notebook_dir = 'notebooks'
path_to_notebook = os.path.join(repo_name,notebook_dir)
%cd "{path_to_notebook}"
%ls

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from imodels import OneRClassifier
import warnings
warnings.filterwarnings('ignore')

# Data Cleaning

Download the dataset from Kaggle, it was too large to be uploaded in github. Link: https://www.kaggle.com/datasets/rishikeshkonapure/zomato/data

In [4]:
data= pd.read_csv('../data/zomato.csv')

In [5]:
data.head()

Unnamed: 0,url,address,name,online_order,book_table,rate,votes,phone,location,rest_type,dish_liked,cuisines,approx_cost(for two people),reviews_list,menu_item,listed_in(type),listed_in(city)
0,https://www.zomato.com/bangalore/jalsa-banasha...,"942, 21st Main Road, 2nd Stage, Banashankari, ...",Jalsa,Yes,Yes,4.1/5,775,080 42297555\r\n+91 9743772233,Banashankari,Casual Dining,"Pasta, Lunch Buffet, Masala Papad, Paneer Laja...","North Indian, Mughlai, Chinese",800,"[('Rated 4.0', 'RATED\n A beautiful place to ...",[],Buffet,Banashankari
1,https://www.zomato.com/bangalore/spice-elephan...,"2nd Floor, 80 Feet Road, Near Big Bazaar, 6th ...",Spice Elephant,Yes,No,4.1/5,787,080 41714161,Banashankari,Casual Dining,"Momos, Lunch Buffet, Chocolate Nirvana, Thai G...","Chinese, North Indian, Thai",800,"[('Rated 4.0', 'RATED\n Had been here for din...",[],Buffet,Banashankari
2,https://www.zomato.com/SanchurroBangalore?cont...,"1112, Next to KIMS Medical College, 17th Cross...",San Churro Cafe,Yes,No,3.8/5,918,+91 9663487993,Banashankari,"Cafe, Casual Dining","Churros, Cannelloni, Minestrone Soup, Hot Choc...","Cafe, Mexican, Italian",800,"[('Rated 3.0', ""RATED\n Ambience is not that ...",[],Buffet,Banashankari
3,https://www.zomato.com/bangalore/addhuri-udupi...,"1st Floor, Annakuteera, 3rd Stage, Banashankar...",Addhuri Udupi Bhojana,No,No,3.7/5,88,+91 9620009302,Banashankari,Quick Bites,Masala Dosa,"South Indian, North Indian",300,"[('Rated 4.0', ""RATED\n Great food and proper...",[],Buffet,Banashankari
4,https://www.zomato.com/bangalore/grand-village...,"10, 3rd Floor, Lakshmi Associates, Gandhi Baza...",Grand Village,No,No,3.8/5,166,+91 8026612447\r\n+91 9901210005,Basavanagudi,Casual Dining,"Panipuri, Gol Gappe","North Indian, Rajasthani",600,"[('Rated 4.0', 'RATED\n Very good restaurant ...",[],Buffet,Banashankari


In [6]:
print(data.shape)
print(data.columns)

(51717, 17)
Index(['url', 'address', 'name', 'online_order', 'book_table', 'rate', 'votes',
       'phone', 'location', 'rest_type', 'dish_liked', 'cuisines',
       'approx_cost(for two people)', 'reviews_list', 'menu_item',
       'listed_in(type)', 'listed_in(city)'],
      dtype='object')


In [7]:
data.duplicated().sum()

np.int64(0)

In [8]:
data.isnull().sum()

url                                0
address                            0
name                               0
online_order                       0
book_table                         0
rate                            7775
votes                              0
phone                           1208
location                          21
rest_type                        227
dish_liked                     28078
cuisines                          45
approx_cost(for two people)      346
reviews_list                       0
menu_item                          0
listed_in(type)                    0
listed_in(city)                    0
dtype: int64

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51717 entries, 0 to 51716
Data columns (total 17 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   url                          51717 non-null  object
 1   address                      51717 non-null  object
 2   name                         51717 non-null  object
 3   online_order                 51717 non-null  object
 4   book_table                   51717 non-null  object
 5   rate                         43942 non-null  object
 6   votes                        51717 non-null  int64 
 7   phone                        50509 non-null  object
 8   location                     51696 non-null  object
 9   rest_type                    51490 non-null  object
 10  dish_liked                   23639 non-null  object
 11  cuisines                     51672 non-null  object
 12  approx_cost(for two people)  51371 non-null  object
 13  reviews_list                 51

In [10]:
data.describe()

Unnamed: 0,votes
count,51717.0
mean,283.697527
std,803.838853
min,0.0
25%,7.0
50%,41.0
75%,198.0
max,16832.0


In [11]:
data.nunique()

url                            51717
address                        11495
name                            8792
online_order                       2
book_table                         2
rate                              64
votes                           2328
phone                          14926
location                          93
rest_type                         93
dish_liked                      5271
cuisines                        2723
approx_cost(for two people)       70
reviews_list                   22513
menu_item                       9098
listed_in(type)                    7
listed_in(city)                   30
dtype: int64

In [12]:
df = data.drop(['url','phone'], axis=1)

In [13]:
df.duplicated().sum()

np.int64(43)

In [14]:
df.drop_duplicates(inplace=True)

In [15]:
df.duplicated().sum()

np.int64(0)

In [16]:
df.dropna(how='any', inplace=True)

In [17]:
df.isnull().sum()

address                        0
name                           0
online_order                   0
book_table                     0
rate                           0
votes                          0
location                       0
rest_type                      0
dish_liked                     0
cuisines                       0
approx_cost(for two people)    0
reviews_list                   0
menu_item                      0
listed_in(type)                0
listed_in(city)                0
dtype: int64

In [18]:
df = df.rename(columns={'approx_cost(for two people)':'cost', 'listed_in(type)':'type', 
                                       'listed_in(city)':'city'})

In [19]:
df['cost'].unique()

array(['800', '300', '600', '700', '550', '500', '450', '650', '400',
       '750', '200', '850', '1,200', '150', '350', '250', '1,500',
       '1,300', '1,000', '100', '900', '1,100', '1,600', '950', '230',
       '1,700', '1,400', '1,350', '2,200', '2,000', '1,800', '1,900',
       '180', '330', '2,500', '2,100', '3,000', '2,800', '3,400', '40',
       '1,250', '3,500', '4,000', '2,400', '1,450', '3,200', '6,000',
       '1,050', '4,100', '2,300', '120', '2,600', '5,000', '3,700',
       '1,650', '2,700', '4,500'], dtype=object)

In [20]:
df['cost'] = df['cost'].apply(lambda x: x.replace(',',''))
df['cost'] = df['cost'].astype(float)

In [21]:
print(df['cost'].unique())
print('---'*10)
df.dtypes

[ 800.  300.  600.  700.  550.  500.  450.  650.  400.  750.  200.  850.
 1200.  150.  350.  250. 1500. 1300. 1000.  100.  900. 1100. 1600.  950.
  230. 1700. 1400. 1350. 2200. 2000. 1800. 1900.  180.  330. 2500. 2100.
 3000. 2800. 3400.   40. 1250. 3500. 4000. 2400. 1450. 3200. 6000. 1050.
 4100. 2300.  120. 2600. 5000. 3700. 1650. 2700. 4500.]
------------------------------


address          object
name             object
online_order     object
book_table       object
rate             object
votes             int64
location         object
rest_type        object
dish_liked       object
cuisines         object
cost            float64
reviews_list     object
menu_item        object
type             object
city             object
dtype: object

In [22]:
df['rate'].unique()

array(['4.1/5', '3.8/5', '3.7/5', '4.6/5', '4.0/5', '4.2/5', '3.9/5',
       '3.0/5', '3.6/5', '2.8/5', '4.4/5', '3.1/5', '4.3/5', '2.6/5',
       '3.3/5', '3.5/5', '3.8 /5', '3.2/5', '4.5/5', '2.5/5', '2.9/5',
       '3.4/5', '2.7/5', '4.7/5', 'NEW', '2.4/5', '2.2/5', '2.3/5',
       '4.8/5', '3.9 /5', '4.2 /5', '4.0 /5', '4.1 /5', '2.9 /5',
       '2.7 /5', '2.5 /5', '2.6 /5', '4.5 /5', '4.3 /5', '3.7 /5',
       '4.4 /5', '4.9/5', '2.1/5', '2.0/5', '1.8/5', '3.4 /5', '3.6 /5',
       '3.3 /5', '4.6 /5', '4.9 /5', '3.2 /5', '3.0 /5', '2.8 /5',
       '3.5 /5', '3.1 /5', '4.8 /5', '2.3 /5', '4.7 /5', '2.4 /5',
       '2.1 /5', '2.2 /5', '2.0 /5', '1.8 /5'], dtype=object)

In [23]:
df = df.loc[df.rate !='NEW']
df['rate'] = df['rate'].apply(lambda x: x.replace('/5',''))
df['rate']=df['rate'].astype(float)

# Encoding the data for training purpose

In [24]:
df.online_order[df.online_order == 'Yes'] = 1 
df.online_order[df.online_order == 'No'] = 0
df.online_order.value_counts()

online_order
1    16378
0     6870
Name: count, dtype: int64

In [25]:
df.online_order = pd.to_numeric(df.online_order)
df.book_table[df.book_table == 'Yes'] = 1 
df.book_table[df.book_table == 'No'] = 0


df.book_table = pd.to_numeric(df.book_table)
df.book_table.value_counts()

book_table
0    17191
1     6057
Name: count, dtype: int64

In [26]:
le = LabelEncoder()
df.location = le.fit_transform(df.location)
df.rest_type = le.fit_transform(df.rest_type)
df.cuisines = le.fit_transform(df.cuisines)
df.menu_item = le.fit_transform(df.menu_item)

In [27]:
df.columns

Index(['address', 'name', 'online_order', 'book_table', 'rate', 'votes',
       'location', 'rest_type', 'dish_liked', 'cuisines', 'cost',
       'reviews_list', 'menu_item', 'type', 'city'],
      dtype='object')

# Model training
Selecting the features and target for training (X and y)

In [28]:
my_data=df.iloc[:,[2,3,4,5,6,7,9,10,12]]
my_data.to_csv('../data/Zomato_df.csv') 

In [29]:
x = df.iloc[:,[3,4,5,6,7,9,10,12]]
x.head()

Unnamed: 0,book_table,rate,votes,location,rest_type,cuisines,cost,menu_item
0,1,4.1,775,1,20,1386,800.0,5047
1,0,4.1,787,1,20,594,800.0,5047
2,0,3.8,918,1,16,484,800.0,5047
3,0,3.7,88,1,62,1587,300.0,5047
4,0,3.8,166,4,20,1406,600.0,5047


In [30]:
y = df['online_order']
y

0        1
1        1
2        1
3        0
4        0
        ..
51705    1
51707    0
51708    0
51711    0
51715    0
Name: online_order, Length: 23248, dtype: int64

# Building Model

In [31]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.2,random_state=10)

In [32]:
# Initialize the model
model = OneRClassifier()

# Train the model
model.fit(x_train, y_train)

# Predict on test data
y_pred = model.predict(x_test)

print(model)

> ------------------------------
> Greedy Rule List
> ------------------------------
↓
48.42% risk (18598 pts)
	if ~cost ==> 78.2% risk (13829 pts)
↓
5.45% risk (4769 pts)
	if ~cost ==> 52.0% risk (4402 pts)
↓
0.0% risk (367 pts)
	if ~cost ==> 6.6000000000000005% risk (305 pts)
↓
0.0% risk (62 pts)



The model uses the simplest feature (in this case, `cost`) to create decision rules. The reason `cost` was chosen is that it best predicts the target variable according to the OneR algorithm.

The output shows the decision rules learned by the model based on the feature `cost`. These rules divide the dataset based on thresholds of `cost` and assign a risk or probability of the outcome (`online_order`) happening within each group.

**Example of the Rules Generated:**
   - **First Rule:**
     - "If `cost` > some value, then 78.2% risk (probability) that `online_order = 1`."
     - This suggests that when the cost is greater than a certain threshold, there is a 78.2% chance that online ordering is available.
   - **Other rules** break down `cost` into smaller ranges and provide the probability of `online_order` being available in each group.
   - Each rule is accompanied by the number of data points it applies to (e.g., `13829 pts`).

### Interpretability:

The `OneRClassifier` is designed to be interpretable by creating a set of simple rules that only depend on one feature (in this case, `cost`). The rules are easy to understand, even for non-technical users, as they rely on comparisons of the feature values and provide a clear risk or probability assessment. This is a list of thresholds on the feature `cost`, breaking the dataset into ranges and providing the estimated probability of the target being true (`online_order = 1`) for each range. The model greedily chooses the best single feature and the best splits within that feature, which ensures that the model remains simple and interpretable.

The rules suggest that `cost` plays a significant role in predicting whether online ordering is available, and the thresholds break down this relationship into understandable segments. 

In [33]:
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.723010752688172


In [34]:
from sklearn.metrics import classification_report

# Precision, Recall, F1-score
report = classification_report(y_test, y_pred)
print("Classification Report:\n", report)

Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.08      0.14      1390
           1       0.72      1.00      0.83      3260

    accuracy                           0.72      4650
   macro avg       0.84      0.54      0.49      4650
weighted avg       0.79      0.72      0.63      4650

