## 1.1 Detecting and Fixing Errors in dirty_data.csv 

General approach
Look for potential dirty data in the following areas:
- integrity constraints
- data entry error
- wrong categorical data
- violation of referential integrity
- duplicated data
- go against value range
- wrong encoding
- wrong representations
- wrong names and numbers

Import necessary libraries

In [2]:
import pandas as pd
import networkx as nx
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
dirty_data_path = "data\Group109_dirty_data.csv"
missing_data_path = "data\Group109_missing_data.csv"
branch_path = r"data\branches.csv"
edges_path = "data\edges.csv"
nodes_path = r"data\nodes.csv"

dirty_df = pd.read_csv(dirty_data_path)
missing_df = pd.read_csv(missing_data_path)
branch_df = pd.read_csv(branch_path)
edges_df = pd.read_csv(edges_path)
nodes_df = pd.read_csv(nodes_path)

In [4]:
branch_df

Unnamed: 0,branch_code,branch_name,branch_lat,branch_lon
0,NS,Nickolson,-37.773803,144.983647
1,TP,Thompson,-37.861835,144.905716
2,BK,Bakers,-37.815834,145.04645


In [5]:
dirty_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   order_id                 500 non-null    object 
 1   date                     500 non-null    object 
 2   time                     500 non-null    object 
 3   order_type               500 non-null    object 
 4   branch_code              500 non-null    object 
 5   order_items              500 non-null    object 
 6   order_price              500 non-null    float64
 7   customer_lat             500 non-null    float64
 8   customer_lon             500 non-null    float64
 9   customerHasloyalty?      500 non-null    int64  
 10  distance_to_customer_KM  500 non-null    float64
 11  delivery_fee             500 non-null    float64
dtypes: float64(5), int64(1), object(6)
memory usage: 47.0+ KB


In [6]:
dirty_df.describe()

Unnamed: 0,order_price,customer_lat,customer_lon,customerHasloyalty?,distance_to_customer_KM,delivery_fee
count,500.0,500.0,500.0,500.0,500.0,500.0
mean,480.7129,-30.753946,143.504403,0.102,8.629274,13.877162
std,254.034843,25.337436,16.29963,0.302951,1.596279,2.378285
min,31.75,-37.827188,-37.822231,0.0,3.613,5.646222
25%,300.625,-37.818738,144.952786,0.0,7.7505,12.660927
50%,434.25,-37.811755,144.963914,0.0,8.6395,13.849738
75%,633.25,-37.804505,144.980037,0.0,9.6335,15.229668
max,1361.5,145.005221,145.015449,1.0,13.735,20.088572


In [7]:
dirty_df.head()

Unnamed: 0,order_id,date,time,order_type,branch_code,order_items,order_price,customer_lat,customer_lon,customerHasloyalty?,distance_to_customer_KM,delivery_fee
0,ORDX02948,2018-03-11,10:11:49,Breakfast,BK,"[('Cereal', 4), ('Eggs', 3), ('Coffee', 1), ('...",206.0,-37.80504,144.963243,0,7.615,14.403915
1,ORDC07988,2018-03-06,10:01:41,Breakfast,NS,"[('Coffee', 2), ('Cereal', 5), ('Eggs', 10), (...",437.0,-37.815989,144.983435,0,8.914,13.807773
2,ORDI00568,2018-07-05,14:05:04,Lunch,NS,"[('Fries', 6), ('Chicken', 5), ('Salad', 7), (...",507.4,-37.806281,144.94196,1,9.316,15.028384
3,ORDI06756,2018-04-23,11:43:05,Breakfast,NS,"[('Coffee', 3), ('Pancake', 6), ('Eggs', 3)]",234.0,-37.822259,144.946977,0,9.975,15.21727
4,ORDX01986,2018-04-29,11:53:14,Breakfast,BK,"[('Coffee', 6), ('Cereal', 1), ('Eggs', 2), ('...",134.25,-37.81595,144.986001,0,6.038,13.6775


As stated in the assignment brief, we will not be looking for any fixing values within the following columns as there are no errors in dirty data for them:
- `order_id`
- `time`
- the numeric quantity in `order_items`
- `delivery_fee`

- [x] check if dates, time is properly formatted
- [x] check if branch code is correct
- [ ] check if order price makes sense according to number of items ordered
- [ ] check if distance to customer is calculated properly in shortest distance
- [ ] check if customer_lat and customer_lon exist in nodes
- [x] check if order_type is correct according to time
- [ ] check if customer does / does not have loyalty according to delivery fee ()


Delivery fee is calculated using a different method for each branch.
The fee depends linearly (but in different ways for each branch) on:
a. weekend or weekday (1 or 0) - as a continuous variable
b. time of the day (morning 0, afternoon 1, evening 2) - as a continuous variable
c. distance between branch and customer


In [8]:
dirty_df['branch_code'].value_counts()

TP    169
NS    163
BK    144
ns     11
tp      7
bk      6
Name: branch_code, dtype: int64

From the output above, we can see that some of the branch codes have been inputted in the wrong representation. Instead of being inputted in all uppercase, some of the values are shown as lowercase. We will need to perform data transformation on this column to convert all values to uppercase letters

In [9]:
dirty_df['branch_code'] = dirty_df['branch_code'].apply(str.upper)

According to the assignment brief, all string date values in column `date` should be in the format YYYY-MM-DD. We can verify if this is the case by using `pd.to_datetime` function on the `date` column and see if all date values fit the format `%Y-%m-%d`

In [10]:
try:
    pd.to_datetime(dirty_df['date'], format='%Y-%m-%d', errors='raise')
except ValueError as e:
    print("Error occurred. Unable to convert the following date:")
    print(e)
else:
    print('No error')

Error occurred. Unable to convert the following date:
time data 06-10-2018 doesn't match format specified


As we can see, not all the date values are in the correct format. Therefore we will have to iterate through each date value and determine which one of the follow formats the date value can be in:
1. YYYY-MM-DD
2. DD-MM-YYYY
3. YYYY-DD-MM

In [11]:
for index, row in dirty_df.iterrows():
    new_row = pd.Series(row)

    # Check if date fits the format YYYY-MM-DD
    try:
        new_row = pd.to_datetime(row['date'], format='%Y-%m-%d')
    except:
        # If not, check if it fits the format YYYY-DD-MM
        try: 
            new_row = pd.to_datetime(row['date'], format='%Y-%d-%m')
        # Else, check if it fits the format DD-MM-YYYY
        except:
            new_row = pd.to_datetime(row['date'], format='%d-%m-%Y')
    dirty_df.at[index, 'date'] = new_row

dirty_df['date'] = pd.to_datetime(dirty_df['date'], format='%Y-%m-%d').dt.strftime('%Y-%m-%d')

Now we evaluate the `order_type` column. Since we know for certain that the time column does not have any errors in it, we can verify if the `order_type` is correct according to the time. The order should be the following according to the times:
- 08:00:00 - 12:00:00 = Breakfast
- 12:00:01 - 16:00:00 = Lunch
- 16:00:01 - 20:00:00 = Dinner

In [12]:
dirty_df['order_type'].value_counts()

Breakfast    170
Lunch        165
Dinner       165
Name: order_type, dtype: int64

To check if the correct order type has been inputted, we create the function `find_order-type`. It looks through each string time value in `time` column, converts this string into datetime format. Then we check the timestamp to see if it fits within the Breakfast, Lunch or Dinner time slots

In [13]:
def find_order_type(time):
    timestamp = pd.to_datetime(time, format='%H:%M:%S').time()
    if timestamp >= pd.to_datetime('08:00:00').time() and timestamp <= pd.to_datetime('12:00:00').time():
        return 'Breakfast'
    elif timestamp >= pd.to_datetime('12:00:01').time() and timestamp < pd.to_datetime('16:00:00').time():
        return 'Lunch'
    elif timestamp >= pd.to_datetime('16:00:01').time() and timestamp <= pd.to_datetime('20:00:00').time():
        return 'Dinner'
    else:
        return 'Error'

In [14]:
# output values where calculated order type does not match the order type in the dataset
dirty_df[dirty_df['time'].apply(find_order_type) != dirty_df['order_type']].head()

Unnamed: 0,order_id,date,time,order_type,branch_code,order_items,order_price,customer_lat,customer_lon,customerHasloyalty?,distance_to_customer_KM,delivery_fee
14,ORDC09610,2018-10-06,17:07:36,Lunch,NS,"[('Fish&Chips', 5), ('Shrimp', 2), ('Pasta', 4)]",393.0,-37.826028,144.984514,0,9.578,17.378357
19,ORDC06273,2018-09-05,16:06:45,Lunch,NS,"[('Pasta', 5), ('Shrimp', 10), ('Fish&Chips', ...",1361.5,-37.814936,144.927351,0,10.813,16.896152
20,ORDB10659,2018-04-20,11:32:57,Dinner,TP,"[('Eggs', 7), ('Coffee', 10), ('Cereal', 6), (...",597.5,-37.8189,144.952797,0,8.576,11.161592
37,ORDK01676,2018-03-16,10:21:58,Dinner,BK,"[('Eggs', 2), ('Pancake', 2)]",92.5,-37.801158,144.957692,0,8.326,13.278789
54,ORDB10190,2018-10-08,17:17:44,Breakfast,TP,"[('Fish&Chips', 2), ('Salmon', 2)]",152.0,-37.813657,144.957285,0,8.419,13.088105


As shown above, there are orders in which order type has been incorrectly inputted according to the time. We can fix this issue by performing the following

In [15]:
dirty_df['order_type'] = dirty_df['time'].apply(find_order_type)

In [16]:
dirty_df['order_type'].value_counts()

Dinner       170
Breakfast    166
Lunch        164
Name: order_type, dtype: int64

We determine if the branch and customer locations in `dirty_df` can be accurately found in the provided Graph in `nodes.csv`

In [17]:
# check if customer node exists
def check_customer_nodes(row):
    customer_node = nodes_df[(nodes_df['lat'] == row['customer_lat']) & (nodes_df['lon'] == row['customer_lon'])]

    if customer_node.empty:
        return True
    else:
        return False

# check if branch node exists
def check_branch_nodes(row):
    branch_lat = branch_df[branch_df['branch_code'] == row['branch_code']]['branch_lat'].values[0]
    branch_lon = branch_df[branch_df['branch_code'] == row['branch_code']]['branch_lon'].values[0]

    branch_node = nodes_df[(nodes_df['lat'] == branch_lat) & (nodes_df['lon'] == branch_lon)]

    if branch_node.empty:
        return True 
    else:
        return False

In [18]:
# Retrieve the number of rows where the customer node does not exist in the nodes.csv file
dirty_df.apply(check_branch_nodes, axis=1).sum()

0

From the above results, we can see that all the branch nodes are accounted for and each node exists in the provided graph

In [19]:
dirty_df.apply(check_customer_nodes, axis=1).sum()

41

When checking for customer nodes, it is evident that there are 41 instances of orders where we are unable to identify the node in the graph according to the provided customer longitude and latitude values. To investigate further into the cause of this issue, let's output the first 15 instances where this is the case.

In [20]:
dirty_df[dirty_df.apply(check_customer_nodes, axis=1)][['customer_lat', 'customer_lon']].head(15)

Unnamed: 0,customer_lat,customer_lon
5,37.814994,144.960538
7,37.816395,144.93817
12,37.816482,144.964894
25,37.824,144.953766
35,37.811124,145.001788
38,37.799866,145.0028
45,37.814037,144.98548
49,37.813155,144.96836
52,37.804832,144.950241
53,37.812135,144.962341


From the outputted data above, we can see 2 prominent errors within the data:
1. <b>There are some instances in which customer latitude has a missing negative symbol at the front. </b>

Take the row at index 5 for example where latitude and longitude are 37.814994 and 144.960538 respectively. If we look at the `nodes.csv` file:

In [21]:
index = 5
# modify latitude by multiplying it by -1
lat = dirty_df.loc[index,]['customer_lat'] * -1
lon = dirty_df.loc[index,]['customer_lon']

# check if the modified latitude exists in the nodes.csv file
nodes_df[(nodes_df['lat'] == lat) & (nodes_df['lon'] == lon)]

Unnamed: 0,node,lat,lon
3793,6167489464,-37.814994,144.960538


As shown above, the following node exists in the graph when the latitude value is included with a negative value.


2. <b>There are some instance in which customer longitude and customer latitude are placed incorrectly and have been swapped with one another</b>

Take row at index 135 for example where latitude and longitude are 145.005221 and -37.817570 respectively. It is evident that these have been inputted incorrectly and the values have swapped over.

In [22]:
index = 135
# modify latitude by multiplying it by -1
lat = dirty_df.loc[index,]['customer_lon']
lon = dirty_df.loc[index,]['customer_lat']

# check if the modified latitude exists in the nodes.csv file
nodes_df[(nodes_df['lat'] == lat) & (nodes_df['lon'] == lon)]

Unnamed: 0,node,lat,lon
9191,1463620803,-37.81757,145.005221


As shown above, the following node exists in the graph when the customer latitude and longitude values are swapped over.

We fix the issue with customer longitude and latitude values using the following custom function `find_customer_node` which first tries to find the node of the customer based on the given latitude and longitude values. If it is unable to do so, we try the following approaches next:
1. Multiply the latitude value by negative and check to see if these coordinates exist in `nodes_df`. Else;
2. Swap the longitude and latitude values and check to see if these coordinates exist in `nodes_df`

If none of these approaches allow us to find a node, we will raise a `ValueError`

In [23]:
def calc_customer_node(row):
    cus_lat = row['customer_lat']
    cus_lon = row['customer_lon']
    customer_node = nodes_df[(nodes_df['lat'] == cus_lat) & (nodes_df['lon'] == cus_lon)]

    # If the customer node does not exist, check for misinput of latitude and longitude values
    if customer_node.empty:
        # multiply the latitude by -1 
        customer_node = nodes_df[(nodes_df['lat'] == -cus_lat) & (nodes_df['lon'] == cus_lon)]
        # If the customer node still does not exist, try swapping the latitude and longitude values
        if customer_node.empty:
            customer_node = nodes_df[(nodes_df['lat'] == cus_lon) & (nodes_df['lon'] == cus_lat)]

    # If the customer node still does not exist, raise an error
    if customer_node.empty:
        raise ValueError("Customer node does not exist in the nodes.csv file")
    
    return customer_node.iloc[0]['lat'], customer_node.iloc[0]['lon']


In [24]:
dirty_df['customer_coords'] = dirty_df.apply(calc_customer_node, axis=1)
dirty_df[['customer_lat', 'customer_lon']] = pd.DataFrame(dirty_df['customer_coords'].to_list(), index=dirty_df.index)

dirty_df.drop(columns=['customer_coords'], inplace=True)

In [25]:
dirty_df.apply(check_customer_nodes, axis=1).sum()

0

When we run `check_customer_nodes` again we can see that all of the coordinates can be accounted for and found in the `nodes.csv` file

Now we check if `distance_to_customer_KM` is correct. To start with this, we utilise `Graph()` function from `networkx` library to construct a Graph based on the nodes, edges provided by the Assignment brief in `nodes.csv` and `edges.csv` respectively

In [26]:
G = nx.Graph()
G.add_nodes_from(nodes_df['node'])
for index, row in edges_df.iterrows():
    G.add_edge(row['u'], row['v'], weight=row['distance(m)'])

In [27]:
def find_branch_node(row):
    branch_lat = branch_df[branch_df['branch_code'] == row['branch_code']]['branch_lat'].values[0]
    branch_lon = branch_df[branch_df['branch_code'] == row['branch_code']]['branch_lon'].values[0]

    branch_node = nodes_df[(nodes_df['lat'] == branch_lat) & (nodes_df['lon'] == branch_lon)]

    return branch_node

def find_shortest_path(row):
    cus_lat = row['customer_lat']
    cus_lon = row['customer_lon']
    customer_node = nodes_df[(nodes_df['lat'] == cus_lat) & (nodes_df['lon'] == cus_lon)]
    
    branch_node = find_branch_node(row)

    # Find the shortest path between the customer node and the branch node
    try:
        # calculates the shortest path using djikstra's algorithm in M
        shortest_path = nx.shortest_path_length(G, source=customer_node['node'].values[0], target=branch_node['node'].values[0], weight='weight')

        # calculate to KM
        shortest_path = shortest_path/1000
    except nx.NetworkXNoPath:
        raise ValueError("No path exists between the customer node and the branch node")
    
    return shortest_path

In [28]:
dirty_df['shortest_path'] = dirty_df.apply(find_shortest_path, axis=1)

In [29]:
dirty_df[dirty_df['distance_to_customer_KM'] != dirty_df['shortest_path']][['distance_to_customer_KM', 'shortest_path']].head()

Unnamed: 0,distance_to_customer_KM,shortest_path
1,8.914,8.624
11,8.33,7.699
15,8.667,7.938
18,8.245,8.787
22,5.435,11.912


Once we have correctly calculated the shortest distance to the customer according to Djikstra's algorithm, we replace the data in `distance_to_customer_KM` column with the correct distance data in `shortest_path` column

In [30]:
dirty_df['distance_to_customer_KM'] = dirty_df['shortest_path']
dirty_df.drop(columns=['shortest_path'], inplace=True)

Check if  `order_price` is correct

In [31]:
dirty_df['order_items'] = dirty_df['order_items'].apply(eval)

Run this code to determine all the menu items provided

In [32]:
dirty_df.explode('order_items')['order_items'].apply(lambda x: x[0]).unique().tolist()

['Cereal',
 'Eggs',
 'Coffee',
 'Pancake',
 'Fries',
 'Chicken',
 'Salad',
 'Burger',
 'Salmon',
 'Shrimp',
 'Fish&Chips',
 'Pasta',
 'Steak']

In [33]:
MENU_ITEMS = ['Cereal',
 'Eggs',
 'Coffee',
 'Pancake',
 'Fries',
 'Chicken',
 'Salad',
 'Burger',
 'Salmon',
 'Shrimp',
 'Fish&Chips',
 'Pasta',
 'Steak']

def create_order_dict(row):
    order_mapping = {item: index for index, item in enumerate(MENU_ITEMS)}
    order = [0 for _ in MENU_ITEMS]
    for item in row['order_items']:
        order_index = order_mapping[item[0]]
        order[order_index] += item[1]
    return order

We can use the clean data in missing_data.csv to determine the correct menu item price

In [34]:
missing_df['order_items'] = missing_df['order_items'].apply(eval)

In [35]:
coef_matrix = np.array(missing_df.apply(create_order_dict, axis=1).tolist())
constants = np.array(missing_df['order_price'].tolist())
solution, residuals, _, _ = np.linalg.lstsq(coef_matrix, constants, rcond=None)
solution = solution.round(2)

We double check to see if the item price has been calculated correctly by finding the total order price for each order

In [36]:
item_prices = {item:price for item, price in zip(MENU_ITEMS, solution)}

def calculate_order_price(row):
    order_price = 0
    for item in row['order_items']:
        order_price += item[1] * item_prices[item[0]]
    order_price = round(order_price, 2)
    return order_price

In [37]:
sum(missing_df['order_price'] != missing_df.apply(calculate_order_price, axis=1))

0

As shown from above, we can be certain that all the item prices have been correctly calculated. Now we can use our calculated item prices to find the correct order prices within the `dirty_data.csv` file

In [38]:
dirty_df['calc_order_price'] = dirty_df.apply(calculate_order_price, axis=1)
dirty_df[dirty_df['order_price'] != dirty_df['calc_order_price']][['order_price', 'calc_order_price']].head()

Unnamed: 0,order_price,calc_order_price
8,387.6,856.0
21,1104.5,1122.0
51,235.0,195.0
59,289.0,541.0
88,399.6,150.5


From what we can see above, there are instances in `dirty_df` where the order_price does not align with the order price we have calculated. Since we know that the numeric quantity in order_items and our item prices are both correct, it would mean that the initial order price is incorrect. We fix these errors by assigning the values calculated in `calc_order_price` to `order_price`

In [39]:
dirty_df['order_price'] = dirty_df['calc_order_price']
dirty_df.drop(columns=['calc_order_price'], inplace=True)

Check if `customerHasloyalty?` is correct. To verify if customerHasloyalty has been calculated correctly. We make use of the correct data in missing_data.csv again to create a logistic regression model to predict the customerHasLoyalty column. This logistic regression model, once generated will be able to correctly calculate the data in dirty_data.csv

In [167]:
def calc_timeOfDay(order_type):
    if order_type == 'Breakfast':
        return 0
    elif order_type == 'Lunch':
        return 1
    elif order_type == 'Dinner':
        return 2

In [168]:
# check to see if pd.datetime.dayofweek is greater than or equal to 5. Monday=0, Sunday=6
temp_df = missing_df.dropna(subset=['date', 'order_type', 'branch_code', 'distance_to_customer_KM', 'delivery_fee'])

In [169]:
temp_df['weekend?']=temp_df['date'].apply(lambda x: pd.to_datetime(x).dayofweek >= 5)
temp_df['timeofday'] = temp_df['order_type'].apply(calc_timeOfDay)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  temp_df['weekend?']=temp_df['date'].apply(lambda x: pd.to_datetime(x).dayofweek >= 5)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  temp_df['timeofday'] = temp_df['order_type'].apply(calc_timeOfDay)


In [170]:
BK_df = temp_df[temp_df['branch_code'] == 'BK']
NS_df = temp_df[temp_df['branch_code'] == 'NS']
TP_df = temp_df[temp_df['branch_code'] == 'TP']

In [171]:
# Load your dataset
# Assuming your data is stored in a DataFrame called df

# Split your data into features (X) and target variable (y)
X = BK_df[['weekend?', 'timeofday', 'distance_to_customer_KM', 'delivery_fee']]  # Features
y = BK_df['customerHasloyalty?']  # Target variable

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize logistic regression model
BK_log_model = LogisticRegression()

# Train the model
BK_log_model.fit(X_train, y_train)

# Predict on test set
y_pred = BK_log_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Additional evaluation metrics
# print(classification_report(y_test, y_pred))



Accuracy: 1.0


In [172]:
# Split your data into features (X) and target variable (y)
X = NS_df[['weekend?', 'timeofday', 'distance_to_customer_KM', 'delivery_fee']]  # Features
y = NS_df['customerHasloyalty?']  # Target variable

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize logistic regression model
NS_log_model = LogisticRegression()

# Train the model
NS_log_model.fit(X_train, y_train)

# Predict on test set
y_pred = NS_log_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 1.0


In [173]:
# Split your data into features (X) and target variable (y)
X = TP_df[['weekend?', 'timeofday', 'distance_to_customer_KM', 'delivery_fee']]  # Features
y = TP_df['customerHasloyalty?']  # Target variable

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize logistic regression model
TP_log_model = LogisticRegression()

# Train the model
TP_log_model.fit(X_train, y_train)

# Predict on test set
y_pred = TP_log_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 1.0


In [199]:
dirty_df['weekend?'] = dirty_df['date'].apply(lambda x: pd.to_datetime(x).dayofweek >= 5)
dirty_df['timeofday'] = dirty_df['order_type'].apply(calc_timeOfDay)

In [177]:
# dirty_df[dirty_df['branch_code'] == 'BK']['calc_loyalty'] =  BK_log_model.predict(dirty_df[dirty_df['branch_code'] == 'BK'][['weekend?', 'timeofday', 'distance_to_customer_KM', 'delivery_fee']])

In [178]:
# dirty_df[['branch_code', 'order_price', 'distance_to_customer_KM', 'delivery_fee']]

In [179]:
dirty_df['BK_loyalty'] = BK_log_model.predict(dirty_df[['weekend?', 'timeofday', 'distance_to_customer_KM', 'delivery_fee']])
dirty_df['NS_loyalty'] = NS_log_model.predict(dirty_df[['weekend?', 'timeofday', 'distance_to_customer_KM', 'delivery_fee']])
dirty_df['TP_loyalty'] = TP_log_model.predict(dirty_df[['weekend?', 'timeofday', 'distance_to_customer_KM', 'delivery_fee']])

In [180]:
def calc_loyalty(row):
    if row['branch_code'] == 'BK':
        return row['BK_loyalty']
    elif row['branch_code'] == 'NS':
        return row['NS_loyalty']
    elif row['branch_code'] == 'TP':
        return row['TP_loyalty']
    else:
        return None

In [181]:
dirty_df['calc_loyalty'] = dirty_df.apply(calc_loyalty, axis=1)

In [189]:
dirty_df[dirty_df['calc_loyalty'] != dirty_df['customerHasloyalty?']][['customerHasloyalty?', 'calc_loyalty']].head(10)

Unnamed: 0,customerHasloyalty?,calc_loyalty
2,1,0
13,1,0
62,1,0
67,1,0
75,1,0
80,1,0
85,0,1
125,1,0
156,1,0
157,1,0


In [192]:
(dirty_df['calc_loyalty'] != dirty_df['customerHasloyalty?']).sum()

39

From above, we can see that in the original data, customerHasloyaty has been inputted incorrectly for 39 orders. We know that the data is issue as our logisticregression models have a 100% accuracy. Knowing this, we assign the values from `calc_loyalty` to `customerHasloyalty?` to remove the dirty data

In [193]:
dirty_df['customerHasloyalty?'] = dirty_df['calc_loyalty']

To verify that our data has been cleaned property, we create another logistic regression model for the data in `dirty_df` to see if the accuracy of the model is 100%

In [201]:
BK_df = dirty_df[dirty_df['branch_code'] == 'BK']
NS_df = dirty_df[dirty_df['branch_code'] == 'NS']
TP_df = dirty_df[dirty_df['branch_code'] == 'TP']

In [204]:
dirty_df['branch_code']

0      BK
1      NS
2      NS
3      NS
4      BK
       ..
495    NS
496    BK
497    TP
498    TP
499    TP
Name: branch_code, Length: 500, dtype: object

In [206]:
# Load your dataset
# Assuming your data is stored in a DataFrame called df

# Split your data into features (X) and target variable (y)
X = BK_df[['weekend?', 'timeofday', 'distance_to_customer_KM', 'delivery_fee']]  # Features
y = BK_df['customerHasloyalty?']  # Target variable

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize logistic regression model
BK_log_model = LogisticRegression()

# Train the model
BK_log_model.fit(X_train, y_train)

# Predict on test set
y_pred = BK_log_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("BK branch customerHasLoyalty Accuracy:", accuracy)

# Additional evaluation metrics
# print(classification_report(y_test, y_pred))

# Split your data into features (X) and target variable (y)
X = NS_df[['weekend?', 'timeofday', 'distance_to_customer_KM', 'delivery_fee']]  # Features
y = NS_df['customerHasloyalty?']  # Target variable

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize logistic regression model
NS_log_model = LogisticRegression()

# Train the model
NS_log_model.fit(X_train, y_train)

# Predict on test set
y_pred = NS_log_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("NS branch customerHasLoyalty Accuracy:", accuracy)

# Split your data into features (X) and target variable (y)
X = TP_df[['weekend?', 'timeofday', 'distance_to_customer_KM', 'delivery_fee']]  # Features
y = TP_df['customerHasloyalty?']  # Target variable

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize logistic regression model
TP_log_model = LogisticRegression()

# Train the model
TP_log_model.fit(X_train, y_train)

# Predict on test set
y_pred = TP_log_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("TP branch customerHasLoyalty Accuracy:", accuracy)

BK branch customerHasLoyalty Accuracy: 1.0
NS branch customerHasLoyalty Accuracy: 1.0
TP branch customerHasLoyalty Accuracy: 1.0


### NO NEED TO SEPARATE MODEL INTO THREE'S ONLY NEED TO CONVERT BRANCH INTO NUMERICAL ATTRIBUTES

In [207]:
#  remove generated columns
dirty_df.drop(columns=['weekend?', 'timeofday', 'BK_loyalty', 'NS_loyalty', 'TP_loyalty', 'calc_loyalty'], inplace=True)

KeyError: "['BK_loyalty' 'NS_loyalty' 'TP_loyalty' 'calc_loyalty'] not found in axis"

## 1.2 Imputating data