Simple outline for this notebook:

The title is k-nearest p-median, because this notebook contains the basic code to acheive the Church's model with pulp.

1. Test the basic model with k = 5 situation and check the result.

2. Test the basic model with k = 1 situation and check the result.

3. Add the increasing k value part and make a loop.

In [1]:
import pandas as pd
import numpy as np
import pulp
import scipy as sp

### k-nearest p-median

In [2]:
# import data
time_df = pd.read_csv("data/example_subject_student_school_journeys.csv")
students_df = pd.read_csv("data/example_subject_students.csv")
schools_df = pd.read_csv("data/example_subject_schools.csv")

#### when k = 5, placeholder facility will not be assigned, and the model has optimal solution if only with k nearest facilities

The first step is to create a new dataframe that only contains the k nearest client-facility pairs from the given dataframe.

In [3]:
def k_smallest_from_distance_table(
    travel_times, client_name, cost_column, facility_name, k
):
    """
    Given a table of travel times with columns "client", "facility", and "cost",
    make a new DataFrame that contains only the `k` lowest-cost client-facility pairs.
    """
    k_per_client = (
        travel_times.groupby(client_name)[  # for each client
            cost_column
        ]  # look at their "cost" column
        .nsmallest(k)  # and keep the "k" rows with the smallest cost
        .reset_index()  # and reset the index from the groupby
    )
    k_per_client["facility"] = travel_times.loc[
        k_per_client[
            "level_1"
        ],  # look at each row using the row number from the groupby
        facility_name,  # and find the corresponding facility
    ].values  # and ignore the index on the DataFrame
    return k_per_client.drop(
        "level_1", axis=1
    )  # drop the row number column from the groupby

In [4]:
new_time_df = k_smallest_from_distance_table(time_df, "student", "time", "school", 5)
new_time_df

Unnamed: 0,student,time,facility
0,2,53,IOE01599
1,2,55,IOE01009
2,2,56,IOE01595
3,2,59,IOE00739
4,2,60,IOE01586
5,4,18,IOE01062
6,4,46,IOE00867
7,4,48,IOE00863
8,4,49,IOE00869
9,4,55,IOE01013


The second step is to prepare the k-nearest dataframe suitable for the p-median model.   

We need to create new indexes for client and facilities, and add the capacity information from facility dataframe.

In [5]:
student_indices = range(new_time_df["student"].nunique())
student_indices

range(0, 10)

In [6]:
school_indices = range(new_time_df["facility"].nunique())
school_indices

range(0, 32)

In [7]:
# create new index for school/facility, according to their order, and this index is the same with
# that of the array 'time_array' we just created
# to do so, we can easily refer to them in the p-median model
new_time_df["school_new_index"] = (
    new_time_df["facility"].rank(method="dense").astype(int) - 1
)
new_time_df

Unnamed: 0,student,time,facility,school_new_index
0,2,53,IOE01599,30
1,2,55,IOE01009,23
2,2,56,IOE01595,29
3,2,59,IOE00739,14
4,2,60,IOE01586,28
5,4,18,IOE01062,25
6,4,46,IOE00867,21
7,4,48,IOE00863,20
8,4,49,IOE00869,22
9,4,55,IOE01013,24


In [8]:
# also the new student index
new_time_df["student_new_index"] = (
    new_time_df["student"].rank(method="dense").astype(int) - 1
)
new_time_df

Unnamed: 0,student,time,facility,school_new_index,student_new_index
0,2,53,IOE01599,30,0
1,2,55,IOE01009,23,0
2,2,56,IOE01595,29,0
3,2,59,IOE00739,14,0
4,2,60,IOE01586,28,0
5,4,18,IOE01062,25,1
6,4,46,IOE00867,21,1
7,4,48,IOE00863,20,1
8,4,49,IOE00869,22,1
9,4,55,IOE01013,24,1


In [9]:
# in this model, we considerate the existence of capacity, so we add it from the 'schools_df'
new_time_df = new_time_df.merge(
    schools_df[["SE2 PP: Code", "Count"]],
    left_on="facility",
    right_on="SE2 PP: Code",
    how="left",
)
new_time_df

Unnamed: 0,student,time,facility,school_new_index,student_new_index,SE2 PP: Code,Count
0,2,53,IOE01599,30,0,IOE01599,1
1,2,55,IOE01009,23,0,IOE01009,1
2,2,56,IOE01595,29,0,IOE01595,1
3,2,59,IOE00739,14,0,IOE00739,1
4,2,60,IOE01586,28,0,IOE01586,2
5,4,18,IOE01062,25,1,IOE01062,1
6,4,46,IOE00867,21,1,IOE00867,2
7,4,48,IOE00863,20,1,IOE00863,1
8,4,49,IOE00869,22,1,IOE00869,1
9,4,55,IOE01013,24,1,IOE01013,1


Now the data has been prepared well.

In [10]:
def setup_from_travel_table(distance_df, client_indices, facility_indices):
    """
    Using the distance dataframe we prepared
    to write a function that sets up the k-nearest p-median problem. 
    """
    # build the sparse matrix of distance/cost
    # in this matrix, only the distance between clients and k nearest facilities will be stored
    row = distance_df['student_new_index'].values
    col = distance_df['school_new_index'].values
    data = distance_df['time'].values
    sparse_matrix = sp.sparse.csr_array((data, (row, col)))

    # set up the problem
    problem = pulp.LpProblem("k-nearest-p-median", pulp.LpMinimize)

    # set the decision variable for client and k nearest facilities
    decision = pulp.LpVariable.dicts(
        "x",
        (
            (row["student_new_index"], row["school_new_index"])
            for _, row in distance_df.iterrows()
        ),
        0,
        1,
        pulp.LpBinary,
    )

    # set the decision variable for placeholder facility
    decision_g = pulp.LpVariable.dicts("g", (i for i in client_indices), 0, 1, pulp.LpBinary)

    # in order to complete the objective, we need to get the maximum distance for each client
    max_distance = sparse_matrix.max(axis=1).toarray().flatten()

    # set the objective
    objective = pulp.lpSum(
        pulp.lpSum(decision.get((i, j), 0) * sparse_matrix[i, j] for j in facility_indices) + (
            decision_g[i] * (max_distance[i] + 1)
        )
        for i in client_indices
    )
    problem += objective

    # constraint 1. Each client is assigned to a facility
    for i in client_indices:
        problem += pulp.lpSum(decision.get((i, j), 0) for j in facility_indices) + decision_g[i] == 1

    # constraint 2. Demand value the facility can serve is no more than its capacity.
    for j in facility_indices:
        count = distance_df.loc[distance_df["school_new_index"] == j, "Count"].values[0]
        problem += pulp.lpSum(decision.get((i, j), 0) for i in client_indices) <= count

    problem.solve(pulp.PULP_CBC_CMD(msg=False))

    return problem, decision, decision_g

In [11]:
prob, prob_decision, decision_g = setup_from_travel_table(new_time_df, student_indices, school_indices)

In [12]:
# check if the decision variable is correct
prob_decision

{(0, 30): x_(0,_30),
 (0, 23): x_(0,_23),
 (0, 29): x_(0,_29),
 (0, 14): x_(0,_14),
 (0, 28): x_(0,_28),
 (1, 25): x_(1,_25),
 (1, 21): x_(1,_21),
 (1, 20): x_(1,_20),
 (1, 22): x_(1,_22),
 (1, 24): x_(1,_24),
 (2, 2): x_(2,_2),
 (2, 9): x_(2,_9),
 (2, 8): x_(2,_8),
 (2, 6): x_(2,_6),
 (2, 7): x_(2,_7),
 (3, 0): x_(3,_0),
 (3, 5): x_(3,_5),
 (3, 17): x_(3,_17),
 (3, 16): x_(3,_16),
 (3, 13): x_(3,_13),
 (4, 3): x_(4,_3),
 (4, 11): x_(4,_11),
 (4, 26): x_(4,_26),
 (4, 10): x_(4,_10),
 (4, 13): x_(4,_13),
 (5, 29): x_(5,_29),
 (5, 1): x_(5,_1),
 (5, 12): x_(5,_12),
 (5, 28): x_(5,_28),
 (5, 0): x_(5,_0),
 (6, 23): x_(6,_23),
 (6, 24): x_(6,_24),
 (6, 14): x_(6,_14),
 (6, 31): x_(6,_31),
 (6, 15): x_(6,_15),
 (7, 17): x_(7,_17),
 (7, 19): x_(7,_19),
 (7, 10): x_(7,_10),
 (7, 4): x_(7,_4),
 (7, 18): x_(7,_18),
 (8, 25): x_(8,_25),
 (8, 27): x_(8,_27),
 (8, 21): x_(8,_21),
 (8, 22): x_(8,_22),
 (8, 20): x_(8,_20),
 (9, 31): x_(9,_31),
 (9, 20): x_(9,_20),
 (9, 21): x_(9,_21),
 (9, 12): x_(9

In [13]:
# also check decision variable for placeholder facility
decision_g

{0: g_0,
 1: g_1,
 2: g_2,
 3: g_3,
 4: g_4,
 5: g_5,
 6: g_6,
 7: g_7,
 8: g_8,
 9: g_9}

In [14]:
# check if the problem has optimal solution, if it returns 1, then it has
# if it returns -1, then it has infeasible solution
prob.status

1

In [15]:
# print the model result
for i in student_indices:
    for j in school_indices:
        if (i, j) in prob_decision and prob_decision[(i, j)].value() == 1:
            print("student " + str(i) + " is served by school " + str(j))

student 0 is served by school 30
student 1 is served by school 25
student 2 is served by school 9
student 3 is served by school 0
student 4 is served by school 11
student 5 is served by school 29
student 6 is served by school 23
student 7 is served by school 17
student 8 is served by school 27
student 9 is served by school 31


In [16]:
# check if any placeholder facility is assigned/selected
# here, no placeholder facility is assigned
for i in student_indices:
    if decision_g[i].value() > 0:
        print("student " + str(i) + " is served by schools far away ")

#### when k = 1, placeholder facility will be assigned, and the model is infeasible if only with k nearest facilities

We use the same way as that of k = 5 case to prepare the data.

In [17]:
new_time_df_k_1 = k_smallest_from_distance_table(time_df, "student", "time", "school", 1)
new_time_df_k_1

Unnamed: 0,student,time,facility
0,2,53,IOE01599
1,4,18,IOE01062
2,7,8,IOE00128
3,9,130,IOE00044
4,13,83,IOE00172
5,14,39,IOE01595
6,22,78,IOE01009
7,25,102,IOE00812
8,35,79,IOE01062
9,41,96,IOE01839


In [18]:
school_indices_k_1 = range(new_time_df_k_1["facility"].nunique())
school_indices_k_1

range(0, 9)

In [19]:
new_time_df_k_1["school_new_index"] = (
    new_time_df_k_1["facility"].rank(method="dense").astype(int) - 1
)
new_time_df_k_1["student_new_index"] = (
    new_time_df_k_1["student"].rank(method="dense").astype(int) - 1
)
new_time_df_k_1

Unnamed: 0,student,time,facility,school_new_index,student_new_index
0,2,53,IOE01599,7,0
1,4,18,IOE01062,5,1
2,7,8,IOE00128,1,2
3,9,130,IOE00044,0,3
4,13,83,IOE00172,2,4
5,14,39,IOE01595,6,5
6,22,78,IOE01009,4,6
7,25,102,IOE00812,3,7
8,35,79,IOE01062,5,8
9,41,96,IOE01839,8,9


In [20]:
new_time_df_k_1 = new_time_df_k_1.merge(
    schools_df[["SE2 PP: Code", "Count"]],
    left_on="facility",
    right_on="SE2 PP: Code",
    how="left",
)
new_time_df_k_1

Unnamed: 0,student,time,facility,school_new_index,student_new_index,SE2 PP: Code,Count
0,2,53,IOE01599,7,0,IOE01599,1
1,4,18,IOE01062,5,1,IOE01062,1
2,7,8,IOE00128,1,2,IOE00128,1
3,9,130,IOE00044,0,3,IOE00044,1
4,13,83,IOE00172,2,4,IOE00172,1
5,14,39,IOE01595,6,5,IOE01595,1
6,22,78,IOE01009,4,6,IOE01009,1
7,25,102,IOE00812,3,7,IOE00812,1
8,35,79,IOE01062,5,8,IOE01062,1
9,41,96,IOE01839,8,9,IOE01839,1


In [21]:
prob_k_1, decision_k_1, decision_k_1_g = setup_from_travel_table(new_time_df_k_1, student_indices, school_indices_k_1)

In [22]:
prob_k_1.status

1

In [23]:
decision_k_1

{(0, 7): x_(0,_7),
 (1, 5): x_(1,_5),
 (2, 1): x_(2,_1),
 (3, 0): x_(3,_0),
 (4, 2): x_(4,_2),
 (5, 6): x_(5,_6),
 (6, 4): x_(6,_4),
 (7, 3): x_(7,_3),
 (8, 5): x_(8,_5),
 (9, 8): x_(9,_8)}

In [24]:
decision_k_1_g

{0: g_0,
 1: g_1,
 2: g_2,
 3: g_3,
 4: g_4,
 5: g_5,
 6: g_6,
 7: g_7,
 8: g_8,
 9: g_9}

In [25]:
for i in student_indices:
    for j in school_indices:
        if (i, j) in decision_k_1 and decision_k_1[(i, j)].value() == 1:
            print("student " + str(i) + " is served by school " + str(j))

student 0 is served by school 7
student 2 is served by school 1
student 3 is served by school 0
student 4 is served by school 2
student 5 is served by school 6
student 6 is served by school 4
student 7 is served by school 3
student 8 is served by school 5
student 9 is served by school 8


In [26]:
for i in student_indices:
    if decision_k_1_g[i].value() > 0:
        print("student " + str(i) + " is served by schools far away ")

student 1 is served by schools far away 


From the model results, we can know that:
1. The model has optimal solution.
2. In the value of `decision_k_1`, `student 1` is missing.
3. While, the value of `decision_k_1_g[1]` is more than 1, showing that this placeholder facility is used.

#### If any g_i is nonzero, increase the k_i value for that observation and try again. 

In [28]:
# check if any g_i is nonzero, and increase the k value for client i
# create the new k value list
k_replace = [1] * len(student_indices)
for i in student_indices:
    if decision_k_1_g[i].value() > 0:
        k_replace[i] = 2

In [29]:
k_replace

[1, 2, 1, 1, 1, 1, 1, 1, 1, 1]

In [32]:
# the first way is to create a new dataframe, and import it into the model
# this way will 'restart' the model every time
def recreate_k_smallest_from_distance_table(travel_times, client_name, cost_column, facility_name, k_list):
    result = pd.DataFrame()  # Create an empty DataFrame to store the results

    for client, k in zip(travel_times[client_name].unique(), k_list):
        k_per_client = (
            travel_times[travel_times[client_name] == client]
            .nsmallest(k, cost_column)
            .reset_index(drop=True)
        )
        result = pd.concat([result, k_per_client], ignore_index=True)

    result["facility"] = result[facility_name]
    result = result.drop(columns=[facility_name])

    return result

In [49]:
new_time_df_k_1_list = recreate_k_smallest_from_distance_table(time_df, "student", "time", "school", k_replace)
new_time_df_k_1_list

Unnamed: 0,student,time,message,facility
0,2,53,Walk to Garston (Herts) Rail Station THEN West...,IOE01599
1,4,18,Walk to Barwick Drive THEN A10 bus to The Gree...,IOE01062
2,4,46,Walk to Merrimans Cnr /Harlington Road THEN U4...,IOE00867
3,7,8,Cycle to E2 8LS,IOE00128
4,9,130,"Walk to Hollington (Hastings), Bodiam Drive TH...",IOE00044
5,13,83,"Walk to The Ridgeway, Fetcham THEN 465 bus to ...",IOE00172
6,14,39,Walk to Harpenden Rail Station THEN Thameslink...,IOE01595
7,22,78,Walk to Aylesbury THEN Chiltern Railways to Ha...,IOE01009
8,25,102,"Walk to Strood Green (Surrey), Wellhouse Lane ...",IOE00812
9,35,79,"Walk to Maidenhead, All Saints' Church THEN 3 ...",IOE01062


In [50]:
# prepare the data like the previous steps
school_indices_k_1_list = range(new_time_df_k_1_list["facility"].nunique())

new_time_df_k_1_list["school_new_index"] = (
    new_time_df_k_1_list["facility"].rank(method="dense").astype(int) - 1
)
new_time_df_k_1_list["student_new_index"] = (
    new_time_df_k_1_list["student"].rank(method="dense").astype(int) - 1
)

new_time_df_k_1_list = new_time_df_k_1_list.merge(
    schools_df[["SE2 PP: Code", "Count"]],
    left_on="facility",
    right_on="SE2 PP: Code",
    how="left",
)
new_time_df_k_1_list

Unnamed: 0,student,time,message,facility,school_new_index,student_new_index,SE2 PP: Code,Count
0,2,53,Walk to Garston (Herts) Rail Station THEN West...,IOE01599,8,0,IOE01599,1
1,4,18,Walk to Barwick Drive THEN A10 bus to The Gree...,IOE01062,6,1,IOE01062,1
2,4,46,Walk to Merrimans Cnr /Harlington Road THEN U4...,IOE00867,4,1,IOE00867,2
3,7,8,Cycle to E2 8LS,IOE00128,1,2,IOE00128,1
4,9,130,"Walk to Hollington (Hastings), Bodiam Drive TH...",IOE00044,0,3,IOE00044,1
5,13,83,"Walk to The Ridgeway, Fetcham THEN 465 bus to ...",IOE00172,2,4,IOE00172,1
6,14,39,Walk to Harpenden Rail Station THEN Thameslink...,IOE01595,7,5,IOE01595,1
7,22,78,Walk to Aylesbury THEN Chiltern Railways to Ha...,IOE01009,5,6,IOE01009,1
8,25,102,"Walk to Strood Green (Surrey), Wellhouse Lane ...",IOE00812,3,7,IOE00812,1
9,35,79,"Walk to Maidenhead, All Saints' Church THEN 3 ...",IOE01062,6,8,IOE01062,1


In [51]:
prob_k_1_list, decision_k_1_list, decision_k_1_g_list = setup_from_travel_table(new_time_df_k_1_list, student_indices, school_indices_k_1_list)

In [52]:
prob_k_1_list.status

1

In [53]:
decision_k_1_list

{(0, 8): x_(0,_8),
 (1, 6): x_(1,_6),
 (1, 4): x_(1,_4),
 (2, 1): x_(2,_1),
 (3, 0): x_(3,_0),
 (4, 2): x_(4,_2),
 (5, 7): x_(5,_7),
 (6, 5): x_(6,_5),
 (7, 3): x_(7,_3),
 (8, 6): x_(8,_6),
 (9, 9): x_(9,_9)}

In [54]:
for i in student_indices:
    if decision_k_1_g_list[i].value() > 0:
        print("student " + str(i) + " is served by schools far away ")

student 8 is served by schools far away 


In [56]:
for i in student_indices:
    for j in school_indices_k_1_list:
        if (i, j) in decision_k_1_list and decision_k_1_list[(i, j)].value() == 1:
            print("student " + str(i) + " is served by school " + str(j))

student 0 is served by school 8
student 1 is served by school 6
student 2 is served by school 1
student 3 is served by school 0
student 4 is served by school 2
student 5 is served by school 7
student 6 is served by school 5
student 7 is served by school 3
student 9 is served by school 9


The result shows that 'restarting' the model may bring the unexpected outcomes.   

In the `k = 1` model previously, student 1 is assigned to the faraway facility, and student 8 is assigned to the facility with the time of 79 minutes. This is because student 1 and student 8 have the same nearest facility, and that facility can only accommodate one student.   

In the `new k = 1` model, which we increase the k value to 2 for student 1, the result shows student 1 is assigned to its nearest facility, while student 8 is assigned to the faraway facility.   

The next step I think about is either trying to write model can be `resolved`, or continuing to increase the k value for student 8 for `restarting`.