## Notes

### Input

Response spreadsheet of the combined student form. Each row represents a student's input. Each column stands for a blank of the form:

> `A`: Timestamp
> `B`: Email Address
> `C`: Gender
>
> `D`: Is the last digit of your UNI even or odd?


#### Group 1 (Course bidding + Preference)

- `E-L`: Bids for *Business Analytics, Cloud Computing, Machine Learning,	Data Analytics,	Optimization,	Stochastic,	Simulation,	Computational Discrete Optimization*

- `M-T`: Rank for the courses (No.1 - No.8)


#### Group 2 (Preference + Course bidding + Timeslot bidding)

- `U-AB`: Rank for the courses (No.1 - No.8)

- `AC-AJ`: Bids for *Business Analytics, Cloud Computing, Machine Learning,	Data Analytics,	Optimization,	Stochastic,	Simulation,	Computational Discrete Optimization*

- `AK-AN`: Bids for time slots (9-11 am, 12-2 pm, 3-5 pm, 6-8 pm)


### Output

`(Student, [3 Courses])` assignment. Each student is assigned to at least one semi-core course.

## Methodologies to test

### Structure of Notebook

* Summary
* Data Processing
* Section 1: Preference Generator from Bids and Lotteries
* Section 2: Schedules for Round 2 algorithm
* Section 3: Two-round algorithm 


**Experimental Groups**
* a) Students give a strict ordering of the classes;
* b) Students  bid on classes, so that the total sum of the bidding sums to <= 100;
* c) Students bid on classes from one time slots, so that the total sum of the bidding sums to <= 100;


**Tests**
1. Ignore a, use b to infer students preferences, with class preferences given by higher bidder; [still ask a for comparing with b,c]
2. Use a and b;
3. Use a and c;
4. Ignore b, c, class preferences are given by unique lottery.



# Summary

In [1]:
import pandas as pd
import numpy as np

In [2]:
# authenticate
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
import gspread

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

# your_module = drive.CreateFile({'id':'1KCokg0NCRyjucZN-qTEIE2-7pHg-XIRS'})
your_module = drive.CreateFile({'id':'1PlSWCGFR8itlXd9xKv3rdIsVQbbRfMf1'})
your_module.GetContentFile('assign.py')

# import respond spreadsheet
gc = gspread.authorize(GoogleCredentials.get_application_default())
wb = gc.open_by_url('https://docs.google.com/spreadsheets/d/1LKi4I4VMGEK2_XBi8zYewFSvqx75nb8HFX9Oo08a8qE/edit#gid=1255728861')
data = wb.sheet1.get_all_values()

**Experiment 1: With original parameters**
* varied capacities
* assume higher ranked course is always preferred

In [1]:
import pandas as pd

filename = "../StudentForm (Combined) (Responses).xlsx"
data = pd.read_excel(filename, header=None)

In [2]:
from assign import Assign

# Initialize with data
asgn = Assign(data)

# Perform different test on either group
test1on1 = asgn.test(test=1, group=1)
test2on1 = asgn.test(test=2, group=1)
test4on1 = asgn.test(test=4, group=1)

test1on2 = asgn.test(test=1, group=2)
test2on2 = asgn.test(test=2, group=2)
test3on2 = asgn.test(test=3, group=2)
test4on2 = asgn.test(test=4, group=2)

Assigning according to test 1 with group 1 ...
	Part I (Semi-core): Number of GS rounds: 1
	Part II (General) : Number of GS rounds: 4
Assigning according to test 2 with group 1 ...
	Part I (Semi-core): Number of GS rounds: 2
	Part II (General) : Number of GS rounds: 5
Assigning according to test 4 with group 1 ...
	Part I (Semi-core): Number of GS rounds: 2
	Part II (General) : Number of GS rounds: 4
Assigning according to test 1 with group 2 ...
	Part I (Semi-core): Number of GS rounds: 2
	Part II (General) : Number of GS rounds: 5
Assigning according to test 2 with group 2 ...
	Part I (Semi-core): Number of GS rounds: 2
	Part II (General) : Number of GS rounds: 4
Assigning according to test 3 with group 2 ...
	Part I (Semi-core): Number of GS rounds: 2
	Part II (General) : Number of GS rounds: 5
Assigning according to test 4 with group 2 ...
	Part I (Semi-core): Number of GS rounds: 2
	Part II (General) : Number of GS rounds: 6


In [13]:
# E.g., check out test 2 assignment result
# last course is the semi-core requirement
result = test1on2.result

In [14]:
# test 4 on group 1
Assign(data).checkStability(result, test=1, group=2)

Check complete!


In [14]:
coursePair = [{'c1', 'c4'}, {'c2', 'c8'}, {'c3', 'c5'}, {'c6', 'c7'}]

def coursePairOf(c):
    for s in coursePair:
        if c in s:
            c2 = list(s)
            c2.remove(c)
            return c2[0]
        
coursePairOf('c8')

'c2'

In [22]:
# resultCourse = Assign(data).courseToStudentView(result)
# resultCourse
course = ['c1','c2','c3','c4','c5','c6','c7','c8']

resultByCourse = {c: [] for c in course}
for a in result.keys():
    for c in result[a]:
        resultByCourse[c].append(a)
        
resultByCourse

{'c1': ['mds225',
  'qz2391',
  'vml213',
  'js5553',
  'ma3973',
  'yd2547',
  'pa2561',
  'sc4619',
  'sa3763',
  'yf2507',
  'sj2993',
  'sc4617',
  'wg2347',
  'sc4811'],
 'c2': ['tnw211', 'mds225', 'sc4597', 'yp2555', 'qz2391', 'vml213'],
 'c3': ['sc4597',
  'yp2555',
  'vml213',
  'js5553',
  'da2899',
  'xm2235',
  'sg3775',
  'lh2991',
  'rs4011',
  'wr2325',
  'yw3379',
  'cf2799'],
 'c4': ['da2899', 'zp2215', 'xm2235', 'sg3775', 'lh2991', 'rs4011'],
 'c5': ['tnw211',
  'zp2215',
  'ma3973',
  'yd2547',
  'pa2561',
  'sc4619',
  'sa3763',
  'yf2507',
  'sj2993',
  'sc4617'],
 'c6': ['tnw211',
  'mds225',
  'sc4597',
  'yp2555',
  'qz2391',
  'js5553',
  'zp2215',
  'xm2235',
  'sg3775',
  'rs4011',
  'ma3973',
  'yd2547',
  'pa2561',
  'sa3763',
  'yf2507',
  'wr2325',
  'wg2347',
  'sc4811',
  'yw3379'],
 'c7': ['da2899', 'lh2991', 'sc4619', 'sj2993', 'sc4617', 'cf2799'],
 'c8': ['wr2325', 'wg2347', 'sc4811', 'yw3379', 'cf2799']}

In [24]:
(df_group1, df_group2) = Assign(data).preprocess()
pref = Assign(data).get_pref(df_group2)
pref

{'wr2325': ['c3', 'c8', 'c5', 'c4', 'c6', 'c7', 'c2', 'c1'],
 'sj2993': ['c1', 'c4', 'c3', 'c2', 'c5', 'c7', 'c8', 'c6'],
 'sa3763': ['c1', 'c6', 'c5', 'c7', 'c4', 'c3', 'c2', 'c8'],
 'yf2507': ['c4', 'c1', 'c3', 'c5', 'c6', 'c7', 'c8', 'c2'],
 'js5553': ['c1', 'c3', 'c2', 'c4', 'c5', 'c6', 'c7', 'c8'],
 'sc4597': ['c3', 'c1', 'c2', 'c5', 'c6', 'c8', 'c7', 'c4'],
 'pa2561': ['c1', 'c4', 'c5', 'c6', 'c7', 'c2', 'c3', 'c8'],
 'cf2799': ['c3', 'c4', 'c5', 'c1', 'c2', 'c7', 'c6', 'c8'],
 'sc4617': ['c1', 'c5', 'c7', 'c2', 'c3', 'c6', 'c8', 'c4'],
 'ma3973': ['c1', 'c3', 'c5', 'c2', 'c4', 'c6', 'c7', 'c8'],
 'qz2391': ['c4', 'c1', 'c2', 'c3', 'c6', 'c5', 'c7', 'c8'],
 'tnw211': ['c2', 'c1', 'c6', 'c5', 'c7', 'c3', 'c4', 'c8'],
 'xm2235': ['c4', 'c3', 'c6', 'c1', 'c7', 'c5', 'c2', 'c8'],
 'da2899': ['c4', 'c2', 'c3', 'c1', 'c7', 'c8', 'c5', 'c6'],
 'wg2347': ['c4', 'c1', 'c6', 'c8', 'c5', 'c2', 'c3', 'c7'],
 'yp2555': ['c1', 'c4', 'c3', 'c2', 'c5', 'c6', 'c7', 'c8'],
 'yw3379': ['c3', 'c2', 

In [25]:
bid = Assign(data).modified_bid(df_group2)
bid

Unnamed: 0_level_0,c1,c2,c3,c4,c5,c6,c7,c8
UNI,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
wr2325,5.37454,5.950714,25.731994,10.598658,10.156019,5.155995,5.058084,35.866176
sj2993,30.601115,15.708073,20.020584,25.96991,3.832443,1.212339,3.181825,3.183405
sa3763,40.304242,5.524756,10.431945,20.291229,5.611853,5.139494,10.292145,5.366362
yf2507,30.45607,5.785176,5.199674,50.514234,5.592415,5.04645,0.607545,0.170524
js5553,40.065052,15.948886,25.965632,10.808397,5.304614,5.097672,0.684233,0.440152
sc4597,25.122038,25.495177,40.034389,0.90932,5.25878,5.662522,0.311711,0.520068
pa2561,80.54671,0.184854,0.969585,10.775133,10.939499,0.894827,0.5979,0.921874
cf2799,40.088493,0.195983,35.045227,15.32533,10.388677,0.271349,0.828738,0.356753
sc4617,93.280935,1.542696,1.140924,1.802197,1.074551,1.986887,1.772245,1.198716
ma3973,40.005522,10.815461,10.706857,0.729007,30.77127,10.074045,0.358466,0.115869


In [33]:
coursePref = {c: bid.sort_values(by=[c], ascending=False)[c].index.values.tolist()
                      for c in course}
coursePref

{'c1': ['sc4811',
  'sc4617',
  'pa2561',
  'wg2347',
  'vml213',
  'sa3763',
  'cf2799',
  'js5553',
  'ma3973',
  'mds225',
  'sc4619',
  'sj2993',
  'yf2507',
  'yd2547',
  'qz2391',
  'sc4597',
  'rs4011',
  'tnw211',
  'yp2555',
  'lh2991',
  'xm2235',
  'zp2215',
  'wr2325',
  'da2899',
  'sg3775',
  'yw3379'],
 'c2': ['tnw211',
  'mds225',
  'sc4597',
  'yp2555',
  'qz2391',
  'vml213',
  'lh2991',
  'js5553',
  'sj2993',
  'ma3973',
  'zp2215',
  'rs4011',
  'wr2325',
  'yf2507',
  'sa3763',
  'yw3379',
  'sc4619',
  'wg2347',
  'sc4617',
  'sg3775',
  'yd2547',
  'sc4811',
  'xm2235',
  'da2899',
  'cf2799',
  'pa2561'],
 'c3': ['yw3379',
  'sg3775',
  'sc4597',
  'xm2235',
  'cf2799',
  'vml213',
  'js5553',
  'sc4619',
  'wr2325',
  'lh2991',
  'yp2555',
  'sj2993',
  'qz2391',
  'ma3973',
  'sa3763',
  'zp2215',
  'rs4011',
  'tnw211',
  'mds225',
  'yf2507',
  'wg2347',
  'sc4617',
  'sc4811',
  'pa2561',
  'yd2547',
  'da2899'],
 'c4': ['da2899',
  'yf2507',
  'wg2347',
 

In [34]:
lastStudentRank = {c: max([coursePref[c].index(a) 
                                   for a in resultByCourse[c]])
                           for c in course}
lastStudentRank

{'c1': 14, 'c2': 5, 'c3': 25, 'c4': 10, 'c5': 22, 'c6': 21, 'c7': 23, 'c8': 19}

In [35]:
c3 = coursePref['c3']
print(c3)
print(len(c3))
for a in result.keys():
    if result[a][2] == 'c3':
        print(a)
        c3.remove(a)
        
print(len(c3))
c3

['yw3379', 'sg3775', 'sc4597', 'xm2235', 'cf2799', 'vml213', 'js5553', 'sc4619', 'wr2325', 'lh2991', 'yp2555', 'sj2993', 'qz2391', 'ma3973', 'sa3763', 'zp2215', 'rs4011', 'tnw211', 'mds225', 'yf2507', 'wg2347', 'sc4617', 'sc4811', 'pa2561', 'yd2547', 'da2899']
26
sc4597
yp2555
da2899
xm2235
sg3775
lh2991
rs4011
wr2325
yw3379
cf2799
16


['vml213',
 'js5553',
 'sc4619',
 'sj2993',
 'qz2391',
 'ma3973',
 'sa3763',
 'zp2215',
 'tnw211',
 'mds225',
 'yf2507',
 'wg2347',
 'sc4617',
 'sc4811',
 'pa2561',
 'yd2547']

In [5]:
[]+[-1]

[-1]

In [5]:
# Check starting capacity
test2on1.start_cap

{'c1': 6, 'c2': 4, 'c3': 8, 'c4': 4, 'c5': 14, 'c6': 14, 'c7': 14, 'c8': 14}

In [6]:
# Check ending capacity
test2on1.end_cap

{'c1': 0, 'c2': 0, 'c3': 0, 'c4': 0, 'c5': 9, 'c6': 6, 'c7': 12, 'c8': 9}

In [28]:
# Check test average rank
test2on1.testAvgRank

1.7381

#### Try different starting capacities



In [32]:
# Initialize with data, cap_same=True
asgn_ = Assign(data, cap_same=True, cap_buffer=0)

# Perform different test on either group
test1on1_ = asgn_.test(test=1, group=1)
test2on1_ = asgn_.test(test=2, group=1)
test4on1_ = asgn_.test(test=4, group=1)

test1on2_ = asgn_.test(test=1, group=2)
test2on2_ = asgn_.test(test=2, group=2)
test3on2_ = asgn_.test(test=3, group=2)
test4on2_ = asgn_.test(test=4, group=2)

Assigning according to test 1 with group 1 ...
	Part I (Semi-core): Number of GS rounds: 1
	Part II (General) : Number of GS rounds: 2
Assigning according to test 2 with group 1 ...
	Part I (Semi-core): Number of GS rounds: 2
	Part II (General) : Number of GS rounds: 3
Assigning according to test 4 with group 1 ...
	Part I (Semi-core): Number of GS rounds: 2
	Part II (General) : Number of GS rounds: 3
Assigning according to test 1 with group 2 ...
	Part I (Semi-core): Number of GS rounds: 2
	Part II (General) : Number of GS rounds: 3
Assigning according to test 2 with group 2 ...
	Part I (Semi-core): Number of GS rounds: 2
	Part II (General) : Number of GS rounds: 5
Assigning according to test 3 with group 2 ...
	Part I (Semi-core): Number of GS rounds: 2
	Part II (General) : Number of GS rounds: 7
Assigning according to test 4 with group 2 ...
	Part I (Semi-core): Number of GS rounds: 2
	Part II (General) : Number of GS rounds: 5


In [30]:
print("start_cap:",test2on1_.start_cap)
print("end_cap:",test2on1_.end_cap)
print("testAvgRank:", test2on1_.testAvgRank)

start_cap: {'c1': 8, 'c2': 8, 'c3': 8, 'c4': 8, 'c5': 8, 'c6': 8, 'c7': 8, 'c8': 8}
end_cap: {'c1': 0, 'c2': 0, 'c3': 0, 'c4': 6, 'c5': 3, 'c6': 1, 'c7': 6, 'c8': 6}
testAvgRank: 1.5001


**Observations from trying different starting capacities**

* In general:
  * Less GS rounds needed in Part II of the algorithm for `cap_same=True` than `cap_same=False` for 5 out of the 7 tests

* Using test 2 on group 1:
  * when `cap_same=False`, the capacities for `c2` and `c4` are much lower, hence `c4` gets filled up. when `cap_same=True` however, `c4` does not get filled up
  * test average rank is lower for `cap_same=True` (1.5001) than for `cap_same=False` (1.7381), even when `cap_buffer=0`
    * *I don't think `cap_buffer>0` makes a difference, though we can check in detail*
 

### Rank review

It maybe not appropriate to assume unit difference between courses in the preference list.

In [7]:
# Compute average ranking of the assigned courses for each student in the test

print('For df_group1, average ranks:')
print('  - Test 1:', test1on1.testAvgRank)
print('  - Test 2:', test2on1.testAvgRank)
print('  - Test 4:', test4on1.testAvgRank)
# average over 8 different seeds for test 4
test4avgRank = 0
for s in range(42,50):
    asgn4 = Assign(data, seed=s)
    test4 = asgn4.test(test=4, group=1,verbose=False)
    test4avgRank += test4.testAvgRank/8
print('  - Test 4 (averaged seeds):', round(test4avgRank,4))


print()
print('For df_group2, average ranks:')
print('  - Test 1:', test1on2.testAvgRank)
print('  - Test 2:', test2on2.testAvgRank)
print('  - Test 3:', test3on2.testAvgRank)
print('  - Test 4:', test4on2.testAvgRank)
# average over 8 different seeds for test 4
test4avgRank = 0
for s in range(42,50):
    asgn4 = Assign(data, seed=s)
    test4 = asgn4.test(test=4, group=2, verbose=False)
    test4avgRank += test4.testAvgRank/8
print('  - Test 4 (averaged seeds):', round(test4avgRank,4))

For df_group1, average ranks:
  - Test 1: 2.4286
  - Test 2: 1.7381
  - Test 4: 1.9524
  - Test 4 (averaged seeds): 1.9614

For df_group2, average ranks:
  - Test 1: 2.7693
  - Test 2: 2.1026
  - Test 3: 2.2308
  - Test 4: 2.3077
  - Test 4 (averaged seeds): 2.3365


### Analysis by Group across tests

Comparing:
* Group 1: Test 1 (main) vs Test 2 (alternative)
* Group 1: Test 1 (main) vs Test 4 (baseline)
* Group 2: Test 1 (alternative) vs Test 2 (main)
* Group 2: Test 2 (main) vs Test 3 (main)
* Group 2: Test 2 (main) vs Test 4 (baseline)
* Group 2: Test 3 (main) vs Test 4 (baseline)
 
 *Tests can be combined; just to highlight what the comparisons are*

**Group 1**: Comparing Test 1 (bids only) to Test 2 (bids + preferences)

In [8]:
# Looking at overall average test ranks
pd.DataFrame((test1on1.testAvgRank, test2on1.testAvgRank, test1on1.testAvgRank - test2on1.testAvgRank),
              index=['test1on1','test2on1','improvement_1to2']).transpose()

Unnamed: 0,test1on1,test2on1,improvement_1to2
0,2.4286,1.7381,0.6905


In [16]:
# Looking at individual students
studentAvgRank = pd.DataFrame((test1on1.studentAvgRank, test2on1.studentAvgRank),index=['test1on1','test2on1']).transpose()
studentAvgRank['improvement_1to2'] = studentAvgRank.test1on1 - studentAvgRank.test2on1
improved = len(studentAvgRank[studentAvgRank.improvement_1to2 > 0])
print(f"{improved} out of {len(studentAvgRank)} students have improved average rank ({round((improved/len(studentAvgRank))*100,3)}%)")
studentAvgRank

9 out of 14 students have improved average rank (64.286%)


Unnamed: 0,test1on1,test2on1,improvement_1to2
zs2440,3.0,1.667,1.333
xt2230,3.667,1.667,2.0
atc214,1.333,1.333,0.0
ih2350,1.0,1.0,0.0
rrb215,2.333,1.667,0.666
jy3026,2.667,2.0,0.667
zl2856,3.667,3.333,0.334
tg2718,2.667,2.0,0.667
la2836,1.0,1.0,0.0
qt2131,1.667,1.667,0.0


# Codes in Detail

## Pre Processing
Goal: To separate the raw data file into 2 csv files, one for each experimental group a) and the other for experimental groups b) and c).

**Course Names** 

Let the following courses be denoted by: <br>
`c1`: Business Analytics <br>
`c2`: Cloud Computing <br>
`c3`: Machine Learning <br>
`c4`: Data Analytics <br>
`c5`: Optimization <br>
`c6`: Stochastic <br>
`c7`: Simulation <br>
`c8`: Computational Discrete Optimization 

where `c1`,`c3`,`c5`,`c6` are semi-core

**Time Slots**

Let the following time slots be denoted by: <br>
`t1`: 9-11am (`c6`,`c7`)<br>
`t2`: 12-2pm (`c1`,`c4`)<br>
`t3`: 3-5pm (`c2`,`c8`)<br>
`t4`: 6-8pm (`c3`,`c5`)

**Course Capacities**

Let the course capacities be denoted as:
k (a dict)

### Separate Groups

In [None]:
# authentication
from google.colab import auth
auth.authenticate_user()

import gspread
from oauth2client.client import GoogleCredentials

gc = gspread.authorize(GoogleCredentials.get_application_default())

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# import StudentForm (Combined) data from Google Sheets
wb = gc.open_by_url('https://docs.google.com/spreadsheets/d/1LKi4I4VMGEK2_XBi8zYewFSvqx75nb8HFX9Oo08a8qE/edit#gid=1255728861')
data = wb.sheet1.get_all_values()

In [None]:
df = pd.DataFrame(data)
df.shape

(41, 40)

In [None]:
df= df.replace(to_replace = {'Business Analytics':'c1', 
                         'Cloud Computing' : 'c2',
                         'Machine Learning' : 'c3',
                         'Data Analytics': 'c4',
                         'Optimization': 'c5',
                         'Stochastic': 'c6',
                         'Simulation': 'c7',
                         'Computational Discrete Optimization': 'c8'
                         },
           value = None)

In [None]:
pd.set_option('display.max_columns', None)
df.head()
# df.iloc[:,3]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39
0,Timestamp,Email Address,Gender,Is the last digit of your UNI even or odd?,Business Analytics - place your bid between 0 ...,Cloud Computing - place your bid between 0 and...,Machine Learning - place your bid between 0 an...,Data Analytics - place your bid between 0 and ...,Optimization - place your bid between 0 and 10...,Stochastic - place your bid between 0 and 100 ...,Simulation - place your bid between 0 and 100 ...,Computational Discrete Optimization - place yo...,Rank the courses - please indicate strict orde...,Rank the courses - please indicate strict orde...,Rank the courses - please indicate strict orde...,Rank the courses - please indicate strict orde...,Rank the courses - please indicate strict orde...,Rank the courses - please indicate strict orde...,Rank the courses - please indicate strict orde...,Rank the courses - please indicate strict orde...,Rank the courses - please indicate strict orde...,Rank the courses - please indicate strict orde...,Rank the courses - please indicate strict orde...,Rank the courses - please indicate strict orde...,Rank the courses - please indicate strict orde...,Rank the courses - please indicate strict orde...,Rank the courses - please indicate strict orde...,Rank the courses - please indicate strict orde...,Business Analytics - place your bid between 0 ...,Cloud Computing - place your bid between 0 and...,Machine Learning - place your bid between 0 an...,Data Analytics - place your bid between 0 and ...,Optimization - place your bid between 0 and 10...,Stochastic - place your bid between 0 and 100 ...,Simulation - place your bid between 0 and 100 ...,Computational Discrete Optimization - place yo...,9-11 am - place your bid between 0 and 100 (re...,12-2 pm - place your bid between 0 and 100 (re...,3-5 pm - place your bid between 0 and 100 (rem...,6-8 pm - place your bid between 0 and 100 (rem...
1,7/30/2020 15:21:12,qt2131@columbia.edu,Female,"Even (0, 2, 4, 6, 8)",1,2,3,4,5,6,7,72,c8,c7,c6,c5,c4,c3,c2,c1,,,,,,,,,,,,,,,,,,,,
2,7/30/2020 15:45:22,wr2325@columbia.edu,Male,"Odd (1, 3, 5, 7, 9)",,,,,,,,,,,,,,,,,c3,c8,c5,c4,c6,c7,c2,c1,5,5,25,10,10,5,5,35,10,20,40,30
3,7/31/2020 17:06:49,zs2440@columbia.edu,Female,"Even (0, 2, 4, 6, 8)",0,100,0,0,0,0,0,0,c2,c1,c3,c4,c5,c6,c7,c8,,,,,,,,,,,,,,,,,,,,
4,7/31/2020 17:07:11,sj2993@columbia.edu,Male,"Odd (1, 3, 5, 7, 9)",,,,,,,,,,,,,,,,,c1,c4,c3,c2,c5,c7,c8,c6,30,15,20,25,3,1,3,3,40,30,20,10


In [None]:
# extract group 1 and group 2 data from combined df
df_group1 = df[df[3]=="Even (0, 2, 4, 6, 8)"].drop(columns=range(20,40)).replace(to_replace={'Even (0, 2, 4, 6, 8)':'Group1'})
df_group2 = df[df[3]=="Odd (1, 3, 5, 7, 9)"].drop(columns=range(4,20)).replace(to_replace={'Odd (1, 3, 5, 7, 9)':'Group2'})

# set colnames
df_group1_colnames = ['Timestamp','Student','Gender','Group',
                      'c1','c2','c3','c4','c5','c6','c7','c8', # bids on courses
                      'R1','R2','R3','R4','R5','R6','R7','R8'  # ranks
                      ]
df_group2_colnames = ['Timestamp','Student','Gender','Group',
                      'R1','R2','R3','R4','R5','R6','R7','R8', # ranks
                      'c1','c2','c3','c4','c5','c6','c7','c8', # bids on courses
                      't1','t2','t3','t4'                      # bids on time slots
                      ]

df_group1.columns = df_group1_colnames
df_group2.columns = df_group2_colnames

In [None]:
# change datatype of bids from str to int
df_group1 = df_group1.apply(pd.to_numeric, downcast='integer', errors='ignore')
df_group2 = df_group2.apply(pd.to_numeric, downcast='integer', errors='ignore')

# index each student using their UNI
df_group1.index = df_group1.Student.str[:6]
df_group1.index.name = 'UNI'
df_group2.index = df_group2.Student.str[:6]
df_group2.index.name = 'UNI'

# check bid criteria is met
df_group1['CourseBidCriteria'] = (df_group1.loc[:,'c1':'c8'].sum(axis=1) == 100)
df_group2['CourseBidCriteria'] = (df_group2.loc[:,'c1':'c8'].sum(axis=1) == 100)
df_group2['TimeBidCriteria'] = (df_group2.loc[:,'t1':'t4'].sum(axis=1) == 100)

In [None]:
df_group1.head()

Unnamed: 0_level_0,Timestamp,Student,Gender,Group,c1,c2,c3,c4,c5,c6,c7,c8,R1,R2,R3,R4,R5,R6,R7,R8,CourseBidCriteria
UNI,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
qt2131,7/30/2020 15:21:12,qt2131@columbia.edu,Female,Group1,1,2,3,4,5,6,7,72,c8,c7,c6,c5,c4,c3,c2,c1,True
zs2440,7/31/2020 17:06:49,zs2440@columbia.edu,Female,Group1,0,100,0,0,0,0,0,0,c2,c1,c3,c4,c5,c6,c7,c8,True
xt2230,7/31/2020 18:23:25,xt2230@columbia.edu,Female,Group1,0,65,0,0,0,0,0,35,c2,c8,c4,c3,c1,c5,c6,c7,True
sjl222,7/31/2020 19:33:42,sjl2220@columbia.edu,Male,Group1,45,0,45,0,0,10,0,0,c1,c3,c6,c5,c7,c4,c8,c2,True
zl2856,7/31/2020 20:06:01,zl2856@columbia.edu,Female,Group1,15,10,15,10,15,15,10,10,c1,c5,c3,c6,c7,c2,c8,c4,True


In [None]:
df_group2.head()

Unnamed: 0_level_0,Timestamp,Student,Gender,Group,R1,R2,R3,R4,R5,R6,R7,R8,c1,c2,c3,c4,c5,c6,c7,c8,t1,t2,t3,t4,CourseBidCriteria,TimeBidCriteria
UNI,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
wr2325,7/30/2020 15:45:22,wr2325@columbia.edu,Male,Group2,c3,c8,c5,c4,c6,c7,c2,c1,5,5,25,10,10,5,5,35,10,20,40,30,True,True
sj2993,7/31/2020 17:07:11,sj2993@columbia.edu,Male,Group2,c1,c4,c3,c2,c5,c7,c8,c6,30,15,20,25,3,1,3,3,40,30,20,10,True,True
sa3763,7/31/2020 17:14:37,sa3763@columbia.edu,Female,Group2,c1,c6,c5,c7,c4,c3,c2,c8,40,5,10,20,5,5,10,5,30,40,20,10,True,True
yf2507,7/31/2020 17:17:24,yf2507@columbia.edu,Female,Group2,c4,c1,c3,c5,c6,c7,c8,c2,30,5,5,50,5,5,0,0,25,25,25,25,True,True
js5553,7/31/2020 17:22:47,js5553@columbia.edu,Male,Group2,c1,c3,c2,c4,c5,c6,c7,c8,40,15,25,10,5,5,0,0,0,50,20,30,True,True


### Response Summary

In [None]:
def countBidder(df, course):
  c = df.groupby(course).count()[["Group"]]
  return int(c[1:].sum())

def report(df):
  print(f"# of participants: {len(df)}")
  gender = df.groupby("Gender").count()["Group"]
  print(f"Female: {gender[0]}, Male: {gender[1]}")
  print(f"\n# of (non-zero) bidders for each course:")
  for c in ['c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8']:
    print(f"\t{c}: {countBidder(df, c)}")

In [None]:
report(df_group1)

# of participants: 14
Female: 6, Male: 8

# of (non-zero) bidders for each course:
	c1: 10
	c2: 9
	c3: 8
	c4: 6
	c5: 6
	c6: 9
	c7: 6
	c8: 8


In [None]:
report(df_group2)

# of participants: 26
Female: 11, Male: 15

# of (non-zero) bidders for each course:
	c1: 23
	c2: 19
	c3: 22
	c4: 22
	c5: 23
	c6: 19
	c7: 13
	c8: 9


### Generate Course Capacities

In [None]:
sc = ['c1','c3','c5','c6']
course = ['c1','c2','c3','c4','c5','c6','c7','c8']
hd = ['c1','c2','c3','c4']
ld = ['c5','c6','c7','c8']

# Equal number of seats for all 8 courses
def capacity(df, same=False, buffer=3):
    cap = {c: 0 for c in course}
    
    if not same:
        # 4 semi-core courses take at least len(df) people
        capOfSC = round(len(df)/4)
        lastSC = len(df) - 3*capOfSC
        for c in sc:
            cap[c] += capOfSC
        cap[sc[0]] = lastSC
        
        remainTotalSeats = 3*len(df) - 3*capOfSC - lastSC
        capOfAll = round(remainTotalSeats/8)
        for c in course:
            cap[c] += capOfAll

        ldCap = {c: len(df) for c in ld}
        cap.update(ldCap)

    if same:
        # as per Yuri's recc, give all classes same capacity
        # allCap = minimum capacity per class + some buffer
        allCap = round(3*len(df)/8) + buffer
        cap = {c: allCap for c in course} 

    return cap

In [None]:
capacity(df_group1)

{'c1': 8, 'c2': 8, 'c3': 8, 'c4': 8, 'c5': 8, 'c6': 8, 'c7': 8, 'c8': 8}

In [None]:
capacity(df_group1, same=False)

{'c1': 6, 'c2': 4, 'c3': 8, 'c4': 4, 'c5': 14, 'c6': 14, 'c7': 14, 'c8': 14}

In [None]:
capacity(df_group2)

{'c1': 13,
 'c2': 13,
 'c3': 13,
 'c4': 13,
 'c5': 13,
 'c6': 13,
 'c7': 13,
 'c8': 13}

## Section 1: Preference Generator from Bids and Lotteries


### Student-side preferences

```python
def get_pref(df, sc=False):     # actual preferences
def get_bid_pref(df, sc=False): # implied preferences from bids
```
Functions return output of the following structure:
```
{'student1': [c2, c3, ...],
 'student2': [c1, c4, ...],
  ...
}
```

TODO: Function to get dictionary of student preferences from data frame


In [None]:
def get_pref(df, sc=False):
  '''
  Returns a dictionary of students' preferences, with student UNI as the key
  sc=False gives all courses, sc=True gives only semi-core courses
  '''
  pref_dict = {}
  if sc:
    for UNI, row in df.loc[:,'R1':'R8'].iterrows():
      sc_list = []
      for c in row.values:
        if c in ['c1','c3','c5','c6']: # if course is semi-core
          sc_list.append(c)
      pref_dict[UNI] = sc_list
    
  else:  
    for UNI, row in df.loc[:,'R1':'R8'].iterrows():
      pref_dict[UNI] = list(row.values)
  
  return pref_dict

In [None]:
# get_pref(df_group2)
# get_pref(df_group2, sc=True)

TODO: Function to generate implied preferences from bids (Test 1).

In [None]:
def get_bid_pref(df, sc=False):
  '''
  Returns a dictionary of students' preferences, derived from course bids.
  Student UNI as key.
  sc=False gives all courses, sc=True gives only semi-core courses
  '''
  pref_dict = {}
  if sc:
    for UNI, row in df.loc[:,'c1':'c8'].iterrows():
      sc_list = []
      for c in row.sort_values(ascending=False).index.values:
        if c in ['c1','c3','c5','c6']: # if course is semi-core
          sc_list.append(c)
      pref_dict[UNI] = sc_list

  else:
    for UNI, row in df.loc[:,'c1':'c8'].iterrows():
      pref_dict[UNI] = list(row.sort_values(ascending=False).index.values)
  
  return pref_dict

In [None]:
# get_bid_pref(df_group1)
# get_bid_pref(df_group1, sc=True)

### Course-side Preferences

```python
def get_course_pref(df, sc = False, rank = True, seed = 42)
# when rank = True (default), returns dictionary of courses' rankings 
# when rank = False, returns dictionary of courses' bids

def get_time_pref(df, sc = False, rank = True, seed = 42)
# converts time bids into course bids, and runs get_course_pref()
```
Function returns a dictionary of the following structure:
```
{'c1': ['student1', 'student3', ...],
 'c2': ['student2', 'student3', ...],
  ...
}
```


TODO: Function to get students' bids for courses/ timeslots (Tests 1-3)

In [None]:
def modified_bid(df, seed = 42):
  '''
  Adds a random real number x drawn from uniform distribution for each student-course pair
  Modifies each positive bid b>0 as b'=b+x
  Returns a modified bid df
  '''
  np.random.seed(seed)
  df_ = df.loc[:,'c1':'c8']
  X = np.random.uniform(size=df_.shape)
  mod_bids = df_ + X
  # mod_bids[mod_bids < 1] = 0   # commented out since we want to rank 0 bids
  return mod_bids

In [None]:
def get_course_pref(df, sc = False, rank = True, seed = 42):
  '''
  Returns course ranks when rank = True, or (original) course bids when rank = False.

  sc = True returns only semi-core courses
  rank = True (default) returns courses' strict ranking on students, based on bids.
          Ties are broken by adds a random number from uniform distribution to actual bid value
  
  Uses modified_bid():
    seed = 42 is default value; only relevant when rank = True
  '''  
  # for course bids
  if not rank:
    if sc:
      course_dict = df.loc[:,['c1','c3','c5','c6']].to_dict()
    if not sc:
      course_dict = df.loc[:,'c1':'c8'].to_dict()

  # for strict course ranks
  if rank:
    new_df = modified_bid(df, seed = seed)  # get modified bid matrix
    course_dict = {}
    if not sc:       
      for c in new_df:
        course_dict[c] = list(new_df[c].sort_values(ascending=False).index.values)
    if sc:
      for c in ['c1','c3','c5','c6']:
        course_dict[c] = list(new_df[c].sort_values(ascending=False).index.values)
  
  return course_dict

In [None]:
# get_course_pref(df_group1)

In [None]:
def get_time_pref(df, sc = False, rank = True, seed = 42):
  '''
  Converts time bids to course bids.
  Returns course ranks when rank = True, or (original) course bids when rank = False.

  Only for df_group2. Returns students' bids timeslots, represented as bids on the individual courses.
  
  Uses get_course_pref():
      sc = True returns only semi-core courses
      rank = True (default) returns courses' strict ranking on students, based on bids.
          Ties are broken by adds a random number from uniform distribution to actual bid value
      seed = 42 is default value; only relevant when break = True  
  '''
  # check that the dataframe passed into function contains time bids
  try:
    df.loc[:,'t1':'t4']
  except:
    raise ValueError("No timeslot keys in df. This function only works for df_group2.")

  # convert time bids to course bids
  new_df = pd.DataFrame()
  new_df['c1'] = df.t2; new_df['c2'] = df.t3; new_df['c3'] = df.t4; new_df['c4'] = df.t2
  new_df['c5'] = df.t4; new_df['c6'] = df.t1; new_df['c7'] = df.t1; new_df['c8'] = df.t3

  # get_course_pref()
  out_dict = get_course_pref(new_df, sc=sc, rank=rank, seed=seed)

  return out_dict

In [None]:
# get_time_pref(df_group2)

TODO: Function to get course preferences using unique lottery (Test 4)

In [None]:
# np.random.uniform()

In [None]:
# Refer to edited lottery function in Test 4

# def lottery(ds, reverse=False):
#     np.random.seed(42)
#     ds = ds.reset_index()
#     if reverse:
#       return np.flip(np.random.permutation(ds['UNI']))
#     return np.random.permutation(ds['UNI'])

# lottery(df_group1)


array(['mz2776', 'qt2131', 'atc214', 'wx2226', 'xt2230', 'zs2440',
       'jy3026', 'zl2856', 'rg3266', 'sjl222', 'tg2718'], dtype=object)

In [None]:
# lottery(df_group1, reverse=True)

array(['tg2718', 'sjl222', 'rg3266', 'zl2856', 'jy3026', 'zs2440',
       'xt2230', 'wx2226', 'atc214', 'qt2131', 'mz2776'], dtype=object)

## Section 2: Schedules for the Second Round

**no longer necessary since this is done within the assign() function**

TODO: For each student, given their preference list (excluding assigned semi-core course) `pref_r2`, and assigned semi-core course `sc_assigned`, generate list of schedules which satisfy the following constraints:
* All 3 courses do not overlap in course timing
* Includes exactly 2 courses (excluding assigned semi-core course)



## Section 3: Two-round Algorithm 

TODO: Generate course capacities based on the total count of participants.

TODO: Student-proposing algorithms are similar (sc or non-sc). Consider to build a common method for student-proposing.

Summary:
* `test_1()`: course bids only
  * main testing: `df_group1`
  * comparison testing: `df_group2`

* `test_2()`: course bids + preferences
  * main testing: `df_group2`
  * comparison testing: `df_group1`

* `test_3()`: time bids + preferences
  * main testing: `df_group2`

* `test_4()`: unique lottery + preferences
  * main testing: `df_group1` and `df_group2`


### Test 1 & 2 (Course bidding)

* Test 1: using implied preferences obtained from bids (`get_bid_pref()`)
  
* Test 2: using actual preferences (`get_pref()`)

For `df_group1`, the main test of interest is Test 1. We carry out Test 2 for comparison.

For `df_group2`, the main test of interest is Test 2. We carry out Tests 1 and 4 for comparison.

**Part I. Code to assign 1 semi-core to each student**

- Select each students' biddings of the first sc course on their preference lists as `b_sc_1`;
- For each semi-core course:
  - Order students by bidding `b_sc_1` from high to low
  - Assign top *k* students to this course (*k* = course capacity)
- If there are students who have not been assigned:
  - Excluding students who already have a sc course, select each students' biddings of the second sc course on their preference lists as `b_sc_2`;
  - ...
- Repeat above until all students have 1 sc course.


**Note:** Students assigned during this part will not be rejected later.

In [None]:
sc = ['c1','c3','c5','c6']
course = ['c1','c2','c3','c4','c5','c6','c7','c8']

In [None]:
def newCap(c, k, bid_sc_k):
    a = 0
    if bid_sc_k.get(c):
        a = len(bid_sc_k.get(c))
    return k - a

In [None]:
# assign 1 semi-core
def assignSC(df, usePref=True, test4=False, seed=42):
    stop = False
    r = 0                                    # starts with 0 (first round)
    cap = capacity(df)                       # init capacity
    if usePref:
        pref_sc = get_pref(df, sc=True)      # get pref list of sc
    else:
        pref_sc = get_bid_pref(df, sc=True)  # get implied pref list of sc from bids
    pref_sc_z = {}                           # pref to be updated each round
    rejected = pref_sc.keys()

    while not stop:
        # get first(round#) sc course on each one's pref list
        pref_sc_r = {u: x[r] for (u,x) in pref_sc.items() if u in rejected}
        # modify pref list
        pref_sc_z.update(pref_sc_r)          # here because later updating bid list would be the same

        # rank students for each sc by bidding
        if not test4:
            bid_sc = {c: sorted([((modified_bid(df, seed=seed)).loc[u,c], u) 
                          for u in pref_sc_z.keys() 
                          if pref_sc_z[u] == c], reverse=True) 
                      for c in sc}
        elif test4:
            bid_sc = {c: sorted([((lottery(df, seed=seed)).loc[u,c], u) 
                      for u in pref_sc_z.keys() 
                      if pref_sc_z[u] == c], reverse=True) 
                  for c in sc}
          
        # keep top k students
        bid_sc_k = {c: s[:cap[c]] for (c, s) in bid_sc.items()}

        # find the list of unmatched student unis
        rejected = [i[1] for l in [s[cap[c]:] for (c, s) in bid_sc.items()] for i in l]

        if rejected: # not empty
            r += 1
        else:
            stop = True

    # update capacity
    cap = {c: newCap(c, k, bid_sc_k) for (c, k) in cap.items()}
    # print("Part I: Number of GS rounds:",r+1)
    return bid_sc_k, cap

In [None]:
# assignSC(df_group1, usePref=False)

**Part II. Code to assign 2 courses to each student**

- Remove assigned course, time-conflict course from each student's preference list;

- Students propose to the top 2 courses on their modified preference list;
  -  Each course rejects if outnumbered capacity and holds others in case of rejecting in rounds after;

- Repeat until no one get rejected.

**Note:** The output of above algorithm is semi-core stable and always exists.

In [None]:
from collections import Counter

def courseToStudentView(courseView):
    """
    Convert {course: [(bid, uni)]} to {uni: [courses]}
    """
    courseViewUni = {c: [s[1] for s in courseView[c]] for c in courseView.keys()}
    studenView = {u: [] for l in courseViewUni.values() for u in l}
    for c in courseViewUni.keys():
        for u in courseViewUni[c]:
            studenView[u].append(c)
    return studenView

# time conflict course pair
coursePair = [{'c1', 'c4'}, {'c2', 'c8'}, {'c3', 'c5'}, {'c6', 'c7'}]

def resolveTimeConflict(df, courseView, usePref=True):
    # update preference list
    if usePref:
        updatedPref = get_pref(df)
    else:
        updatedPref = get_bid_pref(df)
    studentView = courseToStudentView(courseView)
    for u in studentView.keys():
        assignedSC = studentView[u][0]
        for pair in coursePair:
            if assignedSC in pair:
                updatedPref[u] = [i for i in updatedPref[u] if i not in pair]

    return updatedPref


# general assignment (2 courses default)
def assign(df, pref, cap, courseNum=2, test4=False, seed=42):
    stop = False
    r = 0                                                # round, not used in this function
    rejected = pref.keys()
    nextProposeQuota = {u: courseNum for u in rejected}  # # of courses rejected last turn
    propose = {u: [] for u in pref.keys()}               # store courses to be proposed in each turn
    proposed = propose                                   # store proposed courses

    while not stop:
        # first propose to 2 courses, then propose to quota courses
        newPropose = {u: [c for c in x if c not in proposed[u]][:nextProposeQuota[u]]
                          for (u,x) in pref.items() if u in set(rejected)}
        # update propose and proposed
        for u in newPropose.keys():
            propose[u].extend(newPropose[u]) 
            proposed[u].extend(newPropose[u])
        # print(propose)  # uncomment this to see glitch

        # index bids
        if not test4:
            bid1 = {c: [((modified_bid(df, seed=seed)).loc[u,c], u) for u in propose.keys() if propose[u][0] == c] 
                    for c in course}
            bid2 = {c: [((modified_bid(df, seed=seed)).loc[u,c], u) for u in propose.keys() if propose[u][1] == c]
                    for c in course}
        elif test4:
            bid1 = {c: [((lottery(df, seed=seed)).loc[u,c], u) for u in propose.keys() if propose[u][0] == c] 
                    for c in course}
            bid2 = {c: [((lottery(df, seed=seed)).loc[u,c], u) for u in propose.keys() if propose[u][1] == c]
                    for c in course}
        bid = {c: sorted(l + bid2[c], reverse=True) for (c,l) in bid1.items()}

        # keep top k students
        bid_k = {c: s[:cap[c]] for (c, s) in bid.items()}

        # find the list of unmatched student unis
        rejected = [i[1] for l in [s[cap[c]:] for (c, s) in bid.items()] for i in l]

        # record current (successful) proposal
        propose = {u: [] for u in pref.keys()}
        propose.update(courseToStudentView(bid_k))

        # if no one get rejected, check time conflict
        if all([len(cl)==2 for cl in propose.values()]):
            # reject second course (less preferred) due to time conflict
            rejected = [u for u in propose.keys() if set(propose[u]) in coursePair]
            # update propose
            for u in rejected:
                propose[u].pop()

        if rejected: # not empty
            r += 1
            nextProposeQuota = Counter(rejected)
        else:
            stop = True
    # print("Part II: Number of GS rounds:",r+1)
    return bid_k

def test_1(group=1, returnAvgRank=False, seed=42):
    # (df_group1, df_group2) = self.preprocess(filename)
    if group == 1:
        df = df_group1
    elif group == 2:
        df = df_group2
    else:
        print("Group index out of range!")
        
    (courseViewSC, cap) = assignSC(df, usePref=False, seed=seed)
    updatedPref = resolveTimeConflict(df, courseViewSC, usePref=False)
    courseView = assign(df, updatedPref, cap, seed=seed)
    
    # convert to {uni: course-list}
    studentViewSC = courseToStudentView(courseViewSC)
    studentView = courseToStudentView(courseView)
    
    # combine sc with other courses
    for u in studentView.keys():
        studentView[u].extend(studentViewSC[u])

    # compute average rank
    trueRank = {u: {c: i for i, c in enumerate(prefs)} 
                    for u, prefs in get_pref(df).items()}
    studentAvgRank = {u: round(sum([trueRank.get(u).get(c) for c in m])/3,3) 
                                                   for u, m in studentView.items()}
    testAvgRank = round(sum([r for (u,r) in studentAvgRank.items()])/len(trueRank),4)

    # if returnAvgRank=True, return avg ranks instead of matching
    if returnAvgRank:
      return testAvgRank, studentAvgRank

    print('Test Average Rank:', testAvgRank)
    print()
        
    # The last (third) course is the semi-core requirement
    return studentView

def test_2(group=1, returnAvgRank=False, seed=42):
    # (df_group1, df_group2) = self.preprocess(filename)
    if group == 1:
        df = df_group1
    elif group == 2:
        df = df_group2
    else:
        print("Group index out of range!")
        
    (courseViewSC, cap) = assignSC(df, seed=seed)
    updatedPref = resolveTimeConflict(df, courseViewSC)
    courseView = assign(df, updatedPref, cap, seed=seed)
    
    # convert to {uni: course-list}
    studentViewSC = courseToStudentView(courseViewSC)
    studentView = courseToStudentView(courseView)
    
    # combine sc with other courses
    for u in studentView.keys():
        studentView[u].extend(studentViewSC[u])

    # compute average rank
    trueRank = {u: {c: i for i, c in enumerate(prefs)} 
                    for u, prefs in get_pref(df).items()}
    studentAvgRank = {u: round(sum([trueRank.get(u).get(c) for c in m])/3,3) 
                                                   for u, m in studentView.items()}
    testAvgRank = round(sum([r for (u,r) in studentAvgRank.items()])/len(trueRank),4)
    
    # if returnAvgRank=True, return avg ranks instead of matching
    if returnAvgRank:
      return testAvgRank, studentAvgRank

    print('Test Average Rank:', testAvgRank)
    print()
        
    # The last (third) course is the semi-core requirement
    return studentView


**Part III. Carry out the matching using `test_1()` and `test_2()`**

`df_group1`: 
* Main testing: using `test_1()`
* Comparison testing: using `test_2()`

`df_group2`:
* Main testing: using `test_2()`
* Comparison testing: using `test_1()`

In [None]:
# Main test for df_group1
test_1(1)

Test Average Rank: 1.9762



{'atc214': ['c2', 'c5', 'c1'],
 'ih2350': ['c2', 'c6', 'c3'],
 'jy3026': ['c2', 'c3', 'c1'],
 'la2836': ['c5', 'c8', 'c6'],
 'mz2776': ['c6', 'c8', 'c3'],
 'qt2131': ['c5', 'c8', 'c6'],
 'rg3266': ['c7', 'c8', 'c1'],
 'rrb215': ['c3', 'c8', 'c1'],
 'sjl222': ['c1', 'c6', 'c3'],
 'tg2718': ['c3', 'c7', 'c1'],
 'wx2226': ['c2', 'c7', 'c1'],
 'xt2230': ['c2', 'c5', 'c6'],
 'zl2856': ['c1', 'c3', 'c6'],
 'zs2440': ['c2', 'c5', 'c6']}

In [None]:
# Comparison test for df_group1
test_2(1)

Test Average Rank: 1.5001



{'atc214': ['c2', 'c5', 'c1'],
 'ih2350': ['c2', 'c6', 'c3'],
 'jy3026': ['c2', 'c3', 'c1'],
 'la2836': ['c6', 'c8', 'c5'],
 'mz2776': ['c2', 'c6', 'c3'],
 'qt2131': ['c5', 'c8', 'c6'],
 'rg3266': ['c2', 'c5', 'c1'],
 'rrb215': ['c3', 'c6', 'c1'],
 'sjl222': ['c3', 'c6', 'c1'],
 'tg2718': ['c5', 'c7', 'c1'],
 'wx2226': ['c2', 'c7', 'c1'],
 'xt2230': ['c2', 'c4', 'c3'],
 'zl2856': ['c3', 'c6', 'c1'],
 'zs2440': ['c2', 'c4', 'c3']}

In [None]:
# Main test for df_group2
test_2(2)

Test Average Rank: 1.6667



{'cf2799': ['c4', 'c7', 'c3'],
 'da2899': ['c4', 'c7', 'c3'],
 'js5553': ['c2', 'c3', 'c1'],
 'lh2991': ['c2', 'c4', 'c3'],
 'ma3973': ['c2', 'c5', 'c1'],
 'mds225': ['c2', 'c6', 'c1'],
 'pa2561': ['c5', 'c6', 'c1'],
 'qz2391': ['c2', 'c4', 'c3'],
 'rs4011': ['c2', 'c4', 'c3'],
 'sa3763': ['c5', 'c6', 'c1'],
 'sc4597': ['c2', 'c6', 'c3'],
 'sc4617': ['c5', 'c7', 'c1'],
 'sc4619': ['c5', 'c7', 'c1'],
 'sc4811': ['c6', 'c8', 'c1'],
 'sg3775': ['c4', 'c6', 'c3'],
 'sj2993': ['c2', 'c5', 'c1'],
 'tnw211': ['c2', 'c5', 'c6'],
 'vml213': ['c2', 'c3', 'c1'],
 'wg2347': ['c6', 'c8', 'c1'],
 'wr2325': ['c4', 'c8', 'c3'],
 'xm2235': ['c4', 'c6', 'c3'],
 'yd2547': ['c5', 'c6', 'c1'],
 'yf2507': ['c5', 'c6', 'c1'],
 'yp2555': ['c2', 'c4', 'c3'],
 'yw3379': ['c2', 'c4', 'c3'],
 'zp2215': ['c2', 'c4', 'c5']}

In [None]:
# Comparison test for df_group2
test_1(2)

Test Average Rank: 2.3077



{'cf2799': ['c3', 'c8', 'c1'],
 'da2899': ['c4', 'c8', 'c6'],
 'js5553': ['c2', 'c3', 'c1'],
 'lh2991': ['c2', 'c4', 'c3'],
 'ma3973': ['c5', 'c6', 'c1'],
 'mds225': ['c2', 'c6', 'c1'],
 'pa2561': ['c5', 'c8', 'c1'],
 'qz2391': ['c2', 'c4', 'c3'],
 'rs4011': ['c4', 'c8', 'c6'],
 'sa3763': ['c7', 'c8', 'c1'],
 'sc4597': ['c2', 'c6', 'c3'],
 'sc4617': ['c7', 'c8', 'c1'],
 'sc4619': ['c2', 'c3', 'c1'],
 'sc4811': ['c7', 'c8', 'c1'],
 'sg3775': ['c4', 'c8', 'c3'],
 'sj2993': ['c2', 'c7', 'c1'],
 'tnw211': ['c2', 'c7', 'c3'],
 'vml213': ['c2', 'c3', 'c1'],
 'wg2347': ['c7', 'c8', 'c1'],
 'wr2325': ['c4', 'c8', 'c3'],
 'xm2235': ['c4', 'c7', 'c3'],
 'yd2547': ['c4', 'c5', 'c6'],
 'yf2507': ['c5', 'c6', 'c1'],
 'yp2555': ['c2', 'c4', 'c3'],
 'yw3379': ['c2', 'c4', 'c3'],
 'zp2215': ['c2', 'c4', 'c5']}

### Test 3 (Preferences + Timeslot bidding)


In [None]:
def timeToCourse(df):
  """Converts time bids to course bids."""

  # check that the dataframe passed into function contains time bids
  try:
    df.loc[:,'t1':'t4']
  except:
    print("No timeslot keys found. This function only works for df_group2.")

  new_df = df.loc[:,'R1':'R8']
  new_df['c1'] = df.t2; new_df['c2'] = df.t3; new_df['c3'] = df.t4; new_df['c4'] = df.t2
  new_df['c5'] = df.t4; new_df['c6'] = df.t1; new_df['c7'] = df.t1; new_df['c8'] = df.t3


  return new_df

In [None]:
# timeToCourse(df_group2)

In [None]:
def test_3(group=2, returnAvgRank=False, seed=42):
    # (df_group1, df_group2) = self.preprocess(filename)
    if group == 1:
        print("Test 3 only applies to df_group2.")
    elif group == 2:
        df = timeToCourse(df_group2)
    else:
        print("Group index out of range!")
        
    (courseViewSC, cap) = assignSC(df, seed=seed)
    updatedPref = resolveTimeConflict(df, courseViewSC)
    courseView = assign(df, updatedPref, cap, seed=seed)
    
    # convert to {uni: course-list}
    studentViewSC = courseToStudentView(courseViewSC)
    studentView = courseToStudentView(courseView)
    
    # combine sc with other courses
    for u in studentView.keys():
        studentView[u].extend(studentViewSC[u])

    # compute average rank
    trueRank = {u: {c: i for i, c in enumerate(prefs)} 
                    for u, prefs in get_pref(df).items()}
    studentAvgRank = {u: round(sum([trueRank.get(u).get(c) for c in m])/3,3) 
                                                   for u, m in studentView.items()}
    testAvgRank = round(sum([r for (u,r) in studentAvgRank.items()])/len(trueRank),4)

    # if returnAvgRank=True, return avg ranks instead of matching
    if returnAvgRank:
      return testAvgRank, studentAvgRank
    
    print('Test Average Rank:', testAvgRank)
    print()
    
    # The last (third) course is the semi-core requirement
    return studentView

**Carry out the matching using `test_3()`**

For `df_group2` only


In [None]:
test_3(2)

Test Average Rank: 1.7563



{'cf2799': ['c2', 'c4', 'c3'],
 'da2899': ['c4', 'c7', 'c3'],
 'js5553': ['c2', 'c3', 'c1'],
 'lh2991': ['c2', 'c4', 'c3'],
 'ma3973': ['c2', 'c3', 'c1'],
 'mds225': ['c4', 'c6', 'c3'],
 'pa2561': ['c5', 'c6', 'c1'],
 'qz2391': ['c2', 'c6', 'c1'],
 'rs4011': ['c2', 'c4', 'c3'],
 'sa3763': ['c5', 'c6', 'c1'],
 'sc4597': ['c2', 'c6', 'c3'],
 'sc4617': ['c5', 'c7', 'c1'],
 'sc4619': ['c2', 'c5', 'c1'],
 'sc4811': ['c6', 'c8', 'c1'],
 'sg3775': ['c4', 'c6', 'c3'],
 'sj2993': ['c2', 'c5', 'c1'],
 'tnw211': ['c2', 'c5', 'c6'],
 'vml213': ['c2', 'c5', 'c1'],
 'wg2347': ['c6', 'c8', 'c1'],
 'wr2325': ['c4', 'c8', 'c3'],
 'xm2235': ['c4', 'c6', 'c3'],
 'yd2547': ['c5', 'c6', 'c1'],
 'yf2507': ['c4', 'c6', 'c3'],
 'yp2555': ['c2', 'c5', 'c1'],
 'yw3379': ['c2', 'c4', 'c3'],
 'zp2215': ['c4', 'c6', 'c5']}

### Test 4 (Preferences only)

We ignore the bids and only use student preferences to conduct the assignment.

For course preferences, we use a unique lottery to represent 'first-come-first-serve'.

In [None]:
def lottery(ds, seed=42, reverse=False):
    '''Uses a unique lottery to create a fake bid df for matching'''
    np.random.seed(seed)
    ds = ds.reset_index()
    if reverse:
      list_ = np.flip(np.random.permutation(ds['UNI']))
    else:
      list_ = np.random.permutation(ds['UNI'])
    lotteryBids = pd.DataFrame([([u]+[100-r]*8) for r,u in enumerate(list_)]).set_index(0)
    lotteryBids.columns = course
    return lotteryBids
    

In [None]:
lottery(df_group1)

Unnamed: 0_level_0,c1,c2,c3,c4,c5,c6,c7,c8
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
atc214,100,100,100,100,100,100,100,100
la2836,99,99,99,99,99,99,99,99
qt2131,98,98,98,98,98,98,98,98
ih2350,97,97,97,97,97,97,97,97
mz2776,96,96,96,96,96,96,96,96
jy3026,95,95,95,95,95,95,95,95
xt2230,94,94,94,94,94,94,94,94
zs2440,93,93,93,93,93,93,93,93
rrb215,92,92,92,92,92,92,92,92
zl2856,91,91,91,91,91,91,91,91


**Part I. Assign 1 semi-core to each student**

In [None]:
# checking that assignSC() works for test 4
assignSC(df_group2, test4=True)

({'c1': [(100, 'sc4617'),
   (97, 'rs4011'),
   (96, 'tnw211'),
   (95, 'ma3973'),
   (93, 'sj2993'),
   (90, 'sa3763'),
   (88, 'yp2555'),
   (87, 'yf2507'),
   (86, 'js5553'),
   (85, 'mds225'),
   (83, 'sc4619'),
   (81, 'vml213'),
   (80, 'yd2547')],
  'c3': [(99, 'yw3379'),
   (98, 'wr2325'),
   (94, 'da2899'),
   (92, 'sg3775'),
   (91, 'sc4597'),
   (89, 'xm2235'),
   (82, 'lh2991'),
   (79, 'cf2799'),
   (78, 'qz2391')],
  'c5': [(84, 'zp2215'), (75, 'pa2561')],
  'c6': [(77, 'wg2347'), (76, 'sc4811')]},
 {'c1': 0,
  'c2': 13,
  'c3': 4,
  'c4': 13,
  'c5': 11,
  'c6': 11,
  'c7': 13,
  'c8': 13})

**Part II. Assign 2 courses to each student**

In [None]:
def test_4(group=1, returnAvgRank=False, seed=42):
    # (df_group1, df_group2) = self.preprocess(filename)
    if group == 1:
        df = df_group1
    elif group == 2:
        df = df_group2
    else:
        print("Group index out of range!")
        
    (courseViewSC, cap) = assignSC(df, test4=True, seed=seed)
    updatedPref = resolveTimeConflict(df, courseViewSC)
    courseView = assign(df, updatedPref, cap, test4=True)
    
    # convert to {uni: course-list}
    studentViewSC = courseToStudentView(courseViewSC)
    studentView = courseToStudentView(courseView)
    
    # combine sc with other courses
    for u in studentView.keys():
        studentView[u].extend(studentViewSC[u])

    # compute average rank
    trueRank = {u: {c: i for i, c in enumerate(prefs)} 
                    for u, prefs in get_pref(df).items()}
    studentAvgRank = {u: round(sum([trueRank.get(u).get(c) for c in m])/3,3) 
                                                   for u, m in studentView.items()}
    testAvgRank = round(sum([r for (u,r) in studentAvgRank.items()])/len(trueRank),4)

    # if returnAvgRank=True, return avg ranks instead of matching
    if returnAvgRank:
      return testAvgRank, studentAvgRank

    print('Test Average Rank:', testAvgRank)
    print()
        
    # The last (third) course is the semi-core requirement
    return studentView

**Part III. Carry out test 4**

For both `df_group1` and `df_group2`

In [None]:
# Test 4 with df_group1
test_4(1)

Test Average Rank: 1.4049



{'atc214': ['c2', 'c5', 'c1'],
 'ih2350': ['c2', 'c6', 'c3'],
 'jy3026': ['c2', 'c3', 'c1'],
 'la2836': ['c6', 'c8', 'c5'],
 'mz2776': ['c2', 'c6', 'c3'],
 'qt2131': ['c5', 'c8', 'c6'],
 'rg3266': ['c2', 'c3', 'c1'],
 'rrb215': ['c3', 'c6', 'c1'],
 'sjl222': ['c5', 'c6', 'c1'],
 'tg2718': ['c4', 'c7', 'c5'],
 'wx2226': ['c2', 'c7', 'c1'],
 'xt2230': ['c2', 'c4', 'c3'],
 'zl2856': ['c3', 'c6', 'c1'],
 'zs2440': ['c2', 'c3', 'c1']}

In [None]:
# Test 4 with df_group2
test_4(2)

Test Average Rank: 1.641



{'cf2799': ['c4', 'c7', 'c3'],
 'da2899': ['c2', 'c4', 'c3'],
 'js5553': ['c2', 'c5', 'c1'],
 'lh2991': ['c2', 'c4', 'c3'],
 'ma3973': ['c2', 'c3', 'c1'],
 'mds225': ['c2', 'c6', 'c1'],
 'pa2561': ['c4', 'c6', 'c5'],
 'qz2391': ['c4', 'c6', 'c3'],
 'rs4011': ['c2', 'c3', 'c1'],
 'sa3763': ['c5', 'c6', 'c1'],
 'sc4597': ['c2', 'c6', 'c3'],
 'sc4617': ['c5', 'c7', 'c1'],
 'sc4619': ['c2', 'c5', 'c1'],
 'sc4811': ['c4', 'c8', 'c6'],
 'sg3775': ['c4', 'c6', 'c3'],
 'sj2993': ['c2', 'c3', 'c1'],
 'tnw211': ['c2', 'c6', 'c1'],
 'vml213': ['c5', 'c7', 'c1'],
 'wg2347': ['c4', 'c8', 'c6'],
 'wr2325': ['c4', 'c8', 'c3'],
 'xm2235': ['c4', 'c6', 'c3'],
 'yd2547': ['c5', 'c6', 'c1'],
 'yf2507': ['c5', 'c6', 'c1'],
 'yp2555': ['c2', 'c3', 'c1'],
 'yw3379': ['c2', 'c4', 'c3'],
 'zp2215': ['c2', 'c4', 'c5']}

## Section 4: Average Rank Points Comparison

TODO: Calculate the average rank points for each test+group, and compare across tests for each group.

*for Test 4, it may be a good idea to run the test with different seeds and find the average ranks over all the tests*

In [None]:
print('For df_group1, average ranks:')
print('  - Test 1:', test_1(1, returnAvgRank=True)[0])
print('  - Test 2:', test_2(1, returnAvgRank=True)[0])
print('  - Test 4:', test_4(1, returnAvgRank=True)[0])
# average over 8 different seeds for test 4
test4avgRank = 0
for s in range(42,50):
  test4avgRank += test_4(1, returnAvgRank=True, seed=s)[0]/8
print('  - Test 4 (averaged seeds):', round(test4avgRank,4))
print()
print('For df_group2, average ranks:')
print('  - Test 1:', test_1(2, returnAvgRank=True)[0])
print('  - Test 2:', test_2(2, returnAvgRank=True)[0])
print('  - Test 3:', test_3(2, returnAvgRank=True)[0])
print('  - Test 4:', test_4(2, returnAvgRank=True)[0])
# average over 8 different seeds for test 4
test4avgRank = 0
for s in range(42,50):
  test4avgRank += test_4(2, returnAvgRank=True, seed=s)[0]/8
print('  - Test 4 (averaged seeds):', round(test4avgRank,4))


For df_group1, average ranks:
  - Test 1: 2.4286
  - Test 2: 1.7381
  - Test 4: 1.9524
  - Test 4 (averaged seeds): 1.872

For df_group2, average ranks:
  - Test 1: 2.7693
  - Test 2: 2.1026
  - Test 3: 2.2308
  - Test 4: 2.3077
  - Test 4 (averaged seeds): 2.3493


Firstly, it looks like taking into account students' true preferences does indeed give better outcomes than extrapolating students' preferences from their bids. This phenomenon can be observed for both `df_group1` and `df_group2`.

Preliminarily, it seems like first-come-first-serve + true preferences (test 4) performs even better than bidding alone (test 1). 

