## Summary

This notebook was used to insert wildcards in our non-wildcard solutions in an "intelligent" way using TSP and LKH solver. In order to do so for a given string, we select beforhand two permutations and replace on each permutation one of its characters by a wildcard. This only has the effect of changing the distances "from" and "to" the permutation.

For example: if we decide to insert a wildcard in the first position of "**1234567**", then the distance from "1456723" to "**1234567**" gets lowered from 7 to 4, since the "723" at the end of the first permutation now overlaps with the "123" at the start of the second permutation, for it has a wildcard at the beginning.

The strategy then relies on two factors:
1. Given a string, detecting the optimal permutations and positions where the wildcard should be inserted.
2. Having done that, process the strings and run LKH plus extra optimizations to reduce the lengths.

These two factors are tackled as follows:
1. Since the mandatory permutations (those starting with "12") are way more frequent (thrice as much) than those starting with a different 2-symbol prefix, it becomes harder to "stitch" them together nicely within the string. For this reason, we decide that the permutations in which we will insert a wildcard need to be mandatory permutations. Moreover, experimentally, we find out that the best position is the first one (in other words, replacing the symbol "1"). As for which two mandatory permutations to use, my teammate Guillermo Garc√≠a Cobo has developed an heuristic which attempts to find the best choices, linked here: https://www.kaggle.com/atmguille/santa-2021-best-place-for-wildcards. This heuristic was used to get our best solution, but is not present in this notebook for clarity sake.
2. This notebook tackles factor 2 entirely.

**In this demonstration, we will start with a non wildcard solution of score 2442 and lower it to a wildcard solution of score 2430. The permutations that will have a wildcard were selected as explained in step 1 of previous list, but here are introduced manually for clarity sake. Explanations and further insights are provided througout the notebook.**

## INPUT

We start by providing a score 2442 solution found by us using CTSP. Even though we had plenty of 2440 (optimal) solutions, none of them went below 2430 so we decide to use this one as a demonstration. We also need to provide which permutations will have a wildcard, and the position (0 to 6 value).

In [None]:
# Input the strings
str1="126374521637452613745263174526371452637415263741254673126475312657431256473126547321654732615473265147326541732654713265471236547127563412567341273564127354612573462157346251734625713462573146257341625734125643721564372516437256143725641372564317256431275643217564327156432751643275614327564132756412346571243657214365724136572431657243615724365172436512743652174365271436527413652743165274361527436124753612437651273465217346527134652731465273416527346152734612574362157436251743625714362574136257431625743126534712635471236574126375412736542173654271365427316542736154273651427365124367512634752163475261347526314752634175263471526347125364721536472513647253164725361472536417253641275436127534612753641257364215736425173642571364257316425736142573612457631247365126437521643752614375264137526431752643715264371254637215463725146372541637254613725463172546312754632175463271546327514632754163275461327546123475612457361253476125346712543671253674126537412537641276435127634512674531257463125634721563472516347256134725631472563417256341257643125763412563741267534216753426175342671534267513426753142675312746351235467123547612356471235674123574612357641236457123674512367541237465126743521674352617435267143526741352674315267431254763127456321745632714563274156327451632745613274561237456123476512743561253746215374625137462531746253714625374162537412657342165734261573426517342657134265731426573241657324615732465173246571324657124657321465732614573264157326451732645713264571264573216457326127345612547362154736251473625417362547136254731625473126453712465371245673124537612437562143756241375624317562437156243751624375126435712634571246357214635724163572461357246315724635172463512647352164735261473526417352647135264731526473126543712475631247635124563712453672145367241536724513672453167245361724536124735621473562417356247135624731562473516247351246735124675312435762143576241357624315762435176243571624357125437621543762514376254137625431762543716254371256743127654321765432716543276154327651432765413276541273645126734512673541263574216357426135742631574263517426357142635712364752136475231647523614752364175236471523647123467512345761234567124356721435672413567243156724351672435617243561274536127653412675431276354217635427163542761354276315427635142763512476531274653217465327146532741653274615327465132746512375641237546123764512376542137654231765423716542376154237651423765124637512764532176453271645327614532764153276451327645"
str2="12357461254376126573412657431256743215674325167432561743256714325674132567412356742135674231567423516742356174235671423567126453721645372614537264153726451372645317264531274536217453627145362741536274513627453162745312635741267354127635412765342176534271653427615342765134276531427653124763512364751234765213476523147652341765234716523476152347612546371254367215436725143672541367254316725436172543612754362175436271543627514362754136275431627543127465312675432167543261754326715432675143267541326754123765412365741236547126354712654372165437261543726514372654137265431726543127364521736452713645273164527361452736415273641275364123756421375642317564237156423751642375614237561237645126375412367451263475123645712356471253476215347625134762531476253417625347162534712537461253764123576421357642315764235176423571642357614235761243756124753612745631253674125364712563742156374251637425613742563174256371425637126435712643751274635217463527146352741635274613527463152746312457632145763241576324517632457163245761324576123547621354762315476235147623541762354716235471265347216534726153472651347265314726534172653412765431246573124675321467532416753246175324671532467513246751263745126345721634572613457263145726341572634517263451274365123746521374652317465237146523741652374615237461245673214567324156732451673245617324567132456712647351264573124653712546731254736125476312457361275463124765321476532417653247165324761532476513247651234756123457621345762314576234157623451762345716234571256347125643712567341257643215764325176432571643257614325764132576412736541237456127345621734562713456273145627341562734516273451237546127354621735462713546273154627351462735416273541267435126745312764531267534126537421653742615374265137426531742653714265371256473125736412576341275634217563427156342751634275613427563142756312756431257346127356412367542136754231675423617542367154236751423675124365712436751234675213467523146752341675234617523467152346712346571234567213456723145672341567234516723456172345612743562174356271435627413562743156274351627435126475312467351243567124357612354671253467215346725134672531467253416725346172534621753462715346275134627531462753416275341275346212734651276345217634527163452761345276314527634152763412654731245637125743612574631247563214756324175632471563247516324756132475612437651245367124537612463571246375124735612473651276435217643527164352761435276413527643152764312673452167345261734526713452673145267341526734"
str3="12347651243657124653721465372416537246153724651372465317246531264753216475326147532641753264715326475132647512475631254736124537621453762415376245137624531762453716245371256437124563721456372415637245163724561372456317245631274356127546312574361237456213745623174562371456237415623745162374512637542163754261375426317542637154263751426375124673521467352416735246173524671352467315246731274563127634512734561274365124736521473652417365247136524731652473615247361253746123475621347562314756234175623471562347516234751237465124735612735642173564271356427315642735164273561427356127453612357462135746231574623517462357146235741623574126357412673542167354261735426713542673154267351426735123764521376452317645237164523761452376415237641253647123645721364572316457236145723641572364517236451273546127536421753642715364275136427531642753614275361276534125637412563471265374126574321657432615743265174326571432657413265741256743126573412573641257346123756412567342156734251673425617342567134256731425673123647512367452136745231674523617452367145236741523674123576412537642153764251376425317642537164253761425376124573621457362415736245173624571362457316245731245763125746321574632517463257146325741632574613257461234675127364512367541236574213657423165742361574236517423657142365712436752143675241367524316752436175243671524367123457612437561243576123567412534671253476124356712345671235467213546723154672351467235416723546172354612754361237546213754623175462371546237514623754162375412637451247635214763524176352471635247613524763152476312546371265473125647321564732516473256147325641732564713256472135647231564723516472356147235641723564123564721236547213654723165472361547236514723654172365412673451276354127645312674532167453261745326714532674153267451326745127365412753461275634127463512746531275643124567312546732154673251467325416732546173254671325467123547612475362147536241753624715362475136247531624753126743512764351264375124637521463752416375246137524631752463715246371264537126435721643572614357264135726431572643517264351237654125763421576342517634257163425761342576314257631257643126754312675341253674215367425136742531674253617425367142536712547632154763251476325417632547163254761325476124536712543761246357123465721346572314657234165723461572346517234651273465124376521437652413765243176524371652437615243761254367124765312465731264735124675312645731265437126534712634571263475126354721635472613547263154726351472635417263541276543"
print("Provided strings with lengths: ", len(str1), len(str2), len(str3))

# Input the wildcard pairs for each string, and the wildcard position.
# Since WILDCARD_POS = 0, for example, this means that in string 1, "1246573" will have
# a wildcard instead of the "1".
wild1 = ((1, 2, 4, 6, 5, 7, 3), (1, 2, 6, 4, 5, 7, 3))
wild2 = ((1, 2, 7, 5, 3, 4, 6), (1, 2, 6, 7, 3, 4, 5))
wild3 = ((1, 2, 3, 5, 6, 4, 7), (1, 2, 7, 3, 5, 6, 4))
wildcards = [wild1,wild2,wild3]
WILDCARD_POS=0

We set a time limit for LKH. Since this postprocessing is not very complicated, we go for 30 seconds.

In [None]:
TIME_LIMIT = 30 # Total runtime of the notebook is around (TIME_LIMIT * 3) seconds

## UTILS AND FUNCTIONS

This section contains a set of utility and LKH-related functions. Documentation has been provided for each function, but this is just standard LKH except we update the distance matrix to take the wildcards into consideration.

In [None]:
# Import dependencies
import itertools
import numpy as np
import pandas as pd
import random
!wget http://webhotel4.ruc.dk/~keld/research/LKH-3/LKH-3.0.7.tgz
!tar xvfz LKH-3.0.7.tgz
!cd LKH-3.0.7; make clean; make; cp LKH ..

In [None]:
def perm_dist(p, q, string_number, use_wildcards=True):
    """
    Computes distance between two permutations. The distance
    between p and q is the minimum number of symbols we need to 
    add to p so that q appears within the extension.
    
    - p and q are tuples of integers between 1 and 8. 8 is interpreted as a wildcard.
    
    - string_number and use_wildcards allows to insert the desired wildcards in specific permutations. 
        If use_wildcards is true, then the previously set wildcards for string with number "string_number" (0 to 2)
        are considered. For example, if wildcards[0] contains (1,2,3,4,5,6,7) and string_number = 0, p =(1,2,3,4,5,6,7),
        then p is replaced internally by (8,2,3,4,5,6,7) (since WILDCARD_POS = 0).
    """
    p = list(p)
    q = list(q)
    
    
    if p==q:
        return 0
    
    if use_wildcards:
        # Apply wildcards
        for j in range(2):
            if p == list(wildcards[string_number][j]):
                p[WILDCARD_POS] = 8
            if q == list(wildcards[string_number][j]):
                q[WILDCARD_POS] = 8    
    
    # Nope
    if 8 in q and 8 in p:
        return 7
    
    if 8 in q:
        min_dist = 8
        for i in range(1,8):
            q2 = list(q)
            q2[q2.index(8)] = i
            for j in range(1,8):
                if p[j:]==q2[:-j]:
                    if min_dist > j:
                        min_dist = j
                        break
        return min_dist
                
    if 8 in p:
        min_dist = 8
        for i in range(1,8):
            p2 = list(p)
            p2[p2.index(8)] = i
            for j in range(1,8):
                if p2[j:]==q[:-j]:
                    if min_dist > j:
                        min_dist = j
                        break
        return min_dist
            
    i = p.index(q[0])
    return i if p[i:] == q[:7-i] else 7

def perm_dist_no_wildcards(p, q):
    """
    This is an alias to compute the distance between two permutations without wildcard
    insertion (although if p and q contain wildcards beforehand, they're used).
    """
    return perm_dist(p,q,0,False)


def perms_to_string(perms, string_number, use_wildcards=True):
    """
    Given list of permutations in order, creates a string.
    string_number and use_wildcards allow for wildcard insertion, similar to perm_dist.
    """
    perms = list(perms)
    s = [*perms[0]]
    for p, q in zip(perms, perms[1:]):
        d = perm_dist(p[-7:], q[:7], string_number, use_wildcards)
        s.extend(q[7-d:])
        if use_wildcards:
            if q == wildcards[string_number][0]:
                s[-(7-WILDCARD_POS)] = 8
            elif q == wildcards[string_number][1]:
                s[-(7-WILDCARD_POS)] = 8
    return s

def distances_matrix(perms, string_number, depot=False, use_wildcards=True):
    """
    Computes distance matrix for TSP given the desired permutations and the string number for
    wildcard insertion. Allows for depot insertion as well. (This will be important to remove
    the constraint that the starting and ending permutations are the same.)
    """
    if depot:
        m = np.zeros((len(perms)+1, len(perms)+1), dtype='int8')
    else:
        m = np.zeros((len(perms), len(perms)), dtype='int8')
    for i, p in enumerate(perms):
        for j, q in enumerate(perms):
            if depot:
                m[i+1, j+1] = perm_dist(p[-7:], q[:7], string_number, use_wildcards) + len(q) - 7
            else:
                m[i,j] = perm_dist(p[-7:], q[:7], string_number, use_wildcards) + len(q) - 7
    if depot:
        m[0,:]=0
        m[:,0]=0
    return m

def write_params_file(name="tsp"):
    """
    Writes LKH param file with given name.
    """
    with open(f'{name}.par', 'w') as f:
        print(f'PROBLEM_FILE = {name}.tsp', file=f)
        print(f'TOUR_FILE = {name}.txt', file=f)
        print(f'INITIAL_TOUR_FILE = {name}.txt', file=f)
        print('PATCHING_C = 3', file=f)
        print('PATCHING_A = 2', file=f)
        print('SPECIAL',file=f)
        print('GAIN23 = YES', file=f)
        print('MAX_TRIALS=10000000', file=f) # We want time to be the limit
        print('SEED = 69', file=f) # Nice
        print(f'TIME_LIMIT = {TIME_LIMIT}', file=f) #seconds
        print('TRACE_LEVEL = 1', file=f) # Log detail level.


def write_problem_file(distances,name="tsp"):
    """
    Writes problem file given distance matrix and filename.
    """
    with open(f'{name}.tsp', 'w') as f:
        print('TYPE: ATSP', file=f)
        print(f'DIMENSION: {len(distances)}', file=f)
        print('EDGE_WEIGHT_TYPE: EXPLICIT', file=f)
        print('EDGE_WEIGHT_FORMAT: FULL_MATRIX\n', file=f)
        print('EDGE_WEIGHT_SECTION', file=f)
        for row in distances:
            print(' '.join(str(_) for _ in row), file=f)

def read_output_tour(perms,name="best_tour"):
    """
    Reads resulting tour given ordered permutations and file name.
    Output tour permutations in order.
    """
    perms = list(perms)
    with open(f'{name}.txt') as f:
        lines = f.readlines()
    tour = lines[lines.index('TOUR_SECTION\n')+2:-2]
    return [perms[int(_) - 2] for _ in tour]
    
def solve_atsp(perms, name="santa.par"):
    """
    Runs command to solve TSP problem from file name and permutations.
    """
    # Run LKH-3 to solve ATSP instance saving log to file
    !touch lkh.log
    !./LKH $name >> lkh.log

def check_validity(str1, str2, str3):
    """
    Check whether given strings are valid strings for the competition. 
    Strings should be lists of integers. Returns true or false and prints problems, if any.
    Wildcards should be marked with an 8.
    """
    all_perms = set(itertools.permutations(range(1, 8), 7))
    mandatory_perms = set((1, 2) +  _ for _ in itertools.permutations(range(3, 8), 5))

    strings_perms = [perms_in_string(str1), perms_in_string(str2), perms_in_string(str3)]
    for i, s in enumerate(strings_perms):
        if mandatory_perms - s:
            print(f'String #{i} is missing {mandatory_perms - s}.')
            return False
    if all_perms - set.union(*strings_perms):
        print(f"missing:{len(all_perms - set.union(*strings_perms))}")
        print(f'Strings are missing {all_perms - set.union(*strings_perms)}.')
        return False
    return True

def perms_in_string(string_as_list):
    """
    Given string as list of integers (possibly 8 for wildcard), returns
    the permutation set covered by the string.
    """
    perms = set()
    for i in range(len(string_as_list)):
        perm = tuple(string_as_list[i:i+7])
        if len(set(perm))==7:
            if 8 not in perm:
                perms.add(perm)
            else:
                if perm.count(8) > 1:
                    continue
                for i in range(1,8):
                    perm2 = list(perm)
                    perm2[perm2.index(8)] = i
                    if len(set(perm2))==7:
                        perms.add(tuple(perm2))
                
    return perms

In [None]:
def string_to_tour(string, perms, name="tour"):
    """
    Given a string and a permutation list in order, writes a starting tour file
    for TSP with given filename. Wildcards are allowed in the string as "8" symbols
    """
    seen = set()
    perms = list(perms)
    dimension = len(perms)+1
    lines = [f"DIMENSION: {dimension}\nTYPE: TOUR\nTOUR_SECTION\n1\n"]
    for j in range(len(string)-6):
        perm = tuple(string[j:j+7])
        if perm.count(8) == 1:
            for k in range(1,8):
                perm2 = list(perm)
                perm2[perm2.index(8)] = k
                if len(set(perm2)) == 7:
                    perm = tuple(perm2)
                    break
        if perm not in seen and len(set(perm))==7 and perm in perms:
            seen.add(perm)
            lines.append(f"{perms.index(perm)+2}\n")
    lines.append("-1\nEOF")
    with open(f"{name}.txt", "w") as f:
        f.writelines(lines)

In [None]:
# Define the permutation sets
all_perms = set(itertools.permutations(range(1, 8), 7))
mandatory_perms = set((1, 2) +  _ for _ in itertools.permutations(range(3, 8), 5))
non_mandatory_perms = all_perms - mandatory_perms

## Process

In this section the main process is carried out. The strategy is as follows:
1. Shorten STR1 using TSP with the selected wildcards. This is shown as "OPTIMIZING STR1". After this, STR1 is final.
2. Get permutation set from STR2 and remove every permutation within STR1. Shorten STR2 with this reduced amount of permutations. This is shown as "OPTIMIZING STR2 WITHOUT WILDCARDS". 
3. Now, shorten STR2 using TSP with the selected wildcards. This is shown as "OPTIMIZING STR2". After this, STR2 is final.
4. Get permutation set from STR3 and remove every permutation within STR1 and STR2. Shorten STR3 with this reduced amount of permutations. This is shown as "OPTIMIZING STR3 WITHOUT WILDCARDS". 
5. Now, shorten STR3 using TSP with the selected wildcards. This is shown as "OPTIMIZING STR3". After this, STR3 is final.

In [None]:
# Convert
str1 = [int(e) for e in str1]
str2 = [int(e) for e in str2]
str3 = [int(e) for e in str3]

print(f"INITIAL LENGTHS\n{len(str1)} {len(str2)} {len(str3)}")

# Optimize str1
print(f"OPTIMIZING STR1")
perms1 = perms_in_string(str1)
write_params_file("str1")
write_problem_file(distances_matrix(perms1, 0, depot=True), "str1")
string_to_tour(str1,perms1,name="str1")
solve_atsp(perms1, name="str1.par")
tour1=read_output_tour(perms1,name="str1")
str1 = perms_to_string(tour1,0)
print(f"STR1 CHANGED TO {len(str1)}")

# Optimize str2
perms1 = perms_in_string(str1)
perms2 = perms_in_string(str2)
perms2 = perms2 - (perms2.intersection(perms1) - mandatory_perms)
print(f"OPTIMIZING STR2 WITHOUT WILDCARDS")
write_params_file("str2")
write_problem_file(distances_matrix(perms2, 1, depot=True, use_wildcards=False), "str2")
string_to_tour(str2,perms2,name="str2")
solve_atsp(perms2, name="str2.par")
tour2=read_output_tour(perms2,name="str2")
str2 = perms_to_string(tour2,1,use_wildcards=False)
print(f"STR2 CHANGED TO {len(str2)}")
perms2 = perms_in_string(str2)
perms2 = perms2 - (perms2.intersection(perms1) - mandatory_perms)
print(f"OPTIMIZING STR2")
write_params_file("str2")
write_problem_file(distances_matrix(perms2, 1, depot=True), "str2")
string_to_tour(str2,perms2,name="str2")
solve_atsp(perms2, name="str2.par")
tour2=read_output_tour(perms2,name="str2")
str2 = perms_to_string(tour2,1)
print(f"STR2 CHANGED TO {len(str2)}")

# Optimize str3
perms1 = perms_in_string(str1)
perms2 = perms_in_string(str2)
perms3 = perms_in_string(str3)
perms3 = perms3 - (perms3.intersection(perms1.union(perms2)) - mandatory_perms)
print(f"OPTIMIZING STR3 WITHOUT WILDCARDS")
write_params_file("str3")
write_problem_file(distances_matrix(perms3, 2, depot=True, use_wildcards=False), "str3")
string_to_tour(str3,perms3,name="str3")
solve_atsp(perms3, name="str3.par")
tour3=read_output_tour(perms3,name="str3")
str3 = perms_to_string(tour3,2, use_wildcards=False)
print(f"STR3 CHANGED TO {len(str3)}")
perms3 = perms_in_string(str3)
perms3 = perms3 - (perms3.intersection(perms1.union(perms2)) - mandatory_perms)
print(f"OPTIMIZING STR3")
write_params_file("str3")
write_problem_file(distances_matrix(perms3, 2, depot=True), "str3")
string_to_tour(str3,perms3,name="str3")
solve_atsp(perms3, name="str3.par")
tour3=read_output_tour(perms3,name="str3")
str3 = perms_to_string(tour3,2)
print(f"STR3 CHANGED TO {len(str3)}")


### Show and store

This section checks the validity of the strings, prints statistics and prints three python variable declarations so that the strings can be used somewhere else in our pipeline. It also prints wildcard information.

In [None]:
str1 = [int(e) for e in str1]
str2 = [int(e) for e in str2]
str3 = [int(e) for e in str3]
if not check_validity(str1, str2, str3):
    print("UNVALID STRINGS")
else:
    print("VALID STRINGS")
print("PERMS PER STRING")
print(len(perms_in_string(str1)),len(perms_in_string(str2)),len(perms_in_string(str3)))
print("SYMBOLS PER STRING")
print(len(str1),len(str2),len(str3))
print("WILDCARDS")
print(wildcards)
print("WILDCARD POSITION")
print(WILDCARD_POS)
print("STRINGS")
print("="*50)
print("str1=",end="")
print("\"",''.join(str(e) for e in str1),"\"",sep="")
print("str2=",end="")
print("\"",''.join(str(e) for e in str2),"\"",sep="")
print("str3=",end="")
print("\"",''.join(str(e) for e in str3),"\"",sep="")

Submission file is generated here

In [None]:
LETTERS = {
    1: 'üéÖ',  # father christmas
    2: 'ü§∂',  # mother christmas
    3: 'ü¶å',  # reindeer
    4: 'üßù',  # elf
    5: 'üéÑ',  # christmas tree
    6: 'üéÅ',  # gift
    7: 'üéÄ',  # ribbon
    8: 'üåü',  # star
}
strings = [str1, str2, str3]
sub = pd.DataFrame()
sub['schedule'] = [''.join(LETTERS[x] for x in s) for s in strings]
sub_name = f'submission.csv'
sub.to_csv(sub_name, index=False)

## Parting words

This notebook, paired with the wildcard selection heurisitic mentioned at the beginning, is what got us to 2430. There is still one disadvantage to using TSP for wildcards, and it is that the permutations immediately before and after the one with the wildcard determine which number the "wildcard" becomes. In other words, suppose that we have permutation **1234567** with wildcard on the first position, and **1723456** before it:

**...-1723456-1234567-...**

Notice that the distance of that edge is 1 because of the wildcard. However, for the permutation that goes before **1723456**, the wildcard "doesn't exist" and it is forced to be a **7**, which is the value used by **1723456** as a wildcard. This is a limitation that can't be modelled as a TSP problem.

In order to circumvent this issue, we created further code that merged the permutations as one big group (in our example, **17234567** with the wildcard in the first "7"), which fixes the problem. This code is here: https://www.kaggle.com/miguelgonzalez2/santa-2021-wildcard-longer-patterns-groups/

If this notebook was useful, kindly consider upvoting.