In this workbook, I will build a database for the 2023 NCAA Swimming Championships results. The official results are only available as a .pdf, so this will take some massaging before I get a clean data set that I can analyze.

In [1]:
from PyPDF2 import PdfReader
readerW = PdfReader('2023_NCAA_Division_I_Women_-_Final_Results.pdf')
readerM = PdfReader('2023_NCAA_DI_Men_-_Final_Results.pdf')
print(len(readerW.pages), len(readerM.pages))

67 61


The format of the results isn't the nicest when not reading it in the original PDF.

In [2]:
pageText = readerW.pages[5].extract_text()
print(pageText)
lines = pageText.splitlines()
print(lines)

NCAA Division I Championship Meet HY-TEK's MEET MANAGER 7.0 - 9:01 PM  3/18/2023  Page 6
2023 NCAA Division I Women's
Swimming & Diving Championships
Results
Consolation Final ...   (Event 3  Women 500 Yard Freestyle)
Yr Name School Finals Time Prelim Time Points
Tennessee SO McCarville, Kate 4:40.54 5 4:40.43 12
1:50.59 (28.47) 1:22.12 (28.20) 53.92 (27.97) r:+0.67  25.95
3:44.75 (28.34) 3:16.41 (28.54) 2:47.87 (28.63) 2:19.24 (28.65)
4:12.97 (28.22) 4:40.54 (27.57)
Arizona St SR Looney, Lindsay 4:40.72 4 4:40.81 13
1:51.06 (28.66) 1:22.40 (28.56) 53.84 (28.02) r:+0.69  25.82
3:45.38 (28.46) 3:16.92 (28.54) 2:48.38 (28.60) 2:19.78 (28.72)
4:13.66 (28.28) 4:40.72 (27.06)
California SR Motekaitis, Mia 4:40.90 3 4:40.80 14
1:49.28 (28.38) 1:20.90 (28.07) 52.83 (27.59) r:+0.75  25.24
3:44.20 (28.90) 3:15.30 (28.85) 2:46.45 (28.66) 2:17.79 (28.51)
4:13.06 (28.86) 4:40.90 (27.84)
Florida SR Mathieu, Tylor 4:41.18 2 4:40.62 15
1:51.69 (28.72) 1:22.97 (28.33) 54.64 (28.11) r:+0.77  26.53
3:46

There's some patterns that we can use. Each page starts with a header (4 lines) that can be skipped. Each event has event name, the NCAA, meet, American, US Open, Pool records listed before the event results. Individual events have different results sections for Preliminaries and Finals. When an event result spans multiple pages, there is another heading to indicate which event is being continued on eachh page.

In [3]:
linesW = []
for i in range(len(readerW.pages)):
    linesW += readerW.pages[i].extract_text().splitlines()
print(linesW)

["NCAA Division I Championship Meet HY-TEK's MEET MANAGER 7.0 - 9:01 PM  3/18/2023  Page 1", "2023 NCAA Division I Women's", 'Swimming & Diving Championships', 'Results', ' Event 1  Women 200 Yard Medley Relay', 'NCAA: 1:31.51 N3/15/2023 Virginia', 'G Walsh, A Walsh, A Cuomo, K Douglass', 'Meet:1:31.51 M3/15/2023 Virginia', 'G Walsh, A Walsh, A Cuomo, K Douglass', 'American: 1:31.51 A3/15/2023 Virginia', 'G Walsh, A Walsh, A Cuomo, K Douglass', 'US Open: 1:31.51 O3/15/2023 Virginia', 'G Walsh, A Walsh, A Cuomo, K Douglass', 'Pool: 1:31.51 P3/15/2023 Virginia', 'G Walsh, A Walsh, A Cuomo, K Douglass', ' Team Relay Finals Time Seed Time Points', '    Virginia 1:31.51N 40 1:31.73 1', '1) Walsh, Gretchen SO 2) r:0.20 Walsh, Alex JR 3) r:0.01 Cuomo, Lexi SR 4) r:0.26 Douglass, Kate SR', '1:31.51 (20.34) 1:11.17 (22.10) 49.07 (26.30) r:+0.71  22.77', '    NC State 1:32.42 34 1:33.02 2', '1) Berkoff, Katharine SR 2) r:0.22 MacCausland, Heather SR 3) r:0.13 Alons, Kylee 5Y 4) r:0.22 Arens, Abb

In [4]:
len(linesW)

3751

I'll now go line by line, with some functions to simplify repeatable things, and turn this list into a dataframe, which can be use to save results as a csv.

In [5]:
import pandas as pd
import numpy as np

In [6]:
dfRecordsW = pd.DataFrame([], columns = ['Event', 'Category', 'Date', 'Team', 'Swimmer', 'Time'])
dfResultsW = pd.DataFrame([], columns = ['Event', 'Category', 'Name1', 'Year1', 'Name2', 'Year2', 
                                         'Name3', 'Year3', 'Name4', 'Year4', 'School', 
                                         'QualifyingTime', 'Time', 'Place', 'Points'])
dfResultsW

Unnamed: 0,Event,Category,Name1,Year1,Name2,Year2,Name3,Year3,Name4,Year4,School,QualifyingTime,Time,Place,Points


In [7]:
def checkHeader(line):
    headers = ['NCAA Division I Championship Meet','2023 NCAA Division I Women\'s', 'Swimming & Diving Championships',
              'Results', '(Event ', ' Team', 'Yr Name','2023 NCAA Division I Men\'s']
    for header in headers:
        if header in line:
            return True
    return False

def getRecords(lines, relay):
    newRows = []
    eventName = lines[0]
    lineCount = 1
    while lineCount < len(lines):
        x = lines[lineCount].split(':',1)
        category = x[0]
        if category not in ['NCAA','Meet','American','US Open','Pool', 'U. S. Open']:
            dfResults = pd.DataFrame(newRows)
            return lineCount, dfResults
        else:
            x = x[1].strip().split(' ')
            time = x[0]
            date = x[1][1:]
            if relay:
                team = x[2]
                swimmer = lines[lineCount+1]
            else:
                if len(x) == 5:
                    team = x[2]
                    swimmer = x[3] + ' ' + x[4]
                else:
                    swimmer = x[-1] + ' ' + x[-2]
                    team = ' '.join(x[:-2])
            dic = {'Event': eventName, 'Category': category, 'Date': date, 'Team': team, 
                   'Swimmer': swimmer, 'Time': time}
            newRows.append(dic)
        if relay:
            lineCount += 2
        else:
            lineCount += 1
    return -1, -1

def extractRelay(lines):
    if 'Event' not in lines[0]:
        print('Bad event data passed to extractRelay: '+lines[0])
        return -1
    newResults = []
    eventName = lines[0]
    lineCount = 0
    itr, newRecords = getRecords(lines, True)
    if itr == -1:
        print('Error: ', lines[0])
        return -1, -1, -1
    lineCount += itr
    lineCount += 1
    while lineCount+2 < len(lines):
        if checkHeader(lines[lineCount]):
            lineCount += 1
            continue
 
        x = lines[lineCount].strip()
        if x[:5] == 'Event' or x[:6] == 'Scores':
            dfResults = pd.DataFrame(newResults)
            dfRecords = pd.DataFrame(newRecords)
            return lineCount, dfResults, dfRecords
        x = x.split()
        place = x[-1]
        if place[0] == '*':
            place = place[1:]
        if place == '---':
            seed = x[-2]
            time = x[-3]
            j = -3
            lineCount += 1
        else:
            seed = x[-2]
            if int(place) > 16:
                points = 0
                j = -3
            else:
                points = x[-3]
                j = -4
            if x[j][-1].isnumeric():
                time = x[j]
            else:
                time = x[j][:-1]
        school = ' '.join(x[:j])
        x = lines[lineCount+1].split(') ')
        n1 = ' '.join(x[1].split()[:-2])
        yr1 = x[1].split()[-2]
        n2 = ' '.join(x[2].split()[1:-2])
        yr2 = x[2].split()[-2]
        n3 = ' '.join(x[3].split()[1:-2])
        yr3 = x[3].split()[-2]
        n4 = ' '.join(x[4].split()[1:-1])
        yr4 = x[4].split()[-1]
        dic = {'Event': eventName, 'Category': 'Timed Final Relay', 'Name1': n1, 'Year1': yr1, 
               'Name2': n2, 'Year2': yr2, 'Name3': n3, 'Year3': yr3, 'Name4': n4, 'Year4': yr4, 
               'School': school, 'QualifyingTime': seed, 'Time': time, 'Place': place, 'Points': points}
        newResults.append(dic)
        
        lineCount += 2
        while lines[lineCount].strip()[:2] == 'DQ' or lines[lineCount].strip()[:3] == 'DFS' \
              or lines[lineCount].strip()[0].isnumeric():
            lineCount += 1
    return -1, -1, -1

def extractIndividual(lines):
    if 'Event' not in lines[0]:
        print('Bad event data passed to extractIndividual: '+lines[0])
        return -1
    newResults = []
    eventName = lines[0]
    lineCount = 0
    itr, newRecords = getRecords(lines, False)
    if itr == -1:
        print('Error: ', lines[0])
        return -1, -1, -1
    lineCount += itr
    lineCount += 1
    if '1650 Yard Freestyle' in eventName:
        category = 'Timed Final Individual'
    else:
        category = lines[lineCount]
        lineCount += 1
    if 'Swim-off' in category:
        category = 'Swim-off'
#'California SO Alexy, Jack 40.88  41.42 1'
#'Stanford SR MacAlister, Leon 45.59  45.54 2'
    while lineCount < len(lines):
        if checkHeader(lines[lineCount]):
            lineCount += 1
            continue
        if lines[lineCount] in ['Championship Final','Consolation Final','Preliminaries']:
            category = lines[lineCount]
            lineCount += 1
            continue
        x = lines[lineCount].strip()
        if x[:5] == 'Event' or x[:6] == 'Scores':
            dfResults = pd.DataFrame(newResults)
            dfRecords = pd.DataFrame(newRecords)
            return lineCount, dfResults, dfRecords
        x = x.split()
        place = x[-1]
        if place[0] == '*':
            place = place[1:]
        if place == '---':
            seed = x[-2]
            time = x[-3]
            j = -3
            if time == 'DQ':
                lineCount += 1
        else:
            seed = x[-2]
            if int(place) > 16 or category == 'Preliminaries' or category == 'Swim-off':
                points = 0
                j = -3
            else:
                points = x[-3]
                j = -4
            if x[j][-1].isnumeric():
                time = x[j]
            else:
                time = x[j][:-1]
        k = 1
        while x[k] not in ['FR','SO','JR','SR','5Y']:
            k += 1
        yr1 = x[k]
        school = ' '.join(x[:k])
        n1 = ' '.join(x[k+1:j])
        dic = {'Event': eventName, 'Category': category, 'Name1': n1, 'Year1': yr1, 
               'School': school, 'QualifyingTime': seed, 'Time': time, 'Place': place, 'Points': points}
        newResults.append(dic)
        
        lineCount += 2
        if lines[lineCount].strip()[:2] == 'DQ' or lines[lineCount].strip()[:3] == 'DFS':
            lineCount += 1
        while lines[lineCount].strip()[0].isnumeric():
            lineCount += 1
    return -1, -1, -1

def extractDiving(lines):
    if 'Event' not in lines[0]:
        print('Bad event data passed to extractDiving: '+lines[0])
        return -1
    eventNum = lines[0].strip().split(' ',3)[1]
    nextEvt = str(int(eventNum) + 1)
    for lineCount in range(1,len(lines)):
        if lines[lineCount].strip()[:6+len(eventNum)] == 'Event '+eventNum or \
           lines[lineCount].strip()[:6+len(nextEvt)] == 'Event '+nextEvt:
            return lineCount
    return -1

In [8]:
dfRecordsW = dfRecordsW[0:0]
dfResultsW = dfResultsW[0:0]
itr_line = 0
while itr_line < len(linesW):
    prev_itr = itr_line
    if checkHeader(linesW[itr_line]):
        itr_line += 1
        continue
    if linesW[itr_line].strip()[:5] == 'Event':
        if 'Relay' in linesW[itr_line]:
            itr, results, records = extractRelay(linesW[itr_line:])
        elif 'Diving' in linesW[itr_line]:
            itr = extractDiving(linesW[itr_line:])
            results = dfResultsW[0:0]
            records = dfRecordsW[0:0]
        else:
            itr, results, records = extractIndividual(linesW[itr_line:])
        if itr == -1:
            print('Error: ', linesW[itr_line])
            break
        dfRecordsW = pd.concat([dfRecordsW, records], ignore_index=True)
        dfResultsW = pd.concat([dfResultsW, results], ignore_index=True)
        itr_line += itr
        continue
    if itr_line == prev_itr:
        print(itr_line, linesW[itr_line])
        break

3724  Scores - Women


In [9]:
dfResultsW

Unnamed: 0,Event,Category,Name1,Year1,Name2,Year2,Name3,Year3,Name4,Year4,School,QualifyingTime,Time,Place,Points
0,Event 1 Women 200 Yard Medley Relay,Timed Final Relay,"Walsh, Gretchen",SO,"Walsh, Alex",JR,"Cuomo, Lexi",SR,"Douglass, Kate",SR,Virginia,1:31.73,1:31.51,1,40
1,Event 1 Women 200 Yard Medley Relay,Timed Final Relay,"Berkoff, Katharine",SR,"MacCausland, Heather",SR,"Alons, Kylee",5Y,"Arens, Abby",JR,NC State,1:33.02,1:32.42,2,34
2,Event 1 Women 200 Yard Medley Relay,Timed Final Relay,"Bray, Olivia",JR,"Elendt, Anna",JR,"Sticklen, Emma",JR,"Cooper, Grace",JR,Texas,1:33.70,1:33.22,3,32
3,Event 1 Women 200 Yard Medley Relay,Timed Final Relay,"Funderburke, Nyah",SO,Hannah,SR,Katherine,JR,"Ivan, Teresa",SO,Ohio St,1:33.95,1:33.93,4,30
4,Event 1 Women 200 Yard Medley Relay,Timed Final Relay,"Hay, Abby",SR,"Viberg, Cecilia",FR,"Regenauer, Christiana",SR,"Albiero, Gabi",JR,Louisville,1:34.23,1:34.37,5,28
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1030,Event 21 Women 400 Yard Freestyle Relay,Timed Final Relay,"Gowans, Molly",5Y,"Smith, Sierra",JR,"Rees, Meredith",SR,"Moderski, Alex",SR,Missouri,3:15.97,3:16.12,22,0
1031,Event 21 Women 400 Yard Freestyle Relay,Timed Final Relay,"McCarty, Eboni",SO,"Reinstein, Sloane",JR,"Dickinson, Callie",5Y,"Hartman, Zoie",SR,Georgia,3:14.19,3:16.36,23,0
1032,Event 21 Women 400 Yard Freestyle Relay,Timed Final Relay,"Wall, Tatum",FR,"Snyder, Sarah",SR,"Belyakov, Catherine",JR,"Chang, Yi Xuan",SO,Duke,3:15.86,3:17.64,24,0
1033,Event 21 Women 400 Yard Freestyle Relay,Timed Final Relay,"Milutinovich, Katarina",SR,"MacNeil, Maggie",5Y,"de Villiers, Michaela",FR,"Barnes, Megan",FR,LSU,3:10.57,DQ,---,0


In [10]:
dfRecordsW

Unnamed: 0,Event,Category,Date,Team,Swimmer,Time
0,Event 1 Women 200 Yard Medley Relay,NCAA,3/15/2023,Virginia,"G Walsh, A Walsh, A Cuomo, K Douglass",1:31.51
1,Event 1 Women 200 Yard Medley Relay,Meet,3/15/2023,Virginia,"G Walsh, A Walsh, A Cuomo, K Douglass",1:31.51
2,Event 1 Women 200 Yard Medley Relay,American,3/15/2023,Virginia,"G Walsh, A Walsh, A Cuomo, K Douglass",1:31.51
3,Event 1 Women 200 Yard Medley Relay,US Open,3/15/2023,Virginia,"G Walsh, A Walsh, A Cuomo, K Douglass",1:31.51
4,Event 1 Women 200 Yard Medley Relay,Pool,3/15/2023,Virginia,"G Walsh, A Walsh, A Cuomo, K Douglass",1:31.51
...,...,...,...,...,...,...
145,Event 21 Women 400 Yard Freestyle Relay,NCAA,2/18/2023,Virginia,"G Walsh, K Douglass, L Cuomo, A Walsh",3:06.83
146,Event 21 Women 400 Yard Freestyle Relay,Meet,3/19/2022,Virginia,"K Douglass, A Walsh, R Tiltmann, G Walsh",3:06.91
147,Event 21 Women 400 Yard Freestyle Relay,American,3/19/2022,Virginia,"K Douglass, A Walsh, R Tiltmann, G Walsh",3:06.91
148,Event 21 Women 400 Yard Freestyle Relay,US Open,3/19/2022,Virginia,"K Douglass, A Walsh, R Tiltmann, G Walsh",3:06.91


After massaging the PDF into a usable DataFrame, we are one step closer to analyzing the data. Next, there's some more cleaning required, but first I need to repeat the process with the Men's results.

In [11]:
linesM = []
for i in range(len(readerM.pages)):
    linesM += readerM.pages[i].extract_text().splitlines()
print(linesM)

["NCAA Division I Championship Meet HY-TEK's MEET MANAGER 7.0 - 9:07 PM  3/25/2023  Page 1", "2023 NCAA Division I Men's", 'Swimming & Diving Championships', 'Results', ' Event 1  Men 200 Yard Medley Relay', 'NCAA: 1:20.67 N3/22/2023 NC State', 'K Stokowski, M Hunter, N Korstanje, D Curtiss', 'Meet:1:20.67 M3/22/2023 NC State', 'K Stokowski, M Hunter, N Korstanje, D Curtiss', 'American: 1:21.88 A3/23/2018 California', 'D Carr, C Hoppe, J Lynch, R Hoffer', 'U. S. Open: 1:20.67 O3/22/2023 NC State', 'K Stokowski, M Hunter, N Korstanje, D Curtiss', 'Pool: 1:20.67 P3/22/2023 NC State', 'K Stokowski, M Hunter, N Korstanje, D Curtiss', ' Team Relay Finals Time Seed Time Points', '    NC State 1:20.67N 40 1:22.25 1', '1) Stokowski, Kacper SR 2) r:0.15 Hunter, Mason 5Y 3) r:0.09 Korstanje, Nyls SR 4) r:0.16 Curtiss, David SO', '1:20.67 (18.21) 1:02.46 (19.15) 43.31 (22.95) r:+0.71  20.36', '    Arizona St 1:21.07 34 1:21.69 2', '1) Dolan, Jack SR 2) r:0.07 Marchand, Leon SO 3) r:0.28 McCusker,

In [12]:
dfRecordsM = pd.DataFrame([], columns = ['Event', 'Category', 'Date', 'Team', 'Swimmer', 'Time'])
dfResultsM = pd.DataFrame([], columns = ['Event', 'Category', 'Name1', 'Year1', 'Name2', 'Year2', 
                                         'Name3', 'Year3', 'Name4', 'Year4', 'School', 
                                         'QualifyingTime', 'Time', 'Place', 'Points'])

In [13]:
dfRecordsM = dfRecordsM[0:0]
dfResultsM = dfResultsM[0:0]
itr_line = 0
while itr_line < len(linesM):
    prev_itr = itr_line
    if checkHeader(linesM[itr_line]):
        itr_line += 1
        continue
    if linesM[itr_line].strip()[:5] == 'Event':
        if 'Relay' in linesM[itr_line]:
            itr, results, records = extractRelay(linesM[itr_line:])
        elif 'Diving' in linesM[itr_line]:
            itr = extractDiving(linesM[itr_line:])
            results = dfResultsM[0:0]
            records = dfRecordsM[0:0]
        else:
            itr, results, records = extractIndividual(linesM[itr_line:])
        if itr == -1:
            print('Error: ', linesM[itr_line])
            break
        dfRecordsM = pd.concat([dfRecordsM, records], ignore_index=True)
        dfResultsM = pd.concat([dfResultsM, results], ignore_index=True)
        itr_line += itr
        continue
    if itr_line == prev_itr:
        print(itr_line, linesM[itr_line])
        break

3339  Scores - Men


In [14]:
dfResultsM

Unnamed: 0,Event,Category,Name1,Year1,Name2,Year2,Name3,Year3,Name4,Year4,School,QualifyingTime,Time,Place,Points
0,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Stokowski, Kacper",SR,"Hunter, Mason",5Y,"Korstanje, Nyls",SR,"Curtiss, David",SO,NC State,1:22.25,1:20.67,1,40
1,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Dolan, Jack",SR,"Marchand, Leon",SO,"McCusker, Max",5Y,"Kulow, Jonny",FR,Arizona St,1:21.69,1:21.07,2,34
2,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Chaney, Adam",JR,"Savickas, Aleksas",FR,"Friese, Eric",SR,"Liendo, Josh",FR,Florida,1:21.73,1:21.14,3,32
3,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Seeliger, Bjorn",JR,"Bell, Liam",SR,"Rose, Dare",JR,"Alexy, Jack",SO,California,1:22.84,1:21.24,4,30
4,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Burns, Brendan",SR,"Mathias, Van",5Y,"Frankel, Tomer",JR,"Wight, Gavin",JR,Indiana,1:23.52,1:21.52,5,28
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
909,Event 110 Men 200 Yard Freestyle Swim-off,Swim-off,"Miroslaw, Rafael",SO,,,,,,,Indiana,1:32.28,1:34.29,2,0
910,Event 112 Men 100 Yard Backstroke Swim-off,Swim-off,"Janton, Tommy",FR,,,,,,,Notre Dame,45.54,45.12,1,0
911,Event 112 Men 100 Yard Backstroke Swim-off,Swim-off,"MacAlister, Leon",SR,,,,,,,Stanford,45.54,45.59,2,0
912,Event 119 Men 200 Yard Butterfly Swim-off,Swim-off,"Cohen Groumi, Gal",SO,,,,,,,Michigan,1:41.39,1:41.40,1,0


In [15]:
dfRecordsM

Unnamed: 0,Event,Category,Date,Team,Swimmer,Time
0,Event 1 Men 200 Yard Medley Relay,NCAA,3/22/2023,NC,"K Stokowski, M Hunter, N Korstanje, D Curtiss",1:20.67
1,Event 1 Men 200 Yard Medley Relay,Meet,3/22/2023,NC,"K Stokowski, M Hunter, N Korstanje, D Curtiss",1:20.67
2,Event 1 Men 200 Yard Medley Relay,American,3/23/2018,California,"D Carr, C Hoppe, J Lynch, R Hoffer",1:21.88
3,Event 1 Men 200 Yard Medley Relay,U. S. Open,3/22/2023,NC,"K Stokowski, M Hunter, N Korstanje, D Curtiss",1:20.67
4,Event 1 Men 200 Yard Medley Relay,Pool,3/22/2023,NC,"K Stokowski, M Hunter, N Korstanje, D Curtiss",1:20.67
...,...,...,...,...,...,...
145,Event 21 Men 400 Yard Freestyle Relay,NCAA,3/25/2023,Florida,"J Liendo, A Chaney, J Smith, M McDuff",2:44.07
146,Event 21 Men 400 Yard Freestyle Relay,Meet,3/25/2023,Florida,"J Liendo, A Chaney, J Smith, M McDuff",2:44.07
147,Event 21 Men 400 Yard Freestyle Relay,American,3/24/2018,NC,"R Held, J Ress, J Molacek, C Stewart",2:44.31
148,Event 21 Men 400 Yard Freestyle Relay,U. S. Open,3/25/2023,Florida,"J Liendo, A Chaney, J Smith, M McDuff",2:44.07


With minimal changes (ex: the men's meet had a few swim-offs necessary to decide finals qualifiers, while the women's meet did not, so an update was required for extracting the men's results), the men's results are now in DataFrame format.

Now we're ready to clean the data frames. One obvious aspect is that the records data frame has repeated entries for each event, because it was filled for both prelims and finals results. There are also several instances where the pdf reader did not have a space where one was expected/included for most of the other analogous entries. These may have to be edited by hand.

In [16]:
dfRecordsW.drop_duplicates(inplace=True)
dfRecordsM.drop_duplicates(inplace=True)
dfRecordsW

Unnamed: 0,Event,Category,Date,Team,Swimmer,Time
0,Event 1 Women 200 Yard Medley Relay,NCAA,3/15/2023,Virginia,"G Walsh, A Walsh, A Cuomo, K Douglass",1:31.51
1,Event 1 Women 200 Yard Medley Relay,Meet,3/15/2023,Virginia,"G Walsh, A Walsh, A Cuomo, K Douglass",1:31.51
2,Event 1 Women 200 Yard Medley Relay,American,3/15/2023,Virginia,"G Walsh, A Walsh, A Cuomo, K Douglass",1:31.51
3,Event 1 Women 200 Yard Medley Relay,US Open,3/15/2023,Virginia,"G Walsh, A Walsh, A Cuomo, K Douglass",1:31.51
4,Event 1 Women 200 Yard Medley Relay,Pool,3/15/2023,Virginia,"G Walsh, A Walsh, A Cuomo, K Douglass",1:31.51
...,...,...,...,...,...,...
145,Event 21 Women 400 Yard Freestyle Relay,NCAA,2/18/2023,Virginia,"G Walsh, K Douglass, L Cuomo, A Walsh",3:06.83
146,Event 21 Women 400 Yard Freestyle Relay,Meet,3/19/2022,Virginia,"K Douglass, A Walsh, R Tiltmann, G Walsh",3:06.91
147,Event 21 Women 400 Yard Freestyle Relay,American,3/19/2022,Virginia,"K Douglass, A Walsh, R Tiltmann, G Walsh",3:06.91
148,Event 21 Women 400 Yard Freestyle Relay,US Open,3/19/2022,Virginia,"K Douglass, A Walsh, R Tiltmann, G Walsh",3:06.91


In [17]:
dfRecordsM

Unnamed: 0,Event,Category,Date,Team,Swimmer,Time
0,Event 1 Men 200 Yard Medley Relay,NCAA,3/22/2023,NC,"K Stokowski, M Hunter, N Korstanje, D Curtiss",1:20.67
1,Event 1 Men 200 Yard Medley Relay,Meet,3/22/2023,NC,"K Stokowski, M Hunter, N Korstanje, D Curtiss",1:20.67
2,Event 1 Men 200 Yard Medley Relay,American,3/23/2018,California,"D Carr, C Hoppe, J Lynch, R Hoffer",1:21.88
3,Event 1 Men 200 Yard Medley Relay,U. S. Open,3/22/2023,NC,"K Stokowski, M Hunter, N Korstanje, D Curtiss",1:20.67
4,Event 1 Men 200 Yard Medley Relay,Pool,3/22/2023,NC,"K Stokowski, M Hunter, N Korstanje, D Curtiss",1:20.67
...,...,...,...,...,...,...
145,Event 21 Men 400 Yard Freestyle Relay,NCAA,3/25/2023,Florida,"J Liendo, A Chaney, J Smith, M McDuff",2:44.07
146,Event 21 Men 400 Yard Freestyle Relay,Meet,3/25/2023,Florida,"J Liendo, A Chaney, J Smith, M McDuff",2:44.07
147,Event 21 Men 400 Yard Freestyle Relay,American,3/24/2018,NC,"R Held, J Ress, J Molacek, C Stewart",2:44.31
148,Event 21 Men 400 Yard Freestyle Relay,U. S. Open,3/25/2023,Florida,"J Liendo, A Chaney, J Smith, M McDuff",2:44.07


Both have 90 rows now, 5 record types for 18 events (21 total minus 3 diving events), as expected.
Now, let's take a closer look at individual entries.

In [18]:
dfRecordsM.describe()

Unnamed: 0,Event,Category,Date,Team,Swimmer,Time
count,90,90,90,90,90,90
unique,18,5,21,16,30,38
top,Event 1 Men 200 Yard Medley Relay,NCAA,3/23/2023,Florida,Leon Marchand,6:03.42
freq,5,18,10,31,12,5


In [19]:
for col in dfRecordsM.columns:
    print(col, dfRecordsM[col].unique(), dfRecordsM[col].dtype)

Event [' Event 1  Men 200 Yard Medley Relay'
 ' Event 2  Men 800 Yard Freestyle Relay'
 ' Event 3  Men 500 Yard Freestyle' ' Event 4  Men 200 Yard IM'
 ' Event 5  Men 50 Yard Freestyle'
 ' Event 7  Men 200 Yard Freestyle Relay' ' Event 8  Men 400 Yard IM'
 ' Event 9  Men 100 Yard Butterfly' ' Event 10  Men 200 Yard Freestyle'
 ' Event 11  Men 100 Yard Breaststroke'
 ' Event 12  Men 100 Yard Backstroke'
 ' Event 14  Men 400 Yard Medley Relay'
 ' Event 15  Men 1650 Yard Freestyle' ' Event 16  Men 200 Yard Backstroke'
 ' Event 17  Men 100 Yard Freestyle'
 ' Event 18  Men 200 Yard Breaststroke'
 ' Event 19  Men 200 Yard Butterfly'
 ' Event 21  Men 400 Yard Freestyle Relay'] object
Category ['NCAA' 'Meet' 'American' 'U. S. Open' 'Pool'] object
Date ['3/22/2023' '3/23/2018' '2/19/2020' '3/24/2022' '3/23/2023' '3/22/2018'
 'lorida' '2/16/2022' '3/24/2023' '3/24/2017' '3/27/2019' 'ndiana'
 '3/24/2018' '3/25/2022' 'eorgia' '3/23/2017' '2/22/2020' '3/27/2021'
 '3/26/2016' '3/25/2023' '3/25/2017'

I can see that there's some clear errors, and I will just fix these by hand.

In [20]:
dfRecordsM.loc[dfRecordsM.Team.str[0].isin(['0','1','2','3','4','5','6','7','8','9'])]

Unnamed: 0,Event,Category,Date,Team,Swimmer,Time
31,Event 5 Men 50 Yard Freestyle,Meet,lorida,17.63M3/22/2018 Florida,Dressel Caeleb,17.63M3/22/2018
56,Event 9 Men 100 Yard Butterfly,Meet,lorida,42.80M3/23/2018 Florida,Dressel Caeleb,42.80M3/23/2018
76,Event 11 Men 100 Yard Breaststroke,Meet,ndiana,49.69M3/23/2018 Indiana,Finnerty Ian,49.69M3/23/2018
86,Event 12 Men 100 Yard Backstroke,Meet,eorgia,43.35M3/25/2022 Georgia,Urlando Luca,43.35M3/25/2022
104,Event 15 Men 1650 Yard Freestyle,Pool,3/24/2018,14:24.43 P3/24/2018 NC State,Ipson Anton,14:24.43
116,Event 17 Men 100 Yard Freestyle,Meet,lorida,39.90M3/24/2018 Florida,Dressel Caeleb,39.90M3/24/2018
139,Event 19 Men 200 Yard Butterfly,Pool,3/24/2018,1:38.60 P3/24/2018 NC State,Vazaios Andrea,1:38.60


In [21]:
dfRecordsW.loc[dfRecordsW.Team.str[0].isin(['0','1','2','3','4','5','6','7','8','9'])]

Unnamed: 0,Event,Category,Date,Team,Swimmer,Time
31,Event 5 Women 50 Yard Freestyle,Meet,SU,20.79M3/16/2023 LSU,MacNeil Maggie,20.79M3/16/2023
56,Event 9 Women 100 Yard Butterfly,Meet,irginia,48.46M3/17/2023 Virginia,Douglass Kate,48.46M3/17/2023
76,Event 11 Women 100 Yard Breaststroke,Meet,ndiana,55.73M3/22/2019 Indiana,King Lilly,55.73M3/22/2019
86,Event 12 Women 100 Yard Backstroke,Meet,irginia,48.26M3/17/2023 Virginia,Walsh Gretchen,48.26M3/17/2023
102,Event 15 Women 1650 Yard Freestyle,American,3/12/2023,15:01.41 A3/12/2023 Gator Swim Club,Ledecky Katie,15:01.41
103,Event 15 Women 1650 Yard Freestyle,US Open,3/12/2023,15:01.41 O3/12/2023 Gator Swim Club,Ledecky Katie,15:01.41
104,Event 15 Women 1650 Yard Freestyle,Pool,12/7/2013,15:15.17 P12/7/2013 Nation's Capital,Ledecky Katie,15:15.17
116,Event 17 Women 100 Yard Freestyle,Meet,tanford,45.56M3/18/2017 Stanford,Manuel Simone,45.56M3/18/2017


The errors are in 2 distinct categories - a set of events had the Meet records mis-read, while some other events (1650 multiple times for Women, 2 evenst for Men) were mis-read in a different way. I can fix these by swapping around the column values.

In [22]:
dfRecordsM.loc[(dfRecordsM.Team.str[0].isin(['0','1','2','3','4','5','6','7','8','9'])) & 
               (dfRecordsM.Category == 'Meet')]

Unnamed: 0,Event,Category,Date,Team,Swimmer,Time
31,Event 5 Men 50 Yard Freestyle,Meet,lorida,17.63M3/22/2018 Florida,Dressel Caeleb,17.63M3/22/2018
56,Event 9 Men 100 Yard Butterfly,Meet,lorida,42.80M3/23/2018 Florida,Dressel Caeleb,42.80M3/23/2018
76,Event 11 Men 100 Yard Breaststroke,Meet,ndiana,49.69M3/23/2018 Indiana,Finnerty Ian,49.69M3/23/2018
86,Event 12 Men 100 Yard Backstroke,Meet,eorgia,43.35M3/25/2022 Georgia,Urlando Luca,43.35M3/25/2022
116,Event 17 Men 100 Yard Freestyle,Meet,lorida,39.90M3/24/2018 Florida,Dressel Caeleb,39.90M3/24/2018


In [23]:
newNames = dfRecordsM.loc[(dfRecordsM.Team.str[0].isin(['0','1','2','3','4','5','6','7','8','9'])) & 
               (dfRecordsM.Category == 'Meet'), 'Swimmer'].str.split()
dfRecordsM.loc[(dfRecordsM.Team.str[0].isin(['0','1','2','3','4','5','6','7','8','9'])) & 
               (dfRecordsM.Category == 'Meet'), 'Swimmer'] = newNames.apply(lambda x: x[1] + ' ' + x[0])
dfRecordsM.loc[(dfRecordsM.Team.str[0].isin(['0','1','2','3','4','5','6','7','8','9'])) & 
               (dfRecordsM.Category == 'Meet')]

Unnamed: 0,Event,Category,Date,Team,Swimmer,Time
31,Event 5 Men 50 Yard Freestyle,Meet,lorida,17.63M3/22/2018 Florida,Caeleb Dressel,17.63M3/22/2018
56,Event 9 Men 100 Yard Butterfly,Meet,lorida,42.80M3/23/2018 Florida,Caeleb Dressel,42.80M3/23/2018
76,Event 11 Men 100 Yard Breaststroke,Meet,ndiana,49.69M3/23/2018 Indiana,Ian Finnerty,49.69M3/23/2018
86,Event 12 Men 100 Yard Backstroke,Meet,eorgia,43.35M3/25/2022 Georgia,Luca Urlando,43.35M3/25/2022
116,Event 17 Men 100 Yard Freestyle,Meet,lorida,39.90M3/24/2018 Florida,Caeleb Dressel,39.90M3/24/2018


In [24]:
time_date = dfRecordsM.loc[(dfRecordsM.Team.str[0].isin(['0','1','2','3','4','5','6','7','8','9'])) & 
               (dfRecordsM.Category == 'Meet'), 'Time'].str.split('M', expand = True)
dfRecordsM.loc[(dfRecordsM.Team.str[0].isin(['0','1','2','3','4','5','6','7','8','9'])) & 
               (dfRecordsM.Category == 'Meet'), 'Time'] = time_date[0]
dfRecordsM.loc[(dfRecordsM.Team.str[0].isin(['0','1','2','3','4','5','6','7','8','9'])) & 
               (dfRecordsM.Category == 'Meet'), 'Date'] = time_date[1]
dfRecordsM.loc[(dfRecordsM.Team.str[0].isin(['0','1','2','3','4','5','6','7','8','9'])) & 
               (dfRecordsM.Category == 'Meet')]

Unnamed: 0,Event,Category,Date,Team,Swimmer,Time
31,Event 5 Men 50 Yard Freestyle,Meet,3/22/2018,17.63M3/22/2018 Florida,Caeleb Dressel,17.63
56,Event 9 Men 100 Yard Butterfly,Meet,3/23/2018,42.80M3/23/2018 Florida,Caeleb Dressel,42.8
76,Event 11 Men 100 Yard Breaststroke,Meet,3/23/2018,49.69M3/23/2018 Indiana,Ian Finnerty,49.69
86,Event 12 Men 100 Yard Backstroke,Meet,3/25/2022,43.35M3/25/2022 Georgia,Luca Urlando,43.35
116,Event 17 Men 100 Yard Freestyle,Meet,3/24/2018,39.90M3/24/2018 Florida,Caeleb Dressel,39.9


In [25]:
new_team = dfRecordsM.loc[(dfRecordsM.Team.str[0].isin(['0','1','2','3','4','5','6','7','8','9'])) & 
               (dfRecordsM.Category == 'Meet'), 'Team'].str.split(' ', expand = True)
dfRecordsM.loc[(dfRecordsM.Team.str[0].isin(['0','1','2','3','4','5','6','7','8','9'])) & 
               (dfRecordsM.Category == 'Meet'), 'Team'] = new_team[1]
dfRecordsM.loc[(dfRecordsM.Team.str[0].isin(['0','1','2','3','4','5','6','7','8','9'])) & 
               (dfRecordsM.Category == 'Meet')]

Unnamed: 0,Event,Category,Date,Team,Swimmer,Time


In [26]:
dfRecordsM.loc[(dfRecordsM.Team.str[0].isin(['0','1','2','3','4','5','6','7','8','9']))]

Unnamed: 0,Event,Category,Date,Team,Swimmer,Time
104,Event 15 Men 1650 Yard Freestyle,Pool,3/24/2018,14:24.43 P3/24/2018 NC State,Ipson Anton,14:24.43
139,Event 19 Men 200 Yard Butterfly,Pool,3/24/2018,1:38.60 P3/24/2018 NC State,Vazaios Andrea,1:38.60


In [27]:
newNames = dfRecordsM.loc[dfRecordsM.Team.str[0].isin(['0','1','2','3','4','5','6','7','8','9']),
                          'Swimmer'].str.split()
dfRecordsM.loc[dfRecordsM.Team.str[0].isin(['0','1','2','3','4','5','6','7','8','9']),
               'Swimmer'] = newNames.apply(lambda x: x[1] + ' ' + x[0])
dfRecordsM.loc[dfRecordsM.Team.str[0].isin(['0','1','2','3','4','5','6','7','8','9'])]

Unnamed: 0,Event,Category,Date,Team,Swimmer,Time
104,Event 15 Men 1650 Yard Freestyle,Pool,3/24/2018,14:24.43 P3/24/2018 NC State,Anton Ipson,14:24.43
139,Event 19 Men 200 Yard Butterfly,Pool,3/24/2018,1:38.60 P3/24/2018 NC State,Andrea Vazaios,1:38.60


In [28]:
new_team = dfRecordsM.loc[dfRecordsM.Team.str[0].isin(['0','1','2','3','4','5','6','7','8','9']), 
                          'Team'].str.split(' ', expand = True)
dfRecordsM.loc[dfRecordsM.Team.str[0].isin(['0','1','2','3','4','5','6','7','8','9']),
               'Team'] = new_team[1]
dfRecordsM.loc[dfRecordsM.Team.str[0].isin(['0','1','2','3','4','5','6','7','8','9'])]

Unnamed: 0,Event,Category,Date,Team,Swimmer,Time


In [29]:
dfRecordsW.loc[dfRecordsW.Team.str[0].isin(['0','1','2','3','4','5','6','7','8','9'])]

Unnamed: 0,Event,Category,Date,Team,Swimmer,Time
31,Event 5 Women 50 Yard Freestyle,Meet,SU,20.79M3/16/2023 LSU,MacNeil Maggie,20.79M3/16/2023
56,Event 9 Women 100 Yard Butterfly,Meet,irginia,48.46M3/17/2023 Virginia,Douglass Kate,48.46M3/17/2023
76,Event 11 Women 100 Yard Breaststroke,Meet,ndiana,55.73M3/22/2019 Indiana,King Lilly,55.73M3/22/2019
86,Event 12 Women 100 Yard Backstroke,Meet,irginia,48.26M3/17/2023 Virginia,Walsh Gretchen,48.26M3/17/2023
102,Event 15 Women 1650 Yard Freestyle,American,3/12/2023,15:01.41 A3/12/2023 Gator Swim Club,Ledecky Katie,15:01.41
103,Event 15 Women 1650 Yard Freestyle,US Open,3/12/2023,15:01.41 O3/12/2023 Gator Swim Club,Ledecky Katie,15:01.41
104,Event 15 Women 1650 Yard Freestyle,Pool,12/7/2013,15:15.17 P12/7/2013 Nation's Capital,Ledecky Katie,15:15.17
116,Event 17 Women 100 Yard Freestyle,Meet,tanford,45.56M3/18/2017 Stanford,Manuel Simone,45.56M3/18/2017


In [30]:
newNames = dfRecordsW.loc[dfRecordsW.Team.str[0].isin(['0','1','2','3','4','5','6','7','8','9']),
                          'Swimmer'].str.split()
dfRecordsW.loc[dfRecordsW.Team.str[0].isin(['0','1','2','3','4','5','6','7','8','9']),
               'Swimmer'] = newNames.apply(lambda x: x[1] + ' ' + x[0])
time_date = dfRecordsW.loc[(dfRecordsW.Team.str[0].isin(['0','1','2','3','4','5','6','7','8','9'])) & 
               (dfRecordsW.Category == 'Meet'), 'Time'].str.split('M', expand = True)
dfRecordsW.loc[(dfRecordsW.Team.str[0].isin(['0','1','2','3','4','5','6','7','8','9'])) & 
               (dfRecordsW.Category == 'Meet'), 'Time'] = time_date[0]
dfRecordsW.loc[(dfRecordsW.Team.str[0].isin(['0','1','2','3','4','5','6','7','8','9'])) & 
               (dfRecordsW.Category == 'Meet'), 'Date'] = time_date[1]
new_team = dfRecordsW.loc[dfRecordsW.Team.str[0].isin(['0','1','2','3','4','5','6','7','8','9']), 
                          'Team'].str.split(' ', expand = True)
dfRecordsW.loc[dfRecordsW.Team.str[0].isin(['0','1','2','3','4','5','6','7','8','9']),
               'Team'] = new_team[1]
dfRecordsW.loc[dfRecordsW.Team.str[0].isin(['0','1','2','3','4','5','6','7','8','9'])]

Unnamed: 0,Event,Category,Date,Team,Swimmer,Time


Now, the bad record entries have been fixed.

In [31]:
dfRecordsM.Date = pd.to_datetime(dfRecordsM.Date)

In [34]:
times = pd.to_datetime(dfRecordsM.Time, format='%M:%S.%f', errors='coerce').fillna(
                pd.to_datetime(dfRecordsM.Time, format='%S.%f', errors='coerce'))
dfRecordsM.Time = times
dfRecordsM.dtypes

Event               object
Category            object
Date        datetime64[ns]
Team                object
Swimmer             object
Time        datetime64[ns]
dtype: object

In [35]:
dfRecordsM

Unnamed: 0,Event,Category,Date,Team,Swimmer,Time
0,Event 1 Men 200 Yard Medley Relay,NCAA,2023-03-22,NC,"K Stokowski, M Hunter, N Korstanje, D Curtiss",1900-01-01 00:01:20.670
1,Event 1 Men 200 Yard Medley Relay,Meet,2023-03-22,NC,"K Stokowski, M Hunter, N Korstanje, D Curtiss",1900-01-01 00:01:20.670
2,Event 1 Men 200 Yard Medley Relay,American,2018-03-23,California,"D Carr, C Hoppe, J Lynch, R Hoffer",1900-01-01 00:01:21.880
3,Event 1 Men 200 Yard Medley Relay,U. S. Open,2023-03-22,NC,"K Stokowski, M Hunter, N Korstanje, D Curtiss",1900-01-01 00:01:20.670
4,Event 1 Men 200 Yard Medley Relay,Pool,2023-03-22,NC,"K Stokowski, M Hunter, N Korstanje, D Curtiss",1900-01-01 00:01:20.670
...,...,...,...,...,...,...
145,Event 21 Men 400 Yard Freestyle Relay,NCAA,2023-03-25,Florida,"J Liendo, A Chaney, J Smith, M McDuff",1900-01-01 00:02:44.070
146,Event 21 Men 400 Yard Freestyle Relay,Meet,2023-03-25,Florida,"J Liendo, A Chaney, J Smith, M McDuff",1900-01-01 00:02:44.070
147,Event 21 Men 400 Yard Freestyle Relay,American,2018-03-24,NC,"R Held, J Ress, J Molacek, C Stewart",1900-01-01 00:02:44.310
148,Event 21 Men 400 Yard Freestyle Relay,U. S. Open,2023-03-25,Florida,"J Liendo, A Chaney, J Smith, M McDuff",1900-01-01 00:02:44.070


In [36]:
dfRecordsW.Date = pd.to_datetime(dfRecordsW.Date)
times = pd.to_datetime(dfRecordsW.Time, format='%M:%S.%f', errors='coerce').fillna(
                pd.to_datetime(dfRecordsW.Time, format='%S.%f', errors='coerce'))
dfRecordsW.Time = times
dfRecordsW

Unnamed: 0,Event,Category,Date,Team,Swimmer,Time
0,Event 1 Women 200 Yard Medley Relay,NCAA,2023-03-15,Virginia,"G Walsh, A Walsh, A Cuomo, K Douglass",1900-01-01 00:01:31.510
1,Event 1 Women 200 Yard Medley Relay,Meet,2023-03-15,Virginia,"G Walsh, A Walsh, A Cuomo, K Douglass",1900-01-01 00:01:31.510
2,Event 1 Women 200 Yard Medley Relay,American,2023-03-15,Virginia,"G Walsh, A Walsh, A Cuomo, K Douglass",1900-01-01 00:01:31.510
3,Event 1 Women 200 Yard Medley Relay,US Open,2023-03-15,Virginia,"G Walsh, A Walsh, A Cuomo, K Douglass",1900-01-01 00:01:31.510
4,Event 1 Women 200 Yard Medley Relay,Pool,2023-03-15,Virginia,"G Walsh, A Walsh, A Cuomo, K Douglass",1900-01-01 00:01:31.510
...,...,...,...,...,...,...
145,Event 21 Women 400 Yard Freestyle Relay,NCAA,2023-02-18,Virginia,"G Walsh, K Douglass, L Cuomo, A Walsh",1900-01-01 00:03:06.830
146,Event 21 Women 400 Yard Freestyle Relay,Meet,2022-03-19,Virginia,"K Douglass, A Walsh, R Tiltmann, G Walsh",1900-01-01 00:03:06.910
147,Event 21 Women 400 Yard Freestyle Relay,American,2022-03-19,Virginia,"K Douglass, A Walsh, R Tiltmann, G Walsh",1900-01-01 00:03:06.910
148,Event 21 Women 400 Yard Freestyle Relay,US Open,2022-03-19,Virginia,"K Douglass, A Walsh, R Tiltmann, G Walsh",1900-01-01 00:03:06.910


Now, move on to the results data.

In [41]:
dfResultsM.describe()

Unnamed: 0,Event,Category,Name1,Year1,Name2,Year2,Name3,Year3,Name4,Year4,School,QualifyingTime,Time,Place,Points
count,914,914,914,914,113,113,113,113,113,113,914,914.0,914,914,914
unique,21,6,250,5,84,5,83,5,87,5,49,785.0,806,57,31
top,Event 17 Men 100 Yard Freestyle,Preliminaries,"Seeliger, Bjorn",SR,"Marchand, Leon",SR,"Lowe, Dalton",SR,"McDuff, Macguire",SR,Arizona St,19.09,DFS,---,0
freq,78,569,9,242,3,34,3,40,3,26,75,6.0,26,35,626


In [42]:
dfResultsW.describe()

Unnamed: 0,Event,Category,Name1,Year1,Name2,Year2,Name3,Year3,Name4,Year4,School,QualifyingTime,Time,Place,Points
count,1035,1035,1035,1035,116,116,116,116,116,116,1035,1035.0,1035,1035,1035
unique,18,5,297,5,80,5,81,5,81,5,55,918.0,932,69,31
top,Event 3 Women 500 Yard Freestyle,Preliminaries,"Berkoff, Katharine",SO,"Walsh, Alex",JR,"Jones, Emily",JR,"Cronk, Micayla",SO,Virginia,51.59,DFS,12,0
freq,84,686,10,264,4,34,4,31,4,29,79,5.0,17,32,746


Looking through the entries earlier, and looking very top-level, things look OK. I may run into issues when dealing more granularly with the data in my next analysis Notebook, but for now they look ok and I will save all the data as CSV files.

In [43]:
dfResultsM.to_csv('NCAA_M2023.csv')
dfResultsW.to_csv('NCAA_W2023.csv')
dfRecordsM.to_csv('Records_M.csv')
dfRecordsW.to_csv('Records_W.csv')

Up next, a deeper dive into the results.
Note for future: I would like to save splits in case I want to analyze those as well. The easiest thing would be another dataframe, with matching swimmer/events to match to the results, and the columns of that dataframe would be all the splits. It should be straightforward, assuming the PDF reader reads them all correctly, but for now I'll leave this as a to-do item for the next round.