## Analysis of Men's and Women's 2023 NCAA DI Swimming Championships

I will analyze the results of the 2023 NCAA championship meet and answer a few questions:
1. Which team(s) did best (and how should that be defined)?
2. Are there any differences in performance based on a swimmer's year?
3. What are the average improvements from qualifying times to preliminaries to finals? Do any factors impact this?

### Loading and Inspecting Data

Most of this step was handled in `buildData_2023NCAAs.ipynb` where I converted PDF results to a CSV file. I'll load the CSVs and do some inspection and any small cleaning required.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Load the men's and women's results and combine dataframes
dfResultsM = pd.read_csv('NCAA_M2023.csv')
dfResultsW = pd.read_csv('NCAA_W2023.csv')
dfResultsM['Division'] = 'Men\'s DI'
dfResultsW['Division'] = 'Women\'s DI'
dfResults = pd.concat([dfResultsM, dfResultsW], ignore_index=True)
dfResults.head()

Unnamed: 0.1,Unnamed: 0,Event,Category,Name1,Year1,Name2,Year2,Name3,Year3,Name4,Year4,School,QualifyingTime,Time,Place,Points,Division
0,0,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Stokowski, Kacper",SR,"Hunter, Mason",5Y,"Korstanje, Nyls",SR,"Curtiss, David",SO,NC State,1:22.25,1:20.67,1,40.0,Men's DI
1,1,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Dolan, Jack",SR,"Marchand, Leon",SO,"McCusker, Max",5Y,"Kulow, Jonny",FR,Arizona St,1:21.69,1:21.07,2,34.0,Men's DI
2,2,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Chaney, Adam",JR,"Savickas, Aleksas",FR,"Friese, Eric",SR,"Liendo, Josh",FR,Florida,1:21.73,1:21.14,3,32.0,Men's DI
3,3,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Seeliger, Bjorn",JR,"Bell, Liam",SR,"Rose, Dare",JR,"Alexy, Jack",SO,California,1:22.84,1:21.24,4,30.0,Men's DI
4,4,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Burns, Brendan",SR,"Mathias, Van",5Y,"Frankel, Tomer",JR,"Wight, Gavin",JR,Indiana,1:23.52,1:21.52,5,28.0,Men's DI


In [3]:
dfResults.describe(include='all')

Unnamed: 0.1,Unnamed: 0,Event,Category,Name1,Year1,Name2,Year2,Name3,Year3,Name4,Year4,School,QualifyingTime,Time,Place,Points,Division
count,1949.0,1949,1949,1949,1949,229,229,229,229,229,229,1949,1949.0,1949,1949,1949.0,1949
unique,,39,6,547,5,164,5,164,5,168,5,70,1662.0,1702,69,,2
top,,Event 3 Women 500 Yard Freestyle,Preliminaries,"Berkoff, Katharine",SR,"Walsh, Alex",SR,"Jones, Emily",SR,"Arens, Abby",SO,Florida,51.9,DFS,---,,Women's DI
freq,,84,1255,10,495,4,61,4,68,4,54,126,6.0,43,66,,1035
mean,488.628014,,,,,,,,,,,,,,,3.658286,
std,284.614824,,,,,,,,,,,,,,,7.421645,
min,0.0,,,,,,,,,,,,,,,0.0,
25%,243.0,,,,,,,,,,,,,,,0.0,
50%,487.0,,,,,,,,,,,,,,,0.0,
75%,730.0,,,,,,,,,,,,,,,4.0,


Everything looks ok after reading the two dataframes in, except as I wrote/read I ended up with 2 index columns, so I can drop the 'Unnamed: 0' column.

In [4]:
dfResults = dfResults.drop(['Unnamed: 0'], axis=1)

In [5]:
dfResults.head(25)

Unnamed: 0,Event,Category,Name1,Year1,Name2,Year2,Name3,Year3,Name4,Year4,School,QualifyingTime,Time,Place,Points,Division
0,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Stokowski, Kacper",SR,"Hunter, Mason",5Y,"Korstanje, Nyls",SR,"Curtiss, David",SO,NC State,1:22.25,1:20.67,1,40.0,Men's DI
1,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Dolan, Jack",SR,"Marchand, Leon",SO,"McCusker, Max",5Y,"Kulow, Jonny",FR,Arizona St,1:21.69,1:21.07,2,34.0,Men's DI
2,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Chaney, Adam",JR,"Savickas, Aleksas",FR,"Friese, Eric",SR,"Liendo, Josh",FR,Florida,1:21.73,1:21.14,3,32.0,Men's DI
3,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Seeliger, Bjorn",JR,"Bell, Liam",SR,"Rose, Dare",JR,"Alexy, Jack",SO,California,1:22.84,1:21.24,4,30.0,Men's DI
4,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Burns, Brendan",SR,"Mathias, Van",5Y,"Frankel, Tomer",JR,"Wight, Gavin",JR,Indiana,1:23.52,1:21.52,5,28.0,Men's DI
5,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Kammann, Bjoern",SO,"Houlie, Michael",5Y,"Crooks, Jordan",SO,"Santos, Guilherme",FR,Tennessee,1:21.43,1:21.59,6,26.0,Men's DI
6,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Lowe, Dalton",JR,"Petrashov, Denis",SR,"Elaraby, Abdelrahman",SR,"Eastman, Michael",5Y,Louisville,1:23.59,1:22.43,7,24.0,Men's DI
7,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Brownstead, Matt",JR,"Nichols, Noah",JR,"Edwards, Max",SR,"Lamb, August",SR,Virginia,1:23.03,1:22.51,8,22.0,Men's DI
8,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Stoffle, Aidan",SR,"Mikuta, Reid",JR,"Stoffle, Nate",SO,"Makinen, Kalle",FR,Auburn,1:22.98,1:22.67,9,18.0,Men's DI
9,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"MacAlister, Leon",SR,"Polonsky, Ron",SO,"Minakov, Andrei",JR,"Gu, Rafael",FR,Stanford,1:24.00,1:22.69,10,14.0,Men's DI


In [6]:
dfResults[dfResults.Category == 'Championship Final'].head()

Unnamed: 0,Event,Category,Name1,Year1,Name2,Year2,Name3,Year3,Name4,Year4,School,QualifyingTime,Time,Place,Points,Division
44,Event 3 Men 500 Yard Freestyle,Championship Final,"Hobson, Luke",SO,,,,,,,Texas,4:10.51,4:07.37,1,20.0,Men's DI
45,Event 3 Men 500 Yard Freestyle,Championship Final,"Johnston, David",JR,,,,,,,Texas,4:10.02,4:08.79,2,17.0,Men's DI
46,Event 3 Men 500 Yard Freestyle,Championship Final,"Magahey, Jake",JR,,,,,,,Georgia,4:10.83,4:09.24,3,16.0,Men's DI
47,Event 3 Men 500 Yard Freestyle,Championship Final,"Newmark, Jake",JR,,,,,,,Wisconsin,4:10.80,4:10.12,4,15.0,Men's DI
48,Event 3 Men 500 Yard Freestyle,Championship Final,"Mitchell, Jake",JR,,,,,,,Florida,4:11.65,4:10.54,5,14.0,Men's DI


Very brief checks by eye of results for both a relay and an individual event don't show obvious major errors. I'll quickly modify the name columns to get First + Last names.

In [7]:
name = pd.DataFrame()
for col in ['Name1','Name2','Name3','Name4']:
    name[['Last', 'First']] = dfResults[col].str.split(',',expand=True)
    dfResults[col] = name['First'] + ' ' + name['Last']
dfResults.head()

Unnamed: 0,Event,Category,Name1,Year1,Name2,Year2,Name3,Year3,Name4,Year4,School,QualifyingTime,Time,Place,Points,Division
0,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,Kacper Stokowski,SR,Mason Hunter,5Y,Nyls Korstanje,SR,David Curtiss,SO,NC State,1:22.25,1:20.67,1,40.0,Men's DI
1,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,Jack Dolan,SR,Leon Marchand,SO,Max McCusker,5Y,Jonny Kulow,FR,Arizona St,1:21.69,1:21.07,2,34.0,Men's DI
2,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,Adam Chaney,JR,Aleksas Savickas,FR,Eric Friese,SR,Josh Liendo,FR,Florida,1:21.73,1:21.14,3,32.0,Men's DI
3,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,Bjorn Seeliger,JR,Liam Bell,SR,Dare Rose,JR,Jack Alexy,SO,California,1:22.84,1:21.24,4,30.0,Men's DI
4,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,Brendan Burns,SR,Van Mathias,5Y,Tomer Frankel,JR,Gavin Wight,JR,Indiana,1:23.52,1:21.52,5,28.0,Men's DI


In [8]:
dfResults.dtypes

Event              object
Category           object
Name1              object
Year1              object
Name2              object
Year2              object
Name3              object
Year3              object
Name4              object
Year4              object
School             object
QualifyingTime     object
Time               object
Place              object
Points            float64
Division           object
dtype: object

Looking at the data types, I see everything is an object other than the points (there are some half point results in the case of a tie, so float is required for those). The Place could be an int, although currently non-placing swimmers are given a place of '---'. This could be made to be NaN so that the type could be changed to int. The times -- other than 'DQ' (disqualified) and 'DFS' (declared false start) -- could be stored as datetimes to make operations with them easier. Everythig else should stay an object.

In [9]:
dfResults.Event.unique()

array([' Event 1  Men 200 Yard Medley Relay',
       ' Event 2  Men 800 Yard Freestyle Relay',
       ' Event 3  Men 500 Yard Freestyle', ' Event 4  Men 200 Yard IM',
       ' Event 5  Men 50 Yard Freestyle',
       ' Event 7  Men 200 Yard Freestyle Relay',
       ' Event 8  Men 400 Yard IM', ' Event 9  Men 100 Yard Butterfly',
       ' Event 10  Men 200 Yard Freestyle',
       ' Event 11  Men 100 Yard Breaststroke',
       ' Event 12  Men 100 Yard Backstroke',
       ' Event 14  Men 400 Yard Medley Relay',
       ' Event 15  Men 1650 Yard Freestyle',
       ' Event 16  Men 200 Yard Backstroke',
       ' Event 17  Men 100 Yard Freestyle',
       ' Event 18  Men 200 Yard Breaststroke',
       ' Event 19  Men 200 Yard Butterfly',
       ' Event 21  Men 400 Yard Freestyle Relay',
       ' Event 110  Men 200 Yard Freestyle Swim-off',
       ' Event 112  Men 100 Yard Backstroke Swim-off',
       ' Event 119  Men 200 Yard Butterfly Swim-off',
       ' Event 1  Women 200 Yard Medley Relay',
 

In [10]:
# Get rid of extra leading/trailing spaces in string variables
# There's an extra space between event number and event name that is annoying but I'll leave as is for now
dfObj = dfResults.select_dtypes(['object'])
dfResults[dfObj.columns] = dfObj.apply(lambda x: x.str.strip())
dfResults.Event.unique()

array(['Event 1  Men 200 Yard Medley Relay',
       'Event 2  Men 800 Yard Freestyle Relay',
       'Event 3  Men 500 Yard Freestyle', 'Event 4  Men 200 Yard IM',
       'Event 5  Men 50 Yard Freestyle',
       'Event 7  Men 200 Yard Freestyle Relay',
       'Event 8  Men 400 Yard IM', 'Event 9  Men 100 Yard Butterfly',
       'Event 10  Men 200 Yard Freestyle',
       'Event 11  Men 100 Yard Breaststroke',
       'Event 12  Men 100 Yard Backstroke',
       'Event 14  Men 400 Yard Medley Relay',
       'Event 15  Men 1650 Yard Freestyle',
       'Event 16  Men 200 Yard Backstroke',
       'Event 17  Men 100 Yard Freestyle',
       'Event 18  Men 200 Yard Breaststroke',
       'Event 19  Men 200 Yard Butterfly',
       'Event 21  Men 400 Yard Freestyle Relay',
       'Event 110  Men 200 Yard Freestyle Swim-off',
       'Event 112  Men 100 Yard Backstroke Swim-off',
       'Event 119  Men 200 Yard Butterfly Swim-off',
       'Event 1  Women 200 Yard Medley Relay',
       'Event 2  Women 

In [11]:
dfResults[dfResults.Event == 'Event 1  Women 200 Yard Medley Relay']

Unnamed: 0,Event,Category,Name1,Year1,Name2,Year2,Name3,Year3,Name4,Year4,School,QualifyingTime,Time,Place,Points,Division
914,Event 1 Women 200 Yard Medley Relay,Timed Final Relay,Gretchen Walsh,SO,Alex Walsh,JR,Lexi Cuomo,SR,Kate Douglass,SR,Virginia,1:31.73,1:31.51,1,40.0,Women's DI
915,Event 1 Women 200 Yard Medley Relay,Timed Final Relay,Katharine Berkoff,SR,Heather MacCausland,SR,Kylee Alons,5Y,Abby Arens,JR,NC State,1:33.02,1:32.42,2,34.0,Women's DI
916,Event 1 Women 200 Yard Medley Relay,Timed Final Relay,Olivia Bray,JR,Anna Elendt,JR,Emma Sticklen,JR,Grace Cooper,JR,Texas,1:33.70,1:33.22,3,32.0,Women's DI
917,Event 1 Women 200 Yard Medley Relay,Timed Final Relay,Nyah Funderburke,SO,,SR,,JR,Teresa Ivan,SO,Ohio St,1:33.95,1:33.93,4,30.0,Women's DI
918,Event 1 Women 200 Yard Medley Relay,Timed Final Relay,Abby Hay,SR,Cecilia Viberg,FR,Christiana Regenauer,SR,Gabi Albiero,JR,Louisville,1:34.23,1:34.37,5,28.0,Women's DI
919,Event 1 Women 200 Yard Medley Relay,Timed Final Relay,Isabelle Stadden,JR,Jade Neser,JR,Mia Kragh,SO,Emma Davidson,SR,California,1:35.40,1:34.75,6,26.0,Women's DI
920,Event 1 Women 200 Yard Medley Relay,Timed Final Relay,Rhyan White,5Y,Avery Wiseman,SO,Emily Jones,FR,Kalia Antoniou,5Y,Alabama,1:34.20,1:34.83,7,24.0,Women's DI
921,Event 1 Women 200 Yard Medley Relay,Timed Final Relay,Greer Pattison,SO,Skyler Smith,SO,Ellie VanNote,SR,Grace Countie,SR,UNC,1:34.70,1:35.01,8,22.0,Women's DI
922,Event 1 Women 200 Yard Medley Relay,Timed Final Relay,Claire Curzan,FR,Allie Raab,5Y,Emma Wheal,SR,Amy Tang,SO,Stanford,1:35.42,1:35.44,9,18.0,Women's DI
923,Event 1 Women 200 Yard Medley Relay,Timed Final Relay,Caroline Famous,JR,Kaitlyn Dobler,JR,Anicka Delgado,JR,Hanna Henderson,JR,Southern California,1:35.52,1:35.52,10,14.0,Women's DI


There are a few relay entries where the swimmer name columns dropped last names so ended up with NaN values after some manipulations. Since there are so few and I cannot see any reason why this would happen for just these handful so I assume it was a PDF reader error, I will just consult the results and fix these by hand.

In [12]:
dfResults.loc[917, ['Name2','Name3']] = ['Hannah Bach', 'Katherine Zenick']

In [13]:
dfResults[dfResults.Event == 'Event 14  Men 400 Yard Medley Relay']

Unnamed: 0,Event,Category,Name1,Year1,Name2,Year2,Name3,Year3,Name4,Year4,School,QualifyingTime,Time,Place,Points,Division
577,Event 14 Men 400 Yard Medley Relay,Timed Final Relay,Adam Chaney,JR,Dillon Hillis,5Y,Josh Liendo,FR,Macguire McDuff,SO,Florida,2:59.48,2:58.32,1,40.0,Men's DI
578,Event 14 Men 400 Yard Medley Relay,Timed Final Relay,Brendan Burns,SR,Josh Matheny,SO,Tomer Frankel,JR,Rafael Miroslaw,SO,Indiana,3:01.53,2:59.09,2,34.0,Men's DI
579,Event 14 Men 400 Yard Medley Relay,Timed Final Relay,Jack Dolan,SR,Leon Marchand,SO,Max McCusker,5Y,Jonny Kulow,FR,Arizona St,3:01.39,2:59.18,3,32.0,Men's DI
580,Event 14 Men 400 Yard Medley Relay,Timed Final Relay,Kacper Stokowski,SR,Mason Hunter,5Y,Aiden Hayes,SO,Luke Miller,JR,NC State,3:01.10,3:00.22,4,30.0,Men's DI
581,Event 14 Men 400 Yard Medley Relay,Timed Final Relay,Destin Lasco,JR,Reece Whitley,5Y,Gabriel Jett,SO,Bjorn Seeliger,JR,California,3:01.80,3:00.38,5,28.0,Men's DI
582,Event 14 Men 400 Yard Medley Relay,Timed Final Relay,Nick Simons,FR,Lyubomir Epitropov,5Y,Jordan Crooks,SO,Guilherme Santos,FR,Tennessee,3:02.51,3:02.05,6,26.0,Men's DI
583,Event 14 Men 400 Yard Medley Relay,Timed Final Relay,Forest Webb,SR,Carles Coll Marti,JR,Youssef Ramadan,JR,Luis Dominguez Calonge,SO,Virginia Tech,3:03.40,3:02.53,7,24.0,Men's DI
584,Event 14 Men 400 Yard Medley Relay,Timed Final Relay,Carson Foster,JR,Caspar Corbeau,SR,Sterling Crane,SR,Luke Hobson,SO,Texas,3:04.57,3:03.00,8,22.0,Men's DI
585,Event 14 Men 400 Yard Medley Relay,Timed Final Relay,Jack Dahlgren,5Y,Ben Patton,SR,Clement Secchi,5Y,Grant Bochenski,SO,Missouri,3:03.14,3:03.26,9,18.0,Men's DI
586,Event 14 Men 400 Yard Medley Relay,Timed Final Relay,Matt Brownstead,JR,Noah Nichols,JR,Tim Connery,SO,Jack Aikins,SO,Virginia,3:03.29,3:03.50,10,14.0,Men's DI


In [14]:
dfResults.loc[598, ['Name2','Name3','Name4']] = ['Sean Faikish', 'Cason Wilburn', 'Thacher Scannell']

In [15]:
dfResults[(dfResults.Name3.isna()) & (~dfResults.Year3.isna())]

Unnamed: 0,Event,Category,Name1,Year1,Name2,Year2,Name3,Year3,Name4,Year4,School,QualifyingTime,Time,Place,Points,Division
935,Event 1 Women 200 Yard Medley Relay,Timed Final Relay,Caitlin Brooks,SR,Denise Phelan,FR,,FR,,SR,Kentucky,1:36.43,1:37.55,22,0.0,Women's DI


In [16]:
dfResults.loc[935, ['Name3','Name4']] = ['Lydia Hanlon', 'Kaitlynn Wheeler']

In [17]:
dfResults[(dfResults.Name4.isna()) & (~dfResults.Year4.isna())]

Unnamed: 0,Event,Category,Name1,Year1,Name2,Year2,Name3,Year3,Name4,Year4,School,QualifyingTime,Time,Place,Points,Division


In [18]:
dfResults.describe(include='all')

Unnamed: 0,Event,Category,Name1,Year1,Name2,Year2,Name3,Year3,Name4,Year4,School,QualifyingTime,Time,Place,Points,Division
count,1949,1949,1949,1949,229,229,229,229,229,229,1949,1949.0,1949,1949,1949.0,1949
unique,39,6,547,5,163,5,162,5,166,5,70,1662.0,1702,69,,2
top,Event 3 Women 500 Yard Freestyle,Preliminaries,Katharine Berkoff,SR,Alex Walsh,SR,Callie Dickinson,SR,Abby Arens,SO,Florida,51.9,DFS,---,,Women's DI
freq,84,1255,10,495,4,61,4,68,4,54,126,6.0,43,66,,1035
mean,,,,,,,,,,,,,,,3.658286,
std,,,,,,,,,,,,,,,7.421645,
min,,,,,,,,,,,,,,,0.0,
25%,,,,,,,,,,,,,,,0.0,
50%,,,,,,,,,,,,,,,0.0,
75%,,,,,,,,,,,,,,,4.0,


In [28]:
# Want to use datetime for the times, but there are some with minute values and some without
# This will convert all times with minutes, then any that don't work will be converted as seconds
# There are entries for DQs and DFSs that will be coerced into NaT.
# The default is for the date to be Jan. 1, 1900, which I will leave as is. I'll only worry about
# differences in the time pieces anyway and I can use a label to differentiate meets rather than their
# dates anyway if I need to compare multiple meets (plus the meets are multi-day so it just becomes complicated)
dfResults.Time = pd.to_datetime(dfResults.Time, format='%M:%S.%f', errors='coerce').fillna(
                pd.to_datetime(dfResults.Time, format='%S.%f', errors='coerce'))
dfResults.QualifyingTime = pd.to_datetime(dfResults.QualifyingTime, format='%M:%S.%f', errors='coerce').fillna(
                pd.to_datetime(dfResults.QualifyingTime, format='%S.%f', errors='coerce'))
dfResults.describe(datetime_is_numeric=True)

Unnamed: 0,QualifyingTime,Time,Points
count,1949,1883,1949.0
mean,1900-01-01 00:02:25.066541568,1900-01-01 00:02:27.072899328,3.658286
min,1900-01-01 00:00:17.930000,1900-01-01 00:00:18.250000,0.0
25%,1900-01-01 00:00:51.249999872,1900-01-01 00:00:51.180000,0.0
50%,1900-01-01 00:01:42.240000,1900-01-01 00:01:42.590000128,0.0
75%,1900-01-01 00:02:10.020000,1900-01-01 00:02:12.160000,4.0
max,1900-01-01 00:16:21.380000,1900-01-01 00:16:32.050000,40.0
std,,,7.421645


In [30]:
dfResults.isna().sum()

Event                0
Category             0
Name1                0
Year1                0
Name2             1720
Year2             1720
Name3             1720
Year3             1720
Name4             1720
Year4             1720
School               0
QualifyingTime       0
Time                66
Place                0
Points               0
Division             0
dtype: int64

The qualifying and result times are now datetimes. The Time column has a few NaT entries due to DQs and DFSs, while the qualifying time does not have any since a qualifying time is required. There are also NaN's for the relay swimmer columns (name/year 2-4) for individual event entries.

In [34]:
(~dfResults.Event.str.contains("Relay")).sum()

1720

Finally, in case I want to only consider finals or prelims separately, or want to drop the non-recorded times, I will split the dataframe into a few different subsets.

In [None]:
dfResultsFinals = dfResults[dfResults.Category.isin(
                            ['Timed Final Relay','Championship Final','Consolation Final', 'Timed Final Individual'])]
dfResultsPrelims = dfResults[dfResults.Category.isin(['Preliminaries','Swim-off'])]

In [None]:
print(len(dfResults), len(dfResultsFinals), len(dfResultsPrelims))

In [None]:
dfResultsDropDQ = dfResults[(dfResults.Time != 'DQ') & (dfResults.Time != 'DFS')]
dfResultsDropDQ.describe(include='all')

In [None]:
dfResultsFinals.dtypes

In [None]:
dfResultsFinals[~dfResultsFinals.Time.isin(['DQ','DFS'])].Place.unique()

I've now cleaned the dataframes and split the data into a few different subsets that will be useful for different analyses. I can now tackle the questions I set forth to answer.

### Analyzing Data

#### Question 1: Which team(s) did best (and how should that be defined)?

The obvious answer to this question would be the team that scored the most points in the men's and women's competitions. And if I consider those 2 parts of the same team, the team with the most combined points would be the best. These answers can essentially be read right off the results (the PDF results add diving points which I will be ignoring for this analysis of purely swimming results).

But there are a few other considerations of which teams did the best at championships. Beyond just points scored, perhaps success should be measured in terms of the number of swimmers or total number of swims at championships or in a finals heat -- the most successful teams are the teams that sent the most swimmers to National Championships. Or, perhaps the opposite should be considered -- which teams had the most points per swimmer, regardless of the total number they sent.

Alternatively, maybe success should be defined by performance at this meet relative to other meets this season. Which swimmers and teams dropped the most time at this meet?

So, I will take these questions in turn and come up with a few notions of which teams had the "best" meet.

In [36]:
print(dfResults.School.unique())
print(dfResults.School.nunique())

['NC State' 'Arizona St' 'Florida' 'California' 'Indiana' 'Tennessee'
 'Louisville' 'Virginia' 'Auburn' 'Stanford' 'Alabama' 'Texas' 'Georgia'
 'Texas A&M' 'Wisconsin' 'Ohio St' 'Notre Dame' 'Michigan' 'Minnesota'
 'Pittsburgh' 'Harvard' 'Missouri' 'Virginia Tech' 'UNC' 'Kentucky'
 'Georgia Tech' 'Yale' 'Princeton' 'Florida St' 'Purdue' 'SMU'
 'West Virginia' 'Towson' 'Brown' 'Utah' 'Brigham Young'
 'Southern California' 'LSU' 'Penn St' 'Cal Baptist' 'Arizona' 'Air Force'
 'UNLV' 'Northwestern' 'Columbia' 'TCU' 'IUPUI' 'SIUC' 'South Carolina'
 'Arkansas' 'Duke' 'UCLA' 'Miami (Ohio)' 'Penn' 'Nebraska' 'Akron'
 'Oakland' 'GWU' 'Miami (FL)' 'San Diego St' 'Hawaii' 'UNC Asheville'
 'Buffalo' 'William & Mary' 'Nevada' 'Rice' 'Denver' "Florida Int'l"
 'Cincinnati' 'Washington St.']
70


In [69]:
print(dfResults[dfResults.Division=='Women\'s DI'].School.nunique())
print(dfResults[~(dfResults.Division=='Women\'s DI')].School.nunique())
print(len(np.intersect1d(dfResults[dfResults.Division=='Women\'s DI'].School.unique(), 
                            dfResults[~(dfResults.Division=='Women\'s DI')].School.unique())))

55
49
34


There are 55 unique schools in the Women's competition and 49 for the Men's (because of the 34 schools in both there are 70 total unique schools across both meets).

In [63]:
pointsTable = dfResults[['School','Points','Division']].groupby(['School','Division']).sum().reset_index()
pointsTable.groupby('School').sum().sort_values(by='Points',ascending=False).head(10)

Unnamed: 0_level_0,Points
School,Unnamed: 1_level_1
Texas,705.5
NC State,636.5
Virginia,619.5
California,616.0
Florida,527.0
Indiana,459.0
Stanford,454.5
Arizona St,449.0
Tennessee,398.5
Louisville,358.0


In [64]:
pointsTable.sort_values(by='Points', ascending=False).head(10)

Unnamed: 0,School,Division,Points
95,Virginia,Women's DI,541.5
15,California,Men's DI,479.0
6,Arizona St,Men's DI,430.0
50,NC State,Men's DI,373.5
83,Texas,Women's DI,365.5
21,Florida,Men's DI,353.0
82,Texas,Men's DI,340.0
78,Stanford,Women's DI,333.0
34,Indiana,Men's DI,275.0
41,Louisville,Women's DI,266.0


In [65]:
pointsTable[pointsTable.Division=='Women\'s DI'].sort_values(by='Points', ascending=False).head(10)

Unnamed: 0,School,Division,Points
95,Virginia,Women's DI,541.5
83,Texas,Women's DI,365.5
78,Stanford,Women's DI,333.0
41,Louisville,Women's DI,266.0
51,NC State,Women's DI,263.0
60,Ohio St,Women's DI,216.0
81,Tennessee,Women's DI,214.0
35,Indiana,Women's DI,184.0
22,Florida,Women's DI,174.0
16,California,Women's DI,137.0


In [66]:
pointsTable[~(pointsTable.Division=='Women\'s DI')].sort_values(by='Points', ascending=False).head(10)

Unnamed: 0,School,Division,Points
15,California,Men's DI,479.0
6,Arizona St,Men's DI,430.0
50,NC State,Men's DI,373.5
21,Florida,Men's DI,353.0
82,Texas,Men's DI,340.0
34,Indiana,Men's DI,275.0
80,Tennessee,Men's DI,184.5
96,Virginia Tech,Men's DI,127.0
77,Stanford,Men's DI,121.5
9,Auburn,Men's DI,121.0


From the above, we can see the top schools by the number of points scored for a few selections. For the Women's meet the top schools were Virginia, Texas, Stanford. For the Men's the top 3 were Cal, Arizona St., and NC State. If we compare the top performing teams in the two groups, the highest scorers were the Virginia Women, Cal Men, and Arizona St. Men. Finally, when combining men's and women's scores, the top 3 schools were Texas, NC State, and Virginia.

Next, we look at the related but separate question of which schools led in terms of number of swimmers.

First, I'll only consider swimmers in individual events.

In [88]:
swimmersTable = dfResults[~dfResults.Event.str.contains("Relay")][['School','Name1','Points','Division']].groupby(
    ['School','Division']).agg({'Name1':'nunique','Points':'sum'}).reset_index()

In [89]:
#swimmersTable
swimmersTable.sort_values(by='Name1', ascending=False).head(10)

Unnamed: 0,School,Division,Name1,Points
94,Virginia,Women's DI,17,341.5
50,NC State,Men's DI,17,215.5
22,Florida,Women's DI,16,78.0
21,Florida,Men's DI,16,173.0
15,California,Men's DI,16,321.0
78,Stanford,Women's DI,15,183.0
60,Ohio St,Women's DI,15,94.0
82,Texas,Men's DI,15,216.0
6,Arizona St,Men's DI,15,270.0
83,Texas,Women's DI,14,219.5


Consistency check to make sure I didn't mess anything up while grouping and summing: do these values match the points totals for teams from individual events?

In [91]:
dfResults[~dfResults.Event.str.contains("Relay")][['School','Points','Division']].groupby(
    ['School','Division']).sum().reset_index().sort_values(by='Points', ascending=False)

Unnamed: 0,School,Division,Points
94,Virginia,Women's DI,341.5
15,California,Men's DI,321.0
6,Arizona St,Men's DI,270.0
83,Texas,Women's DI,219.5
82,Texas,Men's DI,216.0
...,...,...,...
79,TCU,Men's DI,0.0
49,Missouri,Women's DI,0.0
33,IUPUI,Men's DI,0.0
31,Harvard,Men's DI,0.0


Matching teams in the two previous tables, I can see that they give the same results.
Now, I'll add a points per swimmer column to the table.

In [102]:
swimmersTable['PointsPerSwimmer'] = swimmersTable.Points / swimmersTable.Name1
swimmersTable.rename(columns = {'Name1':'Swimmers'}, inplace=True)
swimmersTable.sort_values(by='PointsPerSwimmer',ascending=False).head(10)

Unnamed: 0,School,Division,Swimmers,Points,PointsPerSwimmer
38,LSU,Men's DI,1,43.5,43.5
94,Virginia,Women's DI,17,341.5,20.088235
15,California,Men's DI,16,321.0,20.0625
6,Arizona St,Men's DI,15,270.0,18.0
56,Notre Dame,Men's DI,3,52.0,17.333333
83,Texas,Women's DI,14,219.5,15.678571
34,Indiana,Men's DI,9,139.0,15.444444
82,Texas,Men's DI,15,216.0,14.4
39,LSU,Women's DI,4,53.0,13.25
81,Tennessee,Women's DI,10,132.0,13.2


When looking at the number of swimmers and points per swimmer we see lots of the same names the points table. Virginia Women did well in points, swimmmers, and points per swimmer, as did Cal and Arizona St. Men.

One standout when looking at points per swimmer is LSU Men who took 43.5 points with their only swimmer.

In [94]:
dfResults[(dfResults.School=='LSU') & (dfResults.Division=='Men\'s DI')]

Unnamed: 0,Event,Category,Name1,Year1,Name2,Year2,Name3,Year3,Name4,Year4,School,QualifyingTime,Time,Place,Points,Division
188,Event 5 Men 50 Yard Freestyle,Championship Final,Brooks Curry,SR,,,,,,,LSU,1900-01-01 00:00:18.720,1900-01-01 00:00:18.760,4,15.0,Men's DI
204,Event 5 Men 50 Yard Freestyle,Preliminaries,Brooks Curry,SR,,,,,,,LSU,1900-01-01 00:00:18.940,1900-01-01 00:00:18.720,4,0.0,Men's DI
397,Event 10 Men 200 Yard Freestyle,Championship Final,Brooks Curry,SR,,,,,,,LSU,1900-01-01 00:01:31.940,1900-01-01 00:01:31.300,4,15.0,Men's DI
413,Event 10 Men 200 Yard Freestyle,Preliminaries,Brooks Curry,SR,,,,,,,LSU,1900-01-01 00:01:33.150,1900-01-01 00:01:31.940,4,0.0,Men's DI
693,Event 17 Men 100 Yard Freestyle,Championship Final,Brooks Curry,SR,,,,,,,LSU,1900-01-01 00:00:41.170,1900-01-01 00:00:41.030,5,13.5,Men's DI
708,Event 17 Men 100 Yard Freestyle,Preliminaries,Brooks Curry,SR,,,,,,,LSU,1900-01-01 00:00:41.860,1900-01-01 00:00:41.170,4,0.0,Men's DI


All of LSU Men's swims are from Brooks Curry, who totaled 43.5 points across 3 events. Other teams that picked up lots of points per swimmer but didn't have enough total swimmers to get them into the top of the team points totals include Notre Dame's Men (52 points from 3 swimmers), Indiana's Men (139 points from 9 swimmers), and LSU's Women (53 points from 4 swimmers). This also highlights Brooks's great result: compared to the LSU women, he achieved 82% of their score from 4 swimmers by himself.

This also suggests another metric: points per number of swims (rather than swimmers) or average place. This picks up teams that tended to have their swimmers score lots of points (high points finishes). I expect to find lots of the same teams, but there's also a possibility of finding something similar to the LSU/Brooks Curry finding. Perhaps there are teams that didn't necessarily get lots of swimmers into finals and score lots of points, but when they did they tended to place well.

Before that, I will take a look at teams on the other end of the previous table: those who did not score many points.

In [103]:
swimmersTable.sort_values(by=['Points','Swimmers'], ascending=[True,False])

Unnamed: 0,School,Division,Swimmers,Points,PointsPerSwimmer
8,Arkansas,Women's DI,6,0.0,0.000000
74,South Carolina,Women's DI,6,0.0,0.000000
24,Florida St,Men's DI,3,0.0,0.000000
49,Missouri,Women's DI,3,0.0,0.000000
72,San Diego St,Women's DI,3,0.0,0.000000
...,...,...,...,...,...
82,Texas,Men's DI,15,216.0,14.400000
83,Texas,Women's DI,14,219.5,15.678571
6,Arizona St,Men's DI,15,270.0,18.000000
15,California,Men's DI,16,321.0,20.062500


The Arkansas and South Carolina Women's team managed to qualify 6 swimmers for national championships but did not score any points.

In [106]:
swimmersTable[swimmersTable.Points==0].sort_values(by='Swimmers', ascending=False).reset_index()

Unnamed: 0,index,School,Division,Swimmers,Points,PointsPerSwimmer
0,8,Arkansas,Women's DI,6,0.0,0.0
1,74,South Carolina,Women's DI,6,0.0,0.0
2,49,Missouri,Women's DI,3,0.0,0.0
3,72,San Diego St,Women's DI,3,0.0,0.0
4,24,Florida St,Men's DI,3,0.0,0.0
5,57,Notre Dame,Women's DI,2,0.0,0.0
6,97,Washington St.,Women's DI,2,0.0,0.0
7,75,Southern California,Men's DI,2,0.0,0.0
8,64,Pittsburgh,Women's DI,2,0.0,0.0
9,5,Arizona,Women's DI,2,0.0,0.0


Overall, 36 teams ended the meet with 0 points from individual events.

In [117]:
dfTmp = swimmersTable[['School','Points']].groupby('School').sum().reset_index()
dfTmp[dfTmp.Points==0].reset_index()

Unnamed: 0,index,School,Points
0,3,Arizona,0.0
1,5,Arkansas,0.0
2,7,Brigham Young,0.0
3,8,Brown,0.0
4,9,Buffalo,0.0
5,10,Cal Baptist,0.0
6,12,Cincinnati,0.0
7,13,Columbia,0.0
8,14,Denver,0.0
9,19,GWU,0.0


Combining Men's and Women's teams, 26 schools left the championships with no points.

Now, back to considering points per swim or average place. First I will consider total event entries per school. All entries will have a prelim entry, and those that made finals will have a repeat (except for the 1650 which has just one swim, the 'timed final'). I am only interested in counting each case as one "swimmer entry." So I want a table of total number of prelim swims per team, along with total points scored. I will also consider which schools did best at converting prelims entries into finals qualifications.

Then, I will limit the scope to only finals swim and which schools converted finals swims into the most points and the highest average finishing places.

In [293]:
#pointsTable = dfResults[~dfResults.Event.str.contains("Relay")][['School','Name1','Points','Place','Division']].groupby(
#    ['School','Division']).agg({'Name1':'sum','Points':'sum'}).reset_index()
pointsTable = dfResults[~dfResults.Event.str.contains("Relay")][
    ['School','Points','Name1','Division','Category']]
#.groupby(['School','Division']).value_counts().reset_index()
pointsTable.Category.unique()
#.agg({
#    'Name1':'nunique','Points':'sum'
#}).reset_index()

#value_counts().reset_index()

array(['Championship Final', 'Consolation Final', 'Preliminaries',
       'Timed Final Individual', 'Swim-off'], dtype=object)

"Timed Final Individual" is both a prelim and final, so I will have to count it as both. I'll eventually count every swim in this type of event as if it were in both Preliminaries and the Championship Final. This messes with the analysis a little because there will be many more than just 8 swimmers in the championship final for this event, but everyone has a chance to win the championship final points, so I think this is the best option. A swim-off is essentially a re-do of a prelim so it shouldn't really be counted here. I'll make new columns that are bools for each category that I can easily sum over.

In [294]:
pointsTable['Prelim'] = (pointsTable.Category == 'Preliminaries') | (pointsTable.Category == 'Timed Final Individual')
pointsTable['BFinal'] = pointsTable.Category == 'Consolation Final'
pointsTable['AFinal'] = pointsTable.Category == 'Championship Final'
pointsTable['TimedFinal'] = pointsTable.Category == 'Timed Final Individual'
pointsTable['BPoints'] = pointsTable[pointsTable.Category == 'Consolation Final'].Points
pointsTable['APoints'] = pointsTable[pointsTable.Category == 'Championship Final'].Points
pointsTable['TimedPoints'] = pointsTable[pointsTable.Category == 'Timed Final Individual'].Points
pointsTable

Unnamed: 0,School,Points,Name1,Division,Category,Prelim,BFinal,AFinal,TimedFinal,BPoints,APoints,TimedPoints
44,Texas,20.0,Luke Hobson,Men's DI,Championship Final,False,False,True,False,,20.0,
45,Texas,17.0,David Johnston,Men's DI,Championship Final,False,False,True,False,,17.0,
46,Georgia,16.0,Jake Magahey,Men's DI,Championship Final,False,False,True,False,,16.0,
47,Wisconsin,15.0,Jake Newmark,Men's DI,Championship Final,False,False,True,False,,15.0,
48,Florida,14.0,Jake Mitchell,Men's DI,Championship Final,False,False,True,False,,14.0,
...,...,...,...,...,...,...,...,...,...,...,...,...
1918,Ohio St,0.0,Kyra Sommerstad,Women's DI,Preliminaries,True,False,False,False,,,
1919,UCLA,0.0,Paige MacEachern,Women's DI,Preliminaries,True,False,False,False,,,
1920,Duke,0.0,Catherine Purnell,Women's DI,Preliminaries,True,False,False,False,,,
1921,Rice,0.0,Arielle Hayon,Women's DI,Preliminaries,True,False,False,False,,,


In [295]:
pointsTable = pointsTable.groupby(['School','Division']).sum().reset_index()
pointsTable

Unnamed: 0,School,Division,Points,Prelim,BFinal,AFinal,TimedFinal,BPoints,APoints,TimedPoints
0,Air Force,Men's DI,2.0,3,1,0,0,2.0,0.0,0.0
1,Akron,Women's DI,5.0,9,2,0,0,5.0,0.0,0.0
2,Alabama,Men's DI,21.0,13,2,1,1,8.0,11.0,2.0
3,Alabama,Women's DI,73.0,20,4,2,1,22.0,31.0,20.0
4,Arizona,Men's DI,0.0,3,0,0,0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
98,West Virginia,Men's DI,0.0,3,0,0,0,0.0,0.0,0.0
99,William & Mary,Women's DI,0.0,3,0,0,0,0.0,0.0,0.0
100,Wisconsin,Men's DI,19.0,11,1,1,0,4.0,15.0,0.0
101,Wisconsin,Women's DI,82.0,21,1,4,2,3.0,63.0,16.0


Now I have a table that has a row for each team with the total points scored and number of swims in each category of prelim/final as well as how many points each team got from each type of final.

In [296]:
pointsTable['pointsPerPrelim'] = pointsTable.Points / (pointsTable.Prelim + pointsTable.TimedFinal)
pointsTable['pointsPerFinal'] = pointsTable.Points / (pointsTable.BFinal + pointsTable.AFinal + pointsTable.TimedFinal)
pointsTable['BPointsPercentage'] = pointsTable.BPoints / pointsTable.Points
pointsTable['APointsPercentage'] = pointsTable.APoints / pointsTable.Points
pointsTable['TimedPointsPercentage'] = pointsTable.TimedPoints / pointsTable.Points
pointsTable

Unnamed: 0,School,Division,Points,Prelim,BFinal,AFinal,TimedFinal,BPoints,APoints,TimedPoints,pointsPerPrelim,pointsPerFinal,BPointsPercentage,APointsPercentage,TimedPointsPercentage
0,Air Force,Men's DI,2.0,3,1,0,0,2.0,0.0,0.0,0.666667,2.000000,1.000000,0.000000,0.000000
1,Akron,Women's DI,5.0,9,2,0,0,5.0,0.0,0.0,0.555556,2.500000,1.000000,0.000000,0.000000
2,Alabama,Men's DI,21.0,13,2,1,1,8.0,11.0,2.0,1.500000,5.250000,0.380952,0.523810,0.095238
3,Alabama,Women's DI,73.0,20,4,2,1,22.0,31.0,20.0,3.476190,10.428571,0.301370,0.424658,0.273973
4,Arizona,Men's DI,0.0,3,0,0,0,0.0,0.0,0.0,0.000000,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,West Virginia,Men's DI,0.0,3,0,0,0,0.0,0.0,0.0,0.000000,,,,
99,William & Mary,Women's DI,0.0,3,0,0,0,0.0,0.0,0.0,0.000000,,,,
100,Wisconsin,Men's DI,19.0,11,1,1,0,4.0,15.0,0.0,1.727273,9.500000,0.210526,0.789474,0.000000
101,Wisconsin,Women's DI,82.0,21,1,4,2,3.0,63.0,16.0,3.565217,11.714286,0.036585,0.768293,0.195122


In [297]:
pointsTable[['School','Division','Points','Prelim','pointsPerPrelim']].sort_values(by='pointsPerPrelim', ascending=False).head(10)

Unnamed: 0,School,Division,Points,Prelim,pointsPerPrelim
38,LSU,Men's DI,43.5,3,14.5
15,California,Men's DI,321.0,43,7.295455
94,Virginia,Women's DI,341.5,48,6.83
69,SIUC,Men's DI,13.0,2,6.5
83,Texas,Women's DI,219.5,32,6.271429
34,Indiana,Men's DI,139.0,23,6.043478
56,Notre Dame,Men's DI,52.0,8,5.777778
6,Arizona St,Men's DI,270.0,45,5.625
82,Texas,Men's DI,216.0,39,5.142857
35,Indiana,Women's DI,102.0,19,4.857143


At the top of the points per prelim metric we see the LSU Men's team again, who scored 43.5 points off of 3 Brooks Curry entries for 14.5 points per prelim. This is nearly twice as much as the next nearest team. Cal's Men's team managed to score an impressive 7.3 points per prelim with 43 entries. The Virginia Women round out the top 3 with 6.8 points per prelim and 48 total prelims. Of note in this top 10 are 2 smaller teams who managed to perform well: SIUC's Men's Team (2 entries, 6.5 points per prelim) and Notre Dame's Men's Team (8 entries, 5.8 points per prelim).

In [229]:
dfResults[(dfResults.School=='SIUC') & (dfResults.Division=='Men\'s DI')]

Unnamed: 0,Event,Category,Name1,Year1,Name2,Year2,Name3,Year3,Name4,Year4,School,QualifyingTime,Time,Place,Points,Division
526,Event 12 Men 100 Yard Backstroke,Consolation Final,Ruard Van Renen,FR,,,,,,,SIUC,1900-01-01 00:00:45.170,1900-01-01 00:00:44.670,9,9.0,Men's DI
543,Event 12 Men 100 Yard Backstroke,Preliminaries,Ruard Van Renen,FR,,,,,,,SIUC,1900-01-01 00:00:44.890,1900-01-01 00:00:45.170,10,0.0,Men's DI
645,Event 16 Men 200 Yard Backstroke,Consolation Final,Ruard Van Renen,FR,,,,,,,SIUC,1900-01-01 00:01:39.730,1900-01-01 00:01:40.220,13,4.0,Men's DI
658,Event 16 Men 200 Yard Backstroke,Preliminaries,Ruard Van Renen,FR,,,,,,,SIUC,1900-01-01 00:01:40.640,1900-01-01 00:01:39.730,10,0.0,Men's DI


Ruard Van Renen from SIUC was the only SIUC swimmer at the meet, making 2 Consolation Finals, scoring a total of 13 points by himself. (This strong peformance also helps highlight once again how impressive Brooks Curry's 14.5 points per event as the only LSU Men's entry is.)

Now I'll consider teams that were able to turn prelim swims into finals entries.

In [298]:
pointsTable.assign(FinalPercentage = 
                   (pointsTable.AFinal + pointsTable.BFinal + pointsTable.TimedFinal)/pointsTable.Prelim,
                  Final = pointsTable.AFinal + pointsTable.BFinal + pointsTable.TimedFinal)[
    ['School','Division','Points','Prelim','Final','FinalPercentage']].sort_values(
    by='FinalPercentage',ascending=False).drop('FinalPercentage',axis=1).reset_index(drop=True).head(10)

Unnamed: 0,School,Division,Points,Prelim,Final
0,SIUC,Men's DI,13.0,2,2
1,LSU,Men's DI,43.5,3,3
2,Notre Dame,Men's DI,52.0,8,7
3,Indiana,Women's DI,102.0,19,12
4,California,Men's DI,321.0,43,26
5,Arizona St,Men's DI,270.0,45,27
6,Virginia,Women's DI,341.5,48,28
7,Texas,Women's DI,219.5,32,18
8,Louisville,Women's DI,140.0,28,15
9,NC State,Women's DI,155.0,34,18


We see lots of repeat top performing teams here, many teams in the top 10 have been discussed as they have been in other top 10s. The top 10 teams below Indiana's Women at #4 (12 of 17 entries in finals) are all large teams that had many swimmer in prelims and finals and took home large points hauls. As discussed, SIUC and LSU's swimmers made finals in all their events. Notre Dame's Men's Team converted 7 of 8 swims into finals (6 consolation finals + 1 timed 1650 final). The lone event they missed the finals of was the 50 yard freestyle, where Chris Guiliano finished 23rd in the prelims. In the 100 Backstroke prelims, Tommy Janton finished tied for 16th so had to go to a swim-off, which he won to earn the last spot in the consolation final.

In [237]:
dfResults[(dfResults.School=='Notre Dame') & (dfResults.Division=='Men\'s DI') & 
          (dfResults.Category!='Timed Final Relay')]

Unnamed: 0,Event,Category,Name1,Year1,Name2,Year2,Name3,Year3,Name4,Year4,School,QualifyingTime,Time,Place,Points,Division
53,Event 3 Men 500 Yard Freestyle,Consolation Final,Jack Hoagland,SR,,,,,,,Notre Dame,1900-01-01 00:04:12.770,1900-01-01 00:04:12.490,10,7.0,Men's DI
72,Event 3 Men 500 Yard Freestyle,Preliminaries,Jack Hoagland,SR,,,,,,,Notre Dame,1900-01-01 00:04:14.240,1900-01-01 00:04:12.770,13,0.0,Men's DI
223,Event 5 Men 50 Yard Freestyle,Preliminaries,Chris Guiliano,SO,,,,,,,Notre Dame,1900-01-01 00:00:18.880,1900-01-01 00:00:19.170,23,0.0,Men's DI
290,Event 8 Men 400 Yard IM,Consolation Final,Jack Hoagland,SR,,,,,,,Notre Dame,1900-01-01 00:03:41.670,1900-01-01 00:03:40.820,12,5.0,Men's DI
310,Event 8 Men 400 Yard IM,Preliminaries,Jack Hoagland,SR,,,,,,,Notre Dame,1900-01-01 00:03:41.150,1900-01-01 00:03:41.670,16,0.0,Men's DI
402,Event 10 Men 200 Yard Freestyle,Consolation Final,Chris Guiliano,SO,,,,,,,Notre Dame,1900-01-01 00:01:32.360,1900-01-01 00:01:32.310,9,9.0,Men's DI
419,Event 10 Men 200 Yard Freestyle,Preliminaries,Chris Guiliano,SO,,,,,,,Notre Dame,1900-01-01 00:01:32.430,1900-01-01 00:01:32.360,10,0.0,Men's DI
531,Event 12 Men 100 Yard Backstroke,Consolation Final,Tommy Janton,FR,,,,,,,Notre Dame,1900-01-01 00:00:45.540,1900-01-01 00:00:45.430,14,3.0,Men's DI
549,Event 12 Men 100 Yard Backstroke,Preliminaries,Tommy Janton,FR,,,,,,,Notre Dame,1900-01-01 00:00:45.610,1900-01-01 00:00:45.540,16,0.0,Men's DI
603,Event 15 Men 1650 Yard Freestyle,Timed Final Individual,Jack Hoagland,SR,,,,,,,Notre Dame,1900-01-01 00:14:48.820,1900-01-01 00:14:38.640,5,14.0,Men's DI


We can also consider which teams failed to make finals, despite many prelims entries.

In [300]:
dfTmp = pointsTable.assign(FinalPercentage = 
                   (pointsTable.AFinal + pointsTable.BFinal + pointsTable.TimedFinal)/pointsTable.Prelim,
                  Final = pointsTable.AFinal + pointsTable.BFinal + pointsTable.TimedFinal)[
    ['School','Division','Points','Prelim','Final','FinalPercentage']].sort_values(
    by=['FinalPercentage','Prelim'],ascending=[True,False]).drop('FinalPercentage',axis=1).reset_index(drop=True)
dfTmp[dfTmp.Final==0]

Unnamed: 0,School,Division,Points,Prelim,Final
0,South Carolina,Women's DI,0.0,15,0
1,Arkansas,Women's DI,0.0,14,0
2,Missouri,Women's DI,0.0,7,0
3,San Diego St,Women's DI,0.0,7,0
4,Arizona,Women's DI,0.0,6,0
5,Notre Dame,Women's DI,0.0,5,0
6,Pittsburgh,Women's DI,0.0,5,0
7,Southern California,Men's DI,0.0,4,0
8,Arizona,Men's DI,0.0,3,0
9,Brigham Young,Men's DI,0.0,3,0


Despite qualifying a lot of swimmers for national championships, South Carolina Women and Arkansas Women were unable to convert any of their 15 and 14 prelim swims, respectively, into finals appearances. These 2 programs had twice as many as any other teams that secured 0 inividual finals.

In [301]:
dfResults[(dfResults.School.isin(['South Carolina','Arkansas'])) & (dfResults.Division=='Women\'s DI') & 
          (dfResults.Category!='Timed Final Relay')]

Unnamed: 0,Event,Category,Name1,Year1,Name2,Year2,Name3,Year3,Name4,Year4,School,QualifyingTime,Time,Place,Points,Division
1011,Event 3 Women 500 Yard Freestyle,Preliminaries,Luciana Thomas,5Y,,,,,,,Arkansas,1900-01-01 00:04:45.750,1900-01-01 00:04:44.230,40,0.0,Women's DI
1078,Event 4 Women 200 Yard IM,Preliminaries,Victoria Kwan,5Y,,,,,,,South Carolina,1900-01-01 00:01:55.780,1900-01-01 00:01:56.410,23,0.0,Women's DI
1108,Event 4 Women 200 Yard IM,Preliminaries,Laura Goettler,JR,,,,,,,South Carolina,1900-01-01 00:01:58.160,1900-01-01 00:01:59.560,53,0.0,Women's DI
1166,Event 5 Women 50 Yard Freestyle,Preliminaries,Andrea Sansores De La Fuent,5Y,,,,,,,Arkansas,1900-01-01 00:00:21.970,1900-01-01 00:00:22.250,28,0.0,Women's DI
1167,Event 5 Women 50 Yard Freestyle,Preliminaries,Janie Smith,SR,,,,,,,South Carolina,1900-01-01 00:00:22.280,1900-01-01 00:00:22.290,30,0.0,Women's DI
1193,Event 5 Women 50 Yard Freestyle,Preliminaries,Kobie Melton,5Y,,,,,,,Arkansas,1900-01-01 00:00:22.580,1900-01-01 00:00:22.740,56,0.0,Women's DI
1194,Event 5 Women 50 Yard Freestyle,Preliminaries,Alessia Ferraguti,5Y,,,,,,,Arkansas,1900-01-01 00:00:22.680,1900-01-01 00:00:22.790,57,0.0,Women's DI
1255,Event 8 Women 400 Yard IM,Preliminaries,Victoria Kwan,5Y,,,,,,,South Carolina,1900-01-01 00:04:07.460,1900-01-01 00:04:10.170,17,0.0,Women's DI
1265,Event 8 Women 400 Yard IM,Preliminaries,Laura Goettler,JR,,,,,,,South Carolina,1900-01-01 00:04:10.270,1900-01-01 00:04:12.750,27,0.0,Women's DI
1322,Event 9 Women 100 Yard Butterfly,Preliminaries,Nicholle Toh,JR,,,,,,,South Carolina,1900-01-01 00:00:51.890,1900-01-01 00:00:52.030,23,0.0,Women's DI


We can repeat, only considering the Championship Final.

In [238]:
pointsTable.assign(AFinalPercentage = pointsTable.AFinal / pointsTable.Prelim)[
    ['School','Division','Points','Prelim','AFinal','AFinalPercentage']].sort_values(
    by='AFinalPercentage',ascending=False).drop('AFinalPercentage',axis=1).head(10)

Unnamed: 0,School,Division,Points,Prelim,AFinal
38,LSU,Men's DI,43.5,3,3
15,California,Men's DI,321.0,43,18
94,Virginia,Women's DI,341.5,48,19
83,Texas,Women's DI,219.5,32,12
32,Hawaii,Women's DI,11.5,3,1
47,Minnesota,Women's DI,13.0,3,1
82,Texas,Men's DI,216.0,39,12
34,Indiana,Men's DI,139.0,23,7
6,Arizona St,Men's DI,270.0,45,13
41,Louisville,Women's DI,140.0,28,8


Among familiar names and impressive results from larger teams, there are 2 teams that got 1 Championship Final entry from just 3 prelim entries: Hawaii Women (Laticia-Leigh Tansom, 7th in 100 free) and Minnesota Women (Megan Van Berkom, 6th in 400 IM).

In [243]:
dfResults[(dfResults.School.isin(['Minnesota','Hawaii'])) & (dfResults.Division=='Women\'s DI')
         & ~(dfResults.Event.str.contains('Relay'))]

Unnamed: 0,Event,Category,Name1,Year1,Name2,Year2,Name3,Year3,Name4,Year4,School,QualifyingTime,Time,Place,Points,Division
1091,Event 4 Women 200 Yard IM,Preliminaries,Megan Van Berkom,JR,,,,,,,Minnesota,1900-01-01 00:01:55.670,1900-01-01 00:01:57.470,35,0.0,Women's DI
1161,Event 5 Women 50 Yard Freestyle,Preliminaries,Laticia-Leigh Transom,5Y,,,,,,,Hawaii,1900-01-01 00:00:22.260,1900-01-01 00:00:22.180,24,0.0,Women's DI
1228,Event 8 Women 400 Yard IM,Championship Final,Megan Van Berkom,JR,,,,,,,Minnesota,1900-01-01 00:04:06.700,1900-01-01 00:04:05.370,6,13.0,Women's DI
1246,Event 8 Women 400 Yard IM,Preliminaries,Megan Van Berkom,JR,,,,,,,Minnesota,1900-01-01 00:04:04.860,1900-01-01 00:04:06.700,8,0.0,Women's DI
1388,Event 10 Women 200 Yard Freestyle,Preliminaries,Laticia-Leigh Transom,5Y,,,,,,,Hawaii,1900-01-01 00:01:44.890,1900-01-01 00:01:44.990,22,0.0,Women's DI
1708,Event 17 Women 100 Yard Freestyle,Championship Final,Laticia-Leigh Transom,5Y,,,,,,,Hawaii,1900-01-01 00:00:47.390,1900-01-01 00:00:47.500,7,11.5,Women's DI
1724,Event 17 Women 100 Yard Freestyle,Preliminaries,Laticia-Leigh Transom,5Y,,,,,,,Hawaii,1900-01-01 00:00:47.860,1900-01-01 00:00:47.390,8,0.0,Women's DI
1904,Event 19 Women 200 Yard Butterfly,Preliminaries,Megan Van Berkom,JR,,,,,,,Minnesota,1900-01-01 00:01:55.140,1900-01-01 00:01:56.370,33,0.0,Women's DI


In [244]:
pointsTable

Unnamed: 0,School,Division,Points,Prelim,BFinal,AFinal,TimedFinal,BPoints,APoints,TimedPoints,pointsPerPrelim,pointsPerFinal,BPointsPercentage,APointsPercentage,TimedPointsPercentage
0,Air Force,Men's DI,2.0,3,1,0,0,2.0,0.0,0.0,0.666667,2.000000,1.000000,0.000000,0.000000
1,Akron,Women's DI,5.0,9,2,0,0,5.0,0.0,0.0,0.555556,2.500000,1.000000,0.000000,0.000000
2,Alabama,Men's DI,21.0,13,2,1,1,8.0,11.0,2.0,1.500000,5.250000,0.380952,0.523810,0.095238
3,Alabama,Women's DI,73.0,20,4,2,1,22.0,31.0,20.0,3.476190,10.428571,0.301370,0.424658,0.273973
4,Arizona,Men's DI,0.0,3,0,0,0,0.0,0.0,0.0,0.000000,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,West Virginia,Men's DI,0.0,3,0,0,0,0.0,0.0,0.0,0.000000,,,,
99,William & Mary,Women's DI,0.0,3,0,0,0,0.0,0.0,0.0,0.000000,,,,
100,Wisconsin,Men's DI,19.0,11,1,1,0,4.0,15.0,0.0,1.727273,9.500000,0.210526,0.789474,0.000000
101,Wisconsin,Women's DI,82.0,21,1,4,2,3.0,63.0,16.0,3.565217,11.714286,0.036585,0.768293,0.195122


In [285]:
ATable = dfResults[~(dfResults.Event.str.contains("Relay")) & 
                   (dfResults.Category.isin(['Championship Final','Consolation Final','Timed Final Individual']))][
    ['School','Name1','Points','Division','Category','Place']]
ATable

Unnamed: 0,School,Name1,Points,Division,Category,Place
44,Texas,Luke Hobson,20.0,Men's DI,Championship Final,1
45,Texas,David Johnston,17.0,Men's DI,Championship Final,2
46,Georgia,Jake Magahey,16.0,Men's DI,Championship Final,3
47,Wisconsin,Jake Newmark,15.0,Men's DI,Championship Final,4
48,Florida,Jake Mitchell,14.0,Men's DI,Championship Final,5
...,...,...,...,...,...,...
1867,UNC,Ellie VanNote,5.0,Women's DI,Consolation Final,12
1868,Florida St,Edith Jernstedt,4.0,Women's DI,Consolation Final,13
1869,NC State,Abby Arens,3.0,Women's DI,Consolation Final,14
1870,Northwestern,Miriam Guevara,2.0,Women's DI,Consolation Final,15


In [289]:
ATable.Place = ATable.Place.astype('int')
ATable.groupby(['School','Division']).agg({
    'Name1': 'count', 'Points':'sum', 'Place':'mean'
}).sort_values(by=['Place','Name1'], ascending=[True,False]).reset_index().head(10)

Unnamed: 0,School,Division,Name1,Points,Place
0,LSU,Women's DI,3,53.0,2.0
1,LSU,Men's DI,3,43.5,4.333333
2,Utah,Men's DI,1,14.0,5.0
3,Indiana,Men's DI,11,139.0,6.0
4,Minnesota,Women's DI,1,13.0,6.0
5,California,Men's DI,26,321.0,6.192308
6,Georgia,Men's DI,6,70.0,6.666667
7,Hawaii,Women's DI,1,11.5,7.0
8,Virginia Tech,Men's DI,5,59.0,7.2
9,Virginia,Women's DI,28,341.5,7.285714


The above table gives teams with the lowest average place, among entries in Final heats. The LSU Women and Men both had 3 finals swims, with an average finsh of 2nd place for the Women and 4.33 for the Men. Utah Men had 1 finals swim and placed 5th. We can see that in this ranking, which rewards the highest averages, the top 10 is mostly filled with teams that only had a few finals swims. Among the top 10, the teams that had both a large number of finals swims and a high average finish were Indiana Men (11 swims, 6.0 average), Cal Men (26 swims, 6.2 average), and Virginia Women (28 swims, 7.3 average). The teams that scored the most swimming points (Cal Men and Virginia Women) got lots of swimmers into finals and they finished highly on average.

In [290]:
ATable.groupby(['School','Division']).agg({
    'Name1': 'count', 'Points':'mean', 'Place':'mean'
}).sort_values(by=['Points','Name1'], ascending=[False,False]).reset_index().head(10)

Unnamed: 0,School,Division,Name1,Points,Place
0,LSU,Women's DI,3,17.666667,2.0
1,LSU,Men's DI,3,14.5,4.333333
2,Utah,Men's DI,1,14.0,5.0
3,Minnesota,Women's DI,1,13.0,6.0
4,Indiana,Men's DI,11,12.636364,6.0
5,California,Men's DI,26,12.346154,6.192308
6,Virginia,Women's DI,28,12.196429,7.285714
7,Texas,Women's DI,18,12.194444,7.666667
8,Virginia Tech,Men's DI,5,11.8,7.2
9,Wisconsin,Women's DI,7,11.714286,8.571429


This table shows how the average finish converts into points. The top 10 is mostly the same when looking at average points or average finishing place; however, because points are not perfectly linear with place (the gap in points awarded to finishing places is different for different places from 1-16) they are not exactly the same. It matters exactly how you end up with a certain average place. For example, Minnesota's Women and Indiana's Men both had an average finish of 6th. However, Minnesota had more points on average per swim (just 1 swim for Minnesota compared to 11 for Indiana). The table below shows all 12 of these results. Across the 11 swims, Indiana on average placed 6th but because of the points allocations, the smaller points awarded for finishes below 6th couldn't balance out the larger rewards for top 6 finishes, so the average points per swim was smaller than the points awarded for a single 6th place.

In [292]:
dfResults[~(dfResults.Event.str.contains("Relay")) & 
                   (dfResults.Category.isin(['Championship Final','Consolation Final','Timed Final Individual']))
         & (((dfResults.School=='Minnesota') & (dfResults.Division=='Women\'s DI')) |
            ((dfResults.School=='Indiana') & (dfResults.Division=='Men\'s DI')))]

Unnamed: 0,Event,Category,Name1,Year1,Name2,Year2,Name3,Year3,Name4,Year4,School,QualifyingTime,Time,Place,Points,Division
195,Event 5 Men 50 Yard Freestyle,Consolation Final,Van Mathias,5Y,,,,,,,Indiana,1900-01-01 00:00:18.890,1900-01-01 00:00:18.910,11,6.0,Men's DI
334,Event 9 Men 100 Yard Butterfly,Championship Final,Tomer Frankel,JR,,,,,,,Indiana,1900-01-01 00:00:44.260,1900-01-01 00:00:44.040,3,16.0,Men's DI
340,Event 9 Men 100 Yard Butterfly,Consolation Final,Brendan Burns,SR,,,,,,,Indiana,1900-01-01 00:00:44.790,1900-01-01 00:00:44.600,9,9.0,Men's DI
406,Event 10 Men 200 Yard Freestyle,Consolation Final,Rafael Miroslaw,SO,,,,,,,Indiana,1900-01-01 00:01:32.280,1900-01-01 00:01:32.650,13,4.0,Men's DI
457,Event 11 Men 100 Yard Breaststroke,Championship Final,Van Mathias,5Y,,,,,,,Indiana,1900-01-01 00:00:50.570,1900-01-01 00:00:50.600,2,17.0,Men's DI
464,Event 11 Men 100 Yard Breaststroke,Consolation Final,Josh Matheny,SO,,,,,,,Indiana,1900-01-01 00:00:51.170,1900-01-01 00:00:50.990,9,9.0,Men's DI
518,Event 12 Men 100 Yard Backstroke,Championship Final,Brendan Burns,SR,,,,,,,Indiana,1900-01-01 00:00:44.280,1900-01-01 00:00:43.610,1,20.0,Men's DI
694,Event 17 Men 100 Yard Freestyle,Championship Final,Van Mathias,5Y,,,,,,,Indiana,1900-01-01 00:00:41.330,1900-01-01 00:00:41.390,7,12.0,Men's DI
769,Event 18 Men 200 Yard Breaststroke,Championship Final,Josh Matheny,SO,,,,,,,Indiana,1900-01-01 00:01:51.240,1900-01-01 00:01:50.120,4,15.0,Men's DI
829,Event 19 Men 200 Yard Butterfly,Championship Final,Brendan Burns,SR,,,,,,,Indiana,1900-01-01 00:01:40.510,1900-01-01 00:01:38.970,2,17.0,Men's DI


Alternatively, maybe success should be defined by performance at this meet relative to other meets this season. Which swimmers and teams dropped the most time at this meet?