## Analysis of Men's and Women's 2023 NCAA DI Swimming Championships

I will analyze the results of the 2023 NCAA championship meet and answer a few questions:
1. Which team(s) did best (and how should that be defined)?
2. Are there any differences in performance based on a swimmer's year?
3. What are the average improvements from qualifying times to preliminaries to finals? Do any factors impact this?

### Loading and Inspecting Data

Most of this step was handled in `buildData_2023NCAAs.ipynb` where I converted PDF results to a CSV file. I'll load the CSVs and do some inspection and any small cleaning required.

In [1]:
import pandas as pd
import numpy as np

In [9]:
dfResultsM = pd.read_csv('NCAA_M2023.csv')
dfResultsW = pd.read_csv('NCAA_W2023.csv')
dfResultsM['Division'] = 'Men\'s DI'
dfResultsW['Division'] = 'Women\'s DI'
dfResults = pd.concat([dfResultsM, dfResultsW])
dfResults.head()

Unnamed: 0.1,Unnamed: 0,Event,Category,Name1,Year1,Name2,Year2,Name3,Year3,Name4,Year4,School,QualifyingTime,Time,Place,Points,Division
0,0,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Stokowski, Kacper",SR,"Hunter, Mason",5Y,"Korstanje, Nyls",SR,"Curtiss, David",SO,NC State,1:22.25,1:20.67,1,40.0,Men's DI
1,1,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Dolan, Jack",SR,"Marchand, Leon",SO,"McCusker, Max",5Y,"Kulow, Jonny",FR,Arizona St,1:21.69,1:21.07,2,34.0,Men's DI
2,2,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Chaney, Adam",JR,"Savickas, Aleksas",FR,"Friese, Eric",SR,"Liendo, Josh",FR,Florida,1:21.73,1:21.14,3,32.0,Men's DI
3,3,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Seeliger, Bjorn",JR,"Bell, Liam",SR,"Rose, Dare",JR,"Alexy, Jack",SO,California,1:22.84,1:21.24,4,30.0,Men's DI
4,4,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Burns, Brendan",SR,"Mathias, Van",5Y,"Frankel, Tomer",JR,"Wight, Gavin",JR,Indiana,1:23.52,1:21.52,5,28.0,Men's DI


In [10]:
dfResults.describe(include='all')

Unnamed: 0.1,Unnamed: 0,Event,Category,Name1,Year1,Name2,Year2,Name3,Year3,Name4,Year4,School,QualifyingTime,Time,Place,Points,Division
count,1949.0,1949,1949,1949,1949,229,229,229,229,229,229,1949,1949.0,1949,1949,1949.0,1949
unique,,39,6,547,5,164,5,164,5,168,5,70,1662.0,1702,69,,2
top,,Event 3 Women 500 Yard Freestyle,Preliminaries,"Berkoff, Katharine",SR,"Walsh, Alex",SR,"Jones, Emily",SR,"Arens, Abby",SO,Florida,51.9,DFS,---,,Women's DI
freq,,84,1255,10,495,4,61,4,68,4,54,126,6.0,43,66,,1035
mean,488.628014,,,,,,,,,,,,,,,3.658286,
std,284.614824,,,,,,,,,,,,,,,7.421645,
min,0.0,,,,,,,,,,,,,,,0.0,
25%,243.0,,,,,,,,,,,,,,,0.0,
50%,487.0,,,,,,,,,,,,,,,0.0,
75%,730.0,,,,,,,,,,,,,,,4.0,


Everything looks ok after reading the two dataframes in, except as I wrote/read I ended up with 2 index columns, so I can drop the 'Unnamed: 0' column.

In [11]:
dfResults = dfResults.drop(['Unnamed: 0'], axis=1)

In [14]:
dfResults.head(25)

Unnamed: 0,Event,Category,Name1,Year1,Name2,Year2,Name3,Year3,Name4,Year4,School,QualifyingTime,Time,Place,Points,Division
0,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Stokowski, Kacper",SR,"Hunter, Mason",5Y,"Korstanje, Nyls",SR,"Curtiss, David",SO,NC State,1:22.25,1:20.67,1,40.0,Men's DI
1,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Dolan, Jack",SR,"Marchand, Leon",SO,"McCusker, Max",5Y,"Kulow, Jonny",FR,Arizona St,1:21.69,1:21.07,2,34.0,Men's DI
2,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Chaney, Adam",JR,"Savickas, Aleksas",FR,"Friese, Eric",SR,"Liendo, Josh",FR,Florida,1:21.73,1:21.14,3,32.0,Men's DI
3,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Seeliger, Bjorn",JR,"Bell, Liam",SR,"Rose, Dare",JR,"Alexy, Jack",SO,California,1:22.84,1:21.24,4,30.0,Men's DI
4,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Burns, Brendan",SR,"Mathias, Van",5Y,"Frankel, Tomer",JR,"Wight, Gavin",JR,Indiana,1:23.52,1:21.52,5,28.0,Men's DI
5,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Kammann, Bjoern",SO,"Houlie, Michael",5Y,"Crooks, Jordan",SO,"Santos, Guilherme",FR,Tennessee,1:21.43,1:21.59,6,26.0,Men's DI
6,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Lowe, Dalton",JR,"Petrashov, Denis",SR,"Elaraby, Abdelrahman",SR,"Eastman, Michael",5Y,Louisville,1:23.59,1:22.43,7,24.0,Men's DI
7,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Brownstead, Matt",JR,"Nichols, Noah",JR,"Edwards, Max",SR,"Lamb, August",SR,Virginia,1:23.03,1:22.51,8,22.0,Men's DI
8,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"Stoffle, Aidan",SR,"Mikuta, Reid",JR,"Stoffle, Nate",SO,"Makinen, Kalle",FR,Auburn,1:22.98,1:22.67,9,18.0,Men's DI
9,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,"MacAlister, Leon",SR,"Polonsky, Ron",SO,"Minakov, Andrei",JR,"Gu, Rafael",FR,Stanford,1:24.00,1:22.69,10,14.0,Men's DI


In [15]:
dfResults[dfResults.Category == 'Championship Final'].head()

Unnamed: 0,Event,Category,Name1,Year1,Name2,Year2,Name3,Year3,Name4,Year4,School,QualifyingTime,Time,Place,Points,Division
44,Event 3 Men 500 Yard Freestyle,Championship Final,"Hobson, Luke",SO,,,,,,,Texas,4:10.51,4:07.37,1,20.0,Men's DI
45,Event 3 Men 500 Yard Freestyle,Championship Final,"Johnston, David",JR,,,,,,,Texas,4:10.02,4:08.79,2,17.0,Men's DI
46,Event 3 Men 500 Yard Freestyle,Championship Final,"Magahey, Jake",JR,,,,,,,Georgia,4:10.83,4:09.24,3,16.0,Men's DI
47,Event 3 Men 500 Yard Freestyle,Championship Final,"Newmark, Jake",JR,,,,,,,Wisconsin,4:10.80,4:10.12,4,15.0,Men's DI
48,Event 3 Men 500 Yard Freestyle,Championship Final,"Mitchell, Jake",JR,,,,,,,Florida,4:11.65,4:10.54,5,14.0,Men's DI


In [26]:
for col in ['Name1','Name2','Name3','Name4']:
    name[['Last', 'First']] = dfResults[col].str.split(',',expand=True)
    dfResults[col] = name['First'] + ' ' + name['Last']
dfResults.head()

Unnamed: 0,Event,Category,Name1,Year1,Name2,Year2,Name3,Year3,Name4,Year4,School,QualifyingTime,Time,Place,Points,Division
0,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,Kacper Stokowski,SR,Mason Hunter,5Y,Nyls Korstanje,SR,David Curtiss,SO,NC State,1:22.25,1:20.67,1,40.0,Men's DI
1,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,Jack Dolan,SR,Leon Marchand,SO,Max McCusker,5Y,Jonny Kulow,FR,Arizona St,1:21.69,1:21.07,2,34.0,Men's DI
2,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,Adam Chaney,JR,Aleksas Savickas,FR,Eric Friese,SR,Josh Liendo,FR,Florida,1:21.73,1:21.14,3,32.0,Men's DI
3,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,Bjorn Seeliger,JR,Liam Bell,SR,Dare Rose,JR,Jack Alexy,SO,California,1:22.84,1:21.24,4,30.0,Men's DI
4,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,Brendan Burns,SR,Van Mathias,5Y,Tomer Frankel,JR,Gavin Wight,JR,Indiana,1:23.52,1:21.52,5,28.0,Men's DI


In [29]:
dfResults.dtypes

Event              object
Category           object
Name1              object
Year1              object
Name2              object
Year2              object
Name3              object
Year3              object
Name4              object
Year4              object
School             object
QualifyingTime     object
Time               object
Place              object
Points            float64
Division           object
dtype: object

In [36]:
dfResultsDropDQ = dfResults[(dfResults.Time != 'DQ') & (dfResults.Time != 'DFS')]
dfResultsDropDQ.describe(include='all')

Unnamed: 0,Event,Category,Name1,Year1,Name2,Year2,Name3,Year3,Name4,Year4,School,QualifyingTime,Time,Place,Points,Division
count,1883,1883,1883,1883,213,214,212,214,213,214,1883,1883.0,1883.0,1883.0,1883.0,1883
unique,39,6,544,5,158,5,154,5,159,5,70,1614.0,1700.0,68.0,,2
top,Event 3 Women 500 Yard Freestyle,Preliminaries,Bjorn Seeliger,SR,Alex Walsh,SR,Callie Dickinson,SR,Micayla Cronk,SO,Florida,19.09,19.1,1.0,,Women's DI
freq,84,1204,9,482,4,57,4,66,4,51,121,6.0,5.0,63.0,,1004
mean,,,,,,,,,,,,,,,3.786511,
std,,,,,,,,,,,,,,,7.518421,
min,,,,,,,,,,,,,,,0.0,
25%,,,,,,,,,,,,,,,0.0,
50%,,,,,,,,,,,,,,,0.0,
75%,,,,,,,,,,,,,,,4.0,


In [41]:
dfResults[(dfResults.Time=='DQ') | (dfResults.Time=='DFS')].head(50)

Unnamed: 0,Event,Category,Name1,Year1,Name2,Year2,Name3,Year3,Name4,Year4,School,QualifyingTime,Time,Place,Points,Division
21,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,Eric Storms,5Y,Ben Patton,SR,Clement Secchi,5Y,Jack Dahlgren,5Y,Missouri,1:23.28,DQ,---,0.0,Men's DI
22,Event 1 Men 200 Yard Medley Relay,Timed Final Relay,Forest Webb,SR,AJ Pouch,SR,Landon Gentry,FR,Will Hayon,FR,Virginia Tech,1:24.21,DQ,---,0.0,Men's DI
106,Event 3 Men 500 Yard Freestyle,Preliminaries,Victor Johansson,5Y,,,,,,,Alabama,4:14.34,DFS,---,0.0,Men's DI
179,Event 4 Men 200 Yard IM,Preliminaries,Luke Barr,SO,,,,,,,Indiana,1:43.14,DQ,---,0.0,Men's DI
180,Event 4 Men 200 Yard IM,Preliminaries,Aleksas Savickas,FR,,,,,,,Florida,1:45.69,DFS,---,0.0,Men's DI
181,Event 4 Men 200 Yard IM,Preliminaries,Leon MacAlister,SR,,,,,,,Stanford,1:45.95,DFS,---,0.0,Men's DI
182,Event 4 Men 200 Yard IM,Preliminaries,Chris O'Grady,SO,,,,,,,Southern California,1:44.93,DFS,---,0.0,Men's DI
183,Event 4 Men 200 Yard IM,Preliminaries,Henry Bethel,SO,,,,,,,Auburn,1:45.15,DFS,---,0.0,Men's DI
184,Event 4 Men 200 Yard IM,Preliminaries,Reid Mikuta,JR,,,,,,,Auburn,1:42.90,DFS,---,0.0,Men's DI
254,Event 5 Men 50 Yard Freestyle,Preliminaries,Will Chan,5Y,,,,,,,Texas,19.68,DFS,---,0.0,Men's DI
