# Markel/USEF Young and Developing Dressage Horse Championship Data Analysis

## Introduction
The aim of this project is to analyze data from the USEF Young and Developing Horse programs over the years. The program can be divisive—some claim that horses that participate in these programs end up washing out, or that many do not make it to the FEI levels. Since these programs have now been in place for over 20 years, there is a significant amount of data available to look at regarding the competitive careers of horses involved in these programs.

## Data Acquisition and Cleaning
The current format of the competition for all divisions (4/5/6/7 Year Old, Developing Prix St. Georges, and Developing Grand Prix) involves two rounds, with round one counting for 40% of the overall score, and round two counting for 60% of the overall score. Overall scores determine the final placings. 

To further complicate matters, early years of this program invited more horses to participate, and held both Final and Consolation Final rounds. To simplify this analysis, I did not include any Consolation Final results. 

For more recent years of the competition (2021-2023), the competition software system Equestrian Hub (www.equestrian-hub.com) has automatically calculated these placings, which made it easy to obtain the data. For prior years, varying levels of detective work were required. I was able to find official overall placing records for some years on the United States Equestrian Federation (USEF) website, but not all were available. For the years that had only the scores from each class available, I wrote a small program in Python to take each individual round of scores, and calculate the overall score using the 40/60 formula. While I have endeavored to do this without error, the lack of official overall rankings to compare to means that there may be errors involved as to exact overall placings of some horses. If you come across an error, please contact me at (byrne.sio@gmail.com) and I will make the correction.

Next, I looked up each horse in the United States Dressage Federation (USDF) database, to determine a) what is the highest level to which this horse competed?,  and b) did this horse ever compete at the CDI (international) level? For a), I did not consider scores received—I was only concerned if that horse had a test on its record at that level. Therefore, this analysis does not differentiate between high and low scoring horses at any level. For b), I only considered horses to be CDI competitors if they had competed at a CDI at a level OTHER THAN FEI Young Horse tests. 

I had the most difficulty acquiring the data on bloodlines and breeders, ironically. Because so many people fail to include this information on entry forms, lots of the data was incomplete. USEF does have a horse search function on their website, but they only list sire and dam (no damsire), and frequently the breeder and country of birth are left out. Finding this information required a lot of detective work in some cases, frequently utilizing Horse Telex (www.horsetelex.com), a pedigree database. I also utilized old Eurodressage and DressageDaily articles about various years of the championships. 

## Analysis Overview

For the analysis of competitve careers of participants, I am focusing on the years 2002-2019. Horses that competed in the USEF 4 Year Old division in 2019 would be 8 in 2023, and 8 is the youngest age a horse can compete at the highest of the FEI levels, Grand Prix. While it is rare to see a horse competing at that level at that age, it is legal, so I went with the lowest legal age versus the most common (9-10 years old).

The analysis of bloodlines and other breeding data will look at all years of the program (2002-2023). 

### 2002 - 2019, 4/5/6 Year Old Division Competitive Outcomes

#### FEI vs Non-FEI Horses
Do horses that compete in the 4, 5, and 6 Year Old divisions make it to FEI? 

To answer this question, I looked only at horses that competed in the 4/5/6 Year Old divisions during the years 2002-2019, 520 horses total.

The overwhelming majority made it to FEI (69.81%). 21.54% competed at the USEF levels (Training-Fourth). Only 45 horses (8.65%) never competed at any level other than a Young Horse division. 

#### CDI vs Non-CDI Horses
Unsurprisingly, most horses (67.50%) never competed in a CDI (international competition). Most horses, even if they make it to FEI, would not necessarily be competitive on the CDI level. CDI competitions are also much more expensive and complicated to enter and compete in (higher entry fees, horses must have an FEI passport). Finally, there are simply not that many CDI competitions in the USA, and the ones there are tend to be concentrated in certain areas, requiring long travel times for people in many parts of the country. 

#### Top Ten vs Bottom Ten Placing Horses
Are horses placing in the top ten vs bottom ten more likely to make it to FEI? Yes. 77.13% of horses that made it to the FEI levels placed in the top ten of their division at the championships. 

This finding may be influenced by the years this project is analyzing. In the early years of this program, there were frequently 10 or fewer horses in some divisions. 

### 2002 - 2023, All Divisions (4/5/6/7 Year Old, Developing Prix St. Georges, Developing Grand Prix)

#### Highest Level Shown, USA Bred vs All Other Countries
At the lower levels, there is a fairly balanced split between horses bred in the USA, and those bred in other countries. However, at the FEI levels, most horses are foreign-bred, with the most noticable difference being at the Grand Prix level (80 American-bred, 168 foreign-bred). There are likely many reasons for this:

* Many people shop for horses in Europe because it is easier to see many horses in one location vs the USA
* Depending on the exchange rate, it may be advantageous cost-wise to buy from other countries, as breeders in the USA face much higher costs to produce and raise foals
* European countries produce far more warmblood foals than the USA
* Bias against American breeders—some may think American-bred foals aren't as good as those produced in Europe
* Hard to get top American-bred foals into the hands of riders that can develop them to Grand Prix
* USA lacks a well-developed pipeline from foal to young horse to FEI
* More young horse specialists in Europe makes it easier for buyers who may not have access to a good young horse specialist in their part of the USA


#### Top Ten Sires Represented
The stallion with the most offspring competing (2002 to 2023, all divisions) was Sandro Hit (24). The rest of the top ten were Sir Donnerhall I (18), Furstenball (17), Jazz (13), Rotspon (12), Fidertanz (12), Florestan I (11), Hotline (10), Grand Galaxy Win (9), Florencio I (9), and Sir Sinclair (9). The three-way tie for ninth place in the top ten means there are actually eleven horses in this category.

#### Top Ten Damsires Represented
The damsire with the most offspring competing (2002 to 2023, all divisions) was Rubinstein (25). The rest of the top ten were De Niro (18), Sandro Hit (17), Jazz (16), Rotspon (16), Weltmeyer (14), Ferro (13), Krack C (11), Sir Donnerhall I (11), and Rohdiamant (11). 

Frustratingly, this column had the second most null values of all the columns (18 missing values / 3.5% of the total)—I hope to resolve this in future iterations of this project.

#### Top Ten Sires of Grand Prix Horses
The top sire of Grand Prix horses (2002-2023, all divisions) was also Sandro Hit (12). The rest of the top ten were Jazz (8), Sir Donnerhall I (7), Florestan I (5), Florencio I (5), Rotspon (4), Fidertanz (4), Sir Sinclair (3), Hotline (3), and Furstenball (3). 

#### Top Ten Damsires of Grand Prix Horses
The top damsire of Grand Prix horses was De Niro (7). The rest of the top ten were Rubinstein (6), Rotspon (6), Ferro (5), Weltmeyer (4), Rohdiamant (4), Jazz (4), Sandro Hit (3), Sir Donnerhall (1), and Krack C (1). 

Once again, the prevalence of null values (just over 7% of the data) in the damsire column affects the completeness of this data. 

#### Top Ten Breeders Represented 
The most prominent breeder over all years and divions is DG Bar Ranch (USA), with 16 horses. The rest of the top ten were Maryanna Haymon (USA, 12 horses), Nancy Holowesko (USA, 9 horses), Leatherdale Farms (USA, 7 horses), Oak Hill Ranch (USA, 6 horses), Judy Yancey (USA, 6 horses), Horses Unlimited (USA, 6 horses), Gestut Lewitz (Germany, 6 horses), Maurine Swanson (USA, 5 horses), and Jackie Ahl-Eckhaus (USA, 5 horses). 

This column had the most null values overall—47 missing values, which equates to 9% of the total. 

#### Most Championship Appearances
The horse with the most appearances at the Young and Developing Horse Championships to date is WakeUp, ridden by Emily Miles. WakeUp competed in the Four and Six Year Old Championships as a young horse, and also competed two years each in the Developing Prix St. Georges and Developing Grand Prix divisions. 

## Acknowledgements
The following sites were utilized to gather this data:

* USDF (www.usdf.org)
* USEF (www.usef.org)
* Equestrian Hub (www.equestrian-hub.com)
* Fox Village (www.foxvillage.com)
* Horse Telex (www.horsetelex.com)
* Eurodressage (www.eurodressage.com)
* DressageDaily (www.dressagedaily.com)



In [300]:
import pandas as pd
import numpy as np

In [301]:
# load CSV of championship results into a dataframe and display first 10 records
championship_df = pd.read_csv("resources/yh-championship-data.csv")
championship_df.head(10)

Unnamed: 0,Year,Division,Horse,OverallPlacing,HighestLevel,CDI,USDFNumber,OverallScore
0,2002,FEI5,Rosabella,1,Third Level,No,37232,7.84
1,2002,FEI5,Favereux,2,Grand Prix,No,38714,7.68
2,2002,FEI5,Devon,3,Third Level,No,38984,7.42
3,2002,FEI5,Welfenstein,4,Grand Prix,Yes,40474,7.14
4,2002,FEI5,R-tistik,5,Grand Prix,Yes,37123,7.1
5,2002,FEI5,Pampero,6,FEI 5 Year Old,No,41386,6.82
6,2002,FEI6,Oleander,1,Grand Prix,Yes,35062,8.24
7,2002,FEI6,Freestyle,2,Prix St. Georges,No,39380,7.4
8,2002,FEI6,Wincenzo,3,Prix St. Georges,No,1026740,7.0
9,2002,FEI6,Olympus,4,Grand Prix,Yes,42683,6.86


In [360]:
# load CSV of horse data into a dataframe and display first 10 results
horse_df = pd.read_csv("resources/yh-horse-data.csv")
horse_df.head(10)


Unnamed: 0,Horse,HighestLevel,CDI,USDFNumber,Sire,Damsire,Country,Breeder,Studbook
0,Rosabella,Third Level,No,37232,Rohdiamant,Watzmann,Germany,Kerstin Ohlemeyer,Hanoverian
1,Favereux,Grand Prix,No,38714,Fidermark,Fidelio,Germany,Johannes Hilgers,Rhinelander
2,Devon,Third Level,No,38984,Don Gregory,,,,Oldenburg
3,Welfenstein,Grand Prix,Yes,40474,Wolkenstein II,Lauries Crusador xx,Germany,Heinz Bruns,Hanoverian
4,R-tistik,Grand Prix,Yes,37123,Ramires,Rex Fritz,Germany,Josef Kathmann,Oldenburg
5,Pampero,FEI 5 Year Old,No,41386,Ferro,,USA,Margaret Avery,KWPN
6,Oleander,Grand Prix,Yes,35062,Jazz,Ulft,Netherlands,R. Van Wourdenbergh,KWPN
7,Freestyle,Prix St. Georges,No,39380,Florestan I,Parademarsch I,Germany,,Westfalen
8,Wincenzo,Prix St. Georges,No,1026740,Werther,Graphit,Germany,,Hanoverian
9,Olympus,Grand Prix,Yes,42683,Clavecimbel,,Netherlands,G. Van Der Veen,KWPN


In [361]:
# get count of null values by column
horse_df.isnull().sum(axis = 0)

Horse            0
HighestLevel     0
CDI              0
USDFNumber       0
Sire             1
Damsire         18
Country          8
Breeder         47
Studbook         1
dtype: int64

In [329]:
# get the total number of horses that competed in the 4/5/6 year old divisions from 2002-2019

total_horses = championship_df.loc[(championship_df["Year"] <= 2019) &
                             (championship_df["Division"].str.contains("USEF4|FEI5|FEI6"))]
total_horses = total_horses["USDFNumber"].nunique()

print(f"The total number of horses competing from 2002-2019 is {total_horses}.")

The total number of horses competing from 2002-2019 is 520.


In [330]:
# get the overall number of horses that have competed in at least one CDI (any level other than Young Horse divisions)
# from 2002-2019, that competed in the 4/5/6/ year old divisions during those years

cdi_horses = championship_df.loc[(championship_df["CDI"] == "Yes") & (championship_df["Year"] <= 2019) &
                             (championship_df["Division"].str.contains("USEF4|FEI5|FEI6"))]
cdi_horses = cdi_horses["USDFNumber"].nunique()


print(f"The number of CDI competitors is {cdi_horses}.")

The number of CDI competitors is 169.


In [375]:
# get the overall percentage of horses that competed in at least one CDI 
cdi_percentage = (cdi_horses / total_horses) * 100
print(f"The percentage of CDI competitors is {cdi_percentage}%")

The percentage of CDI competitors is 32.5%


In [356]:
# get the number of horses that competed to Grand Prix from 2002-2019, 
# that competed in the 4/5/6/ year old divisions during those years

grandprix_horses = championship_df.loc[(championship_df["HighestLevel"] == "Grand Prix") & (championship_df["Year"] <= 2019) &
                             (championship_df["Division"].str.contains("USEF4|FEI5|FEI6"))]
grandprix_horses = grandprix_horses["USDFNumber"].nunique()


print(f"The number of Grand Prix horses is {grandprix_horses}.")

The number of Grand Prix horses is 138.


In [378]:
# get the percentage of all horses that competed at Grand Prix, from 2002-2019,
# that competed in the 4/5/6 year old divisions during those years

grand_prix_percentage = (grandprix_horses / total_horses) * 100
grand_prix_percentage = round(grand_prix_percentage, 2)
print(f"The percentage of Grand Prix horses is {grand_prix_percentage}%.")

The percentage of Grand Prix horses is 26.54%.


In [358]:
# get the horses that have competed at FEI (Prix St. Georges and above) from 2002-2019
levels = ['Grand Prix', 'I-2', 'I-A', 'I-B', 'I-1', 'Prix St. Georges','FEI Junior']

fei_df = championship_df[(championship_df['HighestLevel'].isin(levels)) & (championship_df["Year"] <= 2019) &
                             (championship_df["Division"].str.contains("USEF4|FEI5|FEI6"))]
fei_df = fei_df["USDFNumber"].nunique()

fei_df

363

In [379]:
# get the percentage of FEI horses that competed at Grand Prix, from 2002-2019,
# that competed in the 4/5/6 year old divisions during those years

fei_percentage = (fei_df / total_horses) * 100
fei_percentage = round(fei_percentage, 2)
print(f"The percentage of FEI horses is {fei_percentage}%.")

The percentage of FEI horses is 69.81%.


In [359]:
# create summary table of overall horse level statistics

horse_summary = pd.DataFrame({"Total Horses": [total_horses], "Total FEI Horses": fei_df,
                            "Percentage of FEI Horses": fei_percentage, 
                            "Total CDI Horses": cdi_horses,
                            "Percentage of CDI Horses": cdi_percentage,
                            "Total Grand Prix Horses": grandprix_horses,
                            "Percentage of Grand Prix Horses": grand_prix_percentage})


horse_summary

Unnamed: 0,Total Horses,Total FEI Horses,Percentage of FEI Horses,Total CDI Horses,Percentage of CDI Horses,Total Grand Prix Horses,Percentage of Grand Prix Horses
0,520,363,69.807692,169,32.5,138,26.538462


In [338]:
# get the median scores by division over all years

median_groups = championship_df.groupby("Division")["OverallScore"]

median_summary = pd.DataFrame({"Median Score": median_groups.median()}) 

median_summary


Unnamed: 0_level_0,Median Score
Division,Unnamed: 1_level_1
DHGP,64.888
DHPSG,67.339
FEI5,7.584
FEI6,7.504
FEI7,71.3295
USEF4,7.62


In [380]:
# get the number of FEI horses that placed in the top 10 of their division at the championships between 2002-2019

placing = [1,2,3,4,5,6,7,8,9,10]
levels = ['Grand Prix', 'I-2', 'I-A', 'I-B', 'I-1', 'Prix St. Georges','FEI Junior']

top_ten = championship_df[(championship_df['OverallPlacing'].isin(placing)) & (championship_df['HighestLevel'].isin(levels))
                          & (championship_df["Year"] <= 2019) &
                             (championship_df["Division"].str.contains("USEF4|FEI5|FEI6"))]

top_ten = top_ten["USDFNumber"].nunique()

print(f"{top_ten} horses placed in the top 10 of their division.")




280 horses placed in the top 10 of their division.


In [381]:
# get the number of FEI horses that placed in the bottom 10 of their division between 2002-2019

lower_placing = [11,12,13,14,15,16,17,18,19,20]
levels = ['Grand Prix', 'I-2', 'I-A', 'I-B', 'I-1', 'Prix St. Georges','FEI Junior']

bottom_ten = championship_df[(championship_df['OverallPlacing'].isin(lower_placing)) & (championship_df['HighestLevel'].isin(levels))
                          & (championship_df["Year"] <= 2019) &
                             (championship_df["Division"].str.contains("USEF4|FEI5|FEI6"))]

bottom_ten = bottom_ten["USDFNumber"].nunique()

print(f"{bottom_ten} horses placed in the bottom 10 of their division.")

111 horses placed in the bottom 10 of their division.


In [382]:
# get the percentage of horses in the top ten of their division

top_ten_percentage = (top_ten / fei_df) * 100

top_ten_percentage = round(top_ten_percentage, 2)
print(f"The percentage of horses placing in the top ten of their division is {top_ten_percentage}%.")

The percentage of horses placing in the top ten of their division is 77.13%.


In [383]:
# get the percentage of horses in the top ten of their division

bottom_ten_percentage = (bottom_ten / fei_df) * 100

bottom_ten_percentage = round(bottom_ten_percentage, 2)
print(f"The percentage of horses placing in the top ten of their division is {bottom_ten_percentage}%.")

The percentage of horses placing in the top ten of their division is 30.58%.


In [384]:
# get the number of horses bred in the USA that competed in the years 2002-2023

usa_count = horse_df[(horse_df["Country"] == "USA")].count()["USDFNumber"]

usa_count
print(f"{usa_count} horses were bred in the USA.")

307 horses were bred in the USA.


In [345]:
all_horses = len(horse_df)
usa_percentage = (usa_count / all_horses) * 100
print(f"The percentage of horses bred in the USA is {usa_percentage}")

The percentage of horses bred in the USA is 36.1474435196195


In [370]:
# get the top ten sires represented over all years (2002-2023)
# showing 11 horses, as there are three horses tied for the 9th place spot

top_sires = horse_df['Sire'].value_counts().head(11)

top_sires


Sandro Hit          24
Sir Donnerhall I    18
Furstenball         17
Jazz                13
Rotspon             12
Fidertanz           12
Florestan I         12
Hotline             10
Sir Sinclair         9
Florencio I          9
Grand Galaxy Win     9
Name: Sire, dtype: int64

In [373]:
gp_sires = horse_df.loc[(horse_df["HighestLevel"] == "Grand Prix")]
gp_sires = gp_sires['Sire'].value_counts().head(10)
gp_sires

Sandro Hit          12
Jazz                 8
Sir Donnerhall I     7
Florestan I          5
Florencio I          5
Fidertanz            4
Quaterback           4
Rotspon              4
Belissimo M          4
Furstenball          3
Name: Sire, dtype: int64

In [372]:
# get the top ten damsires represented over all years (2002 - 2023)

top_damsires = horse_df['Damsire'].value_counts().head(10)

top_damsires

Rubinstein          25
De Niro             18
Sandro Hit          17
Jazz                16
Rotspon             16
Weltmeyer           14
Ferro               13
Krack C             11
Sir Donnerhall I    11
Rohdiamant          11
Name: Damsire, dtype: int64

In [374]:
gp_damsires = horse_df.loc[(horse_df["HighestLevel"] == "Grand Prix")]
gp_damsires = gp_damsires['Damsire'].value_counts().head(10)
gp_damsires

De Niro           7
Rubinstein        6
Rotspon           6
Brentano II       5
Ferro             5
Weltmeyer         4
Rohdiamant        4
Jazz              4
Flemmingh         3
Wolkenstein II    3
Name: Damsire, dtype: int64

In [364]:
# get the top ten most prominent breeders represented over all years (2002 - 2023)

top_breeders = horse_df['Breeder'].value_counts().head(10)

top_breeders

DG Bar Ranch          16
Maryanna Haymon       12
Nancy Holowesko        9
Leatherdale Farms      7
Gestut Lewitz          6
Oak Hill Ranch         6
Judy Yancey            6
Horses Unlimited       6
Jackie Ahl-Eckhaus     5
Maurine Swanson        5
Name: Breeder, dtype: int64

In [365]:
# get the breeders with the most top 3 placings over all years
placing = [1,2,3]

highest_placed_breeders = combined_df[(combined_df['OverallPlacing'].isin(placing))]
highest_placed_breeders = highest_placed_breeders.drop_duplicates(subset="USDFNumber")
highest_placed_breeders["Breeder"].value_counts().head(10)




DG Bar Ranch                3
Nancy Holowesko             2
Maryanna Haymon             2
Marefield Meadows           2
J.H. Kamperman              2
Nedergaard Dressage         2
Meg Williams                2
Horses Unlimited            2
Jackie Ahl-Eckhaus          2
Catherine Haddad Staller    2
Name: Breeder, dtype: int64

In [366]:
# get the top ten countries represented over all years (2002 - 2023)

top_countries = horse_df['Country'].value_counts().head(10)

top_countries

USA              307
Germany          299
Netherlands      165
Denmark           26
Canada             7
Belgium            7
Spain              5
Great Britain      5
Norway             4
Sweden             2
Name: Country, dtype: int64

In [367]:
# get the top ten studbooks represented over all years (2002 - 2023)
top_studbooks = horse_df['Studbook'].value_counts().head(10)

top_studbooks

KWPN                  243
Hanoverian            241
Oldenburg             180
Westfalen              48
Danish Warmblood       34
Rhinelander            18
American Warmblood     10
PRE                     7
Holsteiner              7
Swedish Warmblood       6
Name: Studbook, dtype: int64

In [368]:
# get the top horses with the most championship appearances over all years (2002 - 2023)

most_appearances = championship_df['Horse'].value_counts().head(10)

most_appearances

WakeUp                  6
Quantum Jazz            5
Flavius MF              5
Sole Mio                5
Sternlicht Hilltop      5
Floretienne             5
Pikko del Cerro HU      5
Fashion Designer OLD    5
Don Cesar               5
Au Revoir               5
Name: Horse, dtype: int64