This notebook will try to answer the question "Does success in younger (Junior/Cadet/Under 21s) sections reliably lead to success in senior categories?" using a K-NN technique

Imports

In [1]:
import pandas as pd
import chardet
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

Data ingestion

In [2]:
rankings = 'data/wkf_rankings.csv'
rankings_df = pd.read_csv(rankings)

Dropping 'id' and fixing time column data type

In [3]:
rankings_df.drop('_id', axis=1, inplace=True)
rankings_df['date']= pd.to_datetime(rankings_df['date'])

## Dropping the data after world pandemic travel restrictions

On March 11th 2020 a global pandemic was declared by the World Health Organisation (WHO) in light of the spread of the COVID-19 coronavirus. This subsequently caused many countries to impose travel restrictions and cancel events to prevent the spread.

As a result of this, data from events AFTER March 11 2020 will be removed as it cannot be gaurenteed that it is consistent with the rest of the data from periods with no travel restrictions and athletes not having the same preperation and training facilities.

In [4]:
rankings_df = rankings_df[rankings_df['date'] <= '2020-03-11']

Get rid of whitespace values

In [5]:
rankings_df.replace(u'\xa0',u'', regex=True, inplace=True)

In [6]:
rankings_df

Unnamed: 0,ranking_country,ranking_competitor,date,event,type,category,event_factor,rank,matches_won,points
0,AFG,AFG2082,2019-07-19,"16TH AKF SENIOR CHAMPIONSHIP 2019, TASHKENT, U...",Continental Championship,Male Kumite -55 Kg,6.0,Participation,0.0,30.0
1,AFG,AFG02157,2019-07-19,"16TH AKF SENIOR CHAMPIONSHIP 2019, TASHKENT, U...",Continental Championship,Male Kumite 84+ kg,6.0,Participation,2.0,150.0
2,AFG,AFG2002,2015-09-05,Karate1 Premier League - Istanbul 2015(TUR),Karate1 Premier League,Male Kumite -60 Kg,4.0,Participation,1.0,48.0
3,AFG,AFG02158,2019-07-19,"16TH AKF SENIOR CHAMPIONSHIP 2019, TASHKENT, U...",Continental Championship,Male Kumite -84 kg,6.0,Participation,0.0,30.0
4,AFG,AFG114,2012-11-21,21st World Seniors Karate Championships(FRA),World Championship,Male Kumite -67 kg,12.0,Participation,1.0,144.0
...,...,...,...,...,...,...,...,...,...,...
119803,ZIM,ZIM02080,2019-07-12,UFAK Junior & Senior Championships - Gaborone ...,Continental Championship,Male Kumite -75 Kg,6.0,Participation,0.0,30.0
119804,ZIM,ZIM02080,2018-08-31,UFAK Junior & Senior Championships 2018(RWA),Continental Championship,Male Kumite -75 Kg,6.0,Participation,0.0,30.0
119805,ZIM,ZIM2009,2019-07-12,UFAK Junior & Senior Championships - Gaborone ...,Continental Championship,Junior Kumite Female -48 kg,6.0,Participation,0.0,30.0
119806,ZIM,ZIM2008,2018-05-29,TRANSITION POINTS_ZIM2008(ZIM),Others,Male Kumite -60 Kg,1.0,Participation,0.0,30.0


## Create a dataframe of winners / medalists at worlds

To begin with, we can create a dataframe denoting all the medalists from the last 2 world championships. From there we can use previous results to figure out if.

In [7]:
df_EF_12 = rankings_df[rankings_df['event_factor'] == 12]

In [8]:
world_medels_df = df_EF_12.loc[(df_EF_12['date'] == '2018-11-06') | (df_EF_12['date'] == '2016-10-26')]

In [9]:
world_medels_df

Unnamed: 0,ranking_country,ranking_competitor,date,event,type,category,event_factor,rank,matches_won,points
43,ALB,ALB2002,2018-11-06,WKF Senior World Championship 2018(ESP),World Championship,Male Kumite -60 Kg,12.0,Participation,2.0,300.0
58,ALB,ALB2002,2016-10-26,WKF World Senior Championships 2016(AUT),World Championship,Male Kumite -60 Kg,12.0,Participation,2.0,300.0
95,ALB,ALB155,2018-11-06,WKF Senior World Championship 2018(ESP),World Championship,Male Kumite -75 Kg,12.0,Participation,0.0,60.0
151,ALB,ALB2001,2018-11-06,WKF Senior World Championship 2018(ESP),World Championship,Male Kumite -84 kg,12.0,Participation,3.0,420.0
160,ALB,ALB2001,2016-10-26,WKF World Senior Championships 2016(AUT),World Championship,Male Kumite -84 kg,12.0,Participation,1.0,180.0
...,...,...,...,...,...,...,...,...,...,...
119452,WAL,WAL2057,2016-10-26,WKF World Senior Championships 2016(AUT),World Championship,Male Kumite -84 kg,12.0,Participation,0.0,60.0
119473,WAL,WAL146,2018-11-06,WKF Senior World Championship 2018(ESP),World Championship,Male Kumite -75 Kg,12.0,Participation,0.0,60.0
119650,YEM,YEM2075,2016-10-26,WKF World Senior Championships 2016(AUT),World Championship,Male Kumite -60 Kg,12.0,Participation,1.0,180.0
119790,ZIM,ZIM005,2018-11-06,WKF Senior World Championship 2018(ESP),World Championship,Male Kumite -67 kg,12.0,Participation,0.0,60.0


Bin all the kata data.

In [10]:
kata_data = world_medels_df[world_medels_df['category'].str.contains("Kata")].index
world_medels_df.drop(kata_data, inplace = True)
world_medels_df = world_medels_df.reset_index(drop = True)

In [11]:
world_medels_df.head()

Unnamed: 0,ranking_country,ranking_competitor,date,event,type,category,event_factor,rank,matches_won,points
0,ALB,ALB2002,2018-11-06,WKF Senior World Championship 2018(ESP),World Championship,Male Kumite -60 Kg,12.0,Participation,2.0,300.0
1,ALB,ALB2002,2016-10-26,WKF World Senior Championships 2016(AUT),World Championship,Male Kumite -60 Kg,12.0,Participation,2.0,300.0
2,ALB,ALB155,2018-11-06,WKF Senior World Championship 2018(ESP),World Championship,Male Kumite -75 Kg,12.0,Participation,0.0,60.0
3,ALB,ALB2001,2018-11-06,WKF Senior World Championship 2018(ESP),World Championship,Male Kumite -84 kg,12.0,Participation,3.0,420.0
4,ALB,ALB2001,2016-10-26,WKF World Senior Championships 2016(AUT),World Championship,Male Kumite -84 kg,12.0,Participation,1.0,180.0


In [12]:
world_medels_df['rank'].unique()

array(['Participation', '3rd Place', '7th Place', '9th Place',
       '1st Place', '11th Place', '2nd Place', '5th Place', '13th Place'],
      dtype=object)

In [13]:
world_medels_df['category'].unique()

array(['Male Kumite -60 Kg', 'Male Kumite -75 Kg', 'Male Kumite -84 kg',
       'Male Kumite -67 kg', 'Female Kumite 68+ kg',
       'Female Kumite -50 Kg', 'Male Kumite 84+ kg',
       'Female Kumite -55 Kg', 'Female Kumite -61 kg',
       'Female Kumite -68 kg'], dtype=object)

In [14]:
world_medels_df = world_medels_df.drop(['ranking_country','date','event_factor',
                                      'matches_won', 'points','event','type','category'], axis=1)

In [15]:
world_medels_df

Unnamed: 0,ranking_competitor,rank
0,ALB2002,Participation
1,ALB2002,Participation
2,ALB155,Participation
3,ALB2001,Participation
4,ALB2001,Participation
...,...,...
1298,WAL2057,Participation
1299,WAL146,Participation
1300,YEM2075,Participation
1301,ZIM005,Participation


For now we will keep those that have medalled and not medalled together and assess wither it needs to be re-populated after the merge.

In [16]:
world_medels_df = world_medels_df.rename(columns={"rank": "Medalled?"})

world_medels_df.replace({'Medalled?':{'Participation': 'No',
                                  '7th Place': 'No',
                                  '5th Place': 'No',
                                  '9th Place': 'No',
                                  '11th Place': 'No',
                                  '1st Place': 'Yes',
                                  '2nd Place': 'Yes',
                                  '3rd Place': 'Yes'
                                 }},inplace = True)

In [17]:
len(world_medels_df['ranking_competitor'].unique())

1001

In [18]:
world_medels_df = world_medels_df.drop_duplicates()
world_medels_df = world_medels_df.reset_index(drop=True)

In [19]:
len(world_medels_df['ranking_competitor'].unique())

1001

In [20]:
world_medels_df

Unnamed: 0,ranking_competitor,Medalled?
0,ALB2002,No
1,ALB155,No
2,ALB2001,No
3,ALB161,No
4,ALB2216,No
...,...,...
1032,WAL2058,No
1033,WAL2057,No
1034,WAL146,No
1035,YEM2075,No


In [21]:
medalled_at_worlds_df = world_medels_df[world_medels_df['Medalled?'] == 'Yes']

In [22]:
medalled_at_worlds_df = medalled_at_worlds_df.reset_index(drop=True)
medalled_at_worlds_df

Unnamed: 0,ranking_competitor,Medalled?
0,ALG2021,Yes
1,AUT190,Yes
2,AUT191,Yes
3,AZE133,Yes
4,AZE236,Yes
...,...,...
64,TUR238,Yes
65,TUR505,Yes
66,UKR274,Yes
67,UKR190,Yes


In [23]:
comp_IDs = world_medels_df['ranking_competitor'].unique()

In [24]:
not_medaled_df = pd.DataFrame(columns = {'ranking_competitor'})
not_medaled_df

Unnamed: 0,ranking_competitor


In [25]:
not_medaled_df = not_medaled_df.assign(ranking_competitor = comp_IDs)

In [26]:
medalled_at_worlds_df = medalled_at_worlds_df.merge(not_medaled_df, on='ranking_competitor', how='outer')
medalled_at_worlds_df = medalled_at_worlds_df.replace(np.nan, 'No')

In [27]:
medalled_at_worlds_df

Unnamed: 0,ranking_competitor,Medalled?
0,ALG2021,Yes
1,AUT190,Yes
2,AUT191,Yes
3,AZE133,Yes
4,AZE236,Yes
...,...,...
996,WAL2058,No
997,WAL2057,No
998,WAL146,No
999,YEM2075,No


We have a list of all the medalists of the last 2 world championships along with those that entered but did not medal.

Being entered into the worlds will need to be a condition of the analysis that MUST be highlighted in the report.

With all these values we can create the dataframe for U21 data and then merge on ranking_competitor with only the common values in the medalled_at_worlds dataframe being used.

## Getting under 21 dataframe

Only events of 4 or greater

In [28]:
u21_df = rankings_df[rankings_df['event_factor'] >= 2].reset_index(drop=True)

No more than 4 years before the first World Championships included in the medal data.

In [29]:
u21_df = u21_df.loc[(u21_df['date'] >= '2012-10-26') & (u21_df['date'] < '2018-11-06')]

In [30]:
u21_df.sort_values(by=['date'])

Unnamed: 0,ranking_country,ranking_competitor,date,event,type,category,event_factor,rank,matches_won,points
96841,SRI,SRI017,2012-11-21,21st World Seniors Karate Championships(FRA),World Championship,Male Kata,12.0,Participation,2.0,240.0
71555,NZL,NZL136,2012-11-21,21st World Seniors Karate Championships(FRA),World Championship,Male Kumite -75 Kg,12.0,Participation,2.0,240.0
71534,NZL,NZL154,2012-11-21,21st World Seniors Karate Championships(FRA),World Championship,Female Kata,12.0,Participation,1.0,144.0
51311,ITA,ITA192,2012-11-21,21st World Seniors Karate Championships(FRA),World Championship,Male Kumite -75 Kg,12.0,1st Place,7.0,1560.0
103378,TUR,TUR238,2012-11-21,21st World Seniors Karate Championships(FRA),World Championship,Female Kumite -50 Kg,12.0,3rd Place,5.0,888.0
...,...,...,...,...,...,...,...,...,...,...
66125,MEX,MEX2657,2018-10-27,Karate1 Youth League - Cancun-Quitana Roo 2018...,Karate1 Youth League,Cadet Kumite Female -47 kg,3.0,Participation,0.0,15.0
66016,MEX,MEX2586,2018-10-27,Karate1 Youth League - Cancun-Quitana Roo 2018...,Karate1 Youth League,Cadet Kumite Male -63 kg,3.0,Participation,2.0,75.0
66333,MEX,MEX2699,2018-10-27,Karate1 Youth League - Cancun-Quitana Roo 2018...,Karate1 Youth League,Junior Kumite Female 59+ kg,3.0,7th Place,0.0,75.0
66332,MEX,MEX2698,2018-10-27,Karate1 Youth League - Cancun-Quitana Roo 2018...,Karate1 Youth League,Cadet Kumite Male -63 kg,3.0,Participation,0.0,15.0


In [31]:
u21_df['category'].unique()

array(['Male Kumite -60 Kg', 'Male Kumite -67 kg', 'Female Kata',
       'Cadet Kumite Male 70+ kg', 'Junior Kumite Male 76+ kg',
       'U21 Kumite Male -60 kg', 'Cadet Kumite Male -52 kg',
       'Cadet Kumite Female -54 kg', 'Male Kumite -75 Kg',
       'U21 Kumite Male -75 kg', 'Junior Kumite Male -68 kg',
       'Cadet Kumite Male -63 kg', 'Cadet Kumite Male -57 kg',
       'Male Kumite -84 kg', 'Cadet Kumite Male -70 kg',
       'Under 21 Kumite Male -78 kg', 'Male Kumite 84+ kg',
       'U21 Kumite Male -67 kg', 'Cadet Kata Male', 'Male Kata',
       'Junior Kumite Male -76 kg', 'Female Kumite 68+ kg',
       'Junior Kumite Male -55 kg', 'Junior Kata Female',
       'U21 Kumite Male -84 kg', 'U21 Kumite Female 68+ kg',
       'Female Kumite -61 kg', 'Female Kumite -55 Kg',
       'Under 21 Kumite Male -68 kg', 'Female Kumite -50 Kg',
       'Under 21 Kumite Female -53 kg', 'Female Kumite -68 kg',
       'Junior Kumite Female -59 kg', 'Under 21 Kata Male',
       'Under 21 Kata F

In [32]:
u21_df = u21_df[~u21_df['category'].str.contains("Kata")]
u21_df = u21_df[~u21_df['category'].str.contains("Open")]

In [33]:
u21_df['category'].unique()

array(['Male Kumite -60 Kg', 'Male Kumite -67 kg',
       'Cadet Kumite Male 70+ kg', 'Junior Kumite Male 76+ kg',
       'U21 Kumite Male -60 kg', 'Cadet Kumite Male -52 kg',
       'Cadet Kumite Female -54 kg', 'Male Kumite -75 Kg',
       'U21 Kumite Male -75 kg', 'Junior Kumite Male -68 kg',
       'Cadet Kumite Male -63 kg', 'Cadet Kumite Male -57 kg',
       'Male Kumite -84 kg', 'Cadet Kumite Male -70 kg',
       'Under 21 Kumite Male -78 kg', 'Male Kumite 84+ kg',
       'U21 Kumite Male -67 kg', 'Junior Kumite Male -76 kg',
       'Female Kumite 68+ kg', 'Junior Kumite Male -55 kg',
       'U21 Kumite Male -84 kg', 'U21 Kumite Female 68+ kg',
       'Female Kumite -61 kg', 'Female Kumite -55 Kg',
       'Under 21 Kumite Male -68 kg', 'Female Kumite -50 Kg',
       'Under 21 Kumite Female -53 kg', 'Female Kumite -68 kg',
       'Junior Kumite Female -59 kg', 'Under 21 Kumite Female 60+ kg',
       'Junior Kumite Male -61 kg', 'Junior Kumite Female 59+ kg',
       'Cadet Kumite 

In [34]:
u21_df = u21_df[u21_df['category'].str.contains("U21") |
               u21_df['category'].str.contains("Junior") |
               u21_df['category'].str.contains("Cadet") |
               u21_df['category'].str.contains("Under 21")].reset_index(drop=True)

In [35]:
u21_df

Unnamed: 0,ranking_country,ranking_competitor,date,event,type,category,event_factor,rank,matches_won,points
0,ALB,ALB2047,2018-05-25,Karate1 Youth League - Sofia 2018(BUL),Karate1 Youth League,Cadet Kumite Male 70+ kg,3.0,Participation,0.0,15.0
1,ALB,ALB2082,2018-05-25,Karate1 Youth League - Sofia 2018(BUL),Karate1 Youth League,Junior Kumite Male 76+ kg,3.0,Participation,0.0,15.0
2,ALB,ALB2002,2016-06-27,Karate1 Youth Cup Umag 2016(CRO),Karate1 Youth League,U21 Kumite Male -60 kg,3.0,1st Place,4.0,435.0
3,ALB,ALB2002,2016-02-05,"43rd EKF Junior, Cadet and U21 Championships(CYP)",Continental Championship,U21 Kumite Male -60 kg,6.0,3rd Place,4.0,510.0
4,ALB,ALB2002,2015-11-12,"World Junior, Cadet and U21 Championships 2015...",World Championship,U21 Kumite Male -60 kg,12.0,Participation,1.0,144.0
...,...,...,...,...,...,...,...,...,...,...
16816,YEM,YEM2005,2015-11-12,"World Junior, Cadet and U21 Championships 2015...",World Championship,U21 Kumite Male -60 kg,12.0,Participation,0.0,48.0
16817,YEM,YEM2003,2018-05-06,"17TH AKF CADET, JUNIOR AND U21 CHAMPIONSHIP 20...",Continental Championship,U21 Kumite Male -84 kg,6.0,Participation,0.0,30.0
16818,ZIM,ZIM2010,2017-10-25,"World Junior, Cadet and U21 Championships 2017...",World Championship,U21 Kumite Male -67 kg,12.0,Participation,0.0,60.0
16819,ZIM,ZIM2007,2017-10-25,"World Junior, Cadet and U21 Championships 2017...",World Championship,U21 Kumite Male -60 kg,12.0,Participation,0.0,60.0


In [36]:
u21_df['type'].unique()

array(['Karate1 Youth League', 'Continental Championship',
       'World Championship', 'Karate1 Premier League'], dtype=object)

In [37]:
u21_df = u21_df.drop(['ranking_country','date','event','type','category','event_factor',
                     'matches_won','points'], axis=1)

In [38]:
u21_df

Unnamed: 0,ranking_competitor,rank
0,ALB2047,Participation
1,ALB2082,Participation
2,ALB2002,1st Place
3,ALB2002,3rd Place
4,ALB2002,Participation
...,...,...
16816,YEM2005,Participation
16817,YEM2003,Participation
16818,ZIM2010,Participation
16819,ZIM2007,Participation


In [39]:
u21_df = u21_df.groupby(['ranking_competitor','rank']).size().reset_index().rename(columns={0:'count'})

In [40]:
u21_df = u21_df.pivot_table(index=['ranking_competitor'],columns=['rank'],values='count').fillna(0)

In [41]:
u21_df

rank,11th Place,13th Place,1st Place,2nd Place,3rd Place,5th Place,7th Place,9th Place,Participation
ranking_competitor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
AHO113,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
ALB149,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
ALB155,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,3.0
ALB2001,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
ALB2002,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,2.0
...,...,...,...,...,...,...,...,...,...
YEM2005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
YEM2006,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
ZIM2007,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
ZIM2008,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [42]:
combined_df = medalled_at_worlds_df.merge(u21_df, on='ranking_competitor', how='inner')

In [43]:
combined_df

Unnamed: 0,ranking_competitor,Medalled?,11th Place,13th Place,1st Place,2nd Place,3rd Place,5th Place,7th Place,9th Place,Participation
0,AUT190,Yes,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,AUT191,Yes,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,AZE2043,Yes,0.0,0.0,3.0,1.0,1.0,1.0,0.0,0.0,1.0
3,BIH320,Yes,0.0,0.0,0.0,1.0,0.0,2.0,0.0,0.0,4.0
4,BUL304,Yes,0.0,0.0,2.0,3.0,1.0,0.0,1.0,0.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...
485,VEN206,No,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
486,VIE2069,No,0.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,1.0
487,VIE2003,No,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
488,VIE2030,No,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0


In [44]:
combined_df = combined_df[['ranking_competitor','1st Place','2nd Place','3rd Place','5th Place','7th Place','9th Place',
                           '11th Place','13th Place','Participation','Medalled?']]

In [45]:
combined_df = combined_df.sort_values(by=['ranking_competitor']).reset_index(drop=True)
combined_df

Unnamed: 0,ranking_competitor,1st Place,2nd Place,3rd Place,5th Place,7th Place,9th Place,11th Place,13th Place,Participation,Medalled?
0,ALB155,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,3.0,No
1,ALB2001,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,No
2,ALB2002,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0,No
3,ALG194,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,No
4,ALG2016,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,No
...,...,...,...,...,...,...,...,...,...,...,...
485,VEN265,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,5.0,No
486,VIE2003,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,No
487,VIE2030,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,No
488,VIE2069,1.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,No


In [46]:
for i in combined_df.index:
    if combined_df['Medalled?'][i] == 'No':
        combined_df['Medalled?'][i] = False
    if combined_df['Medalled?'][i] == 'Yes':
            combined_df['Medalled?'][i] = True
            
combined_df = combined_df.astype({"Medalled?": bool})

Combine 5th to 13th place and Participation to see if we can get better results for classifier

In [47]:
simple_combined_df = combined_df.copy()

simple_combined_df['No medal'] = simple_combined_df['5th Place'] + simple_combined_df['7th Place'] + simple_combined_df['9th Place'] + simple_combined_df['11th Place'] + simple_combined_df['13th Place']

In [48]:
simple_combined_df

Unnamed: 0,ranking_competitor,1st Place,2nd Place,3rd Place,5th Place,7th Place,9th Place,11th Place,13th Place,Participation,Medalled?,No medal
0,ALB155,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,3.0,False,2.0
1,ALB2001,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False,0.0
2,ALB2002,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0,False,0.0
3,ALG194,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,False,0.0
4,ALG2016,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,False,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
485,VEN265,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,5.0,False,1.0
486,VIE2003,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,False,0.0
487,VIE2030,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,False,0.0
488,VIE2069,1.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,False,0.0


In [49]:
simple_combined_df = simple_combined_df[['ranking_competitor','1st Place','2nd Place','3rd Place','No medal','Participation','Medalled?']]

In [50]:
for i in simple_combined_df.index:
    if simple_combined_df['Medalled?'][i] == 'No':
        simple_combined_df['Medalled?'][i] = False
    if simple_combined_df['Medalled?'][i] == 'Yes':
            simple_combined_df['Medalled?'][i] = True

simple_combined_df = simple_combined_df.astype({"Medalled?": bool})

In [51]:
combined_df['Medalled?'].value_counts()

False    456
True      34
Name: Medalled?, dtype: int64

In [52]:
scaler = MinMaxScaler()

combined_normalised_df = combined_df.copy()

combined_normalised_df[['1st Place','2nd Place','3rd Place','5th Place','7th Place','9th Place',
                           '11th Place','13th Place','Participation']] = scaler.fit_transform(combined_normalised_df[['1st Place','2nd Place','3rd Place','5th Place','7th Place','9th Place',
                           '11th Place','13th Place','Participation']])

In [53]:
combined_normalised_df

Unnamed: 0,ranking_competitor,1st Place,2nd Place,3rd Place,5th Place,7th Place,9th Place,11th Place,13th Place,Participation,Medalled?
0,ALB155,0.000000,0.333333,0.00,0.25,0.0,0.333333,0.0,0.0,0.3,False
1,ALB2001,0.000000,0.333333,0.00,0.00,0.0,0.000000,0.0,0.0,0.0,False
2,ALB2002,0.142857,0.000000,0.25,0.00,0.0,0.000000,0.0,0.0,0.2,False
3,ALG194,0.000000,0.000000,0.00,0.00,0.0,0.000000,0.0,0.0,0.2,False
4,ALG2016,0.000000,0.000000,0.00,0.00,0.0,0.000000,0.0,0.0,0.1,False
...,...,...,...,...,...,...,...,...,...,...,...
485,VEN265,0.142857,0.000000,0.00,0.25,0.0,0.000000,0.0,0.0,0.5,False
486,VIE2003,0.000000,0.000000,0.25,0.00,0.0,0.000000,0.0,0.0,0.1,False
487,VIE2030,0.000000,0.333333,0.00,0.00,0.0,0.000000,0.0,0.0,0.1,False
488,VIE2069,0.142857,0.000000,0.50,0.00,0.0,0.000000,0.0,0.0,0.1,False


In [54]:
combined_normalised_df['Medalled?'].value_counts()

False    456
True      34
Name: Medalled?, dtype: int64

In [55]:
combined_normalised_df.head(10)

Unnamed: 0,ranking_competitor,1st Place,2nd Place,3rd Place,5th Place,7th Place,9th Place,11th Place,13th Place,Participation,Medalled?
0,ALB155,0.0,0.333333,0.0,0.25,0.0,0.333333,0.0,0.0,0.3,False
1,ALB2001,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False
2,ALB2002,0.142857,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.2,False
3,ALG194,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,False
4,ALG2016,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,False
5,ALG236,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,False
6,ALG242,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.1,False
7,ALG271,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,False
8,ALG276,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,False
9,AND133,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,False


Change 

In [56]:
for i in combined_normalised_df.index:
    if combined_normalised_df['Medalled?'][i] == 'No':
        combined_normalised_df['Medalled?'][i] = False
    if combined_normalised_df['Medalled?'][i] == 'Yes':
            combined_normalised_df['Medalled?'][i] = True

In [57]:
combined_normalised_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 490 entries, 0 to 489
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ranking_competitor  490 non-null    object 
 1   1st Place           490 non-null    float64
 2   2nd Place           490 non-null    float64
 3   3rd Place           490 non-null    float64
 4   5th Place           490 non-null    float64
 5   7th Place           490 non-null    float64
 6   9th Place           490 non-null    float64
 7   11th Place          490 non-null    float64
 8   13th Place          490 non-null    float64
 9   Participation       490 non-null    float64
 10  Medalled?           490 non-null    bool   
dtypes: bool(1), float64(9), object(1)
memory usage: 38.9+ KB


In [58]:
combined_normalised_df = combined_normalised_df.astype({"Medalled?": bool})

An now we have our count of all the placements we can merge on the ranking_competitor values and begin testing a K-NN classifer.

## Create and run a leave one out classifier

In [59]:
def classify_single_case_euclidean(trainingData_df, targetValues_ss, ix, k):
    '''Use k-NN to classify the member of trainingData_df with index
       ix using a k-nearest neighbours classifier. The classifier is
       trained on the data in trainingData_df and the classes in
       targetValues_ss, with the data point indexed by ix omitted.
       Returns the class assigned to the data point with index ix.
    '''

    # Create a classifier instance to do k-nearest neighbours
    myClassifier = KNeighborsClassifier(n_neighbors=k,
                                        metric='euclidean',
                                        weights='uniform')

    # Now apply the classifier to all data points except
    # the one indexed by ix
    myClassifier.fit(trainingData_df.drop(ix, axis='index'),
                     targetValues_ss.drop(ix))

    # Return the class predicted by the trained classifier. Need
    # to predict on list of trainingData_df.loc[ix], as predict
    # expects a list/array, rather than a single value

    return myClassifier.predict([trainingData_df.loc[ix]])[0]

In [60]:
def classify_single_case_minkowski(trainingData_df, targetValues_ss, ix, k):
    '''Use k-NN to classify the member of trainingData_df with index
       ix using a k-nearest neighbours classifier. The classifier is
       trained on the data in trainingData_df and the classes in
       targetValues_ss, with the data point indexed by ix omitted.
       Returns the class assigned to the data point with index ix.
    '''

    # Create a classifier instance to do k-nearest neighbours
    myClassifier = KNeighborsClassifier(n_neighbors=k,
                                        metric='minkowski',
                                        weights='uniform')

    # Now apply the classifier to all data points except
    # the one indexed by ix
    myClassifier.fit(trainingData_df.drop(ix, axis='index'),
                     targetValues_ss.drop(ix))

    # Return the class predicted by the trained classifier. Need
    # to predict on list of trainingData_df.loc[ix], as predict
    # expects a list/array, rather than a single value

    return myClassifier.predict([trainingData_df.loc[ix]])[0]

In [61]:
%%capture --no-stderr

trainingData_df = combined_df[['1st Place','2nd Place','3rd Place','5th Place','7th Place','9th Place',
                           '11th Place','13th Place','Participation']]
targetValues_ss = combined_df['Medalled?']

# Return the predicted value of the data point with index 3 for k=3
# the known value is False
classify_single_case_euclidean(trainingData_df, targetValues_ss, 3, 3)

Classifier works. We can try the leave one out algorithm.

In [62]:
for k in range(3, 30, 2):
    count = 0
    yes_count = 0
    for i in trainingData_df.index:
        result = classify_single_case_euclidean(trainingData_df, targetValues_ss, i, k)
        if targetValues_ss.loc[i] == result:
            count += 1
            if (result == True) & (targetValues_ss.loc[i] == True):
                yes_count += 1
    print(k, 'nearest neighbors gives a correct classification count of', count, 'with', yes_count, 'correct medalled classifications')

3 nearest neighbors gives a correct classification count of 450 with 2 correct medalled classifications
5 nearest neighbors gives a correct classification count of 451 with 1 correct medalled classifications
7 nearest neighbors gives a correct classification count of 452 with 3 correct medalled classifications
9 nearest neighbors gives a correct classification count of 451 with 0 correct medalled classifications
11 nearest neighbors gives a correct classification count of 454 with 0 correct medalled classifications
13 nearest neighbors gives a correct classification count of 456 with 0 correct medalled classifications
15 nearest neighbors gives a correct classification count of 456 with 0 correct medalled classifications
17 nearest neighbors gives a correct classification count of 456 with 0 correct medalled classifications
19 nearest neighbors gives a correct classification count of 456 with 0 correct medalled classifications
21 nearest neighbors gives a correct classification count o

In [63]:
for k in range(3, 30, 2):
    count = 0
    yes_count = 0
    for i in trainingData_df.index:
        result = classify_single_case_minkowski(trainingData_df, targetValues_ss, i, k)
        if targetValues_ss.loc[i] == result:
            count += 1
            if (result == True) & (targetValues_ss.loc[i] == True):
                yes_count += 1
    print(k, 'nearest neighbors gives a correct classification count of', count, 'with', yes_count, 'correct medalled classifications')

3 nearest neighbors gives a correct classification count of 450 with 2 correct medalled classifications
5 nearest neighbors gives a correct classification count of 451 with 1 correct medalled classifications
7 nearest neighbors gives a correct classification count of 452 with 3 correct medalled classifications
9 nearest neighbors gives a correct classification count of 451 with 0 correct medalled classifications
11 nearest neighbors gives a correct classification count of 454 with 0 correct medalled classifications
13 nearest neighbors gives a correct classification count of 456 with 0 correct medalled classifications
15 nearest neighbors gives a correct classification count of 456 with 0 correct medalled classifications
17 nearest neighbors gives a correct classification count of 456 with 0 correct medalled classifications
19 nearest neighbors gives a correct classification count of 456 with 0 correct medalled classifications
21 nearest neighbors gives a correct classification count o

And with normalised data

In [64]:
trainingData_df = combined_normalised_df[['1st Place','2nd Place','3rd Place','5th Place','7th Place','9th Place',
                           '11th Place','13th Place','Participation']]
targetValues_ss = combined_normalised_df['Medalled?']

classify_single_case_euclidean(trainingData_df, targetValues_ss, 94, 3)

False

In [65]:
for k in range(3, 30, 2):
    count = 0
    yes_count = 0
    for i in trainingData_df.index:
        result = classify_single_case_euclidean(trainingData_df, targetValues_ss, i, k)
        if targetValues_ss.loc[i] == result:
            count += 1
            if (result == True) & (targetValues_ss.loc[i] == True):
                yes_count += 1
    print(k, 'nearest neighbors gives a correct classification count of', count, 'with', yes_count, 'correct medalled classifications')

3 nearest neighbors gives a correct classification count of 453 with 1 correct medalled classifications
5 nearest neighbors gives a correct classification count of 457 with 3 correct medalled classifications
7 nearest neighbors gives a correct classification count of 454 with 0 correct medalled classifications
9 nearest neighbors gives a correct classification count of 456 with 0 correct medalled classifications
11 nearest neighbors gives a correct classification count of 456 with 0 correct medalled classifications
13 nearest neighbors gives a correct classification count of 456 with 0 correct medalled classifications
15 nearest neighbors gives a correct classification count of 456 with 0 correct medalled classifications
17 nearest neighbors gives a correct classification count of 456 with 0 correct medalled classifications
19 nearest neighbors gives a correct classification count of 456 with 0 correct medalled classifications
21 nearest neighbors gives a correct classification count o

In [66]:
for k in range(3, 30, 2):
    count = 0
    yes_count = 0
    for i in trainingData_df.index:
        result = classify_single_case_minkowski(trainingData_df, targetValues_ss, i, k)
        if targetValues_ss.loc[i] == result:
            count += 1
            if (result == True) & (targetValues_ss.loc[i] == True):
                yes_count += 1
    print(k, 'nearest neighbors gives a correct classification count of', count, 'with', yes_count, 'correct medalled classifications')

3 nearest neighbors gives a correct classification count of 453 with 1 correct medalled classifications
5 nearest neighbors gives a correct classification count of 457 with 3 correct medalled classifications
7 nearest neighbors gives a correct classification count of 454 with 0 correct medalled classifications
9 nearest neighbors gives a correct classification count of 456 with 0 correct medalled classifications
11 nearest neighbors gives a correct classification count of 456 with 0 correct medalled classifications
13 nearest neighbors gives a correct classification count of 456 with 0 correct medalled classifications
15 nearest neighbors gives a correct classification count of 456 with 0 correct medalled classifications
17 nearest neighbors gives a correct classification count of 456 with 0 correct medalled classifications
19 nearest neighbors gives a correct classification count of 456 with 0 correct medalled classifications
21 nearest neighbors gives a correct classification count o

## Try with reduced columns for data noise

In [67]:
trainingData_df = simple_combined_df[['1st Place','2nd Place','3rd Place','No medal','Participation']]
targetValues_ss = simple_combined_df['Medalled?']

In [68]:
for k in range(3, 30, 2):
    count = 0
    yes_count = 0
    for i in trainingData_df.index:
        result = classify_single_case_euclidean(trainingData_df, targetValues_ss, i, k)
        if targetValues_ss.loc[i] == result:
            count += 1
            if (result == True) & (targetValues_ss.loc[i] == True):
                yes_count += 1
    print(k, 'nearest neighbors gives a correct classification count of', count, 'with', yes_count, 'correct medalled classifications')

3 nearest neighbors gives a correct classification count of 450 with 2 correct medalled classifications
5 nearest neighbors gives a correct classification count of 453 with 1 correct medalled classifications
7 nearest neighbors gives a correct classification count of 455 with 3 correct medalled classifications
9 nearest neighbors gives a correct classification count of 453 with 0 correct medalled classifications
11 nearest neighbors gives a correct classification count of 455 with 0 correct medalled classifications
13 nearest neighbors gives a correct classification count of 456 with 0 correct medalled classifications
15 nearest neighbors gives a correct classification count of 456 with 0 correct medalled classifications
17 nearest neighbors gives a correct classification count of 456 with 0 correct medalled classifications
19 nearest neighbors gives a correct classification count of 456 with 0 correct medalled classifications
21 nearest neighbors gives a correct classification count o

Even with simplified encoding there is just not enough data to reliably create a classifier using u21 results. Or maybe there is enough but there's no indication of that u21 results can reliably show senior success.