## Importing libraries & loading dataset

In [1]:
import pandas as pd
import numpy as np
from geopy.geocoders import Nominatim 
import country_converter as coco
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import math
import statsmodels.api as sm

In [2]:
raw_data = pd.read_csv('data/athlete_events.csv',index_col="ID")
regions = pd.read_csv('data/noc_regions.csv')
df = pd.merge(raw_data, regions, on='NOC', how='left')

## Model 2 : Home advantage effect in Olympic games

In this model we will try to understand if host effect can bring some advantage to the teams or not.
It is important to state that, since there is no suitable independent and dependent variable, we will try to create new independent variables and dependent variable under consideration of the data features. 

Since the variables related to the model does not contain missing values, data pre-processing is not applied. However, several filterig can be investigated in the following code.
<br> <u>Overview</u> of the notebook can be found as following:
1. Some filtering and new variable addition to the dataset
2. Logistic regression model approach:
<br>2.1 features: sport, team
<br>2.2 feature: sport
<br>2.3 feature: team
3. Discussion about the results
4. Logit regression model approach
5. Discussion about the results
    

## 1. Some filtering and new variable addition to the dataset

In [3]:
# Creating a new dataframe which does not contain non medal data
df_medal = df[df.Medal.isnull() == False]
df_new = df.copy()


# Dropping the unncessary columns
df_medal = df_medal.drop(['Name', 'Sex', 'Age', 'Height', 'Weight', 'Event', 'notes', 'NOC'], axis=1)

# Since 'Individual Olympic Athletes' variable in 'Team' column does not show country information,
# rows with this variable will be dropped.
df_medal.drop( df_medal[ df_medal['Team'] == 'Individual Olympic Athletes' ].index , inplace=True)
df_new.drop( df_new[ df_new['Team'] == 'Individual Olympic Athletes' ].index , inplace=True)

# There are several region names for the same countries, so before converting them to iso3 values
# conversion in the below is applied.
df_medal.loc[df_medal.region == "US", "region"] = "United States"
df_medal.loc[df_medal.region == "USA", "region"] = "United States"
df_medal.loc[df_medal.region == "UK", "region"] = "United Kingdom"
df_medal.dropna(subset = ["region"], inplace=True)
df_new.loc[df_new.region == "US", "region"] = "United States"
df_new.loc[df_new.region == "USA", "region"] = "United States"
df_new.loc[df_new.region == "UK", "region"] = "United Kingdom"
df_new.loc[df_new.region == "Boliva", "region"] = "Bolivia"
df_new.dropna(subset = ["region"], inplace=True)

Home advantage effect will be examined on the teams that experienced being host in the past. Thus, the following steps will be taken:

In [4]:
# Converting host cities to host countries 
geolocator = Nominatim(user_agent = "geoapiExercises") 
uniq_host_cities = list(df_medal.City.unique())
uniq_host_countries = []
for city in uniq_host_cities:
    location = geolocator.geocode(city, language='en') 
    uniq_host_countries.append(location[0].split(',')[-1])
    
# Converting host countries to iso3 values
iso3_for_host = coco.convert(names=uniq_host_countries, to='ISO3', not_found=None)

# Creating a dictionary in order to match host city to iso3 code
host_city_to_iso3 = {}
for i in range(len(uniq_host_cities)):
    host_city_to_iso3[uniq_host_cities[i]] = iso3_for_host[i]
    
# Creating a list to store corresponding iso3 code of the host city
host_NOC_column = []
for i in range(len(df_medal)):
    host_NOC_column.append(host_city_to_iso3[df_medal.City.iloc[i]])
    
# Adding Team_NOC column to the dataframe
df_medal['Host_NOC'] = host_NOC_column

In [5]:
# Some NOC values mismatch with iso3 values (e.g. Germany -> DEU/GER) 
# Therefore in order to have a consistency iso3 values are found based on region column
region_list = list(df_medal.region.unique())
iso3_for_team = coco.convert(names=region_list, to='ISO3', not_found=None)

# Creating a dictionary in order to match region to iso3 code
region_to_iso3 = {}
for i in range(len(region_list)):
    region_to_iso3[region_list[i]] = iso3_for_team[i]
    
# Creating a list to store corresponding iso3 code of the region(team)
region_column = []
for i in range(len(df_medal)):
    region_column.append(region_to_iso3[df_medal.region.iloc[i]])
    
# Adding Team_NOC column to the dataframe
df_medal['Team_NOC'] = region_column

In [6]:
# Doing the same process with df_new dataframe
region_list = list(df_new.region.unique())
iso3_for_team = coco.convert(names=region_list, to='ISO3', not_found=None)

region_to_iso3 = {}
for i in range(len(region_list)):
    region_to_iso3[region_list[i]] = iso3_for_team[i]
    
# Creating a list to store corresponding iso3 code of the region(team)
region_column = []
for i in range(len(df_new)):
    region_column.append(region_to_iso3[df_new.region.iloc[i]])
    
# Adding New_NOC column to the dataframe
df_new['Team_NOC'] = region_column

In [7]:
# Creating new column to show if team have played in its own country or not 
Host=[]
for i in range(len(df_medal)):
    if df_medal.Team_NOC.iloc[i] == df_medal.Host_NOC.iloc[i]:
        Host.append(1)
    else:
        Host.append(0)

# Adding Host column to the dataframe
df_medal['Host'] = Host

In [8]:
# determining the countries became host anytime
host_countries = df_medal.Host_NOC.unique()

# filtering dataframe with host countries
df_medal_host_only = df_medal.loc[df_medal['Team_NOC'].isin(host_countries)]

Different than the logit regression model, we will try to understand if sports are affected by home advantage. Therefore, we will filter the teams that became host and visitor at the specific sport.

In [9]:
no_host = pd.DataFrame(df_medal_host_only[df_medal_host_only.Host==0].groupby(['Team_NOC','Sport','Host']).Medal.count()).rename(
columns={'Medal' : 'Total_number_of_medals'}).reset_index()

yes_host = pd.DataFrame(df_medal_host_only[df_medal_host_only.Host==1].groupby(['Team_NOC','Sport','Host']).Medal.count()).rename(
columns={'Medal' : 'Total_number_of_medals'}).reset_index()

final_df = pd.merge(no_host, yes_host, how ='inner', on =['Team_NOC', 'Sport'])
final_df = final_df.drop(['Host_x', 'Host_y'], axis=1)
final_df.columns = ['Team_NOC','Sport','total_medals', 'total_medals_when_host']

In this approach y variable will be calculated as following:
1. `Avg_medal_when_host`  = Average medal won by the team at the specific sport when host 
2. `Avg_medal_when_count` = Average medal won by the team at the specific sport in all times 
3. `advantage`            =1, If `Avg_medal_when_host` > `Avg_medal_when_count` ; 0, otherwise


In [10]:
# Finding the average medal won when host and in overall olympics
total_years = []
for i in range(len(final_df)):
    total_years.append(df_new[(df_new['Team_NOC'] == final_df.Team_NOC.iloc[i]) & (df_new['Sport'] == final_df.Sport.iloc[i])].Year.nunique())
final_df['total_years'] = total_years    
final_df['Avg_medal_count']  = final_df.total_medals/final_df.total_years

total_host_years = []
for i in range(len(final_df)):
    total_host_years.append(len(df_medal[df_medal.Host_NOC==final_df.Team_NOC.iloc[i]].groupby('Year').Year.count()))    
final_df['total_host_years'] = total_host_years 
final_df['Avg_medal_when_host'] = final_df.total_medals_when_host/final_df.total_host_years


In [11]:
# Creating y variable 1 if Avg_medal_count greater than Avg_medal_when_host
adv_binary=[]
for i in range(len(final_df)):
    if final_df.Avg_medal_when_host.iloc[i]> final_df.Avg_medal_count.iloc[i]:
        adv_binary.append(1)
    else:
        adv_binary.append(0)
final_df['advantage'] = adv_binary

In [12]:
final_df.head(5)

Unnamed: 0,Team_NOC,Sport,total_medals,total_medals_when_host,total_years,Avg_medal_count,total_host_years,Avg_medal_when_host,advantage
0,AUS,Archery,4,1,12,0.333333,2,0.5,1
1,AUS,Athletics,68,22,29,2.344828,2,11.0,1
2,AUS,Basketball,48,12,14,3.428571,2,6.0,1
3,AUS,Beach Volleyball,2,2,6,0.333333,2,1.0,1
4,AUS,Boxing,4,1,21,0.190476,2,0.5,1


## 2. Logistic regression model approach:

There will be train and test split, 80% to 20% respectively. Splitting will occur randomly, so `random_state` variable taken as 103 in all logistic regression models.

### 2.1 features: sport, team

In [13]:
team = pd.get_dummies(final_df['Team_NOC'],drop_first=True)
sport = pd.get_dummies(final_df['Sport'],drop_first=True)
train = final_df.drop(['Team_NOC','Sport','total_medals','total_medals_when_host','total_years',
                       'total_host_years', 'Avg_medal_count', 'Avg_medal_when_host' ],axis=1)
train = pd.concat([train,team,sport],axis=1)
# Splitting data into test and train set 
X_train, X_test, y_train, y_test = train_test_split(train.drop('advantage',axis=1), 
                                                    train['advantage'], test_size=0.20, 
                                                    random_state=103)
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
predictions = logmodel.predict(X_test)
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.43      0.16      0.23        19
           1       0.73      0.92      0.81        48

    accuracy                           0.70        67
   macro avg       0.58      0.54      0.52        67
weighted avg       0.65      0.70      0.65        67



The following execution is made for visualization purposes and to understand the y variable sigmoid function is used. This step will calculated after each model fitting in logistic regression.

In [14]:
interc = logmodel.intercept_
coefs = logmodel.coef_[0]
sports_team = []
for i in range(1, 22):
    for j in range (22, 75):
        sports_team.append(train.columns[i] +'/'+ train.columns[j])
result = []
for i in range(21):
    for j in range(21,74):
        result.append("{:.2f}".format(1/(1+ math.exp(-(interc + coefs[i] + coefs[j])))))

model_sport_team = pd.DataFrame({'Team/Sport': sports_team,
                   'Probability of advantage': result })

In [15]:
model_sport_team.head(5)

Unnamed: 0,Team/Sport,Probability of advantage
0,AUT/Archery,0.72
1,AUT/Art Competitions,0.6
2,AUT/Athletics,0.52
3,AUT/Badminton,0.54
4,AUT/Baseball,0.42


### 2.2 feature: sport

In [16]:
train = final_df.drop(['Team_NOC','Sport','total_medals','total_medals_when_host','total_years',
                       'total_host_years', 'Avg_medal_count', 'Avg_medal_when_host' ],axis=1)
train = pd.concat([train,sport],axis=1)
# Splitting data into test and train set 
X_train, X_test, y_train, y_test = train_test_split(train.drop('advantage',axis=1), 
                                                    train['advantage'], test_size=0.20, 
                                                    random_state=103)
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
predictions = logmodel.predict(X_test)
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.33      0.05      0.09        19
           1       0.72      0.96      0.82        48

    accuracy                           0.70        67
   macro avg       0.53      0.51      0.46        67
weighted avg       0.61      0.70      0.61        67



In [17]:
interc = logmodel.intercept_
coefs = logmodel.coef_[0]
sports = []
for i in range(1, len(train.columns)):
    sports.append(train.columns[i])
result = []
for i in range(len(sports)):
    result.append("{:.2f}".format(1/(1+ math.exp(-(interc + coefs[i])))))

model_sport = pd.DataFrame({'Sport': sports,
                   'Probability of advantage': result })

In [18]:
model_sport.head(5)

Unnamed: 0,Sport,Probability of advantage
0,Archery,0.84
1,Art Competitions,0.78
2,Athletics,0.71
3,Badminton,0.76
4,Baseball,0.58


### 2.3 feature: team

In [19]:
train = final_df.drop(['Team_NOC','Sport','total_medals','total_medals_when_host','total_years',
                       'total_host_years', 'Avg_medal_count', 'Avg_medal_when_host' ],axis=1)
train = pd.concat([train,team],axis=1)
# Splitting data into test and train set 
X_train, X_test, y_train, y_test = train_test_split(train.drop('advantage',axis=1), 
                                                    train['advantage'], test_size=0.20, 
                                                    random_state=103)
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
predictions = logmodel.predict(X_test)
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.50      0.11      0.17        19
           1       0.73      0.96      0.83        48

    accuracy                           0.72        67
   macro avg       0.62      0.53      0.50        67
weighted avg       0.66      0.72      0.64        67



In [20]:
interc = logmodel.intercept_
coefs = logmodel.coef_[0]
teams = []
for i in range(1, len(train.columns)):
    teams.append(train.columns[i])
result = []
for i in range(len(teams)):
    result.append("{:.2f}".format(1/(1+ math.exp(-(interc + coefs[i])))))

model_team = pd.DataFrame({'Team': teams,
                   'Probability of advantage': result })

In [21]:
model_team.head(5)

Unnamed: 0,Team,Probability of advantage
0,AUT,0.53
1,BEL,0.89
2,BRA,0.85
3,CAN,0.62
4,CHE,0.77


## 3. Discussion about the results

 F-score, recall and precision scores found satisfactory for 1 labeled data, in contrast these performance measurements give pretty low scores for 0 labeled data. From the results, it can be understood that support is distrubuted unbalanced while labeling data. The data is labeled mostly as adavantage (1), so in the test data prediction is inclined with 1 label. 
 In conclusion, it can be said that the models does not show strong relation between home advantage and sport, team variables. 

## 4. Logit regression model approach

Similar data filtering will be taken in this section, but the differences can be listed as below:

1. Up until 1992, the Winter and Summer Games were held in the same year. Then, they occured in different years. In order to have a same year gap between the games, only Summer Olympics will be considered. Also, there are less medals given in the Winter Olympics.
2. Because there will be no investigation in the sport level, extra filtering is not needed like logistic regression models. The filtering only applied to the teams that experienced being host. 

In [22]:
raw_data = pd.read_csv('data/athlete_events.csv',index_col="ID")
regions = pd.read_csv('data/noc_regions.csv')
df = pd.merge(raw_data, regions, on='NOC', how='left')
# Creating a new dataframe which does not contain non medal data
df_medal = df[df.Medal.isnull() == False]
df_medal = df_medal[df_medal.Year >= 1948]
df_medal = df_medal[df_medal.Season == 'Summer']
df_new = df.copy()
df_new = df_new[df_new.Year >= 1948]
df_new = df_new[df_new.Season == 'Summer']

# Dropping the unncessary columns
df_medal = df_medal.drop(['Name', 'Sex', 'Age', 'Height', 'Weight', 'notes', 'NOC'], axis=1)

# Since 'Individual Olympic Athletes' variable in 'Team' column does not show country information,
# rows with this variable will be dropped.
df_medal.drop( df_medal[ df_medal['Team'] == 'Individual Olympic Athletes' ].index , inplace=True)
df_new.drop( df_new[ df_new['Team'] == 'Individual Olympic Athletes' ].index , inplace=True)

df_medal.loc[df_medal.region == "US", "region"] = "United States"
df_medal.loc[df_medal.region == "USA", "region"] = "United States"
df_medal.loc[df_medal.region == "UK", "region"] = "United Kingdom"
df_medal.loc[df_medal.region == "Boliva", "region"] = "Bolivia"
df_medal.dropna(subset = ["region"], inplace=True)
df_new.loc[df_new.region == "US", "region"] = "United States"
df_new.loc[df_new.region == "USA", "region"] = "United States"
df_new.loc[df_new.region == "UK", "region"] = "United Kingdom"
df_new.loc[df_new.region == "Boliva", "region"] = "Bolivia"
df_new.dropna(subset = ["region"], inplace=True)

# Converting host cities to host countries 
geolocator = Nominatim(user_agent = "geoapiExercises") 
uniq_host_cities = list(df_medal.City.unique())
uniq_host_countries = []
for city in uniq_host_cities:
    location = geolocator.geocode(city, language='en') 
    uniq_host_countries.append(location[0].split(',')[-1])
    
# Converting host countries to iso3 values
iso3_for_host = coco.convert(names=uniq_host_countries, to='ISO3', not_found=None)

# Creating a dictionary in order to match host city to iso3 code
host_city_to_iso3 = {}
for i in range(len(uniq_host_cities)):
    host_city_to_iso3[uniq_host_cities[i]] = iso3_for_host[i]
    
# Creating a list to store corresponding iso3 code of the host city
host_NOC_column = []
for i in range(len(df_medal)):
    host_NOC_column.append(host_city_to_iso3[df_medal.City.iloc[i]])
    
# Adding Team_NOC column to the dataframe
df_medal['Host_NOC'] = host_NOC_column

# Some NOC values mismatch with iso3 values (e.g. Germany -> DEU/GER) 
# Therefore in order to have a consistency iso3 values are found based on region column
region_list = list(df_medal.region.unique())
iso3_for_team = coco.convert(names=region_list, to='ISO3', not_found=None)

# Creating a dictionary in order to match region to iso3 code
region_to_iso3 = {}
for i in range(len(region_list)):
    region_to_iso3[region_list[i]] = iso3_for_team[i]
    
# Creating a list to store corresponding iso3 code of the region(team)
region_column = []
for i in range(len(df_medal)):
    region_column.append(region_to_iso3[df_medal.region.iloc[i]])
    
# Adding Team_NOC column to the dataframe
df_medal['Team_NOC'] = region_column

# Doing the same process with df_new dataframe
region_list = list(df_new.region.unique())
iso3_for_team = coco.convert(names=region_list, to='ISO3', not_found=None)

region_to_iso3 = {}
for i in range(len(region_list)):
    region_to_iso3[region_list[i]] = iso3_for_team[i]
    
# Creating a list to store corresponding iso3 code of the region(team)
region_column = []
for i in range(len(df_new)):
    region_column.append(region_to_iso3[df_new.region.iloc[i]])
    
# Adding New_NOC column to the dataframe
df_new['Team_NOC'] = region_column

# Creating new column to show if team have played in its own country or not 
Host=[]
for i in range(len(df_medal)):
    if df_medal.Team_NOC.iloc[i] == df_medal.Host_NOC.iloc[i]:
        Host.append(1)
    else:
        Host.append(0)

# Adding Host column to the dataframe
df_medal['Host'] = Host

# determining the countries became host anytime
host_countries = df_medal.Host_NOC.unique()

# filtering dataframe with host countries
df_medal_host_only = df_medal.loc[df_medal['Team_NOC'].isin(host_countries)]

The following dictionary is created to consist host countries for the each Olympic game. `Year` and `Host` are observed as key and value of the dictionary correspondingly. 

In [23]:
host_df = pd.DataFrame(df_medal_host_only.groupby(['Year','Team_NOC']).Host.sum() > 0).rename(
columns={'Host' : 'Host'}).reset_index()
host_df = host_df[host_df.Host==True]
host_df = host_df.drop('Host',axis=1)
host_df
host_df_dict={}
for i in range(len(host_df)):
    year = str(host_df.Year.iloc[i])
    if host_df.Year.iloc[i] in host_df_dict:
        host_df_dict[host_df.Year.iloc[i]].append(host_df.Team_NOC.iloc[i])
    else:
        host_df_dict[host_df.Year.iloc[i]]= [host_df.Team_NOC.iloc[i]]

Team games consist more than 1 medal, so the following code is calculated to count 1 medal for the team game also.

In [24]:
df_clean = df_new.copy()
df_clean['Target_Variable'] = np.where(df_clean['Medal'].isnull(), 0, 1)
team_sports = df_clean.groupby(['Year', 'Event']).Target_Variable.sum().reset_index(name='count')
team_sport_list = team_sports[team_sports['count'] > 4].Event.unique().tolist()

df_clean['sport_type'] = ['team' if event in team_sport_list else 'individual' for event in df_clean.Event.tolist()]

# 2. Do the groupby such that team sports are grouped by year, NOC and event to get the average athlete of that year, 
# event and team

df_individual = df_clean[df_clean.sport_type == 'individual']
df_team = df_clean[df_clean.sport_type == 'team']

df_team_grouped = df_team.groupby(['Year', 'NOC', 'Event']).agg({'Age': 'mean',
                                                                 'Height': 'mean',
                                                                 'Weight': 'mean'})

df_team_grouped.reset_index(level=['Year', 'NOC', 'Event'], inplace=True)

df_team = df_team_grouped.merge(df_team.groupby(['Year', 'NOC', 'Event']).head(1).reset_index(drop=True), 
                      on=['Year', 'NOC', 'Event'], suffixes=(None, '_individual'))

df_team.drop(columns=['Age_individual', 'Height_individual', 'Weight_individual'], inplace=True)

df_clean = pd.concat((df_individual, df_team), ignore_index=True)

In [25]:
# Number of total medal for each year
total_medals = {}
years = df_clean.Year.unique()
years.sort()
for i in years:
    total_medals[i]= df_clean[df_clean.Year == i ].Medal.count()

Independent variables creation is examined in the following:

In [26]:
df_final = pd.DataFrame({ 'Team': [] ,'Year': [],
                   'Medals_winning_rate': [], 'prehost' : [], 'host' : [], 'posthost' : [] })
team_column = []
for j in host_countries:    
    df = pd.DataFrame(df_clean[df_clean.Team_NOC == j].groupby('Year').Medal.count()).rename(columns={'Medal':'Medals_winning_rate'}).reset_index()
    host_column = []
    for i in range(len(df)):
        team_column.append(j)
        if j in host_df_dict[df.Year.iloc[i]]:
            host_column.append(1)
        else:
            host_column.append(0)
    df['prehost'] = np.zeros(len(df))
    df['host'] = host_column
    df['posthost'] = np.zeros(len(df))
    df['prehost'] = np.append(list(df['host'])[1:],list([0]))
    df['posthost'] = np.append(list([0]), list(df['host'])[:-1])
    tot_med_col = []
    for i in range(len(df)):
        tot_med_col.append(total_medals[df.Year.iloc[i]])
    df['Medals_winning_rate'] = df['Medals_winning_rate']/tot_med_col
    df_final = pd.concat([df_final, df])
df_final['Team'] = team_column
# in order to see the year column as integer
year = []
for i in df_final.Year:
    year.append(int(i))
df_final['Year'] = year

In [27]:
df_final.head(5)

Unnamed: 0,Team,Year,Medals_winning_rate,prehost,host,posthost
0,GBR,1948,0.062937,0.0,1.0,0.0
1,GBR,1952,0.024887,0.0,0.0,1.0
2,GBR,1956,0.052863,0.0,0.0,0.0
3,GBR,1960,0.044944,0.0,0.0,0.0
4,GBR,1964,0.03666,0.0,0.0,0.0


In [28]:
y = df_final['Medals_winning_rate']
X = df_final.drop(['Year', 'Medals_winning_rate', 'Team' ], axis=1)
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary2())

Optimization terminated successfully.
         Current function value: 0.577850
         Iterations 7
                          Results: Logit
Model:              Logit               Pseudo R-squared: inf     
Dependent Variable: Medals_winning_rate AIC:              328.4401
Date:               2020-12-27 14:14    BIC:              339.3338
No. Observations:   279                 Log-Likelihood:   -161.22 
Df Model:           2                   LL-Null:          0.0000  
Df Residuals:       276                 LLR p-value:      1.0000  
Converged:          1.0000              Scale:            1.0000  
No. Iterations:     7.0000                                        
--------------------------------------------------------------------
            Coef.    Std.Err.      z      P>|z|     [0.025    0.975]
--------------------------------------------------------------------
prehost    -2.7592     0.9959   -2.7707   0.0056   -4.7111   -0.8074
host       -2.3911     0.8277   -2.8888   0.0

  return 1 - self.llf/self.llnull


## 5. Discussion about the results

This approach firstly tried to examine in the team level, model is fitted country specific, but none of the model gave low enough p-values to prove significance of the variables. Then, combined model idea is tried. In that way, number of data points are increased. From the results, we can see that p-values are very satisfying to say the x variables are all significant. However, Pseudo R-squared value is found as infinity. We know that R-squared value closes to 1.0 shows that fitted model predicts in a good way. Negative or infinity R-squared value is a sign of a problematic models. Regarding this problem, there is not enought explanation. But it is explained as over specification problem of the selected model. Extra data collection or dropping some variables are suggested, but in our case, we merged the countries in order to provide more data. So, extra data collection is already tried in that sense. 
One of the criticism can be made as merging different countries into same dataset could create some bias. For instance, even USA was not pre/post/host in a year, the y value could be still higher than a country compete generally less number of sports. 
To conclude, different approaches and models are tried to see if host effect can be taken as advantage or not, from the performance measures' investigation we can say that there is no strong model to show hosting brigs advantage to the teams.