<h1 style="text-align: center; color:#900603">Kaggle - Olympic Medals Predictions</h1>

# 1. Introduction:

#### This project is to predict how many medals a country will earn based on how many athletes it enters into the Olympics.
#### Linear Regression is used to train the data and R2 score is used to evaluate the model.
#### The dataset is from [Kaggle - 120 years of Olympic history](https://www.kaggle.com/datasets/heesoo37/120-years-of-olympic-history-athletes-and-results)
<br>

# 2. Import necessary libraries:

In [21]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# 3. Load the data:

In [22]:
teams = pd.read_csv("teams.csv")
teams

Unnamed: 0,team,year,athletes,events,age,height,weight,prev_medals,medals
0,AFG,1964,8,8,22.0,161.0,64.2,0.0,0
1,AFG,1968,5,5,23.2,170.2,70.0,0.0,0
2,AFG,1972,8,8,29.0,168.3,63.8,0.0,0
3,AFG,1980,11,11,23.6,168.4,63.2,0.0,0
4,AFG,2004,5,5,18.6,170.8,64.8,0.0,0
...,...,...,...,...,...,...,...,...,...
2009,ZIM,2000,26,19,25.0,179.0,71.1,0.0,0
2010,ZIM,2004,14,11,25.1,177.8,70.5,0.0,3
2011,ZIM,2008,16,15,26.1,171.9,63.7,3.0,4
2012,ZIM,2012,9,8,27.3,174.4,65.2,4.0,0


# 4. Prepare X and y:

In [23]:
# Prepare matrix input X

X = teams[["athletes", "prev_medals"]].copy() 

X

Unnamed: 0,athletes,prev_medals
0,8,0.0
1,5,0.0
2,8,0.0
3,11,0.0
4,5,0.0
...,...,...
2009,26,0.0
2010,14,0.0
2011,16,3.0
2012,9,4.0


In [24]:
# Prepare vector lebel y

y = teams[["medals"]].copy()

y

Unnamed: 0,medals
0,0
1,0
2,0
3,0
4,0
...,...
2009,0
2010,3
2011,4
2012,0


# 5. Split data into training and testing set:

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [26]:
print(f"Counts of samples in training set: {X_train.shape[0]}")
print(f"Counts of samples in testing set: {X_test.shape[0]}")

Counts of samples in training set: 1611
Counts of samples in testing set: 403


# 6. Using Linear Regression to make prediction:

In [27]:
lr = LinearRegression()

In [28]:
lr.fit(X_train, y_train)

LinearRegression()

In [29]:
# Calculate the bias
lr.intercept_

array([-1.85483677])

In [30]:
# Calculate the coeffiecients
lr.coef_

array([[0.06636112, 0.76017895]])

#### It means that the model is: $y = -1.85483677 + 0.06636112 x_1 + 0.76017895 x_2$

# 7. Model performance:

In [31]:
y_pred = lr.predict(X_test)

In [32]:
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print('The model performance for testing set')
print('--------------------------------------')
print(f'MAE is {round(mae, 3)}')
print(f'RMSE is {round(rmse, 3)}')
print(f'R2 score is {round(r2*100, 2)}%')

The model performance for testing set
--------------------------------------
MAE is 4.815
RMSE is 10.883
R2 score is 89.96%


#### The R2 score is nearly 90%. It means that our model fit the data pretty well.