![123](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRHYA1RK9WHQcr3Wzlk3s9hYidWYFvDoshYsQ&usqp=CAU)

# Introduction
**The popularity of running continues to increase, which means that the incidence of running-related injuries will probably also continue to increase. Little is known about risk factors for running injuries and whether they are sex-specific.
Longitudinal cohort studies with a minimal follow-up of 1 month that investigated the association between risk factors (personal factors, running/training factors and/or health and lifestyle factors) and the occurrence of lower limb injuries in runners were included.
Of 400 articles retrieved, 15 longitudinal studies were included, of which 11 were considered high-quality studies and 4 moderate-quality studies. Overall, women were at lower risk than men for sustaining running-related injuries. Strong and moderate evidence was found that a history of previous injury and of having used orthotics/inserts was associated with an increased risk of running injuries. Age, previous sports activity, running on a concrete surface, participating in a marathon, weekly running distance (30–39 miles) and wearing running shoes for 4 to 6 months were associated with a greater risk of injury in women than in men. A history of previous injuries, having a running experience of 0–2 years, restarting running, weekly running distance (20–29 miles) and having a running distance of more than 40 miles per week were associated with a greater risk of running-related injury in men than in women.**


# Call some libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pandas_profiling as pp

# Read Dataset

In [None]:
day_df=pd.read_csv('../input/injury-prediction-for-competitive-runners/day_approach_maskedID_timeseries.csv')

**Do some analyzes on the data**

In [None]:
day_df.head(10)

In [None]:
# set seed for reproducibility
np.random.seed(0) 

In [None]:
#get the number of missing data points per column
missing_values_count = day_df.isnull().sum()
missing_values_count

In [None]:
columns_with_na_dropped = day_df.dropna(axis=1)
columns_with_na_dropped.head()

In [None]:
day_df.info()

In [None]:
day_df.describe()

In [None]:
print("Columns in original dataset: %d \n" % day_df.shape[1])
print("Columns with na's dropped: %d" % columns_with_na_dropped.shape[1])


In [None]:
day_df.columns

In [None]:
day_df.count()

In [None]:
day_df.sum()

In [None]:
# get all the unique values in the 'Country' column
countries = day_df['total km'].unique()

# sort them alphabetically and then take a closer look
countries.sort()
countries

In [None]:
day_df['Date'].plot.hist()

In [None]:
# KNN
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 3)
x,y = day_df.loc[:,day_df.columns != 'injury'], day_df.loc[:,'injury']
knn.fit(x,y)
prediction = knn.predict(x)
print('Prediction: {}'.format(prediction))

In [None]:
# train test split
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.3,random_state = 1)
knn = KNeighborsClassifier(n_neighbors = 3)
x,y = day_df.loc[:,day_df.columns != 'injury'], day_df.loc[:,'injury']
knn.fit(x_train,y_train)
prediction = knn.predict(x_test)
#print('Prediction: {}'.format(prediction))
print('With KNN (K=3) accuracy is: ',knn.score(x_test,y_test)) # accuracy

Accuracy is 98% so is it good ? I do not know actually, we will see at the end of tutorial.
Now the question is why we choose K = 3 or what value we need to choose K. The answer is in model complexity

In [None]:
# Model complexity
neig = np.arange(1, 25)
train_accuracy = []
test_accuracy = []
# Loop over different values of k
for i, k in enumerate(neig):
    # k from 1 to 25(exclude)
    knn = KNeighborsClassifier(n_neighbors=k)
    # Fit with knn
    knn.fit(x_train,y_train)
    #train accuracy
    train_accuracy.append(knn.score(x_train, y_train))
    # test accuracy
    test_accuracy.append(knn.score(x_test, y_test))

# Plot
plt.figure(figsize=[13,8])
plt.plot(neig, test_accuracy, label = 'Testing Accuracy')
plt.plot(neig, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.title('-value VS Accuracy')
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.xticks(neig)
plt.savefig('graph.png')
plt.show()
print("Best accuracy is {} with K = {}".format(np.max(test_accuracy),1+test_accuracy.index(np.max(test_accuracy))))

In [None]:
# create data1 that includes pelvic_incidence that is feature and sacral_slope that is target variable
data1 = day_df[day_df['injury'] =='Abnormal']
x = np.array(day_df.loc[:,'Date']).reshape(-1,1)
y = np.array(day_df.loc[:,'Athlete ID']).reshape(-1,1)
# Scatter
plt.figure(figsize=[10,10])
plt.scatter(x=x,y=y)
plt.xlabel('Date')
plt.ylabel('Athlete ID')
plt.show()

In [None]:
# LinearRegression
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
# Predict space
predict_space = np.linspace(min(x), max(x)).reshape(-1,1)
# Fit
reg.fit(x,y)
# Predict
predicted = reg.predict(predict_space)
# R^2 
print('R^2 score: ',reg.score(x, y))
# Plot regression line and scatter
plt.plot(predict_space, predicted, color='black', linewidth=3)
plt.scatter(x=x,y=y)
plt.xlabel('injury')
plt.ylabel('Date')
plt.show()

In [None]:
# CV
from sklearn.model_selection import cross_val_score
reg = LinearRegression()
k = 5
cv_result = cross_val_score(reg,x,y,cv=k) # uses R^2 as score 
print('CV Scores: ',cv_result)
print('CV scores average: ',np.sum(cv_result)/k)

In [None]:
# Ridge
from sklearn.linear_model import Ridge
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state = 2, test_size = 0.3)
ridge = Ridge(alpha = 0.1, normalize = True)
ridge.fit(x_train,y_train)
ridge_predict = ridge.predict(x_test)
print('Ridge score: ',ridge.score(x_test,y_test))


In [None]:
# Confusion matrix with random forest
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
x,y = day_df.loc[:,day_df.columns != 'injury'], day_df.loc[:,'injury']
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.3,random_state = 1)
rf = RandomForestClassifier(random_state = 4)
rf.fit(x_train,y_train)
y_pred = rf.predict(x_test)
cm = confusion_matrix(y_test,y_pred)
print('Confusion matrix: \n',cm)
print('Classification report: \n',classification_report(y_test,y_pred))

In [None]:
# visualize with seaborn library
sns.heatmap(cm,annot=True,fmt="d") 
plt.show()

In [None]:
sns.countplot(x="injury", data=day_df)
day_df.loc[:,'injury'].value_counts()

In [None]:
sns.countplot(x="Date", data=day_df)
day_df.loc[:,'Date'].value_counts()