# Injury Analysis using Open Source Procedures in SAS

## Data

### PROC SQL

First, we must access our data. We will use some good ol' fashion SAS to do this.

In [3]:
/* Import dataset */
filename injury url "https://raw.githubusercontent.com/rachelnisbet/SAS-Procs-Oh-My/main/injury_data.csv";

proc import datafile=injury
    out=injury_data dbms= csv replace;
run;

22   ods listing close;ods html5 (id=saspy_internal) options(bitmap_mode='inline') device=svg style=HTMLBlue; ods graphics on /
22 ! outputfmt=png;
23   
24   /* Import dataset */
25   filename injury url "https://raw.githubusercontent.com/rachelnisbet/SAS-Procs-Oh-My/main/injury_data.csv";
26   
27   proc import datafile=injury
28       out=injury_data dbms= csv replace;
29   run;

[38;5;21mNOTE: Unable to open parameter catalog: SASUSER.PARMS.PARMS.SLIST in update mode. Temporary parameter values will be saved to [0m
WORK.PARMS.PARMS.SLIST.
30    /**********************************************************************
31    *   PRODUCT:   SAS
32    *   VERSION:   V.04.00
33    *   CREATOR:   External File Interface
34    *   DATE:      28AUG24
35    *   DESC:      Generated SAS Datastep Code
36    *   TEMPLATE SOURCE:  (None Specified.)
37    ***********************************************************************/
38       data WORK.INJURY_DATA    ;
39       %let _EFIERR_ = 0; /* se

Now let's get on to using PROC SQL! We will start with a simple query to see all of our variables from the Injury dataset but limit the observations to only players over the age of 35.

In [4]:
/* PROC SQL: Using SQL queries */
proc sql;
  select *
  from injury_data
  where Player_Age > 35;
quit;

Player_Age,Player_Weight,Player_Height,Previous_Injuries,Training_Intensity,Recovery_Time,Likelihood_of_Injury
37,70.996271268,174.58165012,0,0.226521626,6,1
38,75.820548712,206.6318235,1,0.359208747,4,0
36,79.038205587,181.52315514,1,0.8206961608,3,1
38,90.097713125,179.17352215,0,0.3625597966,3,0
39,87.869749638,175.51619781,1,0.0846883419,2,1
38,70.553624418,200.10009289,1,0.4666931209,6,1
38,88.859033213,193.97226423,0,0.8084057389,3,0
39,77.809643209,179.4111819,0,0.4591240233,2,0
36,76.805308418,165.53871967,0,0.0326590608,1,1
37,72.971537803,176.39526596,1,0.407748105,1,0


Many of our first queries will be to understand the data - but next, we will likely want to do some further analysis. 
For instance, we might want to understand relationships between if the player has had previous injuries and what their recovery time is. To help with this, we will create a mean training intensity for each of these groups to see how they differ.

In [8]:
proc sql;
    select previous_injuries, recovery_time, mean(training_intensity) as Avg_Intensity
    from injury_data
    group by recovery_time, previous_injuries;
quit;

Previous_Injuries,Recovery_Time,Avg_Intensity
0,1,0.479132
1,1,0.528382
0,2,0.530228
1,2,0.467429
0,3,0.498622
1,3,0.4916
0,4,0.501535
1,4,0.500312
0,5,0.471778
1,5,0.449271


### PROC PYTHON

We will first import a necessary package, Pandas and then read in the injury dataset. 

In [None]:
proc python;
submit;

import pandas as pd

# Read the SAS dataset into a pandas DataFrame
injury_df = pd.read_csv("https://raw.githubusercontent.com/rachelnisbet/SAS-Procs-Oh-My/main/injury_data.csv")

endsubmit;
run;

For a comparision of our PROC SQL above, let's do the first initial step - print our data that has been filtered to only show players over the age of 35.

In [None]:
proc python;
submit;

# Filter data where Player_Age > 35 using Python
filtered_injury = injury_df[injury_df['Player_Age'] > 35]

# Print the filtered DataFrame
print(filtered_injury.head())

endsubmit;
run;

More than one way to do anything in SAS!!

Now let's take advantage of a Python method that calculates and displays summary statistics for our variables

In [None]:
proc python;
submit;

# Calculate and print summary statistics
summary = injury_df.describe()
print(summary)

endsubmit;
run;

Python has robust capabilities in analyzing variables, so let's take it up a notch with a correlation analysis and in particular, how our factors correlate with the predicted liklihood of injury.

In [None]:
proc python;
submit;

# Calculate correlation matrix
correlation_matrix = injury_df.corr()

# Print correlation matrix
print("Correlation Matrix:")
print(correlation_matrix)

# Extract correlation with 'Likelihood_of_Injury' variable
likelihood_correlation = correlation_matrix['Likelihood_of_Injury']

# Print correlation with 'Likelihood_of_Injury'
print("\nCorrelation with Likelihood_of_Injury:")
print(likelihood_correlation)

endsubmit;
run;

### PROC IML

Let's do some R programming!

Since we are now in IML land, we need to bring our dataset into an R matrix to be used.

In [None]:
/* PROC IML: Using the interactive matrix language */

proc iml;
submit / R;

# Read in the CSV file
injury_data <- read.csv("https://raw.githubusercontent.com/rachelnisbet/SAS-Procs-Oh-My/main/injury_data.csv")

endsubmit;
run;

We've seen it in SQL and we've seen it in Python, so we MUST see how to filter and display a dataset in R.

In [None]:
/* PROC IML: Using the interactive matrix language */

proc iml;
submit / R;

# Filter the data
filtered_data <- injury_data[injury_data$Player_Age >= 35, ]
 
# Print data
print(filtered_data)

endsubmit;
run;

R is also pretty quick and easy to compute summary statistics..

In [None]:
/* PROC IML: Using the interactive matrix language */

proc iml;
submit / R;

# Summary statistics
summary_stats <- summary(injury_data)      

# Print summary statistics
print("Here comes summary stats....")
print(summary_stats)

endsubmit;
run;

Our variable, Player_Age, can be quite helpful when determining which players may be more susceptible to injury, but it's difficult to analyze by individual ages - I mean, is 29 any different than 30?! Just kidding, don't answer that!

To better understand how age may play a role in injuries, we will section the data into age ranges.

In [None]:
/* PROC IML: Using the interactive matrix language */

proc iml;
submit / R;

# Define age groups (e.g., 17-25, 26-35, etc.)
age_breaks <- c(17, 25, 35, max(injury_data$Player_Age))
age_labels <- c("18-25", "26-35", "36+")
age_groups <- cut(injury_data$Player_Age, breaks = age_breaks, labels = age_labels)


# Add age groups as a new column in the dataset
injury_data$Age_Group <- age_groups

endsubmit;

run;

Now that we have our new Age_Group column available, let's look at the frequency of players in each age range. 

In [None]:
/* PROC IML: Using the interactive matrix language */

proc iml;
submit / R; 

# Calculate and print frequencies for Age Group
summary_stats2 <- summary(injury_data$Age_Group)
print("Here comes Age Group freq counts....")
print(summary_stats2)

endsubmit;
run;

That concludes our analysis for now - but there's plenty more to explore!