# Data Science Project - Basketball Players Analysis

# WARNING: Do not re-run this Jupyter Notebook. The data is confidential and not included in this submission.

## Background and Motivation

- One of the popular sports in the U.S. is basketball. One of the most important days in any sport is game day. Prathusha did an internship with the basketball team and this idea fascinated the others. Because of the interest in basketball and the importance of game day, we are trying to predict the performance on game day based on various factors taken into consideration for our home team.
- There are variety of factors taken into consideration for the analysis and prediction. For example, players’ performance leading up to game day, injuries, and the influence of a player on the overall team performance


## Project Objectives

- Prediction of the performance of individual players as game day arrives is based on various deciding factors like injury, player involvement, and other measures of performance for each practice. Game day is quite stressful, but also an important day to perform well. Without adding more pressure on the team to perform well on game day, our analytics will help the home team to win the match based on their performance on each day.
- We will use the factors injury, duration of each player’s practice, and the performance of each of the players as game day approaches, to do supervised and unsupervised analysis. Various factors will be used in these analyses to get maximum achieved performance on game day.
- Implementing strategic analysis methods will increase the probability of the team winning games in the future as well as reduce injuries.


##  Input data

In [None]:
# imports and setup 

import pandas as pd
import numpy as np
import statsmodels.formula.api as sm

from sklearn.cluster import KMeans, AgglomerativeClustering

from sklearn import tree, svm, metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_predict, cross_val_score, KFold
from sklearn.datasets import load_digits
from sklearn.preprocessing import scale
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

import nltk
from nltk.corpus import stopwords

import re

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (10, 6)
plt.style.use('ggplot')

In [None]:
analysis_data = pd.read_csv("Prathusha CS Project Data.csv")

In [None]:
game_data = pd.read_csv("CS Project Game Logs.csv", header=1)

In [None]:
# Here we remove any empty columns
game_data = game_data.dropna(axis=1, how='all')

## Exploring the Data

In [None]:
analysis_data.describe()

There are ~5500 data recorded. There are very few instances of an injury. Most of the data that is recorded is for practices.

In [None]:
analysis_data.info()

In [None]:
num_ath = len(analysis_data.PlayerID.unique())
print("There are " + str(num_ath) + " athletes on the Basketball team.")

In [None]:
game_data.describe()

In [None]:
game_data.info()

There is 62 records in the game log file that contains records of win/lose of home team and performance to different opponent.

In [None]:
sm.ols(formula="trimp ~ InjuryStatus", data=analysis_data).fit().summary()

Injuries on their own do not effect the performance of an athlete.

## Data Visualization

### Data Exploration Of Practices VS Game VS Off Days

In [None]:
#Group the data by date and Activity Type, find the size of each ActivityType in each day
date_activity = analysis_data.groupby(['date','ActivityType']).size()
#make a datframe for grouped data
activity_frame = pd.DataFrame(date_activity.reset_index())
activity_frame.columns = ['date','ActivityType','sum']
#Convert date into date format and sort it
activity_frame['date'] = pd.to_datetime(activity_frame.date)
activity_frame.sort_values(['date']).reset_index(drop= True)

In [None]:
#Due to the large number of data, we group the date by month.
activity_frame['year'] = [y.year for y in activity_frame['date']]
activity_frame['month'] = [m.month for m in activity_frame['date']]

In [None]:
#Group off days data by year and month
off = activity_frame.loc[activity_frame['ActivityType'] == 0.0]
off.sort_values(['date']).reset_index(drop=True)
off = off.groupby(['year','month'])['sum'].sum().reset_index(drop=True)
off = pd.DataFrame(off)

#Group practice days data by year and month
practice = activity_frame.loc[activity_frame['ActivityType'] == 1.0]
practice.sort_values(['date']).reset_index(drop=True)
practice = practice.groupby(['year','month'])['sum'].sum().reset_index(drop=True)
practice = pd.DataFrame(practice)

#Group game days data by year and month
game = activity_frame.loc[activity_frame['ActivityType'] == 2.0]
game.sort_values(['date']).reset_index(drop=True)
game = game.groupby(['year','month'])['sum'].sum().reset_index(drop=True)
game = pd.DataFrame(game)

In [None]:
#Make bar plot for each type of activity
plt.figure(figsize=(20,10))
plt.bar(off.index, off['sum'], width = 0.3,label='Off Days')
plt.bar(practice.index + 0.3, practice['sum'],  width = 0.3,label='Practice Days')
plt.bar(game.index + 0.6, game['sum'],  width = 0.3,label='Game Days')
plt.legend()

### Visualization of individual players data

In [None]:
pd.plotting.scatter_matrix(analysis_data, figsize=(12, 12), diagonal='kde')

Here we can see that as an injury stays over time and as the injury type gets more serious, the performance (measured by trimp) of the players goes down. However, when there are no injuries, the performance of a player varies and cannot be easily predicted with simply injury data.

### Influence of injury on team performance

In [None]:
team_data = analysis_data[analysis_data.iloc[:, 0] == "TEAM"]

In [None]:
player_data = analysis_data[analysis_data.iloc[:, 0] != "TEAM"]

In [None]:
plt.scatter(x=player_data['InjuryStatus'],y=player_data['trimp'],c='r',marker='s')

plt.xlabel('Injury Status')
plt.ylabel('Trimp')

We still show potential for better performance when there are no injuries and decreasing performance as time with the injury continues.

Note: We do not have access to the units for any of these variables.

In [None]:
plt.scatter(x=player_data['InjuryType'],y=player_data['trimp'],c='r',marker='s')

plt.xlabel('Injury Type')
plt.ylabel('Trimp')

The intensity of the injury type also decreases the performance of each player.

In [None]:
plt.scatter(x=team_data['InjuryType'],y=team_data['trimp'],c='r',marker='s')

plt.xlabel('InjuryType')
plt.ylabel('Trimp')

The intensity of the injury type also decreases the performance of the team.

#### Influence of Injury status on the team overall performance

From the graph above, it is clear that when there is no injury, then the team performance improves.

### Prediction of performance on game day

In [None]:
team_game_data = team_data[team_data.loc[:, "ActivityType"] == 2]

In [None]:
team_game_data.corr()

In [None]:
sm.ols(formula="trimp ~ rpe + dur", data=team_game_data).fit().summary()

We get that rpe and duration are good estimations of performance on game day. It makes sense that rpe is a good predictor because that is how well the players think they performed.

We do not get a model that promises accurate prediction for every game day, but we can predict 73% of the games correctly with this model.

#### Relation between performace of each player and exertion rate

In [None]:
plt.scatter(x=player_data['rpe'],y=player_data['trimp'],c='r',marker='s')

plt.xlabel('Rating of perceived exertion')
plt.ylabel('Trimp')

From the plot above, it is clear that the rate of exertion is from 8 to 10, which is from hard to very hard exertion. This rate of exertion has more influence on performance of each player than extremely hard (9) and maximum (10) exertion rates.

In [None]:
game_data['Date']

#### Influence of activity type on performance of the team

In [None]:
plt.scatter(x=team_data['ActivityType'],y=team_data['trimp'],c='r',marker='s')

plt.xlabel('Activity Type')
plt.ylabel('Trimp')

Performance of the team on the game day is better than on the practice days.

#### Infuence of rating of perceived exertion on the performance on team on the game day

In [None]:
plt.scatter(x=team_data['rpe'],y=team_data['TPR'],c='r',marker='s')

plt.xlabel('Team performance rating')
plt.ylabel('Trimp')

From above plot it is clear that the rate of exertion is scaterred around 3(moderate practise) to 5(hard practise) has more infuence on performance of team than very hard(7) and maximal(10) exertions. Since team composed on all others, the overall practise exertion is counted rather tahn individual players.

In [None]:
sm.ols(formula="trimp ~ ActivityType + InjuryStatus + InjuryType + rpe + dur ", data=team_data).fit().summary()

From the model, it is clear that the InjuryStatus has more influence on the team performance on game day.

In [None]:
team_data["date"] = pd.to_datetime(team_data["date"])

### Visualization of Game log data:

In [None]:
game_data.dtypes

In [None]:
game_data["Date"] = pd.to_datetime(game_data["Date"])

In [None]:
game_data.corr()

In [None]:
pd.plotting.scatter_matrix(game_data.iloc[:, 4:37], diagonal='kde')