# NBA Data Analysis

## Methodology

In an attempt to gain a deeper understanding of data analysis methods, we will perform seven forms of analysis on data from the National Basketball Association (NBA) that was collected over many seasons of play. These forms of analysis are:

1. Observational Statistics
2. Visualization
3. Linear System of Equations
4. Interpolation
5. Least Squares
6. Fourier Analysis
7. Principal Components

//TO-DO: Needs teammate responsibilities

This notebook supplies our Python 3.6.3 implementation of the aforementioned analysis techniques alongside richer documentation, including Markdown, figures, and the produced output. When executed, this notebook uses an IPython 6.1.0 kernel to execute Python code.

## Data Collection

![](resources/kaggle_logo.png)

We obtained our data from the self-proclaimed "home of data science and machine learning", [Kaggle](https://www.kaggle.com/). It is a good resource for downloading data sets, and it also hosts competitions in data analysis.

The raw data is in two sets: player information and season statistics. The sets include 3,922 and 24,691 observations with 7 and 53 attributes, respectively.

## Setup

To start, we will import the following resources:

- Data manipulation from `numpy` and `pandas`
- Plotting capability from `matplotlib`

Then we will load the player information and season statistics that we are looking to analyze.

In [2]:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd

In [3]:
# Load player information and season statistics
player_info = pd.read_csv('resources/player-info.csv')
season_stats = pd.read_csv('resources/season-stats.csv')

We need to verify that we got the data.

In [7]:
print(player_info.shape)
print(season_stats.shape)

(3922, 8)
(24691, 53)


In [3]:
player_info.head()

Unnamed: 0.1,Unnamed: 0,name,height,weight,college,birth_year,birth_city,birth_state
0,0,Curly Armstrong,180.0,77.0,Indiana University,1918.0,,
1,1,Cliff Barker,188.0,83.0,University of Kentucky,1921.0,Yorktown,Indiana
2,2,Leo Barnhorst,193.0,86.0,University of Notre Dame,1924.0,,
3,3,Ed Bartels,196.0,88.0,North Carolina State University,1925.0,,
4,4,Ralph Beard,178.0,79.0,University of Kentucky,1927.0,Hardinsburg,Kentucky


In [4]:
season_stats.head()

Unnamed: 0,id,year,name,pos,age,tm,g,gs,mp,per,...,ft%,orb,drb,trb,ast,stl,blk,tov,pf,pts
0,0,1950.0,Curly Armstrong,G-F,31.0,FTW,63.0,,,,...,0.705,,,,176.0,,,,217.0,458.0
1,1,1950.0,Cliff Barker,SG,29.0,INO,49.0,,,,...,0.708,,,,109.0,,,,99.0,279.0
2,2,1950.0,Leo Barnhorst,SF,25.0,CHS,67.0,,,,...,0.698,,,,140.0,,,,192.0,438.0
3,3,1950.0,Ed Bartels,F,24.0,TOT,15.0,,,,...,0.559,,,,20.0,,,,29.0,63.0
4,4,1950.0,Ed Bartels,F,24.0,DNN,13.0,,,,...,0.548,,,,20.0,,,,27.0,59.0


It looks like some cleaning needs to be done before we can begin analyzing.

## Data Cleaning

Cleaning the data will involve:

1. Re-labeling the CSV files
2. Removing the "Unnamed: 0" column from player info
3. Re-indexing the player info
4. Joining the data sets
5. Removing null observations
6. Trimming statistics that we are uninterested in

### 1. Re-label the CSV files

We re-labeled the first column of `season-stats.csv` to `id`, to reflect its purpose. We do not need to re-label the first, unnamed column of `player-info.csv`, as the players' names serve as unique identifiers themselves; we will be removing the unnamed column promptly.

### 2. Remove the "Unnamed: 0" column

In [5]:
# Drop the unnamed first column.
player_info.drop('Unnamed: 0', axis=1, inplace=True)

### 3. Re-index player info

In [6]:
# Set the dataframe index to player names.
player_info.set_index('name', inplace=True)

### 4. Join the sets

In [10]:
# Join player info and season stats on the players' names.
complete_stats = season_stats.join(player_info, on="name")
complete_stats.head(3)

Unnamed: 0,id,year,name,pos,age,tm,g,gs,mp,per,...,blk,tov,pf,pts,height,weight,college,birth_year,birth_city,birth_state
0,0,1950.0,Curly Armstrong,G-F,31.0,FTW,63.0,,,,...,,,217.0,458.0,180.0,77.0,Indiana University,1918.0,,
1,1,1950.0,Cliff Barker,SG,29.0,INO,49.0,,,,...,,,99.0,279.0,188.0,83.0,University of Kentucky,1921.0,Yorktown,Indiana
2,2,1950.0,Leo Barnhorst,SF,25.0,CHS,67.0,,,,...,,,192.0,438.0,193.0,86.0,University of Notre Dame,1924.0,,


### 5. Remove null observations

In [49]:
# Drop NaN players, then clean up the remaining NaN attributes.
complete_stats[complete_stats.name == "NaN"]

Unnamed: 0,id,year,name,pos,age,tm,g,gs,mp,per,...,blk,tov,pf,pts,height,weight,college,birth_year,birth_city,birth_state


### 6. Trim uninteresting statistics

## Analysis

As a reminder, the seven forms of analysis we will perform are:

1. Observational Statistics
2. Visualization
3. Linear System of Equations
4. Interpolation
5. Least Squares
6. Fourier Analysis
7. Principal Components

### 1. Observational Statistics

### 2. Visualization

### 3. Linear System of Equations

### 4. Interpolation

### 5. Least Squares

### 6. Fourier Analysis

### 7. Principal Components

## Reflection