# **NBA Anaysis: Extract, Transform, Load**

## Objectives

* Load data set from Kaggle: 
https://www.kaggle.com/datasets/thedevastator/historical-nba-finals-and-mvp-results/data?select=NBA+Finals+and+MVP.csv

File was renamed to lower case and spaces replaced with _ for usage.

Credit: Tristan Malherbe ( https://data.world/datatouille ) as per usage instructions of the data set

## Inputs

* Files used: 
nba_finals_and_mvp.csv


## Outputs

* File produced:
nba-cities-coordinates-logolinks.csv 

## Additional Comments

* This dataset contains essential information about the wins, losses, and MVP awards spanning the long-standing rivalry between the Eastern and Western Conferences in the NBA.  
* Note: columns listed in this dataset:

Year: The year of the NBA Finals. (Integer)
Western Champion: The team that won the Western Conference. (String)
Eastern Champion: The team that won the Eastern Conference. (String)
Result: The result of the NBA Finals. (String)
NBA Champion: The team that won the NBA Finals.(String)
NBA Vice-Champion: The team that lost the NBA Finals.(String)
Final Sweep ? Whether or not the NBA Finals was a sweep. (Boolean)
MVP Name: The name of the MVP. (String)
MVP Height (m):	The height of the MVP in meters. (Float)
MVP Height (ft): The height of the MVP in feet. (Float)
MVP Position: The position of the MVP. (String)
MVP Team: The team of the MVP. (String)
MVP Nationality: The nationality of the MVP. (String)
MVP status: The status of the MVP (whether the MVP won the Championship, reached the final or not). (String)
* Note: only require data from 1980 onwards




---

# Section 1

#Import necessary libraries

In [15]:
import pandas as pd

# Load CSV file

In [16]:
df1 = pd.read_csv('../data/inputs/raw/nba_finals_and_mvp.csv')

# Look at data

In [17]:
df1.head()


Unnamed: 0,index,Year,Western Champion,Eastern Champion,Result,NBA Champion,NBA Vice-Champion,Final Sweep ?,MVP Name,MVP Height (m),MVP Height (ft),MVP Position,MVP Team,MVP Nationality,MVP status
0,0,1950,Minneapolis Lakers,Syracuse Nationals,4–2,Minneapolis Lakers,Syracuse Nationals,False,,,,,,,
1,1,1951,Rochester Royals,New York Knicks,4–3,Rochester Royals,New York Knicks,False,,,,,,,
2,2,1952,Minneapolis Lakers,New York Knicks,4–3,Minneapolis Lakers,New York Knicks,False,,,,,,,
3,3,1953,Minneapolis Lakers,New York Knicks,4–1,Minneapolis Lakers,New York Knicks,False,,,,,,,
4,4,1954,Minneapolis Lakers,Syracuse Nationals,4–3,Minneapolis Lakers,Syracuse Nationals,False,,,,,,,


# Look at shape

In [18]:
df1.shape

(69, 15)

# Check data types

In [19]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69 entries, 0 to 68
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   index              69 non-null     int64  
 1   Year               69 non-null     int64  
 2   Western Champion   69 non-null     object 
 3   Eastern Champion   69 non-null     object 
 4   Result             69 non-null     object 
 5   NBA Champion       69 non-null     object 
 6   NBA Vice-Champion  69 non-null     object 
 7   Final Sweep ?      69 non-null     bool   
 8   MVP Name           63 non-null     object 
 9   MVP Height (m)     63 non-null     float64
 10  MVP Height (ft)    63 non-null     float64
 11  MVP Position       63 non-null     object 
 12  MVP Team           63 non-null     object 
 13  MVP Nationality    63 non-null     object 
 14  MVP status         63 non-null     object 
dtypes: bool(1), float64(2), int64(2), object(10)
memory usage: 7.7+ KB


# Check null values

In [20]:
df1.isnull().sum()

index                0
Year                 0
Western Champion     0
Eastern Champion     0
Result               0
NBA Champion         0
NBA Vice-Champion    0
Final Sweep ?        0
MVP Name             6
MVP Height (m)       6
MVP Height (ft)      6
MVP Position         6
MVP Team             6
MVP Nationality      6
MVP status           6
dtype: int64

# Check for duplicates

In [21]:
df1.duplicated().sum()

0

# Only data from 1980 onwards is being used, so anything before this date needs to be removed.
# Filter rows where the year is 1980 or later

In [22]:
filtered_df = df1[df1['Year'] >= 1980]


# Check new data frame


In [23]:
filtered_df.head()

Unnamed: 0,index,Year,Western Champion,Eastern Champion,Result,NBA Champion,NBA Vice-Champion,Final Sweep ?,MVP Name,MVP Height (m),MVP Height (ft),MVP Position,MVP Team,MVP Nationality,MVP status
30,30,1980,Los Angeles Lakers,Philadelphia 76ers,4–2,Los Angeles Lakers,Philadelphia 76ers,False,K. AJabbar,2.18,7.152231,Center,Los Angeles Lakers,US,Champion
31,31,1981,Houston Rockets,Boston Celtics,2–4,Boston Celtics,Houston Rockets,False,J. Erving,2.01,6.594488,Forward,Philadelphia 76ers,US,Not reached Final
32,32,1982,Los Angeles Lakers,Philadelphia 76ers,4–2,Los Angeles Lakers,Philadelphia 76ers,False,M. Malone,2.08,6.824147,Center,Houston Rockets,US,Not reached Final
33,33,1983,Los Angeles Lakers,Philadelphia 76ers,0–4,Philadelphia 76ers,Los Angeles Lakers,True,M. Malone,2.08,6.824147,Center,Philadelphia 76ers,US,Champion
34,34,1984,Los Angeles Lakers,Boston Celtics,3–4,Boston Celtics,Los Angeles Lakers,False,L. Bird,2.06,6.75853,Forward,Boston Celtics,US,Champion


# Can see here that "Kareem Abdul-Jabbar" is being shown as "K. AJabbar" which may be a potential problem later when linking to other data. Change the filtered data with the convention of first initial and last name. May need to separate last name from initial for use with other data sets.

In [28]:
# Replace the string in a specific column (e.g., 'Player')
filtered_df['MVP Name'] = filtered_df['MVP Name'].replace('K. AJabbar', 'K. Abdul-Jabbar')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['MVP Name'] = filtered_df['MVP Name'].replace('K. AJabbar', 'K. Abdul-Jabbar')


# Check data frame to view the change as earlier (Abdul-Jabbar appears near the top of the data set)

In [29]:
filtered_df.head()

Unnamed: 0,index,Year,Western Champion,Eastern Champion,Result,NBA Champion,NBA Vice-Champion,Final Sweep ?,MVP Name,MVP Height (m),MVP Height (ft),MVP Position,MVP Team,MVP Nationality,MVP status
30,30,1980,Los Angeles Lakers,Philadelphia 76ers,4–2,Los Angeles Lakers,Philadelphia 76ers,False,K. Abdul-Jabbar,2.18,7.152231,Center,Los Angeles Lakers,US,Champion
31,31,1981,Houston Rockets,Boston Celtics,2–4,Boston Celtics,Houston Rockets,False,J. Erving,2.01,6.594488,Forward,Philadelphia 76ers,US,Not reached Final
32,32,1982,Los Angeles Lakers,Philadelphia 76ers,4–2,Los Angeles Lakers,Philadelphia 76ers,False,M. Malone,2.08,6.824147,Center,Houston Rockets,US,Not reached Final
33,33,1983,Los Angeles Lakers,Philadelphia 76ers,0–4,Philadelphia 76ers,Los Angeles Lakers,True,M. Malone,2.08,6.824147,Center,Philadelphia 76ers,US,Champion
34,34,1984,Los Angeles Lakers,Boston Celtics,3–4,Boston Celtics,Los Angeles Lakers,False,L. Bird,2.06,6.75853,Forward,Boston Celtics,US,Champion


# Check null values

In [30]:
filtered_df.isnull().sum()

index                0
Year                 0
Western Champion     0
Eastern Champion     0
Result               0
NBA Champion         0
NBA Vice-Champion    0
Final Sweep ?        0
MVP Name             0
MVP Height (m)       0
MVP Height (ft)      0
MVP Position         0
MVP Team             0
MVP Nationality      0
MVP status           0
dtype: int64

# Check for duplicates

In [31]:
filtered_df.duplicated().sum()

0

# Save to a new CSV file

In [32]:
filtered_df.to_csv('../data/outputs/nba_finals_and_mvp_1980plus.csv', index=False)

---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)
