<a href="https://colab.research.google.com/github/ryansilalahi/pubprojs/blob/main/NBA_Data_Manipulation_and_Function_Design.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **NBA Data Manipulation and Function Design**
Ryan Silalahi

December 2022

Source of my datasets : https://www.kaggle.com/datasets/drgilermo/nba-players-stats?resource=download

 # **Overview:**

In this project, I delved into the extensive NBA player statistics spanning from 1950 to 2017. The core objective was to analyze and manipulate this dataset, focusing on key player metrics: Points, Steals, Blocks, Turnovers, Rebounds, and Assists. By leveraging a variety of data manipulation techniques and through function design, I aimed to answer pivotal questions, such as identifying the highest statistical averages by year, exploring player demographics, and understanding the relationships between player attributes and points scoring.

# **Preliminary Data Processing**

In [1]:
#Importing relevant packages
import csv
import requests
from lxml import etree
import io
import pandas as pd
import numpy as np
import sqlite3 as sql

**Web Scraping and Data Extraction:**

I initiated the project by scraping HTML data from a basketball statistics website. The data extraction involved intricate processes, including handling irregularities like missing values. Additionally, When I copied the xpath of the elements of the table I wanted from "https://www.basketball-reference.com/leaders/pts_career_p.html", tbody was included in the path as a child of table. However, the data was not scraped using that xpath and only an empty set of data was returned. Using getchildren() function showed that there actually wasn't a tbody tag as a child of table, so a new xpath without tbody was used. Afterwards, I successfully transformed the scraped data into a structured dictionary format.

In [2]:
SQLcreated = False
dbSQLcreated = False #Boolean operation such that createSQL only runs once
def web_scraping(link):
  '''
  Takes website url, prints whether accessing was successful, and returns table object of the scoring statistics.
  '''
  response = requests.get(link)
  if response.status_code == 200:
    print("Success")
  else:
    print("Error")
  assert response.status_code == 200
  htmlparser = etree.HTMLParser()
  tree1 = etree.parse(io.BytesIO(response.content), htmlparser)
  root1 = tree1.getroot()
  node_table = root1.xpath('//*[@id="tot"]')
  return node_table

node = web_scraping("https://www.basketball-reference.com/leaders/pts_career_p.html")


Success


In [3]:
def data_dic(node):
  '''
  takes accessed element(node) from web_scraping and returns data in dictionary form,
  headers as keys and cell values as values
  '''
  headers = node[0].xpath('./thead/tr/th/text()')
  name = node[0].xpath('./tr/td//a/text()')
  ranknpts = node[0].xpath('./tr/td/text()') #tbody NOT a child of table..
  ranklist = list()
  ptslist = list()
  ranknpts_clean = list(filter(lambda a: a != '\n' and a != '*\n', ranknpts))
  for k in range(0,len(ranknpts_clean)):
    if k % 2 == 0:
      ranklist.append(ranknpts_clean[k])
    elif k % 2 == 1:
      ptslist.append(ranknpts_clean[k])
  for x in range(0,len(ranklist)):
    if ranklist[x] == '\xa0': #some ranks have weird '\xa0' instead of actual rank...
      ranklist[x] = str(x+1) + '.'
  data_dicform = {headers[0]: ranklist, headers[1]: name, headers[2]: ptslist}
  return data_dicform

data_dict = data_dic(node)
print(data_dict['Rank'])

['1.', '2.', '3.', '4.', '5.', '6.', '7.', '8.', '9.', '10.', '11.', '12.', '13.', '14.', '15.', '16.', '17.', '18.', '19.', '20.', '21.', '22.', '23.', '24.', '25.', '26.', '27.', '28.', '29.', '30.', '31.', '32.', '33.', '34.', '35.', '36.', '37.', '38.', '39.', '40.', '41.', '42.', '43.', '44.', '45.', '46.', '47.', '48.', '49.', '50.', '51.', '52.', '53.', '54.', '55.', '56.', '57.', '58.', '59.', '60.', '61.', '62.', '63.', '64.', '65.', '66.', '67.', '68.', '69.', '70.', '71.', '72.', '73.', '74.', '75.', '76.', '77.', '78.', '79.', '80.', '81.', '82.', '83.', '84.', '85.', '86.', '87.', '88.', '89.', '90.', '91.', '92.', '93.', '94.', '95.', '96.', '97.', '98.', '99.', '100.', '101.', '102.', '103.', '104.', '105.', '106.', '107.', '108.', '109.', '110.', '111.', '112.', '113.', '114.', '115.', '116.', '117.', '118.', '119.', '120.', '121.', '122.', '123.', '124.', '125.', '126.', '127.', '128.', '129.', '130.', '131.', '132.', '133.', '134.', '135.', '136.', '137.', '138.', '13

In [4]:
def createSQL(data, dbname):
  '''
  takes data in dictionary form, returns newly created (empty) SQL
  '''
  connection = sql.connect(dbname)
  cursor = connection.cursor()
  exe = cursor.execute("CREATE TABLE playoffpts (Rank VARCHAR(255),Player VARCHAR(255),PTS VARCHAR(255))") #once executed, do not run again
  connection.commit()
  return exe.fetchall()
if SQLcreated == False:
  createSQL(data_dict, 'project.db')
  SQLcreated = True
SQLcreated

True

**Database Creation and Data Storage:**

Utilizing SQL, I created a database to efficiently manage and store the acquired data. Challenges emerged during this process, notably with xpath discrepancies and data type inconsistencies. Thorough debugging was essential to ensure the accuracy of the stored data.

Function that accesses a database with SQL query:

In [5]:
def insertSQL(content, dbname):
  '''
  inserts values into the table in the database.
  '''
  connection = sql.connect(dbname)
  cursor = connection.cursor()
  for x in range(0, len(content)): #executemany() command didn't work out
    values = content[x]
    qry = f"INSERT INTO playoffpts (Rank,Player,PTS) VALUES {values}"
    exe = cursor.execute(qry)
  connection.commit()
  return exe.fetchall
val = []
for k in range(0, len(data_dict['Rank'])):
  val.append((data_dict['Rank'][k], data_dict['Player'][k], data_dict['PTS'][k]))

insertSQL(val, 'project.db')

<function Cursor.fetchall()>

**Data Transformation:**

The stored data in the SQL database was then processed and transformed into CSV format. I faced challenges converting object types back to their intended formats (ints, floats, strings). Through careful handling, I ensured the integrity of the data during this transition.

Function that processes CSV files and stores their data in an SQLite database

In [6]:
def database_from_csv(csv_file, database, table_name):
  #Open the csv file using csv package and with open to read the data
  with open(csv_file, 'r') as csv_data:
    reader = csv.reader(csv_data)
    data = list(reader)
  conn = sql.connect(database)
  c = conn.cursor()
  #Create a new table with the amount of columns necessary
  num_columns = len(data[0])
  c.execute("CREATE TABLE {} ({})".format(table_name, ", ".join("column{} TEXT".format(i) for i in range(1, num_columns+1))))
  #Loop through data and insert rows into the database
  for row in data:
    c.execute("INSERT INTO {} VALUES ({})".format(table_name, ", ".join("?" for i in range(num_columns))), row)
  conn.commit()
  conn.close()

#Running functions on all csv data
if not dbSQLcreated:
  database_from_csv('Seasons_Stats.csv', 'project.db', 'seasonsstats')
  database_from_csv('player_data.csv', 'project.db', 'playerdata')
  database_from_csv('Players.csv', 'project.db', 'playerdemo')
  dbSQLcreated = True

This function connects to the specified SQLite database, retrieves data from a specified table within that database, and creates pandas DataFrames using the retrieved data.

In [7]:
def show_database_data(database, table_name):
  conn = sql.connect(database)
  #Read data from the database and create a data frame using a sqlquery
  df = pd.read_sql_query("SELECT * FROM {}".format(table_name), conn,)
  #Makes the first row of data the column names
  df.columns = df.iloc[0]
  df = df[1:]

  return df

seasonstats = show_database_data('project.db', 'seasonsstats')
player_born = show_database_data('project.db', 'playerdata')
player_demo = show_database_data('project.db', 'playerdemo')


Here, I am tidying the season stats dataset by removing the first column that is a duplicate, stripping asterisks from the 'Player' column to avoid errors when searching for player names, converting numeric columns to floats, specific columns back into strings, and replacing NaN values with 0, resulting in a final tidy version of the dataset.

In [8]:
del(seasonstats['']) #delete first column then commented out to avoid error

#removing asterisk that follows HOF player names
seasonstats['Player'] = seasonstats['Player'].str.strip('*')

#making necessary columns floats
cols = seasonstats.columns
cols = list(cols)
cols.remove('Player')
cols.remove('Pos')
cols.remove('Tm')
cols

#making all columns numeric (float)
for i in cols:
  seasonstats[i] = pd.to_numeric(seasonstats[i], errors='coerce')

#making necessary columns strings
seasonstats['Player'] = seasonstats['Player'].astype('string')
seasonstats['Pos'] = seasonstats['Pos'].astype('string')
seasonstats['Tm'] = seasonstats['Tm'].astype('string')

#making NaN = 0
seasonstats = seasonstats.replace(np.nan, 0, regex=True)

seasonstats #the final tidy version

Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,TS%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
1,1950.0,Curly Armstrong,G-F,31.0,FTW,63.0,0.0,0.0,0.0,0.368,...,0.705,0.0,0.0,0.0,176.0,0.0,0.0,0.0,217.0,458.0
2,1950.0,Cliff Barker,SG,29.0,INO,49.0,0.0,0.0,0.0,0.435,...,0.708,0.0,0.0,0.0,109.0,0.0,0.0,0.0,99.0,279.0
3,1950.0,Leo Barnhorst,SF,25.0,CHS,67.0,0.0,0.0,0.0,0.394,...,0.698,0.0,0.0,0.0,140.0,0.0,0.0,0.0,192.0,438.0
4,1950.0,Ed Bartels,F,24.0,TOT,15.0,0.0,0.0,0.0,0.312,...,0.559,0.0,0.0,0.0,20.0,0.0,0.0,0.0,29.0,63.0
5,1950.0,Ed Bartels,F,24.0,DNN,13.0,0.0,0.0,0.0,0.308,...,0.548,0.0,0.0,0.0,20.0,0.0,0.0,0.0,27.0,59.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24687,2017.0,Cody Zeller,PF,24.0,CHO,62.0,58.0,1725.0,16.7,0.604,...,0.679,135.0,270.0,405.0,99.0,62.0,58.0,65.0,189.0,639.0
24688,2017.0,Tyler Zeller,C,27.0,BOS,51.0,5.0,525.0,13.0,0.508,...,0.564,43.0,81.0,124.0,42.0,7.0,21.0,20.0,61.0,178.0
24689,2017.0,Stephen Zimmerman,C,20.0,ORL,19.0,0.0,108.0,7.3,0.346,...,0.600,11.0,24.0,35.0,4.0,2.0,5.0,3.0,17.0,23.0
24690,2017.0,Paul Zipser,SF,22.0,CHI,44.0,18.0,843.0,6.9,0.503,...,0.775,15.0,110.0,125.0,36.0,15.0,16.0,40.0,78.0,240.0


In [9]:
del(player_demo['']) #deleting column 1 because it was the same as pythons given indices

In [10]:
player_demo

Unnamed: 0,Player,height,weight,collage,born,birth_city,birth_state
1,Curly Armstrong,180,77,Indiana University,1918,,
2,Cliff Barker,188,83,University of Kentucky,1921,Yorktown,Indiana
3,Leo Barnhorst,193,86,University of Notre Dame,1924,,
4,Ed Bartels,196,88,North Carolina State University,1925,,
5,Ralph Beard,178,79,University of Kentucky,1927,Hardinsburg,Kentucky
...,...,...,...,...,...,...,...
3918,Troy Williams,198,97,South Carolina State University,1969,Columbia,South Carolina
3919,Kyle Wiltjer,208,108,Gonzaga University,1992,Portland,Oregon
3920,Stephen Zimmerman,213,108,"University of Nevada, Las Vegas",1996,Hendersonville,Tennessee
3921,Paul Zipser,203,97,,1994,Heidelberg,Germany


In [11]:
player_born

Unnamed: 0,name,year_start,year_end,position,height,weight,birth_date,college
1,Alaa Abdelnaby,1991,1995,F-C,6-10,240,"June 24, 1968",Duke University
2,Zaid Abdul-Aziz,1969,1978,C-F,6-9,235,"April 7, 1946",Iowa State University
3,Kareem Abdul-Jabbar,1970,1989,C,7-2,225,"April 16, 1947","University of California, Los Angeles"
4,Mahmoud Abdul-Rauf,1991,2001,G,6-1,162,"March 9, 1969",Louisiana State University
5,Tariq Abdul-Wahad,1998,2003,F,6-6,223,"November 3, 1974",San Jose State University
...,...,...,...,...,...,...,...,...
4546,Ante Zizic,2018,2018,F-C,6-11,250,"January 4, 1997",
4547,Jim Zoet,1983,1983,C,7-1,240,"December 20, 1953",Kent State University
4548,Bill Zopf,1971,1971,G,6-1,170,"June 7, 1948",Duquesne University
4549,Ivica Zubac,2017,2018,C,7-1,265,"March 18, 1997",


# **Important Definitions**

In this analysis, key basketball performance metrics are defined as follows:

**Points**: Represents the scoring of the ball, with one point for a free throw, two points within the three-point line, and three points beyond it.

**Rebound**: Involves collecting a missed shot, typically on offense.

**Steal**: Occurs when a player takes the ball from the opponent, gaining possession for their team.

**Turnover**: Refers to losing possession due to mishandling the ball.

**Assist**: Involves passing the ball, leading directly to a scored basket.

**Block**: Involves stopping a shot attempt by the opponent.

In all of my datasets, the fundamental independent variable is the player. This variable links diverse statistics, including demographics and season-specific data. Without the player variable, answering pivotal questions such as "Who is the top scorer this year?" would be impossible. Furthermore, the datasets 'player_demo' and 'player_born' contain intuitively defined dependent variables like height, weight, college, birth year, birth state, birth city, and player position, providing essential demographic insights about specific players. Lastly, the 'year' variable is crucial for addressing temporal inquiries, enabling analysis such as determining the best player in a specific category during a given year.










**Refining the Data for Analysis**



In [12]:
#making a dataset that is grouped by player and year in order to easily find which players impacted the most each year
sortedstats = seasonstats.groupby(['Player', 'Year'])


#finding main player statistics such as points scored types of points scored
#as well as assists, blocks, rebounds, steals, fouls, and turnovers.
sortedstatsagg = sortedstats.agg({'G': ['sum'], 'PTS': ['sum'], 'FG': ['sum'], 'FGA': ['sum'],
              'TRB': ['sum'], 'AST': ['sum'], 'STL': ['sum'], 'BLK': ['sum'],
             'TOV': ['sum'], 'PF': ['sum']
              })
#creating a variable for points per game
sortedstatsagg['PPG'] = sortedstatsagg['PTS'] / sortedstatsagg['G']
#creating a variable for the common stats and their averages such as points to points per game above^
sortedstatsagg['APG'] = sortedstatsagg['AST'] / sortedstatsagg['G']
sortedstatsagg['SPG'] = sortedstatsagg['STL'] / sortedstatsagg['G']
sortedstatsagg['BPG'] = sortedstatsagg['BLK'] / sortedstatsagg['G']
sortedstatsagg['TOPG'] = sortedstatsagg['TOV'] / sortedstatsagg['G']
sortedstatsagg['RBG'] = sortedstatsagg['TRB'] / sortedstatsagg['G']



In my initial data cleaning phase, significant tidying efforts were made; however, I soon realized the need for further refinements before delving into analysis. Primarily, I restructured the 'seasonstats' dataset by grouping it based on players and years. This restructuring was crucial, enabling easier navigation through player-specific performance statistics over distinct years, a pivotal aspect in addressing my questions. Subsequently, I created an aggregation dataset derived from the grouped season stats. This aggregated dataset contained intriguing player statistics. Its organized structure would guide effortless exploration of player-related questions, statistical analyses, and yearly trends due to its predefined sorting. To provide a more accurate representation of player performance, I calculated average values. These averages, obtained by dividing player statistics by the corresponding games played, unveiled a more consistent narrative of players' abilities, highlighting their sustained performance over time. With these steps completed, I was confident that the data was meticulously organized and readily accessible, allowing me to answer any questions that might arise during my analysis.

**Function Writing to Answer Questions**

In order to avoid becoming overwhelmed by the many questions that arise when it comes to NBA statistics, I have focused my research on understanding player statistics as well as player demographics. Questions that I have include:

  1. What are specific players' season stats by year?

  2. What are specific player's demographics?

  3. What are the best averages by year and player?

How will I approach these questions?

1. I will define functions that will take input such as player and year and will return the players season stats within those parameters.

2. I will also write a function that will take input such as a player's name and it will return that player's demographics.

3. Finally, I will write a function that will take input such as a year, and print the names of players who had the highest averages in each of the major statistical categories and what those averages were.

# **Function Design**

***When inputting player names make sure they are spelt correctly!***

e.g Lebron James will not output anything but LeBron James will...

**Function One: Finding player season stats by player and year.**

I was able to write this function by taking input for a players name and it would display the valid input of years for the user to input. It would print season stats that match the players name and year input and performs a loc operation to easily find and output results.

In [13]:
def findStatYear():
  player = input('Input player first and last name: ',  )
  print(seasonstats[seasonstats['Player'] == player]['Year'].to_string(index = False))
  year = input('Please input one of the above years: ',  )
  print(sortedstatsagg.loc[player, int(year)].round(2))

findStatYear()



Input player first and last name: Michael Jordan
1985.0
1986.0
1987.0
1988.0
1989.0
1990.0
1991.0
1992.0
1993.0
1995.0
1996.0
1997.0
1998.0
2002.0
2003.0
Please input one of the above years: 1989
G     sum      81.00
PTS   sum    2633.00
FG    sum     966.00
FGA   sum    1795.00
TRB   sum     652.00
AST   sum     650.00
STL   sum     234.00
BLK   sum      65.00
TOV   sum     290.00
PF    sum     247.00
PPG            32.51
APG             8.02
SPG             2.89
BPG             0.80
TOPG            3.58
RBG             8.05
Name: (Michael Jordan, 1989.0), dtype: float64


**Function Two: Finding entire player demographics by player name.**

I wrote this function by taking input for a players name and returning a row of player demographics that matches the name of the users input.

In [14]:
def player_demographics():
  player = input('Input a player you would like to see demographics of: ', )
  return player_born[player_born['name'] == player]

player_demographics()


Input a player you would like to see demographics of: Magic Johnson


Unnamed: 0,name,year_start,year_end,position,height,weight,birth_date,college
2034,Magic Johnson,1980,1996,G-F,6-9,215,"August 14, 1959",Michigan State University


**Function Three: Returns players who have the highest averages in each major statistic category by year.
(Points, Steals, Blocks, Assists, Rebounds, Turnovers)**

Before this function was written it was necessary to add player statistic averages to the the regular season stats dataset. That way I could use this dataset to simply create a function that asks for user input on any year between 1950-2017 and finds the player who has the highest or (max) average in that given year and prints their name and the average.

In [15]:
seasonstats = seasonstats[seasonstats['G'] > 21]
#Filtering players that have played less than 1/4 of the season to avoid noise in stat averages

In [40]:
def maxAvgbyYear():
    year = input('Input a year between 1950 and 2017 to find the highest statistical averages in that year: ')
    tempsave = seasonstats[seasonstats['Year'] == int(year)].copy()  #Create a copy to avoid SettingWithCopyWarning

    print("Disclaimer: Some years may have NaN or empty values for max stats since these stats were not recorded in this year")

    max_PPG = tempsave['PPG'].max()
    max_APG = tempsave['APG'].max()
    max_SPG = tempsave['SPG'].max()
    max_BPG = tempsave['BPG'].max()
    max_TOPG = tempsave['TOPG'].max()
    max_RBG = tempsave['RBG'].max()

    #If statistic is less than 0 there is no output to omit years with missing data
    if max_PPG > 0:
        print('Highest avg PPG is', round(max_PPG, 2), 'by', tempsave.loc[tempsave['PPG'].idxmax(), 'Player'])
    if max_APG > 0:
        print('Highest avg APG is', round(max_APG, 2), 'by', tempsave.loc[tempsave['APG'].idxmax(), 'Player'])
    if max_SPG > 0:
        print('Highest avg SPG is', round(max_SPG, 2), 'by', tempsave.loc[tempsave['SPG'].idxmax(), 'Player'])
    if max_BPG > 0:
        print('Highest avg BPG is', round(max_BPG, 2), 'by', tempsave.loc[tempsave['BPG'].idxmax(), 'Player'])
    if max_TOPG > 0:
        print('Highest avg TOPG is', round(max_TOPG, 2), 'by', tempsave.loc[tempsave['TOPG'].idxmax(), 'Player'])
    if max_RBG > 0:
        print('Highest avg RBG is', round(max_RBG, 2), 'by', tempsave.loc[tempsave['RBG'].idxmax(), 'Player'])

#Call the function
maxAvgbyYear()


Input a year between 1950 and 2017 to find the highest statistical averages in that year: 1980
Disclaimer: Some years may have NaN or empty values for max stats since these stats were not recorded in this year
Highest avg PPG is 33.14 by George Gervin
Highest avg APG is 10.15 by Micheal Ray
Highest avg SPG is 3.23 by Micheal Ray
Highest avg BPG is 3.41 by Kareem Abdul-Jabbar
Highest avg TOPG is 4.38 by Micheal Ray
Highest avg RBG is 15.01 by Swen Nater


# **Answer to Questions**

With these functions I should easily be able to answer any of the questions that I had before. Simply by inputting player's names and/or years I can find basic season statistics very easily. For example, I could find who was the highest average scorer in 2003 by looking at the first row that outputs points per game in the maxAvgbyYear() function. I did this by inputting 2003, and found that Tracy McGrady averaged the highest amount of points per game with 32.1. The same can be done with the other major stats or demographics in the other functions.


More example questions and answers:

What were Dennis Rodman's stats in 1989?

In [17]:
findStatYear()

Input player first and last name: Dennis Rodman
1987.0
1988.0
1989.0
1990.0
1991.0
1992.0
1993.0
1994.0
1995.0
1996.0
1997.0
1998.0
1999.0
Please input one of the above years: 1989
G     sum     82.00
PTS   sum    735.00
FG    sum    316.00
FGA   sum    531.00
TRB   sum    772.00
AST   sum     99.00
STL   sum     55.00
BLK   sum     76.00
TOV   sum    126.00
PF    sum    292.00
PPG            8.96
APG            1.21
SPG            0.67
BPG            0.93
TOPG           1.54
RBG            9.41
Name: (Dennis Rodman, 1989.0), dtype: float64


What are Dennis Rodman's demographics?

In [18]:
player_demographics()

Input a player you would like to see demographics of: Dennis Rodman


Unnamed: 0,name,year_start,year_end,position,height,weight,birth_date,college
3452,Dennis Rodman,1987,2000,F,6-7,210,"May 13, 1961",Southeastern Oklahoma State University


Who had the highest statistical averages in 2003?

In [41]:
maxAvgbyYear()

Input a year between 1950 and 2017 to find the highest statistical averages in that year: 2003
Disclaimer: Some years may have NaN or empty values for max stats since these stats were not recorded in this year
Highest avg PPG is 32.09 by Tracy McGrady
Highest avg APG is 8.89 by Jason Kidd
Highest avg SPG is 2.74 by Allen Iverson
Highest avg BPG is 3.23 by Theo Ratliff
Highest avg TOPG is 3.7 by Jason Kidd
Highest avg RBG is 15.42 by Ben Wallace


# **Bonus Insights + Visualizations**

In addition to designing functions to parse through the datasets and provide me with answers to my questions, I thought data visualizations would provide me with a clearer picture on the many relationships and insights to be discovered within the data.

The one's that were most curious to me...
*   How three point attempts have shifted throughout the conception of them being added in 1980?
*   How age and points scored are related if at all?
*   The distribution of heights in the NBA?
*   What are the heights of all-time playoff scoring leaders and how are they distributed?







In [42]:
import plotly.express as px

seasonstats1 = seasonstats[seasonstats['Year'] >= 1980]
#filtered year range from when 3 pointers were added to the NBA to 2017
data = seasonstats1
fig = px.histogram(data, x='Year', y='3PA', title = 'Histogram of Three Point Attempts in the NBA (1980-2017)')
fig.show()

**Evolution of Three-Point Attempts in NBA Since 1980**

The analysis of three-pointers attempts in the NBA, after being introduced in 1980, reveals a consistent and anticipated trend. Over the years, there has been a continuous uptick in three-point attempts. This trajectory aligns with expectations, mirroring the league-wide realization of the strategic advantage associated with three-point shooting. The increasing frequency of three-point attempts underscores a fundamental shift in basketball strategy, demonstrating the growing significance teams place on this long-range scoring technique.

In [43]:
seasonstats2 = seasonstats[seasonstats['Age'] > 0]
#filtered players with unknown ages

data = seasonstats2

fig = px.scatter(data, x='Age', y='PTS', title = 'Scatterplot of NBA Player Ages and Total Points Scored')
fig.show()

**Relationship Between Age and Points Scored**

Examining the relationship between age and points scored in the NBA, it becomes evident that there is a notable correlation. As players age, there is a discernible decline in the number of points they score. This observation aligns with expectations, indicating a natural progression in athletes' careers. As players grow older, their athletic abilities tend to diminish, directly influencing their scoring capabilities. Therefore, the distribution of points scored and age reflects a logical trend, where scoring dwindles as players age, in line with the anticipated decline in physical prowess.

In [44]:
player_demo_heights = pd.to_numeric(player_demo['height'].str.replace('-', ''), errors='coerce')

data1 = player_demo_heights
fig1 = px.histogram(data1, x='height', title = 'Histogram of NBA Player Heights')
fig1.show()

**Distribution of Heights in the NBA**

The distribution of heights among NBA players conforms to a normal distribution, centered around a mean of approximately 200 centimeters (6 feet 5 inches). This pattern aligns with the sport's inherent dynamics, where players closer to the hoop often have an advantage, leading to a prevalence of taller athletes. The typical bell-shaped distribution underscores the strategic advantage of height in basketball, shaping the demographic landscape of the NBA.



To look at the heights of all-time playoff scoring leaders and how are they distributed I needed access the data from basketball reference I scraped previously and added to project.db. I queried the table and processed it into a pandas table to use for analysis/visualization

In [45]:
def sqlout(col, table, lim):
  '''queries database made from createSQL and returns queried data
  '''
  connection = sql.connect('project.db')
  cursor = connection.cursor()
  query = f"select {cols} from {table} limit {lim}"
  return cursor.execute(query).fetchall()
cols = 'Rank, Player, PTS'
table = 'playoffpts'
lim = 250
data = sqlout(cols, table, lim)

def dbtodf(arg):
  '''takes queried data from sqlout and returns pandas DataFrame
  '''
  df = pd.DataFrame(arg)
  df.columns = cols.split(',')
  return df

df1 = dbtodf(data)
df_align = df1.sort_values(by=[' PTS'])
df_align[' PTS'] = df_align[' PTS'].astype(str).astype(int) #had to converts PTS values from type object to int64
print(df_align.dtypes)
df_align.head()

Rank       object
 Player    object
 PTS        int64
dtype: object


Unnamed: 0,Rank,Player,PTS
249,250.0,Lenny Wilkens,1031
248,249.0,Nick Van Exel,1032
247,248.0,Eddie Jones,1037
246,247.0,Bryon Russell,1038
245,246.0,Donnie Freeman,1040


In [46]:
df_name = pd.DataFrame(player_born['name']) #organizing the new dataset using data from previous dataset
df_name.columns = [' Player'] #adding player column
df_hts = pd.DataFrame(player_born['height']) #using height from previous dataset
df_namehts = pd.concat([df_name, df_hts], axis=1) #putting these two together
df_playoffhts = df_align.set_index(' Player').join(df_namehts.set_index(' Player')) #indexing by player
df_playoffhts = df_playoffhts.reset_index()
df_playoffhts = df_playoffhts.sort_values(by=['height']) #sorting the heights
df_playoffhts.head()

Unnamed: 0,Player,Rank,PTS,height
21,Bill Keller,214.0,1139,5-10
17,Bill Bradley,191.0,1222,5-11
170,Louie Dampier,113.0,1651,6-0
159,Kyle Lowry,96.0,1794,6-0
52,Chris Paul,29.0,2981,6-0


In [47]:
import plotly.express as px
fig = px.scatter(df_playoffhts, x="height", y=" PTS", color = " Player", title = "All-Time Playoff Scoring Leaders Height Distribution")
fig.update_layout(yaxis_range=[1000,7700]) #if PTS was still dtype object, each ascending value would be the next 'tick' in y-axis
fig.show()

**Heights of All-Time Playoff Scoring Leaders and Their Distribution**

When analyzing the heights of all-time playoff scoring leaders in the NBA, it becomes evident that height does not exert a significant influence on playoff scoring. There is a wide and diverse spread of points scored across different heights, with only a handful of outliers. This observation implies that factors other than height, such as skill, experience, and game strategy, play a more crucial role in determining playoff performance. The absence of a clear correlation between height and playoff scoring underscores the multifaceted nature of basketball talent and the varied skill sets possessed by players across different height categories.