# AnyoneAI - Project I

# An analysis of NBA players in the 2021/2022 season

Hi! this is the first of several projects we're going to be working on during this course. 
You will be expected to finish this on your own, but you can use the available channels on Discord to ask questions and help others. Please read the entire notebook before starting, this will give you a better idea of what you need to accomplish.

This project will rely heavily in the use of API as data sources. Contrary to most machine learning challenges and competitions, working in the industry usually requires the ML Developer to work with multiple teams and use heterogeneous sources of information to get the necessary data to solve a particular problem. Access to data is often provided through application programming interfaces (APIs), whether internal or external to the organization. It is very important to understand how to interact with APIs to collect data in our day to day work.

You might be wondering: why basketball? The main reason is availability of data.

The sport is played at a fast pace, with hundreds of plays in each single game, thousands of games in a season, and with a relatively small amount of on-court players, which gives them a lot of interactions with the ball, which in turn provides an oportunity to collect a great amount of data about each player performance.

These are the objectives of the project:
- Understanding how to query an API to create a dataset with Python and Pandas
- Learning how to cleanup a dataset and generate new fields from calculated data
- Storing the created dataset in a serialized manner
- Generating statistics about the data
- Visualizing data

## Introduction

### A brief description of basketball and the NBA

The National Basketball Association is the main basketball league in the United States of America. It currently features 30 teams from different cities, divided in 2 conferences (East and West) of 15 teams. Each team plays a total of 82 games during the regular season. After that, the 8 teams with better records from each conference are seeded in a playoff format, with the winner of each conference playing the finals to determine the eventual champion. NBA seasons usually play out between october of one year, to june of the next year, so for example the current season being played is called the 2021/2022 season.

As in most leagues in the world, the game is played 5 vs 5 players, with as many as 9 reserve players that can rotate with the starters as many times as the team wants. Games are played to 48 minutes, so the total amount of minutes of combined play time for any team in a single game with no added time is 240 minutes. If the score is tied at the end of the 48 minutes, 5 minutes of extra time are played, this continues until a winner is decided.

Even though they can play multiple positions, players are usually classified according to the following positions:

- Guards
    - Point Guards
    - Shooting Guards
- Forwards
    - Small Forwards
    - Power Forwards
- Centers

We will mainly focus on the three main positions: Guards/Forwards/Centers

### The dataset

You'll be in charge of creating our dataset. We want to create a single pandas dataframe with information about all active players in the current NBA season. 
The dataset needs to have the following structure:

- Personal Information
    - player_id (int) (INDEX) 
    - player_name (str)
    - team_name (str)
    - position (str)
    - height (int) (in centimeters) 
    - weight (float) (in kilograms)
    - country of origin (str)
    - date_of_birth (datetime)
    - age (str) (years and months)
    - years_of_experience (int) (years since entering the league)
    - Draft position (int)
- Player career statistics
    - games played (int)
    - minutes per game (float)
    - points per game (float)
    - rebounds per game (float)
    - assists per game (float)
    - steals per game (float)
    - blocks per game (float)
- Misc
    - salary in dollars (int) (contract value for this season only)
    - next_game_date (datetime)

Here is a sample of how the final result should look like:

In [None]:
import pandas as pd

sample_dict = {
    'PLAYER_NAME': {200765: 'Rajon Rondo',  203107: 'Tomas Satoransky',  204060: 'Joe Ingles'},
    'TEAM_NAME': {200765: 'Cavaliers', 203107: 'Wizards', 204060: 'Trail Blazers'},
    'POSITION': {200765: 'Guard', 203107: 'Guard', 204060: 'Forward'},
    'HEIGHT': {200765: 185, 203107: 201, 204060: 203},
    'WEIGHT': {200765: 82, 203107: 95, 204060: 100},
    'COUNTRY': {200765: 'USA', 203107: 'Czech Republic', 204060: 'Australia'},
    'BIRTHDATE': {200765: pd.Timestamp('1986-02-22 00:00:00'), 203107: pd.Timestamp('1991-10-30 00:00:00'), 204060: pd.Timestamp('1987-10-02 00:00:00')},
    'SEASON_EXP': {200765: 15, 203107: 5, 204060: 7},
    'DRAFT_NUMBER': {200765: '21', 203107: '32', 204060: 'Undrafted'},
    'GP': {200765: 957, 203107: 388, 204060: 590},
    'MIN': {200765: 29.9, 203107: 22.2, 204060: 25.7},
    'PTS': {200765: 9.8, 203107: 6.9, 204060: 8.6},
    'REB': {200765: 4.5, 203107: 2.9, 204060: 3.2},
    'AST': {200765: 7.9, 203107: 4.1, 204060: 3.8},
    'STL': {200765: 1.6, 203107: 0.8, 204060: 0.9},
    'BLK': {200765: 0.1, 203107: 0.2, 204060: 0.2},
    'GAME_DATE': {200765: pd.Timestamp('2022-04-10 00:00:00'), 203107: pd.Timestamp('2022-04-10 00:00:00'), 204060: pd.Timestamp('2022-04-10 00:00:00')},
    'SALARY': {200765: 2641691, 203107: 10468119, 204060: 14000000},
    'AGE': {200765: '36 years, 1 months, 19 days', 203107: '30 years, 5 months, 11 days', 204060: '34 years, 6 months, 8 days'}
}
pd.DataFrame(sample_dict)

Unnamed: 0,PLAYER_NAME,TEAM_NAME,POSITION,HEIGHT,WEIGHT,COUNTRY,BIRTHDATE,SEASON_EXP,DRAFT_NUMBER,GP,MIN,PTS,REB,AST,STL,BLK,GAME_DATE,SALARY,AGE
200765,Rajon Rondo,Cavaliers,Guard,185,82,USA,1986-02-22,15,21,957,29.9,9.8,4.5,7.9,1.6,0.1,2022-04-10,2641691,"36 years, 1 months, 19 days"
203107,Tomas Satoransky,Wizards,Guard,201,95,Czech Republic,1991-10-30,5,32,388,22.2,6.9,2.9,4.1,0.8,0.2,2022-04-10,10468119,"30 years, 5 months, 11 days"
204060,Joe Ingles,Trail Blazers,Forward,203,100,Australia,1987-10-02,7,Undrafted,590,25.7,8.6,3.2,3.8,0.9,0.2,2022-04-10,14000000,"34 years, 6 months, 8 days"


## Collecting information for building our dataset

In this section, we're only going to work on collecting the necessary raw data to build the required dataset. Don't worry about finishing everything here, we'll generate the appropiate fields and merge the data into a single dataframe in the next section.

To get the information, you can use any public and free API you can find, but you have to provide the code that gets the information here. We recommend using this API:
 
- https://github.com/swar/nba_api

    This is a Python library that can be used to obtain data from stats.nba.com, it provides a set of methods that abstracts you from making the http calls, but directly makes calls to nba stats page and parses the results. [Here](https://github.com/swar/nba_api/blob/master/docs/examples/Basics.ipynb) are a couple of examples on how to use it.
    

A few notes on data collection:

- Start simple. Try to get all the required information for 1 player, read the APIs documentation carefuly, then think about how to use them to collect all players data. 

- Please bear in mind that the most public APIs have some kind of rate limit, so you have to be careful about iterating on data and making lots of requests in a short amount of time (a 1 second delay between calls to the api should be enough). Once you've collected what you needed, save it to file in order to retrieve it later without calling the API again.

- A key consideration: we only want data about players that have played in the current season, so make sure to filter those out before collecting the rest of the information.

- There is at least one piece of information you're not going to find in both of those APIs: the player contract information. Again you can decide to use any source, but we recommend using information provided [here](https://www.basketball-reference.com/contracts/players.html), as it lets you export the data as a csv.  

1- Create a function to find all ACTIVE players, meaning players that are listed with a team in the 2021/2022 season. For now you only need the player id, name, and team. Save the dataframe to a csv named "nba_current_players_list.csv". The function should return the dataframe.

Hint: you should find an API method that can give you a list of players in just one call, this way we can filter those players we're interested in, and later will make calls for each specific player.

Consider dropping: 
- All players with TEAM_ID == 0
- All players with GAMES_PLAYED_FLAG == N
- Player with id 1630597 (This guy is a problem ;))

In [5]:
pip install nba_api

Collecting nba_api
  Downloading nba_api-1.1.11.tar.gz (125 kB)
[?25l[K     |██▋                             | 10 kB 19.1 MB/s eta 0:00:01[K     |█████▏                          | 20 kB 11.7 MB/s eta 0:00:01[K     |███████▉                        | 30 kB 9.5 MB/s eta 0:00:01[K     |██████████▍                     | 40 kB 8.6 MB/s eta 0:00:01[K     |█████████████                   | 51 kB 4.3 MB/s eta 0:00:01[K     |███████████████▋                | 61 kB 5.1 MB/s eta 0:00:01[K     |██████████████████▎             | 71 kB 5.5 MB/s eta 0:00:01[K     |████████████████████▉           | 81 kB 5.7 MB/s eta 0:00:01[K     |███████████████████████▌        | 92 kB 6.4 MB/s eta 0:00:01[K     |██████████████████████████      | 102 kB 5.1 MB/s eta 0:00:01[K     |████████████████████████████▊   | 112 kB 5.1 MB/s eta 0:00:01[K     |███████████████████████████████▎| 122 kB 5.1 MB/s eta 0:00:01[K     |████████████████████████████████| 125 kB 5.1 MB/s 
Building wheels for coll

In [3]:
import numpy as np
import pandas as pd
import time
from nba_api.stats.endpoints import playercareerstats, commonplayerinfo, commonallplayers, playerprofilev2


### Complete in this cell: get all active players from the api

def get_and_save_players_list():
    
    nba_players = commonallplayers.CommonAllPlayers(is_only_current_season=1)        #Retrieve all active players data from API

    filter1 = (nba_players.get_data_frames()[0]['TEAM_ID'] != 0) ; filter2 = (nba_players.get_data_frames()[0]['GAMES_PLAYED_FLAG'] != "N") ; filter3 = (nba_players.get_data_frames()[0]['PERSON_ID'] != 1630597)   #Filter prerequisites Masks    
    
    nba_players = nba_players.get_data_frames()[0].loc[filter1 & filter2 & filter3]  #Apply Filtering

    nba_players = nba_players.rename(columns={'PERSON_ID':'PLAYER_ID','DISPLAY_FIRST_LAST':'PLAYER_NAME'}) ; nba_players = nba_players.set_index('PLAYER_ID') ;  nba_players = nba_players[['PLAYER_NAME']]  #Rename Columns & setting player_id as index & Filter df

    return nba_players


In [4]:
current_players_list = get_and_save_players_list()
current_players_list.to_csv("nba_current_players_list.csv")
current_players_list

Unnamed: 0_level_0,PLAYER_NAME
PLAYER_ID,Unnamed: 1_level_1
1630173,Precious Achiuwa
203500,Steven Adams
1628389,Bam Adebayo
1630583,Santi Aldama
200746,LaMarcus Aldridge
...,...
1628221,Gabe York
201152,Thaddeus Young
1629027,Trae Young
1630209,Omer Yurtseven


2- Create a function to find the personal information of all players listed in the dataframe created in the previous step, and save it to a csv file named "nba_players_personal_info.csv". The function should also return the created dataframe.

OPTIONAL: iterating on a list of players and making API calls can be complex and full of errors, try a code block that handles exceptions (for example a timeout from the API) and returns the partial result before it failed, you could also save the partial information to disk.

In [5]:
### Complete in this cell: Find players personal information (name, age, dob, etc), store the information in a CSV file.

def sleepAndPrint(x):   
    '''Sleep API call and show retrieving number'''
    time.sleep(0.5)        
    x += 1 
    print("Retrieving player n°: {0}".format(x), end = "\r")
    return x
        
def get_players_personal_information(current_players_list):
    '''Retrieves personal information trough API'''
    commonInfo = pd.DataFrame()
    count=0
    for y in current_players_list.index:   
        commonInfo = commonInfo.append(commonplayerinfo.CommonPlayerInfo(player_id =str(y) ,timeout=30).get_data_frames()[0])
        count = sleepAndPrint(count)
    commonInfo = commonInfo[["PERSON_ID","DISPLAY_FIRST_LAST","TEAM_NAME","POSITION","HEIGHT","WEIGHT","COUNTRY","BIRTHDATE","DRAFT_NUMBER"]]  ; commonInfo = commonInfo.rename(columns={'PERSON_ID':'PLAYER_ID','DISPLAY_FIRST_LAST':'PLAYER_NAME'}) ; commonInfo = commonInfo.set_index('PLAYER_ID') #Filter & rename & set playerId as index in dataframe
        
    return commonInfo


In [4]:
nba_players_personal_info = get_players_personal_information(current_players_list)
nba_players_personal_info = nba_players_personal_info.to_csv("personal_player_information.csv")


Retrieving player n°: 503

In [6]:
## Read csv from disk, Filter and rename columns ##

players_personal_info = pd.read_csv('personal_player_information.csv')
players_personal_info = players_personal_info.set_index('PLAYER_ID')
players_personal_info

Unnamed: 0_level_0,PLAYER_NAME,TEAM_NAME,POSITION,HEIGHT,WEIGHT,COUNTRY,BIRTHDATE,DRAFT_NUMBER
PLAYER_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1630173,Precious Achiuwa,Raptors,Forward,6-8,225,Nigeria,1999-09-19T00:00:00,20
203500,Steven Adams,Grizzlies,Center,6-11,265,New Zealand,1993-07-20T00:00:00,12
1628389,Bam Adebayo,Heat,Center-Forward,6-9,255,USA,1997-07-18T00:00:00,14
1630583,Santi Aldama,Grizzlies,Forward-Center,6-11,215,Spain,2001-01-10T00:00:00,30
200746,LaMarcus Aldridge,Nets,Center-Forward,6-11,250,USA,1985-07-19T00:00:00,2
...,...,...,...,...,...,...,...,...
1628221,Gabe York,Pacers,Guard,6-3,190,USA,1993-08-02T00:00:00,Undrafted
201152,Thaddeus Young,Raptors,Forward,6-8,235,USA,1988-06-21T00:00:00,12
1629027,Trae Young,Hawks,Guard,6-1,164,USA,1998-09-19T00:00:00,5
1630209,Omer Yurtseven,Heat,Center,6-11,275,Turkey,1998-06-19T00:00:00,Undrafted


3- Create a function to find players career statistics, store the information in a CSV file called "nba_players_career_stats.csv"

In [7]:
### Complete in this cell: find players career stats, save to csv file

def get_players_career_stats(current_players_list):
    player_career_stats = pd.DataFrame()
    count=0
    for y in current_players_list.index:
        player_career_stats = player_career_stats.append(playerprofilev2.PlayerProfileV2(per_mode36="PerGame",player_id=str(y)).career_totals_regular_season.get_data_frame(),sort=False)  #pull data from endpoint & filter dataframe using playerID
        count = sleepAndPrint(count)
    player_career_stats = player_career_stats[["PLAYER_ID", "GP", "MIN", "PTS", "REB", "AST", "STL", "BLK"]] ; player_career_stats = player_career_stats.set_index('PLAYER_ID') #Filter dataframes & set playerID as index                                      #setting player_id as index
    return player_career_stats


*Retriving data using API:

In [7]:
nba_players_career_stats = get_players_career_stats(current_players_list)
nba_players_career_stats.to_csv("nba_players_career_stats.csv")


Retrieving player n°: 503

*Read csv nba_player_career_stats.csv from disk:

In [8]:
players_career_stats = pd.read_csv('nba_players_career_stats.csv')
players_career_stats = players_career_stats.set_index('PLAYER_ID')
players_career_stats



Unnamed: 0_level_0,GP,MIN,PTS,REB,AST,STL,BLK
PLAYER_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1630173,134,18.4,7.2,5.1,0.8,0.4,0.5
203500,664,26.8,9.3,8.0,1.5,0.9,0.9
1628389,343,28.2,13.5,8.3,3.5,1.0,0.9
1630583,32,11.3,4.1,2.7,0.7,0.2,0.3
200746,1076,33.7,19.1,8.1,1.9,0.7,1.1
...,...,...,...,...,...,...,...
1628221,2,10.5,4.0,1.0,2.0,1.0,0.5
201152,1085,29.3,12.8,5.8,1.8,1.4,0.4
1629027,280,33.6,25.3,3.9,9.1,0.9,0.2
1630209,56,12.6,5.3,5.3,0.9,0.3,0.4


4- Create a function to find players next games and save the information to a csv call "nba_players_next_game.csv"

In [9]:
### Complete in this cell: find players next game

def get_players_next_game(current_players_list):
    player_next_game = pd.DataFrame()
    count=0
    for y in current_players_list.index:
        player_data = playerprofilev2.PlayerProfileV2(per_mode36="PerGame",player_id=str(y),timeout=50).next_game.get_data_frame()  #pull data from endpoint using playerID
        player_data.insert(0, "PLAYER_ID", y, True)
        player_next_game = player_next_game.append(player_data)
        count = sleepAndPrint(count)       
    player_next_game = player_next_game[["PLAYER_ID","GAME_DATE"]] ; player_next_game = player_next_game.set_index('PLAYER_ID')
    return player_next_game

Get next_game from API:

In [10]:
players_next_game = get_players_next_game(current_players_list)
players_next_game.to_csv("nba_players_next_game.csv")

Retrieving player n°: 504

Unnamed: 0_level_0,GAME_DATE
PLAYER_ID,Unnamed: 1_level_1
1630173,APR 23 2022
203500,APR 23 2022
1628389,APR 24 2022
1630583,APR 23 2022
200746,APR 23 2022
...,...
1630589,APR 23 2022
1630593,APR 23 2022
201152,APR 23 2022
1629027,APR 24 2022


*Read csv nba_players_next_game.csv from disk:

In [11]:
players_next_game = pd.read_csv('nba_players_next_game.csv')
players_next_game = players_next_game.set_index('PLAYER_ID')
players_next_game

Unnamed: 0_level_0,GAME_DATE
PLAYER_ID,Unnamed: 1_level_1
1630173,APR 23 2022
203500,APR 23 2022
1628389,APR 24 2022
1630583,APR 23 2022
200746,APR 23 2022
...,...
1630589,APR 23 2022
1630593,APR 23 2022
201152,APR 23 2022
1629027,APR 24 2022


5- Create a function to find players salary for this season, save the information to a csv called "nba_players_salary.csv". Make sure the players names format match the ones in the API, otherwise you won't be able to merge the data later.

Hint: Using data from the Basketball Reference page, you will have to solve 2 kinds of problems, duplicated values (for which you should keep just the first value) and players names not matching with the ones from the API. The latter problem has multiple causes, one of them is that some names are written with non ascii characters (there are libraries for dealing with that).

In [12]:
### Complete in this cell: find players salary, save the information to csv

def get_nba_players_salaries(csv_file_path):
    salaries = pd.read_html('https://hoopshype.com/salaries/players/')[0]
    salaries = salaries.rename(columns={'Player':'PLAYER_NAME','2021/22':'SALARY'})
    salaries = salaries[["PLAYER_NAME", "SALARY"]]

    return salaries

In [13]:
players_salaries = get_nba_players_salaries("contracts.csv")
players_salaries.to_csv("nba_players_salary.csv")
players_salaries

Unnamed: 0,PLAYER_NAME,SALARY
0,Stephen Curry,"$45,780,966"
1,John Wall,"$44,310,840"
2,James Harden,"$44,310,840"
3,Russell Westbrook,"$44,211,146"
4,Kevin Durant,"$42,018,900"
...,...,...
649,Craig Sword,"$53,176"
650,Luca Vildoza,"$42,789"
651,Zavier Simpson,"$37,223"
652,Mfiondu Kabengele,"$19,186"


*Read csv nba_players_next_game.csv from disk:

In [15]:
players_salaries = pd.read_csv("nba_players_salary.csv",index_col=0)
players_salaries

Unnamed: 0,PLAYER_NAME,SALARY
0,Stephen Curry,"$45,780,966"
1,John Wall,"$44,310,840"
2,James Harden,"$44,310,840"
3,Russell Westbrook,"$44,211,146"
4,Kevin Durant,"$42,018,900"
...,...,...
649,Craig Sword,"$53,176"
650,Luca Vildoza,"$42,789"
651,Zavier Simpson,"$37,223"
652,Mfiondu Kabengele,"$19,186"


6- Create a function to merge the created dataframes: players_personal_info, players_career_stats, players_next_game, players_salaries. For each dataframe, select only the subset of columns needed to create the dataset described in section "The Dataset"

    - Players info: "PLAYER_NAME", "TEAM_NAME", "POSITION", "HEIGHT", "WEIGHT", "COUNTRY", "BIRTHDATE", "SEASON_EXP", "DRAFT_NUMBER"
    - Players stats: "GP", "MIN", "PTS", "REB", "AST", "STL", "BLK"
    - Misc: "GAME_DATE", "SALARY"

Save the result to a csv called "raw_nba_players_dataset.csv"

Hint: Before merging the data, you should make sure all four dataframes have the same length, are indexed by PERSON_ID and have the same keys

In [34]:
### Complete in this cell: merge the dataframes
from functools import reduce

def merge_dataframes(players_personal_info, players_career_stats, players_next_game, players_salaries):
    
    dfs = [players_personal_info, players_career_stats, players_next_game]                   # df List

    dataframe = reduce(lambda a,b: pd.merge(a,b,on='PLAYER_ID', how="left"), dfs)            # Reduce & apply lambda for pair merging     
    
    dataframe = [dataframe, players_salaries]                                                # df second list to merge
    
    dataframe = reduce(lambda a,b: pd.merge(a,b, on='PLAYER_NAME',right_index=True, how='inner'),dataframe)    # Merge with players_salaries
    
    return dataframe

In [35]:
raw_players_dataset = merge_dataframes(players_personal_info, players_career_stats, players_next_game, players_salaries)
raw_players_dataset.to_csv("raw_nba_players_dataset.csv")
raw_players_dataset

Unnamed: 0_level_0,PLAYER_NAME,TEAM_NAME,POSITION,HEIGHT,WEIGHT,COUNTRY,BIRTHDATE,DRAFT_NUMBER,GP,MIN,PTS,REB,AST,STL,BLK,GAME_DATE,SALARY
PLAYER_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1630173,Precious Achiuwa,Raptors,Forward,6-8,225,Nigeria,1999-09-19T00:00:00,20,134,18.4,7.2,5.1,0.8,0.4,0.5,APR 23 2022,"$2,711,280"
203500,Steven Adams,Grizzlies,Center,6-11,265,New Zealand,1993-07-20T00:00:00,12,664,26.8,9.3,8.0,1.5,0.9,0.9,APR 23 2022,"$17,073,171"
1628389,Bam Adebayo,Heat,Center-Forward,6-9,255,USA,1997-07-18T00:00:00,14,343,28.2,13.5,8.3,3.5,1.0,0.9,APR 24 2022,"$28,103,500"
200746,LaMarcus Aldridge,Nets,Center-Forward,6-11,250,USA,1985-07-19T00:00:00,2,1076,33.7,19.1,8.1,1.9,0.7,1.1,APR 23 2022,"$2,641,691"
1629638,Nickeil Alexander-Walker,Jazz,Guard,6-5,205,Canada,1998-09-02T00:00:00,17,158,19.4,9.3,2.6,2.2,0.7,0.3,APR 23 2022,"$3,261,480"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1630589,Moses Wright,Mavericks,Forward,6-8,226,USA,1998-12-23T00:00:00,Undrafted,4,3.8,1.3,0.8,0.5,0.0,0.3,APR 23 2022,"$172,821"
201152,Thaddeus Young,Raptors,Forward,6-8,235,USA,1988-06-21T00:00:00,12,1085,29.3,12.8,5.8,1.8,1.4,0.4,APR 23 2022,"$14,190,000"
1629027,Trae Young,Hawks,Guard,6-1,164,USA,1998-09-19T00:00:00,5,280,33.6,25.3,3.9,9.1,0.9,0.2,APR 24 2022,"$8,326,471"
1630209,Omer Yurtseven,Heat,Center,6-11,275,Turkey,1998-06-19T00:00:00,Undrafted,56,12.6,5.3,5.3,0.9,0.3,0.4,APR 24 2022,"$1,489,065"


## Data Cleaning and Preprocessing

There are several steps that you will have to follow, depending on where you have collected the data some information might be missing.  

- Height and weight might need to be converted to the metric system
- Players that have no team assigned should be removed from the dataset
- Players with no contracts (meaning they don't have a salary defined) should be removed from the dataset
- If the "position" data is ambiguous (listed at multiple positions), use the first value
- If the player does not have height or weight data, use the average for its position as its value
- In order to fill the column next_game_date, just consider the date of the next game of each player's team. 

1- Create a copy of your dataset named "working_df", remove all players with no teams or salary

In [None]:
### Complete in this cell: copy the dataset and drop NaNs in team or salary
def copy_and_delete_nan(players_dataset):
    pass

In [None]:
working_df = copy_and_delete_nan(raw_players_dataset)

2- Cast Salary, Birthday and Game Date columns to its corresponding type (int, datetime) 

In [None]:
### Complete in this cell: cast all columns to its type
def cast_columns(working_df):
    pass

In [None]:
cast_columns(working_df)

3- Create a function that converts the height column from height in feet and inches to centimeters

In [None]:
### Complete in this cell: convert height column
def convert_height_column(working_df):
    pass

In [None]:
convert_height_column(working_df)

4- Create a function that converts the weight column from pounds to kilograms

In [None]:
### Complete in this cell: convert weight column
def convert_weight_column(working_df):
    pass

In [None]:
convert_weight_column(working_df)

5- Create a function that calculates the age in (years, months, days) and saves it in a new string column, example: "22 years, 5 months, 25 days" 

In [None]:
### Complete in this cell: add age column
def add_age_column(working_df):
    pass

In [None]:
add_age_column(working_df)

6- Create a function that takes care of the disambiguation of the "POSITION" column. Should replace all positions that are mixed with the first one listed.

In [None]:
### Complete in this cell: disambiguation of the position column
def update_position(working_df):
    pass

In [None]:
update_position(working_df)

7- Review that the working dataset has all requested columns with its corresponding datatypes and save it as a csv with name "nba_players_processed_dataset.csv"

In [None]:
working_df.to_csv("nba_players_processed_dataset.csv")

## Analyzing and Visualizing data

Now that we have the data, let's do some work

1- Calculate and print the following metrics:

    - General metrics:
        - Total number of players
        - Number of USA born players
        - Number of foreign players
        - Number of players per position
        - Number of players per team
        - Number of rookies (first year players)
    - Players description
        - Average player age (in years)
        - Youngest player age (years and days, i.e: 18 years and 16 days)
        - Oldest player age (years and days, i.e: 40 years and 160 days)
        - Min and Max players height
        - Average height of players per position
    - Contracts
        - Min and Max salary of all players
        - Mean and Median salary of all players

Bonus: if you can, calculate how many players retired between the end of the 2020-2021 season and the start of the 2021-22 season.

In [None]:
### Complete in this cell: print general metrics


In [None]:
### Complete in this cell: print players descriptions
    

In [None]:
### Complete in this cell: Contracts


2- Plot the relationship between scoring (points per game) and salary of all players, the players positions should also be visible.

In [None]:
### Complete in this cell:  Relationship between scoring and salary (in millions of dollars)


3- Now plot assists-vs-salary and rebounding-vs-salary

In [None]:
### Complete in this cell: plot assist-vs-salary, rebounding-vs-salary


4- When NBA players enter the league, they have low value salaries during what is called their "rookie contract". This means that it doesn't matter how well the player performs, they can't have large salaries. This can distort our understanding of how much teams value each skill, as a player could score 50 points a game and still earn just a couple of millions. So, lets now plot points, assists and rebounding vs salary, but only for players that have more than 4 years of experience (the typical length of a rookie contract)

In [None]:
### Complete in this cell: non rookie contracts


5- Plot the scoring average grouped by position, we want to be able to see median, quartiles, etc 

In [None]:
### Complete in this cell: Scoring average grouped by position


6- Plot the Height distribution of all players

In [None]:
### Complete in this cell: height distribution


OPTIONAL: Can you find a way to draw a world map and show how many active players per country the NBA has? [Example](https://i.redd.it/8qymui9fnin71.jpg)