In [3]:
# These lines import the Numpy and Datascience modules.
from datascience import *


## DataSets: Adding or changing a dataset

## Part 1: Changing a dataset 
<div class="alert alert-block alert-info">
There are two goals in this section:

- Understand the paths to data files that are stored with your notebook
- Be able to change the location and path to a data file
</div>

### Understanding Paths

<div class="alert alert-block alert-success">

If you would like to change the dataset being used in a notebook, your first step is to determine the path to the dataset relative to the notebook you are working with. The second step is copy that path into the `Table().read_table` function. 

For example, we are updating the player_data.csv and salary_data.csv files to reflect 2024 NBA data. Later we will add in the WNBA player data.

Sometimes it is helpful to know where the data files you need are located relative to your notebook. There are two commands we can run:
- `%pwd` -- "print working directory" -- displays your current directory
- `%ls -l` -- "list in long format" which lists all the files in the current directory

**Note:** You probably do not want to keep these commands in a notebook you distribute to students. 

Run the cells below to observe these commands:

</div>


In [None]:
%pwd

In [None]:
%ls -l

### Changing the path to a dataset

<div class="alert alert-block alert-success">
Hopefully the `pwd` and `ls -l` commands illustrated that you are in the nba-demo directory and your data files are present.

The paths we need to the 2024 data are:
- player_data_nba_2024.csv
- salary_data_nba_2024.csv

You might want to run the cell below to see the data we currently have. 

Then change the paths below by replacing player_data.csv and salary_data.csv with the appropriate new file names. Once the change is made, run the cell and take a look at the data again
</div>


In [4]:
nba_player_data = Table.read_table("player_data.csv")
nba_salary_data = Table.read_table("salary_data.csv")

# The show method immediately displays the contents of a table. 
nba_player_data.show(3)
nba_salary_data.show(3)



Name,Age,Team,Games,Rebounds,Assists,Steals,Blocks,Turnovers,Points
James Harden,25,HOU,81,459,565,154,60,321,2217
Chris Paul,29,LAC,82,376,838,156,15,190,1564
Stephen Curry,26,GSW,80,341,619,163,16,249,1900


PlayerName,Salary
Kobe Bryant,23500000
Amar'e Stoudemire,23410988
Joe Johnson,23180790


### Storing datasets in a sub-folder

<div class="alert alert-block alert-success">
There is a big difference!  The original data is from 2014-15 season. 

You might want to load and display both the 2014 and 2024 datasets in the same cell to easily see the difference in salaries. 

Before we do that, we can practice moving our datasets into a different folder.

- Create the folder, data, in the nba-demo folder.
- Move the .csv files into the folder, data.
- Run `pwd` and `ls -l` again to see the difference.
</div>

In [None]:
%pwd

In [None]:
%ls -l

<div class="alert alert-block alert-success">
Now, instead of seeing the csv files you see the folder, "data/". We need to include the folder name, "data/", when we use the `Table.read_table` function.

In the cell below:
- read in the 2014 and 2024 data via read_table
- show the first few records of each.
- Be careful! The path to the data files is now: "data/name_of_file.csv"
</div>

In [None]:

nba_player_data_2024 = ...
nba_salary_data_2024 = ...

# The show method immediately displays the contents of a table. 
nba_salary_data_2024.show(3)

nba_player_data_2014 = ...
nba_salary_data_2014 = ...

nba_salary_data_2014.show(3)


### Where are we

<div class="alert alert-block alert-success">

In ten years, the highest paid players in the NBA jumps $28,000,000!

Inflation, OK. But still. 51,000,000 dollars in 2024 was about 38,000,000 dollars in 2014.

We know how to read in csv files both from the directory our notebook is currently in as well as from a sub-folder. 

Now, we work on loading the WNBA salary data from a URL(website).

</div>

## Part 2: Change the data file to read from a URL
<div class="alert alert-block alert-info">
In this section, we are aiming for you to be able to:

- Load and display data from a URL
- Create your own Code cell
- Create your Markdown cell

</div>

<div class="alert alert-block alert-success">

It is pretty straight-froward to load a dataset from a URL instead of storing the dataset with the notebook itself.

We want to add WNBA player data to this notebook, which includes statistics and salary data for WNBA players. We conveniently put the dataset in a git repository on Github. You can navigate to the dataset by going here:
- https://github.com/ucb-ds/demo-datasets/raw/main/wnba_data.csv

***Step 1:***

You can create a "Code" cell below, that will load and show the data from a URL by replacing the path in `Table.read_table` with the URL above.

In order to create a "Code" cell, move to the top-right of this cell and click the button with the "+" sign underneath it - second to last

![Image of "Code" button to click to create "Code" cell](code1.png "'Code' button to click to create 'Code' cell")

Copy the contents of a cell earlier in the notebook that loads in player data, then change the path to the WNBA data URL. You may want to change the variable names as well to be descriptive of the dataset we are loading.

</div>

<div class="alert alert-block alert-success">

***Step 2:***

Now, we should add some context for the student in regards to this new dataset from the WNBA. We might tell the student the years the dataset represents, where we got the dataset, and set up some questions that might motivate the exploration of the data -- talk about pay equity between genders, what statistic for women is stronger than for men?

You can create a "Markdown" cell above the "Code" cell you just made by moving to the top-right of the "Code" cell and click the button with the "+" sign above it - third to last.

![Image of "Code" button to click to create "Code" cell](code1.png "'Code' button to click to create 'Code' cell")

Go ahead and type. We can show you some MarkDown tricks!
</div>

## Summary
<div class="alert alert-block alert-info">
Here we have practiced working with various methods of loading datasets into your notebook, display those datasets, as well as creating and editing Code cells.
</div>

# Plots and Packages

We want to create a histogram to explore various parts of the data we have on the NBA and WNBA. The goal in this section is to illustrate how to import packages you may want to use with your notebooks that are not already installed on the JupyterHub you are using.

**Note:** The plotly-express package we use is not as easy as using the datascience package to create a histogram! But I needed a package not installed on the JupyterHub to illustrate how to install packages yourself.

<div class="alert alert-block alert-info">
You will be able to:

- Install a package that may not already be installed
- Import that package into your notebook
- Use the package
</div>


<div class="alert alert-block alert-success">

The **try-except** statement below will attempt to first import plotly.express, if it fails then it installs the plotly-express package and imports it again. If you run the cell again, the system will just import the package for use, it will not need to install it again.

For you to do:
- In the two spots marked with "...", replace the "..." with the name of column you would like to summarize as a histogram.
- You should also swap out the data set being used. Here we have "wnba_player_data"; you can take a look at the "nba_player_data" as well.

</div>

In [None]:
import plotly.express as px

# try:
#     import plotly.express as px
# except ImportError:
#     %pip install plotly-express
#     import plotly.express as px

# Create a histogram
fig = px.histogram(wnba_player_data.to_df()[...], x="...",
                  title="Distribution of Athletes age")
fig.show()

