# Class 6: Array computations

In this notebook we explore computations on arrays of data. 

In [None]:
import YData

# YData.download.download_class_code(6)   # get class 6 code    

# YData.download.download_class_code(6, True) # get the code with the answers 

There are also similar functions to download the homework:

In [None]:
# YData.download_homework(3)  # downloads the third homework if you have not done so already

If you are using Google Colabs, you should install the YData package and mount Google Drive by uncommenting and running the code below.

In [None]:
# !pip install https://github.com/lederman/YData_package/tarball/master
# from google.colab import drive
# drive.mount('/content/drive')

## 0. Warm-up exercises: NBA salaries

Let's do some warm-up exercises by looking at statistics of basketball players in the NBA! The data we will analyze contains infomration about each player including their salary from the 2022-2023 season listed in millions of dollars. This table can be found online: https://www.kaggle.com/datasets/jamiewelsh2/nba-player-salaries-2022-23-season

We will load the data as a "pandas DataFrame" which is a data structure we will discuss more in a couple of weeks. We will then convert the data to lists to explore it further. The lists we are creating are:

- `name_list`: A list of the basketball players' names
- `salary_list`: A list of salaries
- `position_list`: A list of the positions each player plays
- `team_list`: A list of which team each player is on
- `points_per_game_list`: A list showing the average number of points each player scored per game


In [None]:
# load the data and display the first 6 rows

import YData
import pandas as pd

YData.download_data("nba_salaries_2022_23_all.csv")
nba = pd.read_csv("nba_salaries_2022_23_all.csv")  # load in the data

nba[["Player Name", "Salary", "Position", "Team", "PTS"]].head()  # show the first 6 rows


In [None]:
# extract the salaries and the player positions as lists

name_list = nba["Player Name"].to_list()
salary_list = nba["Salary"].to_list()
position_list = nba["Position"].to_list()
team_list = nba["Team"].to_list()
points_per_game_list = nba["PTS"].to_list()



### Warm-up exercise 1: Categorical analyses

Can you do the following:
- Calculate the proportion of players who play on the Boston Celtics ("BOS")?

If you finish the other warm-up exercises, you can also try creating a bar plot showing the number of players on the Boston Celtics ("BOS"), New York Knicks ("NYK") and Golden State Warriors ("GSW").

In [None]:
# Proportion of players on the Celtics


In [None]:
# If you finish the other warm-up exercies, you can create a bar plot showing the number of players on "BOS", "NYK", and "GSW"

import matplotlib.pyplot as plt
%matplotlib inline

# Create a list of counts of players on the three teams
team_counts = ...

# Create the team names
team_names = ...


# Create a bar chart of how many players are on each team. Be sure to label your axes!
...


### Warm-up exercise 2: One quantitative variable analyses

Can you do the following:

1. Create a histogram of the player salaries
2. Calculate the mean salary, the median salary, and the standard deviation of salaries
3. Calcualte the z-score for the first player's salary (i.e., Stephen Curry's salary)


In [None]:
# Plot a histogram of NBA salaries
...

In [None]:
import statistics

# mean salary
...

# median salary
...

# standard deviation of salaries
...


In [None]:
# z-score for the first player's salary (i.e., z-score for Stephen Curry's salary)

...


### Warm-up exercise 3: Two-quantitative variables

Can you do the following:

1. Create a scatter plot of salary as a function of the points scored per game
2. Guess what you believe correlation is between salary and points per game, and then calculate the correlation to see if your guess was close.


In [None]:
...

In [None]:
# guess the correlation and then calculate it
...

## 1. Creating Arrays

Often we want to process data that is all of the same type. For example, we might want to do processing on a data set of numbers (e.g., if we were just analyzing salary data). 

When we have data that is all of the same type, there are faster ways to process data than using a list. In Python, the `numpy` package offers ways to store and process data that is all of the same type using a data structure called a `ndarray`. There are also functions that operate on `ndarrays` that can do computations very efficiently. 

Let's explore this now!

In [None]:
# import the numpy package
import numpy as np

In [None]:
# create an ndarry of numbers
...

In [None]:
# we can get the type of elements in an array by accessing the dtype property
...

In [None]:
# get the size of the array
...

In [None]:
# create an array of strings
...

In [None]:
# get the type in the string array
string_array.dtype      # < little endian byte order, U unicode string, of maximum length 1.

In [None]:
# variation
...      # < little endian byte order, U unicode string, of maximum length 3.

In [None]:
# create a boolean array
...

In [None]:
# get the type in the boolean array
...

In [None]:
# what happens if we make an array from a list of mixed values
mixed_array = ...

In [None]:
# get the dtype 
mixed_array.dtype

In [None]:
# get the 0th element of the mixed_array
...

In [None]:
# get the type of the 0th element
...

In [None]:
# is the 0th element equal to the integer 1? 
...

In [None]:
# is the 0th element equal to the string '1'? 
...

In [None]:
# create sequential numbers 1 to 9
...

## 2. NumPy functions on numerical arrays

The NumPy package has a number of functions that operate very efficiently on numerical ndarrays.

Let's explore these functions by looking at the price of gas!

The data comes from: https://www.eia.gov/opendata/v1/qb.php?category=240692&sdid=PET.EMM_EPM0_PTE_NUS_DPG.W

In [None]:
# If loading the data using pandas_datareader.fred, you can uncomment this code and load this data

##  Download the data - code based on a fixed .csv file
#YData.download.download_data('US_Gasoline_Prices_Weekly.csv')
#import pandas as pd
#gas_data = pd.read_csv("US_Gasoline_Prices_Weekly.csv", parse_dates=[0])  # load in the data
#gas_data.head()
#gas_data_2023 = gas_data[(gas_data['Week'] > '2023-01-01') & (gas_data['Week'] < '2024-01-01')] 
#gas_prices_2023 = gas_data_2023["DollarsPerGallon"].values
#gas_dates_2023 = gas_data_2023["Week"].values

In [None]:
# Read in the price of gas directly from the FRED

from pandas_datareader.fred import FredReader

gas_data = FredReader("GASREGW", start='2019-06-01', end='2024-09-01').read().reset_index() 

gas_data_2023 = gas_data[(gas_data['DATE'] > '2023-01-01') & (gas_data['DATE'] < '2024-01-01')] 

gas_data_2023.head()


In [None]:
# Get an ndarray of the gas prices from each week of 2023
# You can ignore this code for now...

gas_prices_2023 = gas_data_2023["GASREGW"].values
gas_dates_2023 = gas_data_2023["DATE"].values


In [None]:
# prices for all 52 weeks in 2022
...   

In [None]:
# print the prices and the dates
...

In [None]:
# One dollar is currently 141 Yen. What has been the price of a gallon of gas cost in Yen? 
# What have gas prices been in Euros? 
...

In [None]:
# what if there was a constant tax of $2 on each gallon purchased? 
...

In [None]:
# basic functions of: min, max, etc.
...

In [None]:
# if you bought one gallon each week, what would you pay over the whole year? 
...

In [None]:
# what do you pay on average? 
...

In [None]:
# If you bought one gallon each week, how much would you pay at the end of each of the weeks of the year? 
...

In [None]:
# How much does the gas price go up and down each week? 
...

In [None]:
# plot the gas prices
...

In [None]:
# plot the gas prices better!
...

<br>
<br>
<br>
<p>
<center><img src=https://cdn.quotesgram.com/img/69/59/1803591020-high-gas-prices.jpg></center>

## 3. Boolean arrays

We can easily compare all values in an ndarray to a particular value. The result will return an ndarray of Booleans. 

Since Boolean `True` values are treated as 1's, and Boolean `False` values are treated as 0's, this makes it easy to see how many values in an array meet particular conditions. 

In [None]:
# Test all values in an array that are less than 5
my_array = np.array([12, 4, 6, 3, 4, 3, 7, 4])

...


In [None]:
# How many values are less than 5?
...

In [None]:
# How many (and what proportion) of weeks in 2023 were gas prices were below $3.50?
...

### Example: What proportion of movies passed the Bechdel test revisited 

Let's calculate (again) the proportion of movies that passed the Bechdel test, but this time using numpy array computations. 

The code below loads the Bechdel data, and we will focus on the `bechdel` list, which is a list of strings saying whether movies passed ('PASS') or failed ('FAIL') the Bechdel test.



In [None]:
import YData
import pandas as pd

YData.download_data("movies.csv")

movies = pd.read_csv("movies.csv")
col_names_to_keep = ['year', 'imdb', 'title', 'clean_test', 'binary', 'budget',
       'domgross', 'budget_2013', 'domgross_2013', 'decade_code', 'imdb_id',
       'rated', 'imdb_rating', 'runtime',  'imdb_votes']
movies =   movies[col_names_to_keep]

movies.dropna(axis = 0, how = 'any', inplace = True, subset=col_names_to_keep[0:9])


# get lists of data for our data analysis
title = movies["title"].to_list()
bechdel = movies["binary"].to_list()
bechdel_reason = movies["clean_test"].to_list()

domgross_2013 = movies["domgross_2013"].to_list()
budget_2013 = movies["budget_2013"].to_list() 
year = movies["year"].to_list()


bechdel[0:5]


In [None]:

# convert the list to an ndarray
bechdel_array = ...


# create a Boolean array of that is True for movies that passed the Bechdel test
passed_booleans = ...

print(passed_booleans[0:5])


# calculate the proportion of movies that passed the Bechdel test
...

# alternatively, we can use the np.mean() function 
...


## 4. Boolean subsetting/indexing/masking

We can also use Boolean arrays to return values in another array. This is referred to as "Boolean Subsetting", Boolean masking" or "Boolean indexing"


In [None]:
# initial array
my_array = np.array([12, 4, 6, 3, 4, 3, 7, 4])

# create Boolean array for values less than 5
boolean_array = ...


# get values of my_array that are less than 5
...


### Example: calculate the average revenue for movies that passed the Bechdel test 

In [None]:
# Calculate the average revenue for movies that passed the Bechdel test 

# create an ndarray of revenues
domgross_2013_array = ...

# use the boolean mask to extract movies the pass the Bechdel test
passed_domgross_2013 = ...

# get the average revenue of movies that passed the Bechdel test
...

## 5. Percentiles

The Pth percentile is the value of a quantitative variable which is greater than P percent of the data. 

We can calculate percentiles using the numpy function `np.percentile()`

Let's calculate the 25th, 50th, and 75th percentile for the Bechdel movie revenue data.


In [None]:
## Get the 25th, 50th and 75th percentile of movie revenues

bechdel_percentiles = ...


Question: What is another way to calculate the 50th percentile? 


In [None]:
# A: The 50th percentile is the median so we can also calculate it using np.median()

...

Other commonly calculated statistics include:

- Five Number Summary = (minimum, Q1, median, Q3, maximum)
- Range = maximum – minimum
- Interquartile range (IQR) = Q3 – Q1

Where:
- Q1 = 25th percentile
- Q3 = 75th percentile

Let's calculate these for the Bechdel revenue data...


In [None]:
# Range

...

In [None]:
# Interquartile range (IQR)

...




In [None]:
# Five number summary

five_num = ...

print(five_num)





## 5. Box plots

A box plot is a graphical display of the five-number summary and consists of:

   1. Drawing a box from Q1  to Q3   

   2. Dividing the box with a line (or dot) drawn at the median

   3. Draw a line from each quartile to the most extreme data value that is not and outlier

   4. Draw a dot/asterisk for each outlier data point.


Create a side-by-side boxplot showing the revenue of movies that passed and failed the Bechdel test


In [None]:
# get the movies that failed the Bechdel test
failed_domgross_2013 = ...


# create a side-by-side boxplot showing the revenue of movies that passed and failed the Bechdel test
#  #  # The old parameter name "labels" was renamed "tick_lables".



## 6. Higher dimensional arrays

In [None]:
my_matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
my_matrix

In [None]:
# slicing to get a submatrix 
...

In [None]:
my_matrix2 = my_matrix.copy()  # copy the matrix
...  # set particular index values to 100

my_matrix2

In [None]:
# sum all the values
...

In [None]:
# sum down the rows 
...

In [None]:
# sum across the columns
...

In [None]:
# create a boolean array for all values less than 5
...

In [None]:
# what does the following do? 

face_array = np.zeros([100, 100])  # create a matrix of all 0's 

face_array[21:30, 21:30] = 1  # assign particular regions the value of 1
face_array[21:30, 71:80] = 1
face_array[71:80, 21:80] = 1

#plt.imshow(face_array, cmap = 'gray');
#plt.colorbar();

In [None]:
# convert face_array to a boolean matrix
...

## 7. Image processing

We can use numerical arrays (and NumPy) to do image processing. Let's explre this now.

In [None]:
# download an image of a famous Yale alumni
YData.download.download_image("burns.jpeg")

In [None]:
# load in an image 

from imageio.v3 import imread

I = imread("burns.jpeg")

plt.imshow(I);

In [None]:
# get the type and shape of the image
...

In [None]:
# Let's reverse the red and blue channels

# extract each color channel as a matrix
...

# create new image where color channels will be swapped
rev_rb = ...

# swap channels
...

# convert to ints
...

# display the image
plt.imshow(rev_rb);

In [None]:
# To create a grayscale image - use the average value in all three r, g, b channels

mean_image = ...

plt.imshow(mean_image, cmap='gray');

In [None]:
# Image masking - make all drak pixels even darker (set to a value of 0)

# copy the image and create a darkening mask
darken = ...
darken_mask = ...
print(darken_mask.shape)

# darken the pixels and display the image
...
plt.imshow(darken);

## 8. Tuples

Tuples are a basic data structure in Python that is like a list. However, unlike lists, elements in tuples are "immutable" meaning that once we create a tuple, we can not modify the values in the tuple.

We create tuples by using values in parentheses separated by commas:

`my_tuple = (10, 20, 30)`

Let's explore tuples now... 


In [None]:
# create a tuple

my_tuple = (10, 20, 30)

my_tuple


In [None]:
# we can access elements of the tuple using square brackets (the same as lists)
...

In [None]:
# unlike a list, we can't reassign values in a tuple 
# my_tuple[1] = 50

In [None]:
# We extract values from tuples into regular names using "tuple unpacking"

...

## 9. Dictionaries

Dictionaries allow us to look up values. In particular, we provide a "key" and the dictionary return a "value". 

We can create dictionaries using the syntax: 

`my_dict = {"key1": 1, "key2": 20}`


In [None]:
# create a dictionary
my_dict = {"key1": 1, "key2": 20}
my_dict

In [None]:
# we can access elements using square brackets 
...

In [None]:
# values in dictionaries can be list
my_dict2 = {"a": [1, 2, 3, 4], "b": ["a", "b", "c"], "c": [True, False]}
...


In [None]:
# We can create a dictionary from two lists of the same length using the dict() and zip() functions

my_list = [1, 2, 3]
my_list2 = ["a", "b", "c"]

my_dict3 = ...

print(my_dict3)

...

In [None]:
# create a dictionary between players and their salaries
# player_salaries = dict(zip(player_array, salary_array))

# what is Stephen Curry's salary? 
# player_salaries["Stephen Curry"]

## 10. Pandas 

pandas Series are: 0ne-dimensional ndarray with axis labels

pands DataFrame are: Table data

Let's look at the egg and wheat price data...


In [None]:
YData.download.download_data("monthly_egg_prices.csv");
YData.download.download_data("dow.csv");

In [None]:
# reading in a series by parsing the dates, and using .squeeze() to conver to a Series
egg_prices_series = pd.read_csv("monthly_egg_prices.csv", parse_dates=True, date_format="%m/%d/%y", index_col="DATE").squeeze()


# print the type
print(type(egg_prices_series))

# print the shape
print(egg_prices_series.shape)

# print the series
egg_prices_series


In [None]:
# get a value from the Series by an Index name using .loc
...

In [None]:
# get a value from the Series by index number using .iloc
...

In [None]:
# use the .filter() method to get data from dates that contain "2023"
egg_prices_2023 = ...

# print the length 
...


In [None]:
# turn the index back into a column using .reset_index()
egg_prices_df = ...

# get the type
...

# print the values
...

## DataFrames!

The ability to manipulate data in tables is one of the most useful skills in Data Science. 

Pandas is the most popular package in Python for manipulating data tables so we will use this package for manipulating tables in this class. The syntax for Pandas can be a little tricky, so try to be patient if you run into errors, and as always, there should be plenty of help available at office hours and on Ed. 

As an example, let's look at data on the closing price of the [Dow Jones Industrial Average](https://www.marketwatch.com/investing/index/djia) which is an index of the prices of the 30 largest corporations in the US.

The code below loads the DOW data into a Pandas DataFrame and displays the first 5 rows using the `head()` method. 


In [None]:
dow = pd.read_csv("dow.csv", parse_dates=[0], date_format="%m/%d/%y", index_col="Date")

dow.head()

In [None]:
# The head() method returns the first 5 rows. 
# Let's use the tail() method to get the last 5 rows.
# From looking at the output, can you tell what year the data goes back until? 

...

In [None]:
# get the number of rows and columns in a DataFrame using the shape property
...

In [None]:
# get the types of all the columns using .dtypes
...

In [None]:
# get the names of all the columns using .columns
...

# we can convert these names to an numpy array using the .to_numpy() method
...

In [None]:
# get more info on the data frame using the .info() method
...

In [None]:
# get descriptive statistics on DataFrame using the .describe() method

...