<a href="https://colab.research.google.com/github/tmckim/materials-sp24-colab/blob/main/lec_demos/lec07.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Before you start - Save this notebook!

When you open a new Colab notebook from the WebCampus (like you hopefully did for this one), you cannot save changes. So it's  best to store the Colab notebook in your personal drive `"File > Save a copy in drive..."` **before** you do anything else.

The file will open in a new tab in your web browser, and it is automatically named something like: "**Copy of lec07.ipynb**". You can rename this to just the title of the assignment "**lec07.ipynb**". Make sure you do keep an informative name (like the name of the assignment) so that you know which files to submit back to WebCampus for grading! More instructions on this are at the end of the notebook.


**Where does the notebook get saved in Google Drive?**

By default, the notebook will be copied to a folder called “Colab Notebooks” at the root (home directory) of your Google Drive. If you use this for other courses or personal code notebooks, I recommend creating a folder for this course and then moving the assignments AFTER you have completed them. <br>

I also recommend you give the folder where you save your notebooks^ a different name than the folder we create below that will store the notebook resources you need each time you work through a course notebook. This includes any data files you will need, links to the images that appear in the notebook, and the files associated with the autograder for answer checking.<br>
You should select a name other than '**NS499-DataSci-course-materials**'. <br>
This folder gets overwritten with each assignment you work on in the course, so you should **NOT** store your notebooks in this folder that we use for course materials! <br><br>For example, you could create a folder called 'NS499-**notebooks**' or something along those lines. 

### We will now do the setup steps as separate cells to help with issues finding files in google drive/colab. <br> If you restart colab, you must rerun all **5** steps in each of these cells!

In [None]:
# Step 1
# Setup and add files needed to access gdrive
from google.colab import drive                                   # these lines mount your gdrive to access the files we import below
drive.mount('/content/gdrive', force_remount=True)

In [None]:
# Step 2
# Change directory to the correct location in gdrive (modified way to do this from before)
import os
os.chdir('/content/gdrive/MyDrive/NS499-DataSci-course-materials/')

In [None]:
# Step 3
# Remove the files that were previously there- we will replace with all the old + new ones for this assignment
!rm -r materials-sp24-colab                                        

In [None]:
# Step 4
# These lines clone (copy) all the files you will need from where I store the code+data for the course (github)
# Second part of the code copies the files to this location and folder in your own gdrive
!git clone https://github.com/tmckim/materials-sp24-colab '/content/gdrive/My Drive/NS499-DataSci-course-materials/materials-sp24-colab/'

In [None]:
# Step 5
# Change directory into the folder where the resources for this assignment are stored in gdrive (modified way from before)
os.chdir('/content/gdrive/MyDrive/NS499-DataSci-course-materials/materials-sp24-colab/lec_demos/')

In [None]:
# Import packages and other things needed
# Don't change this cell; Just run this cell
# If you restart colab, make sure to run this cell again after the first ones above^

from datascience import *
import numpy as np
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
plt.rcParams["patch.force_edgecolor"] = True

#### Today's Lecture

In today's lecture, you'll learn how to:

1. continue practicing to visualize data (more!)
2. review plotting numerical and categorical data
3. use `hist` to plot distributions

## Review from last lecture

## Census Data ##

In [None]:
# Read in the Census data table
full = Table.read_table('nc-est2019-agesex-res.csv')

In [None]:
# Keep only the columns we care about- 2014 and 2019
# What is the output of select?
partial = full.select('SEX', 'AGE', 'POPESTIMATE2014', 'POPESTIMATE2019')

In [None]:
# Make things easier to read
# can relabel column names using relabeled
simple = partial.relabeled(2, '2014').relabeled(3, '2019')
simple

In [None]:
# Remove the age totals- coded as 999
no_999 = simple.where('AGE', are.below(999))
no_999

## Comparing Males and Females

In [None]:
# Let's compare male and female counts per age
males = no_999.where('SEX', 1).drop('SEX')
females = no_999.where('SEX', 2).drop('SEX')

In [None]:
# Create table for 2019
pop_2019 = Table().with_columns(
    'Age', males.column('AGE'),
    'Males', males.column('2019'),
    'Females', females.column('2019')
)
pop_2019

In [None]:
# Plot 2019 data
pop_2019.plot('Age')

In [None]:
# Calculate the percent female for each age
# Start by calculating the total
total = pop_2019.column('Males') + pop_2019.column('Females')
total

In [None]:
# Calculate the percent female for each age
# Now do the division of arrays
pct_female = pop_2019.column('Females') / total * 100
pct_female

In [None]:
# Round it to 3 so that it's easier to read
pct_female = np.round(pct_female, 3)
pct_female

In [None]:
# Add female percent to our table
pop_2019 = pop_2019.with_column('Percent female', pct_female)
pop_2019

In [None]:
# Plot percent
pop_2019.plot('Age', 'Percent female')

In [None]:
# ^^ Look at the y-axis! Trend is not as dramatic as you might think

pop_2019.plot('Age', 'Percent female')
plt.ylim(0, 100);  # Optional for this course

## Numerical Distributions ##

The most basic tool for visualizing the distribution of numerical data is the historgram.

In [None]:
# Examine the age of the top 200 films
# Show just a preview of the table
top_movies.take(np.arange(5))

In [None]:
# Add a column containing the age of each movie in the top_movies table
this_year = 2024
ages = this_year - top_movies.column('Year')
top_movies = top_movies.with_column('Age', ages)

In [None]:
# Show the table
top_movies

In [None]:
# What is the range? Use built in functions we know



Split the "`Age`" column into the following bins

In [None]:
# Create the bins
my_bins = make_array(0, 5, 10, 15, 25, 40, 65, 100)

In [None]:
# Bin the data
binned_data = top_movies.bin('Age', bins = my_bins)
binned_data

You an also use `np.arange` to create regular bins of a fixed size or even just specify a number.

In [None]:
# Demonstrate other ways to bin the data
my_bins_another_way = top_movies.bin('Age', bins = np.arange(0,126,25))
my_bins_another_way

In [None]:
# Demonstrate other ways to bin the data
my_bins_another_way = top_movies.bin('Age', bins = 15)
my_bins_another_way

If the `bins` argument isn't used, default is to produce 10 equally wide bins between the min and max values of the data.

## Histograms ##

We can construct histograms of numerical variables by calling `tbl.hist(...)` function using our `bins`


Make a histogram of `Age` using `my_bins`

In [None]:
# What are my_bins from before?
my_bins

In [None]:
# Reminder of the data with original bins
binned_data

In [None]:
# Calculate percent of data for each age bin
percent = 100 * binned_data.column('Age count') / binned_data.column('Age count').sum()
percent

In [None]:
# Add a column containing what percent of movies are in each bin
binned_data = binned_data.with_column(
    'percent', percent)
binned_data

In [None]:
# Let's make our first histogram!
top_movies.hist('Age', bins = my_bins, unit = 'Year')

In [None]:
# Let's try equally spaced bins instead
top_movies.hist('Age', bins = np.arange(0, 110, 10), unit = 'Year')

In [None]:
# Let's try not specifying any bins
top_movies.hist('Age', unit='Year')

## Height ##

### Question: What is the height of the [40, 65) bin?

In [None]:
# Review the plot to remind ourselves what it looks like based on bins
top_movies.hist('Age', bins=my_bins, unit='Year')
binned_data

In [None]:
# Step 1: Calculate % of movies in the [40, 65) bin
percent_bin = binned_data.where('bin',40).column('percent').item(0)
percent_bin

In [None]:
# Step 2: Calculate the width of the 40-65 bin
width = 65 - 40

In [None]:
# Step 3: Area of rectangle = height * width
#         --> height = percent / width
height = percent_bin / width
height

### What are the heights of the rest of the bins?

In [None]:
# Get the bin lefts (remove the last bin)
bin_lefts = binned_data.take(np.arange(binned_data.num_rows - 1))
bin_lefts

In [None]:
# Get the bin widths and add that to the table
bin_widths = np.diff(binned_data.column('bin'))
bin_lefts = bin_lefts.with_column('width', bin_widths)
bin_lefts

In [None]:
# Get the bin heights and add to the table
bin_heights = bin_lefts.column('percent') / bin_widths
bin_lefts = bin_lefts.with_column('height', bin_heights)

In [None]:
# Show the full table with bin, count, percent, width, and height
bin_lefts

In [None]:
# Plot once more
top_movies.hist('Age', bins = my_bins, unit = 'Year')

### Saving
Remember to save your notebook before closing.
Choose **Save** (and make sure you've already saved a copy in your drive) from the **File** menu.