<a href="https://colab.research.google.com/github/tmckim/materials-sp24-colab/blob/main/lec_demos/lec06.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Before you start - Save this notebook!

When you open a new Colab notebook from the WebCampus (like you hopefully did for this one), you cannot save changes. So it's  best to store the Colab notebook in your personal drive `"File > Save a copy in drive..."` **before** you do anything else.

The file will open in a new tab in your web browser, and it is automatically named something like: "**Copy of lec06.ipynb**". You can rename this to just the title of the assignment "**lec06.ipynb**". Make sure you do keep an informative name (like the name of the assignment) so that you know which files to submit back to WebCampus for grading! More instructions on this are at the end of the notebook.


**Where does the notebook get saved in Google Drive?**

By default, the notebook will be copied to a folder called “Colab Notebooks” at the root (home directory) of your Google Drive. If you use this for other courses or personal code notebooks, I recommend creating a folder for this course and then moving the assignments AFTER you have completed them. <br>

I also recommend you give the folder where you save your notebooks^ a different name than the folder we create below that will store the notebook resources you need each time you work through a course notebook. This includes any data files you will need, links to the images that appear in the notebook, and the files associated with the autograder for answer checking.<br>
You should select a name other than '**NS499-DataSci-course-materials**'. <br>
This folder gets overwritten with each assignment you work on in the course, so you should **NOT** store your notebooks in this folder that we use for course materials! <br><br>For example, you could create a folder called 'NS499-**notebooks**' or something along those lines. 

### We will now do the setup steps as separate cells to help with issues finding files in google drive/colab. <br> If you restart colab, you must rerun all **5** steps in each of these cells!

In [None]:
# Step 1
# Setup and add files needed to access gdrive
from google.colab import drive                                   # these lines mount your gdrive to access the files we import below
drive.mount('/content/gdrive', force_remount=True)

In [None]:
# Step 2
# Change directory to the correct location in gdrive (modified way to do this from before)
import os
os.chdir('/content/gdrive/MyDrive/NS499-DataSci-course-materials/')

In [None]:
# Step 3
# Remove the files that were previously there- we will replace with all the old + new ones for this assignment
!rm -r materials-sp24-colab                                        

In [None]:
# Step 4
# These lines clone (copy) all the files you will need from where I store the code+data for the course (github)
# Second part of the code copies the files to this location and folder in your own gdrive
!git clone https://github.com/tmckim/materials-sp24-colab '/content/gdrive/My Drive/NS499-DataSci-course-materials/materials-sp24-colab/'

In [None]:
# Step 5
# Change directory into the folder where the resources for this assignment are stored in gdrive (modified way from before)
os.chdir('/content/gdrive/MyDrive/NS499-DataSci-course-materials/materials-sp24-colab/lec_demos/')

In [None]:
# Import packages and other things needed
# Don't change this cell; Just run this cell
# If you restart colab, make sure to run this cell again after the first ones above^

from datascience import *
import numpy as np

%matplotlib inlineF
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")

#### Today's Lecture

In today's lecture, you'll learn how to:

1. visualize data
2. use `plot` and `scatter` to plot numerical data
3. use `bar` and `barh` to plot categorical and numerical data

## Census Data ##

In [None]:
# Read in the Census data table
full = Table.read_table('nc-est2019-agesex-res.csv')

In [None]:
# Keep only the columns we care about- 2014 and 2019
# What is the output of select?
partial = full.select('SEX', 'AGE', 'POPESTIMATE2014', 'POPESTIMATE2019')

In [None]:
# Make things easier to read
# can relabel column names using relabeled
simple = partial.relabeled(2, '2014').relabeled(3, '2019')
simple

In [None]:
# Remove the age totals- coded as 999
no_999 = simple.where('AGE', are.below(999))
no_999

## Line Plots ##

In [None]:
# Our first chart!
no_999.plot('AGE', '2019')

In [None]:
# Adjusted plot
overall = no_999.where('SEX', 0)
overall.plot('AGE', '2019')

In [None]:
# ^^ That plot should be labeled! Here are 3 ways to label it:

In [None]:
# US Population  <--- Just add a comment

overall.plot('AGE', '2019')

In [None]:
overall.plot('AGE', '2019')
print('US Population')  # <--- Print out what it is

In [None]:
overall.plot('AGE', '2019')
plt.title('US Population');    # <--- Add a title

In [None]:
# Now plot 2014

In [None]:
# Put both years on the same plot
overall.drop('SEX').plot('AGE')

## Males vs Females (Optional)

In [None]:
# Let's compare male and female counts per age
males = no_999.where('SEX', 1).drop('SEX')
females = no_999.where('SEX', 2).drop('SEX')

In [None]:
# Create table for 2019
pop_2019 = Table().with_columns(
    'Age', males.column('AGE'),
    'Males', males.column('2019'),
    'Females', females.column('2019')
)
pop_2019

In [None]:
# Plot 2019 data
pop_2019.plot('Age')

In [None]:
# Calculate the percent female for each age
total = pop_2019.column('Males') + pop_2019.column('Females')
pct_female = pop_2019.column('Females') / total * 100
pct_female

In [None]:
# Round it to 3 so that it's easier to read
pct_female = np.round(pct_female, 3)
pct_female

In [None]:
# Add female percent to our table
pop_2019 = pop_2019.with_column('Percent female', pct_female)
pop_2019

In [None]:
# Plot percent
pop_2019.plot('Age', 'Percent female')

In [None]:
# ^^ Look at the y-axis! Trend is not as dramatic as you might think

pop_2019.plot('Age', 'Percent female')
plt.ylim(0, 100);  # Optional for this course

## Scatter Plots ##

In [None]:
# Actors and their highest grossing movies
actors = Table.read_table('actors.csv')
actors

In [None]:
# Create scatter plot from slides
actors.scatter('Number of Movies', 'Average per Movie')

In [None]:
# How do we find the highest average per movie?
actors.where('Average per Movie', are.above(400))

In [None]:
# Extra: Scatter of # of movies compared to total gross amount
#actors.scatter('Number of Movies', 'Total Gross')

## Bar Charts ##

In [None]:
# Highest grossing movies as of 2017
top_movies = Table.read_table('top_movies_2017.csv')
top_movies

In [None]:
# Find the top 10 movies
top10_adjusted = top_movies.take(np.arange(10))
top10_adjusted

In [None]:
# Convert to millions of dollars for readability
millions = np.round(top10_adjusted.column('Gross (Adjusted)') / 1000000, 3)
top10_adjusted = top10_adjusted.with_column('Millions', millions)
top10_adjusted


In [None]:
# A line plot doesn't make sense here: don't do this!
top10_adjusted.plot('Year', 'Millions')

In [None]:
# Use a bar plot instead
top10_adjusted.barh('Title', 'Millions')

### Saving
Remember to save your notebook before closing.
Choose **Save** (and make sure you've already saved a copy in your drive) from the **File** menu.