# Data Visualization: Live Session 1

## Session Overview

+ Cource Overview
    - Expectations
    - Syllabus
    - Assignments


+ Introductions
    - Where are you (geographically)?
    - What kind of visualization tools have you used?
    

+ A thrilling review of some essential concepts and packages
    - \*args and \*\*kwargs
    - numpy
    - pandas


+ In-class activity


+ Overview of Assignment 1


+ Questions

## Packages you'll need for this course:
_Note that geopandas can be tricky to install._

_I've posted a requirements.txt file to the files section on 2DU that should build a good venv for this course._
+ numpy
+ pandas
+ matplotlib
+ seaborn
+ geopy
+ folium
+ shapely
+ geopandas
+ yfinance
+ ipykernel 
+ plotly
+ dash

### Adding Virtual Environment as an ipykernel

Opening a Jupyter session will utilize the default Python3 ipykernel. However, you can add your virtual envvironment as a kernel to use in Jupyter.

 - Activate the virtual environment you want to add.
 - From your virtual environment directory, run pip install ipykernel
 - Run: ipython kernel install --user --name=name_of_venv
 - To uninstall a kernel run: jupyter kernelspec uninstall name_of_venv
 - When you open a Jupyter session you should now be able to select the kernel you've created from your virtual environment.

#### Charles Joseph Minard's map of Napoleon's Russian Campaign of 1812
<img src="https://www.researchgate.net/profile/Robert-Strohmaier/publication/277098952/figure/fig5/AS:294635642605574@1447258019690/Charles-Joseph-Minards-visualization-of-Napoleons-Russian-campaign-of-1812-Friendly.png"/>

#### Edward Tufte
<img src="http://prod-upp-image-read.ft.com/6e52bc30-f4d3-11e2-a62e-00144feabdc0" width="400" height="300"/>

In [None]:
from IPython.display import YouTubeVideo
vid = YouTubeVideo("jbkSRLYSojo", width=900, height=500, allow_autoplay=False)
display(vid)

### *args and **kwargs review
a single * is the unpacking operator for iterables

a double ** is the unpacking operator for dictionaries

In [None]:
# suppose we want to create a function to sum to variables
def sum_it(a, b):
    return a + b

In [None]:
# now suppose we want to be able to extend this function to be able
# to sum n digits.
def sum_it(int_list):
    total = 0
    for i in int_list:
        total += i
    return total

some_ints = [1, 2, 3, 4]
print(sum_it(some_ints))

In [None]:
# This works nicely, but we need to put the elements to sum in a list
# before we call the function on them.

def sum_it(*args):
    total = 0
    for i in args:
        total += i
    return total

In [None]:
# now we're not passing a list.  We're passing positional arguments.
# our function takes these arguments and packs them into an iterable object (tuple)
# called args. args is not a keyword we could just as well say *numbers.
# The unpacking operator * is the important part.
print(sum_it(1, 2, 3, 4, 5, 6, 18))

In [None]:
# **kwargs is similar to *args, but it accepts keyword arguments instead
# of positional arguments.
def multiple_ops(**kwargs):
    initial_sum = kwargs['add_1'] + kwargs['add_2']
    try:
        if kwargs['operation'] == 'div':
            return initial_sum / kwargs['mult_div']
    except KeyError:
        return initial_sum * kwargs['mult_div']

In [None]:
# here we pass a dictionary to our function (although, note that
# it isn't defined with curly braces).  Think of the keywords as keys and the values
# as values.
multiple_ops(add_1=7, add_2=8, mult_div=2)

In [None]:
multiple_ops(add_1=7, add_2=8, mult_div=2, operation='div')

#### Again, the name kwargs is just a convention. We're really concerned with the ** unpacking operator.

In [None]:
# An example using *args and **kwargs
# This function takes *args and *kwargs as parameters.
# The function loops through all arg values and multiplies them to get iterations.
# We then loop over the range of iterations and if the iterator is even
# we print kwargs values else we print kwargs keys.

def do_something(*args, **kwargs):
    iterations = 1
    for arg in args:
        iterations *= arg

    for i in range(iterations):
        if i % 2 == 0:
            print(f'{i} is even')
            print(kwargs.values())
        else:
            print(f'{i} is odd')
            print(kwargs.keys())

do_something(1, 2, 3, 4, 5, foo = 'hello', bar = 'hi')

In [None]:
# what happens if I try to specify the key word arguments before the positional arguments?
do_something(bar = 'hi', foo = 'hello', 1, 2, 3)

In [None]:
# another function using *kwargs
def p_func(**kwargs):
    for i in kwargs.values():
        print(i)

In [None]:
# Notice the error generated when we call the function on the following input.
p_func('a', 'b', 'c', 'd')

In [None]:
p_func(arg1 = 'a', arg2 = 'b', arg3 = 'c', arg4 = 'b')

#### Writing functions with standard arguments, *args and *kwargs
Note that the order is critical here.
Standard arguments precede *args, which precede **kwargs.

Note that the single unpacking operator (\*) can be used on any iterable.
The ** unpacking operator can only be used on dicts.

In [None]:
# to get a better sense of what the unpacking operator does
# notice how calling print on the list prints the brackets, commas and quotes (the list itself)
print(['foo', 'bar'])
# calling print on the unpacked list prints just the list content
print(*['foo', 'bar'])

# this is the same as calling print with two arguments...
print('foo', 'bar')

In [None]:
# passing unpacked args to a function with a specified number of arguments
def print_something(a, b):
    print(a, b)

In [None]:
# this allows us to pass in the two required parameters with unpacking
print_something(*['foo', 'bar'])

In [None]:
# here we're unpacking too many values
print_something(*['foo', 'bar', 'norf'])

In [None]:
# another printing function using *args
def print_something(*args):
    hold = ''
    for i in args:
        hold += str(i) + ' '
    print(hold)

# consider passing multiple lists to the above function.
# the unpacking operator in *args will treat these as a tuple of lists.

# we can use unpacking operators on our function inputs as well as within the function.
# Here we use multiple unpacking operators to unpack three lists and pass those values
# as *args...they will then be packed into a tuple and further unpacked within the function.
print_something(*[1, 2, 3], *[4, 5, 6], *[7, 8, 9])

# without the additional level of unpacking we're just concatenating the lists
print_something([1, 2, 3], [4, 5, 6], [7, 8, 9])

### Now a bit of numpy review.

Matplotlib uses the numpy ndarray data structure, so it's important to have a good grasp on these.

In [None]:
import numpy as np

# numpy's array method takes any list, tuple or array-like object and converts it to an ndarray

an_array = np.array([i for i in range(10)])
print(an_array)
print(type(an_array))

In [None]:
# nested arrays are arrays that have arrays as values

# a 0-D array is just a scalar...each value in a 1D array is a 0-D array itself
scalar = np.array(20)

In [None]:
# a 1-D array has scalars as its elements.  Think of a single vector of scalars.
basic_array = np.array([1, 2, 3, 4, 5])
print(basic_array)

In [None]:
# a 2-D array is just an array of arrays. Think of a matrix or table.
two_dim_array = np.array([[1, 2, 3], [4, 5, 6]])
print(two_dim_array)

# we can evaluate the shape attribute of ndarrays as well
print(two_dim_array.shape)

In [None]:
# a 3-D array has 2-D array elements.
three_dim_array = np.array([[[2, 4, 6], [8, 10, 12]], [[14, 16, 18], [20, 22, 24]]])
print(three_dim_array)

In [None]:
# the ndim attribute gives the number of dimensions in an ndarray
print(three_dim_array.ndim)

In [None]:
# Note that ndarrays can have any number of dimensions.
# np.array() takes an optional argument ndmin to specify dimensions.

### Indexing ndarrays

In [None]:
# to index 2D arrays we use comma separated values to address dimension and index
# think of the 1st dimension as the row and the index as the column
two_dim = np.array([[1, 2, 3], [4, 5, 6]])

In [None]:
print(two_dim[1, 2])

In [None]:
print(two_dim[0, 0])

In [None]:
# higher dimensional arrays are indexed similarly, the first integer represents the first dimension,
# the second integer represents the second dimension and so on.
three_dim = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
print(three_dim)

In [None]:
print(three_dim[0, 1, 2]) # prints third element of second array of first array

In [None]:
# above the 0 allows us to access the first 2D array...see below
print(three_dim[0])

In [None]:
# from there on the next two positions are just accessing values within the first 2D array
print(three_dim[0, 1]) # this returns second row from first 2D array

In [None]:
# We can use negative indexing as well.
# Here we access the last element of the last row of the last 2D array.
print(three_dim[-1, -1, -1])

In [None]:
# and here we access the first element of the last row of the first 2D array
print(three_dim[0, -1, 0])

### Slicing ndarrays

Slicing is also similar to what we've experienced with lists and tuples
we slice with [start: end] or [start: end: step]
as before, omitting the starting index assumes zero
and omitting the end assumes the length of the array

In [None]:
# We'll skip an explanation of slicing 1D arrays and jump to higher dimensions
two_dim = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])

In [None]:
# Here we're slicing from index 1 to 3 of the second dimension (dimension index 1)
print(two_dim[1, 1:3])

In [None]:
# Now we're accessing the last two values from both dimensions
# This will return a 2D array
print(two_dim[0:2, 2:4])

In [None]:
# the following does the same as the preceding line
print(two_dim[:, 2:4])

In [None]:
# 3D arrays are sliced similarly
three_dim = np.array([[[1, 2, 3, 4], [5, 6, 7, 8]], [[9, 10, 11, 12], [13, 14, 15, 16]]])
print(three_dim)

In [None]:
# accessing last two elements of last row of second 2D array.
print(three_dim[1, 1, 2:])

In [None]:
# returns a 2D array
print(three_dim[0:2, 1, 1:])

In [None]:
# returns a 3D array
print(three_dim[0:2, 0:2, 1:])

In [None]:
# Data Types
# Numpy arrays are homogenous, but multiple datatypes are supported.
# Numpy supports strings (S), integers (i), floats (f), bools (b) and complex numbers (c)
# and also has some additional data types such as:
# unsigned integer (u), timedelta (m), datetime (M), object (O), unicode string (U)

In [None]:
# oftentimes you need to cast an entire array to another type
# numpy offers the method astype() which takes the new type as a parameter and returns a copy.
# datatypes can be specified with the single char version or the name. For example 'f' or float.

an_array = np.array([1, 2, 3, 4, 5, 11])
an_array = an_array.astype('S1')
print(an_array.dtype)
print(an_array)

In [None]:
# notice the truncation that occurs when casting to single byte string

an_array = an_array.astype('S2')
print(an_array.dtype)
print(an_array)

### copy() and view() methods
These concepts are related to the aliasing of variables

In numpy we can use the copy() method to make copies of arrays.
Changes made to the original or the copy have no impact on another.
A view() of an array just points to the original array, so changes made
to the original or the view will impact the other.

In [None]:
# .copy()
an_array = np.array([1, 2, 3, 4])
a_copy = an_array.copy()

In [None]:
an_array[0] = 99
a_copy[1] = 99

In [None]:
print(an_array)
print(a_copy)

In [None]:
# .view()
an_array = np.array([1, 2, 3, 4])
a_view = an_array.view()

In [None]:
an_array[0] = 99
a_view[1] = 99

In [None]:
print(an_array)
print(a_view)

In [None]:
# We can think about this in terms of ownership of the data.
# A copy owns the data and a view does not.
# Data ownership can be assessed using the base attribute of an ndarray.

an_array = np.array([1, 2, 3, 4])
a_copy = an_array.copy()
a_view = an_array.view()

In [None]:
# If the array owns the data the base attribute will return None
# If not the base attribute returns a reference to the original object

print(a_copy.base)
print(a_view.base)

In [None]:
# if we modify an element in the original array returned from the base attribute
# it will modify the original array.

a_view.base[0] = 99

In [None]:
# now if we print the original array, the copy and the view,
# the original and the view will have been modified by the preceding statement.
# this will have an impact on any views of the base array as well.

print(an_array)
print(a_copy)
print(a_view)

In [None]:
# the shape attribute returns a tuple (of length .ndim) with the corresponding number of elements in that index
one_dim = np.array([1, 2, 3, 4])
two_dim = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
three_dim = np.array([[[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]], [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]])

In [None]:
print(one_dim.shape) # one dimension with 4 elements
print(two_dim.shape) # two dimensions with 4 elements. Or two rows and 4 columns.
print(three_dim.shape) # three dimensions. Two three x four 2D arrays.

In [None]:
# reshape()
# The reshape method allows us to change the shape of an array or add/remove elements from dimensions.
# You can reshape into any array as long as you have enough elements to achieve that shape.
# for example, if we had a 1D array of length 9 then we couldn't reshape into a 2D array of shape (2, 5)
an_array = np.array([i for i in range(1, 21)])

new_array = an_array.reshape(4, 5)
print(new_array)

In [None]:
# Note that reshaping returns a view
print(an_array.reshape(2, 5, 2).base)

In [None]:
an_array.reshape(2, 5, 2)[0, 0, 0] = 99
print(an_array)

In [None]:
# Note that we can avoid this by creating a copy of the view.
one_dim = np.array([1, 2, 3, 4, 5, 6, 7, 8])
two_dim = one_dim.reshape(4, 2).copy()
two_dim[0, 0] = 99
print(two_dim.base)
print(two_dim)
print(one_dim)

In [None]:
# We can also initialize the values of the array generally

# np.zeros() takes a shape and will initialize an ndarray
print(np.zeros(10))

In [None]:
three_dim_z = np.zeros((2,3,4))
print(three_dim_z)

In [None]:
# np.ones()
print(np.ones(10))

In [None]:
two_dim_ones = np.ones((4,4))
print(two_dim_ones)

In [None]:
# np.random.random()...continuous uniform draws
print(np.random.random(10))

In [None]:
two_dim_rand = np.random.random((4, 4))
print(two_dim_rand)

In [None]:
# np.arange() similar to range()...takes start stop and step values

three_dim = np.arange(1, 51).reshape(5, 2, 5)
print(three_dim)

In [None]:
# There are a number of array methods that provide valuable aggregate information
# .min(), .max(), .sum(), mean(), std()
print(two_dim_rand.max())
print(two_dim_rand.min())
print(two_dim_rand.sum())
print(two_dim_rand.mean())
print(two_dim_rand.std())

In [None]:
# get the mean of first column of two_dim_rand
print(two_dim_rand[:,0].mean())

In [None]:
# we can specify the axis of the ndarray to accomplish this as well
# axis 0 computes along rows and axis 1 computes along columns
print(two_dim_rand.mean(axis=1)) # aggregates along column, so will return means for each row
print(two_dim_rand.mean(axis=0)) # aggregates along rows, so will return means for each column

In [None]:
# Iterating over ndarrays

# prints elements
for i in np.array([i for i in range(10)]):
    print(i)

In [None]:
# prints rows
for i in np.arange(1,21).reshape(4, 5):
    print(i)

In [None]:
# prints 2-D matrices
np.random.seed(101)

for i in np.random.random((2,5,2)):
    print('element\n', i)

In [None]:
# nditer()
# returns an iterator over the array

np.random.seed(101)

for i in np.nditer(np.random.random((2,5,2))):
    print(i)

In [None]:
np.random.seed(101)
an_iterator = np.nditer(np.random.random((2,5,2)))

In [None]:
next(an_iterator)

In [None]:
# ndenumerate
# similar to standard library enumerate function

for i, j in np.ndenumerate(np.random.random((2,5,2))):
    print(i, j)

In [None]:
#ravel()
# returns a 1D array representation of a higher dim data structure
np.random.seed(101)

np.ravel(np.random.random((2,5,2)))

### In-Class: Some Data Cleaning with Pandas

+ Download the two csv’s posted in the files section of 2DU
    - carnegie2021.csv
    - us_census_regions_divisions.csv

+ Read both files into pandas DataFrames (encoding = "ISO-8859-1“ for the Carnegie file)

+ Retain only basic2021 values of 15 and 16 (15 = Doctoral Very High Research, 16 = Doctoral High Research)

+ Split the location feature from Carnegie into city and state columns.

+ Merge the files in order to add the Region and Division features from census to Carnegie. 

+ Retain only the following columns:
	*unitid, name, city, state, Region, Division, basic2018, basic2021, pdnfrstaff, facnum, socsc_rsd, hum_rsd, 	stem_rsd, oth_rsd* 

+ Create ranking features for the last six variables (*pdnfrstaff, facnum, socsc_rsd, hum_rsd, stem_rsd, oth_rsd*).

+ Import StandardScaler from sklearn.preprocessing and fit_transform this scaler on the ranked columns….or just implement a z score manually.

+ Make two composite scores according to the following arbitrary weighting scheme:
    - HR_composite: a 50/50 weighted avg of the z scores for pdnfrstaff and facnum
    - RSD_composite: a 40/20/20/20 weighted avg of the z scores for stem_rsd, socsc_rsd, hum_rsd, oth_rsd

+ Access the names of institutions that had a high research classification (16) in 2018 and very high (15) in 2021.

+ What states have the most 'Doctoral Very High Research' institutions?


In [15]:
# Download the two csv’s posted in the files section of 2DU
# carnegie2021.csv
# us_census_regions_divisions.csv
import pandas as pd
carnegie2021 = pd.read_csv('carnegie2021.csv', encoding="ISO-8859-1")
us_census_regions_divisions = pd.read_csv('us_census_regions_divisions.csv')

In [16]:
carnegie2021

Unnamed: 0,unitid,name,location,basic2000,basic2005,basic2010,basic2015,basic2018,basic2021,ipug2021,...,satv25,satm25,satcmb25,actcmp25,satacteq25,actfinal,appsf20,admitsf20,pctadmitf20,selindex
0,177834,A T Still University of Health Sciences,"Kirksville, MO",52,25,25,25,25,25,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
1,134811,AI Miami International University of Art and D...,"Miami, FL",56,30,30,30,30,22,19,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
2,429094,AOMA Graduate School of Integrative Medicine,"Austin, TX",53,26,26,26,26,26,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
3,404994,ASA College,"Brooklyn, NY",-2,10,10,10,4,14,5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
4,446127,ATA Career Education,"Spring Hill, FL",-2,-2,-2,-2,10,10,4,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3930,493619,Young Americans College of the Performing Arts,"Corona, CA",-2,-2,-2,-2,-2,12,4,...,,,,,,,,,,
3931,141361,Young Harris College,"Young Harris, GA",40,9,9,9,21,21,10,...,480.0,450.0,930.0,17.0,17.0,17.0,2216.0,1441.0,0.650271,1.0
3932,206695,Youngstown State University,"Youngstown, OH",21,18,18,18,18,18,16,...,480.0,480.0,960.0,18.0,18.0,18.0,8636.0,6039.0,0.699282,1.0
3933,126119,Yuba College,"Marysville, CA",40,3,3,3,7,8,3,...,,,,,,,,,,
