# 0. Load imports 

In [5]:
## imports
import pandas as pd
import numpy as np
import re


# ## print multiple things from same cell
# from IPython.core.interactiveshell import InteractiveShell
# InteractiveShell.ast_node_interactivity = "all"


## load data on 2020 crimes in DC
df = dc_crim_2020 = pd.read_csv("https://opendata.arcgis.com/datasets/f516e0dd7b614b088ad781b0c4002331_2.csv")

# 1. Questions: list comprehension

- In class example, why did we need the "courses" at the beginning of the list iteration
- How did the join syntax work in the example where we paste together offenses from same ward

In [6]:
## toy example

### pool of courses
all_courses = ["QSS20", "QSS17", "GOV10", "GOV4", "CSC1"]


## 1.1 Application 1: filtering to a smaller list

When we might use: have a lot of columns in a dataframe; want to filter to a smaller set using some pattern

In [7]:
### pull out ones that contain GOV in the string
gov_c = [course for course in all_courses
        if "GOV" in course]
gov_c # result

['GOV10', 'GOV4']

In [8]:
### showing that the "course" is just a placeholder/
### arbitrary interator
gov_c_alt = [x for x in all_courses if "GOV" in x]

gov_c == gov_c_alt

True

In [10]:
### what happens if we use the same syntax
### but don't have course at the beginning?
[for course in all_courses if "GOV" in course]

### gives us error about invalid syntax
### reason is we need to tell it what to return

## 1.2 Application two: keep all objects in the list but do some transformation

In [11]:
## strip the numbers from the course names
courses_prefix = [x[:3] for x in all_courses]
courses_prefix # could then find unique elements


['QSS', 'QSS', 'GOV', 'GOV', 'CSC']

In [17]:
# Join all together example
" #:)# ".join(courses_prefix)

'QSS #:)# QSS #:)# GOV #:)# GOV #:)# CSC'

#### Your turn: Using original list, add "dartmouth_" prefix to the course name

## 1.3 Using to help with subsetting columns

In [12]:
## print all columns in the crime report data
df.columns

Index(['X', 'Y', 'CCN', 'REPORT_DAT', 'SHIFT', 'METHOD', 'OFFENSE', 'BLOCK',
       'XBLOCK', 'YBLOCK', 'WARD', 'ANC', 'DISTRICT', 'PSA',
       'NEIGHBORHOOD_CLUSTER', 'BLOCK_GROUP', 'CENSUS_TRACT',
       'VOTING_PRECINCT', 'LATITUDE', 'LONGITUDE', 'BID', 'START_DATE',
       'END_DATE', 'OBJECTID', 'OCTO_RECORD_ID'],
      dtype='object')

Use list comprehension to filter to columns with id in the string. Then, create a new dataframe called df1 that contains only column heads with "id"

In [13]:
id_cols = [col for col in df.columns if "ID" in col]
id_cols

['BID', 'OBJECTID', 'OCTO_RECORD_ID']

In [14]:
## Then, can filter the data
df[id_cols]

Unnamed: 0,BID,OBJECTID,OCTO_RECORD_ID
0,,379434829,
1,,379438174,
2,,379438177,
3,DOWNTOWN,379438181,
4,GOLDEN TRIANGLE,379438182,
...,...,...,...
27928,,379952878,
27929,,379952879,
27930,,379952880,
27931,,379952881,


## 1.4 Saving time and space

Here we compare two ways of creating a list of even numbers.

In [19]:
num_list = np.arange(10000)
num_list

array([   0,    1,    2, ..., 9997, 9998, 9999])

In [2]:
10 % 2

0

In [32]:
%%time
even_nums = []
for i in num_list:
    if (i % 2) == 0:
        even_nums.append(i)

CPU times: user 10.6 ms, sys: 576 µs, total: 11.2 ms
Wall time: 11.2 ms


In [33]:
%%time
even_nums = [i for i in num_list if (i % 2) == 0]

CPU times: user 10.5 ms, sys: 3.08 ms, total: 13.6 ms
Wall time: 17.2 ms


#### Your turn: Extract all numbers in num_list that end in 7

#### Your turn: Divide each number  in num_list by 2

# 2. Questions: lambda functions

Two questions:

- General syntax (see here for a reference: https://www.w3schools.com/python/python_lambda.asp 
- How they work in the context of aggregations

How is a lambda function different from a "normal" user-defined function (that has the syntax def func_name(arg): etc?

- Operates similarly to normal user-defined functions in that it can take any # of arguments
- Operates differently in that it's an "anonymous" function or a function that we don't explicitly name/save in memory

In [44]:
def f(x,y):
    return x+y
f(2,1)

3

In [None]:
f = lambda x, y: x+y
f(4,3)

## 2.1 General syntax for lambda functions

In [36]:
### two pools of courses
socsci = ["QSS20", "QSS17", "GOV10"]
natsci = ["BIO2", "PHYS3"]


## generalize some of the steps
## above into a two-arg function
## that takes the course prefix
## and a list of all courses
def filter_courses(prefix,all_courses):
    rel_courses = [c for c in all_courses if prefix in c]
    return(rel_courses)

### a few applications 
# filter_courses(prefix = "QSS", all_courses = socsci)
# filter_courses(prefix = "QSS", all_courses = natsci)
# filter_courses(prefix = "BIO", all_courses = natsci)

In [None]:
## what's the lambda function version of this
filter_courses_v2 = lambda prefix, all_courses: [c for c in all_courses if prefix in c]
filter_courses_v2(prefix = "BIO", all_courses = natsci)


## 2.2 using alongside agg

In [50]:
## use lambda to find modal block in a ward- multiple ways

### way 1: subsetting agg syntex
df.groupby("WARD")["BLOCK"].agg(lambda x: x.mode())

### way 2: dictionary agg syntax
df.groupby("WARD").agg({"BLOCK": lambda x: x.mode()})


WARD
1           3100 - 3299 BLOCK OF 14TH STREET NW
2    1300 - 1699 BLOCK OF CONNECTICUT AVENUE NW
3      5300 - 5399 BLOCK OF WISCONSIN AVENUE NW
4          100 - 199 BLOCK OF CARROLL STREET NW
5     900 - 999 BLOCK OF RHODE ISLAND AVENUE NE
6                600 - 699 BLOCK OF H STREET NE
7         934 - 1099 BLOCK OF EASTERN AVENUE NE
8        2300 - 2399 BLOCK OF GOOD HOPE ROAD SE
Name: BLOCK, dtype: object

Unnamed: 0_level_0,BLOCK
WARD,Unnamed: 1_level_1
1,3100 - 3299 BLOCK OF 14TH STREET NW
2,1300 - 1699 BLOCK OF CONNECTICUT AVENUE NW
3,5300 - 5399 BLOCK OF WISCONSIN AVENUE NW
4,100 - 199 BLOCK OF CARROLL STREET NW
5,900 - 999 BLOCK OF RHODE ISLAND AVENUE NE
6,600 - 699 BLOCK OF H STREET NE
7,934 - 1099 BLOCK OF EASTERN AVENUE NE
8,2300 - 2399 BLOCK OF GOOD HOPE ROAD SE


#### Question: How would you validate that the groupby is correct? Hint: How would you check results for Ward 1?

#### Your turn: Group by WARD and get the mean and standard deviation (std) of X and Y