# IMT 573 - Lab 4 - Data Integration

### Instructions

Before beginning this assignment, please ensure you have access to a working instance of Jupyter Notebooks with Python 3.

1. First, replace the “YOUR NAME HERE” text in the next cell with your own full name. Any collaborators must also be listed in this cell.

2. Be sure to include well-documented (e.g. commented) code cells, figures, and clearly written text  explanations as necessary. Any figures should be clearly labeled and appropriately referenced within the text. Be sure that each visualization adds value to your written explanation; avoid redundancy – you do no need four different visualizations of the same pattern.

3. Collaboration on problem sets and labs is fun, useful, and encouraged. However, each student must turn in an individual write-up in their own words as well as code/work that is their own. Regardless of whether you work with others, what you turn in must be your own work; this includes code and interpretation of results. The names of all collaborators must be listed on each assignment. Do not copy-and-paste from other students’ responses or code - your code should never be on any other student's screen or machine.

4. All materials and resources that you use (with the exception of lecture slides) must be appropriately referenced within your assignment.

Name: Steve Gonzales, Collaborators: None

In this module, we have focused on integrating and cleaning data. In this lab, we'll look at integrating different data sources.

The data we will use comes from the City of Seattle. It consists of police beats in the Seattle area and provides information on their geographic locations. You can learn more about police precincts and beats [here](https://www.seattle.gov/police/about-us/about-policing/precinct-and-patrol-boundaries). We'll use this same dataset in a future problem set. 

The data can be found in the `Police_Beat_and_Precinct_Centerpoints.csv` file.

In [5]:
import os
os.path.abspath("")

'S:\\code\\uw\\IMT573'

In [1]:
import pandas as pd
beats_data = pd.read_csv('Police_Beat_and_Precinct_Centerpoints.csv')
df = beats_data

In [3]:
beats_data.head()

Unnamed: 0,Name,Location 1,Latitude,Longitude
0,B1,"(47.7097756394592, -122.370990523069)",47.70978,-122.37099
1,B2,"(47.6790521901374, -122.391748391741)",47.67905,-122.39175
2,B3,"(47.6812920482227, -122.364236159741)",47.68129,-122.36424
3,C1,"(47.6342500180223, -122.315684762418)",47.63425,-122.31568
4,C2,"(47.6192385752996, -122.313557430551)",47.61924,-122.31356


### Problem 1: Inspection

Inspect the beats data. How many records are there? What are the variables? Is there any missing or seemingly anomolous data?

In [6]:
print(f"There are {len(beats_data):,} total rows in the data set.")
print(f"The variables or columns are {df.columns.tolist()}")

There are 57 total rows in the data set.
The variables or columns are ['Name', 'Location 1', 'Latitude', 'Longitude']


In [7]:
display(df)

Unnamed: 0,Name,Location 1,Latitude,Longitude
0,B1,"(47.7097756394592, -122.370990523069)",47.70978,-122.37099
1,B2,"(47.6790521901374, -122.391748391741)",47.67905,-122.39175
2,B3,"(47.6812920482227, -122.364236159741)",47.68129,-122.36424
3,C1,"(47.6342500180223, -122.315684762418)",47.63425,-122.31568
4,C2,"(47.6192385752996, -122.313557430551)",47.61924,-122.31356
5,C3,"(47.6300792887474, -122.292087128251)",47.63008,-122.29209
6,CITYWIDE,"(47.6210041048652, -122.332993498998)",47.621,-122.33299
7,D1,"(47.6274421308028, -122.345705781837)",47.62744,-122.34571
8,D2,"(47.6256548876049, -122.331370005506)",47.62565,-122.33137
9,D3,"(47.6103493249325, -122.328653706199)",47.61035,-122.32865


At first glance the data set looks fairly clean. No NaN, null or mismatched data

In [16]:
# Going to take a lot at the relationship between Location 1' and 'Latitude', 'Longitude' 
# Location 1 seems to be derived from Lat and Lon, but I want to make sure they are all follow the pattern
def compare_coordinates(row):
    loc_str = row['Location 1'].strip('()')  # Remove parentheses
    lat_loc, lon_loc = map(float, loc_str.split(','))  # Split and convert what looks to be lat & lon
    return round(lat_loc, 5) == row['Latitude'] and round(lon_loc, 5) == row['Longitude']

df['Coordinates_Match'] = df.apply(compare_coordinates, axis=1)
display(df)

# Print just the rows where the coordinates don't match:
mismatched_df = df[df['Coordinates_Match'] == False]
if not mismatched_df.empty:
    print("\nRows with Mismatched Coordinates:")
    print(mismatched_df)
else:
    print("\nAll coordinates match.")

Unnamed: 0,Name,Location 1,Latitude,Longitude,Coordinates_Match
0,B1,"(47.7097756394592, -122.370990523069)",47.70978,-122.37099,True
1,B2,"(47.6790521901374, -122.391748391741)",47.67905,-122.39175,True
2,B3,"(47.6812920482227, -122.364236159741)",47.68129,-122.36424,True
3,C1,"(47.6342500180223, -122.315684762418)",47.63425,-122.31568,True
4,C2,"(47.6192385752996, -122.313557430551)",47.61924,-122.31356,True
5,C3,"(47.6300792887474, -122.292087128251)",47.63008,-122.29209,True
6,CITYWIDE,"(47.6210041048652, -122.332993498998)",47.621,-122.33299,True
7,D1,"(47.6274421308028, -122.345705781837)",47.62744,-122.34571,True
8,D2,"(47.6256548876049, -122.331370005506)",47.62565,-122.33137,True
9,D3,"(47.6103493249325, -122.328653706199)",47.61035,-122.32865,True



All coordinates match.


The only other thing of note is that the Name column has a few rows with single characters and one that says "CITYWIDE"

### Problem 2: Using an API

We're going to join census data to the beats dataset. To do so, we need to first get census tract information for the beats. 

We'll use the `censusgeocode` package to get census tract information for this task. We have seen how different websites/data sources can have APIs and leverage API keys. Python also has many packages that will leverage APIs and `censusgeocode` is one such package in that it can interact with the US Census' APIs.

To start, import the `censusgeocode` package. As always, if the package does not import, you may need to install it first.

In [3]:
!python --version

Python 3.11.6


In [18]:
!pip install censusgeocode

Collecting censusgeocode
  Downloading censusgeocode-0.5.2-py3-none-any.whl.metadata (6.6 kB)
Collecting requests-toolbelt<1,>=0.9.0 (from censusgeocode)
  Downloading requests_toolbelt-0.10.1-py2.py3-none-any.whl.metadata (14 kB)
Downloading censusgeocode-0.5.2-py3-none-any.whl (9.2 kB)
Downloading requests_toolbelt-0.10.1-py2.py3-none-any.whl (54 kB)
Installing collected packages: requests-toolbelt, censusgeocode
  Attempting uninstall: requests-toolbelt
    Found existing installation: requests-toolbelt 1.0.0
    Uninstalling requests-toolbelt-1.0.0:
      Successfully uninstalled requests-toolbelt-1.0.0
Successfully installed censusgeocode-0.5.2 requests-toolbelt-0.10.1



[notice] A new release of pip is available: 24.3.1 -> 25.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
import censusgeocode as cg
# FROM python:3.9
# RUN pip install urllib3==1.26.15 requests-toolbelt==0.10.1

ImportError: cannot import name 'appengine' from 'urllib3.contrib' (C:\Users\steve\scoop\apps\python311\current\Lib\site-packages\urllib3\contrib\__init__.py)

Now, use the [documentation](https://pypi.org/project/censusgeocode/) from the `censusgeocode` package to write a function with the following specifications: 

- the function should accept two arguments - one for longitude and one for latitude (in that order)
- the function should return the census tract number (often coded as `GEOID`) for the inputted latitude and longitude as a string
- the function should be named `get_census_tract`

You can find example outputs below to test your function

In [8]:
get_census_tract(-77.036543, 38.898691) #should return '11001980000'
get_census_tract(-73.985428, 40.748817) #should return '36061007600'
get_census_tract(-118.321495, 34.134117) #should return '06037980009'

'06037980009'

### Problem 3: Get census tracts

Now, for each of the beats in the beats dataset, find the associated census tract. Keep this code as you'll use it in a future problem set.

Census tracts are codes to designate specific locations. The block codes are comprised of state/territory codes, followed by county codes, tract codes, and block codes. You can learn more about this [here](https://transition.fcc.gov/form477/Geo/more_about_census_blocks.pdf) . Confirm that each of the tracts for the beats data is from the state of Washington (code 53) and King County (the county that the city of Seattle is in - code 033).