# Exploring the 311-data.org api.

Up to this point I've used data from [MyLA311](https://data.lacity.org/City-Infrastructure-Service-Requests/MyLA311-Service-Request-Data-2021/97z7-y5bt).  It is easy to use.  There is another possible source for the data.  H4LA has built [311-data.org](https://311-data.org) that ingests the same 311 data into a searchable portal.  This is a quick analysis to compare the data sets I can get from each.  It is not meant to be rigourous or exhaustive, just a quick look.

I am going to compare a basic data set from each.  I'm not looking at all the API enpoints, just a select few.  Steps:

1. Grab the current processed data set (as of 12/29/2021).  
2. Use requests library to pull from the API.
3. Compare the two results.


Simple.

In [None]:
%run start.py

import utils
from utils import read_new311_shape, marker_color_map, dt_to_object

import numpy as np

# 1 - MyLA311

I am using the geodataframe generated for 2021. As usual I'm using my helper function to load it and map the columns names. Just a shortcut!

I am going to use it to evaluate/compare the two data sets.

In [None]:
%%time
new311_gdf = read_new311_shape('../data/311/clean311-geo.zip/')

Start with some basic sanity checks.

In [None]:
new311_gdf['created_dt'].min()

In [None]:
new311_gdf['created_dt'].max()

So the first and last time stamps seem reasonable.

I know some things about the API so I'm going to add a categorical for month (name).

In [None]:
new311_gdf['month_name'] = new311_gdf['created_dt'].apply(lambda dt: dt.month_name())

Let's look at counts per month.  This will drive some thinking when I get to the API.

In [None]:
new311_gdf['month_name'].value_counts(sort=False)

Here's an example using month_name to get the 

In [None]:
jan_end_dt = new311_gdf.query(f"month_name == 'January'")['created_dt'].max().strftime('%Y-%m-%d %I:%M %p')

jan_start_dt = new311_gdf.query(f"month_name == 'January'")['created_dt'].min().strftime('%Y-%m-%d %I:%M %p')

print(f"First request in January was {jan_start_dt} and last request was {jan_end_dt}")

Ok, that passess the goofy test.  We have the January 311 requests from the MyLA file.

One last query on the data to look at later:

In [None]:
new311_gdf.query(f"month_name == 'January' and request_type == 'Graffiti Removal'")['nc_name'].value_counts()

# 2 - 311-data.org API

Next we'll look at the API.  This is a quick hack so I'm not doing any error checking.

I've noticed some peculiar behavior at times.  request.get will return 200 but the json seems malformed.  If that happens, executing the get a second time will probably work.

I did a quick look at the API end points and the only one I'll use for starters is requests.

Also note that the limit default is 1000 rows with a max of 100000.  I've set it to 100000.  Now think back to the month_name.value_counts above. 

In [None]:
IFrame("https://dev-api.311-data.org/docs", width=1400, height=800)

In [None]:
url_base = 'https://dev-api.311-data.org/'

top_level_nouns = ['councils', 'regions', 'agencies', 'sources', 'types', 'requests', 'geojson', 'servicerequest']

So this is an extremely hacky way to do this.  We might revisit later time permitting.

In [None]:
r1 = requests.get('https://dev-api.311-data.org/' + top_level_nouns[5] + '?start_date=2021-01-01&end_date=2021-01-15&skip=0&limit=100000')
requests1_df = pd.DataFrame(r1.json())

In [None]:
r2 = requests.get('https://dev-api.311-data.org/' + top_level_nouns[5] + '?start_date=2021-01-16&end_date=2021-01-31&skip=0&limit=100000')
requests2_df = pd.DataFrame(r2.json())

Note if either of these cells throws an exception just rerun.  I've noticed sometimes I get a 200 response, but the payload seems broken.

Not sure what is going on, but I don't want to investigate why right now.

In [None]:
h4la_january_df = pd.concat([requests1_df, requests2_df])
h4la_january_df['createdDate'] = pd.to_datetime(h4la_january_df['createdDate'])

In [None]:
jan_end_dt = h4la_january_df['createdDate'].max().strftime('%Y-%m-%d %I:%M %p')

jan_start_dt = h4la_january_df['createdDate'].min().strftime('%Y-%m-%d %I:%M %p')

print(f"First request in January was {jan_start_dt} and last request was {jan_end_dt}")

Ok, that passess the goofy test number two.  Compare the first/last times for MyLA version and this one.

One last query on the data to look at later:

In [None]:
h4la_january_df.columns

In [None]:
h4la_january_df.query(f"typeName == 'Graffiti'")['councilName'].value_counts()

Note the differences in the data models and value types.  Taking that into consideration the values match up.

# 3 - Comparison

We've already done a back-of-the-envelope comparison.  Let's look at them side by side.

In [None]:
january_df = new311_gdf.query(f"month_name == 'January'").reset_index().drop(columns='index')

In [None]:
myla_info = Output(layout={'border': '1px solid black',
                            'width': '50%'})

h4la_info = Output(layout={'border': '1px solid black',
                            'width': '50%'})

with myla_info:
    display(HTML('<center><b>myla311 info()</b></center>'))
    display(january_df.info())
    
with h4la_info:
    display(HTML('<center><b>h4la info()</b></center>'))
    display((h4la_january_df.info()))
          
HBox([myla_info, h4la_info])            

# 4 - Other API endpoints

I didn't spend much time on any of the other end points.  Here's a few pertinent examples.

In [None]:
council_r = requests.get('https://dev-api.311-data.org/councils')

In [None]:
councils = council_r.json()

In [None]:
len(councils)

In [None]:
council_df = pd.DataFrame(councils)

In [None]:
council_df

Not 100% sure where this is coming from.  The NC's I've dealt with include geometries (polygons).  I assume this lat/lon is the centroid but ...

In [None]:
r2 = requests.get('https://dev-api.311-data.org/councils/types/stats')

In [None]:
council_stats = pd.DataFrame(r2.json())
council_stats

Once again, I assume this is something they use in their UI? 

One last endpoint is the geometry endpoint, /geojson.  Let's see what it is...

In [None]:
geo = requests.get('https://dev-api.311-data.org/geojson')

In [None]:
len(geo.json()['features'])

So, it looks like the geojson endpoint is the geometries I'd expect from the certified Neighborhood Council on data.lacity.org.  My next step would be to get this in a geodataframe, but ...

That's as far as I'm going to go with this.

My conclusions:

  1. They both seem to have the same overall content.
  2. Column names and values are different so that could complicate analytic code.

# 5 - Build Yearly Dataset

So now the task is to get the data as one csv for 2021.  Remember we have the 100000 limit on each API call. How could we automate this?  

  1. The first (easy part) is to construct the queries.  You've seen the components above.  
  2. Using the queries we can use the API to get a months worth of requests.  That keeps us within the 100K limit.
  3. Iterate on "months" in a year and concat results.
  4. Try these steps and see if we need to add exceptions to the mix.
  
**Note:** I could look into a pagination approach but this seems ...

## 1 - Build the query

The calendar module comes in handy for this!

In [None]:
import calendar

https://stackoverflow.com/questions/36155332/how-to-get-the-first-day-and-last-day-of-current-month-in-python

Remember we need to get date ranges for two queries on any given month.  Based on the calendar module.

First the mechanics (ie. see how it works).

In [None]:
_, num_days = calendar.monthrange(2021, 1)

In [None]:
int(num_days/2) + 1

In [None]:
first_day = datetime.date(2021, 1, 1).strftime('%Y-%m-%d')
end_first_half = datetime.date(2021, 1, int(num_days/2)).strftime('%Y-%m-%d ')
start_second_half = datetime.date(2021, 1, int(num_days/2)+1).strftime('%Y-%m-%d')           
last_day = datetime.date(2021, 1, num_days)

In [None]:
print(first_day)
print(end_first_half)
print(start_second_half)
print(last_day)

Now build a function to create query strings.

In [None]:
def build_query_for_month(month, year=2021):
    _, num_days = calendar.monthrange(year, month)
    
    first_day = datetime.date(year, month, 1).strftime('%Y-%m-%d')
    end_first_half = datetime.date(year, month, int(num_days/2)).strftime('%Y-%m-%d')
    start_second_half = datetime.date(year, month, int(num_days/2)+1).strftime('%Y-%m-%d')           
    last_day = datetime.date(year, month, num_days)
    
    params1 = f"start_date={first_day}&end_date={end_first_half}"
    
    q1 = "https://dev-api.311-data.org/requests?" + params1 +"&skip=0&limit=100000"
    
    params2  = f"start_date={start_second_half}&end_date={last_day}"
    
    q2 = "https://dev-api.311-data.org/requests?" + params2 +"&skip=0&limit=100000"
    
    return q1, q2

In [None]:
build_query_for_month(1)

In [None]:
q1, q2 = build_query_for_month(6)

In [None]:
print(q1)
print(q2)

So now we have the function we need that converts the month to two query strings.

## 2 - Get One Month as csv

So I'm getting intermittent exceptions.  Need to see if I can fix this.

In [None]:
def df_for_month(month, year=2021):
    """Get one months worth of 311 data.
       The API limit is 100000.  All the months have more than that.
       Use build_query_for_month to construct needed queries.
       Input: month - int
              year - int
       Note: A kludgey exception.
    """
    q1, q2 = build_query_for_month(month, year)
    try:
        first_half_df = pd.DataFrame(requests.get(q1).json())
    except:
        print(f"first half {month}")
        first_half_df = pd.DataFrame(requests.get(q1).json())
        
    try:
        second_half_df = pd.DataFrame(requests.get(q2).json())
    except:
        print(f"first half {month}")
        second_half_df = pd.DataFrame(requests.get(q2).json())
    
    month_df = pd.concat([first_half_df, second_half_df])
    month_df['createdDate'] = pd.to_datetime(month_df['createdDate'])
    
    return month_df

Easy peesy.  Here's a couple of examples.

In [None]:
jan_df = df_for_month(1)

In [None]:
len(jan_df)

Recollect, from the myla data pull:

In [None]:
len(new311_gdf.query(f"month_name == 'January'"))

So, looks reasonable.

## 3 - Combine Monthly csv's

Once again start with the mechanics.

In [None]:
feb_df = df_for_month(2)

In [None]:
first_two_months_df = pd.concat([jan_df, feb_df])

In [None]:
print(first_two_months_df['createdDate'].max())
print(first_two_months_df['createdDate'].min())

In [None]:
def build_df_for_year(year=2021):
    
    list_of_dfs = [df_for_month(month, year) for month in range(1, 13)]
    
    return pd.concat(list_of_dfs)

In [None]:
annual_df = build_df_for_year()

In [None]:
delta = len(new311_gdf) - len(annual_df)
print(f"difference between myla and api: {delta}")

In [None]:
annual2020_df = build_df_for_year(2020)

In [None]:
len(annual2020_df)

In [None]:
annual2020_df['createdDate'].max()

In [None]:
annual2020_df['createdDate'].min()

In [None]:
#annual2020_df.to_csv('api-call-2020.csv')

In [None]:
#annual_df.to_csv('../data/for-goog/api-call-2021.csv')