# API Data Analysis

This is a quick analysis of the data pulled from myla (data.lacity.org) and H4LA's 311-data.org

In [None]:
%run start.py

import utils
from utils import read_new311_shape, marker_color_map, dt_to_object, read_ncs

import numpy as np

# 1 - MyLA311

I am using the geodataframe generated for 2021. As usual I'm using my helper function to load it and map the columns names.  Just a shortcut!

I am going to use it to evaluate/compare the two data sets.

In [None]:
%%time
myla311_gdf = read_new311_shape('../data/311/clean311-geo.zip/')

In [None]:
myla311_gdf['month_name'] = myla311_gdf['created_dt'].apply(lambda dt: dt.month_name())

# 2 - H4LA

This is the csv file I generated in [api-hacks.ipynb](api-hacks.ipynb)for 2021.

It's a csv so python dtype for createdDate is object.  I'll convert to datetime and add the month_name categorical.

In [None]:
api311_df = pd.read_csv('../data/for-goog/api-call-2021.csv.zip')

In [None]:
api311_df['createdDate'] = pd.to_datetime(api311_df['createdDate'])

In [None]:
api311_df['month_name'] = api311_df['createdDate'].apply(lambda dt: dt.month_name())

I will first do the basic look-see.

In [None]:
myla311 = len(myla311_gdf)
api311 = len(api311_df)
print(f"size of myla311: {myla311}")
print(f"size of api311: {api311}")

delta = myla311 - api311
print(f"delta: {delta} ({(delta / myla311):.1%} smaller)")

Honestly, I have no idea which data set I would consider to be ground truth.  Some number of records seem to have been filtered for the api set.

I am going to investigate further.

In [None]:
#myla311_gdf['month_name'].value_counts(sort=False)

In [None]:
#api311_df['month_name'].value_counts(sort=False)

As a starting point I want to look at value counts by month side-by-side.

In [None]:
myla_vc = Output(layout={'border': '1px solid black',
                            'width': '50%'})

api_vc = Output(layout={'border': '1px solid black',
                            'width': '50%'})

with myla_vc:
    display(HTML('<center><b>myla</b></center>'))
    display(myla311_gdf['month_name'].value_counts(sort=False))
    
with api_vc:
    display(HTML('<center><b>api</b></center>'))
    display(api311_df['month_name'].value_counts(sort=False))
          
HBox([myla_vc, api_vc])            

They look pretty similar?  If you look close you'll see some differences in June and July.  We'll see more in a bit.

# 3 - Investigate API Differences

From the initial sizing analysis the api has 24256 fewer records than myla.  Let's investigate.

First I want to know if the unique things are unique.  The SRNumber - service request number should be.

In [None]:
print(len(api311_df))
print(len(api311_df['srnumber'].unique()))

Don't ya just hate data sometimes!!  One of the SRNumbers is duplcated in the api dataset.  I bet they are the same so ...

Let's find it and see.

In [None]:
the_dup_sr = api311_df.loc[api311_df['srnumber'].duplicated(), :].iloc[0]['srnumber']
the_dup_sr

At least I can find this SRNumber in both of the dataframes.

In [None]:
myla311_gdf.query(f"SRNumber == @the_dup_sr")

In [None]:
api311_df.query(f"srnumber == @the_dup_sr").drop('Unnamed: 0', axis=1)

So it's only duplicated in the api data set.  If you're using that df then toss it.

Next I want to do understand the differencs.  For that, I need a dataframe that has records in myla311 and not in api311.  I will use some set hacking with SRNumber for this.

In [None]:
myla_sr_set = set(myla311_gdf['SRNumber'])

api_sr_set = set(api311_df['srnumber'])

diff_sr_set = myla_sr_set.difference(api_sr_set)

So diff_sr_set is the set of SRNumbers in myla and not in api311.  I can use this to build the dataframe.

In [None]:
diff_df = myla311_gdf.query(f"SRNumber in @diff_sr_set").reset_index().drop(columns=['index'])

In [None]:
diff_df.info()

This is the subset of records in myla that are not in api based on the SRNumber (a unique ID from the city?).

I'm going to continue to look at this but for the types of analysis I do so it seem a bit like I'm paving the cowpath!

Remember above I looked at things by month_name?  Check this out.

In [None]:
diff_df['month_name'].value_counts(sort=False)

Interesting or weird?  Couple of things seem a bit strange.  Huge spike in June and July.  And it mysteriously stops at November?

It's still such a small count overall so my reaction is meh.

I suppose it's worth while to dig deeper?  (just for fun)

In [None]:
diff_df.columns

In [None]:
diff_df['created_dt'].max()

In [None]:
diff_df['created_dt'].min()

In [None]:
api311_df.columns

I remember hearing about problems with datetimes?

In [None]:
print(diff_df['created_dt'].max())
print(diff_df['created_dt'].min())

print(diff_df['closed_dt'].max())
print(diff_df['closed_dt'].min())

Ok.  These dates are all within range (i.e. 2021).  Note: null values for closed_dt are ok - just means it's not closed.

Finally let's look at neighborhood councls (nc and nc_name).  Need to check for nulls and "invalid" ids/names.

In [None]:
diff_df['nc'].isnull().sum()

So 214 out of 24257 is ??  It's a small number of a small number.  Sort of irrelevant?

I've already spent way to much time on this, but lets see one last thing.  How many of the nc's are "valid" (i.e. they are in the certified nc data from empowerla)

In [None]:
neighborhoods_gdf = read_ncs()

nc_id_set = set(neighborhoods_gdf['nc_id'])

In [None]:
nc_set = set(diff_df['nc'])

In [None]:
nc_id_set.difference(nc_set)

So I have the id's in myla minus the set of id's in api is empty.  That means the nc's are valid (as defined by the certified data set from data.lacity.org).

I'm not going any further.

# 4 - Conclusion 

Use either one!