# In-Class Demo: Analyzing personal location data (as a Bricoleur)

In [53]:
import pandas as pd
import numpy as np

I requested all of my location data from Google...going back to 2013!!!

Let's take a look at it...

It came in a series of JSON files.

_(Take a look at the Takeout file structure on Prof. W-B's machine.)_

In [59]:
# Let's import it usind pandas
df2 = pd.read_json("Takeout/Location History/Records.json")

In [60]:
# And look at it
df2

Unnamed: 0,locations
0,"{'latitudeE7': 344220989, 'longitudeE7': -1197..."
1,"{'latitudeE7': 344220834, 'longitudeE7': -1197..."
2,"{'latitudeE7': 344220752, 'longitudeE7': -1197..."
3,"{'latitudeE7': 344221135, 'longitudeE7': -1197..."
4,"{'latitudeE7': 344221136, 'longitudeE7': -1197..."
...,...
1055967,"{'latitudeE7': 460727900, 'longitudeE7': -1183..."
1055968,"{'latitudeE7': 460727900, 'longitudeE7': -1183..."
1055969,"{'latitudeE7': 460727900, 'longitudeE7': -1183..."
1055970,"{'latitudeE7': 460727900, 'longitudeE7': -1183..."


At this point, I weighed my options...
* I could try to parse the JSON myself
* Or (because I'm lazy), I could Google around to see if anyone else has already done this

Which one do you think I went with?

In [61]:
# Used Google Takout Parser: https://pypi.org/project/google-takeout-parser/
# I had to first import it (using pip)
# Then I ran it, using this sample code from the link above

from google_takeout_parser.models import Location
from google_takeout_parser.path_dispatch import TakeoutParser
locations = list(TakeoutParser("Takeout").parse(filter_type=Location))
len(locations)

[I 230412 21:37:08 path_dispatch:283] Parsing 'Location History/Records.json' using '_parse_location_history'
[W 230412 21:37:14 path_dispatch:300] 'longitudeE7'


1055972

Now that I have the parsed data, I spend some time figuring out what format it's in:

In [98]:
# I see that its a variable called locations
# First I check the time
type(locations)

list

In [99]:
locations

[Location(lat=34.4220989, lng=-119.7236823, accuracy=27, dt=datetime.datetime(2013, 8, 13, 22, 5, 37, 271000, tzinfo=datetime.timezone.utc)),
 Location(lat=34.4220834, lng=-119.7236951, accuracy=14, dt=datetime.datetime(2013, 8, 13, 22, 5, 54, 4000, tzinfo=datetime.timezone.utc)),
 Location(lat=34.4220752, lng=-119.7237066, accuracy=8, dt=datetime.datetime(2013, 8, 13, 22, 6, 40, 242000, tzinfo=datetime.timezone.utc)),
 Location(lat=34.4221135, lng=-119.723667, accuracy=35, dt=datetime.datetime(2013, 8, 13, 22, 8, 43, 504000, tzinfo=datetime.timezone.utc)),
 Location(lat=34.4221136, lng=-119.7236601, accuracy=20, dt=datetime.datetime(2013, 8, 13, 22, 8, 52, 904000, tzinfo=datetime.timezone.utc)),
 Location(lat=34.4221115, lng=-119.7236544, accuracy=15, dt=datetime.datetime(2013, 8, 13, 22, 8, 55, 750000, tzinfo=datetime.timezone.utc)),
 Location(lat=34.4221026, lng=-119.7236598, accuracy=13, dt=datetime.datetime(2013, 8, 13, 22, 9, 13, 119000, tzinfo=datetime.timezone.utc)),
 Location(

In [63]:
# Then I look at a single element from this list
locations[0]

Location(lat=34.4220989, lng=-119.7236823, accuracy=27, dt=datetime.datetime(2013, 8, 13, 22, 5, 37, 271000, tzinfo=datetime.timezone.utc))

In [87]:
# What type is a single element?
type(locations[0])

google_takeout_parser.models.Location

I have a bit of experience with object-oriented programming in Python, so I know that this is a special location object, which has properties (variables) that I can access. This is where I am using my _past knowledge_ to work with these new materials.

In [88]:
# Can access the datetime
locations[0].dt

datetime.datetime(2013, 8, 13, 22, 5, 37, 271000, tzinfo=datetime.timezone.utc)

In [89]:
# Can access just the latitude
locations[0].lat

34.4220989

Ok, this location object is cool! But to do some data analysis, I want to put it into a pandas DataFrame. I just want the lat/lon and the timestamps.

Here's how I did that:

In [83]:
# Make a blank data frame with the columns I want
df4 = pd.DataFrame(columns=["lat","lon", "timestamp"])

# Loop through locations -- remember, it's a list!
for i in locations:
    # When I first did this, I got an error after ~1 million rows...
    # So I tweaked it to do a try/except, and keep going if there's an error
    try:
        row = pd.DataFrame({'lat':[i.lat], 'lon': [i.lng],'timestamp': [i.dt]})
        df4 = pd.concat([df4, row], ignore_index=True)
    # If it gets an error, just print this (but keep going)
    except:
        print("An exception occurred") 

An exception occurred


I'm not going to run this for you now, because it took about 15 minutes (!!!). But here's the result:

In [84]:
df4

Unnamed: 0,lat,lon,timestamp
0,34.422099,-119.723682,2013-08-13 22:05:37.271000+00:00
1,34.422083,-119.723695,2013-08-13 22:05:54.004000+00:00
2,34.422075,-119.723707,2013-08-13 22:06:40.242000+00:00
3,34.422114,-119.723667,2013-08-13 22:08:43.504000+00:00
4,34.422114,-119.723660,2013-08-13 22:08:52.904000+00:00
...,...,...,...
1055966,46.072790,-118.328990,2023-04-07 23:21:25.872000+00:00
1055967,46.072790,-118.328990,2023-04-07 23:23:25.886000+00:00
1055968,46.072790,-118.328990,2023-04-07 23:25:25.899000+00:00
1055969,46.072790,-118.328990,2023-04-07 23:27:25.909000+00:00


In [92]:
# Export as a CSV file
df4[['lat','lon','timestamp']].to_csv("prof_wb_location.csv")

Now, I want to find out: **How much time is between the data points Google collects about my location?**

To do this, we can use a fun little built in pandas function called [.diff()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.diff.html). (You'll use it in Project 8, hint hint.)

It calculates the difference between a value in a row, and the value in the row directly above it.

(Note: Because my data is already ordered chronologically, I don't have to sort -- that may not be the case with _your_ data.)

In [90]:
# Calculate the time deltas using .diff
df4["time_delta"] = df4["timestamp"].diff()

In [91]:
df4

Unnamed: 0,lat,lon,timestamp,time_delta
0,34.422099,-119.723682,2013-08-13 22:05:37.271000+00:00,NaT
1,34.422083,-119.723695,2013-08-13 22:05:54.004000+00:00,0 days 00:00:16.733000
2,34.422075,-119.723707,2013-08-13 22:06:40.242000+00:00,0 days 00:00:46.238000
3,34.422114,-119.723667,2013-08-13 22:08:43.504000+00:00,0 days 00:02:03.262000
4,34.422114,-119.723660,2013-08-13 22:08:52.904000+00:00,0 days 00:00:09.400000
...,...,...,...,...
1055966,46.072790,-118.328990,2023-04-07 23:21:25.872000+00:00,0 days 00:02:00.018000
1055967,46.072790,-118.328990,2023-04-07 23:23:25.886000+00:00,0 days 00:02:00.014000
1055968,46.072790,-118.328990,2023-04-07 23:25:25.899000+00:00,0 days 00:02:00.013000
1055969,46.072790,-118.328990,2023-04-07 23:27:25.909000+00:00,0 days 00:02:00.010000


Now I want to answer: 
* For each of these data points, how far was I from Olin Hall (on Whitman's campus)?
* When was I the furthest from Olin Hall? (And where was I?)
* How many data points has Google logged me within 1m of Olin Hall (basically, in Olin Hall)?

First, we have to find the location of Olin Hall. To do this, I went to Google Maps.
                                                               
                                                                                                

In [None]:
# Olin Hall
# lat: 46.0729805, long: -118.3286593

Now, we have to figure out how to calculate distance... any ideas??

Euclidean distance formula: square root((lat2 - lat1)^2 + (lon2 - lon1)^2)
Can we use this?




No! So...what do we do...
We have to use something called the [haversine formula](https://community.esri.com/t5/coordinate-reference-systems-blog/distance-on-a-sphere-the-haversine-formula/ba-p/902128 ). (Let's go back to the slides.)

In [93]:
# But how to implement it in Python? Well... I looked at Stack Overflow:
# https://stackoverflow.com/questions/29545704/fast-haversine-approximation-python-pandas

# In that thread, I found this function:

def haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    All args must be of equal length.    

    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

In [94]:
# This Stack Overflow thread also had advice on how to implement in pandas
# for millions of rows:
df4['distance'] = haversine_np(df4['lon'],df4['lat'],-118.3286593,46.0729805)

In [95]:
# Now...remember to always do a sanity check!!
# Does it seem right?
# Remember, the units are in km
# We can also double check it with NOAA's distance calculator: https://www.nhc.noaa.gov/gccalc.shtml
df4

Unnamed: 0,lat,lon,timestamp,time_delta,distance
0,34.422099,-119.723682,2013-08-13 22:05:37.271000+00:00,NaT,1300.042288
1,34.422083,-119.723695,2013-08-13 22:05:54.004000+00:00,0 days 00:00:16.733000,1300.044103
2,34.422075,-119.723707,2013-08-13 22:06:40.242000+00:00,0 days 00:00:46.238000,1300.045098
3,34.422114,-119.723667,2013-08-13 22:08:43.504000+00:00,0 days 00:02:03.262000,1300.040555
4,34.422114,-119.723660,2013-08-13 22:08:52.904000+00:00,0 days 00:00:09.400000,1300.040491
...,...,...,...,...,...
1055966,46.072790,-118.328990,2023-04-07 23:21:25.872000+00:00,0 days 00:02:00.018000,0.033161
1055967,46.072790,-118.328990,2023-04-07 23:23:25.886000+00:00,0 days 00:02:00.014000,0.033161
1055968,46.072790,-118.328990,2023-04-07 23:25:25.899000+00:00,0 days 00:02:00.013000,0.033161
1055969,46.072790,-118.328990,2023-04-07 23:27:25.909000+00:00,0 days 00:02:00.010000,0.033161


In [96]:
# Where is the furthest Prof W-B has been from Olin Hall?
df4[df4["distance"] == df4["distance"].max()]

Unnamed: 0,lat,lon,timestamp,time_delta,distance
220647,-26.131608,28.234575,2014-05-06 15:23:29.378000+00:00,0 days 00:01:00.683000,16315.444873


Where is this location? You'll have to type it into Google Maps to find out...

In [102]:
# How many times did Google log Prof W-B as being within 1m of Olin Hall?
df4[df4["distance"] < 1]

Unnamed: 0,lat,lon,timestamp,time_delta,distance
910088,46.067649,-118.318802,2022-03-09 23:22:58.463000+00:00,0 days 00:00:59.933000,0.963656
910095,46.069488,-118.340196,2022-03-09 23:40:11.634000+00:00,0 days 00:01:45.129000,0.970429
910236,46.067072,-118.338252,2022-03-10 02:22:52.372000+00:00,0 days 00:00:18.597000,0.988969
910237,46.067122,-118.338036,2022-03-10 02:23:08+00:00,0 days 00:00:15.628000,0.972877
910238,46.067101,-118.337930,2022-03-10 02:23:22+00:00,0 days 00:00:14,0.968354
...,...,...,...,...,...
1055966,46.072790,-118.328990,2023-04-07 23:21:25.872000+00:00,0 days 00:02:00.018000,0.033161
1055967,46.072790,-118.328990,2023-04-07 23:23:25.886000+00:00,0 days 00:02:00.014000,0.033161
1055968,46.072790,-118.328990,2023-04-07 23:25:25.899000+00:00,0 days 00:02:00.013000,0.033161
1055969,46.072790,-118.328990,2023-04-07 23:27:25.909000+00:00,0 days 00:02:00.010000,0.033161
