# Identifying BSC Customers

**Note: DFI Queries Will Not Work**

**The Data Flow Index server used for this workshop is no longer running.  The workshop materials are left up _as is_ but queries will not run.  If you would like to trial the Data Flow Index please reach out to General System at [https://www.generalsystem.com/contact-us](https://www.generalsystem.com/contact-us).**

This notebook is set to run for a single BSC location, but can be amended to run for all Blank Street Coffee customers. This can be done by removing the "break" lines in the dfi query code chunks.


In [None]:
import json
import shutil
from getpass import getpass
from pathlib import Path
from typing import Set

import altair as alt
import geopandas as gpd
import pandas as pd
import urllib3
from dfi import Client
from tqdm.notebook import tqdm

alt.data_transformers.disable_max_rows()

## 0. Utility Functions

In [None]:
def load_location_data(filename: str, url: str) -> gpd.GeoDataFrame:
    """ "Downloads the file at url and saves to a file called filename, returns gdf
    e.g. url = "https://d3ftlhu7xfb8rb.cloudfront.net/blank_street_coffees_callsigns.geoparquet"
    """
    Path(filename).parent.mkdir(parents=True, exist_ok=True)
    http = urllib3.PoolManager()
    with open(filename, "wb") as out:
        r = http.request("GET", url, preload_content=False)
        shutil.copyfileobj(r, out)

    return gpd.read_parquet(filename)


def unpack_payload(df: pd.DataFrame) -> pd.DataFrame:
    df = df[df["payload"].apply(lambda x: isinstance(x, str))]  # filter out any problem payloads
    df["route"] = df["payload"].apply(lambda x: json.loads(x)["route"])
    df["transportation_mode"] = df["payload"].apply(lambda x: json.loads(x)["transportation_mode"])
    df["start_location_id"] = df["payload"].apply(lambda x: json.loads(x)["start_location_id"])
    df["end_location_id"] = df["payload"].apply(lambda x: json.loads(x)["end_location_id"])

    return df

## 1. Load OSM & BSC Location Datasets

In [None]:
# Load in OSM building data
osm_gdf = load_location_data("osm_gdf", "https://d3ftlhu7xfb8rb.cloudfront.net/london_nyc_osm.geoparquet")
osm_ids = osm_gdf["osm_id"]

# Load in Blank Street Coffee Location dataset
bsc_gdf = load_location_data(
    "bsc_gdf", "https://d3ftlhu7xfb8rb.cloudfront.net/blank_street_coffee_callsigns.geoparquet"
)
bsc_osm_ids = bsc_gdf["osm_id"]

## II. Identifying BSC Customers

We say an entity is a customer of BSC if it has dwelled at one or more BSC cafes.  To identify the customers in the dataset we query the BSC building polygons for records within each and identify the unique IDs.  Since we want to identify just the entities that dwelled at the locations and not ones that just pass by, we need to pull all the records for each entity and calculate their dwells.  Here, since the data was synthetically generated, each record is labelled if it is `dwelling`, `walking`, `cycling`, or `driving`.  Once we've queried for the entitie's records, we simply filter for those with the `dwelling` label. 

In [None]:
# Initialise DFI
token = getpass("Enter your API access token: ")
instance = "sdsc-2-2088"  # sdsc-1-5148
namespace = "gs"
url = "https://api.prod.generalsystem.com"

dfi = Client(token, instance, namespace, url, progress_bar=True)

In [None]:
# Get a list of devices which have pings inside bsc locations
bsc_entities: Set[str] = set([])
for _, row in tqdm(bsc_gdf.iterrows(), total=len(bsc_gdf)):
    entities = dfi.get.entities(
        polygon=list(row.geometry.exterior.coords),
    )
    bsc_entities = bsc_entities.union(entities)
    break  # Remove or comment out to run for all entities

bsc_entities = list(bsc_entities)

print(f"There are {len(bsc_entities)} unique devices with data inside Blank Street Coffee: {row.callsign}")

In [None]:
# Get the records associated with those devices
records_df = dfi.get.records([bsc_entities[1]], add_payload_as_json=True)
records_df = unpack_payload(records_df)

records_df

In [None]:
records_df = records_df[
    records_df["start_location_id"].isin(bsc_osm_ids) & records_df["transportation_mode"] == "dwelling"
]
customers = records_df["entity_id"].unique()

customers

## III. Profiling Customers

Here is where we can begin the analysis. This section provides a brief overview of the customer data.

In [None]:
agg_df = (
    records_df[records_df["transportation_mode"] == "dwelling"]
    .rename(columns={"entity_id": "customer_id"})
    .groupby(by=["customer_id", "route"], as_index=False)
    .agg(
        start_time=("timestamp", "min"),
        end_time=("timestamp", "max"),
        location_id=("start_location_id", "first"),
    )
)
agg_df

### How many customers visited BSC locations?

In [None]:
agg_df["entity_id"].nunique()

### How many customers visited the same BSC shop more than once?

In [None]:
bsc_dwells_df = agg_df[agg_df["location_id"].isin(bsc_osm_ids)]
repeat_customers = bsc_dwells_df[bsc_dwells_df["entity_id"].duplicated(keep=False)]
repeat_customers["entity_id"].nunique()

### How many customers visited multiple different BSC locations

In [None]:
customer_location_counts = bsc_dwells_df.groupby("entity_id")["location_id"].nunique()
(customer_location_counts > 1).sum()