本文将介绍如何利用 Arctern 处理纽约出租车数据，同时结合 Keplergl 展示数据。

首先需要加载数据：

In [1]:
import pandas as pd
nyc_schame={
    "VendorID":"string",
    "tpep_pickup_datetime":"string",
    "tpep_dropoff_datetime":"string",
    "passenger_count":"int64",
    "trip_distance":"double",
    "pickup_longitude":"double",
    "pickup_latitude":"double",
    "dropoff_longitude":"double",
    "dropoff_latitude":"double",
    "fare_amount":"double",
    "tip_amount":"double",
    "total_amount":"double",
    "buildingid_pickup":"int64",
    "buildingid_dropoff":"int64",
    "buildingtext_pickup":"string",
    "buildingtext_dropoff":"string",
}
nyc_df=pd.read_csv("/tmp/0_2M_nyc_taxi_and_building.csv",
               dtype=nyc_schame,
               date_parser=pd.to_datetime,
               parse_dates=["tpep_pickup_datetime","tpep_dropoff_datetime"])

展示所有上车点的位置：

In [2]:
import arctern
from keplergl import KeplerGl

pickup_points = arctern.ST_Point(nyc_df.pickup_longitude,nyc_df.pickup_latitude)
KeplerGl(data={"pickup_points": pd.DataFrame(data={'pickup_points':arctern.ST_AsText(pickup_points)})})

KeplerGl(data={'pickup_points':                        pickup_points
0       POINT (-73.993003 40.747594)
1   …

User Guide: https://docs.kepler.gl/docs/keplergl-jupyter


返回的结果在地图上支持交互操作，可以发现输入的出租车数据存有噪点，有些上车点已经到海面上了，实际上所有数据应该都集中在陆地上才是合理的，这些噪点数据就需要我们通过一定的方法进行过滤。

为了正确分析纽约市区中的出租车数据，接下来我们会根据纽约市的地形图来过滤数据，即不在纽约市地图中的数据视为噪点数据并进行过滤。

首先我们读取纽约市的地形数据图，该地形数据是以 GeoJSON 格式存储的，首先使用 Arctern 解析 GeoJSON 数据：

In [3]:
import shapefile
import json
# 读取纽约市的地形数据图
nyc_shape = shapefile.Reader("/tmp/taxi_zones/taxi_zones.shp")
nyc_zone=[ shp.shape.__geo_interface__  for shp in nyc_shape.shapeRecords()]
nyc_zone=[json.dumps(shp) for shp in nyc_zone]
# 使用 Arctern 读取数据
nyc_zone_series=pd.Series(nyc_zone)
nyc_zone_arctern=arctern.ST_GeomFromGeoJSON(nyc_zone_series)
arctern.ST_AsText(nyc_zone_arctern)

0    MULTIPOLYGON (((-8226155.13045259 4982269.9492...
1    MULTIPOLYGON (((-8243264.85067129 4948597.8364...
2    MULTIPOLYGON (((-8222843.67198779 4950893.7925...
3    MULTIPOLYGON (((-8219461.92460008 4952778.7319...
4    MULTIPOLYGON (((-8238858.86403699 4965915.0243...
dtype: object

获得当前纽约市地形数据文件的坐标系，并利用 Arctern 将该坐标系转成经纬度坐标系，即 “EPSG:4326” ：

In [4]:
from sridentify import Sridentify
ident = Sridentify()
ident.from_file('/tmp/taxi_zones/taxi_zones.prj')
src_crs = ident.get_epsg()
nyc_arctern_4326 = arctern.ST_Transform(nyc_zone_arctern,f'EPSG:{src_crs}','EPSG:4326')
arctern.ST_AsText(nyc_arctern_4326)

0    MULTIPOLYGON (((-73.8968088322377 40.795808445...
1    MULTIPOLYGON (((-74.0505080640325 40.566422034...
2    MULTIPOLYGON (((-73.8670614947212 40.582087976...
3    MULTIPOLYGON (((-73.8366827410671 40.594946697...
4    MULTIPOLYGON (((-74.0109284126803 40.684491472...
dtype: object

根据转换后的经纬度坐标，绘制的纽约市地形图：

In [5]:
KeplerGl(data={"nyc_zones": pd.DataFrame(data={'nyc_zones':arctern.ST_AsText(nyc_arctern_4326)})})

User Guide: https://docs.kepler.gl/docs/keplergl-jupyter


KeplerGl(data={'nyc_zones':                                            nyc_zones
0  MULTIPOLYGON (((-73.896808…

为了分析纽约市区中的出租车数据，根据纽约市的地形图，我们认为不在图内的点即为噪点，以此过滤出租车数据中的噪点，首先我们根据纽约市区的轮廓图对上车点进行过滤：

In [None]:
# 该步骤会比较耗时
index_nyc = arctern.sjoin(pickup_points, nyc_arctern_4326, 'within')
is_in_nyc = index_nyc.map(lambda x: x >= 0)
pickup_in_nyc = pickup_points[pd.Series(is_in_nyc)]

绘制出数据过滤后的上车点：

In [None]:
KeplerGl(data={"pickup_points": pd.DataFrame(data={'pickup_points':arctern.ST_AsText(pickup_in_nyc)})})

根据同样的方法，对乘客的下车点进行过滤：

In [None]:
# 该步骤会比较耗时
dropoff_points = arctern.ST_Point(nyc_df.dropoff_longitude,nyc_df.dropoff_latitude)
index_nyc = arctern.sjoin(dropoff_points, nyc_arctern_4326, 'within')
is_dorpoff_in_nyc = index_nyc.map(lambda x: x >= 0)
dropoff_in_nyc=dropoff_points[is_dorpoff_in_nyc]
KeplerGl(data={"drop_points": pd.DataFrame(data={'drop_points':arctern.ST_AsText(dropoff_in_nyc)})})

根据上车点和下车点经纬度数据，在最初的数据上过滤所有的非法数据：

In [None]:
is_resonable = [is_dorpoff_in_nyc[idx] & is_in_nyc[idx] for idx in range(0,len(is_in_nyc)) ]
in_nyc_df=nyc_df[pd.Series(is_resonable)]
in_nyc_df.fare_amount.describe()

根据纽约市轮廓图对租车数据过滤后，我们发现很多上车点的位置和道路有一些偏差，甚至偏离到某些建筑物内：

In [None]:
import json
from keplergl import KeplerGl
with open("map_config.json", "r") as f:
    config = json.load(f)
KeplerGl(data={"projectioned_point": pd.DataFrame(data={'projectioned_point':arctern.ST_AsText(pickup_in_nyc)})},config=config)

我们认为离道路较远的数据同样为噪点(默认离道路距离大于100m视为较远),通过匹配纽约市的道路网将偏离道路较远的租车数据过滤掉，首先加载纽约市道路网：

In [None]:
import arctern
nyc_road=pd.read_csv("/tmp/nyc_road.csv", dtype={"roads":"string"}, delimiter='|')
roads=arctern.ST_GeomFromText(nyc_road.roads)

然后根据纽约市的道路网对上车点和下车点进行过滤：

In [None]:
pickup_points = arctern.ST_Point(in_nyc_df.pickup_longitude,in_nyc_df.pickup_latitude)
dropoff_points = arctern.ST_Point(in_nyc_df.dropoff_longitude,in_nyc_df.dropoff_latitude)
is_pickup_near_road = arctern.near_road(roads, pickup_points)
is_dropoff_near_road = arctern.near_road(roads, dropoff_points)
is_resonable = [is_pickup_near_road[idx] & is_dropoff_near_road[idx] for idx in range(0,len(is_dropoff_near_road)) ]
in_nyc_df=in_nyc_df.reset_index()
on_road_nyc_df=in_nyc_df[pd.Series(is_resonable)]
on_road_nyc_df.head(10)

过滤到距离道路较远的租车数据之后，我们将上下车点绑定到最近的道路上，生成新的上下车点：

In [None]:
pickup_points = arctern.ST_Point(on_road_nyc_df.pickup_longitude,on_road_nyc_df.pickup_latitude)
projectioned_pickup = arctern.nearest_location_on_road(roads, pickup_points)
dropoff_points = arctern.ST_Point(on_road_nyc_df.dropoff_longitude,on_road_nyc_df.dropoff_latitude)
projectioned_dropoff = arctern.nearest_location_on_road(roads, dropoff_points)

绘制出数据绑定道路后的上车点：

In [None]:
import json
from keplergl import KeplerGl
with open("map_config.json", "r") as f:
    config = json.load(f)
KeplerGl(data={"projectioned_point": pd.DataFrame(data={'projectioned_point':arctern.ST_AsText(projectioned_pickup)})},config=config)

绘制出数据绑定道路后的下车点：

In [None]:
with open("map_config.json", "r") as f:
    config = json.load(f)
KeplerGl(data={"projectioned_point": pd.DataFrame(data={'projectioned_point':arctern.ST_AsText(projectioned_dropoff)})},config=config)

将绑路后的乘客上下车位置信息添加到dataframe near_road_df 中：

In [None]:
on_road_nyc_df.insert(16,'pickup_on_road',projectioned_pickup)
on_road_nyc_df.insert(17,'dropoff_on_road',projectioned_dropoff)
on_road_nyc_df.fare_amount.describe()

我们按照交易额提取费用大于 50 美元的数据，并绘制出租车的上车点和下车点：

In [None]:
fare_amount_gt_50 = on_road_nyc_df[on_road_nyc_df.fare_amount > 50]
KeplerGl(data={"pickup": pd.DataFrame(data={'pickup':arctern.ST_AsText(fare_amount_gt_50.pickup_on_road)}),
               "dropoff":pd.DataFrame(data={'dropoff':arctern.ST_AsText(fare_amount_gt_50.dropoff_on_road)})
              })

我们还可以计算上车点和下车点的直线距离：

In [None]:
nyc_distance=arctern.ST_DistanceSphere(on_road_nyc_df.pickup_on_road, on_road_nyc_df.dropoff_on_road)
nyc_distance.index=on_road_nyc_df.index
nyc_distance.describe()

获得直线距离大于 20 公里的点，并绘制所有直线距离大于 20 公里的上车点和下车点：

In [None]:
nyc_with_distance=pd.DataFrame({"pickup":on_road_nyc_df.pickup_on_road,
                                "dropoff":on_road_nyc_df.dropoff_on_road,
                                "sphere_distance":nyc_distance
                               })

nyc_dist_gt = nyc_with_distance[nyc_with_distance.sphere_distance > 20e3]
KeplerGl(data={"pickup": pd.DataFrame(data={'pickup':arctern.ST_AsText(nyc_dist_gt.pickup)}),
               "dropoff":pd.DataFrame(data={'dropoff':arctern.ST_AsText(nyc_dist_gt.dropoff)})
              })