本文将介绍如何利用 Arctern 处理纽约出租车数据，同时结合 Keplergl 展示数据。

首先需要加载数据：

In [2]:
import pandas as pd
nyc_schame={
    "VendorID":"string",
    "tpep_pickup_datetime":"string",
    "tpep_dropoff_datetime":"string",
    "passenger_count":"int64",
    "trip_distance":"double",
    "pickup_longitude":"double",
    "pickup_latitude":"double",
    "dropoff_longitude":"double",
    "dropoff_latitude":"double",
    "fare_amount":"double",
    "tip_amount":"double",
    "total_amount":"double",
    "buildingid_pickup":"int64",
    "buildingid_dropoff":"int64",
    "buildingtext_pickup":"string",
    "buildingtext_dropoff":"string",
}
nyc_df=pd.read_csv("/tmp/0_2M_nyc_taxi_and_building.csv",
               dtype=nyc_schame,
               date_parser=pd.to_datetime,
               parse_dates=["tpep_pickup_datetime","tpep_dropoff_datetime"])

展示所有上车点的位置：

In [5]:
import arctern
from keplergl import KeplerGl

pickup_points = arctern.ST_Point(nyc_df.pickup_longitude,nyc_df.pickup_latitude)
KeplerGl(data={"pickup_points": pd.DataFrame(data={'pickup_points':arctern.ST_AsText(pickup_points)})})

User Guide: https://docs.kepler.gl/docs/keplergl-jupyter


KeplerGl(data={'pickup_points':                        pickup_points
0       POINT (-73.993003 40.747594)
1   …

返回的结果在地图上支持交互操作，可以发现输入的出租车数据存有噪点，有些上车点已经到海面上了，实际上所有数据应该都集中在陆地上才是合理的，这些噪点数据就需要我们通过一定的方法进行过滤。

为了正确分析纽约市区中的出租车数据，接下来我们会根据纽约市的地形图来过滤数据，即不在纽约市地图中的数据视为噪点数据并进行过滤。

首先我们读取纽约市的地形数据图，该地形数据是以 GeoJSON 格式存储的，首先使用 Arctern 解析 GeoJSON 数据：

In [3]:
import shapefile
import json
# 读取纽约市的地形数据图
nyc_shape = shapefile.Reader("/tmp/taxi_zones/taxi_zones.shp")
nyc_zone=[ shp.shape.__geo_interface__  for shp in nyc_shape.shapeRecords()]
nyc_zone=[json.dumps(shp) for shp in nyc_zone]
# 使用 Arctern 读取数据
nyc_zone_series=pd.Series(nyc_zone)
nyc_zone_arctern=arctern.ST_GeomFromGeoJSON(nyc_zone_series)
arctern.ST_AsText(nyc_zone_arctern)

0    MULTIPOLYGON (((-8226155.13045259 4982269.9492...
1    MULTIPOLYGON (((-8243264.85067129 4948597.8364...
2    MULTIPOLYGON (((-8222843.67198779 4950893.7925...
3    MULTIPOLYGON (((-8219461.92460008 4952778.7319...
4    MULTIPOLYGON (((-8238858.86403699 4965915.0243...
dtype: object

获得当前纽约市地形数据文件的坐标系，并利用 Arctern 将该坐标系转成经纬度坐标系，即 “EPSG:4326” ：

In [3]:
from sridentify import Sridentify
ident = Sridentify()
ident.from_file('/tmp/taxi_zones/taxi_zones.prj')
src_crs = ident.get_epsg()
nyc_arctern_4326 = arctern.ST_Transform(nyc_zone_arctern,f'EPSG:{src_crs}','EPSG:4326')
arctern.ST_AsText(nyc_arctern_4326)

NameError: name 'arctern' is not defined

根据转换后的经纬度坐标，绘制的纽约市地形图：

In [5]:
KeplerGl(data={"nyc_zones": pd.DataFrame(data={'nyc_zones':arctern.ST_AsText(nyc_arctern_4326)})})

User Guide: https://docs.kepler.gl/docs/keplergl-jupyter


KeplerGl(data={'nyc_zones':                                            nyc_zones
0  MULTIPOLYGON (((-73.896808…

为了分析纽约市区中的出租车数据，根据纽约市的地形图，我们认为不在图内的点即为噪点，以此过滤出租车数据中的噪点，首先我们根据纽约市区的轮廓图对上车点进行过滤：

In [6]:
# 该步骤会比较耗时
nyc_arctern_one = arctern.ST_Union_Aggr(nyc_arctern_4326)
nyc_arctern_one = arctern.ST_SimplifyPreserveTopology(nyc_arctern_one,0.005)
is_in_nyc = arctern.ST_Within(pickup_points,nyc_arctern_one[0])
pickup_in_nyc = pickup_points[pd.Series(is_in_nyc)]

绘制出数据过滤后的上车点：

In [7]:
KeplerGl(data={"pickup_points": pd.DataFrame(data={'pickup_points':arctern.ST_AsText(pickup_in_nyc)})})

User Guide: https://docs.kepler.gl/docs/keplergl-jupyter


KeplerGl(data={'pickup_points':                        pickup_points
0       POINT (-73.993003 40.747594)
1   …

根据同样的方法，对乘客的下车点进行过滤：

In [8]:
# 该步骤会比较耗时
dropoff_points = arctern.ST_Point(nyc_df.dropoff_longitude,nyc_df.dropoff_latitude)
is_dorpoff_in_nyc = arctern.ST_Within(dropoff_points,nyc_arctern_one[0])
dropoff_in_nyc=dropoff_points[is_dorpoff_in_nyc]
KeplerGl(data={"drop_points": pd.DataFrame(data={'drop_points':arctern.ST_AsText(dropoff_in_nyc)})})

User Guide: https://docs.kepler.gl/docs/keplergl-jupyter


KeplerGl(data={'drop_points':                          drop_points
0       POINT (-73.983609 40.760426)
1     …

根据上车点和下车点经纬度数据，在最初的数据上过滤所有的非法数据：

In [9]:
is_resonable = [is_dorpoff_in_nyc[idx] & is_in_nyc[idx] for idx in range(0,len(is_in_nyc)) ]
in_nyc_df=nyc_df[pd.Series(is_resonable)]
in_nyc_df.fare_amount.describe()

count    192701.000000
mean          9.733092
std           7.080015
min           2.500000
25%           5.700000
50%           7.700000
75%          11.000000
max         175.000000
Name: fare_amount, dtype: float64

我们按照交易额提取费用大于 50 美元的数据，并绘制出租车的上车点和下车点：

In [10]:
fare_amount_gt_50 = in_nyc_df[in_nyc_df.fare_amount > 50]
pickup_50 = arctern.ST_Point(fare_amount_gt_50.pickup_longitude,fare_amount_gt_50.pickup_latitude)
dropoff_50 = arctern.ST_Point(fare_amount_gt_50.dropoff_longitude,fare_amount_gt_50.dropoff_latitude)
KeplerGl(data={"pickup": pd.DataFrame(data={'pickup':arctern.ST_AsText(pickup_50)}),
               "dropoff":pd.DataFrame(data={'dropoff':arctern.ST_AsText(dropoff_50)})
              })

User Guide: https://docs.kepler.gl/docs/keplergl-jupyter


KeplerGl(data={'pickup':                            pickup
0    POINT (-73.983795 40.737956)
1    POINT (-73.7…

我们还可以计算上车点和下车点的直线距离：

In [11]:
nyc_distance=arctern.ST_DistanceSphere(arctern.ST_Point(in_nyc_df.pickup_longitude,
                                                        in_nyc_df.pickup_latitude),
                                       arctern.ST_Point(in_nyc_df.dropoff_longitude,
                                                        in_nyc_df.dropoff_latitude))
nyc_distance.index=in_nyc_df.index
nyc_distance.describe()

count    192701.000000
mean       3134.377530
std        3287.693775
min           0.000000
25%        1224.284192
50%        2086.800052
75%        3744.355960
max       35395.487197
dtype: float64

获得直线距离大于 20 公里的点，并绘制所有直线距离大于 20 公里的上车点和下车点：

In [12]:
nyc_with_distance=pd.DataFrame({"pickup_longitude":in_nyc_df.pickup_longitude,
                                "pickup_latitude":in_nyc_df.pickup_latitude,
                                "dropoff_longitude":in_nyc_df.dropoff_longitude,
                                "dropoff_latitude":in_nyc_df.dropoff_latitude,
                                "sphere_distance":nyc_distance
                               })

nyc_dist_gt = nyc_with_distance[nyc_with_distance.sphere_distance > 20e3]
pickup_gt = arctern.ST_Point(nyc_dist_gt.pickup_longitude,nyc_dist_gt.pickup_latitude)
dropoff_gt = arctern.ST_Point(nyc_dist_gt.dropoff_longitude,nyc_dist_gt.dropoff_latitude)

KeplerGl(data={"pickup": pd.DataFrame(data={'pickup':arctern.ST_AsText(pickup_gt)}),
               "dropoff":pd.DataFrame(data={'dropoff':arctern.ST_AsText(dropoff_gt)})
              })

User Guide: https://docs.kepler.gl/docs/keplergl-jupyter


KeplerGl(data={'pickup':                             pickup
0     POINT (-73.781487 40.644855)
1     POINT (-7…

In [None]:
import arctern
from keplergl import KeplerGl
import json

pickup_points = arctern.ST_Point(nyc_df.pickup_longitude,nyc_df.pickup_latitude)
config
KeplerGl(data={"pickup_points": pd.DataFrame(data={'pickup_points':arctern.ST_AsText(pickup_points)})},config=)

In [3]:
import arctern
from keplergl import KeplerGl
import json

pickup_points = arctern.ST_Point(nyc_df.pickup_longitude,nyc_df.pickup_latitude)
with open("/home/xge/xge-zilliz/arctern-bootcamp/nytaxi/nyc.json", "r") as f:
    config = json.load(f)
KeplerGl(data={"pickup_points": pd.DataFrame(data={'pickup_points':arctern.ST_AsText(pickup_points)})},config=config)

User Guide: https://docs.kepler.gl/docs/keplergl-jupyter


KeplerGl(config={'version': 'v1', 'config': {'visState': {'filters': [], 'layers': [{'id': '0cbr95h', 'type': …

In [4]:
import pandas as pd

road_df=pd.read_csv("/tmp/nyc_road.csv", dtype={"roads":"string"}, delimiter='|')
# road_df.head(20)


In [5]:
print(type(road_df.roads))

<class 'pandas.core.series.Series'>


In [None]:
import arctern
projectioned_point = arctern.nearest_location_on_road(arctern.ST_GeomFromText(road_df.roads), pickup_points)
projectioned_point.head(10)

In [None]:
with open("/home/xge/xge-zilliz/arctern-bootcamp/nytaxi/nyc.json", "r") as f:
    config = json.load(f)
KeplerGl(data={"projectioned_point": pd.DataFrame(data={'projectioned_point':arctern.ST_AsText(projectioned_point)})},config=config)