# 1.从gps信息中提取OD信息

本notebook为自己的学习笔记，想看原始版本请移步：
[出租车 GPS 数据——时空大数据处理基础.ipynb](%B3%F6%D7%E2%B3%B5%20GPS%20%CA%FD%BE%DD%A1%AA%A1%AA%CA%B1%BF%D5%B4%F3%CA%FD%BE%DD%B4%A6%C0%ED%BB%F9%B4%A1.ipynb)


## 2.导入数据

In [1]:
import pandas as pd

In [2]:
# 读取数据
data = pd.read_csv("./data/TaxiData-Sample")
data.head()

Unnamed: 0,22271,22:54:04,114.167,22.718398999999998,0,0.1
0,22271,18:26:26,114.190598,22.6478,0,4
1,22271,18:35:18,114.201401,22.6497,0,0
2,22271,16:02:46,114.233498,22.725901,0,24
3,22271,21:41:17,114.233597,22.7209,0,19
4,22271,15:27:20,114.234001,22.7225,0,0


In [3]:
# 定义列名 
data.columns = ['VehicleNum', 'Time', 'Lng', 'Lat', 'OccupancyStatus', 'Speed'] 
# OccupancyStatus: 1-with passengers & 0-with passengers;
data.head()

Unnamed: 0,VehicleNum,Time,Lng,Lat,OccupancyStatus,Speed
0,22271,18:26:26,114.190598,22.6478,0,4
1,22271,18:35:18,114.201401,22.6497,0,0
2,22271,16:02:46,114.233498,22.725901,0,24
3,22271,21:41:17,114.233597,22.7209,0,19
4,22271,15:27:20,114.234001,22.7225,0,0


In [4]:
# 检查数据类型
data.dtypes

VehicleNum           int64
Time                object
Lng                float64
Lat                float64
OccupancyStatus      int64
Speed                int64
dtype: object

In [5]:
len(data)

1601306

## 2.数据清洗
-   **检查和处理缺失值**:
    
    -   确定数据中是否存在缺失值。
    -   根据需要填充或删除这些缺失值。
-   **检查数据类型**:
    
    -   确保每一列的数据类型适合其数据（例如，时间戳应该是日期时间类型）。
-   **去除重复的记录**:
    
    -   检查数据中是否有重复行，并删除重复的条目。
-   **处理异常值**:
    
    -   检查是否有异常或不合理的数据点，并根据情况进行处理。
-   **数据格式化**:
    
    -   确保数据格式统一，例如日期时间格式统一。


In [6]:
# 检查缺失值
missing_values = data.isnull().sum()

# 检查数据类型
data_types = data.dtypes

# 检查重复行
duplicate_rows = data.duplicated().sum()

# 输出结果
missing_values, data_types, duplicate_rows

(VehicleNum         0
 Time               0
 Lng                0
 Lat                0
 OccupancyStatus    0
 Speed              0
 dtype: int64,
 VehicleNum           int64
 Time                object
 Lng                float64
 Lat                float64
 OccupancyStatus      int64
 Speed                int64
 dtype: object,
 1)

In [7]:
# 将时间戳转换为时间格式
# 定义一个年月日字符串 由数据源官网可知数据所在日期是2013-10-22
default_date_str = '2013-10-22 '

# 将时间转换为字符串，并在前面加上默认日期
data['Time'] = pd.to_datetime(default_date_str + data['Time'])
data['Time'] = pd.to_datetime(data['Time'], format='%Y-%M-%d %H:%M:%S')

# 检查数据类型转换后的数据类型
data['Time'].dtypes

dtype('<M8[ns]')

In [8]:
data.dtypes

VehicleNum                  int64
Time               datetime64[ns]
Lng                       float64
Lat                       float64
OccupancyStatus             int64
Speed                       int64
dtype: object

In [9]:
# 检查数据的空间范围是否正确

# 定义深圳的经纬度范围
shenzhen_lng_range = (113.5, 114.8)
shenzhen_lat_range = (22.3, 22.9) 

# 简单的过滤数据
data_shenzhen = data[(data['Lng'] >= shenzhen_lng_range[0]) & (data['Lng'] <= shenzhen_lng_range[1]) &
                     (data['Lat'] >= shenzhen_lat_range[0]) & (data['Lat'] <= shenzhen_lat_range[1])]

# 检查被筛选掉多少数据
out_of_range_values_shenzhen = len(data) - len(data_shenzhen)

data_shenzhen.head(), out_of_range_values_shenzhen


(   VehicleNum                Time         Lng        Lat  OccupancyStatus  \
 0       22271 2013-10-22 18:26:26  114.190598  22.647800                0   
 1       22271 2013-10-22 18:35:18  114.201401  22.649700                0   
 2       22271 2013-10-22 16:02:46  114.233498  22.725901                0   
 3       22271 2013-10-22 21:41:17  114.233597  22.720900                0   
 4       22271 2013-10-22 15:27:20  114.234001  22.722500                0   
 
    Speed  
 0      4  
 1      0  
 2     24  
 3     19  
 4      0  ,
 1704)

## 3.提取行程

提取每辆车的每个行程信息，包括每个行程的起点和终点的经纬度以及开始和结束时间。由于 'OccupancyStatus' 用于表示车辆是否载客（1 表示载客，0 表示空载），我们可以利用这个字段来识别行程的开始和结束。

一般来说，一个行程的开始可以定义为车辆从空载状态变为载客状态的时刻，而行程的结束则是车辆从载客状态变回空载状态的时刻。因此，我们需要找到每次 'OccupancyStatus' 从 0 变为 1 的点作为行程的开始，以及从 1 变为 0 的点作为行程的结束。

我们将按照以下步骤进行操作：

对数据进行排序，确保按照每辆车的编号和时间顺序排列。
识别每辆车的行程开始和结束点。
提取每个行程的相关信息，包括起点和终点的经纬度以及开始和结束时间。
现在我将开始进行这些步骤的实现。

已经成功提取了每辆车的每个行程信息，包括每个行程的起点和终点经纬度以及开始和结束时间。这些信息被存储在一个新的数据框中，包含以下列：

'VehicleNum'：车辆编号
'StartTime'：行程开始时间
'EndTime'：行程结束时间
'StartLng'：行程起点经度
'StartLat'：行程起点纬度
'EndLng'：行程终点经度
'EndLat'：行程终点纬度

In [10]:
# 1.数据排序
data_sorted = data_shenzhen.sort_values(by=['VehicleNum', 'Time'])
data_sorted.head()

Unnamed: 0,VehicleNum,Time,Lng,Lat,OccupancyStatus,Speed
38,22271,2013-10-22 00:00:49,114.266502,22.728201,0,0
396,22271,2013-10-22 00:01:48,114.266502,22.728201,0,0
1412,22271,2013-10-22 00:02:47,114.266502,22.728201,0,0
243,22271,2013-10-22 00:03:46,114.266502,22.728201,0,0
246,22271,2013-10-22 00:04:45,114.268898,22.7295,0,11


In [11]:
# 2.准备存储行程信息的新数据框
# 创建一个新的 DataFrame，用于存储提取出的行程信息。列包括车辆编号、行程开始和结束时间、起点和终点的经纬度。
trips = pd.DataFrame(columns=['VehicleNum', 'StartTime', 'EndTime', 'StartLng', 'StartLat', 'EndLng', 'EndLat', 'Speed'])

In [12]:
# 迭代处理每辆车的数据
for vehicle in data_sorted['VehicleNum'].unique():
    # 为每辆车创建一个子数据集 vehicle_data
    vehicle_data = data_sorted[data_sorted['VehicleNum'] == vehicle]
    
    # 追踪和记录每个行程
    trip_start = None
    trip_start_lng = None
    trip_start_lat = None
    previous_status = None

    # 4.迭代处理每个行程
    for i, row in vehicle_data.iterrows():
        if row['OccupancyStatus'] == 1 and previous_status == 0:
            # 行程开始
            trip_start = row['Time']
            trip_start_lng = row['Lng']
            trip_start_lat = row['Lat']
        
        elif row['OccupancyStatus'] == 0 and trip_start is not None:
            # 行程结束，添加到 trips 数据框中
            trip_data = pd.DataFrame({
                'VehicleNum': [int(vehicle_data.iloc[0]['VehicleNum'])],
                'StartTime': [trip_start], 
                'EndTime': [row['Time']], 
                'StartLng': [trip_start_lng], 
                'StartLat': [trip_start_lat], 
                'EndLng': [row['Lng']], 
                'EndLat': [row['Lat']],
                'Speed': [row['Speed']]
            })
            trips = pd.concat([trips, trip_data], ignore_index=True)
                
            # 重置追踪变量
            trip_start = None
            trip_start_lng = None
            trip_start_lat = None
    
        # 更新 previous_status
        previous_status = row['OccupancyStatus']

# 看看结果
trips.head()

  trips = pd.concat([trips, trip_data], ignore_index=True)


Unnamed: 0,VehicleNum,StartTime,EndTime,StartLng,StartLat,EndLng,EndLat,Speed
0,22334,2013-10-22 00:07:57,2013-10-22 00:18:16,114.080498,22.554182,114.084915,22.54085,0
1,22334,2013-10-22 00:19:05,2013-10-22 00:44:52,114.084915,22.54085,114.056236,22.633383,0
2,22334,2013-10-22 02:38:52,2013-10-22 02:47:04,114.091637,22.5432,114.093536,22.554382,1
3,22334,2013-10-22 03:58:57,2013-10-22 04:23:07,114.038818,22.553232,114.052216,22.602118,1
4,22334,2013-10-22 06:30:19,2013-10-22 06:41:20,114.03125,22.51955,114.067886,22.521299,0


In [13]:
len(trips)

16757

In [14]:
# 进一步优化：使用pandas的apply方法能够更快地处理数据

# 函数，用于处理每辆车的数据并返回行程信息
def extract_trips(vehicle_data):
    trips_list = []
    trip_start = trip_start_lng = trip_start_lat = None
    previous_status = None  # 引入了previous_status变量来存储上一条记录的OccupancyStatus。

    for _, row in vehicle_data.iterrows():
        # 检测行程开始：前一状态为0，当前状态为1
        if row['OccupancyStatus'] == 1 and previous_status == 0:
            trip_start = row['Time']
            trip_start_lng = row['Lng']
            trip_start_lat = row['Lat']
        # 检测行程结束：当前状态为0，行程已经开始
        elif row['OccupancyStatus'] == 0 and trip_start is not None:
            trips_list.append({
                'VehicleNum': int(vehicle_data.iloc[0]['VehicleNum']), 
                'StartTime': trip_start, 
                'EndTime': row['Time'], 
                'StartLng': trip_start_lng, 
                'StartLat': trip_start_lat, 
                'EndLng': row['Lng'], 
                'EndLat': row['Lat'],
                'Speed': row['Speed']
            })
            trip_start = trip_start_lng = trip_start_lat = None

        # 更新前一状态
        previous_status = row['OccupancyStatus']

    return pd.DataFrame(trips_list)


# 使用 groupby() 和 apply() 处理每辆车的数据
trips = data_sorted.groupby('VehicleNum').apply(extract_trips).reset_index(drop=True)

# 显示前几行数据
print(trips.head())

   VehicleNum           StartTime             EndTime    StartLng   StartLat  \
0     22334.0 2013-10-22 00:07:57 2013-10-22 00:18:16  114.080498  22.554182   
1     22334.0 2013-10-22 00:19:05 2013-10-22 00:44:52  114.084915  22.540850   
2     22334.0 2013-10-22 02:38:52 2013-10-22 02:47:04  114.091637  22.543200   
3     22334.0 2013-10-22 03:58:57 2013-10-22 04:23:07  114.038818  22.553232   
4     22334.0 2013-10-22 06:30:19 2013-10-22 06:41:20  114.031250  22.519550   

       EndLng     EndLat  Speed  
0  114.084915  22.540850    0.0  
1  114.056236  22.633383    0.0  
2  114.093536  22.554382    1.0  
3  114.052216  22.602118    1.0  
4  114.067886  22.521299    0.0  


  trips = data_sorted.groupby('VehicleNum').apply(extract_trips).reset_index(drop=True)


In [15]:
trips

Unnamed: 0,VehicleNum,StartTime,EndTime,StartLng,StartLat,EndLng,EndLat,Speed
0,22334.0,2013-10-22 00:07:57,2013-10-22 00:18:16,114.080498,22.554182,114.084915,22.540850,0.0
1,22334.0,2013-10-22 00:19:05,2013-10-22 00:44:52,114.084915,22.540850,114.056236,22.633383,0.0
2,22334.0,2013-10-22 02:38:52,2013-10-22 02:47:04,114.091637,22.543200,114.093536,22.554382,1.0
3,22334.0,2013-10-22 03:58:57,2013-10-22 04:23:07,114.038818,22.553232,114.052216,22.602118,1.0
4,22334.0,2013-10-22 06:30:19,2013-10-22 06:41:20,114.031250,22.519550,114.067886,22.521299,0.0
...,...,...,...,...,...,...,...,...
16752,36805.0,2013-10-22 22:49:12,2013-10-22 22:50:40,114.114365,22.550632,114.115501,22.557983,15.0
16753,36805.0,2013-10-22 22:52:07,2013-10-22 23:03:12,114.115402,22.558083,114.118484,22.547867,0.0
16754,36805.0,2013-10-22 23:03:45,2013-10-22 23:20:09,114.118484,22.547867,114.133286,22.617750,23.0
16755,36805.0,2013-10-22 23:36:19,2013-10-22 23:43:12,114.112968,22.549601,114.089485,22.538918,23.0


In [16]:
len(trips)

16757

In [17]:
trips['VehicleNum'].dtype

dtype('float64')

In [18]:
trips['VehicleNum'].astype(int)

0        22334
1        22334
2        22334
3        22334
4        22334
         ...  
16752    36805
16753    36805
16754    36805
16755    36805
16756    36805
Name: VehicleNum, Length: 16757, dtype: int32

In [19]:
# 验证数据
# 选择trips的22334号车 
trips_22396 = trips[trips['VehicleNum'] == 22396].sort_values("EndTime")
trips_22396

Unnamed: 0,VehicleNum,StartTime,EndTime,StartLng,StartLat,EndLng,EndLat,Speed
43,22396.0,2013-10-22 00:19:41,2013-10-22 00:23:01,114.013016,22.664818,114.0214,22.663918,25.0
44,22396.0,2013-10-22 00:41:51,2013-10-22 00:43:44,114.021767,22.6402,114.02607,22.640266,1.0
45,22396.0,2013-10-22 00:45:44,2013-10-22 00:47:44,114.028099,22.645082,114.03038,22.650017,2.0
46,22396.0,2013-10-22 01:08:26,2013-10-22 01:16:34,114.034897,22.616301,114.035614,22.646717,42.0
47,22396.0,2013-10-22 01:26:06,2013-10-22 01:34:48,114.046021,22.641251,114.066048,22.636183,2.0
48,22396.0,2013-10-22 01:49:28,2013-10-22 01:52:48,114.02858,22.64575,114.03302,22.640667,1.0
49,22396.0,2013-10-22 02:01:28,2013-10-22 02:13:28,114.029617,22.618883,114.028847,22.650999,49.0
50,22396.0,2013-10-22 02:16:48,2013-10-22 02:25:28,114.023102,22.658817,114.0354,22.690767,54.0
51,22396.0,2013-10-22 02:30:08,2013-10-22 02:40:55,114.025146,22.674534,113.954964,22.686899,59.0
52,22396.0,2013-10-22 04:30:13,2013-10-22 04:38:48,113.888084,22.582767,113.85923,22.612034,81.0


In [20]:
trips.head()

Unnamed: 0,VehicleNum,StartTime,EndTime,StartLng,StartLat,EndLng,EndLat,Speed
0,22334.0,2013-10-22 00:07:57,2013-10-22 00:18:16,114.080498,22.554182,114.084915,22.54085,0.0
1,22334.0,2013-10-22 00:19:05,2013-10-22 00:44:52,114.084915,22.54085,114.056236,22.633383,0.0
2,22334.0,2013-10-22 02:38:52,2013-10-22 02:47:04,114.091637,22.5432,114.093536,22.554382,1.0
3,22334.0,2013-10-22 03:58:57,2013-10-22 04:23:07,114.038818,22.553232,114.052216,22.602118,1.0
4,22334.0,2013-10-22 06:30:19,2013-10-22 06:41:20,114.03125,22.51955,114.067886,22.521299,0.0


## 5.清洗异常数据：行程时间过短 或 起点终点相同

In [21]:
trips2 = trips.copy()
len(trips)

16757

## 4.计算行程时间

In [22]:
# 计算行程时间
trips['TripTime'] = (trips['EndTime'] - trips['StartTime']).dt.total_seconds() / 60 # 分钟
# 选择行程时间大于等于1分钟的行程
trips = trips[trips['TripTime'] >= 1]
len(trips)

15963

In [23]:
# 2.起点终点相同
# 选择起点终点不相同的行程
trips = trips[(trips['StartLng'] != trips['EndLng']) | (trips['StartLat'] != trips['EndLat'])]
print(len(trips))

15878


# 5.计算形成距离

In [25]:
# 使用geopy库计算球面距离
# 安装库：conda install -c conda-forge geopy
# 使用 geodesic 函数计算距离
from geopy.distance import geodesic

In [27]:
# 使用 geopy 计算距离
trips['TripDistance'] = trips.apply(lambda row: geodesic((row['StartLat'], row['StartLng']), (row['EndLat'], row['EndLng'])).km, 
                                       axis=1)

In [28]:
trips['TripDistance']

0         1.544684
1        10.662886
2         1.253581
3         5.586134
4         3.774126
           ...    
16752     0.822377
16753     1.174873
16754     7.886999
16755     2.689677
16756     3.022583
Name: TripDistance, Length: 15878, dtype: float64

In [29]:
# 保存数据
trips.to_csv("./data/TaxiOD-Clean.csv", index=False, header=True)

In [30]:
# 或者用Parquet 格式：
# Parquet 格式可以有效地存储 datetime 对象，并保留其格式和类型。
trips.to_parquet('./data/TaxiOD-Clean.parquet')