# 统计第一次停留点和只访问过一次停留点的占比

按照 Understanding predictability and exploration in human mobility 文中的定义来计算。

## 1 新地点（Exploration）

对 **每个用户**，按时间排序的地点序列：

- 若当前位置 **在该用户历史中从未出现过**
- 则该一次访问记为 **exploration**

例子（论文原文）：
```text
A B A C B C  
→ 1 1 0 1 0 0
```

统计量：

$$
\text{Exploration Ratio}
= \frac{\#\text{exploration visits}}{\#\text{total visits}}
$$

---

## 2. 只访问过一次的地点

对 **每个用户**：

- 统计其所有访问过的唯一地点
- 其中 **访问次数 = 1 的地点数**
- 占 **唯一地点总数** 的比例

$$
\text{Once-Visited Location Ratio}
= \frac{\#\text{locations visited once}}{\#\text{unique locations}}
$$


In [1]:
import pandas as pd

# 1. 读取数据
df = pd.read_csv(
    "./Data/Output/all_users_context_combined.csv",
    parse_dates=["stime"]
)

# 2. 按 user + time 排序（论文要求是序列）
df = df.sort_values(["userID", "stime"])

def compute_user_stats(user_df):
    """
    对单个用户计算：
    1) exploration 占比
    2) 只访问一次的地点占比
    """
    locations = user_df["grid"].tolist()

    # ---------- Exploration ----------
    seen = set()
    exploration_flags = []

    for loc in locations:
        if loc in seen:
            exploration_flags.append(0)
        else:
            exploration_flags.append(1)
            seen.add(loc)

    exploration_ratio = sum(exploration_flags) / len(exploration_flags)

    # ---------- Visited only once ----------
    loc_counts = pd.Series(locations).value_counts()
    once_visited_ratio = (loc_counts == 1).sum() / len(loc_counts)

    return pd.Series({
        "exploration_ratio": exploration_ratio,
        "once_visited_location_ratio": once_visited_ratio,
        "num_visits": len(locations),
        "num_unique_locations": len(loc_counts)
    })

# 3. 按用户统计
geolife_user_stats = (
    df.groupby("userID", group_keys=False)
      .apply(compute_user_stats)
      .reset_index()
)



  .apply(compute_user_stats)


In [2]:
print(geolife_user_stats.head())
geolife_user_stats.to_csv('./Data/Output/GeoLife_ExplorationRatio_OnceVisitedLocationRatio.csv')

   userID  exploration_ratio  once_visited_location_ratio  num_visits  \
0       0           0.104208                     0.480769       499.0   
1       1           0.204380                     0.535714       137.0   
2       2           0.129032                     0.522727       341.0   
3       3           0.090909                     0.531250      1056.0   
4       4           0.066510                     0.447059      1278.0   

   num_unique_locations  
0                  52.0  
1                  28.0  
2                  44.0  
3                  96.0  
4                  85.0  


In [None]:
import pandas as pd

# 1. 读取数据
df = pd.read_csv(
    "./Data/MoreUser/all.csv",
    parse_dates=["stime"]
)

# 2. 按 user + time 排序（论文要求是序列）
df = df.sort_values(["userID", "stime"])

def compute_user_stats(user_df):
    """
    对单个用户计算：
    1) exploration 占比
    2) 只访问一次的地点占比
    """
    locations = user_df["grid"].tolist()

    # ---------- Exploration ----------
    seen = set()
    exploration_flags = []

    for loc in locations:
        if loc in seen:
            exploration_flags.append(0)
        else:
            exploration_flags.append(1)
            seen.add(loc)

    exploration_ratio = sum(exploration_flags) / len(exploration_flags)

    # ---------- Visited only once ----------
    loc_counts = pd.Series(locations).value_counts()
    once_visited_ratio = (loc_counts == 1).sum() / len(loc_counts)

    return pd.Series({
        "exploration_ratio": exploration_ratio,
        "once_visited_location_ratio": once_visited_ratio,
        "num_visits": len(locations),
        "num_unique_locations": len(loc_counts)
    })

# 3. 按用户统计
moreuser_stats = (
    df.groupby("userID", group_keys=False)
      .apply(compute_user_stats)
      .reset_index()
)

print(moreuser_stats.head())
moreuser_stats.to_csv('./Data/MoreUser/MoreUser_ExplorationRatio_OnceVisitedLocationRatio.csv')

   userID  exploration_ratio  once_visited_location_ratio  num_visits  \
0       0           0.115594                     0.452830       917.0   
1       1           0.167048                     0.589041       437.0   
2       2           0.062992                     0.416667       381.0   
3       3           0.185874                     0.480000       269.0   
4       4           0.061103                     0.414634       671.0   

   num_unique_locations  
0                 106.0  
1                  73.0  
2                  24.0  
3                  50.0  
4                  41.0  


  .apply(compute_user_stats)


In [4]:
moreuser_stats.describe()

Unnamed: 0,userID,exploration_ratio,once_visited_location_ratio,num_visits,num_unique_locations
count,9907.0,9907.0,9907.0,9907.0,9907.0
mean,4970.096396,0.101445,0.510757,688.243464,49.810538
std,2870.739555,0.088936,0.119381,398.824487,33.847936
min,0.0,0.00347,0.0,4.0,4.0
25%,2485.5,0.045348,0.4375,369.0,28.0
50%,4970.0,0.075472,0.518519,654.0,42.0
75%,7454.5,0.126214,0.590909,961.0,63.0
max,9943.0,1.0,1.0,2112.0,442.0
