ego network
- 有哪些相关的研究,用了什么方法
- 他们用了哪些属性做预测,准确率是多少
- You Are Who You Know: Inferring User Profiles in Online Social Networks http://www.leonidzhukov.net/hse/2011/seminar/papers/Inferring-WSDM.pdf
- Predicting user personality by mining social interactions in Facebook
- Inferring user profile properties based upon mobile device usage D Krishnaswamy - US Patent App. 12/369,676, 2010 - Google Patents
- Inferring the location of twitter messages based on user relationships CA Davis Jr, GL Pappa, DRR de Oliveira… - Transactions in …, 2011 - Wiley Online Library
- Inferring Gender from the Content of Tweets: A Region Specific Example. C Fink, J Kopecky, M Morawski - ICWSM, 2012 - aaai.org
- Inferring Latent User Properties from Texts Published in Social Media. S Volkova, Y Bachrach, M Armstrong, V Sharma - AAAI, 2015 - research.microsoft.com
- We know where you live: privacy characterization of foursquare behavior T Pontes, M Vasconcelos, J Almeida… - Proceedings of the …, 2012 - dl.acm.org
- González MC, Hidalgo CA, Barabási AL (2008) Understanding individual human mobility patterns. Nature 453(7196):779–782.
- Song C, Qu Z, Blumm N, Barabási AL (2010) Limits of predictability in human mobility. Science 327(5968):1018–1021.
- python pandas
- awk+shell
- 使用python将数据进行清洗和整理,仅留下暂时需要的数据,另外写到一个文件中去
# 减少多次读写,加快计算时间。bigfile = open(path + 'path/to/data.csv') chunkSize = 100000000 chunk = bigfile.readlines(chunkSize) while chunk: # process chunk chunk = bigfile.readlines(chunkSize)
- start to read in data by chunks
从tb_user_info_201202.txt文件中随机选取5000个用户——实验证明,部分出现在tb_user_info_201202.txt文件中的用户不一定出现在通话和短信记录的数据中,因此此方法是无效的。改为从tb_call_201202.txt和tb_sms_201202.txt中分别提取出用户关系数据,分别写到tb_call_users.txt和tb_sms_users.txt两个文件中,从这两个文件中的用户数据中抽样
下面是打电话的数据(左图为全部,右图为网内):
每个小时的主叫号码数量和主叫次数之间的关系满足异速增长律。但是其异速增长率并不强。可能原因是异网和固话打电信的行为没有被记录。
下面是短信的数据(左图为全部,右图为网内):
短信行为中的异速增长更弱。
以往的关于群体注意力的研究当中,我们将系统看成一个由人和内容组成的bi-partite network。人的注意力在内容中流动。这种网络转化为节点为内容连边为注意力数量的流网络。通过考察流网络的性质(比如耗散)来解释异速增长。实证数据发现耗散率(耗散量和流入量)同样满足标度关系。
1. 一种简单的猜想是这种耗散随流入量sublinearly的增长,即流入越多的内容,附着注意力的能力越强(耗散越少),那么整个系统就呈现出我们想要观察的规模经济:异速增长。
2. 但是,实证数据进一步发现,有些superlinear的耗散率和异速增长存在着!所以并不仅仅是这么简单。一个解决方法通过几何化来重新测度耗散和流入。测量出每个节点到源的流距离,根据这种流距离为参数观察系统的累积耗散和累积流入。发现二者满足着标度关系b。
3. 但是这个时候的解释还是不好?数据发现,还和流网络的总耗散规模有标度关系a。拟合出a和b两个参数可以较好地描述异速增长率$\theta$。$\theta$减去a和b满足线性关系。如果说b衡量了几何化的耗散率,为什么耗散率越大,系统的异速增长率也越大?规模经济到哪里去了?
人在人当中流动。互动网络的特点是正反馈,主要应该关注的是reciprocity。
对每个随机选取的用户,构建他们的ego network
使用tb_call_users.txt和tb_sms_users.txt两个文件中的数据分别构成一个巨大的网络G,然后调用python复杂网络包networkx中的ego_graph函数,提取不同尺度(层级)的ego network。例如:
eg1 = ego_graph(G, n, radius = 1) eg2 = ego_graph(G, n, radius = 2) eg3 = ego_graph(G, n, radius = 3)
将构建的call_graph和sms_graph对象序列化到本地
# import pickle模块 try: import cPickle as pickle except ImportError: import pickle>>> d = dict(name='Bob', age=20, score=88) >>> f = open('dump.txt', 'wb') >>> pickle.dump(d, f) >>> f.close()
- 将对象d序列化到文件f中
>>> f = open('dump.txt', 'rb') >>> d = pickle.load(f) >>> f.close() >>> d {'age': 20, 'score': 88, 'name': 'Bob'}
- 将文件f反序列化成对象d
- eg3sizes
- ages
- consume_amt
- terminal_price
- degree_assortativity
- average_clustering
- transitivity
- density
- 链接数(与density相关)
- ego在整个网络中的度中心性(degree_centrality)
- 移动基站网络的属性(参考MSRA的轨迹数据挖掘教程)
- 将2度网络的属性和3度网络的属性一起加入
- 针对call_data,考虑通话的开始时间call_start和持续时间call_dur是否有特征?
- 针对sms_data,考虑短信的发送时间sms_send是否有特征?
- 针对user_data,考虑用户所在行业work_type
Whereas a single call between two individuals during 18 weeks may not carry much information, reciprocal calls of long duration between two users serves as a signature of some work-, family-, leisure-, or service-based relationship. Therefore, to translate the phone log data into a network representation that captures the characteristics of the underlying communication network: we connected two users with an undirected link if there had been at least one reciprocated pair of phone calls between them (i.e., A called B, and B called A) and defined the strength, wAB = wBA as aggregated duration of calls between users A and B.[1]
从tb_user中选取ACC_NBR和CUST_LEVEL属性,从tb_call中选取'calling_nbr','raw_dur','calling_cell'属性
user_data.loc[:, ['ACC_NBR','CUST_LEVEL']].to_csv('acc_nbr_cust_level.csv', index = False) call_data.loc[:, ['calling_nbr','raw_dur','calling_cell']].to_csv('calling_nbr_raw_dur_calling_cell.csv', index = False)
以ACC_NBR和calling_nbr为外键进行merge,计算merge后的表中前20000条记录中用户去过的基站数、用户在所有基站中停留的最长时间和用户等级三个变量之间的关系
用户等级(X轴)与用户去过的基站的数量(Y轴)之间的关系
用户去过的基站的数量(X轴)与用户在所有基站中停留的最长时间(Y轴)之间的关系
用户等级(X轴)与用户在所有基站中停留的最长时间(Y轴)之间的关系
不同用户等级的用户的用户在所有基站中停留的最长时间的平均值
- 242.276923
- 233.663317
- 215.990440
- 173.111985
- 1.076923
- 1.165829
- 1.148821
- 1.121973
采用手机通话数据(tb_call),去除了异网(y)、固话(g)、异常(#)用户
call_data = call_data[~call_data['called_nbr'].str.contains('y')] call_data = call_data[~call_data['called_nbr'].str.contains('g')] call_data = call_data[~call_data['called_nbr'].str.contains('#')]
按照城市、县城、乡村,划分不同年龄段的用户进行研究
县城青少年、中年、老年和乡村青少年、中年、老年的通话网络度分布(颜色依次为red、blue、green、yellow、magenta、cyan
乡村,青少年,与内网用户通话的用户的基站移动网络
选取20120201这一天的所有通话数据
按照城市、县城、乡村,划分不同年龄段用户
# 选取一天的数据:20120201 call_data_one_day = call_data[call_data['day_id'] == 20120201]call_data_one_day.to_csv('call_data_one_day.csv', index = False)
- dump
call_data_one_day_ccnc = call_data_one_day.loc[:, ['calling_nbr', 'called_nbr', 'raw_dur','calling_cell']]
- select call_data attributes
user_data_URID = user_data.loc[:, ['ACC_NBR','URBAN_RURAL_ID', 'GENDER', 'AGE', 'CUST_LEVEL', 'IS_3G', 'IS_VIP', 'CONSUME_AMT', 'CALL_FEE', 'INNET_FLOAT_AMT', 'INNET_DUR']][user_data['CUST_LEVEL'] != -1][user_data['URBAN_RURAL_ID'] != -1][user_data['IS_VIP'] != -1].dropna()
- select user_data attributes
user_data_0 = user_data_URID[user_data_URID['URBAN_RURAL_ID'] == 0]
- 城市人群
user_data_1 = user_data_URID[user_data_URID['URBAN_RURAL_ID'] == 1]
- 县城人群
user_data_2 = user_data_URID[user_data_URID['URBAN_RURAL_ID'] == 2]
- 乡村人群
user_data_youth_0 = user_data_0[user_data_0['AGE'] < 25][user_data_0['AGE'] > 12] user_data_youth_1 = user_data_1[user_data_1['AGE'] < 25][user_data_1['AGE'] > 12] user_data_youth_2 = user_data_2[user_data_2['AGE'] < 25][user_data_2['AGE'] > 12] user_call_data_youth_0 = user_data_youth_0.merge(call_data_one_day_ccnc, left_on = 'ACC_NBR', right_on = 'calling_nbr') user_call_data_youth_1 = user_data_youth_1.merge(call_data_one_day_ccnc, left_on = 'ACC_NBR', right_on = 'calling_nbr') user_call_data_youth_2 = user_data_youth_2.merge(call_data_one_day_ccnc, left_on = 'ACC_NBR', right_on = 'calling_nbr')
- 选取城市、县城、乡村的年轻人群体
G = nx.Graph() for i in user_call_data_youth_0['ACC_NBR'].drop_duplicates(): attributes = user_data_youth_0[user_data_youth_0['ACC_NBR'] == i] node_list = user_call_data_youth_0[user_call_data_youth_0['ACC_NBR'] == i].calling_cell.tolist() G.add_path(node_list, URID=0) G.add_nodes_from(node_list, attributes=attributes) for i in user_call_data_youth_1['ACC_NBR'].drop_duplicates(): attributes = user_data_youth_1[user_data_youth_1['ACC_NBR'] == i] node_list = user_call_data_youth_1[user_call_data_youth_1['ACC_NBR'] == i].calling_cell.tolist() G.add_path(node_list, URID=1) G.add_nodes_from(node_list, attributes=attributes) for i in user_call_data_youth_2['ACC_NBR'].drop_duplicates(): attributes = user_data_youth_2[user_data_youth_2['ACC_NBR'] == i] node_list = user_call_data_youth_2[user_call_data_youth_2['ACC_NBR'] == i].calling_cell.tolist() G.add_path(node_list, URID=2) G.add_nodes_from(node_list, attributes=attributes)
- 构建基站移动网络
attr_dict = nx.get_node_attributes(G, 'attributes') nodes_0 = [] nodes_1 = [] nodes_2 = [] for key in attr_dict: if attr_dict[key]['URBAN_RURAL_ID'].tolist()[0] == 0: nodes_0.append(key) if attr_dict[key]['URBAN_RURAL_ID'].tolist()[0] == 1: nodes_1.append(key) if attr_dict[key]['URBAN_RURAL_ID'].tolist()[0] == 2: nodes_2.append(key) subG_0 = G.subgraph(nodes_0) subG_1 = G.subgraph(nodes_1) subG_2 = G.subgraph(nodes_2)
- 按照URBAN_RURAL_ID来划分城市、县城、乡村的subgraph
3个subgraph的degree_centrality分布(0,1,2对应的颜色为r、g、b)
3个subgraph的degree distribution(0,1,2对应的点颜色为r、g、b)
beta,r2(依次为0,1,2)
- -2.29744767208 0.753029337819
- -1.34807829286 0.557218689598
- -1.09562062485 0.744622030593
选取20120201-20120207这一周的通话数据,按照上述计算方法
3个subgraph的degree distribution(0,1,2对应的点颜色为r、g、b)
beta,r2(依次为0,1,2)
- -2.5833344515 0.841807053127
- -1.69149012377 0.682239828293
- -0.771546717305 0.701634232687
Onnela, J. P., Saramäki, J., Hyvönen, J., Szabó, G., Lazer, D., Kaski, K., ... & Barabási, A. L. (2007). Structure and tie strengths in mobile communication networks. Proceedings of the National Academy of Sciences, 104(18), 7332-7336.