## We want to be able to do a couple things in this repo:
0. checking consistancy and general sanity check
1. deduplicate (naïve if faster)
2. tokenisation
3. stop-words removal
4. others

In [1]:
# reserved for libraries
import os
import datetime
from tqdm import tqdm
from multiprocessing import pool as P
import pickle
import random
from math import floor
from datetime import timedelta

In [2]:
import datetime

datetime.datetime.strptime('2018-2-2', '%Y-%m-%d')

datetime.datetime(2018, 2, 2, 0, 0)

In [3]:
# data dir
data_dir = "../master_thesis_data/weibo_raw"
save_dir = "../master_thesis_data/weibo_deduplicated"

In [4]:
# finding all files and seperating them into dates and tweets
all_files = os.listdir(data_dir)

all_texts = [file for file in all_files if file.split('.')[0][-5:] == 'texts']
all_dates = [file for file in all_files if file.split('.')[0][-5:] == 'dates']

In [5]:
# how many days of data did we actually get
len(all_texts), len(all_dates), 365 * 3 + 1

(1096, 1096, 1096)

In [6]:
# if they match up
actual_dates_0 = sorted([text.split('_')[1] for text in all_texts])
actual_dates_1 = sorted([date.split('_')[1] for date in all_dates])

matches = [actual_dates_0[i] == actual_dates_1[i] for i in range(len(actual_dates_0))]
sum(matches)

1096

It seems though we managed to scrape data from nearly all dates and the 'date' labels are correctly matched, let's check using datetime which dates we're missing

In [7]:
# setting limits
starting_date = datetime.datetime.strptime('2016-4-17', '%Y-%m-%d')
end_date = datetime.datetime.strptime('2019-4-17', '%Y-%m-%d')

In [8]:
# converting to datetime format
actual_dates = [datetime.datetime.strptime(date, '%Y-%m-%d') for date in actual_dates_0]

In [9]:
# finding which dates are missing
missing_dates = []
while starting_date != end_date + datetime.timedelta(days = 1):
    if starting_date not in actual_dates:
        missing_dates.append(starting_date)
    starting_date += datetime.timedelta(days = 1)

In [10]:
len(missing_dates)

0

Ok so we got the entire period covered

## Now let's see how many tweets we managed in total

In [11]:
# counting total number of tweets
tweet_count = 0
for file in all_texts:
    with open(os.path.join(data_dir, file), 'rb') as handle:
        file_holder = pickle.load(handle)
    tweet_count += len(file_holder)

In [12]:
tweet_count, round(tweet_count/len(matches), 0), round(tweet_count * 0.6, 0)

(741389, 676.0, 444833.0)

We managed ball park 740K tweets, or on avg 676 tweets per day, we can expect around 440K tweets in total after deduplication

## Prototyping deduplication

We'll use a random period first

In [13]:
# loading in the sample date's text and dates
with open(os.path.join(data_dir, sorted(all_texts)[0]), 'rb') as handle:
    texts = pickle.load(handle)
with open(os.path.join(data_dir, sorted(all_dates)[0]), 'rb') as handle:
    dates = pickle.load(handle)

In [14]:
sum(['人工智能' in text for text in texts])/len(texts)
# we only have 70% of the data that mention the search term, most of the non mentionning ones should be retweets

0.9943342776203966

In [15]:
# recording ones we want to keep
Mention_idx = [i for i, text in enumerate(texts) if '人工智能' in text]

In [16]:
# keeping the tweets we want
texts = [texts[i] for i in Mention_idx]

In [17]:
# keeping the dates we want
dates = [dates[i] for i in Mention_idx]

In [18]:
len(dates), len(texts)

(702, 702)

In [19]:
len(set(texts))

# so unique tweets wise, we only have 573

573

Now let's take a look at naive duplicates

In [20]:
clusters = []

for i, tweet_0 in tqdm(enumerate(texts)):
    if sum([i in clus for clus in clusters]) == 0:
        clusters.append([i])
    else:
        continue
    for j,tweet_1 in enumerate(texts[i+1:]):
        if tweet_0 == tweet_1:
            clusters[-1].append(i+j+1)

702it [00:00, 12760.19it/s]


In [21]:
len(clusters)

573

In [22]:
# we manually check of what nature they are, we'll sample a few
clusters = [clus for clus in clusters if len(clus) > 1]

In [23]:
for idx in random.choice(clusters):
    print(texts[idx])
    print('----------------------')
    print(' ')

//@李开复:明天晚上，我在上海交大等你，想要更深入研究人工智能的同学请踊跃报名哦。
----------------------
 
//@李开复:明天晚上，我在上海交大等你，想要更深入研究人工智能的同学请踊跃报名哦。
----------------------
 
//@李开复:明天晚上，我在上海交大等你，想要更深入研究人工智能的同学请踊跃报名哦。
----------------------
 
//@李开复:明天晚上，我在上海交大等你，想要更深入研究人工智能的同学请踊跃报名哦。
----------------------
 
//@李开复:明天晚上，我在上海交大等你，想要更深入研究人工智能的同学请踊跃报名哦。
----------------------
 
//@李开复:明天晚上，我在上海交大等你，想要更深入研究人工智能的同学请踊跃报名哦。
----------------------
 
//@李开复:明天晚上，我在上海交大等你，想要更深入研究人工智能的同学请踊跃报名哦。
----------------------
 
//@李开复:明天晚上，我在上海交大等你，想要更深入研究人工智能的同学请踊跃报名哦。
----------------------
 
//@李开复:明天晚上，我在上海交大等你，想要更深入研究人工智能的同学请踊跃报名哦。
----------------------
 
//@李开复:明天晚上，我在上海交大等你，想要更深入研究人工智能的同学请踊跃报名哦。
----------------------
 
//@李开复:明天晚上，我在上海交大等你，想要更深入研究人工智能的同学请踊跃报名哦。
----------------------
 
//@李开复:明天晚上，我在上海交大等你，想要更深入研究人工智能的同学请踊跃报名哦。
----------------------
 
//@李开复:明天晚上，我在上海交大等你，想要更深入研究人工智能的同学请踊跃报名哦。
----------------------
 
//@李开复:明天晚上，我在上海交大等你，想要更深入研究人工智能的同学请踊跃报名哦。
----------------------
 
//@李开复:明天晚上，我在上海交大等你，想要更深入研究人工智能的同学请踊跃报名哦。
-----

OK so let's now go on away and execute a naïve deduplication now

In [24]:
texts

['AKA打造的人工智能机器人Musio，有着出色的学习与自然语言处理能力，能用讲话、表情以及姿势等与人交流，每一次互动都会让它变得更智能。它还可以与所有的智能家庭设备连接，并控制与之相连的设备，能关灯、调节恒温器等。',
 'AKA打造的人工智能机器人Musio，有着出色的学习与自然语言处理能力，能用讲话、表情以及姿势等与人交流，每一次互动都会让它变得更智能。它还可以与所有的智能家庭设备连接，并控制与之相连的设备，能关灯、调节恒温器等。',
 '航空公司为什么青睐AI（人工智能）？-科技频道-手机搜狐 O航空公司为什么青睐AI（人工智能）？',
 'AKA打造的人工智能机器人Musio，有着出色的学习与自然语言处理能力，能用讲话、表情以及姿势等与人交流，每一次互动都会让它变得更智能。它还可以与所有的智能家庭设备连接，并控制与之相连的设备，能关灯、调节恒温器等。',
 '【信息科技：五大巨头联合推动人工智能 荐9股】O网页链接',
 'AKA打造的人工智能机器人Musio，有着出色的学习与自然语言处理能力，能用讲话、表情以及姿势等与人交流，每一次互动都会让它变得更智能。它还可以与所有的智能家庭设备连接，并控制与之相连的设备，能关灯、调节恒温器等。',
 '看完了 深恭还是那么美//@会员号是空号的mimo:刚看完录画，还不错。顺说那个人工智能女高中生rinna，是真的人工智能哦，有推特跟line账号的，会回复留言，日本微软开发的（翻过这条新闻 //@系录芥末:马！//@风中劲节_goro酱是小天使: 这个马一个。//@悠幽刨冰: 嗷嗷嗷嗷嗷终于等到了看看',
 '//@李开复: 明天晚上，我在上海交大等你，想要更深入研究人工智能的同学请踊跃报名哦。',
 '【信息科技：五大巨头联合推动人工智能 荐9股】 O网页链接',
 '【《聚焦：人工智能或引发乐视价值井喷》】//@红叶st: 转发微博',
 '横看人脑侧智能，远近高低各不同，不识人脑真面目，只缘脑在此脑中。人工智能超越人脑对于现在来说只能是伪命题，不能以一部虚构的神剧下定论，说白了人脑子还在人脑里，人工智能也是大脑想的。所以现在担心的不是人工智能，反而倒是人。如果现在高度依赖智能网络的社会被别有用心的人控制，那后果……',
 '人工智能(AI)的研发将对人类产生最大的影响。全球的科学家们都在疯狂的研发着人工智能

In [25]:
begin_date = '2018-02-05'
end_date = '2019-02-08'

In [26]:
# converting input dates into datetime format
begin_date = datetime.datetime.strptime(begin_date, '%Y-%m-%d')
end_date = datetime.datetime.strptime(end_date, '%Y-%m-%d')

In [27]:
n_days = (end_date - begin_date).days

In [28]:
window_size = 3

In [29]:
floor(n_days/window_size)+1

123

In [30]:
len(texts), len(dates)

(702, 702)

In [31]:
begin_date

datetime.datetime(2018, 2, 5, 0, 0)

In [32]:
begin_date + timedelta(days = window_size-1)

datetime.datetime(2018, 2, 7, 0, 0)

In [33]:
ind_remove = []
for clus in clusters:
    min_date = None
    ind_holder = 0
    for i, ind in enumerate(clus):
        if i == 0:
            min_date = dates[ind]
            ind_holder = ind
        elif dates[ind] < min_date:
            min_date = dates[ind]
            ind_holder = ind
    ind_remove += [ind for ind in clus if ind != ind_holder]

In [34]:
len(ind_remove)

129

In [35]:
import time
time.time() - time.time()

0.0

In [36]:
from deduplication import load_and_deduplicate

In [37]:
deduplicator = load_and_deduplicate(data_dir)

In [38]:
begin_date = '2016-4-17'
end_date = '2019-4-17'
window_size = 4
#deduplicator.naive_deduplication(begin_date, end_date, window_size)

In [39]:
deduplicator.reset_holders()

In [40]:
bdate = datetime.datetime.strptime(begin_date, '%Y-%m-%d')
edate = datetime.datetime.strptime(end_date, '%Y-%m-%d')

fnames = deduplicator.get_file_names(bdate, edate)

In [41]:
deduplicator.load_data(fnames)

In [42]:
len(deduplicator.current_dates),len(deduplicator.current_texts)

(1482778, 1482778)

In [43]:
deduplicator.naive_deduplication(begin_date, end_date, window_size)

beginning deduplcation process
deduplication completed for iteration 0
time lapsed so far: 1.5008199214935303
deduplication completed for iteration 1
time lapsed so far: 2.510611057281494
deduplication completed for iteration 2
time lapsed so far: 3.694103717803955
deduplication completed for iteration 3
time lapsed so far: 4.493620872497559
deduplication completed for iteration 4
time lapsed so far: 5.73867392539978
deduplication completed for iteration 5
time lapsed so far: 5.971966028213501
deduplication completed for iteration 6
time lapsed so far: 7.24242377281189
deduplication completed for iteration 7
time lapsed so far: 7.87320876121521
deduplication completed for iteration 8
time lapsed so far: 8.999205827713013
deduplication completed for iteration 9
time lapsed so far: 9.997784852981567
deduplication completed for iteration 10
time lapsed so far: 10.91395902633667
deduplication completed for iteration 11
time lapsed so far: 12.066958904266357
deduplication completed for iter

deduplication completed for iteration 104
time lapsed so far: 116.3903238773346
deduplication completed for iteration 105
time lapsed so far: 118.32860803604126
deduplication completed for iteration 106
time lapsed so far: 119.396488904953
deduplication completed for iteration 107
time lapsed so far: 121.20257997512817
deduplication completed for iteration 108
time lapsed so far: 122.93316793441772
deduplication completed for iteration 109
time lapsed so far: 124.75444674491882
deduplication completed for iteration 110
time lapsed so far: 126.58229207992554
deduplication completed for iteration 111
time lapsed so far: 127.72394299507141
deduplication completed for iteration 112
time lapsed so far: 129.47446298599243
deduplication completed for iteration 113
time lapsed so far: 130.27503895759583
deduplication completed for iteration 114
time lapsed so far: 132.1923930644989
deduplication completed for iteration 115
time lapsed so far: 133.97399497032166
deduplication completed for iter

deduplication completed for iteration 206
time lapsed so far: 254.77560305595398
deduplication completed for iteration 207
time lapsed so far: 256.22439980506897
deduplication completed for iteration 208
time lapsed so far: 257.1017098426819
deduplication completed for iteration 209
time lapsed so far: 257.7472677230835
deduplication completed for iteration 210
time lapsed so far: 258.49242091178894
deduplication completed for iteration 211
time lapsed so far: 258.9972770214081
deduplication completed for iteration 212
time lapsed so far: 259.6318929195404
deduplication completed for iteration 213
time lapsed so far: 260.0516788959503
deduplication completed for iteration 214
time lapsed so far: 261.52520775794983
deduplication completed for iteration 215
time lapsed so far: 262.6419789791107
deduplication completed for iteration 216
time lapsed so far: 264.0982418060303
deduplication completed for iteration 217
time lapsed so far: 265.6918008327484
deduplication completed for iteratio

In [46]:
len(deduplicator.texts), len(deduplicator.dates)

(491369, 491369)

In [47]:
deduplicator.texts

['发现我的微博会自动取消关注,,,也是醉了,,,您这么自说自话地取消真的好吗？问过我吗？我很怕的你知道吗？【本周刚和人探讨过人工智能的觉醒】',
 '@里也嗒 @smdzbagsgabpygzhxtgzdjs 好可怕的人工智能',
 '【盘面热点掘金】民办学校将实施分类管理 鼓励社会力量办教育；能源技术革命创新行动计划发布；棉花期货7个交易日暴涨17%；收储效果逐步显现 稀土价格全线反弹 ；全球人工智能大会周五召开 O【盘面热点掘金】 【订阅后即可查看更多相关个股】',
 '癌症进化论 O网页链接 斯坦福大学发起人工智能百年研究计划 O网页链接 澳洲两岁女孩将接受世界首个3D打印功能性假耳 O网页链接 能分辨肿瘤和健康脑组织的智能手术刀 O网页链接',
 '[出售] iPod //如果你对自媒体的人工智能方向感兴趣, 请与我们私信联系.',
 '人工智能不能代替人类去进化。它只有同，没有和。它没有生和死。它没有精神所附的肉体。除非它的哲学不是人类的哲学。如果它的哲学和人类不同，它也就不会代替人类去进化。但是它可能会像电影里的故事情节，操纵人或者把人当作奴隶。',
 '//@数急://@Copper_PKU://@快乐的孩纸: //@王树森CS://@南大周志华: CS里面人工智能领域民科最多。曾遇到人严肃讨论“机器啥时统治世界”，“统治不了大城市统治没见识村民组成的偏僻小山村总行吧？”。。。三板斧下来教授落荒而逃',
 '最近互联网圈又出现不少说法，说各种机会是风口：有说视频直播的、有说VR/AR的、有说人工智能的……然而按照某句业界老话"猪都能飞起来"去推断，等风口的多半觉得自己是猪。一个想法，不一定对。',
 '刚想到一件事，子承父业、女承母业在现代大都市已经越来越不现实了。性格差异、产业变迁，以及人工智能加速发展，都会导致两代人职业上的差异。那么，问题来了，教育怎么做，专业怎么选?',
 '人工智能的可怕@吃天下贾妮儿 你用的是iPhone吧，可以试试看',
 '未来三十年内，哪些行业的工作人员将被人工智能取代？失业的人类何去何从？ - 回答作者：知乎用户 O网页链接（想看更多？下载知乎 App：知乎）',
 '【短片】当你有一个机器人妈妈（人工智能黑色科幻 ） UP主: Al_Ica #哔哩哔哩动画# L【短片】当你有一个机器人妈妈（人工智能黑色科幻 ）'

In [49]:
set(deduplicator.dates)

{datetime.datetime(2012, 11, 2, 0, 0),
 datetime.datetime(2013, 5, 10, 0, 0),
 datetime.datetime(2015, 3, 10, 0, 0),
 datetime.datetime(2016, 4, 17, 0, 0),
 datetime.datetime(2016, 4, 18, 0, 0),
 datetime.datetime(2016, 4, 19, 0, 0),
 datetime.datetime(2016, 4, 20, 0, 0),
 datetime.datetime(2016, 4, 21, 0, 0),
 datetime.datetime(2016, 4, 22, 0, 0),
 datetime.datetime(2016, 4, 23, 0, 0),
 datetime.datetime(2016, 4, 24, 0, 0),
 datetime.datetime(2016, 4, 25, 0, 0),
 datetime.datetime(2016, 4, 26, 0, 0),
 datetime.datetime(2016, 4, 27, 0, 0),
 datetime.datetime(2016, 4, 29, 0, 0),
 datetime.datetime(2016, 5, 2, 0, 0),
 datetime.datetime(2016, 5, 3, 0, 0),
 datetime.datetime(2016, 5, 4, 0, 0),
 datetime.datetime(2016, 5, 5, 0, 0),
 datetime.datetime(2016, 5, 6, 0, 0),
 datetime.datetime(2016, 5, 7, 0, 0),
 datetime.datetime(2016, 5, 9, 0, 0),
 datetime.datetime(2016, 5, 11, 0, 0),
 datetime.datetime(2016, 5, 12, 0, 0),
 datetime.datetime(2016, 5, 13, 0, 0),
 datetime.datetime(2016, 5, 15, 

In [51]:
deduplicator.save_all(save_dir, save_name = 'weibo_all')