推特用户 WeRateDogs 以诙谐幽默的方式对人们的宠物狗评级。这些评级通常以 10 作为分母。但是分子呢？分子一般大于 10。 
这个档案包括基本的推特数据，如截止到 2017 年 4 月 1 日的 5000 多条推特。
你需要创建包含图片的书面文档，并且把这些文档导出为 PDF 文件。这个任务可以在 Jupyter 记事本中进行，但是你最好使用文字处理软件，如免费软件 Google Docs 或 Microsoft Word。

## 背景

你的目标：清洗 WeRateDogs 推特数据，创建有趣可靠的分析和可视化。推特档案很大，但是只包括基本的推特信息。对 "Wow!" 进行收集、评估和清洗，是分析和可视化应该做的。

## 数据
### 推特档案（待完善）
该文件为tweet_json.txt
WeRateDogs 推特档案包括基本的推特信息，如 5000 多条推特，但并不包括所有数据。

1\. 档案中有一列包括每个**推特文本**，我可以用来**提取评级、狗的名字和 "地位"（“stage”） **(即 doggo、floof(er)、pupper 和 puppo)。  
用编程方式提取数据，但是，评级并不都是正确的。狗的名字和地位 (参见下面更多相关信息) 也有不正确的。如果想用它们进行分析和可视化，你需要评估和清洗这些列。  
- rating
- dog name
- dog stage

2\.  推特档案遗漏的两列：转发用户和喜爱用户是从推特 API 中收集到的附加数据。  
- retweet count
- favorite count

### 图像预测文件
记录了通过一个神经网络对推特档案中的所有图片对狗的品种进行分类的结果。  
对图片预测 (只含前三名) 的表格包括每个推特 ID、图片 URL 和最自信预测对应的图片编号 (由于推特最多包含 4 个图片，所以编号为 1 到 4)。

- tweet_id 是推特 URL 最后一部分，位于 "status/" 后面 → https://twitter.com/dog_rates/status/889531135344209921
- p1 是对推特中图片算法 #1 的预测 → 金毛犬
- p1_conf 是 #1 预测中算法的可信度 → 95%
- p1_dog 是 #1 预测是否是狗的品种 → 真
- p2 是算法的第二个最有可能的预测 → 拉布拉多犬
- p2_conf 是 #2 预测中算法的可信度 → 1%
- p2_dog 是 #2 预测是否是狗的品种 → 真
    等等




## 关键要点
清洗这个项目的数据时要牢记几个要点：

我们只需要含有图片的原始评级 (不包括转发)。
充分评估和清洗整个数据集需要巨大努力，所以只有一些问题 (至少 8 个质量问题和 2 个清洁度问题) 的子集需要进行评估和清洗。
根据 清洗数据 的规则，清洗包括合并数据的独立内容。
如果分子评级超过分母评级，不需要进行清洗。这个 特殊评级系统 是 WeRateDogs 人气度较高的主要原因。


评估项目数据

收集上述数据的每个内容后，从视觉上和程序上，对质量和清洁度进行数据评估。在你的 wrangle_act.ipynb Jupyter Notebook 中查找和记录至少 8 个质量问题 和 2 个清洁度问题。为了符合规范，必须评估符合项目动机的问题 (参见上一页的 关键要点 标题)。

清洗项目数据

评估时清洗你记录的每个问题。在 wrangle_act.ipynb 完成清洗。结果应该为优质干净的主要 pandas DataFrame (如有，或为多个 DataFrame)。必须评估符合项目动机的问题。

存储、分析和可视化项目数据

在 CSV 文件中存储洁净的数据，命名为 twitter_archive_master.csv。如果因为清洁需要多个表格，存在附加文件，要给这些文件合理命名。另外，你可以把清洗后的数据存储在 SQLite 数据库中 (如有需要也可以提交)。

在 wrangle_act.ipynb Jupyter Notebook 中对清洗后的数据进行分析和可视化。必须生成至少 3 个见解和 1 个可视化。

wrangle_act.ipynb：用于收集、评估、清理、分析和可视化数据代码
wrangle_report.pdf：数据整理步骤的文档：收集，评估和清理
act_report.pdf：观察并分析最终数据的文档
twitter_archive_enhanced.csv：给定的文件
image_predictions.tsv：以编程方式下载的文件
tweet_json.txt：通过API构建的文件
twitter_archive_master.csv：合并和清理数据

In [56]:
import numpy as np
import pandas as pd
import csv
import re
import json
from pprint import pprint

读入数据并拷贝
- from csv: "twitter-archive-enhanced.csv"
- from json: tweet_json.txt
- from tsv: image-predictions.tsv

In [62]:
df_twitter_archive_enhanced = pd.read_csv("twitter-archive-enhanced.csv")
# df_twitter_archive_enhanced

In [226]:
df_twt_arch = df_twitter_archive_enhanced.copy()

In [None]:
HEADER = [tweet_id, in_reply_to_status_id, in_reply_to_user_id, 
          timestamp, source, text, retweeted_status_id, 
          retweeted_status_user_id, retweeted_status_timestamp, 
          expanded_urls, rating_numerator, rating_denominator, 
          name, doggo, floofer, pupper, puppo]
STAGES = ["doggo", "pupper", "puppo", "blep", "snoot", "floof","floofer"]

In [190]:
with open("tweet_json.txt") as f:
    for index, line in enumerate(f.readlines()):
        if index >= 20:
            break
        dct = json.loads(line)
#         pprint(dct)

In [214]:
df_tweet_json_p1 = pd.DataFrame(columns=['tweet_id', 'jpg_url', 'img_num', 'text', 'retweet_count', 'favorite_count'])
except_list = []
with open("tweet_json.txt") as f:

    for index, line in enumerate(f.readlines()):
        try:
            dct = json.loads(line)
            tweet_id = dct["entities"]["media"][0]['expanded_url'].split(sep='/')[-3]
            retweet_count = dct["retweet_count"]
            favorite_count = dct["favorite_count"]
            text = dct["full_text"]
            for i, d in enumerate(dct["extended_entities"]["media"],start=1):
                jpg_url = d["media_url_https"]
                img_num = i
                df_tweet_json_p1.loc[df_tweet_json_p1.shape[0]] = \
                    {'tweet_id':tweet_id, 
                     'jpg_url':jpg_url, 
                     'img_num':img_num, 
                     'text':text, 
                     'retweet_count':retweet_count, 
                     'favorite_count':favorite_count}
        except KeyError as e:
#             print('except:', index,":\t",e)
            except_list.append(index)
df_tweet_json_p1

Unnamed: 0,tweet_id,jpg_url,img_num,text,retweet_count,favorite_count
0,892420643555336193,https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg,1,This is Phineas. He's a mystical boy. Only eve...,8842,39492
1,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,This is Tilly. She's just checking pup on you....,6480,33786
2,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,This is Archie. He is a rare Norwegian Pouncin...,4301,25445
3,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,This is Darla. She commenced a snooze mid meal...,8925,42863
4,891327558926688256,https://pbs.twimg.com/media/DF6hr6AVYAAZ8G8.jpg,1,This is Franklin. He would like you to stop ca...,9721,41016
5,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,This is Franklin. He would like you to stop ca...,9721,41016
6,891087950875897856,https://pbs.twimg.com/media/DF3HwyEWsAABqE6.jpg,1,Here we have a majestic great white breaching ...,3240,20548
7,890971913173991426,https://pbs.twimg.com/media/DF1eOmZXUAALUcq.jpg,1,Meet Jax. He enjoys ice cream so much he gets ...,2142,12053
8,890729181411237888,https://pbs.twimg.com/media/DFyBag_UQAAhhBC.jpg,1,When you watch your owner call another dog a g...,19548,66596
9,890729181411237888,https://pbs.twimg.com/media/DFyBahAVwAAhUTd.jpg,2,When you watch your owner call another dog a g...,19548,66596


In [211]:
#             tweet_id = dct["entities"]["media"][0]["id"]
#             retweet_count = dct["retweet_count"]
#             favorite_count = dct["favorite_count"]
#             text = dct["full_text"]
#             for i, d in enumerate(dct["extended_entities"]["media"],start=1):
#                 jpg_url = d["media_url_https"]
#                 img_num = i
# 有部分报错说['entities']['media'][0]["id"]中['media']不存在
print(len(except_list))
except_list_copy = except_list.copy()
# 接下来看tweet_id还有放在什么地方的

279


In [215]:
# 279 -> 181
# 报错为'retweeted_status'和 'media'两种，重复上述步骤，将错误细分为这两种，发生'media'错误的是没有图片的，捕捉这两种错误，并丢弃'media'
# 继续

list_retweeted_status = []
list_media = []

df_tweet_json_p2 =  pd.DataFrame(columns=['tweet_id', 'jpg_url', 'img_num', 'text', 'retweet_count', 'favorite_count'])
# print("len(except_list:\t)", len(except_list))
# print("len(except_list_copy:\t)", len(except_list_copy))
except_list_copy = except_list.copy()
quoted_status_list = []
retweeted_status_list = []
with open("tweet_json.txt") as f:
    for index, line in enumerate(f.readlines()):
#         if index != 31:
#             continue
        if index not in except_list_copy:
            continue
        dct = json.loads(line)
#         pprint(dct)
        try:
            tweet_id = dct["retweeted_status"]["entities"]["media"][0]['expanded_url'].split(sep='/')[-3]
            retweet_count = dct["retweeted_status"]["retweet_count"]
            favorite_count = dct["retweeted_status"]["favorite_count"]
            text = dct["retweeted_status"]["full_text"]
            for i, d in enumerate(dct["retweeted_status"]["extended_entities"]["media"],start=1):
                jpg_url = d["media_url_https"]
                img_num = i

#                 df_tt_new.loc[df_tt_new.shape[0]] = \
                df_tweet_json_p2.loc[df_tweet_json_p2.shape[0]] = \
                    {'tweet_id':tweet_id, 
                     'jpg_url':jpg_url, 
                     'img_num':img_num, 
                     'text':text, 
                     'retweet_count':retweet_count, 
                     'favorite_count':favorite_count}
            except_list_copy.remove(index)
#             print("poped:\t", index)
        except KeyError as e:
#             print('except:', index,":\t",e)
            if e.args[0] == 'retweeted_status':
                list_retweeted_status.append(index)
            if e.args[0] == 'media':
                list_media.append(index)
except_list_copy = [elem for elem in except_list_copy if elem not in list_media]
print("len(except_list:\t)", len(except_list))
print("len(except_list_copy:\t)", len(except_list_copy))
print("len(except_list diff:\t)", len(except_list)-len(except_list_copy), '\n',set(except_list)-set(except_list_copy))
df_tweet_json_p2

len(except_list:	) 279
len(except_list_copy:	) 181
len(except_list diff:	) 98 
 {1039, 535, 31, 543, 555, 565, 571, 574, 67, 580, 72, 73, 586, 591, 592, 595, 596, 601, 602, 90, 96, 609, 100, 612, 108, 122, 130, 135, 651, 652, 653, 144, 666, 156, 668, 162, 674, 679, 168, 177, 689, 179, 691, 182, 191, 192, 201, 219, 227, 742, 746, 750, 244, 247, 761, 770, 269, 270, 790, 283, 286, 299, 300, 814, 304, 306, 307, 822, 825, 316, 829, 324, 837, 337, 856, 354, 363, 881, 379, 383, 394, 403, 417, 933, 422, 939, 428, 431, 435, 447, 452, 459, 472, 476, 482, 1008, 503, 1019}


Unnamed: 0,tweet_id,jpg_url,img_num,text,retweet_count,favorite_count
0,878057613040115712,https://pbs.twimg.com/media/DC98vABUIAA97pz.jpg,1,This is Emmy. She was adopted today. Massive r...,7118,42743
1,878057613040115712,https://pbs.twimg.com/media/DC98vAHVoAAUj8d.jpg,2,This is Emmy. She was adopted today. Massive r...,7118,42743
2,878281511006478336,https://pbs.twimg.com/media/DDBIX9QVYAAohGa.jpg,1,Meet Shadow. In an attempt to reach maximum zo...,1338,7890
3,669000397445533696,https://pbs.twimg.com/media/CUjETvDVAAI8LIy.jpg,1,Meet Terrance. He's being yelled at because he...,6925,22047
4,866334964761202691,https://pbs.twimg.com/media/DAXXDQNXgAAoYQH.jpg,1,This is Coco. At first I thought she was a clo...,15442,54493
5,866334964761202691,https://pbs.twimg.com/media/DAXXDQMXoAQa0no.jpg,2,This is Coco. At first I thought she was a clo...,15442,54493
6,873213775632977920,https://pbs.twimg.com/media/DB5HTBGXUAE0TiK.jpg,1,This is Sierra. She's one precious pupper. Abs...,1656,7435
7,873213775632977920,https://pbs.twimg.com/media/DB5HTBMWsAAdrYH.jpg,2,This is Sierra. She's one precious pupper. Abs...,1656,7435
8,872657584259551233,https://pbs.twimg.com/media/DBxNccsXcAEKKpN.jpg,1,Penelope here is doing me quite a divertir. We...,31,717
9,841077006473256960,https://pbs.twimg.com/media/C6wbE5bXUAAh1Hv.jpg,1,This is Dawn. She's just checking pup on you. ...,5956,24841


In [218]:
except_list_copy_copy = except_list_copy.copy()

In [219]:
# 181 -> 
# 看里面是否含有'dog_rates/status/'条目，否则就是没有图片文件，或者图片文件不满足要求
df_tmp =  pd.DataFrame(columns=['tweet_id', 'jpg_url', 'img_num', 'text', 'retweet_count', 'favorite_count'])
with open("tweet_json.txt") as f:
    for index, line in enumerate(f.readlines()):
#         if index != 31:
#             continue
        if index not in except_list_copy_copy:
            continue
        if line.find(r'dog_rates/status/') == -1:
            except_list_copy_copy.remove(index)
            continue
print("len(except_list_copy:\t)", len(except_list_copy))
print("len(except_list_copy_copy:\t)", len(except_list_copy_copy))
print("len(except_list_copy diff:\t)", len(except_list_copy)-len(except_list_copy_copy))

len(except_list_copy:	) 181
len(except_list_copy_copy:	) 0
len(except_list_copy diff:	) 181


In [227]:
df_twt_json = pd.concat([df_tweet_json_p1,df_tweet_json_p2],ignore_index=True)
df_twt_json

Unnamed: 0,tweet_id,jpg_url,img_num,text,retweet_count,favorite_count
0,892420643555336193,https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg,1,This is Phineas. He's a mystical boy. Only eve...,8842,39492
1,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,This is Tilly. She's just checking pup on you....,6480,33786
2,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,This is Archie. He is a rare Norwegian Pouncin...,4301,25445
3,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,This is Darla. She commenced a snooze mid meal...,8925,42863
4,891327558926688256,https://pbs.twimg.com/media/DF6hr6AVYAAZ8G8.jpg,1,This is Franklin. He would like you to stop ca...,9721,41016
5,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,This is Franklin. He would like you to stop ca...,9721,41016
6,891087950875897856,https://pbs.twimg.com/media/DF3HwyEWsAABqE6.jpg,1,Here we have a majestic great white breaching ...,3240,20548
7,890971913173991426,https://pbs.twimg.com/media/DF1eOmZXUAALUcq.jpg,1,Meet Jax. He enjoys ice cream so much he gets ...,2142,12053
8,890729181411237888,https://pbs.twimg.com/media/DFyBag_UQAAhhBC.jpg,1,When you watch your owner call another dog a g...,19548,66596
9,890729181411237888,https://pbs.twimg.com/media/DFyBahAVwAAhUTd.jpg,2,When you watch your owner call another dog a g...,19548,66596


发现全部都没有指定的图片url，所以提取完毕，取备份，并保存

In [228]:
df_twt_json.to_csv("df_tweet_json.csv")

In [229]:
df_img_pre = pd.read_csv("image-predictions.tsv", sep='\t')                     

In [None]:
# 读取备份
# df_img_pre = pd.read_csv("image-predictions.tsv", sep='\t')
# df_twt_json.read_csv("df_tweet_json.csv")
# df_twt_arch = pd.read_csv("twitter-archive-enhanced.csv")

## 清洗数据
df_img_pre
df_twt_json
df_twt_arch

In [243]:
# 分裂df_twt_json，提取text中的name, rating, stage
pd.set_option('display.max_colwidth',200)
pd.set_option('display.max_row', 100)
df_twt_json['text']

0                                                                This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU
1           This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV
2                            This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB
3                                                                      This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ
4           This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f
5           This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected

In [58]:
df.describe()
HEADER = [tweet_id, in_reply_to_status_id, in_reply_to_user_id, 
          timestamp, source, text, retweeted_status_id, 
          retweeted_status_user_id, retweeted_status_timestamp, 
          expanded_urls, rating_numerator, rating_denominator, 
          name, doggo, floofer, pupper, puppo]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [60]:
df['name'].value_counts()

None          745
a              55
Charlie        12
Lucy           11
Oliver         11
Cooper         11
Penny          10
Lola           10
Tucker         10
Bo              9
Winston         9
Sadie           8
the             8
Bailey          7
Daisy           7
an              7
Toby            7
Buddy           7
Dave            6
Rusty           6
Leo             6
Bella           6
Scout           6
Oscar           6
Koda            6
Stanley         6
Milo            6
Jax             6
Jack            6
Sammy           5
             ... 
Rudy            1
Eazy            1
Bertson         1
Gunner          1
Pepper          1
Marty           1
Bode            1
Tupawc          1
Kendall         1
Jomathan        1
Leonard         1
Jeb             1
Lorelei         1
Jordy           1
Tonks           1
Christoper      1
Kloey           1
space           1
Julius          1
Rinna           1
Laika           1
Banditt         1
Vince           1
Ralphus         1
Fillup    

## 质量问题
- 狗狗姓名抓取错误，有'None'和'a',还有一个叫'space'，不是大写，可能有问题
- 狗狗的stage不全，还有"blep", "snoot"
- rating_numerator和rating_denominator有部分抓取错误
- 
## 整洁度问题
- 狗狗的stage不应该用独热编码
- rating_denominator其实可以不需要，因为如果抓取正确，应该都是10


In [49]:
from collections import namedtuple
headers = ['Symbol','Price','Date','Time','Change','Volume']
Row = namedtuple('Row',headers)

In [29]:
import re
# dir(re)
tt = '"truncated": false, "display_text_range": [0, 85]'
false = re.compile(r"\bfalse\b")
print(false.sub('False', tt))
re.sub()

"truncated": False, "display_text_range": [0, 85]


In [None]:
with open("tweet_json.txt") as f:
    for i, line in enumerate(f.readlines()):
        if i >= 1:
            break
        dct = json.loads(line)
        pprint(dct)

In [180]:
try:
    10/0
except ZeroDivisionError as e:
    print(e.args[0] == 'division by zero')
    print(e)

True
division by zero


项目汇报

创建一个 300-600 字书面报告 命名为 wrangle_report.pdf，可以简要描述你的清洗过程。这可以作为内部文档。

创建一个 250 字以上的书面报告 命名为 act_report.pdf，可以沟通观点，展示你清洗过数据后生成的可视化内容。这可作为外部文档，如博客帖子或杂志文章。

使用 Jupyter Notebook 中的 Markdown 功能，在 Jupyter Notebook 中创建这些文档，然后下载这些 Notebook，作为 PDF 文件 (见下图)。不过你最好使用文字处理软件，如 Google Docs 或 Microsoft Word。