# 数据收集

本报告的数据来源于3个不同的维度
* 项目提供的twitter_archive_enhanced.csv，其中包含了待分析的基本数据
* 项目提供的tweet_json.txt，其中包含了每条推特信息的转发数和点赞数等额外信息
* 项目同时提供了一个图像预测文件image-predictions.tsv的下载链接（https://raw.githubusercontent.com/udacity/new-dand-advanced-china/master/%E6%95%B0%E6%8D%AE%E6%B8%85%E6%B4%97/WeRateDogs%E9%A1%B9%E7%9B%AE/image-predictions.tsv）

其中twitter_archive_enhanced.csv和tweet_json.txt直接拷贝到了项目目录下
而image-predictions.tsv会使用RequestAPI的方式从网络中获取


In [1]:
# 先通过RequestAPI下载对应的文件，并且将原始文件存储到data_gathering目录中

import requests
import os
import pandas as pd
import json
import time

url = 'https://raw.githubusercontent.com/udacity/new-dand-advanced-china/master/%E6%95%B0%E6%8D%AE%E6%B8%85%E6%B4%97/WeRateDogs%E9%A1%B9%E7%9B%AE/image-predictions.tsv'
r = requests.get(url)

folder_name = 'data_gathering'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

with open(folder_name + '/image-predictions.tsv', 'wb') as f:
    f.write(r.content)

images_origin = pd.read_csv('data_gathering/image-predictions.tsv', sep = '\t', encoding = 'utf-8')
images = images_origin.copy()
images.info()
print('done')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [2]:
# 读取对应的twitter_archive_enhanced.csv文件
archive_origin = pd.read_csv('twitter-archive-enhanced.csv', encoding = 'utf-8')
archive = archive_origin.copy()

archive['text'][0]

"This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU"

In [3]:
#读取对应的tweet_json.txt文件，并且将数据按照需求，进行获取
#我们在twitter_archive文件中有的tweet才去获取它的扩展信息
with open('tweet_json.txt') as f:
    content = f.readlines()
    
json_list = [json.loads(line) for line in open('tweet_json.txt')]
json_tweet = [{'tweet_id': int(tweet_obj['id']),
        'date_time': pd.to_datetime(tweet_obj['created_at']),
        'favorites': tweet_obj['favorite_count'],
        'retweets': tweet_obj['retweet_count'],
        'user_followers': tweet_obj['user']['followers_count'],
        'user_favourites': tweet_obj['user']['favourites_count']} for tweet_obj in json_list]


json_tweet[9]

{'tweet_id': 890240255349198849,
 'date_time': Timestamp('2017-07-26 15:59:51'),
 'favorites': 32467,
 'retweets': 7684,
 'user_followers': 3768792,
 'user_favourites': 120162}

In [4]:
# 将处理过的字段的json文件转换成csv文件，便于分析
json_tweet = pd.DataFrame(json_tweet, columns = ['tweet_id', 'date_time', 'favorites', 'retweets',
                                               'user_followers', 'user_favourites'])

json_tweet.to_csv('new_tweet_json.txt', encoding = 'utf-8', index=False)
json_tweets_csv = pd.read_csv('new_tweet_json.txt', encoding = 'utf-8')


In [5]:
json_tweets_csv.user_followers.sort_values()

json_tweets_csv.favorites.sort_values()

json_tweets_csv.retweets.sort_values()

288         0
1291        2
271         3
339         3
112         3
29          4
1076        6
54          7
424        10
63         10
2294       14
2335       15
183        17
1519       19
176        20
209        23
2315       23
2185       25
1233       26
185        28
406        30
100        31
608        32
2220       34
2256       34
2255       37
963        37
282        37
2333       37
881        38
        ...  
935     23870
307     23870
526     23959
447     23959
886     24183
1073    24183
651     24370
1621    24370
152     24997
114     27502
621     27586
1762    30797
456     31140
300     31140
1826    31810
166     32589
130     32705
133     32705
866     33230
162     33231
534     40437
446     42045
443     42045
65      45655
410     47958
814     52101
1075    52101
531     56373
257     56373
1035    79116
Name: retweets, Length: 2352, dtype: int64

# 评估数据

在收集到数据之后，我们会通过编程和肉眼分析的方式，来初步定为数据的整洁度和数据质量问题

In [6]:
archive.info()
archive
archive['name'].value_counts()

archive['rating_denominator'].value_counts()
archive['rating_numerator'].value_counts()
archive['score_ratio'] = archive['rating_numerator']/archive['rating_denominator']

#archive.sample(5)

archive[archive['in_reply_to_user_id'].notnull()].sample(4)


archive['rating_numerator'].value_counts()
archive['rating_denominator'].value_counts()

archive.describe()


archive[archive['rating_numerator']==1776]

archive[archive.tweet_id.duplicated()]

archive.rating_numerator.sort_values()

archive

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,score_ratio
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,,1.3
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,,1.3
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,,1.2
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,,1.3
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,,1.2
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,,,,,1.3
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jax. He enjoys ice cream so much he gets ...,,,,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,,,,,1.3
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,,,,,1.3
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Zoey. She doesn't want to be one of th...,,,,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,,,,,1.3
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,,1.4


In [7]:
# 代码方式看看image数据集
images
images.info()
images['jpg_url'].value_counts()
images[images['jpg_url'] == 'https://pbs.twimg.com/media/CiibOMzUYAA9Mxz.jpg']

images[images['jpg_url'].duplicated()]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1297,752309394570878976,https://pbs.twimg.com/ext_tw_video_thumb/67535...,1,upright,0.303415,False,golden_retriever,0.181351,True,Brittany_spaniel,0.162084,True
1315,754874841593970688,https://pbs.twimg.com/media/CWza7kpWcAAdYLc.jpg,1,pug,0.272205,True,bull_mastiff,0.251530,True,bath_towel,0.116806,False
1333,757729163776290825,https://pbs.twimg.com/media/CWyD2HGUYAQ1Xa7.jpg,2,cash_machine,0.802333,False,schipperke,0.045519,True,German_shepherd,0.023353,True
1345,759159934323924993,https://pbs.twimg.com/media/CU1zsMSUAAAS0qW.jpg,1,Irish_terrier,0.254856,True,briard,0.227716,True,soft-coated_wheaten_terrier,0.223263,True
1349,759566828574212096,https://pbs.twimg.com/media/CkNjahBXAAQ2kWo.jpg,1,Labrador_retriever,0.967397,True,golden_retriever,0.016641,True,ice_bear,0.014858,False
1364,761371037149827077,https://pbs.twimg.com/tweet_video_thumb/CeBym7...,1,brown_bear,0.713293,False,Indian_elephant,0.172844,False,water_buffalo,0.038902,False
1368,761750502866649088,https://pbs.twimg.com/media/CYLDikFWEAAIy1y.jpg,1,golden_retriever,0.586937,True,Labrador_retriever,0.398260,True,kuvasz,0.005410,True
1387,766078092750233600,https://pbs.twimg.com/media/ChK1tdBWwAQ1flD.jpg,1,toy_poodle,0.420463,True,miniature_poodle,0.132640,True,Chesapeake_Bay_retriever,0.121523,True
1407,770093767776997377,https://pbs.twimg.com/media/CkjMx99UoAM2B1a.jpg,1,golden_retriever,0.843799,True,Labrador_retriever,0.052956,True,kelpie,0.035711,True
1417,771171053431250945,https://pbs.twimg.com/media/CVgdFjNWEAAxmbq.jpg,3,Samoyed,0.978833,True,Pomeranian,0.012763,True,Eskimo_dog,0.001853,True


In [8]:
json_tweets_csv

json_tweets_csv.info()

#json_tweets_csv[json_tweets_csv['tweet_id'].duplicated()]

archive[archive['tweet_id'].duplicated()]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2352 entries, 0 to 2351
Data columns (total 6 columns):
tweet_id           2352 non-null int64
date_time          2352 non-null object
favorites          2352 non-null int64
retweets           2352 non-null int64
user_followers     2352 non-null int64
user_favourites    2352 non-null int64
dtypes: int64(5), object(1)
memory usage: 110.3+ KB


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,score_ratio


# 数据质量

数据质量从下面几个维度考虑：
* 完整性
* 有效性
* 准确性
* 一致性

### archive 表格

* 狗的名字存在一些错误的单词（such,quite,not,very,just,my,his,one,a,an）
* 需要过滤掉转发的记录，只保存原始评分的tweet记录
* 字段属性不正确（retweeted_status_id，retweeted_status_user_id）
* timestamp的数据类型应该为datetime，而不是object
* 对于“空字段”的表示，可以统一。（NaN，None）


### image 表格

* 被分析的图片的url有重复
* 数据缺失，archive中有2356条记录，而images中只有2075条记录


### tweet_json 表格

* 数据只有2352条，与archive中的数据不一致



# 数据整洁度

* 狗的种类，合并成一列，将具体的类型填写其中
* 应该将3个表格的信息进行合并
* 对于图片预测的概率（p1_conf,p2_conf,p3_conf），及预测的狗的种类(p1, p2, p3)进行合并

In [9]:
all_columns = pd.Series(list(archive) + list(images) + list(json_tweets_csv))

all_columns[all_columns.duplicated()]
list(json_tweets_csv)
list(archive)
list(images)

archive.info()
images.info()
json_tweets_csv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 18 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
score_ratio                   23

# 清理数据

根据上面分析的质量和整洁度问题，进行数据的清理

### 缺失数据

#### 我们以archive的数据为主数据（2356条），其他的数据都以tweet_id进行合并

##### 定义
 其中获取image和主表的公共部分,并且将他们拼接在一起
 
##### 代码 

In [10]:
# 先将所有的数据列进行合并

df_all = pd.merge(archive, images, how = 'inner', on = ['tweet_id'] )
df_all.to_csv('df_all.csv', encoding = 'utf-8')


#### 测试

In [11]:
df_all.info()
df_all.sample(5)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2075 entries, 0 to 2074
Data columns (total 29 columns):
tweet_id                      2075 non-null int64
in_reply_to_status_id         23 non-null float64
in_reply_to_user_id           23 non-null float64
timestamp                     2075 non-null object
source                        2075 non-null object
text                          2075 non-null object
retweeted_status_id           81 non-null float64
retweeted_status_user_id      81 non-null float64
retweeted_status_timestamp    81 non-null object
expanded_urls                 2075 non-null object
rating_numerator              2075 non-null int64
rating_denominator            2075 non-null int64
name                          2075 non-null object
doggo                         2075 non-null object
floofer                       2075 non-null object
pupper                        2075 non-null object
puppo                         2075 non-null object
score_ratio                   2075 

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,...,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
180,849776966551130114,,,2017-04-06 00:13:11 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Seriously guys? Again? We only rate dogs. Plea...,,,,https://twitter.com/dog_rates/status/849776966...,...,2,Chihuahua,0.292092,True,toy_terrier,0.136852,True,bonnet,0.103111,False
650,772152991789019136,,,2016-09-03 19:23:13 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here's a couple rufferees making sure all the ...,,,,https://twitter.com/dog_rates/status/772152991...,...,2,golden_retriever,0.275318,True,Irish_setter,0.100988,True,vizsla,0.073525,True
236,837471256429613056,,,2017-03-03 01:14:41 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Vincent. He's suave as h*ck. Will be y...,,,,https://twitter.com/dog_rates/status/837471256...,...,1,Norwegian_elkhound,0.976255,True,keeshond,0.01399,True,seat_belt,0.002111,False
1874,669680153564442624,,,2015-11-26 00:52:45 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Shawwn. He's a Turkish Gangrene Robitu...,,,,https://twitter.com/dog_rates/status/669680153...,...,1,dalmatian,0.141257,True,borzoi,0.137744,True,Labrador_retriever,0.103792,True
1400,683391852557561860,,,2016-01-02 20:58:09 +0000,"<a href=""http://twitter.com/download/iphone"" r...","Say hello to Jack (pronounced ""Kevin""). He's a...",,,,https://twitter.com/dog_rates/status/683391852...,...,1,French_bulldog,0.992833,True,Boston_bull,0.004749,True,pug,0.001392,True


#### 定义
   并且将tweet_json详细信息拼接到原始的tweet数据中，保留archive中的所有数据
   
#### 代码   

In [12]:
df_all = pd.merge(df_all, json_tweets_csv, how = 'left', on = ['tweet_id'])

#### 测试

In [13]:
df_all.info()
df_all.sample(5)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2075 entries, 0 to 2074
Data columns (total 34 columns):
tweet_id                      2075 non-null int64
in_reply_to_status_id         23 non-null float64
in_reply_to_user_id           23 non-null float64
timestamp                     2075 non-null object
source                        2075 non-null object
text                          2075 non-null object
retweeted_status_id           81 non-null float64
retweeted_status_user_id      81 non-null float64
retweeted_status_timestamp    81 non-null object
expanded_urls                 2075 non-null object
rating_numerator              2075 non-null int64
rating_denominator            2075 non-null int64
name                          2075 non-null object
doggo                         2075 non-null object
floofer                       2075 non-null object
pupper                        2075 non-null object
puppo                         2075 non-null object
score_ratio                   2075 

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,...,p2_conf,p2_dog,p3,p3_conf,p3_dog,date_time,favorites,retweets,user_followers,user_favourites
80,874057562936811520,,,2017-06-12 00:15:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...","I can't believe this keeps happening. This, is...",,,,https://twitter.com/dog_rates/status/874057562...,...,0.040437,True,Newfoundland,0.028228,True,2017-06-12 00:15:36,23061.0,4107.0,3768812.0,120162.0
1693,673317986296586240,,,2015-12-06 01:48:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Take a moment and appreciate how these two dog...,,,,https://twitter.com/dog_rates/status/673317986...,...,0.079923,True,Rottweiler,0.068594,True,2015-12-06 01:48:12,920.0,292.0,3768961.0,120161.0
661,771004394259247104,,,2016-08-31 15:19:06 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @katieornah: @dog_rates learning a lot at c...,7.710021e+17,1732729000.0,2016-08-31 15:10:07 +0000,https://twitter.com/katieornah/status/77100213...,...,0.052741,False,pop_bottle,0.048821,False,2016-08-31 15:19:06,0.0,252.0,3768926.0,120161.0
1132,704113298707505153,,,2016-02-29 01:17:46 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Stanley. He's an inverted Uzbekistani wat...,,,,https://twitter.com/dog_rates/status/704113298...,...,0.018231,False,sea_lion,0.015861,False,2016-02-29 01:17:46,2022.0,629.0,3768839.0,120161.0
1600,675147105808306176,,,2015-12-11 02:56:28 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you're presenting a group project and the...,,,,https://twitter.com/dog_rates/status/675147105...,...,0.016765,True,flat-coated_retriever,0.010637,True,2015-12-11 02:56:28,1020.0,273.0,3768941.0,120161.0


### 整洁度

#### 去掉主数据中关于转发的所有数据

#### 定义
archive数据中，如果记录中的retweeted_status_id不为空，则说明这条记录是转发，需要将其去除

#### 代码

In [14]:
df_all = df_all[pd.isnull(df_all.retweeted_status_id)]


# 对于没有提供图片分析数据的tweet记录删除掉
df_all = df_all.dropna(subset = ['jpg_url'])

#### 测试

In [15]:
df_all.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1994 entries, 0 to 2074
Data columns (total 34 columns):
tweet_id                      1994 non-null int64
in_reply_to_status_id         23 non-null float64
in_reply_to_user_id           23 non-null float64
timestamp                     1994 non-null object
source                        1994 non-null object
text                          1994 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null object
expanded_urls                 1994 non-null object
rating_numerator              1994 non-null int64
rating_denominator            1994 non-null int64
name                          1994 non-null object
doggo                         1994 non-null object
floofer                       1994 non-null object
pupper                        1994 non-null object
puppo                         1994 non-null object
score_ratio                   1994 non

####  doggo需要4列进行表示

#### 定义
  * 数据中存在多个狗狗的类型
  * 重新解析text字段，获取完整的狗狗类型
  
#### 代码

In [16]:
df_all['dog_stage'] = df_all['text'].str.lower().str.findall('doggo|floofer|pupper|puppo').apply(lambda x: ','.join(set(x)))
df_all = df_all.sort_values('dog_stage').drop_duplicates('tweet_id', keep = 'last')

melt_column = ['doggo', 'floofer', 'pupper', 'puppo']
df_all = df_all.drop(columns=melt_column)


#### 测试

In [17]:
df_all.info()
df_all.sample()
df_all['dog_stage'].value_counts()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1994 entries, 0 to 330
Data columns (total 31 columns):
tweet_id                      1994 non-null int64
in_reply_to_status_id         23 non-null float64
in_reply_to_user_id           23 non-null float64
timestamp                     1994 non-null object
source                        1994 non-null object
text                          1994 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null object
expanded_urls                 1994 non-null object
rating_numerator              1994 non-null int64
rating_denominator            1994 non-null int64
name                          1994 non-null object
score_ratio                   1994 non-null float64
jpg_url                       1994 non-null object
img_num                       1994 non-null int64
p1                            1994 non-null object
p1_conf                       1994 non-

                 1652
pupper            228
doggo              68
puppo              27
pupper,doggo        9
floofer             7
doggo,puppo         2
doggo,floofer       1
Name: dog_stage, dtype: int64

### 数据质量：狗的评分存在解析问题
  * 分子的数值没有正确的提取 
  * 存在多个评分，但是没有全部参考  

#### 定义
  * 从text字段中获取到多个评分
  * 得到每个评分的相对值，例如13/10,分数为1.3分
  * 如果有多个分数时，会计算平均的分数，并且添加一列“score_ratio”
  * 即使存在多个分数时，还是将第一个分数的分子设置在rating_numerator中，分母写入到rating_denominator的列中
  
  
#### 代码  

In [18]:
import re

# 将text重新解析，得到多个评分时，计算其平均分
def transform(row):
    # 根据空格分开 
    x = row['text']
    parts = re.split(r'\s+', x)
    
    # 求平均
    sum = 0
    count = 0
    score_ratio = 0
    for item in parts:
        m = re.match(r'^((?:\d+)(?:\.\d+)?)/(\d+)$', item)
        if m:
            # 将解析出的分数，如13/10，将13的值赋值给rating_numerator，10赋值给rating_denominator
            row['rating_numerator'] = m[1]
            row['rating_denominator'] = m[2]
            sum += float(m.group(1)) / float(m.group(2))
            count += 1
    if(count != 0):
        score_ratio = sum / count
    row['score_ratio'] = score_ratio
    return row
    
# 将处理后的数据，重新赋值给df_all
df_all = df_all.apply(transform, axis =1)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,...,p2_dog,p3,p3_conf,p3_dog,date_time,favorites,retweets,user_followers,user_favourites,dog_stage
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,...,False,banana,0.076110,False,2017-08-01 16:23:56,39492.0,8842.0,3768791.0,120162.0,
1385,684188786104872960,,,2016-01-05 01:44:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...","""Yo Boomer I'm taking a selfie, grab your stic...",,,,https://twitter.com/dog_rates/status/684188786...,...,True,Staffordshire_bullterrier,0.069760,True,2016-01-05 01:44:52,3810.0,1336.0,3768894.0,120161.0,
1384,684195085588783105,,,2016-01-05 02:09:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tino. He really likes corndogs. 9/10 h...,,,,https://twitter.com/dog_rates/status/684195085...,...,True,Boston_bull,0.095981,True,2016-01-05 02:09:54,2096.0,593.0,3768894.0,120161.0,
1380,684241637099323392,,,2016-01-05 05:14:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Obi. He got camera shy. 12/10 https://...,,,,https://twitter.com/dog_rates/status/684241637...,...,False,weasel,0.051280,False,2016-01-05 05:14:53,8956.0,3711.0,3768894.0,120161.0,
1379,684460069371654144,,,2016-01-05 19:42:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Jeph. He's a Western Sagittarius Dookm...,,,,https://twitter.com/dog_rates/status/684460069...,...,True,American_Staffordshire_terrier,0.059471,True,2016-01-05 19:42:51,2163.0,627.0,3768894.0,120161.0,
1378,684481074559381504,,,2016-01-05 21:06:19 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Pippa. She's an Elfin High Feta. Compact ...,,,,https://twitter.com/dog_rates/status/684481074...,...,True,polecat,0.017357,False,2016-01-05 21:06:19,4233.0,1315.0,3768894.0,120161.0,
1377,684538444857667585,6.844811e+17,4.196984e+09,2016-01-06 00:54:18 +0000,"<a href=""http://twitter.com/download/iphone"" r...","After watching this video, we've determined th...",,,,https://twitter.com/dog_rates/status/684538444...,...,False,macaque,0.043325,False,2016-01-06 00:54:18,2902.0,1082.0,3768894.0,120161.0,
1376,684567543613382656,,,2016-01-06 02:49:55 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Bobby. He doesn't give a damn about pe...,,,,https://twitter.com/dog_rates/status/684567543...,...,False,seat_belt,0.209393,False,2016-01-06 02:49:55,3289.0,1410.0,3768894.0,120161.0,
1375,684594889858887680,,,2016-01-06 04:38:35 +0000,"<a href=""http://twitter.com/download/iphone"" r...","""FOR THE LAST TIME I DON'T WANNA PLAY TWISTER ...",,,,https://twitter.com/dog_rates/status/684594889...,...,True,Brittany_spaniel,0.003879,True,2016-01-06 04:38:35,9807.0,3993.0,3768894.0,120161.0,
1374,684800227459624960,,,2016-01-06 18:14:31 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Theodore. He's dapper as hell. Probably o...,,,,https://twitter.com/dog_rates/status/684800227...,...,True,West_Highland_white_terrier,0.120992,True,2016-01-06 18:14:31,2961.0,1114.0,3768894.0,120161.0,


#### 测试

In [20]:
df_all.info()
df_all['score_ratio']

pd.set_option('max_colwidth', 200)
df_all[df_all['score_ratio'] > 2][['text', 'score_ratio', 'rating_numerator', 'rating_denominator']]

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1994 entries, 0 to 330
Data columns (total 31 columns):
tweet_id                      1994 non-null int64
in_reply_to_status_id         23 non-null float64
in_reply_to_user_id           23 non-null float64
timestamp                     1994 non-null object
source                        1994 non-null object
text                          1994 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null object
expanded_urls                 1994 non-null object
rating_numerator              1994 non-null int64
rating_denominator            1994 non-null int64
name                          1994 non-null object
score_ratio                   1994 non-null float64
jpg_url                       1994 non-null object
img_num                       1994 non-null int64
p1                            1994 non-null object
p1_conf                       1994 non-

Unnamed: 0,text,score_ratio,rating_numerator,rating_denominator
1797,After so many requests... here you go.\n\nGood dogg. 420/10 https://t.co/yfAAo1gdeY,42.0,420,10
416,Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t https://t.co/LouL5vdvxx,3.428571,24,7
559,"This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wuqaPS",7.5,75,10
804,This is Atticus. He's quite simply America af. 1776/10 https://t.co/GRXwMxLBkh,177.6,1776,10
1453,Here we have uncovered an entire battalion of holiday puppers. Average of 11.26/10 https://t.co/eNm2S6p9BD,2.6,26,10
615,This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back https://t.co/QFaUiIHxHq,2.7,27,10


#### 测试

In [277]:
df_all.sample(1)

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,score_ratio,jpg_url,img_num,date_time,favorites,retweets,user_followers,user_favourites,dog_stage,date_month
1286,690400367696297985,2016-01-22 05:07:29,"<a href=""http://twitter.com/download/iphone"" r...",This is Eriq. His friend just reminded him of ...,https://twitter.com/dog_rates/status/690400367...,10,10,Eriq,1.0,https://pbs.twimg.com/media/CZTLeBuWIAAFkeR.jpg,1,2016-01-22 05:07:29,2035.0,509.0,3768875.0,120161.0,,2016-01-01


### 冗余数据处理

#### 定义
   由于这里会分析转发和内容的相关性，如果从这个角度去分析，这里并不会使用到狗狗具体品种，所以考虑将这些数据先删除<br>
   所以只是将一些会使用到数据进行了保留
   
#### 代码   

In [278]:
df_all = df_all.drop(['retweeted_status_id','retweeted_status_user_id', 'retweeted_status_timestamp'], 1)

df_all = df_all.drop(['in_reply_to_status_id','in_reply_to_user_id'], 1)
df_all = df_all.drop(['p1','p1_conf','p2','p2_conf','p3','p3_conf','p2_dog','p3_dog','p1_dog'], 1)


df_all.to_csv('twitter_archive_master.csv', index=False, encoding = 'utf-8')

KeyError: "labels ['retweeted_status_id' 'retweeted_status_user_id'\n 'retweeted_status_timestamp'] not contained in axis"

#### 测试

In [279]:
df_master = pd.read_csv('twitter_archive_master.csv')
df_master.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1994 entries, 0 to 1993
Data columns (total 17 columns):
tweet_id              1994 non-null int64
timestamp             1994 non-null object
source                1994 non-null object
text                  1994 non-null object
expanded_urls         1994 non-null object
rating_numerator      1994 non-null int64
rating_denominator    1994 non-null int64
name                  1994 non-null object
score_ratio           1994 non-null float64
jpg_url               1994 non-null object
img_num               1994 non-null int64
date_time             1994 non-null object
favorites             1994 non-null float64
retweets              1994 non-null float64
user_followers        1994 non-null float64
user_favourites       1994 non-null float64
dog_stage             342 non-null object
dtypes: float64(5), int64(4), object(8)
memory usage: 264.9+ KB


### 新维度统计数据：因为想要分析转发和内容质量之间的关系，所以额外整理了一个转发数据相关的文件

#### 定义
  将数据按照月的维度，进行重新的合并，分别统计了按月的转发量，点赞量，以及一共发了多少tweet记录
  
#### 代码  

In [280]:
df_all['timestamp']=pd.to_datetime(df_all['timestamp'])
df_all['date_month']=pd.to_datetime(df_all['timestamp']).values.astype('datetime64[M]')

month_group = df_all.groupby('date_month')
plot_data_df = pd.DataFrame([], columns=['month','tweet_count', 'retweet_count', 'favourites_count'])

for name,group in month_group:
    nest_dict = pd.DataFrame([name,len(group.tweet_id), sum(group.retweets), sum(group.favorites)]).T
    nest_dict.columns = plot_data_df.columns
    plot_data_df = pd.concat([plot_data_df, nest_dict], ignore_index=True)
    
    
    
plot_data_df.to_csv('twitter_month_data.csv', index=False, encoding = 'utf-8')
 

#### 测试

In [281]:
df_master = pd.read_csv('twitter_month_data.csv')
df_master.info()   

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 4 columns):
month               22 non-null object
tweet_count         22 non-null int64
retweet_count       22 non-null float64
favourites_count    22 non-null float64
dtypes: float64(2), int64(1), object(1)
memory usage: 784.0+ bytes
