# WeRateDogs Data Wrangling Project
![image.jpg](https://media.mnn.com/assets/images/2015/10/dogs-pom-puppy.jpg.1440x960_q100_crop-scale_upscale.jpg)

## Steps for Wrangling:
- Data Gathering
- Data Assesing
- Data Cleaning

# 1. Data Gathering
Navigate Down

### 1. Reading `twitter-archive-enhanced.csv` using pandas read_csv

## Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns

%matplotlib inline

## Tweepy Authentication

In [18]:
import tweepy

consumer_key = '###'
consumer_secret = '###'
access_token = '###'
access_secret = '###'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True, parser=tweepy.parsers.JSONParser(),
                 wait_on_rate_limit_notify = True)

## 1. Reading Tweet Archive Data

In [3]:
archive = pd.read_csv('twitter-archive-enhanced.csv')

In [4]:
archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


### 2. Reading Image Predictions via requests library

First we try to import the content into memory via requests

In [5]:
import requests
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
response

<Response [200]>

Reponse 200 means, successful request

In [6]:
response.text[:100]

'tweet_id\tjpg_url\timg_num\tp1\tp1_conf\tp1_dog\tp2\tp2_conf\tp2_dog\tp3\tp3_conf\tp3_dog\n666020888022790149\tht'

Reading first 100 characters of the content as above

### 2. Reading Image Predictions via requests library

Writing the content into a folder named 'image_predictions'. We make the folder if itsn't already present.

In [7]:
import os
#create a folder for image predictions data
folder_name = 'image_predictions'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

In [8]:
#Write image predictions data to a file in our folder
with open(os.path.join(folder_name,url.split('/')[-1]),mode = 'wb') as file:
    file.write(response.content)

### 2. Reading Image Predictions via requests library

Now we also load the same content into a dataframe for our analysis

In [9]:
image_predictions_df = pd.read_csv('image-predictions.tsv', sep='\t')

**This is how it looks:**

In [10]:
image_predictions_df.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


### 3. Reading Image Predictions via tweepy library

1. Tweepy authentication done
2. Going through each `tweet_id` to extract tweets. (Took about 50 mins)

In [26]:
tweet = api.get_status(archive.tweet_id[0], tweet_mode='extended')
print(tweet['full_text'])

This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU


In [27]:
tweet['retweet_count']

7727

In [29]:
tweet['favorite_count']

36307

In [30]:
tweet

{'created_at': 'Tue Aug 01 16:23:56 +0000 2017',
 'id': 892420643555336193,
 'id_str': '892420643555336193',
 'full_text': "This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU",
 'truncated': False,
 'display_text_range': [0, 85],
 'entities': {'hashtags': [],
  'symbols': [],
  'user_mentions': [],
  'urls': [],
  'media': [{'id': 892420639486877696,
    'id_str': '892420639486877696',
    'indices': [86, 109],
    'media_url': 'http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
    'media_url_https': 'https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
    'url': 'https://t.co/MgUWQ76dJU',
    'display_url': 'pic.twitter.com/MgUWQ76dJU',
    'expanded_url': 'https://twitter.com/dog_rates/status/892420643555336193/photo/1',
    'type': 'photo',
    'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'},
     'medium': {'w': 540, 'h': 528, 'resize': 'fit'},
     'small': {'w': 540, 'h': 528, 'resize': 'fit'},
     'large': {'w': 

### Writing tweets to JSON

In [16]:
import json

In [31]:
# Iterate over tweet_ids to obtain favorites and retweet count. Then append these has dicts into df_list.
df_list = []

# Any tweet_ids not found by API, are appended to e_list
e_list = []

for id in archive.tweet_id:
    try:
        page = api.get_status(id, tweet_mode = 'extended')
        favorites = page['favorite_count']
        retweet_count = page['retweet_count']
        df_list.append({'tweet_id': int(id),
                        'favorites': int(favorites),
                        'retweet_count': int(retweet_count)
                       })
        print(id, favorites, retweet_count)
    
    except Exception as e:
        e_list.append(id)

892420643555336193 36307 7727
892177421306343426 31296 5719
891815181378084864 23571 3785
891689557279858688 39600 7880
891327558926688256 37799 8508
891087950875897856 19041 2848
890971913173991426 11088 1854
890729181411237888 61157 17247
890609185150312448 26207 3925
890240255349198849 29943 6715
890006608113172480 28835 6709
889880896479866881 26210 4565
889665388333682689 45148 9154
889638837579907072 25410 4108
889531135344209921 14213 2060
889278841981685760 23670 4874
888917238123831296 27356 4115
888804989199671297 24044 3881
888554962724278272 18567 3185
888078434458587136 20462 3168
887705289381826560 28428 4922
887517139158093824 43598 10774
887473957103951883 64622 16464
887343217045368832 31635 9571
887101392804085760 28782 5452
886983233522544640 32857 7016
886736880519319552 11268 2939
886680336477933568 21119 4087
886366144734445568 19885 2895
886267009285017600 111 4
886258384151887873 26318 5776
886054160059072513 0 100
885984800019947520 30605 6142
88552894320547020

838083903487373313 17679 3142
837820167694528512 34239 7870
837482249356513284 3857 443
837471256429613056 12846 2279
837110210464448512 16124 2370
836989968035819520 12812 2294
836753516572119041 19395 4655
836677758902222849 12656 2234
836648853927522308 0 556
836397794269200385 0 27604
836380477523124226 14828 3003
836260088725786625 21406 4317
836001077879255040 19167 4282
835685285446955009 0 8169
835574547218894849 17879 3637
835536468978302976 0 1708
835309094223372289 0 21057
835297930240217089 16557 3030
835264098648616962 7778 1708
835246439529840640 2116 73
835172783151792128 26296 5796
835152434251116546 22528 3035
834931633769889797 10867 1652
834786237630337024 21456 5528
834574053763584002 13751 2556
834477809192075265 0 10714
834458053273591808 9678 1694
834209720923721728 20729 4784
834167344700198914 15924 3683
834089966724603904 10118 2151
834086379323871233 13154 2223
833863086058651648 13501 2421
833826103416520705 18161 3801
833732339549220864 0 222
83372290175704

800388270626521089 11419 2882
800188575492947969 0 4048
800141422401830912 15656 2618
800018252395122689 29046 13522
799774291445383169 0 4818
799757965289017345 8590 2218
799422933579902976 8246 1973
799308762079035393 0 5621
799297110730567681 10134 2837
799063482566066176 8260 2525
798933969379225600 13404 4593
798925684722855936 7665 1479
798705661114773508 0 6826
798701998996647937 0 8003
798697898615730177 0 6677
798694562394996736 0 5090
798686750113755136 0 2396
798682547630837760 0 4844
798673117451325440 0 5691
798665375516884993 0 4016
798644042770751489 0 1906
798628517273620480 0 2026
798585098161549313 0 5873
798576900688019456 0 6004
798340744599797760 0 3444
798209839306514432 10670 2624
797971864723324932 11885 3192
797545162159308800 14891 4999
797236660651966464 20415 6768
797165961484890113 229 26
796904159865868288 0 9159
796865951799083009 7873 1963
796759840936919040 12122 3128
796563435802726400 0 7460
796484825502875648 7759 1808
796387464403357696 11267 4267
7

761371037149827077 0 18260
761334018830917632 5245 1465
761292947749015552 4531 1095
761227390836215808 5371 1584
761004547850530816 11442 3517
760893934457552897 3851 988
760656994973933572 6648 1931
760641137271070720 5021 1315
760539183865880579 7645 3666
760521673607086080 4266 1401
760290219849637889 27075 11632
760252756032651264 3972 877
760190180481531904 5748 1824
760153949710192640 0 31
759943073749200896 6001 2097
759923798737051648 14802 5714
759846353224826880 6784 2002
759793422261743616 6022 1937
759557299618865152 4743 1198
759447681597108224 8562 2476
759446261539934208 1689 494
759197388317847553 6128 1967
759159934323924993 0 1167
759099523532779520 14648 4213
759047813560868866 6552 2044
758854675097526272 3589 907
758828659922702336 11322 3873
758740312047005698 5788 1626
758474966123810816 3845 1014
758467244762497024 4833 2240
758405701903519748 5244 1930
758355060040593408 3434 1090
758099635764359168 19254 10097
758041019896193024 2728 380
757741869644341248 69

724046343203856385 2648 554
724004602748780546 4163 1543
723912936180330496 3868 1230
723688335806480385 7643 2930
723673163800948736 3003 883
723179728551723008 5281 1858
722974582966214656 4077 1556
722613351520608256 4948 1628
721503162398597120 4601 1840
721001180231503872 2514 607
720785406564900865 3101 762
720775346191278080 2455 661
720415127506415616 4111 1493
720389942216527872 6345 2492
720340705894408192 2862 973
720059472081784833 3852 1111
720043174954147842 4884 1999
719991154352222208 4816 1731
719704490224398336 4524 1424
719551379208073216 5041 1900
719367763014393856 2791 739
719339463458033665 4410 1229
719332531645071360 3415 946
718971898235854848 3473 1071
718939241951195136 5232 1749
718631497683582976 18775 8003
718613305783398402 2423 478
718540630683709445 2444 1012
718460005985447936 2685 521
718454725339934721 4838 1502
718246886998687744 1915 501
718234618122661888 3848 990
717841801130979328 2458 603
717790033953034240 2872 1107
717537687239008257 5759 18

696488710901260288 2532 1017
696405997980676096 3199 1156
696100768806522880 1931 653
695816827381944320 3010 1166
695794761660297217 3155 766
695767669421768709 1863 746
695629776980148225 4557 2098
695446424020918272 4389 1807
695409464418041856 8664 3569
695314793360662529 3621 1452
695095422348574720 2630 613
695074328191332352 2841 1095
695064344191721472 1601 589
695051054296211456 2660 787
694925794720792577 2692 916
694905863685980160 2745 937
694669722378485760 35768 14372
694356675654983680 1497 287
694352839993344000 2035 615
694342028726001664 1571 482
694329668942569472 2007 501
694206574471057408 4146 2032
694183373896572928 2947 924
694001791655137281 3405 1024
693993230313091072 1874 402
693942351086120961 1739 369
693647888581312512 2682 584
693644216740769793 1343 122
693642232151285760 2533 410
693629975228977152 2441 793
693622659251335168 1537 374
693590843962331137 5054 1955
693582294167244802 1648 254
693486665285931008 1780 634
693280720173801472 3340 1235
69326

679722016581222400 1661 462
679530280114372609 4778 2070
679527802031484928 2625 715
679511351870550016 3349 1256
679503373272485890 3155 1459
679475951516934144 2082 634
679462823135686656 31426 18691
679405845277462528 2359 1196
679158373988876288 21186 7990
679148763231985668 2764 1019
679132435750195208 2954 1143
679111216690831360 5937 2555
679062614270468097 16861 8025
679047485189439488 2236 661
679001094530465792 2798 1214
678991772295516161 2283 1148
678969228704284672 1639 454
678800283649069056 2546 890
678798276842360832 3428 1154
678774928607469569 2792 896
678767140346941444 3496 1345
678764513869611008 1625 468
678755239630127104 7064 3250
678740035362037760 3732 1659
678708137298427904 5548 2402
678675843183484930 2838 1448
678643457146150913 2036 402
678446151570427904 3965 1524
678424312106393600 5422 2518
678410210315247616 4180 1788
678399652199309312 78553 31435
678396796259975168 1581 410
678389028614488064 1861 419
678380236862578688 2416 904
678341075375947776 1

671874878652489728 1211 529
671866342182637568 1083 473
671855973984772097 871 418
671789708968640512 6824 3340
671768281401958400 1151 481
671763349865160704 1618 873
671744970634719232 1307 732
671743150407421952 727 228
671735591348891648 1392 727
671729906628341761 8214 4184
671561002136281088 12360 6964
671550332464455680 878 201
671547767500775424 1299 570
671544874165002241 1898 1010
671542985629241344 1055 542
671538301157904385 905 374
671536543010570240 1134 389
671533943490011136 987 555
671528761649688577 815 248
671520732782923777 1347 510
671518598289059840 916 284
671511350426865664 1545 678
671504605491109889 6756 3417
671497587707535361 887 429
671488513339211776 968 457
671486386088865792 562 188
671485057807351808 737 225
671390180817915904 1399 718
671362598324076544 1067 286
671357843010908160 387 147
671355857343524864 471 114
671347597085433856 925 414
671186162933985280 710 202
671182547775299584 1083 323
671166507850801152 848 331
671163268581498880 1597 1053
6

666373753744588802 174 85
666362758909284353 727 521
666353288456101888 199 66
666345417576210432 274 128
666337882303524864 182 84
666293911632134144 469 321
666287406224695296 138 62
666273097616637952 161 73
666268910803644416 95 32
666104133288665088 13691 6004
666102155909144576 72 11
666099513787052032 142 60
666094000022159362 153 67
666082916733198337 103 41
666073100786774016 297 147
666071193221509120 136 53
666063827256086533 445 198
666058600524156928 105 54
666057090499244032 270 127
666055525042405380 409 221
666051853826850816 1127 778
666050758794694657 124 51
666049248165822465 96 40
666044226329800704 272 132
666033412701032449 112 41
666029285002620928 121 42
666020888022790149 2422 463


### 3. Reading Image Predictions via tweepy library

In [32]:
# The exception list
e_list

[888202515573088257,
 873697596434513921,
 872668790621863937,
 872261713294495745,
 869988702071779329,
 866816280283807744,
 861769973181624320,
 856602993587888130,
 851953902622658560,
 845459076796616705,
 844704788403113984,
 842892208864923648,
 837366284874571778,
 837012587749474308,
 829374341691346946,
 827228250799742977,
 812747805718642688,
 802247111496568832,
 779123168116150273,
 775096608509886464,
 771004394259247104,
 770743923962707968,
 759566828574212096,
 754011816964026368,
 680055455951884288]

**ABove tweets had an Exception error while scraping** 

### 3. Reading Image Predictions via tweepy library
#### Writing our scraped data to tweet_json.txt

This is how the data looks

In [34]:
df_list[:5]

[{'tweet_id': 892420643555336193, 'favorites': 36307, 'retweet_count': 7727},
 {'tweet_id': 892177421306343426, 'favorites': 31296, 'retweet_count': 5719},
 {'tweet_id': 891815181378084864, 'favorites': 23571, 'retweet_count': 3785},
 {'tweet_id': 891689557279858688, 'favorites': 39600, 'retweet_count': 7880},
 {'tweet_id': 891327558926688256, 'favorites': 37799, 'retweet_count': 8508}]

In [37]:
with open('tweet_json.txt', 'w') as file:
    json.dump(df_list, file)

### 3. Reading Image Predictions via tweepy library
#### Also using the same data as a dataframe for further analysis

In [47]:
json_tweets_df = pd.DataFrame(df_list, columns = ['tweet_id', 'favorites', 'retweet_count'])

In [48]:
# Created a csv file in order to download and open without issues in Excel for visual inspection
json_tweets_df.to_csv('tweet_json.csv',encoding='utf-8',index=False)

In [11]:
json_tweets_df = pd.read_csv('tweet_json.csv',encoding='utf-8')

In [12]:
json_tweets_df.head()

Unnamed: 0,tweet_id,favorites,retweet_count
0,892420643555336193,36307,7727
1,892177421306343426,31296,5719
2,891815181378084864,23571,3785
3,891689557279858688,39600,7880
4,891327558926688256,37799,8508


## Gathering is complete here
### Checking all the dataframes once again

In [13]:
archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [14]:
image_predictions_df.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [15]:
json_tweets_df.head()

Unnamed: 0,tweet_id,favorites,retweet_count
0,892420643555336193,36307,7727
1,892177421306343426,31296,5719
2,891815181378084864,23571,3785
3,891689557279858688,39600,7880
4,891327558926688256,37799,8508


# 2. Data Assesing
Navigate Down

In [16]:
archive

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,,,,
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jax. He enjoys ice cream so much he gets ...,,,,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,,,,
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Zoey. She doesn't want to be one of th...,,,,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,,,,
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,


# Issues found after Visual & Programmatic Assesments

## Quality Issues:
- `source` column contains "a href" which can be removed.
- Need to remove those rows where `retweeted_status_id` is not null, because as per instructions, we only need to include those tweets which are not retweets.
- replacing "None" as pythonic `None`
- Replacing `doggo` `floofer` `pupper` `puppo` with 1s and 0s
- `timestamp` and `retweeted_status_timestamp` need to be converted to datetime
- Retweets texts start with "RT @", which can be removed
- The `names` column has many names which are like "a", "an", "the"
- Converting `in_reply_to_status_id` and `retweeted_status_id` to int64

# Issues found after Visual & Programmatic Assesments


## Tidiness issue:
- `in_reply_to_status_id` can be deleted because it is same as `tweet_id` (if not null), because replying to own thread.
- Whole dataset can be divided into three different datasets - `tweet`, `retweet` and `reply`, because all three represent 3 different types of observational units. This will also help us to have reduced columns.
- All 3 datasets - `archive`, `json_tweets_df` and `image_predictions can be combined in one single dataset`
- `doggo`, `fluffer`, `pupper`, `puppo` can be combined in one column named as `pup_type`

### Playing with data below to come up with above issues

In [21]:
archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [23]:
archive.loc[archive.in_reply_to_status_id.notnull()].head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
30,886267009285017600,8.862664e+17,2281182000.0,2017-07-15 16:51:35 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@NonWhiteHat @MayhewMayhem omg hello tanner yo...,,,,,12,10,,,,,
55,881633300179243008,8.81607e+17,47384430.0,2017-07-02 21:58:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@roushfenway These are good dogs but 17/10 is ...,,,,,17,10,,,,,
64,879674319642796034,8.795538e+17,3105441000.0,2017-06-27 12:14:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@RealKentMurphy 14/10 confirmed,,,,,14,10,,,,,
113,870726314365509632,8.707262e+17,16487760.0,2017-06-02 19:38:25 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@ComplicitOwl @ShopWeRateDogs &gt;10/10 is res...,,,,,10,10,,,,,
148,863427515083354112,8.634256e+17,77596200.0,2017-05-13 16:15:35 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@Jack_Septic_Eye I'd need a few more pics to p...,,,,,12,10,,,,,


In [24]:
archive.loc[archive.retweeted_status_id.notnull()].head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
19,888202515573088257,,,2017-07-21 01:02:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Canela. She attempted s...,8.87474e+17,4196984000.0,2017-07-19 00:47:34 +0000,https://twitter.com/dog_rates/status/887473957...,13,10,Canela,,,,
32,886054160059072513,,,2017-07-15 02:45:48 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @Athletics: 12/10 #BATP https://t.co/WxwJmv...,8.860537e+17,19607400.0,2017-07-15 02:44:07 +0000,https://twitter.com/dog_rates/status/886053434...,12,10,,,,,
36,885311592912609280,,,2017-07-13 01:35:06 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Lilly. She just paralle...,8.305833e+17,4196984000.0,2017-02-12 01:04:29 +0000,https://twitter.com/dog_rates/status/830583320...,13,10,Lilly,,,,
68,879130579576475649,,,2017-06-26 00:13:58 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Emmy. She was adopted t...,8.780576e+17,4196984000.0,2017-06-23 01:10:23 +0000,https://twitter.com/dog_rates/status/878057613...,14,10,Emmy,,,,
73,878404777348136964,,,2017-06-24 00:09:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Meet Shadow. In an attempt to r...,8.782815e+17,4196984000.0,2017-06-23 16:00:04 +0000,"https://www.gofundme.com/3yd6y1c,https://twitt...",13,10,Shadow,,,,


# 3. Cleaning the Data
Navigate Down

### First Creating a Copy of DataFrame

In [28]:
archive_clean = archive.copy()

### 1. Converting timestamps to datetime values

In [44]:
#Firstly we need to remove the last 6 characters " +0000"
archive_clean.timestamp = archive_clean.timestamp.str.split('+', expand=True)[0].str.strip()
archive_clean.retweeted_status_timestamp = archive_clean.retweeted_status_timestamp.str.split('+', expand=True)[0].str.strip()

#Convert to datetime
archive_clean.timestamp = pd.to_datetime(archive_clean.timestamp)
archive_clean.retweeted_status_timestamp = pd.to_datetime(archive_clean.retweeted_status_timestamp)

In [46]:
archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null datetime64[ns]
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null datetime64[ns]
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: datetime

### 2. Seperating tweet, rewteet and replies
#### But first checking exclusivity of `in_reply_to_status_id` and `retweeted_status_id`

In [48]:
# Checking the replied tweets
replied = archive_clean.loc[archive_clean.in_reply_to_status_id.notnull()]

In [49]:
replied.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
30,886267009285017600,8.862664e+17,2281182000.0,2017-07-15 16:51:35,"<a href=""http://twitter.com/download/iphone"" r...",@NonWhiteHat @MayhewMayhem omg hello tanner yo...,,,NaT,,12,10,,,,,
55,881633300179243008,8.81607e+17,47384430.0,2017-07-02 21:58:53,"<a href=""http://twitter.com/download/iphone"" r...",@roushfenway These are good dogs but 17/10 is ...,,,NaT,,17,10,,,,,
64,879674319642796034,8.795538e+17,3105441000.0,2017-06-27 12:14:36,"<a href=""http://twitter.com/download/iphone"" r...",@RealKentMurphy 14/10 confirmed,,,NaT,,14,10,,,,,
113,870726314365509632,8.707262e+17,16487760.0,2017-06-02 19:38:25,"<a href=""http://twitter.com/download/iphone"" r...",@ComplicitOwl @ShopWeRateDogs &gt;10/10 is res...,,,NaT,,10,10,,,,,
148,863427515083354112,8.634256e+17,77596200.0,2017-05-13 16:15:35,"<a href=""http://twitter.com/download/iphone"" r...",@Jack_Septic_Eye I'd need a few more pics to p...,,,NaT,,12,10,,,,,


In [50]:
replied.loc[replied.retweeted_status_id.notnull()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


**Note: As we can see, when the status is a replied one, all columns related to retweets are null. Hence, it is an indication of two different sets of observations**

### 2. Seperating tweet, rewteet and replies
#### Now creating seperate databases for each

In [79]:
retweet  = archive_clean.loc[archive_clean.retweeted_status_id.notnull()]

In [80]:
retweet.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
19,888202515573088257,,,2017-07-21 01:02:36,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Canela. She attempted s...,8.87474e+17,4196984000.0,2017-07-19 00:47:34,https://twitter.com/dog_rates/status/887473957...,13,10,Canela,,,,
32,886054160059072513,,,2017-07-15 02:45:48,"<a href=""http://twitter.com/download/iphone"" r...",RT @Athletics: 12/10 #BATP https://t.co/WxwJmv...,8.860537e+17,19607400.0,2017-07-15 02:44:07,https://twitter.com/dog_rates/status/886053434...,12,10,,,,,
36,885311592912609280,,,2017-07-13 01:35:06,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Lilly. She just paralle...,8.305833e+17,4196984000.0,2017-02-12 01:04:29,https://twitter.com/dog_rates/status/830583320...,13,10,Lilly,,,,
68,879130579576475649,,,2017-06-26 00:13:58,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Emmy. She was adopted t...,8.780576e+17,4196984000.0,2017-06-23 01:10:23,https://twitter.com/dog_rates/status/878057613...,14,10,Emmy,,,,
73,878404777348136964,,,2017-06-24 00:09:53,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Meet Shadow. In an attempt to r...,8.782815e+17,4196984000.0,2017-06-23 16:00:04,"https://www.gofundme.com/3yd6y1c,https://twitt...",13,10,Shadow,,,,


In [81]:
tweets_only = archive_clean.loc[archive_clean.retweeted_status_id.isnull() & archive_clean.in_reply_to_status_id.isnull()]

In [82]:
tweets_only.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,NaT,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,NaT,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,NaT,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,NaT,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,NaT,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


Note: `replied` was already made in the prvious sub-slide.

### 2. Seperating tweet, rewteet and replies
#### Testing the exclusivity of datasets

In [83]:
archive_clean.shape[0] - replied.shape[0] - retweet.shape[0] - tweets_only.shape[0]

0

**As the difference is zero => All three are mutually exclusive and exhaustive**

### 3. Seperating each of `replied`, `retweet` and `tweets_only` and dropping irrelevant variables

In [86]:
replied.drop(['retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp'], axis=1, inplace=True)
retweet.drop(['in_reply_to_status_id', 'in_reply_to_user_id'], axis=1, inplace=True)
tweets_only.drop(['retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp', 
           'in_reply_to_status_id', 'in_reply_to_user_id'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [92]:
replied.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
30,886267009285017600,8.862664e+17,2281182000.0,2017-07-15 16:51:35,"<a href=""http://twitter.com/download/iphone"" r...",@NonWhiteHat @MayhewMayhem omg hello tanner yo...,,12,10,,,,,
55,881633300179243008,8.81607e+17,47384430.0,2017-07-02 21:58:53,"<a href=""http://twitter.com/download/iphone"" r...",@roushfenway These are good dogs but 17/10 is ...,,17,10,,,,,
64,879674319642796034,8.795538e+17,3105441000.0,2017-06-27 12:14:36,"<a href=""http://twitter.com/download/iphone"" r...",@RealKentMurphy 14/10 confirmed,,14,10,,,,,
113,870726314365509632,8.707262e+17,16487760.0,2017-06-02 19:38:25,"<a href=""http://twitter.com/download/iphone"" r...",@ComplicitOwl @ShopWeRateDogs &gt;10/10 is res...,,10,10,,,,,
148,863427515083354112,8.634256e+17,77596200.0,2017-05-13 16:15:35,"<a href=""http://twitter.com/download/iphone"" r...",@Jack_Septic_Eye I'd need a few more pics to p...,,12,10,,,,,


In [93]:
retweet.head()

Unnamed: 0,tweet_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
19,888202515573088257,2017-07-21 01:02:36,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Canela. She attempted s...,8.87474e+17,4196984000.0,2017-07-19 00:47:34,https://twitter.com/dog_rates/status/887473957...,13,10,Canela,,,,
32,886054160059072513,2017-07-15 02:45:48,"<a href=""http://twitter.com/download/iphone"" r...",RT @Athletics: 12/10 #BATP https://t.co/WxwJmv...,8.860537e+17,19607400.0,2017-07-15 02:44:07,https://twitter.com/dog_rates/status/886053434...,12,10,,,,,
36,885311592912609280,2017-07-13 01:35:06,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Lilly. She just paralle...,8.305833e+17,4196984000.0,2017-02-12 01:04:29,https://twitter.com/dog_rates/status/830583320...,13,10,Lilly,,,,
68,879130579576475649,2017-06-26 00:13:58,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Emmy. She was adopted t...,8.780576e+17,4196984000.0,2017-06-23 01:10:23,https://twitter.com/dog_rates/status/878057613...,14,10,Emmy,,,,
73,878404777348136964,2017-06-24 00:09:53,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Meet Shadow. In an attempt to r...,8.782815e+17,4196984000.0,2017-06-23 16:00:04,"https://www.gofundme.com/3yd6y1c,https://twitt...",13,10,Shadow,,,,


In [94]:
tweets_only.head()

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,2017-08-01 16:23:56,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,2017-08-01 00:17:27,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,2017-07-31 00:18:03,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,2017-07-30 15:58:51,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,2017-07-29 16:00:24,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


### 4. Removing RT in `retweets` text

In [106]:
retweet.text = retweet.text.str.split(' ', n=2, expand=True)[2]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [202]:
retweet.text[:5]

19    This is Canela. She attempted some fancy porch...
32                  12/10 #BATP https://t.co/WxwJmvjfxo
36    This is Lilly. She just parallel barked. Kindl...
68    This is Emmy. She was adopted today. Massive r...
73    Meet Shadow. In an attempt to reach maximum zo...
Name: text, dtype: object

The text is much cleaned now!

### 5. Improving `source` column

In [114]:
tweets_only.source.str.split('\"', expand=True).head()

Unnamed: 0,0,1,2,3,4
0,<a href=,http://twitter.com/download/iphone,rel=,nofollow,>Twitter for iPhone</a>
1,<a href=,http://twitter.com/download/iphone,rel=,nofollow,>Twitter for iPhone</a>
2,<a href=,http://twitter.com/download/iphone,rel=,nofollow,>Twitter for iPhone</a>
3,<a href=,http://twitter.com/download/iphone,rel=,nofollow,>Twitter for iPhone</a>
4,<a href=,http://twitter.com/download/iphone,rel=,nofollow,>Twitter for iPhone</a>


We just need the http://.... part of the string

In [115]:
tweets_only.source.str.split('\"', expand=True)[1].value_counts()

http://twitter.com/download/iphone              1964
http://vine.co                                    91
http://twitter.com                                31
https://about.twitter.com/products/tweetdeck      11
Name: 1, dtype: int64

Above 4 are the sources of tweet. Let's replace them as below:

In [116]:
tweets_only.source = tweets_only.source.str.split('\"', expand=True)[1]
retweet.source = retweet.source.str.split('\"', expand=True)[1]
replied.source = replied.source.str.split('\"', expand=True)[1]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [117]:
tweets_only.head()

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,2017-08-01 16:23:56,http://twitter.com/download/iphone,This is Phineas. He's a mystical boy. Only eve...,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,2017-08-01 00:17:27,http://twitter.com/download/iphone,This is Tilly. She's just checking pup on you....,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,2017-07-31 00:18:03,http://twitter.com/download/iphone,This is Archie. He is a rare Norwegian Pouncin...,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,2017-07-30 15:58:51,http://twitter.com/download/iphone,This is Darla. She commenced a snooze mid meal...,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,2017-07-29 16:00:24,http://twitter.com/download/iphone,This is Franklin. He would like you to stop ca...,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


### 6. Replacing "None" with Pythonic None

In [120]:
replied.replace({"None": None}, inplace=True)
tweets_only.replace({"None": None}, inplace=True)
retweet.replace({"None": None}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  method=method)


In [121]:
tweets_only.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 12 columns):
tweet_id              2097 non-null int64
timestamp             2097 non-null datetime64[ns]
source                2097 non-null object
text                  2097 non-null object
expanded_urls         2094 non-null object
rating_numerator      2097 non-null int64
rating_denominator    2097 non-null int64
name                  1494 non-null object
doggo                 83 non-null object
floofer               10 non-null object
pupper                230 non-null object
puppo                 24 non-null object
dtypes: datetime64[ns](1), int64(3), object(8)
memory usage: 213.0+ KB


In [133]:
index_doggo_notnull = set(tweets_only.loc[tweets_only.doggo.notnull()].index)
index_floofer_notnull = set(tweets_only.loc[tweets_only.floofer.notnull()].index)
index_pupper_notnull = set(tweets_only.loc[tweets_only.pupper.notnull()].index)
index_puppo_notnull = set(tweets_only.loc[tweets_only.puppo.notnull()].index)

In [140]:
print(index_doggo_notnull & index_floofer_notnull)
print(index_doggo_notnull & index_pupper_notnull)
print(index_doggo_notnull & index_puppo_notnull)

{200}
{705, 1063, 460, 531, 1113, 956, 733, 889, 575}
{191}


**doggo, floofer, pupper and puppo are not mutually exclusive**

## 7. Replacing `doggo` `floofer` `pupper` `puppo` with 1s and 0s

In [158]:
tweets_only.puppo.value_counts()

1    24
Name: puppo, dtype: int64

In [159]:
tweets_only.replace({"doggo": 1, "floofer": 1, "pupper": 1, "puppo": 1, None: 0}, inplace=True)
retweet.replace({"doggo": 1, "floofer": 1, "pupper": 1, "puppo": 1, None: 0}, inplace=True)
replied.replace({"doggo": 1, "floofer": 1, "pupper": 1, "puppo": 1, None: 0}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  method=method)


## 8. Converting ids to int64

In [160]:
replied.in_reply_to_status_id = replied.in_reply_to_status_id.astype('Int64')
retweet.retweeted_status_id = retweet.retweeted_status_id.astype('Int64')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


## 9. Improving `name`

In [161]:
tweets_only.name.value_counts()

0            672
Charlie       11
Lucy          11
Oliver        10
Cooper        10
Tucker         9
Penny          9
Lola           8
Winston        8
Sadie          8
Daisy          7
Toby           7
Oscar          6
Stanley        6
Bella          6
Jax            6
Koda           6
Bo             6
Bailey         6
Dave           5
Rusty          5
Chester        5
Milo           5
Buddy          5
Leo            5
Scout          5
Louis          5
Bentley        5
Gary           4
Boomer         4
            ... 
Chef           1
Willow         1
Gustaf         1
Rubio          1
Barclay        1
Timofy         1
Augie          1
Dook           1
Edd            1
Heinrich       1
Millie         1
Maks           1
Hazel          1
Tilly          1
Blipson        1
Opie           1
Pip            1
Ralphson       1
Goose          1
Stuart         1
Sephie         1
Margo          1
Stefan         1
Skye           1
Genevieve      1
Kaia           1
Katie          1
Cilantro      

There are many "a", "an", "the" in the `name` column which are equivalent to None

In [162]:
tweets_only.name.replace({"a": None, "an": None, "the": None, 0: None}, inplace=True)
retweet.name.replace({"a": None, "an": None, "the": None, 0: None}, inplace=True)
replied.name.replace({"a": None, "an": None, "the": None, 0: None}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


In [163]:
tweets_only.name.value_counts()

Charlie      11
Lucy         11
Oliver       10
Cooper       10
Penny         9
Tucker        9
Winston       8
Lola          8
Sadie         8
Daisy         7
Toby          7
Bailey        6
Jax           6
Stanley       6
Bella         6
Koda          6
Bo            6
Oscar         6
Rusty         5
Leo           5
Bentley       5
Milo          5
Chester       5
Louis         5
Buddy         5
Scout         5
Dave          5
Archie        4
Bear          4
Chip          4
             ..
Chef          1
Willow        1
Gustaf        1
Rubio         1
Barclay       1
Timofy        1
Augie         1
Dook          1
Edd           1
Heinrich      1
Millie        1
Maks          1
Hazel         1
Tilly         1
Blipson       1
Opie          1
Pip           1
Ralphson      1
Goose         1
Stuart        1
Sephie        1
Margo         1
Stefan        1
Skye          1
Genevieve     1
Kaia          1
Katie         1
Cilantro      1
Glacier       1
Poppy         1
Name: name, Length: 951,

In [164]:
replied.name.value_counts()

Tessa    1
Name: name, dtype: int64

In [165]:
retweet.name.value_counts()

Bo          3
Lola        2
Maddie      2
Jack        2
Sampson     2
Sunny       2
Buddy       2
Arnie       1
Terrance    1
Scout       1
Gidget      1
Stubert     1
Stephan     1
Lorenzo     1
Pipsy       1
Mattie      1
Davey       1
Bungalo     1
Dawn        1
Milo        1
Colby       1
mad         1
Kenny       1
Penny       1
Chompsky    1
Coco        1
George      1
Maximus     1
Tyrone      1
Winston     1
           ..
Peaches     1
Astrid      1
Sierra      1
Philbert    1
Bailey      1
Harper      1
Carl        1
Loki        1
Eve         1
Baloo       1
Butter      1
Rizzy       1
Quinn       1
Rubio       1
Luna        1
Carly       1
Paull       1
Canela      1
Nollie      1
Lilly       1
Reginald    1
quite       1
Gabby       1
Oliver      1
Timison     1
Frankie     1
Chelsea     1
just        1
Pablo       1
Moose       1
Name: name, Length: 107, dtype: int64

## 11. Combining DataFrames

### First combining archive and json tweets

In [166]:
replied.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
30,886267009285017600,886266357075128320,2281182000.0,2017-07-15 16:51:35,http://twitter.com/download/iphone,@NonWhiteHat @MayhewMayhem omg hello tanner yo...,0,12,10,,0,0,0,0
55,881633300179243008,881607037314052096,47384430.0,2017-07-02 21:58:53,http://twitter.com/download/iphone,@roushfenway These are good dogs but 17/10 is ...,0,17,10,,0,0,0,0
64,879674319642796034,879553827334172672,3105441000.0,2017-06-27 12:14:36,http://twitter.com/download/iphone,@RealKentMurphy 14/10 confirmed,0,14,10,,0,0,0,0
113,870726314365509632,870726202742493184,16487760.0,2017-06-02 19:38:25,http://twitter.com/download/iphone,@ComplicitOwl @ShopWeRateDogs &gt;10/10 is res...,0,10,10,,0,0,0,0
148,863427515083354112,863425645568774144,77596200.0,2017-05-13 16:15:35,http://twitter.com/download/iphone,@Jack_Septic_Eye I'd need a few more pics to p...,0,12,10,,0,0,0,0


In [167]:
retweet.head()

Unnamed: 0,tweet_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
19,888202515573088257,2017-07-21 01:02:36,http://twitter.com/download/iphone,This is Canela. She attempted some fancy porch...,887473957103951872,4196984000.0,2017-07-19 00:47:34,https://twitter.com/dog_rates/status/887473957...,13,10,Canela,0,0,0,0
32,886054160059072513,2017-07-15 02:45:48,http://twitter.com/download/iphone,12/10 #BATP https://t.co/WxwJmvjfxo,886053734421102592,19607400.0,2017-07-15 02:44:07,https://twitter.com/dog_rates/status/886053434...,12,10,,0,0,0,0
36,885311592912609280,2017-07-13 01:35:06,http://twitter.com/download/iphone,This is Lilly. She just parallel barked. Kindl...,830583320585068544,4196984000.0,2017-02-12 01:04:29,https://twitter.com/dog_rates/status/830583320...,13,10,Lilly,0,0,0,0
68,879130579576475649,2017-06-26 00:13:58,http://twitter.com/download/iphone,This is Emmy. She was adopted today. Massive r...,878057613040115712,4196984000.0,2017-06-23 01:10:23,https://twitter.com/dog_rates/status/878057613...,14,10,Emmy,0,0,0,0
73,878404777348136964,2017-06-24 00:09:53,http://twitter.com/download/iphone,Meet Shadow. In an attempt to reach maximum zo...,878281511006478336,4196984000.0,2017-06-23 16:00:04,"https://www.gofundme.com/3yd6y1c,https://twitt...",13,10,Shadow,0,0,0,0


In [168]:
tweets_only.head()

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,2017-08-01 16:23:56,http://twitter.com/download/iphone,This is Phineas. He's a mystical boy. Only eve...,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,0,0,0,0
1,892177421306343426,2017-08-01 00:17:27,http://twitter.com/download/iphone,This is Tilly. She's just checking pup on you....,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,0,0,0,0
2,891815181378084864,2017-07-31 00:18:03,http://twitter.com/download/iphone,This is Archie. He is a rare Norwegian Pouncin...,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,0,0,0,0
3,891689557279858688,2017-07-30 15:58:51,http://twitter.com/download/iphone,This is Darla. She commenced a snooze mid meal...,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,0,0,0,0
4,891327558926688256,2017-07-29 16:00:24,http://twitter.com/download/iphone,This is Franklin. He would like you to stop ca...,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,0,0,0,0


In [170]:
json_tweets_df.head()

Unnamed: 0,tweet_id,favorites,retweet_count
0,892420643555336193,36307,7727
1,892177421306343426,31296,5719
2,891815181378084864,23571,3785
3,891689557279858688,39600,7880
4,891327558926688256,37799,8508


In [203]:
# Joiining archive and json tweets
tweets_merge = pd.merge(left = tweets_only, right=json_tweets_df, how='inner', on='tweet_id')
retweet_merge = pd.merge(left = retweet, right=json_tweets_df, how='inner', on='tweet_id')
replied_merge = pd.merge(left = replied, right=json_tweets_df, how='inner', on='tweet_id')

Here we join the `json_tweets_df` upon `tweets_only`, `retweet` and `replied` dataframes on `tweet_id` column with inner join

In [204]:
tweets_only.shape

(2097, 12)

In [205]:
tweets_merge.shape

(2090, 14)

In [206]:
retweet.shape

(181, 15)

In [207]:
retweet_merge.shape

(163, 17)

In [208]:
replied.shape

(78, 14)

In [209]:
replied_merge.shape

(78, 16)

## `replied_merge` view

In [210]:
replied_merge.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,favorites,retweet_count
0,886267009285017600,886266357075128320,2281182000.0,2017-07-15 16:51:35,http://twitter.com/download/iphone,@NonWhiteHat @MayhewMayhem omg hello tanner yo...,0,12,10,,0,0,0,0,111,4
1,881633300179243008,881607037314052096,47384430.0,2017-07-02 21:58:53,http://twitter.com/download/iphone,@roushfenway These are good dogs but 17/10 is ...,0,17,10,,0,0,0,0,117,7
2,879674319642796034,879553827334172672,3105441000.0,2017-06-27 12:14:36,http://twitter.com/download/iphone,@RealKentMurphy 14/10 confirmed,0,14,10,,0,0,0,0,297,10
3,870726314365509632,870726202742493184,16487760.0,2017-06-02 19:38:25,http://twitter.com/download/iphone,@ComplicitOwl @ShopWeRateDogs &gt;10/10 is res...,0,10,10,,0,0,0,0,115,3
4,863427515083354112,863425645568774144,77596200.0,2017-05-13 16:15:35,http://twitter.com/download/iphone,@Jack_Septic_Eye I'd need a few more pics to p...,0,12,10,,0,0,0,0,2120,91


## `tweets_merge` view

In [211]:
tweets_merge.head()

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,favorites,retweet_count
0,892420643555336193,2017-08-01 16:23:56,http://twitter.com/download/iphone,This is Phineas. He's a mystical boy. Only eve...,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,0,0,0,0,36307,7727
1,892177421306343426,2017-08-01 00:17:27,http://twitter.com/download/iphone,This is Tilly. She's just checking pup on you....,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,0,0,0,0,31296,5719
2,891815181378084864,2017-07-31 00:18:03,http://twitter.com/download/iphone,This is Archie. He is a rare Norwegian Pouncin...,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,0,0,0,0,23571,3785
3,891689557279858688,2017-07-30 15:58:51,http://twitter.com/download/iphone,This is Darla. She commenced a snooze mid meal...,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,0,0,0,0,39600,7880
4,891327558926688256,2017-07-29 16:00:24,http://twitter.com/download/iphone,This is Franklin. He would like you to stop ca...,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,0,0,0,0,37799,8508


## `retweet_merge` view

In [212]:
retweet_merge.head()

Unnamed: 0,tweet_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,favorites,retweet_count
0,886054160059072513,2017-07-15 02:45:48,http://twitter.com/download/iphone,12/10 #BATP https://t.co/WxwJmvjfxo,886053734421102592,19607400.0,2017-07-15 02:44:07,https://twitter.com/dog_rates/status/886053434...,12,10,,0,0,0,0,0,100
1,885311592912609280,2017-07-13 01:35:06,http://twitter.com/download/iphone,This is Lilly. She just parallel barked. Kindl...,830583320585068544,4196984000.0,2017-02-12 01:04:29,https://twitter.com/dog_rates/status/830583320...,13,10,Lilly,0,0,0,0,0,16946
2,879130579576475649,2017-06-26 00:13:58,http://twitter.com/download/iphone,This is Emmy. She was adopted today. Massive r...,878057613040115712,4196984000.0,2017-06-23 01:10:23,https://twitter.com/dog_rates/status/878057613...,14,10,Emmy,0,0,0,0,0,6237
3,878404777348136964,2017-06-24 00:09:53,http://twitter.com/download/iphone,Meet Shadow. In an attempt to reach maximum zo...,878281511006478336,4196984000.0,2017-06-23 16:00:04,"https://www.gofundme.com/3yd6y1c,https://twitt...",13,10,Shadow,0,0,0,0,0,1179
4,878316110768087041,2017-06-23 18:17:33,http://twitter.com/download/iphone,Meet Terrance. He's being yelled at because he...,669000397445533696,4196984000.0,2015-11-24 03:51:38,https://twitter.com/dog_rates/status/669000397...,11,10,Terrance,0,0,0,0,0,6113


### 12. Now combining `image_predictions_df` too

In [213]:
tweets_merge = pd.merge(left = tweets_merge, right=image_predictions_df, how='inner', on='tweet_id')
retweet_merge = pd.merge(left = retweet_merge, right=image_predictions_df, how='inner', on='tweet_id')
replied_merge = pd.merge(left = replied_merge, right=image_predictions_df, how='inner', on='tweet_id')

# Final DataFrames
Navigate Down

## Tweets (no retweet, replies) dataset

In [214]:
tweets_merge.head()

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,...,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,892420643555336193,2017-08-01 16:23:56,http://twitter.com/download/iphone,This is Phineas. He's a mystical boy. Only eve...,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,0,0,...,1,orange,0.097049,False,bagel,0.085851,False,banana,0.07611,False
1,892177421306343426,2017-08-01 00:17:27,http://twitter.com/download/iphone,This is Tilly. She's just checking pup on you....,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,0,0,...,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True
2,891815181378084864,2017-07-31 00:18:03,http://twitter.com/download/iphone,This is Archie. He is a rare Norwegian Pouncin...,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,0,0,...,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
3,891689557279858688,2017-07-30 15:58:51,http://twitter.com/download/iphone,This is Darla. She commenced a snooze mid meal...,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,0,0,...,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
4,891327558926688256,2017-07-29 16:00:24,http://twitter.com/download/iphone,This is Franklin. He would like you to stop ca...,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,0,0,...,2,basset,0.555712,True,English_springer,0.22577,True,German_short-haired_pointer,0.175219,True


## Replies Dataset

In [215]:
replied_merge.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,...,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,863079547188785154,667152164079423488,4196984000.0,2017-05-12 17:12:53,http://twitter.com/download/iphone,Ladies and gentlemen... I found Pipsy. He may ...,https://twitter.com/dog_rates/status/863079547...,14,10,,...,1,Lakeland_terrier,0.275242,True,Airedale,0.190569,True,teddy,0.102595,False
1,856526610513747968,855818117272018944,4196984000.0,2017-04-24 15:13:52,http://twitter.com/download/iphone,"THIS IS CHARLIE, MARK. HE DID JUST WANT TO SAY...",https://twitter.com/dog_rates/status/856526610...,14,10,,...,1,Old_English_sheepdog,0.798481,True,Tibetan_terrier,0.060602,True,standard_poodle,0.040722,True
2,844979544864018432,759099523532779520,4196984000.0,2017-03-23 18:29:57,http://twitter.com/download/iphone,PUPDATE: I'm proud to announce that Toby is 23...,https://twitter.com/dog_rates/status/844979544...,13,10,,...,3,tennis_ball,0.999281,False,racket,0.00037,False,Shetland_sheepdog,0.000132,True
3,802265048156610565,733109485275860992,4196984000.0,2016-11-25 21:37:47,http://twitter.com/download/iphone,"Like doggo, like pupper version 2. Both 11/10 ...",https://twitter.com/dog_rates/status/802265048...,11,10,,...,1,Labrador_retriever,0.897162,True,beagle,0.016895,True,Rhodesian_ridgeback,0.012061,True
4,746906459439529985,746885919387574272,4196984000.0,2016-06-26 03:22:31,http://twitter.com/download/iphone,"PUPDATE: can't see any. Even if I could, I cou...",https://twitter.com/dog_rates/status/746906459...,0,10,,...,1,traffic_light,0.470708,False,fountain,0.199776,False,space_shuttle,0.064807,False


## Retweets Dataset

In [216]:
retweet_merge.head()

Unnamed: 0,tweet_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,...,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,885311592912609280,2017-07-13 01:35:06,http://twitter.com/download/iphone,This is Lilly. She just parallel barked. Kindl...,830583320585068544,4196984000.0,2017-02-12 01:04:29,https://twitter.com/dog_rates/status/830583320...,13,10,...,1,Labrador_retriever,0.908703,True,seat_belt,0.057091,False,pug,0.011933,True
1,877611172832227328,2017-06-21 19:36:23,http://twitter.com/download/iphone,@dog_rates the boyfriend and his soaking wet p...,876850772322988032,512804500.0,2017-06-19 17:14:49,https://twitter.com/rachel2195/status/87685077...,14,10,...,1,Irish_setter,0.364729,True,golden_retriever,0.202907,True,Irish_terrier,0.107473,True
2,867072653475098625,2017-05-23 17:40:04,http://twitter.com/download/iphone,these @dog_rates hats are 13/10 bean approved ...,865013420445368320,7.874618e+17,2017-05-18 01:17:25,https://twitter.com/rachaeleasler/status/86501...,13,10,...,1,Blenheim_spaniel,0.352946,True,papillon,0.211766,True,Pekinese,0.112952,True
3,860924035999428608,2017-05-06 18:27:40,http://twitter.com/download/iphone,h*ckin adorable promposal. 13/10 @dog_rates ht...,860914485250469888,363890800.0,2017-05-06 17:49:42,https://twitter.com/tallylott/status/860914485...,13,10,...,2,envelope,0.933016,False,oscilloscope,0.012591,False,paper_towel,0.011178,False
4,851861385021730816,2017-04-11 18:15:55,http://twitter.com/download/iphone,Thanks @dog_rates completed my laptop. 10/10 w...,848289382176100352,341021100.0,2017-04-01 21:42:03,https://twitter.com/eddie_coe98/status/8482893...,10,10,...,1,pencil_box,0.662183,False,purse,0.066505,False,pillow,0.044725,False


In [217]:
tweets_merge.shape

(1964, 25)

In [218]:
retweet_merge.shape

(72, 28)

In [219]:
replied_merge.shape

(23, 27)

## Exporting to Final CSV

In [220]:
tweets_merge.to_csv('twitter_archive_master.csv', index=False)
retweet_merge.to_csv('final_retweets.csv', index=False)
replied_merge.to_csv('final_replies.csv', index=False)