#   Data wrangling WeRateDogs

#### Bawahab Abdulwahab - Makkah - KSA

## Introduction


The dataset that we will be wrangling and make our analyzing on it are the tweet archive of Twitter user **@dog_rates**, also known as **WeRateDogs**. 

**WeRateDogs** is a Twitter account that rates people's dogs with a humorous comment about the dog. 

These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." 

WeRateDogs has over **4 million** followers and has received international media coverage.

click on this link to find the twitter account we used: [WeRateDogs™](https://twitter.com/dog_rates)


### Steps

- **Getting data load dataset from** (existing file), (programmatically using the Requests library), (using Tweepy).
- **Assessing Data** (8 quality issues), (2 tidiness issues).
- **Cleaning Data** Clean each of the issues we documented while assessing.
- **Storing** Store the clean DataFrames in file with the main one named `twitter_archive_master.csv`.
- **Analyzing & Visualizing Data** (3 insights), (1 visualization).
- **Reports** (300-600 word `wrangle_report.pdf` describes our wrangling efforts), (250-word `act_report.pdf` communicates the insights and displays the visualizations like a blog post or magazine article).


## Gathering data

In [1]:
# Import all packages
import pandas as pd
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import dates
import requests
import wptools
from PIL import Image
from functools import reduce
import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer

### Twitter archive file

Load the dataset of twitter archive file `twitter-archive-enhanced-2.csv`.

In [2]:
# Load dataset twitter archive
archive_tw = pd.read_csv('twitter-archive-enhanced-2.csv')
archive_tw.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [3]:
archive_tw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

### Twitter image

Download by Requests library the file `image-predictions.tsv`.

In [4]:
# Download by Requests library the file "image-predictions.tsv"

url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
response = requests.get(url)

with open('image-predictions.tsv', mode ='wb') as file:
    file.write(response.content)

In [5]:
# Load Dataset image predictions
predict_img = pd.read_csv('image-predictions.tsv', sep='\t' )
predict_img.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [6]:
predict_img.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


### Twitter API And JSON

**After** I send to Twitter to ask them to allow me to use their API to collect data from WeRateDogs account, they rejected my request. 

**So**, I will put the codes to get the tweets by twitter API without running it.

In [7]:
# Query Twitter API for each tweet in the Twitter archive and save JSON in a text file
# These are hidden to comply with Twitter's API terms and conditions
consumer_key = 'HIDDEN'
consumer_secret = 'HIDDEN'
access_token = 'HIDDEN'
access_secret = 'HIDDEN'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

In [8]:
tweet_ids = archive_tw.tweet_id.values
len(tweet_ids)

2356

In [9]:
# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet-json.txt', 'w') as outfile:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    for tweet_id in tweet_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            print("Success")
            json.dump(tweet._json, outfile)
            outfile.write('\n')
        except tweepy.TweepError as e:
            print("Fail")
            fails_dict[tweet_id] = e
            pass
end = timer()
print(end - start)
print(fails_dict)

1: 892420643555336193
Fail
2: 892177421306343426
Fail
3: 891815181378084864
Fail
4: 891689557279858688
Fail
5: 891327558926688256
Fail
6: 891087950875897856
Fail
7: 890971913173991426
Fail
8: 890729181411237888
Fail
9: 890609185150312448
Fail
10: 890240255349198849
Fail
11: 890006608113172480
Fail
12: 889880896479866881
Fail
13: 889665388333682689
Fail
14: 889638837579907072
Fail
15: 889531135344209921
Fail
16: 889278841981685760
Fail
17: 888917238123831296
Fail
18: 888804989199671297
Fail
19: 888554962724278272
Fail
20: 888202515573088257
Fail
21: 888078434458587136
Fail
22: 887705289381826560
Fail
23: 887517139158093824
Fail
24: 887473957103951883
Fail
25: 887343217045368832
Fail
26: 887101392804085760
Fail
27: 886983233522544640
Fail
28: 886736880519319552
Fail
29: 886680336477933568
Fail
30: 886366144734445568
Fail
31: 886267009285017600
Fail
32: 886258384151887873
Fail
33: 886054160059072513
Fail
34: 885984800019947520
Fail
35: 885528943205470208
Fail
36: 885518971528720385
Fail
3

**But now**, we need to get the extra info **"Retweet Count, Favorite Count"**,  So we will use the exist file from udacity `tweet_json.txt` to get that information we need.

In [9]:
# Create an empty list and in sert each info of `tweet_json.text` file 

tweet_list = []


# read tweet_json.txt file
tw_json = open('tweet_json.txt', 'r')

for line in tw_json:
    tweets = json.loads(line)
    tweet_list.append(tweets)    
tw_json.close()


print(tweet_list[0])

{'created_at': 'Tue Aug 01 16:23:56 +0000 2017', 'id': 892420643555336193, 'id_str': '892420643555336193', 'full_text': "This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU", 'truncated': False, 'display_text_range': [0, 85], 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 892420639486877696, 'id_str': '892420639486877696', 'indices': [86, 109], 'media_url': 'http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg', 'url': 'https://t.co/MgUWQ76dJU', 'display_url': 'pic.twitter.com/MgUWQ76dJU', 'expanded_url': 'https://twitter.com/dog_rates/status/892420643555336193/photo/1', 'type': 'photo', 'sizes': {'large': {'w': 540, 'h': 528, 'resize': 'fit'}, 'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'small': {'w': 540, 'h': 528, 'resize': 'fit'}, 'medium': {'w': 540, 'h': 528, 'resize': 'fit'}}}]}, 'extended_entities': {'media

In [10]:
# create a empty dataframe for map the tweet_data info 

tw_api = pd.DataFrame()  

# insert columns id, retweet_count, favorite_count in new df from `tweet_json.txt` file
tw_api['tweet_id'] = list(map(lambda tweet: tweet['id'], tweet_list))
tw_api['retweet_count'] = list(map(lambda tweet: tweet['retweet_count'], tweet_list))
tw_api['favorite_count'] = list(map(lambda tweet: tweet['favorite_count'], tweet_list))


tw_api.head()

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,8853,39467
1,892177421306343426,6514,33819
2,891815181378084864,4328,25461
3,891689557279858688,8964,42908
4,891327558926688256,9774,41048


In [11]:
tw_api.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 3 columns):
tweet_id          2354 non-null int64
retweet_count     2354 non-null int64
favorite_count    2354 non-null int64
dtypes: int64(3)
memory usage: 55.2 KB


## Assessing Data

### Twitter archive

In [12]:
archive_tw

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,,,,
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jax. He enjoys ice cream so much he gets ...,,,,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,,,,
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Zoey. She doesn't want to be one of th...,,,,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,,,,
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,


In [13]:
archive_tw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [14]:
archive_tw['source'].value_counts()

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2221
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       33
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
Name: source, dtype: int64

In [15]:
archive_tw['expanded_urls'].value_counts()

http://www.gofundme.com/bluethewhitehusky,https://twitter.com/dog_rates/status/831650051525054464/photo/1,https://twitter.com/dog_rates/status/831650051525054464/photo/1,https://twitter.com/dog_rates/status/831650051525054464/photo/1,https://twitter.com/dog_rates/status/831650051525054464/photo/1    2
https://twitter.com/dog_rates/status/839549326359670784/photo/1                                                                                                                                                                                                                                              2
https://twitter.com/dog_rates/status/810254108431155201/photo/1                                                                                                                                                                                                                                              2
https://twitter.com/dog_rates/status/681523177663676416/photo/1                            

In [16]:
archive_tw['rating_numerator'].value_counts()

12      558
11      464
10      461
13      351
9       158
8       102
7        55
14       54
5        37
6        32
3        19
4        17
1         9
2         9
420       2
0         2
15        2
75        2
80        1
20        1
24        1
26        1
44        1
50        1
60        1
165       1
84        1
88        1
144       1
182       1
143       1
666       1
960       1
1776      1
17        1
27        1
45        1
99        1
121       1
204       1
Name: rating_numerator, dtype: int64

In [17]:
archive_tw['rating_denominator'].value_counts()

10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64

In [18]:
archive_tw['name'].value_counts()

None              745
a                  55
Charlie            12
Lucy               11
Oliver             11
Cooper             11
Penny              10
Lola               10
Tucker             10
Bo                  9
Winston             9
the                 8
Sadie               8
Daisy               7
an                  7
Bailey              7
Buddy               7
Toby                7
Dave                6
Jack                6
Scout               6
Oscar               6
Milo                6
Bella               6
Jax                 6
Koda                6
Leo                 6
Stanley             6
Rusty               6
Sammy               5
                 ... 
Norman              1
Sage                1
Arnold              1
Orion               1
Ulysses             1
Shadoe              1
Chadrick            1
Swagger             1
Edgar               1
Jim                 1
Berkeley            1
Donny               1
Trevith             1
Ralph               1
Grizz     

In [19]:
archive_tw['doggo'].value_counts()

None     2259
doggo      97
Name: doggo, dtype: int64

In [20]:
archive_tw['floofer'].value_counts()

None       2346
floofer      10
Name: floofer, dtype: int64

In [21]:
archive_tw['pupper'].value_counts()

None      2099
pupper     257
Name: pupper, dtype: int64

In [22]:
archive_tw['puppo'].value_counts()

None     2326
puppo      30
Name: puppo, dtype: int64

In [23]:
archive_tw.isnull().sum()

tweet_id                         0
in_reply_to_status_id         2278
in_reply_to_user_id           2278
timestamp                        0
source                           0
text                             0
retweeted_status_id           2175
retweeted_status_user_id      2175
retweeted_status_timestamp    2175
expanded_urls                   59
rating_numerator                 0
rating_denominator               0
name                             0
doggo                            0
floofer                          0
pupper                           0
puppo                            0
dtype: int64

### Twitter image

In [24]:
predict_img

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
5,666050758794694657,https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg,1,Bernese_mountain_dog,0.651137,True,English_springer,0.263788,True,Greater_Swiss_Mountain_dog,0.016199,True
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,1,box_turtle,0.933012,False,mud_turtle,0.045885,False,terrapin,0.017885,False
7,666055525042405380,https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg,1,chow,0.692517,True,Tibetan_mastiff,0.058279,True,fur_coat,0.054449,False
8,666057090499244032,https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg,1,shopping_cart,0.962465,False,shopping_basket,0.014594,False,golden_retriever,0.007959,True
9,666058600524156928,https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg,1,miniature_poodle,0.201493,True,komondor,0.192305,True,soft-coated_wheaten_terrier,0.082086,True


In [25]:
predict_img.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [26]:
predict_img['p1'].value_counts()

golden_retriever             150
Labrador_retriever           100
Pembroke                      89
Chihuahua                     83
pug                           57
chow                          44
Samoyed                       43
toy_poodle                    39
Pomeranian                    38
cocker_spaniel                30
malamute                      30
French_bulldog                26
miniature_pinscher            23
Chesapeake_Bay_retriever      23
seat_belt                     22
Siberian_husky                20
Staffordshire_bullterrier     20
German_shepherd               20
Cardigan                      19
web_site                      19
Maltese_dog                   18
teddy                         18
beagle                        18
Eskimo_dog                    18
Shetland_sheepdog             18
Shih-Tzu                      17
Rottweiler                    17
Lakeland_terrier              17
Italian_greyhound             16
kuvasz                        16
          

In [27]:
predict_img['p2'].value_counts()

Labrador_retriever                104
golden_retriever                   92
Cardigan                           73
Chihuahua                          44
Pomeranian                         42
French_bulldog                     41
Chesapeake_Bay_retriever           41
toy_poodle                         37
cocker_spaniel                     34
miniature_poodle                   33
Siberian_husky                     33
beagle                             28
Eskimo_dog                         27
Pembroke                           27
collie                             27
kuvasz                             26
Italian_greyhound                  22
American_Staffordshire_terrier     21
Pekinese                           21
malinois                           20
Samoyed                            20
miniature_pinscher                 20
chow                               20
toy_terrier                        20
Norwegian_elkhound                 19
Boston_bull                        19
Staffordshir

In [28]:
predict_img['p3'].value_counts()

Labrador_retriever                79
Chihuahua                         58
golden_retriever                  48
Eskimo_dog                        38
kelpie                            35
kuvasz                            34
Staffordshire_bullterrier         32
chow                              32
cocker_spaniel                    31
beagle                            31
Pomeranian                        29
Pekinese                          29
toy_poodle                        29
Chesapeake_Bay_retriever          27
Pembroke                          27
Great_Pyrenees                    27
malamute                          26
French_bulldog                    26
American_Staffordshire_terrier    24
Cardigan                          23
pug                               23
basenji                           21
toy_terrier                       20
bull_mastiff                      20
Siberian_husky                    19
Boston_bull                       17
Shetland_sheepdog                 17
L

In [29]:
predict_img['p1_dog'].nunique()

2

In [30]:
predict_img['p2_dog'].nunique()

2

In [31]:
predict_img['p3_dog'].nunique()

2

In [32]:
predict_img['p1_conf'].describe()

count    2075.000000
mean        0.594548
std         0.271174
min         0.044333
25%         0.364412
50%         0.588230
75%         0.843855
max         1.000000
Name: p1_conf, dtype: float64

In [33]:
predict_img['p2_conf'].describe()

count    2.075000e+03
mean     1.345886e-01
std      1.006657e-01
min      1.011300e-08
25%      5.388625e-02
50%      1.181810e-01
75%      1.955655e-01
max      4.880140e-01
Name: p2_conf, dtype: float64

In [34]:
predict_img['p3_conf'].describe()

count    2.075000e+03
mean     6.032417e-02
std      5.090593e-02
min      1.740170e-10
25%      1.622240e-02
50%      4.944380e-02
75%      9.180755e-02
max      2.734190e-01
Name: p3_conf, dtype: float64

In [35]:
predict_img.isnull().sum()

tweet_id    0
jpg_url     0
img_num     0
p1          0
p1_conf     0
p1_dog      0
p2          0
p2_conf     0
p2_dog      0
p3          0
p3_conf     0
p3_dog      0
dtype: int64

### Twitter API And JSON

In [36]:
tw_api

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,8853,39467
1,892177421306343426,6514,33819
2,891815181378084864,4328,25461
3,891689557279858688,8964,42908
4,891327558926688256,9774,41048
5,891087950875897856,3261,20562
6,890971913173991426,2158,12041
7,890729181411237888,16716,56848
8,890609185150312448,4429,28226
9,890240255349198849,7711,32467


In [37]:
tw_api.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 3 columns):
tweet_id          2354 non-null int64
retweet_count     2354 non-null int64
favorite_count    2354 non-null int64
dtypes: int64(3)
memory usage: 55.2 KB


In [38]:
tw_api.describe()

Unnamed: 0,tweet_id,retweet_count,favorite_count
count,2354.0,2354.0,2354.0
mean,7.426978e+17,3164.797366,8080.968564
std,6.852812e+16,5284.770364,11814.771334
min,6.660209e+17,0.0,0.0
25%,6.783975e+17,624.5,1415.0
50%,7.194596e+17,1473.5,3603.5
75%,7.993058e+17,3652.0,10122.25
max,8.924206e+17,79515.0,132810.0


In [39]:
tw_api.isnull().sum()

tweet_id          0
retweet_count     0
favorite_count    0
dtype: int64

### Quality

#### Twitter archive dataframe

- The column **name**,  have name value "None" is a null value actually, we need to convert it to NaN.
- The column **rating_numerator** have invalid values.
- The column **rating_numerator** need change format.
- The column **stages**,  have name "None" is a null value actually, we need to convert it to NaN.
- The format for `stages` column from string to category.
- Remove the tweets that has been retweet as its not original.
- Remove unnecessary coulmns.

#### Twitter image dataframe

- There are unnecessary underscoreb **(_)** between words in the columns **p1**,**p2** and **p3**.
- There are inconsisitent capital words in columns **p1**,**p2** and **p3**.

#### Twitter API And JSON dataframe

- Remove the tweets rows that don't have dog ratings.

### Tidiness

- The dog stage spilled in 4 columns **doggo**, **floofer**, **pupper** and **puppo** we need to merge them.
- Our three dataframes they are from the same observation unit we need to merge them.

## Cleaning Data

We will copies the dataframes for our cleaning.

In [40]:
# copies the dataframes

archive_tw_clean = archive_tw.copy()
predict_img_clean = predict_img.copy()
tw_apia_clean = tw_api.copy()

### Issues

#### Define

We will remove the unnecessary underscore (_) between words in the columns `p1`,`p2` and `p3` in Twitter image dataframe.

#### Code

In [41]:
# change underscore with space
predict_img_clean['p1'] = predict_img_clean['p1'].str.replace('_', ' ')
predict_img_clean['p2'] = predict_img_clean['p2'].str.replace('_', ' ')
predict_img_clean['p3'] = predict_img_clean['p3'].str.replace('_', ' ')

#### Test

In [42]:
# Check
predict_img_clean

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh springer spaniel,0.465074,True,collie,0.156665,True,Shetland sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature pinscher,0.074192,True,Rhodesian ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian ridgeback,0.408143,True,redbone,0.360687,True,miniature pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
5,666050758794694657,https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg,1,Bernese mountain dog,0.651137,True,English springer,0.263788,True,Greater Swiss Mountain dog,0.016199,True
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,1,box turtle,0.933012,False,mud turtle,0.045885,False,terrapin,0.017885,False
7,666055525042405380,https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg,1,chow,0.692517,True,Tibetan mastiff,0.058279,True,fur coat,0.054449,False
8,666057090499244032,https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg,1,shopping cart,0.962465,False,shopping basket,0.014594,False,golden retriever,0.007959,True
9,666058600524156928,https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg,1,miniature poodle,0.201493,True,komondor,0.192305,True,soft-coated wheaten terrier,0.082086,True


### Issues

#### Define

We will change the uppercase letter in some words in columns `p1`,`p2` and `p3` Twitter image dataframe to lowercase letter using lower() function.

#### Code

In [43]:
# using lower() function to change uppercase letter to lowercase letter in the columns `p1`,`p2` and `p3 


predict_img_clean['p1'] = predict_img_clean['p1'].str.lower()
predict_img_clean['p2'] = predict_img_clean['p2'].str.lower()
predict_img_clean['p3'] = predict_img_clean['p3'].str.lower()

#### Test

In [44]:
# check
predict_img_clean.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,welsh springer spaniel,0.465074,True,collie,0.156665,True,shetland sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature pinscher,0.074192,True,rhodesian ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,german shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,rhodesian ridgeback,0.408143,True,redbone,0.360687,True,miniature pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature pinscher,0.560311,True,rottweiler,0.243682,True,doberman,0.154629,True


### Issues

#### Define

In twitter archive dataframe The column `name` has name value called **None**, this means null value, we need to convert it to **NaN**.

#### Code

In [45]:
# replace None with null value
archive_tw_clean['name'].replace('None', np.nan, inplace = True)

#### Test

In [46]:
# check
archive_tw_clean['name'].value_counts()

a                 55
Charlie           12
Oliver            11
Lucy              11
Cooper            11
Lola              10
Penny             10
Tucker            10
Winston            9
Bo                 9
the                8
Sadie              8
Buddy              7
an                 7
Toby               7
Daisy              7
Bailey             7
Jack               6
Koda               6
Oscar              6
Bella              6
Leo                6
Milo               6
Rusty              6
Scout              6
Stanley            6
Jax                6
Dave               6
Oakley             5
Phil               5
                  ..
Norman             1
Sage               1
Arnold             1
Orion              1
Ulysses            1
Shadoe             1
Chadrick           1
Swagger            1
Edgar              1
Jim                1
Berkeley           1
Donny              1
Trevith            1
Ralph              1
Grizz              1
Dug                1
Freddery     

In [47]:
# check if None == NaN
archive_tw_clean['name'].isnull().sum()

745

### Issues

#### Define

- We will change the value in column `rating_numerator` that has invalid numbers of rating.
- We will change the type of the `rating_numerator` column form **integer** to **float**.
- We will drop the row with index **340** that is a repeat tweet with another tweet with index **695**.
- We will reset the index after drop row has done .

#### Code

In [48]:
# check
archive_tw_clean['rating_numerator'].value_counts()

12      558
11      464
10      461
13      351
9       158
8       102
7        55
14       54
5        37
6        32
3        19
4        17
1         9
2         9
420       2
0         2
15        2
75        2
80        1
20        1
24        1
26        1
44        1
50        1
60        1
165       1
84        1
88        1
144       1
182       1
143       1
666       1
960       1
1776      1
17        1
27        1
45        1
99        1
121       1
204       1
Name: rating_numerator, dtype: int64

In [49]:
# check
with pd.option_context('max_colwidth', 300):
    display(archive_tw_clean[archive_tw_clean['text'].str.contains(r"(\d+\.\d*\/\d+)")][['tweet_id', 'text', 'rating_numerator', 
                                                                                         'rating_denominator']])

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,tweet_id,text,rating_numerator,rating_denominator
45,883482846933004288,"This is Bella. She hopes her smile made you smile. If not, she is also offering you her favorite monkey. 13.5/10 https://t.co/qjrljjt948",5,10
340,832215909146226688,"RT @dog_rates: This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wu…",75,10
695,786709082849828864,"This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wuqaPS",75,10
763,778027034220126208,This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back https://t.co/QFaUiIHxHq,27,10
1689,681340665377193984,I've been told there's a slight possibility he's checking his mirror. We'll bump to 9.5/10. Still a menace,5,10
1712,680494726643068929,Here we have uncovered an entire battalion of holiday puppers. Average of 11.26/10 https://t.co/eNm2S6p9BD,26,10


There is some rating is float number so, we will change numerator column from intger to float type.

In [50]:
# change numerator from int to float 

archive_tw_clean['rating_numerator'] = archive_tw_clean['rating_numerator'].astype(float)

In [51]:
# change numerators with the correct values

archive_tw_clean.loc[(archive_tw_clean.tweet_id == 883482846933004288), 'rating_numerator'] = 13.5
archive_tw_clean.loc[(archive_tw_clean.tweet_id == 786709082849828864), 'rating_numerator'] = 9.75
archive_tw_clean.loc[(archive_tw_clean.tweet_id == 778027034220126208), 'rating_numerator'] = 11.27
archive_tw_clean.loc[(archive_tw_clean.tweet_id == 681340665377193984), 'rating_numerator'] = 9.5
archive_tw_clean.loc[(archive_tw_clean.tweet_id == 680494726643068929), 'rating_numerator'] = 11.26

Because the row with index 340 id `832215909146226688` is a repeat tweet with another tweet id `786709082849828864` we will drop it, then we will reset the index. 

In [52]:
# remove tweet row with index 340

archive_2_clean = archive_tw_clean.drop(archive_tw_clean.index[340])

In [53]:
# fix the index after delete row

archive_2_clean.reset_index(drop=True, inplace=True)
archive_2_clean

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13.0,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13.0,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12.0,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13.0,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12.0,10,Franklin,,,,
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13.0,10,,,,,
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jax. He enjoys ice cream so much he gets ...,,,,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13.0,10,Jax,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13.0,10,,,,,
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Zoey. She doesn't want to be one of th...,,,,https://twitter.com/dog_rates/status/890609185...,13.0,10,Zoey,,,,
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14.0,10,Cassie,doggo,,,


#### Test

In [54]:
# check
archive_2_clean['rating_numerator'].value_counts()

12.00      558
11.00      464
10.00      461
13.00      351
9.00       158
8.00       102
7.00        55
14.00       54
5.00        35
6.00        32
3.00        19
4.00        17
2.00         9
1.00         9
0.00         2
420.00       2
15.00        2
84.00        1
24.00        1
17.00        1
13.50        1
960.00       1
143.00       1
121.00       1
99.00        1
182.00       1
165.00       1
45.00        1
204.00       1
1776.00      1
9.75         1
666.00       1
11.27        1
80.00        1
11.26        1
88.00        1
144.00       1
9.50         1
20.00        1
44.00        1
60.00        1
50.00        1
Name: rating_numerator, dtype: int64

In [55]:
# check
with pd.option_context('max_colwidth', 300):
    display(archive_2_clean[archive_2_clean['text'].str.contains(r"(\d+\.\d*\/\d+)")][['tweet_id', 'text', 'rating_numerator', 
                                                                                         'rating_denominator']])

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,tweet_id,text,rating_numerator,rating_denominator
45,883482846933004288,"This is Bella. She hopes her smile made you smile. If not, she is also offering you her favorite monkey. 13.5/10 https://t.co/qjrljjt948",13.5,10
694,786709082849828864,"This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wuqaPS",9.75,10
762,778027034220126208,This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back https://t.co/QFaUiIHxHq,11.27,10
1688,681340665377193984,I've been told there's a slight possibility he's checking his mirror. We'll bump to 9.5/10. Still a menace,9.5,10
1711,680494726643068929,Here we have uncovered an entire battalion of holiday puppers. Average of 11.26/10 https://t.co/eNm2S6p9BD,11.26,10


### Issues

#### Define

Remove the tweets that have been retweet as it's not original, 

We will delete the retweets tweets by check if the tweets row has a value in `retweeted_status_id`, then replace it.

### Code

In [56]:
# removing retweet tweets

archive_2_clean.drop(archive_2_clean[archive_2_clean['retweeted_status_id'].notnull()== True].index, inplace= True)
archive_2_clean.shape[0]

2175

#### Test

In [57]:
# check
archive_2_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2354
Data columns (total 17 columns):
tweet_id                      2175 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2175 non-null object
source                        2175 non-null object
text                          2175 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null object
expanded_urls                 2117 non-null object
rating_numerator              2175 non-null float64
rating_denominator            2175 non-null int64
name                          1495 non-null object
doggo                         2175 non-null object
floofer                       2175 non-null object
pupper                        2175 non-null object
puppo                         2175 non-null object
dtypes: float64(5), int64(2), object

### Issues

#### Define

The dog stage spilled in 4 columns **doggo**, **floof**, **pupper** and **puppo** we need to merge them in one column, and we will name it **stages**.

#### Code

In [58]:
# create new column stages for the four columns('doggo', 'floofer','pupper','puppo')

archive_2_clean['stages'] = archive_2_clean[['doggo', 'floofer','pupper','puppo']].apply(lambda x: ''.join(x), axis=1)

# if the all rows have None fill a new column with None
archive_2_clean['stages'].replace("NoneNoneNoneNone","None", inplace=True)

# if the three rows have None and one row have doggo fill a new column with doggo
archive_2_clean['stages'].replace("doggoNoneNoneNone","doggo", inplace=True)

# if the three rows have None and one row have floofer fill a new column with floofer
archive_2_clean['stages'].replace("NoneflooferNoneNone","floofer", inplace=True)

# if the three rows have None and one row have pupper fill a new column with pupper
archive_2_clean['stages'].replace("NoneNonepupperNone","pupper", inplace=True)

# if the three rows have None and one row have puppo fill a new column with puppo
archive_2_clean['stages'].replace("NoneNoneNonepuppo","puppo", inplace=True)

# if the two rows have None and one row have doggo and another have pupper fill a new column with floofer
archive_2_clean['stages'].replace("doggoNonepupperNone","doggo, pupper", inplace=True)

# if the two rows have None and one row have doggo and another have floofer fill a new column with pupper
archive_2_clean['stages'].replace("doggoflooferNoneNone","doggo, floofer", inplace=True)

# if the two rows have None and one row have doggo and another have puppo fill a new column with puppo
archive_2_clean['stages'].replace("doggoNoneNonepuppo","doggo, puppo", inplace=True)

#### Test

In [59]:
# check
archive_2_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2354
Data columns (total 18 columns):
tweet_id                      2175 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2175 non-null object
source                        2175 non-null object
text                          2175 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null object
expanded_urls                 2117 non-null object
rating_numerator              2175 non-null float64
rating_denominator            2175 non-null int64
name                          1495 non-null object
doggo                         2175 non-null object
floofer                       2175 non-null object
pupper                        2175 non-null object
puppo                         2175 non-null object
stages                        2175 n

In [60]:
# check
archive_2_clean['stages'].value_counts()

None              1831
pupper             224
doggo               75
puppo               24
doggo, pupper       10
floofer              9
doggo, floofer       1
doggo, puppo         1
Name: stages, dtype: int64

### Issues

#### Define

Change format for `stages` column in Twitter image dataframe from string to category.

#### Code

In [61]:
# convert type from object to category

archive_2_clean['stages'] = archive_2_clean['stages'].astype('category')

#### Test

In [62]:
# check
archive_2_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2354
Data columns (total 18 columns):
tweet_id                      2175 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2175 non-null object
source                        2175 non-null object
text                          2175 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null object
expanded_urls                 2117 non-null object
rating_numerator              2175 non-null float64
rating_denominator            2175 non-null int64
name                          1495 non-null object
doggo                         2175 non-null object
floofer                       2175 non-null object
pupper                        2175 non-null object
puppo                         2175 non-null object
stages                        2175 n

### Issues

#### Define

In twitter archive dataframe the column **stages** has name value called **None** is a null value actually, we need to convert it to NaN.

#### Code

In [63]:
# check
archive_2_clean['stages'].isnull().sum()

0

In [64]:
# replace 'None' with null value
archive_2_clean['stages'].replace('None', np.nan, inplace = True)

#### Test

In [65]:
# check
archive_2_clean['stages'].value_counts()

pupper            224
doggo              75
puppo              24
doggo, pupper      10
floofer             9
doggo, puppo        1
doggo, floofer      1
None                0
Name: stages, dtype: int64

In [66]:
# check
archive_2_clean['stages'].isnull().sum()

1831

### Issues

#### Define

Our dataframes we loaded are from the same observation unit, so we will merge them in one master dataframe.

#### Code

In [67]:
# Merge dataframes on 'tweet_id' column

dfs = [archive_2_clean, predict_img_clean, tw_apia_clean]

archive_master = reduce(lambda left,right: pd.merge(left,right,on='tweet_id'), dfs)

#### Test

In [68]:
archive_master.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1994 entries, 0 to 1993
Data columns (total 31 columns):
tweet_id                      1994 non-null int64
in_reply_to_status_id         23 non-null float64
in_reply_to_user_id           23 non-null float64
timestamp                     1994 non-null object
source                        1994 non-null object
text                          1994 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null object
expanded_urls                 1994 non-null object
rating_numerator              1994 non-null float64
rating_denominator            1994 non-null int64
name                          1448 non-null object
doggo                         1994 non-null object
floofer                       1994 non-null object
pupper                        1994 non-null object
puppo                         1994 non-null object
stages                        306 no

### Issues

#### Define

We will remove the tweets rows that don't have dog ratings.

-  We will create a function where catch from the predictions columns if there is one **value == True**, then will append it in a new list,  and the True value means the tweet has a dog image.
- Our function will store first `True` from p1, p2, and p3 columns.

#### Code

In [69]:
# predict list
predict_list = []
# conf list
conf_list = []

# function 
def dog_images(archive_master):
    
    if archive_master['p1_dog'] == True:
        predict_list.append(archive_master['p1'])
        conf_list.append(archive_master['p1_conf'])
    elif archive_master['p2_dog'] == True:
        predict_list.append(archive_master['p2'])
        conf_list.append(archive_master['p2_conf'])
    elif archive_master['p3_dog'] == True:
        predict_list.append(archive_master['p3'])
        conf_list.append(archive_master['p3_conf'])
    else:
        predict_list.append('Error')
        conf_list.append('Error')

        
archive_master.apply(dog_images, axis=1)
# create new columns
archive_master['dog_predict'] = predict_list
archive_master['dog_conf'] = conf_list

In [70]:
# drop rows with an error

archive_master = archive_master[archive_master['dog_predict'] != 'Error']

In [71]:
# reset index after delete done

archive_master.reset_index(drop=True, inplace=True)

#### Test

In [72]:
archive_master.head(3)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,...,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog,retweet_count,favorite_count,dog_predict,dog_conf
0,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,...,pekinese,0.090647,True,papillon,0.068957,True,6514,33819,chihuahua,0.323581
1,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,...,malamute,0.078253,True,kelpie,0.031379,True,4328,25461,chihuahua,0.716012
2,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,...,labrador retriever,0.168086,True,spatula,0.040836,False,8964,42908,labrador retriever,0.168086


In [73]:
archive_master.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1686 entries, 0 to 1685
Data columns (total 33 columns):
tweet_id                      1686 non-null int64
in_reply_to_status_id         20 non-null float64
in_reply_to_user_id           20 non-null float64
timestamp                     1686 non-null object
source                        1686 non-null object
text                          1686 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null object
expanded_urls                 1686 non-null object
rating_numerator              1686 non-null float64
rating_denominator            1686 non-null int64
name                          1267 non-null object
doggo                         1686 non-null object
floofer                       1686 non-null object
pupper                        1686 non-null object
puppo                         1686 non-null object
stages                        260 no

### Issues

#### Define

Remove unnecessary coulmns,

We will drop the columns that we will not use. 

#### Code

In [74]:
# Drop unnecessary columns

archive_master.drop(['retweeted_status_id', 'retweeted_status_user_id','retweeted_status_timestamp', 'source', 'img_num', 
                     'p1', 'p1_conf', 'p1_dog', 'p2', 'p2_conf', 'p2_dog', 'p3', 'p3_conf', 'p3_dog', 'doggo', 'floofer', 
                     'pupper', 'puppo'], axis= 1, inplace= True)

#### Test

In [75]:
archive_master.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1686 entries, 0 to 1685
Data columns (total 15 columns):
tweet_id                 1686 non-null int64
in_reply_to_status_id    20 non-null float64
in_reply_to_user_id      20 non-null float64
timestamp                1686 non-null object
text                     1686 non-null object
expanded_urls            1686 non-null object
rating_numerator         1686 non-null float64
rating_denominator       1686 non-null int64
name                     1267 non-null object
stages                   260 non-null category
jpg_url                  1686 non-null object
retweet_count            1686 non-null int64
favorite_count           1686 non-null int64
dog_predict              1686 non-null object
dog_conf                 1686 non-null object
dtypes: category(1), float64(3), int64(4), object(7)
memory usage: 186.5+ KB


In [76]:
archive_master

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,stages,jpg_url,retweet_count,favorite_count,dog_predict,dog_conf
0,892177421306343426,,,2017-08-01 00:17:27 +0000,This is Tilly. She's just checking pup on you....,https://twitter.com/dog_rates/status/892177421...,13.0,10,Tilly,,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,6514,33819,chihuahua,0.323581
1,891815181378084864,,,2017-07-31 00:18:03 +0000,This is Archie. He is a rare Norwegian Pouncin...,https://twitter.com/dog_rates/status/891815181...,12.0,10,Archie,,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,4328,25461,chihuahua,0.716012
2,891689557279858688,,,2017-07-30 15:58:51 +0000,This is Darla. She commenced a snooze mid meal...,https://twitter.com/dog_rates/status/891689557...,13.0,10,Darla,,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,8964,42908,labrador retriever,0.168086
3,891327558926688256,,,2017-07-29 16:00:24 +0000,This is Franklin. He would like you to stop ca...,https://twitter.com/dog_rates/status/891327558...,12.0,10,Franklin,,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,9774,41048,basset,0.555712
4,891087950875897856,,,2017-07-29 00:08:17 +0000,Here we have a majestic great white breaching ...,https://twitter.com/dog_rates/status/891087950...,13.0,10,,,https://pbs.twimg.com/media/DF3HwyEWsAABqE6.jpg,3261,20562,chesapeake bay retriever,0.425595
5,890971913173991426,,,2017-07-28 16:27:12 +0000,Meet Jax. He enjoys ice cream so much he gets ...,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13.0,10,Jax,,https://pbs.twimg.com/media/DF1eOmZXUAALUcq.jpg,2158,12041,appenzeller,0.341703
6,890729181411237888,,,2017-07-28 00:22:40 +0000,When you watch your owner call another dog a g...,https://twitter.com/dog_rates/status/890729181...,13.0,10,,,https://pbs.twimg.com/media/DFyBahAVwAAhUTd.jpg,16716,56848,pomeranian,0.566142
7,890609185150312448,,,2017-07-27 16:25:51 +0000,This is Zoey. She doesn't want to be one of th...,https://twitter.com/dog_rates/status/890609185...,13.0,10,Zoey,,https://pbs.twimg.com/media/DFwUU__XcAEpyXI.jpg,4429,28226,irish terrier,0.487574
8,890240255349198849,,,2017-07-26 15:59:51 +0000,This is Cassie. She is a college pup. Studying...,https://twitter.com/dog_rates/status/890240255...,14.0,10,Cassie,doggo,https://pbs.twimg.com/media/DFrEyVuW0AAO3t9.jpg,7711,32467,pembroke,0.511319
9,890006608113172480,,,2017-07-26 00:31:25 +0000,This is Koda. He is a South Australian decksha...,https://twitter.com/dog_rates/status/890006608...,13.0,10,Koda,,https://pbs.twimg.com/media/DFnwSY4WAAAMliS.jpg,7624,31166,samoyed,0.957979


## Storing

We will save the clean dataFrame in `twitter archive master.csv` file to make our analysis on it.

In [77]:
# Saved dataframe
archive_master.to_csv('twitter_archive_master.csv', encoding = 'utf-8', index=False)

## Analyzing & Visualizing Data

Our analyzing and visualizing Data for the `twitter_archive_master.csv` dataset in the file named `Analyzi_Visualiz.ipynb`.