# WeRateDogs Data Analysis

## Table of Contents
<ul>
<li><a href="#intro">Project Motivation</a></li>
    <ul>
    <li><a href="#requirements">Requirements</a></li>
    </ul>
<li><a href="#wrangling">Data Wrangling</a></li>
    <ul>
    <li><a href="#gather">Gather</a></li>
    <li><a href="#assess">Asses</a></li>
        <ul>
        <li><a href="#visual">Visual Assess</a></li>
        <li><a href="#program">Programmatic Asses</a></li>
        <li><a href="#document">Document</a></li>
        </ul>
    <li><a href="#clean">Clean</a></li>
    </ul>
<li><a href="#visuals">Visual Utils Functions</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
    <ul>
    <li><a href="#insight_1">Dogs insight</a></li>
    <li><a href="#insight_2">Dogs2 insight</a></li>
    <li><a href="#insight_3">Dog insgiht</a></li>
    </ul>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Project Motivation

This project intents to wrangle data from the Twitter [`WeRateDogs`](https://twitter.com/dog_rates) and analyze it.

<a id='requirements'></a>
#### Requirements

In [1]:
import pandas as pd
import numpy as np
import requests
import os
import tweepy
import time
import json

<a id='wrangling'></a>
## Data Wrangling
<a id='gather'></a>
### Gather

#### from csv file

In [2]:
# Gather data from csv
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')

#### from udacity's server

In [3]:
# Gather data from Udacity's server
server_url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

response = requests.get(server_url)
with open(os.path.join(server_url.split('/')[-1]), mode='wb') as file:
        file.write(response.content)
        
image_predictions = pd.read_csv('image-predictions.tsv', sep='\t')
        

#### setup Twitter and get data to json_file

In [4]:
file_path = 'tweet_json.txt'

if not os.path.isfile(file_path):
    # Set keys to use tweepy
    consumer_key = os.environ('CONSUMER_KEY')
    consumer_secret = os.environ('CONSUMER_SECRET')
    access_token = os.environ('ACCESS_TOKEN')
    access_secret = os.environ('ACCESS_SECRET')

    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_secret)

    api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

    start = time.time()
    print("Initiating...")

    # Write every tweet on its own line
    with open(file_path, 'w') as file:
        for i, tweet_id in enumerate(twitter_archive.tweet_id):
            try:
                tweet = api.get_status(tweet_id, tweet_mode='extended')
                json.dump(tweet._json, file)
                file.write("\n")
            except Exception as e: 
                print(e)
            finally:
                end = time.time()
                print(f'Tweet: {i+1} | Time elapsed: {end - start}')

#### read data from json

In [5]:
twitter_json = []

with open('tweet_json.txt','r') as file:
    for status in file:
        status = json.loads(status.strip())
        twitter_json.append({'tweet_id': status['id'],
                             'retweet_count': status['retweet_count'],
                             'favorite_count': status['favorite_count']})
        
additional_twitter_data = pd.DataFrame(twitter_json, columns = ['tweet_id', 'retweet_count', 'favorite_count'])

<a id='assess'></a>
### Assess
<a id='visual'></a>
#### Visual Assess

In [7]:
twitter_archive

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,,,,
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jax. He enjoys ice cream so much he gets ...,,,,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,,,,
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Zoey. She doesn't want to be one of th...,,,,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,,,,
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,


In [11]:
image_predictions

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
5,666050758794694657,https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg,1,Bernese_mountain_dog,0.651137,True,English_springer,0.263788,True,Greater_Swiss_Mountain_dog,0.016199,True
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,1,box_turtle,0.933012,False,mud_turtle,0.045885,False,terrapin,0.017885,False
7,666055525042405380,https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg,1,chow,0.692517,True,Tibetan_mastiff,0.058279,True,fur_coat,0.054449,False
8,666057090499244032,https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg,1,shopping_cart,0.962465,False,shopping_basket,0.014594,False,golden_retriever,0.007959,True
9,666058600524156928,https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg,1,miniature_poodle,0.201493,True,komondor,0.192305,True,soft-coated_wheaten_terrier,0.082086,True


In [6]:
additional_twitter_data

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,7976,37212
1,892177421306343426,5918,32008
2,891815181378084864,3911,24098
3,891689557279858688,8122,40540
4,891327558926688256,8807,38761
5,891087950875897856,2930,19478
6,890971913173991426,1943,11368
7,890729181411237888,17756,62788
8,890609185150312448,4029,26790
9,890240255349198849,6939,30700


<a id='program'></a>
#### Programmatic Asses

In [32]:
twitter_archive.head(2)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,


In [33]:
twitter_archive.tail(2)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a western brown Mitsubishi terrier. Up...,,,,https://twitter.com/dog_rates/status/666029285...,7,10,a,,,,
2355,666020888022790149,,,2015-11-15 22:32:08 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a Japanese Irish Setter. Lost eye...,,,,https://twitter.com/dog_rates/status/666020888...,8,10,,,,,


In [35]:
twitter_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [70]:
twitter_archive.text.apply(lambda x: x[:4] == 'RT @').sum()

181

In [36]:
twitter_archive.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [49]:
twitter_archive.duplicated().sum()

0

In [28]:
image_predictions.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [29]:
image_predictions.tail()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.22577,True,German_short-haired_pointer,0.175219,True
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True
2074,892420643555336193,https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg,1,orange,0.097049,False,bagel,0.085851,False,banana,0.07611,False


In [45]:
image_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [46]:
image_predictions.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


In [57]:
image_predictions[image_predictions['p1_conf'] <= 0.0444]

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
38,666644823164719104,https://pbs.twimg.com/media/CUBl6IwVAAA9_zT.jpg,1,Ibizan_hound,0.044333,True,Pembroke,0.043209,True,West_Highland_white_terrier,0.038906,True


In [48]:
image_predictions.duplicated().sum()

0

In [7]:
additional_twitter_data.head(3)

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,7976,37212
1,892177421306343426,5918,32008
2,891815181378084864,3911,24098


In [8]:
additional_twitter_data.tail(3)

Unnamed: 0,tweet_id,retweet_count,favorite_count
2329,666033412701032449,43,121
2330,666029285002620928,45,124
2331,666020888022790149,481,2479


In [9]:
additional_twitter_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2332 entries, 0 to 2331
Data columns (total 3 columns):
tweet_id          2332 non-null int64
retweet_count     2332 non-null int64
favorite_count    2332 non-null int64
dtypes: int64(3)
memory usage: 54.7 KB


In [10]:
additional_twitter_data.describe()

Unnamed: 0,tweet_id,retweet_count,favorite_count
count,2332.0,2332.0,2332.0
mean,7.419284e+17,2801.87693,7761.890223
std,6.82207e+16,4742.296776,12048.464149
min,6.660209e+17,1.0,0.0
25%,6.782728e+17,561.75,1347.75
50%,7.183508e+17,1311.5,3379.5
75%,7.986673e+17,3261.75,9520.5
max,8.924206e+17,80496.0,160823.0


In [11]:
additional_twitter_data.duplicated().sum()

0

<a id='document'></a>
### Issues documentation
### Quality

#### `twitter_archive`

- There are `NaN` values for the columns: in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp.
- Missing expanded_urls (2297 instead of 2356).
- There are missing values for name, doggo, floofer, pupper, puppo
- There are data for `retweets` and `replies`.
- Some dogs have English articles as their names, such as: `a`, `an`, `the`; Others have even words, for example `quite`.
- Timestamp column has wrong data type.
- Wrong data types for columns: in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id.

#### `image_predictions`

- The dogs' breeds names have no standard, some start with capital letter and others not.
- Some predictions are not dog breeds.
- Missing data (2075 instead of 2356).

#### `additional_twitter_data`
- Missing data (2332 instead of 2356).


### Tidiness
#### `twitter_archive`
- The columns doggo, floofer, pupper, puppo are all categories for the dog stage.

#### `image_predictions`
- The columns p1, p2, p3, p1_conf, p2_conf, p3_conf, p1_dog, p2_dog, p3_dog have no semantic names.
- The columns p1, p2 and p3 are breed predictions for the same dog, the first one is the most accurate, and should be enough.
- The prediction for the dog breed is part of the observational unit `tweet`.

#### `additional_twitter_data`
- The columns `retweet_count` and `favorite_count` should be part of `twitter_archive` DataFrame.

---

<a id='clean'></a>
### Clean
In this section, the issues identified above will be fixed.

In [58]:
twitter_clean = twitter_archive.copy()
predictions_clean = image_predictions.copy()

##### Completeness
- There are `NaN` values for the columns: in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp.

#### Define