# Wrangle & Analyze WeRateDogs Data

<hr>

#### Table of Contents:
* [Project Description](#Project-Description)  
* [Notebook Setup](#Notebook-Setup)  
<br/> 
* [Gathering Data](#Gathering-Data)
    * [Enhanced Twitter Archive](#Gether:-Enhanced-Twitter-Archive)  
    * [Image Predictions File](#Gether:-Image-Predictions-File)  
    * [Twitter API File](#Gether:-Twitter-API-File)  
<br/> 
* [Assessing Data](#Assessing-Data)
    * [Twitter Archive Data](#Assess:-Twitter-Archive-Data)  
    * [Image Predictions](#Assess:-Image-Predictions)  
    * [Twitter API Data](#Assess:-Twitter-API-Data)  
<br/> 
* [Cleaning Data](#Cleaning-Data)
    * [Twitter Archive Data](#Clean:-Twitter-Archive-Data)  
    * [Image Predictions](#Clean:-Image-Predictions)  
    * [Twitter API Data](#Clean:-Twitter-API-Data)  
    * [Merge Datasets](#Merge-Datasets)  
<br/> 
* [Analyzing Data](#Analyzing-Data)

<hr>

## Project Description

Real-world data rarely comes clean. Using Python and its libraries, I will gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. This is called data wrangling. I will document my wrangling efforts in a Jupyter Notebook, plus showcase them through analyses and visualizations using Python.

The dataset that I will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.

<p align="center">
  <img src="img/dog-rates-social.jpg" width="600">
</p>

The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. One column the archive does contain though: each tweet's text, which I used to extract rating, dog name, and dog "stage" (i.e. doggo, floofer, pupper, and puppo) to make this Twitter archive "enhanced." Of the 5000+ tweets, I have filtered for tweets with ratings only (there are 2356).

<p align="center">
  <img src="img/data.png" width="1300">
</p>

Retweet count and favorite count are two of the notable column omissions. Fortunately, this additional data can be gathered from Twitter's API, which I will do.

## Notebook Setup

Load libraries and set pandas display options.

<hr>

In [262]:
# import libraries
import numpy as np
import pandas as pd
import requests
import json
import matplotlib.pyplot as plt
from stop_words import get_stop_words
import re

# pandas settings
pd.set_option('display.max_colwidth', -1)

## Gathering Data

Gather data from various sources and a variety of file formats.

<hr>

* [Enhanced Twitter Archive](#Gether:-Enhanced-Twitter-Archive)  
* [Image Predictions File](#Gether:-Image-Predictions-File)  
* [Twitter API File](#Gether:-Twitter-API-File)


### Gather: Enhanced Twitter Archive

This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017.

In [34]:
# load twitter archive
twitter_arch = pd.read_csv("data/twitter-archive-enhanced.csv")
# use tweet id column as index
twitter_arch.set_index("tweet_id", inplace = True)
# display few lines
twitter_arch.head(3)

Unnamed: 0_level_0,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU,,,,https://twitter.com/dog_rates/status/892420643555336193/photo/1,13,10,Phineas,,,,
892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV",,,,https://twitter.com/dog_rates/status/892177421306343426/photo/1,13,10,Tilly,,,,
891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB,,,,https://twitter.com/dog_rates/status/891815181378084864/photo/1,12,10,Archie,,,,


### Gether: Image Predictions File

This file contains top three predictions of dog breed for each dog image from the WeRateDogs archive. Table contains the top three predictions, tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images).

In [35]:
# get file with the image predictions
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
with open('data/image-predictions.tsv' , 'wb') as file:
    predictions = requests.get(url)
    file.write(predictions.content)

# load image predictions
image_pred = pd.read_csv('data/image-predictions.tsv', sep = '\t')
# use tweet id column as index
image_pred.set_index("tweet_id", inplace = True)
# display few lines
image_pred.head(3)

Unnamed: 0_level_0,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True


### Gether: Twitter API File

Retweet count and favorite count are two of the notable column omissions of Twitter data archive. Fortunately, this additional data can be gathered from Twitter's API. Twitter API file contains tweet id, favorite count and retweet count. 

In [36]:
# load twitter API data
with open('data/tweet-json.txt') as f:
    twitter_api = pd.DataFrame((json.loads(line) for line in f), columns = ['id', 'favorite_count', 'retweet_count'])

# change column names
twitter_api.columns = ['tweet_id', 'favorites', 'retweets']
# use tweet id column as index
twitter_api.set_index('tweet_id', inplace = True)
# display few lines
twitter_api.head(3)

Unnamed: 0_level_0,favorites,retweets
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1
892420643555336193,39467,8853
892177421306343426,33819,6514
891815181378084864,25461,4328



## Assessing Data

Assess data visually and programmatically for quality and tidiness issues using pandas.

<hr>

* [Twitter Archive Data](#Assess:-Twitter-Archive-Data)  
* [Image Predictions](#Assess:-Image-Predictions)  
* [Twitter API Data](#Assess:-Twitter-API-Data)

### Assess: Twitter Archive Data

In [37]:
# display sample of data
twitter_arch.sample(3)

Unnamed: 0_level_0,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
803638050916102144,,,2016-11-29 16:33:36 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Pupper hath acquire enemy. 13/10 https://t.co/ns9qoElfsX,,,,https://twitter.com/dog_rates/status/803638050916102144/video/1,13,10,,,,pupper,
728751179681943552,,,2016-05-07 00:59:55 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Flurpson. He can't believe it's not butter. 10/10 https://t.co/XD3ort1PsE,,,,https://twitter.com/dog_rates/status/728751179681943552/photo/1,10,10,Flurpson,,,,
826598365270007810,,,2017-02-01 01:09:42 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Pawnd... James Pawnd. He's suave af. 13/10 would trust with my life https://t.co/YprN62Z74I,,,,"https://twitter.com/dog_rates/status/826598365270007810/photo/1,https://twitter.com/dog_rates/status/826598365270007810/photo/1,https://twitter.com/dog_rates/status/826598365270007810/photo/1",13,10,Pawnd,,,,


In [38]:
# print a summary of a DataFrame
twitter_arch.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2356 entries, 892420643555336193 to 666020888022790149
Data columns (total 16 columns):
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(2), object(10)
memory usa

In [39]:
# check if ids are unique
twitter_arch.index.is_unique

True

In [40]:
# check number of replies
np.isfinite(twitter_arch.in_reply_to_status_id).sum()

78

In [41]:
# check values in sources
twitter_arch.source.value_counts()

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2221
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                        91  
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                     33  
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>    11  
Name: source, dtype: int64

In [114]:
# check quality of text
twitter_arch.text.sample(3)

tweet_id
680221482581123072    This is CeCe. She's patiently waiting for Santa. 10/10 https://t.co/ZJUypFFwvg                                                     
798628517273620480    RT @dog_rates: This a Norwegian Pewterschmidt named Tickles. Ears for days. 12/10 I care deeply for Tickles https://t.co/0aDF62KVP7
792913359805018113    Here is a perfect example of someone who has their priorities in order. 13/10 for both owner and Forrest https://t.co/LRyMrU7Wfq   
Name: text, dtype: object

In [115]:
# check number of retweets
np.isfinite(twitter_arch.retweeted_status_id).sum()

181

In [125]:
# check expanded urls
twitter_arch[~twitter_arch.expanded_urls.str.startswith(('https://twitter.com','http://twitter.com', 'https://vine.co'), na=False)].sample(3)[['text','expanded_urls']]

Unnamed: 0_level_0,text,expanded_urls
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1
847842811428974592,"This is Rontu. He is described as a pal, cuddle bug, protector and constant shadow. 12/10, but he needs your help\n\nhttps://t.co/zK4cpKPFfU https://t.co/7Xvoalr798","https://www.gofundme.com/help-save-rontu,https://twitter.com/dog_rates/status/847842811428974592/photo/1"
674606911342424069,The 13/10 also takes into account this impeccable yard. Louis is great but the future dad in me can't ignore that luscious green grass,
682808988178739200,"I'm aware that I could've said 20/16, but here at WeRateDogs we are very professional. An inconsistent rating scale is simply irresponsible",


In [126]:
# check for two or more urls in the expanded urls
twitter_arch[twitter_arch.expanded_urls.str.contains(',', na = False)].expanded_urls.count()

639

In [127]:
# check rating denominator
twitter_arch.rating_denominator.value_counts()

10     2333
11     3   
50     3   
80     2   
20     2   
2      1   
16     1   
40     1   
70     1   
15     1   
90     1   
110    1   
120    1   
130    1   
150    1   
170    1   
7      1   
0      1   
Name: rating_denominator, dtype: int64

In [128]:
# check rating numerator
twitter_arch.rating_numerator.value_counts()

12      558
11      464
10      461
13      351
9       158
8       102
7       55 
14      54 
5       37 
6       32 
3       19 
4       17 
1       9  
2       9  
420     2  
0       2  
15      2  
75      2  
80      1  
20      1  
24      1  
26      1  
44      1  
50      1  
60      1  
165     1  
84      1  
88      1  
144     1  
182     1  
143     1  
666     1  
960     1  
1776    1  
17      1  
27      1  
45      1  
99      1  
121     1  
204     1  
Name: rating_numerator, dtype: int64

In [129]:
# check for any float ratings in the text column
twitter_arch[twitter_arch.text.str.contains(r'\d+\.\d+\/\d+')][['text','rating_denominator', 'rating_numerator']]

Unnamed: 0_level_0,text,rating_denominator,rating_numerator
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
883482846933004288,"This is Bella. She hopes her smile made you smile. If not, she is also offering you her favorite monkey. 13.5/10 https://t.co/qjrljjt948",10,5
832215909146226688,"RT @dog_rates: This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wu…",10,75
786709082849828864,"This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wuqaPS",10,75
778027034220126208,This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back https://t.co/QFaUiIHxHq,10,27
681340665377193984,I've been told there's a slight possibility he's checking his mirror. We'll bump to 9.5/10. Still a menace,10,5
680494726643068929,Here we have uncovered an entire battalion of holiday puppers. Average of 11.26/10 https://t.co/eNm2S6p9BD,10,26


In [130]:
# check name of dog
twitter_arch.name.value_counts()

None        745
a           55 
Charlie     12 
Cooper      11 
Lucy        11 
Oliver      11 
Lola        10 
Penny       10 
Tucker      10 
Winston     9  
Bo          9  
Sadie       8  
the         8  
Buddy       7  
Daisy       7  
Bailey      7  
Toby        7  
an          7  
Leo         6  
Oscar       6  
Milo        6  
Rusty       6  
Bella       6  
Stanley     6  
Jack        6  
Koda        6  
Jax         6  
Scout       6  
Dave        6  
Bentley     5  
           ..  
Evy         1  
Rumpole     1  
Sky         1  
Tobi        1  
Newt        1  
Carbon      1  
Harlso      1  
Carll       1  
Laika       1  
Opie        1  
Blue        1  
life        1  
Lucia       1  
Jennifur    1  
Lipton      1  
Ruffles     1  
Brady       1  
Andy        1  
Emma        1  
Puff        1  
Steve       1  
Kramer      1  
Apollo      1  
Arlen       1  
Zeek        1  
Keet        1  
Champ       1  
Billy       1  
Benny       1  
Divine      1  
Name: name, Length: 957,

In [131]:
# check for stop words in dog name
# https://stackoverflow.com/a/5486535/7382214
stop_words = set(get_stop_words('en'))

count = 0
for word in twitter_arch.name:
    if word.lower() in stop_words:
        count += 1
print('Rows with stop words:', count)

Rows with stop words: 83


In [132]:
# check if dogs have more than one category assigned
categories = ['doggo', 'floofer', 'pupper', 'puppo']

for category in categories:
    twitter_arch[category] = twitter_arch[category].apply(lambda x: 0 if x == 'None' else 1)

twitter_arch['number_categories'] = twitter_arch.loc[:,categories].sum(axis = 1)

In [134]:
# dogs categories
twitter_arch['number_categories'].value_counts()

0    1976
1    366 
2    14  
Name: number_categories, dtype: int64

#### Quality & Tidiness Issues

- some of the gathered tweets are replies and should be removed;
- the timestamp has an incorrect datatype - is an object, should be DateTime;
- source is an HTML element - its text should be extracted;
- some rows in the text column begin from 'RT @dog_rates:';
- some rows in the text column have leading and/or trailing whitespace;
- some of the gathered tweets are retweets;
- we have 59 missing expanded urls;
- we have 639 expanded urls which contain more than one url address;
- denominator of some ratings is not 10;
- numerator of some ratings is greater than 10;
- float ratings have been incorrectly read from the text of tweet;
- 'None' in the name should be convert to NaN;
- we have stop words in the name column;
- dog 'stage' classification (doggo, floofer, pupper or puppo) should be one column;
- some dogs have more than one category assigned;

### Assess: Image Predictions

In [135]:
# display sample of data
image_pred.sample(3)

Unnamed: 0_level_0,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
724771698126512129,https://pbs.twimg.com/media/Cg7n_-OU8AA5RR1.jpg,2,German_short-haired_pointer,0.835491,True,bluetick,0.058788,True,English_setter,0.037208,True
694329668942569472,https://pbs.twimg.com/media/CaLBJmOWYAQt44t.jpg,1,boxer,0.99006,True,bull_mastiff,0.007436,True,Saint_Bernard,0.001617,True
677328882937298944,https://pbs.twimg.com/media/CWZbBlAUsAAjRg5.jpg,1,water_buffalo,0.42425,False,kelpie,0.029054,True,Staffordshire_bullterrier,0.02847,True


In [136]:
# print a summary of a DataFrame
image_pred.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2075 entries, 666020888022790149 to 892420643555336193
Data columns (total 11 columns):
jpg_url    2075 non-null object
img_num    2075 non-null int64
p1         2075 non-null object
p1_conf    2075 non-null float64
p1_dog     2075 non-null bool
p2         2075 non-null object
p2_conf    2075 non-null float64
p2_dog     2075 non-null bool
p3         2075 non-null object
p3_conf    2075 non-null float64
p3_dog     2075 non-null bool
dtypes: bool(3), float64(3), int64(1), object(4)
memory usage: 152.0+ KB


In [137]:
# check if ids are unique
image_pred.index.is_unique

True

In [138]:
# check jpg_url
image_pred[~image_pred.jpg_url.str.endswith(('.jpg', '.png'), na = False)].jpg_url.count()

0

In [139]:
# check image number
image_pred.img_num.value_counts()

1    1780
2    198 
3    66  
4    31  
Name: img_num, dtype: int64

In [140]:
# check 1st prediction
image_pred.p1.sample(3)

tweet_id
671109016219725825    basenji    
867051520902168576    Samoyed    
673906403526995968    toilet_seat
Name: p1, dtype: object

In [141]:
# check dog predictions
image_pred.p1_dog.count()

2075

#### Quality & Tidiness Issues

- the dataset has 2075 entries, while twitter archive dataset has 2356 entries;
- column names are confusing and do not give much information about the content;
- dog breeds contain underscores, and have different case formatting;
- only 2075 images have been classified as dog images for top prediction;
- dataset should be merged with the twitter archive dataset;

### Assess: Twitter API Data

In [142]:
# display sample of data
twitter_api.sample(3)

Unnamed: 0_level_0,favorites,retweets
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1
820690176645140481,13518,3716
677557565589463040,2665,1322
890729181411237888,56848,16716


In [143]:
# print a summary of a DataFrame
twitter_api.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2354 entries, 892420643555336193 to 666020888022790149
Data columns (total 2 columns):
favorites    2354 non-null int64
retweets     2354 non-null int64
dtypes: int64(2)
memory usage: 55.2 KB


In [144]:
# check if ids are unique
twitter_arch.index.is_unique

True

#### Quality & Tidiness Issues

- twitter archive dataset has 2356 entries, while twitter API data has 2354;
- dataset should be merged with the twitter archive dataset;

## Cleaning Data

Using pandas, clean the quality and tidiness issues identified in the [Assessing Data](#Assessing-Data) section.

<hr>

* [Twitter Archive Data](#Clean:-Twitter-Archive-Data)  
* [Image Predictions](#Clean:-Image-Predictions)  
* [Twitter API Data](#Clean:-Twitter-API-Data)
* [Merge Datasets](#Merge-Datasets)

### Clean: Twitter Archive Data

In [265]:
# create a copy of dataset
twitter_arch_clean = twitter_arch.copy()

In [266]:
# display sample of data
twitter_arch_clean.sample(3)

Unnamed: 0_level_0,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,number_categories
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
705898680587526145,,,2016-03-04 23:32:15 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Meet Max. He's a Fallopian Cephalopuff. Eyes are magical af. Lil dandruff problem. No big deal 10/10 would still pet https://t.co/c67nUjwmFs,,,,"https://twitter.com/dog_rates/status/705898680587526145/photo/1,https://twitter.com/dog_rates/status/705898680587526145/photo/1",10,10,Max,0,0,0,0,0
666353288456101888,,,2015-11-16 20:32:58 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here we have a mixed Asiago from the Galápagos Islands. Only one ear working. Big fan of marijuana carpet. 8/10 https://t.co/tltQ5w9aUO,,,,https://twitter.com/dog_rates/status/666353288456101888/photo/1,8,10,,0,0,0,0,0
823939628516474880,,,2017-01-24 17:04:50 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Cash. He's officially given pup on today. 12/10 frighteningly relatable https://t.co/m0hrATIEyw,,,,https://twitter.com/dog_rates/status/823939628516474880/photo/1,12,10,Cash,0,0,0,0,0


#### Define

Some of the gathered tweets are replies and retweets
- remove replies data from the dataset
- remove retweets data from the dataset

#### Code

In [267]:
# display all columns
twitter_arch_clean.columns

Index(['in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp', 'source',
       'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo',
       'number_categories'],
      dtype='object')

In [268]:
# drop unnecessary columns
twitter_arch_clean.drop(['in_reply_to_status_id','in_reply_to_user_id','retweeted_status_id',
           'retweeted_status_user_id','retweeted_status_timestamp'], axis = 1, inplace = True)

In [269]:
# display cleaned dataset
twitter_arch_clean.sample(3)

Unnamed: 0_level_0,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,number_categories
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
767122157629476866,2016-08-20 22:12:29 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Rupert. You betrayed him with bath time but he forgives you. Cuddly af 13/10 https://t.co/IEARC2sRzC,"https://twitter.com/dog_rates/status/767122157629476866/photo/1,https://twitter.com/dog_rates/status/767122157629476866/photo/1",13,10,Rupert,0,0,0,0,0
689835978131935233,2016-01-20 15:44:48 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Meet Fynn &amp; Taco. Fynn is an all-powerful leaf lord and Taco is in the wrong place at the wrong time. 11/10 &amp; 10/10 https://t.co/MuqHPvtL8c,https://twitter.com/dog_rates/status/689835978131935233/photo/1,11,10,Fynn,0,0,0,0,0
756303284449767430,2016-07-22 01:42:09 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Pwease accept dis rose on behalf of dog. 11/10 https://t.co/az5BVcIV5I,https://twitter.com/dog_rates/status/756303284449767430/photo/1,11,10,,0,0,0,0,0


#### Define

The timestamp has an incorrect datatype - is an object, should be DateTime
* convert to datetime

#### Code

In [270]:
# convert to datetime
twitter_arch_clean.timestamp = pd.to_datetime(twitter_arch_clean.timestamp)

In [271]:
# display dataset types
twitter_arch_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2356 entries, 892420643555336193 to 666020888022790149
Data columns (total 12 columns):
timestamp             2356 non-null datetime64[ns]
source                2356 non-null object
text                  2356 non-null object
expanded_urls         2297 non-null object
rating_numerator      2356 non-null int64
rating_denominator    2356 non-null int64
name                  2356 non-null object
doggo                 2356 non-null int64
floofer               2356 non-null int64
pupper                2356 non-null int64
puppo                 2356 non-null int64
number_categories     2356 non-null int64
dtypes: datetime64[ns](1), int64(7), object(4)
memory usage: 239.3+ KB


#### Define

Source is an HTML element - its text should be extracted
* extract inner text of the HTML elements

#### Code

In [272]:
# extract inner text from HTML
twitter_arch_clean.source = twitter_arch_clean.source.apply(lambda x: re.findall(r'>(.*)<', x)[0])

In [274]:
# display new source
twitter_arch_clean.source.value_counts()

Twitter for iPhone     2221
Vine - Make a Scene    91  
Twitter Web Client     33  
TweetDeck              11  
Name: source, dtype: int64

#### Define

Some rows in the text column begin from 'RT @dog_rates:'. Some rows have leading and/or trailing whitespace
- remove 'RT @dog_rates:'
- strip whitespace

#### Code

In [427]:
# example of tweet
twitter_arch_clean.text[838831947270979586]

"RT @dog_rates: This is Riley. His owner put a donut pillow around him and he loves it so much he won't let anyone take it off. 13/10 https:…"

In [428]:
# remove 'RT @dog_rates:' and strip leading and trailing space
twitter_arch_clean.text = twitter_arch_clean.text.str.replace('RT @dog_rates:', '').str.strip()

In [429]:
# example of tweet after clean up
twitter_arch_clean.text[838831947270979586]

"This is Riley. His owner put a donut pillow around him and he loves it so much he won't let anyone take it off. 13/10 https:…"

#### Define

Denominator of some ratings is not 10. Numerator of some ratings is greater than 10. Float ratings have been incorrectly read from the text of tweet

#### Code

#### Define

We have stop words in the name column and 'None' values, which should be convert to NaN.

#### Code

#### Define

Dog 'stage' classification (doggo, floofer, pupper or puppo) should be one column. Some dogs have more than one category assigned

#### Code

### Clean: Image Predictions

In [241]:
# create a copy of dataset
image_pred_clean = image_pred.copy()

In [242]:
# display sample of data
image_pred_clean.sample(3)

Unnamed: 0_level_0,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
746369468511756288,https://pbs.twimg.com/media/ClujESVXEAA4uH8.jpg,1,German_shepherd,0.622957,True,malinois,0.338884,True,wallaby,0.024161,False
877736472329191424,https://pbs.twimg.com/media/DC5YqoQW0AArOLH.jpg,2,Chesapeake_Bay_retriever,0.837956,True,Labrador_retriever,0.062034,True,Weimaraner,0.040599,True
796177847564038144,https://pbs.twimg.com/media/Cwx99rpW8AMk_Ie.jpg,1,golden_retriever,0.600276,True,Labrador_retriever,0.140798,True,seat_belt,0.087355,False


#### Define

Column names are confusing and do not give much information about the content.  
- Change column names to more descriptive ones.

#### Code

In [243]:
# display current labels
image_pred_clean.columns

Index(['jpg_url', 'img_num', 'p1', 'p1_conf', 'p1_dog', 'p2', 'p2_conf',
       'p2_dog', 'p3', 'p3_conf', 'p3_dog'],
      dtype='object')

In [244]:
# change labels
image_pred_clean.columns = ['image_url', 
                            'img_number', 
                            '1st_prediction',
                            '1st_prediction_confidence',
                            '1st_prediction_isdog',
                            '2nd_prediction',
                            '2nd_prediction_confidence',
                            '2nd_prediction_isdog',
                            '3rd_prediction',
                            '3rd_prediction_confidence',
                            '3rd_prediction_isdog']

In [245]:
# display new labels
image_pred_clean.columns

Index(['image_url', 'img_number', '1st_prediction',
       '1st_prediction_confidence', '1st_prediction_isdog', '2nd_prediction',
       '2nd_prediction_confidence', '2nd_prediction_isdog', '3rd_prediction',
       '3rd_prediction_confidence', '3rd_prediction_isdog'],
      dtype='object')

#### Define

Dog breeds contain underscores, and have different case formatting
- Replace underscores with whitespace
- Capitalize the first letter of each word

#### Code

In [246]:
# columns with dog breed
dog_breed_cols = ['1st_prediction', '2nd_prediction', '3rd_prediction']

# remove underscore and capitalize the first letter of each word 
for column in dog_breed_cols:
    image_pred_clean[column] = image_pred_clean[column].str.replace('_', ' ').str.title()

In [247]:
# display sample of changes
image_pred_clean[dog_breed_cols].sample(3)

Unnamed: 0_level_0,1st_prediction,2nd_prediction,3rd_prediction
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
800513324630806528,Pembroke,Cardigan,Chihuahua
680055455951884288,Samoyed,Great Pyrenees,Pomeranian
747512671126323200,Cardigan,Malinois,German Shepherd


#### Define

Only 2075 images have been classified as dog images for top prediction
- If 1st predictions is not a dog breed, then use dog breed predicted in the 2nd or 3rd predicion

#### Code

In [248]:
# build function to determine dog breed
# if no breed detected, set value to NaN

breed_predicted = []
prediction_confidence = []

def get_breed(row):
    if row['1st_prediction_isdog'] == True:
        breed_predicted.append(row['1st_prediction'])
        prediction_confidence.append(row['1st_prediction_confidence'])
    elif row['2nd_prediction_isdog'] == True:
        breed_predicted.append(row['2nd_prediction'])
        prediction_confidence.append(row['2nd_prediction_confidence'])
    elif row['3rd_prediction_isdog'] == True:
        breed_predicted.append(row['3rd_prediction'])
        prediction_confidence.append(row['3rd_prediction_confidence'])
    else:
        breed_predicted.append(np.nan)
        prediction_confidence.append(np.nan)

# apply function to dataset
image_pred_clean.apply(get_breed, axis = 1)

# create new columns with data
image_pred_clean['breed_predicted'] = breed_predicted
image_pred_clean['prediction_confidence'] = prediction_confidence

# drop old columns
image_pred_clean.drop(['1st_prediction',
                       '1st_prediction_confidence',
                       '1st_prediction_isdog',
                       '2nd_prediction',
                       '2nd_prediction_confidence',
                       '2nd_prediction_isdog',
                       '3rd_prediction',
                       '3rd_prediction_confidence',
                       '3rd_prediction_isdog'],
                      axis=1, inplace=True)

# drop rows without dog breed prediction
image_pred_clean.dropna(subset = ['breed_predicted', 'prediction_confidence'], inplace = True)

In [249]:
# displa sample of cleaned dataset
image_pred_clean.sample(3)

Unnamed: 0_level_0,image_url,img_number,breed_predicted,prediction_confidence
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
817056546584727552,https://pbs.twimg.com/media/C1bEl4zVIAASj7_.jpg,1,Kelpie,0.864415
778027034220126208,https://pbs.twimg.com/media/Cswbc2yWcAAVsCJ.jpg,1,Clumber,0.946718
853760880890318849,https://pbs.twimg.com/media/C9kq_bbVwAAuRZd.jpg,1,Miniature Pinscher,0.292519


This concludes cleaning activities for the Image Predictions dataset. The remaining task is to merge it with the Twitter archive data, which is covered in the [Clean: Merge Datasets](#Clean:-Merge-Datasets) section.

### Clean: Twitter API Data

In [145]:
# display sample of data
twitter_api.sample(3)

Unnamed: 0_level_0,favorites,retweets
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1
739544079319588864,43694,24319
826204788643753985,5361,1075
798665375516884993,0,4519


There is no need to perform cleaning tasks in this data set, except for merging it with the Twitter archive data, which is covered in the next section.

### Clean: Merge Datasets

## Analyze Data

Analyze and visualize data using matplotlib.

<hr>