# Wrangle and Analyze Data Project

## Introduction


## Rough notes/steps/thoughts

- Wrangle dataset by gathering, assessing and cleaning the data
- twitter file twitter_archive_enhanced.csv
- download the image_predictions.tsv using requests (this was generated using a neural network)
- WeRateDogs has a tweet id, use the twitter api (tweepy) and store the data in tweet_json.txt

In [1]:
import pandas as pd
import numpy as np
import requests
import tweepy
import json
import os
import re
import time
import csv
from datetime import datetime as dt

# Gathering Data

### Twitter Archive Enhanced Dataset
The twitter archive was downloaded from the course materials page. We can read it into pandas

In [2]:
twitter_archive = pd.read_csv('./twitter-archive-enhanced.csv')
twitter_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


### Image Predictions Dataset
We can download the `image_predictions.tsv` using the requests library, but we only want to download it once, if it exists we wont download it again

In [3]:
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
filename = 'image-predictions.tsv'

# if it doesn't exist, download the file using requests
if(not os.path.exists('.\{}'.format(filename))):
    # send a http get request to obtain the file at the above link
    image_prediction_file = requests.get(url)
    # write the file downloaded with requests 
    with open(filename,'wb') as ip_file:
        ip_file.write(image_prediction_file.content)

In [4]:
# open the file just to check it has been downloaded ok
image_predictions = pd.read_csv(filename, sep='\t')
image_predictions.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


### Gathering Tweet information using the twitter api Tweepy
We'll use the twitter API to obtain some extra information for each tweet

In [5]:
# this function takes in the API keys and returns the API object will be be used to obtain tweets
def get_twitter_api(consumer_key,consumer_secret, access_token, access_token_secret):
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    return tweepy.API(auth, wait_on_rate_limit=True,wait_on_rate_limit_notify=True)

In [6]:
# twitter api keys were saved as a JSON file, so when pushing to github I can use a .gitignore to ensure it doesn't get
# uploaded.
twitter_api = {}
twitter_api_filename = 'twitter_api_keys.json'
with open('./{}'.format(twitter_api_filename)) as twitter_api_keys:
    twitter_api = json.load(twitter_api_keys)   

In [7]:
#api = get_twitter_api(CONSUMER_API_KEY, CONSUMER_SECRET,ACCESS_TOKEN,ACCESS_TOKEN_SECRET)
api = get_twitter_api(twitter_api['CONSUMER_API_KEY'], 
                      twitter_api['CONSUMER_SECRET'],
                      twitter_api['ACCESS_TOKEN'],
                      twitter_api['ACCESS_TOKEN_SECRET'])

### Using The twitter API to pull Tweet Data (takes a long time)

In [8]:
tweet_json_filename = 'tweet_json.txt'

We want to obtain data for all tweet id's, there could be some tweet id's not in the `twitter_archive` dataframe but in the `image_predictions` file. I want to gather all unique Tweet IDs across both datasets

In [9]:
# 2356 unique tweet ids
tweet_ids_twitter_archive = twitter_archive['tweet_id'].unique()
# has 2075 unique
tweet_ids_image_predictions = image_predictions['tweet_id'].unique()

In [10]:
tweet_ids_twitter_archive.shape[0]

2356

In [11]:
tweet_ids_image_predictions.shape[0]

2075

We can find out how many tweet id's from `twitter_archive` are not in `image_predictions`

In [12]:
tid_in_archive_notin_prediction = [t_id for t_id in tweet_ids_twitter_archive if t_id not in tweet_ids_image_predictions]
len(tid_in_archive_notin_prediction)

281

There are 281 missing from the `image_predictions` that we don't have data for. I also don't have the means to obtain this information as the data in the `image_predictions` dataset was generated using a neural network algorithm which was not provided nor is the scope if this assignment to write one.

This also tells us that all tweet id's in the `image_predictions` dataframe are in the `twitter_archive` dataframe (as 281 + 2075 = 2356), hence tweet_ids in `image_predictions` are a subset of tweet_ids from `twitter_archive`. Therefore when using the twitter API to get tweet data, we can use the 

In [13]:
# obtain all the twitter data

# if the file tweet_json.txt doesn't exist, then obtain the tweet
if (not os.path.exists('.\{}'.format(tweet_json_filename))):
    twitter_data = []
    errors = []
    
    # function will loop over all the tweet ids
    start_time = time.time()
    for index,tweet_id in enumerate(twitter_archive['tweet_id']):
        try:
            tweet_data = api.get_status(tweet_id, tweet_mode='extended')
            tweet_data_json = tweet_data._json

            # add the tweet_id into the json file, so we know which tweet it corresponds to, be careful here as the 
            # tweet is a large number, hence int64, which is not JSON serializable
            # "TypeError: Object of type 'int64' is not JSON serializable"
            # so convert to sting
            tweet_data_json['tweet_id'] = str(tweet_id)

            twitter_data.append(tweet_data_json)
        except Exception as e:
            print(e)
            errors.append({'index':index, 'tweet':tweet_id})
    end_time = time.time()

    # write the twitter data list to a txt file
    with open(tweet_json_filename, 'w') as json_tweet_file:
        for tweet in twitter_data:
            json.dump(tweet, json_tweet_file)
            #add a new line, so when we read we can read line by line
            json_tweet_file.write("\n")
        
    # write the errors, may want to return to those later
    with open('errors.txt','w') as error_file:
        # create the writer
        writer = csv.DictWriter(error_file, fieldnames=['index','tweet'])
        for error in errors:
            writer.writerow(error)

There are 6 tweet IDs that returned an exception, those have been recorded in `errors = []`

### Read the extended tweet data back into a dataframe

In [14]:
# read the data back in from the txt file to check we can load it
extended_tweet_data = []
with open(tweet_json_filename,'r') as tweet_file:
    for line in tweet_file:
        tweet_json = json.loads(line)
        tweet_data = {
            'retweet_count':tweet_json['retweet_count'],
            'favorite_count':tweet_json['favorite_count'],
            'tweet_id':tweet_json['tweet_id']
            }
        extended_tweet_data.append(tweet_data)

In [15]:
# extract the elements from each tweet into a  dictionary, then build up a pandas dataframe
df_json_tweet = pd.DataFrame(extended_tweet_data, columns=['tweet_id', 'favorite_count','retweet_count'])
df_json_tweet.head()

Unnamed: 0,tweet_id,favorite_count,retweet_count
0,892420643555336193,38884,8614
1,892177421306343426,33294,6328
2,891815181378084864,25090,4200
3,891689557279858688,42249,8726
4,891327558926688256,40396,9497


# Assess

Any observations have been added towards the end of this section, rather than under the 

### Visual Assessments

In [16]:
twitter_archive.sample(10)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1951,673686845050527744,,,2015-12-07 02:13:55 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is George. He's upset that the 4th of Jul...,,,,https://twitter.com/dog_rates/status/673686845...,11,10,George,,,,
1171,720415127506415616,,,2016-04-14 00:55:25 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Garden's coming in nice this year. 10/10 https...,,,,https://twitter.com/dog_rates/status/720415127...,10,10,,,,,
2191,668955713004314625,,,2015-11-24 00:54:05 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a Slovakian Helter Skelter Feta named ...,,,,https://twitter.com/dog_rates/status/668955713...,10,10,a,,,,
213,851591660324737024,,,2017-04-11 00:24:08 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Oh jeez u did me quite the spook little fella....,,,,https://twitter.com/dog_rates/status/851591660...,11,10,,,,,
871,761599872357261312,,,2016-08-05 16:28:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...","This is Sephie. According to this picture, she...",,,,https://twitter.com/dog_rates/status/761599872...,11,10,Sephie,,,,
67,879376492567855104,,,2017-06-26 16:31:08 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Jack AKA Stephen Furry. You're not sco...,,,,https://twitter.com/dog_rates/status/879376492...,12,10,Jack,,,,
2240,667924896115245057,,,2015-11-21 04:37:59 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Jiminy. He has always wanted to be a c...,,,,https://twitter.com/dog_rates/status/667924896...,9,10,Jiminy,,,,
149,863079547188785154,6.671522e+17,4196984000.0,2017-05-12 17:12:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Ladies and gentlemen... I found Pipsy. He may ...,,,,https://twitter.com/dog_rates/status/863079547...,14,10,,,,,
1299,707738799544082433,,,2016-03-10 01:24:13 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine -...",He's doing his best. 12/10 very impressive tha...,,,,https://vine.co/v/hUvHKYrdb1d,12,10,,,,,
1187,718460005985447936,,,2016-04-08 15:26:28 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Bowie. He's listening for underground squ...,,,,https://twitter.com/dog_rates/status/718460005...,9,10,Bowie,,,,


In [17]:
twitter_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [18]:
twitter_archive.sample(10)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
401,824663926340194305,,,2017-01-26 17:02:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Wilson. Named after the volleyball. He...,,,,https://twitter.com/dog_rates/status/824663926...,13,10,Wilson,,,,
1022,746542875601690625,,,2016-06-25 03:17:46 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine -...",Here's a golden floofer helping with the groce...,,,,https://vine.co/v/5uZYwqmuDeT,11,10,,,floofer,,
1151,725842289046749185,,,2016-04-29 00:21:01 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Colby. He's currently regretting all t...,,,,https://twitter.com/dog_rates/status/725842289...,12,10,Colby,,,,
1596,686286779679375361,,,2016-01-10 20:41:33 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine -...",When bae calls your name from across the room....,,,,https://vine.co/v/iMZx6aDbExn,12,10,,,,,
41,884441805382717440,,,2017-07-10 15:58:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...","I present to you, Pup in Hat. Pup in Hat is gr...",,,,https://twitter.com/dog_rates/status/884441805...,14,10,,,,,
667,790337589677002753,,,2016-10-23 23:42:19 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Maggie. She can hear your cells divide. 1...,,,,https://twitter.com/dog_rates/status/790337589...,12,10,Maggie,,,,
808,771770456517009408,,,2016-09-02 18:03:10 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Davey. He'll have your daughter home b...,,,,https://twitter.com/dog_rates/status/771770456...,11,10,Davey,,,,
39,884876753390489601,,,2017-07-11 20:47:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Lola. It's her first time outside. Mus...,,,,https://twitter.com/dog_rates/status/884876753...,13,10,Lola,,,,
176,857746408056729600,,,2017-04-28 00:00:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Thor. He doesn't have finals because he's...,,,,https://twitter.com/dog_rates/status/857746408...,13,10,Thor,,,,
811,771171053431250945,,,2016-09-01 02:21:21 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Frankie. He's wearing b...,6.733201e+17,4196984000.0,2015-12-06 01:56:44 +0000,https://twitter.com/dog_rates/status/673320132...,11,10,Frankie,,,,


In [19]:
twitter_archive.shape

(2356, 17)

In [20]:
twitter_archive[twitter_archive.duplicated()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


twitter_archive dataset has 2356 entries/rows, tweet ID is also referenced in image_predictions dataset, we can check if one table has more tweet_ids than the other.

In [21]:
image_predictions.shape

(2075, 12)

there are more tweet ids in twitter_archive than image_predictions, we can enter the missing tweets into a list


In [22]:
# will hold all the tweet id's that are missing from image_predictions that are in twitter_archive
missing_tweets_image_predictions = []

# twitter ids
tid_ta = list(twitter_archive['tweet_id'].unique())
tid_ip = list(image_predictions['tweet_id'].unique())

for tw_id in tid_ta:
    if tw_id not in tid_ip:
        missing_tweets_image_predictions.append(tw_id)

In [23]:
len(missing_tweets_image_predictions)

281

There are 281 missing tweet ids from image_predictions

## Assessment of Twitter Archive

Quality (issues with content)

- `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id`, `retweeted_status_timestamp` all have NaN values
- `expanded_urls` column is also incomplete, missing data
- `timestamp` should be converted to time object
- `retweeted_status_id` and `retweeted_status_user_id`, if they do have a tweet id, they are NaN (this is fine, as we need to remove the 181 retweets, we just want original data)
- `in_reply_to_status_id` and `in_reply_to_user_id` also have the same issue
- Some dogs do not have a dog stages assigned(pup, puppo, doggo etc)
- Some dog stages have multiple types (eg puppo and doggo)
- Some dog names look like they have been incorrectly extracted from the text. Some names are a, an or just.

Tidiness

- `source` column has the html tags, can remove the html tags and leave the inner text. 
- The dog stages are all in separate columns (doggo, pup), these can be melted into a single column

In [24]:
image_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [25]:
image_predictions.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [26]:
image_predictions[image_predictions.duplicated()]

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog


In [27]:
df_json_tweet.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2345 entries, 0 to 2344
Data columns (total 3 columns):
tweet_id          2345 non-null object
favorite_count    2345 non-null int64
retweet_count     2345 non-null int64
dtypes: int64(2), object(1)
memory usage: 55.0+ KB


## Assessment of Image Predictions

- missing 281 tweets

## Assessment of JSON tweets

- Tweet id is held as string object and needs to be int64 like the tweet id's from other tables

# Cleaning

In [28]:
# first take a copy of the datasets as we will be manipulaing them
twitter_archive_clean = twitter_archive.copy()

## Delete Retweets, replies and unnecessary columns

### Delete Retweets

#### Define

- Remove the tweets that are retweets, leaving just original tweets using drop for rows that have a tweet id in the `retweeted_status_id`. We can also check the `retweeted_status_timestamp` or `retweeted_status_user_id` for entries, if they have a NaN then it is an original tweet

#### Code

In [29]:
# retweeted_status_id can be removed as if it exists then it is a retweet, drop the rows
retweets = twitter_archive_clean[twitter_archive_clean['retweeted_status_id'].isnull()==False]
twitter_archive_clean.drop(retweets.index, inplace=True)
twitter_archive_clean.reset_index(drop=True,inplace=True)

#### Test

- To test the retweets have all been removed, we can check all the rows that have a value other than null/nan in the `retweeted_status_id` column.
- We should also test the other columns that could indicate the row is a retweet such as `retweeted_status_user_id`,`retweeted_status_timestamp`, just incase `retweeted_status_id` was a null value and didn't have a tweet id

In [30]:
twitter_archive_clean[twitter_archive_clean['retweeted_status_id'].isnull()==False].shape[0]

0

In [31]:
twitter_archive_clean[twitter_archive_clean['retweeted_status_user_id'].isnull()==False].shape[0]

0

In [32]:
twitter_archive_clean[twitter_archive_clean['retweeted_status_timestamp'].isnull()==False].shape[0]

0

In [33]:
twitter_archive_clean.shape[0]

2175

All retweets have been removed, reducing the dataset to 2175 rows

### Delete Replies

#### Define

- Remove tweets that are replies to tweets. We only want original tweets in this study. We can check for replies using the same method as we did for deleting retweets

#### Code

In [34]:
# check for all rows which have an entry for in_reply_to_status_id or in_reply_to_user_id
twitter_archive_clean[twitter_archive_clean['in_reply_to_status_id'].isnull()==False]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
29,886267009285017600,8.862664e+17,2.281182e+09,2017-07-15 16:51:35 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@NonWhiteHat @MayhewMayhem omg hello tanner yo...,,,,,12,10,,,,,
52,881633300179243008,8.816070e+17,4.738443e+07,2017-07-02 21:58:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@roushfenway These are good dogs but 17/10 is ...,,,,,17,10,,,,,
61,879674319642796034,8.795538e+17,3.105441e+09,2017-06-27 12:14:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@RealKentMurphy 14/10 confirmed,,,,,14,10,,,,,
101,870726314365509632,8.707262e+17,1.648776e+07,2017-06-02 19:38:25 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@ComplicitOwl @ShopWeRateDogs &gt;10/10 is res...,,,,,10,10,,,,,
130,863427515083354112,8.634256e+17,7.759620e+07,2017-05-13 16:15:35 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@Jack_Septic_Eye I'd need a few more pics to p...,,,,,12,10,,,,,
131,863079547188785154,6.671522e+17,4.196984e+09,2017-05-12 17:12:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Ladies and gentlemen... I found Pipsy. He may ...,,,,https://twitter.com/dog_rates/status/863079547...,14,10,,,,,
156,857214891891077121,8.571567e+17,1.806710e+08,2017-04-26 12:48:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@Marc_IRL pixelated af 12/10,,,,,12,10,,,,,
159,856526610513747968,8.558181e+17,4.196984e+09,2017-04-24 15:13:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...","THIS IS CHARLIE, MARK. HE DID JUST WANT TO SAY...",,,,https://twitter.com/dog_rates/status/856526610...,14,10,,,,,
160,856288084350160898,8.562860e+17,2.792810e+08,2017-04-23 23:26:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@xianmcguire @Jenna_Marbles Kardashians wouldn...,,,,,14,10,,,,,
162,855862651834028034,8.558616e+17,1.943518e+08,2017-04-22 19:15:32 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@dhmontgomery We also gave snoop dogg a 420/10...,,,,,420,10,,,,,


There are 78 rows in the dataset which are replies to a tweet

In [35]:
replies = twitter_archive_clean[twitter_archive_clean['in_reply_to_status_id'].isnull()==False]
twitter_archive_clean.drop(replies.index, inplace=True)
twitter_archive_clean.reset_index(inplace=True, drop=True)

#### Test

- Test the replies have been removed by checking for non null values in both `in_reply_to_status_id` and `in_reply_to_user_id`

In [36]:
# check for all rows which have an entry for in_reply_to_status_id or in_reply_to_user_id
assert (twitter_archive_clean[twitter_archive_clean['in_reply_to_status_id'].isnull()==False].shape[0]==0 
        and 
        twitter_archive_clean[twitter_archive_clean['in_reply_to_user_id'].isnull()==False].shape[0]==0)

Assertion test passed

### Delete columns no longer required

#### Define

No longer need the following columns
- `in_reply_to_status_id`
- `in_reply_to_user_id`
- `retweeted_status_id`
- `retweeted_status_user_id`
- `retweeted_status_timestamp`

#### Code

In [37]:
cols_to_drop = ['in_reply_to_status_id','in_reply_to_user_id','retweeted_status_id','retweeted_status_user_id','retweeted_status_timestamp']
twitter_archive_clean.drop(labels=cols_to_drop, axis=1, inplace=True)
twitter_archive_clean.reset_index(inplace=True, drop=True)

#### Test

- visual assesment will suffice, however i've also included an assertion as part of best practise

In [38]:
twitter_archive_clean.head(1)

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,


In [39]:
# checks the length of the list for 0 (ok), which will be populated if columns that should have been deleted are present.
assert len([col for col in cols_to_drop if col in twitter_archive_clean.columns])==0

## Convert Timestamp from string to datetime object

#### Define

- The `timestamp` column is currently a string object, it needs to be converted to a datetime object. This can be done by using the datetimes string parse function `strptime()` which returns a datetime object

#### Code

In [40]:
# function takes in the datetime string of the cell and returns the datetime object
def convert_to_datetime(dt_str):
    fmt = "%Y-%m-%d %H:%M:%S +0000"
    return dt.strptime(dt_str,fmt)

In [41]:
twitter_archive_clean['timestamp']= twitter_archive_clean['timestamp'].apply(convert_to_datetime)

#### Test

- Visual test by checking the info, we should see thet `timestamp` column is a `datetime64[ns]` object

In [42]:
twitter_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2097 entries, 0 to 2096
Data columns (total 12 columns):
tweet_id              2097 non-null int64
timestamp             2097 non-null datetime64[ns]
source                2097 non-null object
text                  2097 non-null object
expanded_urls         2094 non-null object
rating_numerator      2097 non-null int64
rating_denominator    2097 non-null int64
name                  2097 non-null object
doggo                 2097 non-null object
floofer               2097 non-null object
pupper                2097 non-null object
puppo                 2097 non-null object
dtypes: datetime64[ns](1), int64(3), object(8)
memory usage: 196.7+ KB


## Cleaning source

#### Define

- The `source` column contains the full html tag a link, we don't need this as it contains the actual information as inner html text (the content of the tag). We will strip out the content of the tags 

#### Code

In [43]:
twitter_archive_clean['source'].value_counts()

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     1964
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       31
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
Name: source, dtype: int64

In [44]:
# function that searches for the pattern group below, this extracts the text within an a html tag
def clean_source(html_tag):
    matches = re.search(r'>([\w\s\-]+)<', html_tag)
    if(matches!=None):
        return matches.group(1)
    else:
        return "no_url"
    
twitter_archive_clean['source']= twitter_archive_clean['source'].apply(clean_source)

In [45]:
# convert the source column to a categorical column
twitter_archive_clean['source'].astype('category')
twitter_archive_clean.head()

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,2017-08-01 16:23:56,Twitter for iPhone,This is Phineas. He's a mystical boy. Only eve...,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,2017-08-01 00:17:27,Twitter for iPhone,This is Tilly. She's just checking pup on you....,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,2017-07-31 00:18:03,Twitter for iPhone,This is Archie. He is a rare Norwegian Pouncin...,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,2017-07-30 15:58:51,Twitter for iPhone,This is Darla. She commenced a snooze mid meal...,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,2017-07-29 16:00:24,Twitter for iPhone,This is Franklin. He would like you to stop ca...,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


#### Test

- I can check the value counts of the column

In [46]:
twitter_archive_clean['source'].value_counts()

Twitter for iPhone     1964
Vine - Make a Scene      91
Twitter Web Client       31
TweetDeck                11
Name: source, dtype: int64

## Cleaning expanded_urls - delete column

#### Define

- expanded urls is not required, we can remove

#### Code

In [47]:
twitter_archive_clean.drop('expanded_urls', axis=1, inplace=True)

#### Test

In [48]:
# test using visual - check columns
twitter_archive_clean.head(1)

Unnamed: 0,tweet_id,timestamp,source,text,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,2017-08-01 16:23:56,Twitter for iPhone,This is Phineas. He's a mystical boy. Only eve...,13,10,Phineas,,,,


In [49]:
assert 'expanded_urls' not in twitter_archive_clean.columns

## Cleaning dog stages

#### Define

- It was noticed that some of the dogs have multiple stages, so a close look at this is important to see if we can determine the right dog stage

#### Code

In [50]:
pd.get_option('display.max_colwidth')

50

In [51]:
# just an option to change as we'll be looking at text and don't want it truncated in the view
pd.set_option('display.max_colwidth',-1)

The below code will take a look at the rows which have been assigned more than one dog stage

In [52]:
# function that checks the dataframe for rows with duplicate dog stages

def get_duplicate_dog_stages():
# extract just the dog stages
    dog_stages = twitter_archive_clean.loc[:,'doggo':'puppo']
    # assign a numeric value, 0=None and 1 if a dog_stage in the column, the sum in the row direction should not be more than 1
    dog_stages_numbers = dog_stages.applymap(lambda x: 1 if x !='None' else 0)

    # get counts (the number of dog stages)
    dog_stages_numbers['count'] = dog_stages_numbers.apply(sum, axis=1)
    # put all the bad indexes in a separate object
    bad_rows = dog_stages_numbers[dog_stages_numbers['count'] > 1]
    return bad_rows
bad_rows = get_duplicate_dog_stages()
bad_indexes = bad_rows.index

In [53]:
bad_rows

Unnamed: 0,doggo,floofer,pupper,puppo,count
154,1,0,0,1,2
161,1,1,0,0,2
358,1,0,1,0,2
416,1,0,1,0,2
446,1,0,1,0,2
536,1,0,1,0,2
562,1,0,1,0,2
689,1,0,1,0,2
748,1,0,1,0,2
848,1,0,1,0,2


In [54]:
twitter_archive_clean.iloc[bad_indexes,[3,7,8,9,10]]

Unnamed: 0,text,doggo,floofer,pupper,puppo
154,Here's a puppo participating in the #ScienceMarch. Cleverly disguising her own doggo agenda. 13/10 would keep the planet habitable for https://t.co/cMhq16isel,doggo,,,puppo
161,"At first I thought this was a shy doggo, but it's actually a Rare Canadian Floofer Owl. Amateurs would confuse the two. 11/10 only send dogs https://t.co/TXdT3tmuYk",doggo,floofer,,
358,"This is Dido. She's playing the lead role in ""Pupper Stops to Catch Snow Before Resuming Shadow Box with Dried Apple."" 13/10 (IG: didodoggo) https://t.co/m7isZrOBX7",doggo,,pupper,
416,Here we have Burke (pupper) and Dexter (doggo). Pupper wants to be exactly like doggo. Both 12/10 would pet at same time https://t.co/ANBpEYHaho,doggo,,pupper,
446,This is Bones. He's being haunted by another doggo of roughly the same size. 12/10 deep breaths pupper everything's fine https://t.co/55Dqe0SJNj,doggo,,pupper,
536,This is Pinot. He's a sophisticated doggo. You can tell by the hat. Also pointier than your average pupper. Still 10/10 would pet cautiously https://t.co/f2wmLZTPHd,doggo,,pupper,
562,"Pupper butt 1, Doggo 0. Both 12/10 https://t.co/WQvcPEpH2u",doggo,,pupper,
689,"Meet Maggie &amp; Lila. Maggie is the doggo, Lila is the pupper. They are sisters. Both 12/10 would pet at the same time https://t.co/MYwR4DQKll",doggo,,pupper,
748,Please stop sending it pictures that don't even have a doggo or pupper in them. Churlish af. 5/10 neat couch tho https://t.co/u2c9c7qSg8,doggo,,pupper,
848,This is just downright precious af. 12/10 for both pupper and doggo https://t.co/o5J479bZUC,doggo,,pupper,


It looks like some of these lines would be difficult to programmatically determine what type of dog stage the dog is. Some of these are actually 2 dogs in the same picture.

As there are only 11 rows I have decided to do manual analysis, here is what I will do
- Delete tweets with multiple dogs in the same pic
- if the tweet was about one dog type, correct the dog types in the main table
- delete any that shouldn't have been sent in

| Index | Text                                                                                                                                                                 | Decision                  | Notes                                         |
|-------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------|-----------------------------------------------|
| 154   | Here's a puppo participating in the #ScienceMarch. Cleverly disguising her own doggo agenda. 13/10 would keep the planet habitable for https://t.co/cMhq16isel       | puppo                     |                                               |
| 161   | At first I thought this was a shy doggo, but it's actually a Rare Canadian Floofer Owl. Amateurs would confuse the two. 11/10 only send dogs https://t.co/TXdT3tmuYk | floofer                   |                                               |
| 358   | This is Dido. She's playing the lead role in "Pupper Stops to Catch Snow Before Resuming Shadow Box with Dried Apple." 13/10 (IG: didodoggo) https://t.co/m7isZrOBX7 | pupper                    | doggo was found in the text with IG:didodoggo |
| 416   | Here we have Burke (pupper) and Dexter (doggo). Pupper wants to be exactly like doggo. Both 12/10 would pet at same time https://t.co/ANBpEYHaho                     | 2 dogs                    | delete                                        |
| 446   | This is Bones. He's being haunted by another doggo of roughly the same size. 12/10 deep breaths pupper everything's fine https://t.co/55Dqe0SJNj                     | pupper                    | tweet actually mentions the dog is a pupper   |
| 536   | This is Pinot. He's a sophisticated doggo. You can tell by the hat. Also pointier than your average pupper. Still 10/10 would pet cautiously https://t.co/f2wmLZTPHd | doggo                     |                                               |
| 562   | Pupper butt 1, Doggo 0. Both 12/10 https://t.co/WQvcPEpH2u                                                                                                           | 2 dogs                    | delete                                        |
| 689   | Meet Maggie &amp; Lila. Maggie is the doggo, Lila is the pupper. They are sisters. Both 12/10 would pet at the same time https://t.co/MYwR4DQKll                     | 2 dogs                    | delete                                        |
| 748   | Please stop sending it pictures that don't even have a doggo or pupper in them. Churlish af. 5/10 neat couch tho https://t.co/u2c9c7qSg8                             | should not have been sent | delete                                        |
| 848   | This is just downright precious af. 12/10 for both pupper and doggo https://t.co/o5J479bZUC                                                                          | 2 dogs                    | delete                                        |
| 897   | Like father (doggo), like son (pupper). Both 12/10 https://t.co/pG2inLaOda                                                                                           | 2 dogs                    | delete                                        |


In [55]:
# delete rows with index 416,562,689,748,848,897, Do not reset indexes until the cleaning operation has happened
delete_indexes = [416,562,689,748,848,897]
twitter_archive_clean.drop(delete_indexes, inplace=True)

In [56]:
# dictionary that holds what the new dog types will be
cleaned_dog_stages = {
    154 : 'puppo',
    161 : 'floofer',
    358 : 'pupper',
    446 : 'pupper',
    536 : 'doggo'
}

for k,v in cleaned_dog_stages.items():
    # reset the row to all None for all dog stages
    row = twitter_archive_clean.loc[k,'doggo':'puppo'].apply(lambda x: 'None')
    # change the dog stage to the value of the dog stage
    row[v]=v
    # write back into the dataframe
    twitter_archive_clean.loc[k,'doggo':'puppo'] = row

In [57]:
# now reset the index
twitter_archive_clean.reset_index(drop=True, inplace=True)

#### Test

To test we can call the original code to ensure there are no rows with ambiguous or duplicate dog stages assigned

In [58]:
get_duplicate_dog_stages()

Unnamed: 0,doggo,floofer,pupper,puppo,count


No rows with duplicates in

## Melting/Collapsing dog stages

#### Define

- The dog stages information (doggo,floofer,pupper,puppo) all have their own column. We can collapse this information into a single column which is text for what the stage of dog is


#### Code

In [59]:
# function to combine all the dog stages into a single result
def combine_dog_stages(row):
    dog_stage_list = list(row['doggo':'puppo'])
    # filter the list for None string
    dog_stage_result = list(filter(lambda x: x!='None', dog_stage_list))
    if (len(dog_stage_result)==0):
        return 'None'
    else:
        return dog_stage_result[0]

# Enter data into the dog stage
twitter_archive_clean['dog_stage']=twitter_archive_clean.apply(combine_dog_stages, axis=1)

#### Test

We can test it worked by checking the value counts of the newly created column `dog_stage`

In [60]:
twitter_archive_clean['dog_stage'].value_counts()

None       1761
pupper     223 
doggo      73  
puppo      24  
floofer    10  
Name: dog_stage, dtype: int64

We can also write a function to count the number of times the dog stage name appears in the old (wide format columns) and compare them against the counts we have in the new column

In [61]:
dog_stage_list_all = list(twitter_archive_clean['dog_stage'].value_counts().index)
# get rid of the None in the list
dog_stage_list = list(filter(lambda x:x!='None',dog_stage_list_all))

for dog_stage in dog_stage_list:
    assert twitter_archive_clean[dog_stage].value_counts()[dog_stage] == twitter_archive_clean['dog_stage'].value_counts()[dog_stage]

The test statement worked, we can now delete the old dog stage columns

In [62]:
dog_stage_list

['pupper', 'doggo', 'puppo', 'floofer']

In [63]:
twitter_archive_clean.drop(dog_stage_list, axis=1, inplace=True)

# Image Predictions Cleaning


## Get the highest dog prediction

#### Define

- The image predictions dataframe includes a number of predictions on what it has guessed and a level of confidence on the prediction. It also has a boolean value which determines if the guess is a dog or not a dog. To clean this dataset I will remove predictions that guessed the image was not a dog, and then only take the breed of the dog with the highest confidence

In [64]:
image_predictions.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


#### Code

In [65]:
p1_cols = ['tweet_id','p1','p1_conf','p1_dog']
p2_cols = ['tweet_id','p2','p2_conf','p2_dog']
p3_cols = ['tweet_id','p3','p3_conf','p3_dog']

p1_rename_dict = {'p1':'prediction','p1_conf':'confidence','p1_dog':'is_dog'}
p2_rename_dict = {'p2':'prediction','p2_conf':'confidence','p2_dog':'is_dog'}
p3_rename_dict = {'p3':'prediction','p3_conf':'confidence','p3_dog':'is_dog'}

p1_df = image_predictions.loc[:,p1_cols].rename(columns=p1_rename_dict)
p2_df = image_predictions.loc[:,p2_cols].rename(columns=p2_rename_dict)
p3_df = image_predictions.loc[:,p3_cols].rename(columns=p3_rename_dict)

stacked_image_predictions = pd.concat([p1_df,p2_df,p3_df], axis=0).copy()
stacked_image_predictions.reset_index(drop=True, inplace=True)
stacked_image_predictions.head(10)

Unnamed: 0,tweet_id,prediction,confidence,is_dog
0,666020888022790149,Welsh_springer_spaniel,0.465074,True
1,666029285002620928,redbone,0.506826,True
2,666033412701032449,German_shepherd,0.596461,True
3,666044226329800704,Rhodesian_ridgeback,0.408143,True
4,666049248165822465,miniature_pinscher,0.560311,True
5,666050758794694657,Bernese_mountain_dog,0.651137,True
6,666051853826850816,box_turtle,0.933012,False
7,666055525042405380,chow,0.692517,True
8,666057090499244032,shopping_cart,0.962465,False
9,666058600524156928,miniature_poodle,0.201493,True


In [66]:
# new stacked table should be 3 x size of the old
assert stacked_image_predictions.shape[0] == 3*image_predictions.shape[0]

In [67]:
# remove any predictions that are not a dog
tweet_not_dog = stacked_image_predictions[stacked_image_predictions['is_dog']==False].index
stacked_image_predictions.drop(tweet_not_dog, inplace=True)
stacked_image_predictions.reset_index(drop=True, inplace=True)
stacked_image_predictions.head(10)

Unnamed: 0,tweet_id,prediction,confidence,is_dog
0,666020888022790149,Welsh_springer_spaniel,0.465074,True
1,666029285002620928,redbone,0.506826,True
2,666033412701032449,German_shepherd,0.596461,True
3,666044226329800704,Rhodesian_ridgeback,0.408143,True
4,666049248165822465,miniature_pinscher,0.560311,True
5,666050758794694657,Bernese_mountain_dog,0.651137,True
6,666055525042405380,chow,0.692517,True
7,666058600524156928,miniature_poodle,0.201493,True
8,666063827256086533,golden_retriever,0.77593,True
9,666071193221509120,Gordon_setter,0.503672,True


In [68]:
# drop the is_dog column, as they're all true now
stacked_image_predictions.drop('is_dog', axis=1, inplace=True)
stacked_image_predictions.head()

Unnamed: 0,tweet_id,prediction,confidence
0,666020888022790149,Welsh_springer_spaniel,0.465074
1,666029285002620928,redbone,0.506826
2,666033412701032449,German_shepherd,0.596461
3,666044226329800704,Rhodesian_ridgeback,0.408143
4,666049248165822465,miniature_pinscher,0.560311


Now we have lots of rows with the same tweet id, we want to group by the tweets and select the rows with the highest confidence

In [69]:
highest_confidence = stacked_image_predictions.groupby(['tweet_id'])['confidence'].max().reset_index()

In [70]:
# merge this with stacked_image_predictions table 
most_confident_predictions = pd.merge(left=highest_confidence, right=stacked_image_predictions, how='left',on=['tweet_id','confidence'])

#### Test

In [71]:
most_confident_predictions.head()

Unnamed: 0,tweet_id,confidence,prediction
0,666020888022790149,0.465074,Welsh_springer_spaniel
1,666029285002620928,0.506826,redbone
2,666033412701032449,0.596461,German_shepherd
3,666044226329800704,0.408143,Rhodesian_ridgeback
4,666049248165822465,0.560311,miniature_pinscher


## Merging predictions with the twitter archive

#### Define

Need to merge all the separate tables into a single dataframe

#### Code

In [72]:
# merge predictions and archive
merged_predictions_archive = pd.merge(left=twitter_archive_clean, right=most_confident_predictions, how='inner', on=['tweet_id']).copy()

#### Test

This can be tested by looking at the dataframe

In [73]:
merged_predictions_archive.head(1)

Unnamed: 0,tweet_id,timestamp,source,text,rating_numerator,rating_denominator,name,dog_stage,confidence,prediction
0,892177421306343426,2017-08-01 00:17:27,Twitter for iPhone,"This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV",13,10,Tilly,,0.323581,Chihuahua


## Merging json tweets with the twitter archive

#### Define

Final merge with JSON tweets and twitter archive,  first the JSON tweets tweet id needs to be converted to int64 then merged

#### Code

In [74]:
df_json_tweet['tweet_id'] = df_json_tweet['tweet_id'].astype('int64')
df_json_tweet.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2345 entries, 0 to 2344
Data columns (total 3 columns):
tweet_id          2345 non-null int64
favorite_count    2345 non-null int64
retweet_count     2345 non-null int64
dtypes: int64(3)
memory usage: 55.0 KB


In [75]:
df_complete = pd.merge(left=merged_predictions_archive, right=df_json_tweet, how='inner', on=['tweet_id']).copy()

#### Test

In [76]:
df_complete.head(1)

Unnamed: 0,tweet_id,timestamp,source,text,rating_numerator,rating_denominator,name,dog_stage,confidence,prediction,favorite_count,retweet_count
0,892177421306343426,2017-08-01 00:17:27,Twitter for iPhone,"This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV",13,10,Tilly,,0.323581,Chihuahua,33294,6328


## Cleaning dog stage

#### Define

Some of the dog stages (doggo, puppo etc) have None/Null for the value instead of a stage. In this section I will take a closer look by analysing the text. It may not be possible for all as some of the text may not actually give an indication on the dog stage, however I may be able to increase the number 

#### Code

In [77]:
df_complete['dog_stage'].value_counts()

None       1409
pupper     168 
doggo      54  
puppo      22  
floofer    8   
Name: dog_stage, dtype: int64

1409 dogs do not have a dog stage

In [78]:
# just get the rows that don't have a dog stage
no_dog_stage = df_complete.loc[:,['text','dog_stage']].query("dog_stage=='None'")
no_dog_stage.sample(10)

Unnamed: 0,text,dog_stage
1582,Meet Olive. He comes to spot by tree to reminisce of simpler times and truly admire his place in the universe. 11/10 https://t.co/LwrCwlWwPB,
1566,This is Ron. Ron's currently experiencing a brain freeze. Damn it Ron. 8/10 https://t.co/4ilfcR5SlK,
400,This is Daisy. She's here to make your day better. 13/10 mission h*ckin successful https://t.co/PbgvuD0qIL,
881,This is Remington. He was caught off guard by the magical floating cheese. Spooked af. 10/10 deep breaths pup https://t.co/mhPSADiJmZ,
110,"This is Zooey. She's the world's biggest fan of illiterate delivery people. 13/10 not your fault they don't listen, Zooey https://t.co/ixOFQ1tfqE",
1561,Here we have an Azerbaijani Buttermilk named Guss. He sees a demon baby Hitler behind his owner. 10/10 stays alert https://t.co/aeZykWwiJN,
1028,Meet Tupawc. He's actually a Christian rapper. Doesn't even understand the concept of dollar signs. 10/10 great guy https://t.co/mCqgtqLDCW,
1155,"I know we joke around on here, but this is getting really frustrating. We rate dogs. Not T-Rex. Thank you... 8/10 https://t.co/5aFw7SWyxU",
114,HI. MY. NAME. IS. BOOMER. AND. I. WANT. TO. SAY. IT'S. H*CKIN. RIDICULOUS. THAT. DOGS. CAN'T VOTE. ABSOLUTE. CODSWALLUP. THANK. YOU. 13/10 https://t.co/SqKJPwbQ2g,
1571,Meet Otis. He is a Peruvian Quartzite. Pic sponsored by Planters. Ears on point. Killer sunglasses. 10/10 ily Otis https://t.co/tIaaBIMlJN,


A lot of these have some variation of the dog stage name, eg I noticed floofapolis for the following

In [79]:
df_complete.iloc[790]['text']

"This is Neptune. He's a Snowy Swiss Mountain Floofapolis. Cheeky wink. Tongue nifty af. 11/10 would pet so firmly https://t.co/SoZq2Xoopv"

Some of the dog stages are also shortened such as *pup* or pluralised such as *puppers*. I will search for other variations of the dog stage words in all the text and see what matches.

In [80]:
from functools import reduce

dog_stage_variants = ['doggo','pup','floo']

# gets all the words found in the text but reduces the dictionary key value pairs to a single list with just words
def get_words(words_found_dictionary):
    # just get the items of the dictionary
    all_lists = list(words_found_dictionary.values())
    just_words = reduce(lambda x,y:x+y, all_lists)
    return just_words

def contains_dog_words(words_found_dictionary):
    list_of_dog_words_found = get_words(words_found_dictionary)
    if len(list_of_dog_words_found)==0:
        return False
    else:
        return True

# takes in the tweet text, returns a dictionary of all the words that came up 
def find_dog_stage_names(tweet_text):
    
    # variations of the dog stages, note that pup will find puppo and pupper, that's fine for now, we'll take a closer look later
    
    
    # construct and initalise a dictionary object from the possible dog stages in the list above
    dict_dog_stages = {ds:[] for ds in dog_stage_variants}
    
    # for each variation
    for dog_stage in dog_stage_variants:
        # match the words
        matches = re.finditer(dog_stage,tweet_text)
        
        # for each match
        for match in matches:
            # get the start position of the word in the tweet text, returns a tuple, so (start, end)
            start_pos = match.span()[0]
            
            # get the word that is referenced by the start position, take all the words
            # from the start position of the found word, split the sentence and take the
            # first word
            word_found = tweet_text[start_pos:].split(' ')[0]
            dict_dog_stages[dog_stage].append(word_found)
            
    return dict_dog_stages

In [81]:
# scan through all the rows/tweets which did not have a dog_stage assigned
text_results = []
for tweet_text in no_dog_stage['text']:
    # append the result in the dictionary
    text_results.append(find_dog_stage_names(tweet_text))

In [82]:
# how many in the list had a variation of the dog stage name
contains_words_list = list(map(lambda x:contains_dog_words(x), text_results))
all_false = [x for x in contains_words_list if x==False]
all_true = [x for x in contains_words_list if x==True]

message_results = "Out of {} tweets which did not have a dog stage assosicated, " +\
"{} were found to have some text in the tweet which could be used to determine the dog stage, " +\
"leaving {} that we cannot infer the stage of the dog"
print(message_results.format(no_dog_stage.shape[0], len(all_true), len(all_false)))


Out of 1409 tweets which did not have a dog stage assosicated, 213 were found to have some text in the tweet which could be used to determine the dog stage, leaving 1196 that we cannot infer the stage of the dog


def get_stage_names_for_tweet(row):
    names = find_dog_stage_names(row['text'])
    new_row = row
    for ds in dog_stage_variants:
        new_row[ds] = names[ds]
    return new_row
no_dog_stage.apply(get_stage_names_for_tweet, axis=1)

In [83]:
from pprint import PrettyPrinter
# reducer
pp = PrettyPrinter()

# This reducer will combine all the names for each key into a single list
reduce_dictionary = reduce(lambda x,y: {ds:x[ds]+y[ds] for ds in dog_stage_variants}, text_results)
# convert the dictionary items into a set, so we don't get repeats
for k,v in reduce_dictionary.items():
    reduce_dictionary[k] = set(v)


pp.pprint(reduce_dictionary)

{'doggo': {'doggos'},
 'floo': {'floof',
          'floof.',
          'floofs',
          'floofy',
          'floor',
          'floor.',
          'flooring.'},
 'pup': {'pup',
         'pup!',
         "pup's",
         'pup,',
         'pup.',
         'puparazzi"',
         'pupared',
         'pupared.',
         'pupholder.',
         'pupkins',
         'pupmost',
         'pupnado.',
         'puppa.',
         'puppalled.',
         'puppared',
         'puppears',
         'puppers',
         'puppers.',
         'puppertunity',
         'pupplause',
         'pupple',
         'pupple.',
         'puppoccino.',
         'puppologize',
         'pupporazzi.',
         'pupporting',
         'pupposes.',
         'puppreciate',
         'puppurchase',
         'pupright',
         'puprises',
         'pups',
         'pups.',
         'pupset',
         'pupset,',
         'pupset.',
         'pupsets',
         'pupsicle.',
         'pupside',
         'puptacular',
      

### Summary of what was found in variations of the dog stages words

__Doggo__

All found doggos had the word doggos in the tweet text, we can convert these easily. 

__Floofer__

There are more variants of the words for floofers, such as floof, floofs and floofy. Our algorithm also picked up the word floor, we can get ignore those. 

__Pupper__

Pups is definitely the most difficult as i'm not sure of pups refers to puppo or puppers. I guess it would be puppos as it sounds like it's for a smaller animal but I will leave these out, I found someone else who shared my frustration https://imgur.com/gallery/EaWzw. The only ones I can be sure of for puppers are the following

- puppa
- puppers

__summary__

I will look for the following words and convert them

- doggos
- floof
- floofs
- floofy
- puppa
- puppers

In [84]:
words_to_match = ['doggos','floof','floofs','floofy','puppa','puppers']
def correct_dog_stages(row):
    if(row['dog_stage']=='None'):
        # try and find one of the words in the list above in the text
        for word_to_match in words_to_match:
            m = re.search(word_to_match, row['text'])
            if m!=None:
                # we have found a word
                found_word = m.group(0)
                start_letter = found_word[0]
                # if it starts with d it's doggo
                # if it starts with p it's a pupper
                # if it starts with f it's a floofer
                if start_letter=='d':
                    row['dog_stage']='doggo'
                elif start_letter=='p':
                    row['dog_stage']='pupper'
                elif start_letter=='f':
                    row['dog_stage']='floofer'
                else:
                    pass
    return row
    
df_complete_a = df_complete.apply(correct_dog_stages, axis=1)


#### Test

How do we know it worked? we'll we couldn't fix all the dog stage names, there was a lot of uncertainly and to be sure I would have to look at the images myself or look at a machine learning algorithm. However we managed to find some, therefore we should see a reduction in the number of None/Null values from what was previously encountered. Originally we had 1409 tweets that did not have a dog stage. To test again we can query for `dog_stage=='None'` and do a count

In [85]:
df_complete_a[df_complete_a['dog_stage']=='None'].shape[0]

1367

There are now only 1367 values which do not have a dog stage, so we managed to find 42 values during the cleaning process

## Cleaning dog names

#### Define

- Some of the dog names had None/Null values but some had a value such as a or an, this is due to the parser that detected sentences such as `this is {dog's name}` eg "this is rover" but would end up matching sentences such as "this is a {dog breed}". the a would be mistaken as the name.

#### Code

In [86]:
df_complete_a['name'].value_counts()

None         397
a            46 
Cooper       10 
Charlie      10 
Lucy         10 
Oliver       9  
Tucker       9  
Penny        8  
the          7  
Sadie        7  
Daisy        7  
Winston      7  
Toby         6  
Lola         6  
Jax          6  
Koda         6  
Stanley      5  
Rusty        5  
Leo          5  
Oscar        5  
Bo           5  
Bella        5  
Milo         4  
Bentley      4  
Dave         4  
Brody        4  
Bailey       4  
Jack         4  
Chester      4  
Scout        4  
            ..  
Acro         1  
Cal          1  
Kody         1  
Millie       1  
Ralpher      1  
Mac          1  
Klevin       1  
Nollie       1  
Ito          1  
Ralphie      1  
Bayley       1  
Layla        1  
Jeffri       1  
Blue         1  
Bert         1  
Duddles      1  
Comet        1  
Tango        1  
Scruffers    1  
Sprinkles    1  
Karll        1  
Gerbald      1  
Rizzy        1  
Clarkus      1  
space        1  
Frönq        1  
Hercules     1  
Iroh         1

In [87]:
df_complete_a['name'].value_counts().to_csv('names.csv')

- 397 names are `none`
- 46 names are `a`
- 7 names are `the`
- 4 names are `an`
- 2 names are `just`
- 3 names are `very`
- There are other names such as `unacceptable`, `O` and `all` but I have chosen to ignore those as they only appear once

In [88]:
# we can quickly glimps through some of the problems in csv, faster than in jupyter
# df_complete_a.query("name=='None' or name=='an' or name=='a' or name=='the' or name=='just'")['text'].to_csv('problem_names.csv')
missing_incorrect_names = df_complete_a.query("name=='None' or name=='an' or name=='a' or name=='the' or name=='just'")

In [89]:
# will pull names from a pattern
def extract_names(row):
    pattern_list=[
        r'named (\w+)',
        r'name is (\w+)'
    ]
    dog_breed = None
    dog_name = None
    for pattern in pattern_list:
        m = re.search(pattern, row['text'], re.IGNORECASE)
        if(m!=None):
            # row['dog_breed']=m.group(1)
            row['name']=m.group(1)
    return row.copy()

corrected_names = missing_incorrect_names.apply(extract_names, axis=1)

Some names have been found using this approach

In [90]:
incorrect_names = ['None','a','just','an']

# managed to find some names
names_found = [name for name in corrected_names['name'].unique().tolist() if name not in incorrect_names]
#corrected_names['name'].unique().tolist()
for name in names_found:
    print(name)

Zoey
the
Sabertooth
Wylie
Kip
Jacob
Rufus
Spork
Hemry
Alfredo
Zeus
Leroi
Berta
Chuk
Guss
Alfonso
Cheryl
Jessiga
Klint
Big
Tickles
Kohl
Daryl
Octaviath
Johm


In [91]:
# now to correct this in the main table
def fix_names(row):
    if(row['name'] in incorrect_names):
        return extract_names(row)
    else:
        return row
df_complete_b = df_complete_a.apply(fix_names, axis=1)

#### Test

One way to check is compare the value counts of names with the previous value_counts `df_complete_a` vs `df_complete_b`

In [92]:
old_counts = df_complete_a['name'].value_counts().reset_index().rename(columns={'index':'name', 'name':'counts_old'})
new_counts = df_complete_b['name'].value_counts().reset_index().rename(columns={'index':'name', 'name':'counts_new'})

compare_counts = pd.merge(left=old_counts, right=new_counts, how='inner',on=['name'])
compare_counts['diff'] = compare_counts['counts_old'] - compare_counts['counts_new']
compare_counts[compare_counts['diff'] > 0]

Unnamed: 0,name,counts_old,counts_new,diff
0,,397,391,6
1,a,46,29,17
45,an,4,3,1


We made the most difference to where `a` was placed by correcting a total of 17 rows

## Cleaning incorrect submissions

#### Define

Some of the dog submissions were not correct. You can often see in the tweet message where the owner of WeRateDogs replies humourously asking users to stop sending pictures that aren't of dogs. We will look for sentences such as 

- please only send
- we only rate dogs
- not a dog
- only send in dogs
- guys

We can also look for any submissions that had an unusual denominator

#### Code

In [93]:
# some images aren't dogs
replies_to_bad_submission = ['please only send', 'we only rate dogs', 'not a dog', 'only send in dogs', 'guys']


def find_bad_submission(row):
    for bad_reply in replies_to_bad_submission:
        m = re.search(bad_reply, row['text'], re.IGNORECASE)
        if(m!=None):
            row['bad_reply']=True
        else:
            row['bad_reply']=False
        return row
bad_replies = df_complete_b.apply(find_bad_submission,axis=1).query("bad_reply==True")


#### Test

In [94]:
n_bad_replies = bad_replies.shape[0]
print("algorithm` has detected {} bad replies".format(n_bad_replies))

algorithm` has detected 16 bad replies


In [95]:
bad_replies

Unnamed: 0,tweet_id,timestamp,source,text,rating_numerator,rating_denominator,name,dog_stage,confidence,prediction,favorite_count,retweet_count,bad_reply
22,887101392804085760,2017-07-18 00:07:08,Twitter for iPhone,This... is a Jubilant Antarctic House Bear. We only rate dogs. Please only send dogs. Thank you... 12/10 would suffocate in floof https://t.co/4Ad1jzJSdp,12,10,,floofer,0.733942,Samoyed,30615,6018,True
38,883117836046086144,2017-07-07 00:17:54,Twitter for iPhone,"Please only send dogs. We don't rate mechanics, no matter how h*ckin good. Thank you... 13/10 would sneak a pat https://t.co/Se5fZ9wp5E",13,10,,,0.949562,golden_retriever,37304,6743,True
71,874057562936811520,2017-06-12 00:15:36,Twitter for iPhone,"I can't believe this keeps happening. This, is a birb taking a bath. We only rate dogs. Please only send dogs. Thank you... 12/10 https://t.co/pwY9PQhtP2",12,10,,,0.832177,flat-coated_retriever,22758,4026,True
133,855459453768019968,2017-04-21 16:33:22,Twitter for iPhone,"Guys, we only rate dogs. This is quite clearly a bulbasaur. Please only send dogs. Thank you... 12/10 human used pet, it's super effective https://t.co/Xc7uj1C64x",12,10,quite,,0.389513,Blenheim_spaniel,31093,8745,True
166,845677943972139009,2017-03-25 16:45:08,Twitter for iPhone,"C'mon guys. Please only send in dogs. We only rate dogs, not Exceptional-Tongued Peruvian Floor Bears. Thank you... 12/10 https://t.co/z30iQLiXNo",12,10,,,0.808681,chow,26637,5218,True
350,809920764300447744,2016-12-17 00:38:52,Twitter for iPhone,"Please only send in dogs. We only rate dogs, not seemingly heartbroken ewoks. Thank you... still 10/10 would console https://t.co/HIraYS1Bzo",10,10,,,0.397163,Norwich_terrier,16922,4402,True
474,781524693396357120,2016-09-29 16:03:01,Twitter for iPhone,Idk why this keeps happening. We only rate dogs. Not Bangladeshi Couch Chipmunks. Please only send dogs... 12/10 https://t.co/ya7bviQUUf,12,10,,,0.003523,Chesapeake_Bay_retriever,22705,6235,True
667,746872823977771008,2016-06-26 01:08:52,Twitter for iPhone,This is a carrot. We only rate dogs. Please only send in dogs. You all really should know this by now ...11/10 https://t.co/9e48aPrBm2,11,10,a,,0.540201,Pembroke,6448,2369,True
671,746369468511756288,2016-06-24 15:48:42,Twitter for iPhone,This is an Iraqi Speed Kangaroo. It is not a dog. Please only send in dogs. I'm very angry with all of you ...9/10 https://t.co/5qpBTTpgUt,9,10,an,,0.622957,German_shepherd,6504,1814,True
708,739544079319588864,2016-06-05 19:47:03,Twitter for iPhone,This... is a Tyrannosaurus rex. We only rate dogs. Please only send in dogs. Thank you ...10/10 https://t.co/zxw8d5g94P,10,10,,,0.967397,Labrador_retriever,42708,23615,True


The algorithm has worked, now we can remove these from the main table 

In [96]:
df_complete_b.shape[0]

1661

In [97]:
df_final = df_complete_b.drop(bad_replies.index)
df_final.reset_index(drop=True, inplace=True)
df_final.head()

Unnamed: 0,tweet_id,timestamp,source,text,rating_numerator,rating_denominator,name,dog_stage,confidence,prediction,favorite_count,retweet_count
0,892177421306343426,2017-08-01 00:17:27,Twitter for iPhone,"This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV",13,10,Tilly,,0.323581,Chihuahua,33294,6328
1,891815181378084864,2017-07-31 00:18:03,Twitter for iPhone,This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB,12,10,Archie,,0.716012,Chihuahua,25090,4200
2,891689557279858688,2017-07-30 15:58:51,Twitter for iPhone,This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ,13,10,Darla,,0.168086,Labrador_retriever,42249,8726
3,891327558926688256,2017-07-29 16:00:24,Twitter for iPhone,"This is Franklin. He would like you to stop calling him ""cute."" He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f",12,10,Franklin,,0.555712,basset,40396,9497
4,891087950875897856,2017-07-29 00:08:17,Twitter for iPhone,Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG: tucker_marlo) #BarkWeek https://t.co/kQ04fDDRmh,13,10,,,0.425595,Chesapeake_Bay_retriever,20257,3142


In [100]:
df_final[df_final['rating_denominator']<10]

Unnamed: 0,tweet_id,timestamp,source,text,rating_numerator,rating_denominator,name,dog_stage,confidence,prediction,favorite_count,retweet_count
340,810984652412424192,2016-12-19 23:06:23,Twitter for iPhone,Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t https://t.co/LouL5vdvxx,24,7,Sam,,0.871342,golden_retriever,5837,1612
1627,666287406224695296,2015-11-16 16:11:11,Twitter for iPhone,This is an Albanian 3 1/2 legged Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv,1,2,an,,0.857531,Maltese_dog,150,68


it looks like index tweet id `666287406224695296` is also an incorrect submission

In [105]:
episcopalian = df_final[df_final['tweet_id']==666287406224695296].index
df_final.drop(episcopalian, inplace=True)
df_final.reset_index(drop=True,inplace=True)

In [106]:
df_final.shape[0]

1644

## Write to csv

#### Define

This is so we can use it for visualisation, please note I have implemented the visualisation in another notebook named `wrangle_act_visualisation.ipynb`

#### code

In [108]:
df_final.to_csv('twitter_archive_master.csv', encoding='utf-8')

# Visualisation in another file - `wrangle_act_visualisation.ipynb`