---
# Brief
---
**Your goal**: wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. The Twitter archive is great, but it only contains very basic tweet information. Additional gathering, then assessing and cleaning is required for "Wow!"-worthy analyses and visualizations.


1. The WeRateDogs Twitter archive. I am giving this file to you, so imagine it as a file on hand. Download this file manually by clicking the following link: twitter_archive_enhanced.csv

2. The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (image_predictions.tsv) is hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv]

3. Each tweet's retweet count and favorite ("like") count at minimum, and any additional data you find interesting. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file. Each tweet's JSON data should be written to its own line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count. Note: do not include your Twitter API keys, secrets, and tokens in your project submission.

Detect and document at least
* **Eight quality issues**
* **Two tidiness issues**


* **three insights** 
* **one visualization**

* Create a 300-600 word written report called wrangle_report.pdf or wrangle_report.html that briefly describes your wrangling efforts. This is to be framed as an internal document.

* Create a 250-word-minimum written report called act_report.pdf or act_report.html that communicates the insights and displays the visualization(s) produced from your wrangled data. This is to be framed as an external document, like a blog post or magazine article, for example.


## Basic wrangling strategy

1. Gather all data sets, document issues pertinent to every dataset in the process.
2. Merge all data sets.
3. Remove null, duplicate and redundant data.
4. Resolve tidiness issues.
5. Resolve quality issues.
6. Find three insights.
7. Create visualization.


In [1]:
import requests as rq
import pandas as pd
import numpy as np
import io
import json
from random import randrange

---
# Gather data
---

## Dataset 1: Enhanced twitter archive
### Read the enhanced twitter archive into a dataframe

In [2]:
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')

In [3]:
twitter_archive.head(15)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,,,,
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jax. He enjoys ice cream so much he gets ...,,,,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,,,,
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Zoey. She doesn't want to be one of th...,,,,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,,,,
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,


In [4]:
twitter_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

## Dataset 2: Image predictions

### Download the tweet image predictions

In [5]:
r = rq.get('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv')
r.status_code

200

In [6]:
r.headers['content-type']

'text/tab-separated-values; charset=utf-8'

In [7]:
r.encoding

'utf-8'

#### Load the received .tsv file into a dataframe

In [8]:
# load image predictions dataset from a local file
image_predictions = pd.DataFrame.from_csv('image-predictions.tsv', sep='\t')

  


In [9]:
# load image predictions dataset from the URL
image_predictions = pd.DataFrame.from_csv(io.StringIO(r.content.decode(r.encoding)), sep='\t')

  


In [10]:
image_predictions.head()

Unnamed: 0_level_0,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [11]:
image_predictions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2075 entries, 666020888022790149 to 892420643555336193
Data columns (total 11 columns):
jpg_url    2075 non-null object
img_num    2075 non-null int64
p1         2075 non-null object
p1_conf    2075 non-null float64
p1_dog     2075 non-null bool
p2         2075 non-null object
p2_conf    2075 non-null float64
p2_dog     2075 non-null bool
p3         2075 non-null object
p3_conf    2075 non-null float64
p3_dog     2075 non-null bool
dtypes: bool(3), float64(3), int64(1), object(4)
memory usage: 152.0+ KB


### Merge twitter archive and image predictions

In [12]:
merged_df = pd.merge(twitter_archive,image_predictions,how='inner',on='tweet_id')

In [13]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2075 entries, 0 to 2074
Data columns (total 28 columns):
tweet_id                      2075 non-null int64
in_reply_to_status_id         23 non-null float64
in_reply_to_user_id           23 non-null float64
timestamp                     2075 non-null object
source                        2075 non-null object
text                          2075 non-null object
retweeted_status_id           81 non-null float64
retweeted_status_user_id      81 non-null float64
retweeted_status_timestamp    81 non-null object
expanded_urls                 2075 non-null object
rating_numerator              2075 non-null int64
rating_denominator            2075 non-null int64
name                          2075 non-null object
doggo                         2075 non-null object
floofer                       2075 non-null object
pupper                        2075 non-null object
puppo                         2075 non-null object
jpg_url                       2075 

In [14]:
print ('twitter archive size: ', len(twitter_archive))
print ('image predictions size: ', len(image_predictions))
print ('merged dataframe size: ', len(merged_df))

twitter archive size:  2356
image predictions size:  2075
merged dataframe size:  2075


The size of the new merged dataframe is the same as the image predictions dataframe, which means that all the records from `image_precdictions` were kept.

## Dataset 3: Twitter via API

### Set up Twitter API via tweepy

In [15]:
import tweepy

TWITTER_CONSUMER_KEY = ''
TWITTER_CONSUMER_SECRET = ''
ACCESS_TOKEN = ''
ACCESS_SECRET = ''

auth = tweepy.OAuthHandler(TWITTER_CONSUMER_KEY, TWITTER_CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)

api = tweepy.API(auth, parser=tweepy.parsers.JSONParser())
api.wait_on_rate_limit = True

### Download WeRateDogs Twitter archive. Takes around 30 mins.

In [None]:
# create an empty array to store dictionaries retrieved via API
tweets = []
missing_tweets = []

# use tweet_id's from our dataframe to retrieve original tweets
for i in merged_df.tweet_id:
    try:
        tweets.append (api.get_status(i, tweet_mode='extended'))
    except:
        missing_tweets.append(i)
        print ('tweet #', i, ' could not be located')
            
# write downloaded tweets to a json file and store it locally            
with open('tweets.json', 'w') as outfile:
    json.dump(tweets, outfile)

### Read the downloaded and saved archive from a local .json file

In [16]:
tweets = pd.read_json ('tweets.json')

In [17]:
tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2061 entries, 0 to 2060
Data columns (total 28 columns):
contributors                     0 non-null float64
coordinates                      0 non-null float64
created_at                       2061 non-null datetime64[ns]
display_text_range               2061 non-null object
entities                         2061 non-null object
extended_entities                2061 non-null object
favorite_count                   2061 non-null int64
favorited                        2061 non-null bool
full_text                        2061 non-null object
geo                              0 non-null float64
id                               2061 non-null int64
id_str                           2061 non-null int64
in_reply_to_screen_name          23 non-null object
in_reply_to_status_id            23 non-null float64
in_reply_to_status_id_str        23 non-null float64
in_reply_to_user_id              23 non-null float64
in_reply_to_user_id_str          23 n

In [18]:
tweets.head()

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,favorite_count,favorited,full_text,geo,...,lang,place,possibly_sensitive,possibly_sensitive_appealable,retweet_count,retweeted,retweeted_status,source,truncated,user
0,,,2017-08-01 16:23:56,"[0, 85]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892420639486877696, 'id_str'...",36940,False,This is Phineas. He's a mystical boy. Only eve...,,...,en,,False,False,7873,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
1,,,2017-08-01 00:17:27,"[0, 138]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892177413194625024, 'id_str'...",31781,False,This is Tilly. She's just checking pup on you....,,...,en,,False,False,5844,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
2,,,2017-07-31 00:18:03,"[0, 121]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891815175371796480, 'id_str'...",23952,False,This is Archie. He is a rare Norwegian Pouncin...,,...,en,,False,False,3859,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
3,,,2017-07-30 15:58:51,"[0, 79]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891689552724799489, 'id_str'...",40263,False,This is Darla. She commenced a snooze mid meal...,,...,en,,False,False,8042,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
4,,,2017-07-29 16:00:24,"[0, 138]","{'hashtags': [{'text': 'BarkWeek', 'indices': ...","{'media': [{'id': 891327551943041024, 'id_str'...",38458,False,This is Franklin. He would like you to stop ca...,,...,en,,False,False,8701,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."


### Merge tweets with two previous datasets

`tweets` JSON has two representations of ID - `int` and `str`. Twitter documentation recommends using the `str` representation as "some programming languages may have difficulty / silent defects in interpreting [`int` id]".


Let's see whether there is any difference between the records.

In [19]:
# how many tweets have mismatching `id` and `id_str` record
len(tweets[(tweets['id'] - tweets['id_str'].astype(int))!=0])

718

That's quite a bit. More than 1/3 of `id` records seem to be corrupt.

Let's try merging `tweets` with our previous data on both columns.

In [20]:
tweets['tweet_id'] = tweets['id_str'].astype(int)
we_rate_dogs = pd.merge(merged_df,tweets, how='inner')
len(we_rate_dogs) - len(tweets)

-718

In [21]:
tweets['tweet_id'] = tweets['id']
we_rate_dogs = pd.merge(merged_df,tweets, how='inner')
len(we_rate_dogs) - len(tweets)

0

It looks like previous datasets were using `int` `id` field rather than the correct `id_str`, therefore merging sets on the string representation yields a much shorter dataset. To preserve as much data as possible - let's use the dataset merged on the `int64` representation, but replace it with `str` version when we get to the cleaning stage. 

---
# Assess data
---

Let's take a quick overview of the complete merged dataset that we've created.

In [22]:
we_rate_dogs.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,...,is_quote_status,lang,place,possibly_sensitive,possibly_sensitive_appealable,retweet_count,retweeted,retweeted_status,truncated,user
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,...,False,en,,False,False,7873,False,,False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,...,False,en,,False,False,5844,False,,False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,...,False,en,,False,False,3859,False,,False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,...,False,en,,False,False,8042,False,,False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,...,False,en,,False,False,8701,False,,False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."


In [23]:
we_rate_dogs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2061 entries, 0 to 2060
Data columns (total 53 columns):
tweet_id                         2061 non-null int64
in_reply_to_status_id            23 non-null float64
in_reply_to_user_id              23 non-null float64
timestamp                        2061 non-null object
source                           2061 non-null object
text                             2061 non-null object
retweeted_status_id              74 non-null float64
retweeted_status_user_id         74 non-null float64
retweeted_status_timestamp       74 non-null object
expanded_urls                    2061 non-null object
rating_numerator                 2061 non-null int64
rating_denominator               2061 non-null int64
name                             2061 non-null object
doggo                            2061 non-null object
floofer                          2061 non-null object
pupper                           2061 non-null object
puppo                            2061 

How many columns of data do we have after merging all datasets.

In [24]:
print (len(we_rate_dogs.columns))

53


## Duplicate, zero and redundant data
First let's assess whether there are any columns that contain no data whatsoever or that are irrelevant for the purposes of our analysis. Let's make empty lists which we will populate over the course of our assessment and drop remove them from the dataframe.

### Duplicate columns
Some of the data is duplicated across columns

In [25]:
# let's make an empty list to hold column names with duplicate data
duplicate_columns = []

In [26]:
# another example of int and str representations of the same data. Are they identical?
we_rate_dogs.in_reply_to_status_id.equals(we_rate_dogs.in_reply_to_status_id_str)

True

In [27]:
# another example of int and str representations of the same data. Are they identical?
we_rate_dogs.in_reply_to_user_id.equals(we_rate_dogs.in_reply_to_user_id_str)

True

In [28]:
duplicate_columns.extend(['in_reply_to_status_id_str', 'in_reply_to_user_id_str'])

`timestamp` and `created_at` columns probably contain identical set of records. Let's test it for any random record. Let's subtract two columns from each other and see how many unique differences are there.

In [29]:
# we have to remove timezone data to perform arithmetic on timestamps
column_created = we_rate_dogs.created_at.dt.tz_localize(None)
column_timestamp = we_rate_dogs.timestamp.apply(lambda x: pd.Timestamp(x)).dt.tz_localize(None)

(column_created - column_timestamp).unique()

array([0], dtype='timedelta64[ns]')

There's only one unique value and that value is `0`. We can conclude that both columns contain indentical sets of data. Therefore let's add one of them to the `duplicate_columns` list for future removal. As `created_at` is the original field from tweet json and is represented in correct data type - let's keep it and remove the `timestamp` column.

In [30]:
duplicate_columns.append('timestamp')

In [31]:
print ('duplicate_columns: ', duplicate_columns)

duplicate_columns:  ['in_reply_to_status_id_str', 'in_reply_to_user_id_str', 'timestamp']


### Zero columns

In [32]:
# let's make an empty list to hodl column names with no data in them
zero_columns = []

Summing data in columns highlights some of the empty variables.

In [33]:
zeros = we_rate_dogs.sum(axis=0) == 0

In [34]:
zeros.index[zeros]

Index(['contributors', 'coordinates', 'favorited', 'geo', 'is_quote_status',
       'possibly_sensitive', 'possibly_sensitive_appealable', 'retweeted',
       'truncated'],
      dtype='object')

In [35]:
# make list of all zero-value columns
zero_columns.extend(we_rate_dogs[zeros.index[zeros]].columns)

### Redundant columns
some of the data in this set is not relevant to our future analysis. let's put it aside.

In [36]:
# let's make an empty list to hold column names with data not pertinent to our analysis
redundant_columns = []

It seems that there is only one record in the 'place' column. What is it?

In [37]:
we_rate_dogs[we_rate_dogs.place.notnull()].place

686    {'id': '7356b662670b2c31', 'url': 'https://api...
Name: place, dtype: object

Let's store this place in a separate variable and remove the column.

In [38]:
wrd_place = we_rate_dogs.iloc[686]

Retweets and status replies comprise only a small fraction of tweets for this account. Let's store them in separate dataframes and remove from our main analysis.

In [39]:
for i in we_rate_dogs.columns:
    c = we_rate_dogs[i].notna()
    c = len(c.index[c])
    if 0<c<len(we_rate_dogs):
        redundant_columns.append(i)
        print (c,' : ', i)

23  :  in_reply_to_status_id
23  :  in_reply_to_user_id
74  :  retweeted_status_id
74  :  retweeted_status_user_id
74  :  retweeted_status_timestamp
23  :  in_reply_to_screen_name
23  :  in_reply_to_status_id_str
23  :  in_reply_to_user_id_str
1  :  place
74  :  retweeted_status


In [40]:
replies_01 = we_rate_dogs[we_rate_dogs.in_reply_to_status_id.notna()]
replies_02 = we_rate_dogs[we_rate_dogs.in_reply_to_user_id.notna()]
retweets_01 = we_rate_dogs[we_rate_dogs.retweeted_status.notna()]
retweets_02 = we_rate_dogs[we_rate_dogs.retweeted_status_id.notna()]
retweets_03 = we_rate_dogs[we_rate_dogs.retweeted_status_user_id.notna()]
retweets_04 = we_rate_dogs[we_rate_dogs.retweeted_status_timestamp.notna()]

In [41]:
redundant_columns

['in_reply_to_status_id',
 'in_reply_to_user_id',
 'retweeted_status_id',
 'retweeted_status_user_id',
 'retweeted_status_timestamp',
 'in_reply_to_screen_name',
 'in_reply_to_status_id_str',
 'in_reply_to_user_id_str',
 'place',
 'retweeted_status']

## Incorrect data types.

Some of the columns are `object` type, whereas they can be better represented by more specific variable types. Namely: 

`
doggo                            object >> bool
floofer                          object >> bool
pupper                           object >> bool
puppo                            object >> bool
lang                             object >> category
`

## Badly formatted data
### `expanded_urls` column
`expanded_urls` column contains long strings of multiple, often duplicate, URLs. They need to be parsed, duplicates removed, and stored in lists. Let's look how such an entry looks like.

In [42]:
we_rate_dogs.expanded_urls[145]

'https://www.gofundme.com/meeko-needs-heart-surgery,https://twitter.com/dog_rates/status/857393404942143489/photo/1,https://twitter.com/dog_rates/status/857393404942143489/photo/1,https://twitter.com/dog_rates/status/857393404942143489/photo/1,https://twitter.com/dog_rates/status/857393404942143489/photo/1'

### `source` column
`source` column contains messy long URLs, which, when parsed correctly can be best represented by a `category` type variable.

In [43]:
we_rate_dogs.source[145]

'<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>'

### Hashtags

Hashtags are hidden deep inside hierarchical dictionaries. As they are very valuable analysis material - they'd rather be unpacked into a separate column.

In [44]:
we_rate_dogs.entities.iloc[10]['hashtags']

[{'text': 'BarkWeek', 'indices': [121, 130]}]

---
# Clean data
---
After assessing the data - it's time to clean it.

Let's keep the original in place and work on the copy.

In [45]:
wrd = we_rate_dogs.copy()

## Quality issues

### Redundant data

#### Define
get rid of retweet and reply data as it's not relevant to our analysis

In [46]:
redundant_columns

['in_reply_to_status_id',
 'in_reply_to_user_id',
 'retweeted_status_id',
 'retweeted_status_user_id',
 'retweeted_status_timestamp',
 'in_reply_to_screen_name',
 'in_reply_to_status_id_str',
 'in_reply_to_user_id_str',
 'place',
 'retweeted_status']

#### Code

In [47]:
wrd.drop(columns=redundant_columns, axis=1, inplace=True)

#### Test

In [48]:
wrd.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2061 entries, 0 to 2060
Data columns (total 43 columns):
tweet_id                         2061 non-null int64
timestamp                        2061 non-null object
source                           2061 non-null object
text                             2061 non-null object
expanded_urls                    2061 non-null object
rating_numerator                 2061 non-null int64
rating_denominator               2061 non-null int64
name                             2061 non-null object
doggo                            2061 non-null object
floofer                          2061 non-null object
pupper                           2061 non-null object
puppo                            2061 non-null object
jpg_url                          2061 non-null object
img_num                          2061 non-null int64
p1                               2061 non-null object
p1_conf                          2061 non-null float64
p1_dog                          

### Zero data

#### Define
Columns with zero data are irrelevant for our analysis. Let's just drop them.

In [49]:
zero_columns

['contributors',
 'coordinates',
 'favorited',
 'geo',
 'is_quote_status',
 'possibly_sensitive',
 'possibly_sensitive_appealable',
 'retweeted',
 'truncated']

#### Code

In [50]:
wrd.drop(columns=zero_columns, axis=1, inplace=True)

#### Test

In [51]:
wrd.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2061 entries, 0 to 2060
Data columns (total 34 columns):
tweet_id              2061 non-null int64
timestamp             2061 non-null object
source                2061 non-null object
text                  2061 non-null object
expanded_urls         2061 non-null object
rating_numerator      2061 non-null int64
rating_denominator    2061 non-null int64
name                  2061 non-null object
doggo                 2061 non-null object
floofer               2061 non-null object
pupper                2061 non-null object
puppo                 2061 non-null object
jpg_url               2061 non-null object
img_num               2061 non-null int64
p1                    2061 non-null object
p1_conf               2061 non-null float64
p1_dog                2061 non-null bool
p2                    2061 non-null object
p2_conf               2061 non-null float64
p2_dog                2061 non-null bool
p3                    2061 non-null obj

### Duplicate data

#### Define

In [52]:
duplicate_columns

['in_reply_to_status_id_str', 'in_reply_to_user_id_str', 'timestamp']

#### Code

In [53]:
wrd.drop(columns=duplicate_columns, axis=1, inplace=True)

KeyError: "['in_reply_to_status_id_str' 'in_reply_to_user_id_str'] not found in axis"

Expectedly we get an error as two reply columns we've already removed with redundant data. Let's only remove timestamp now.

In [54]:
wrd.drop(columns='timestamp', axis=1, inplace=True)

#### Test

In [55]:
wrd.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2061 entries, 0 to 2060
Data columns (total 33 columns):
tweet_id              2061 non-null int64
source                2061 non-null object
text                  2061 non-null object
expanded_urls         2061 non-null object
rating_numerator      2061 non-null int64
rating_denominator    2061 non-null int64
name                  2061 non-null object
doggo                 2061 non-null object
floofer               2061 non-null object
pupper                2061 non-null object
puppo                 2061 non-null object
jpg_url               2061 non-null object
img_num               2061 non-null int64
p1                    2061 non-null object
p1_conf               2061 non-null float64
p1_dog                2061 non-null bool
p2                    2061 non-null object
p2_conf               2061 non-null float64
p2_dog                2061 non-null bool
p3                    2061 non-null object
p3_conf               2061 non-null flo

### Dog columns to bools

#### Define
`doggo`, `floofer`, `pupper`, `puppo` columns are better represented as `bool`s.

#### Code

In [56]:
def to_bool(x):
    if x=='None':
        return False
    else:
        return True
    
wrd.doggo = wrd.doggo.apply(to_bool)
wrd.floofer = wrd.floofer.apply(to_bool)
wrd.pupper = wrd.pupper.apply(to_bool)
wrd.puppo = wrd.puppo.apply(to_bool)

#### Test

In [57]:
wrd.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2061 entries, 0 to 2060
Data columns (total 33 columns):
tweet_id              2061 non-null int64
source                2061 non-null object
text                  2061 non-null object
expanded_urls         2061 non-null object
rating_numerator      2061 non-null int64
rating_denominator    2061 non-null int64
name                  2061 non-null object
doggo                 2061 non-null bool
floofer               2061 non-null bool
pupper                2061 non-null bool
puppo                 2061 non-null bool
jpg_url               2061 non-null object
img_num               2061 non-null int64
p1                    2061 non-null object
p1_conf               2061 non-null float64
p1_dog                2061 non-null bool
p2                    2061 non-null object
p2_conf               2061 non-null float64
p2_dog                2061 non-null bool
p3                    2061 non-null object
p3_conf               2061 non-null float64
p3_

### Lang column to category

#### Define
Convert lang column into category.

#### Code

In [58]:
wrd.lang = wrd.lang.astype('category')

#### Test

In [67]:
wrd.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2061 entries, 0 to 2060
Data columns (total 33 columns):
tweet_id              2061 non-null int64
source                2061 non-null object
text                  2061 non-null object
expanded_urls         2061 non-null object
rating_numerator      2061 non-null int64
rating_denominator    2061 non-null int64
name                  2061 non-null object
doggo                 2061 non-null bool
floofer               2061 non-null bool
pupper                2061 non-null bool
puppo                 2061 non-null bool
jpg_url               2061 non-null object
img_num               2061 non-null int64
p1                    2061 non-null object
p1_conf               2061 non-null float64
p1_dog                2061 non-null bool
p2                    2061 non-null object
p2_conf               2061 non-null float64
p2_dog                2061 non-null bool
p3                    2061 non-null object
p3_conf               2061 non-null float64
p3_

### Display_text_range into int

#### Define
`display_text_range` column can be effectively reduced to a single `int` rather than a list.

#### Code

In [68]:
wrd['display_text_end'] = wrd.display_text_range.apply(lambda x: x[1]).astype(int)
wrd.drop(columns='display_text_range', inplace=True)

#### Test

In [69]:
wrd.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2061 entries, 0 to 2060
Data columns (total 33 columns):
tweet_id              2061 non-null int64
source                2061 non-null object
text                  2061 non-null object
expanded_urls         2061 non-null object
rating_numerator      2061 non-null int64
rating_denominator    2061 non-null int64
name                  2061 non-null object
doggo                 2061 non-null bool
floofer               2061 non-null bool
pupper                2061 non-null bool
puppo                 2061 non-null bool
jpg_url               2061 non-null object
img_num               2061 non-null int64
p1                    2061 non-null object
p1_conf               2061 non-null float64
p1_dog                2061 non-null bool
p2                    2061 non-null object
p2_conf               2061 non-null float64
p2_dog                2061 non-null bool
p3                    2061 non-null object
p3_conf               2061 non-null float64
p3_

### Parse `source` column and make it categorical variable

#### Define
There is probablay a limited set of appications used to post tweets.

#### Code

In [71]:
wrd.source = wrd.source.apply(lambda x: x.split('>')[1].split('<')[0]).astype('category')

#### Test

In [75]:
wrd.source.unique()

[Twitter for iPhone, Twitter Web Client, TweetDeck]
Categories (3, object): [Twitter for iPhone, Twitter Web Client, TweetDeck]

In [76]:
wrd.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2061 entries, 0 to 2060
Data columns (total 33 columns):
tweet_id              2061 non-null int64
source                2061 non-null category
text                  2061 non-null object
expanded_urls         2061 non-null object
rating_numerator      2061 non-null int64
rating_denominator    2061 non-null int64
name                  2061 non-null object
doggo                 2061 non-null bool
floofer               2061 non-null bool
pupper                2061 non-null bool
puppo                 2061 non-null bool
jpg_url               2061 non-null object
img_num               2061 non-null int64
p1                    2061 non-null object
p1_conf               2061 non-null float64
p1_dog                2061 non-null bool
p2                    2061 non-null object
p2_conf               2061 non-null float64
p2_dog                2061 non-null bool
p3                    2061 non-null object
p3_conf               2061 non-null float64
p

### Replace the `id` field with `str` representation of it

#### Define

#### Code

#### Test

### Cast id type columns into int64 values from floats

#### Define

#### Code

#### Test

## Tidiness issues

### Unpack hashtags from dictionaries

#### Define

#### Code

#### Test

### Unpack URLs from dictionaries

#### Define

#### Code

#### Test