
**Your goal**: wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. The Twitter archive is great, but it only contains very basic tweet information. Additional gathering, then assessing and cleaning is required for "Wow!"-worthy analyses and visualizations.


1. The WeRateDogs Twitter archive. I am giving this file to you, so imagine it as a file on hand. Download this file manually by clicking the following link: twitter_archive_enhanced.csv

2. The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (image_predictions.tsv) is hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv]

3. Each tweet's retweet count and favorite ("like") count at minimum, and any additional data you find interesting. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file. Each tweet's JSON data should be written to its own line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count. Note: do not include your Twitter API keys, secrets, and tokens in your project submission.

Detect and document at least
* **Eight quality issues**
* **Two tidiness issues**


* **three insights** 
* **one visualization**

* Create a 300-600 word written report called wrangle_report.pdf or wrangle_report.html that briefly describes your wrangling efforts. This is to be framed as an internal document.

* Create a 250-word-minimum written report called act_report.pdf or act_report.html that communicates the insights and displays the visualization(s) produced from your wrangled data. This is to be framed as an external document, like a blog post or magazine article, for example.


In [1]:
import requests as rq
import pandas as pd
import numpy as np
import io
import json

### Read the enhanced twitter archive into a dataframe

In [2]:
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')

In [3]:
twitter_archive.head(15)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,,,,
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jax. He enjoys ice cream so much he gets ...,,,,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,,,,
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Zoey. She doesn't want to be one of th...,,,,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,,,,
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,


In [4]:
twitter_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

### Download the tweet image predictions

In [5]:
r = rq.get('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv')

In [6]:
r.status_code

200

In [7]:
r.headers['content-type']

'text/tab-separated-values; charset=utf-8'

In [8]:
r.encoding

'utf-8'

#### Load the received .tsv file into a dataframe

In [9]:
image_predictions = pd.DataFrame.from_csv(io.StringIO(r.content.decode(r.encoding)), sep='\t')

  """Entry point for launching an IPython kernel.


In [10]:
image_predictions.head()

Unnamed: 0_level_0,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [11]:
image_predictions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2075 entries, 666020888022790149 to 892420643555336193
Data columns (total 11 columns):
jpg_url    2075 non-null object
img_num    2075 non-null int64
p1         2075 non-null object
p1_conf    2075 non-null float64
p1_dog     2075 non-null bool
p2         2075 non-null object
p2_conf    2075 non-null float64
p2_dog     2075 non-null bool
p3         2075 non-null object
p3_conf    2075 non-null float64
p3_dog     2075 non-null bool
dtypes: bool(3), float64(3), int64(1), object(4)
memory usage: 152.0+ KB


### Merge twitter archive and image predictions

In [12]:
merged_df = pd.merge(twitter_archive,image_predictions,how='inner',on='tweet_id')

In [13]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2075 entries, 0 to 2074
Data columns (total 28 columns):
tweet_id                      2075 non-null int64
in_reply_to_status_id         23 non-null float64
in_reply_to_user_id           23 non-null float64
timestamp                     2075 non-null object
source                        2075 non-null object
text                          2075 non-null object
retweeted_status_id           81 non-null float64
retweeted_status_user_id      81 non-null float64
retweeted_status_timestamp    81 non-null object
expanded_urls                 2075 non-null object
rating_numerator              2075 non-null int64
rating_denominator            2075 non-null int64
name                          2075 non-null object
doggo                         2075 non-null object
floofer                       2075 non-null object
pupper                        2075 non-null object
puppo                         2075 non-null object
jpg_url                       2075 

In [14]:
merged_df.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,...,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,...,1,orange,0.097049,False,bagel,0.085851,False,banana,0.07611,False
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,...,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,...,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,...,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,...,2,basset,0.555712,True,English_springer,0.22577,True,German_short-haired_pointer,0.175219,True


### Set up Twitter API via tweepy

In [None]:
import tweepy

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, parser=tweepy.parsers.JSONParser())
api.wait_on_rate_limit = True

In [None]:
test_tweet = api.get_status('892420643555336193', tweet_mode='extended')

### Download WeRateDogs Twitter archive. Takes around 30 mins.

In [None]:
# create an empty array to store dictionaries retrieved via API
tweets = []
missing_tweets = []

# use tweet_id's from our dataframe to retrieve original tweets
for i in merged_df.tweet_id:
    try:
        tweets.append (api.get_status(i, tweet_mode='extended'))
    except:
        missing_tweets.append(i)
        print ('tweet #', i, ' could not be located')
            
# write downloaded tweets to a json file and store it locally            
with open('tweets.json', 'w') as outfile:
    json.dump(tweets, outfile)

In [None]:
missing_tweets

### Read the downloaded and saved archive from a local .json file

In [18]:
tweets = pd.read_json ('tweets.json')

In [48]:
tweets.head()

Unnamed: 0,created_at,entities,extended_entities,favorite_count,full_text,tweet_id,id_str,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,lang,place,retweet_count,retweeted_status,source,display_text_end,hashtags,url
0,2017-08-01 16:23:56,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892420639486877696, 'id_str'...",37246,This is Phineas. He's a mystical boy. Only eve...,892420643555336193,892420643555336192,,,,,,en,,7979,,Twitter for iPhone,85,,https://t.co/MgUWQ76dJU
1,2017-08-01 00:17:27,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892177413194625024, 'id_str'...",32027,This is Tilly. She's just checking pup on you....,892177421306343426,892177421306343424,,,,,,en,,5925,,Twitter for iPhone,138,,https://t.co/0Xxu71qeIV
2,2017-07-31 00:18:03,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891815175371796480, 'id_str'...",24129,This is Archie. He is a rare Norwegian Pouncin...,891815181378084864,891815181378084864,,,,,,en,,3913,,Twitter for iPhone,121,,https://t.co/wUnZnhtVJB
3,2017-07-30 15:58:51,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891689552724799489, 'id_str'...",40568,This is Darla. She commenced a snooze mid meal...,891689557279858688,891689557279858688,,,,,,en,,8132,,Twitter for iPhone,79,,https://t.co/tD36da7qLQ
4,2017-07-29 16:00:24,"{'hashtags': [{'text': 'BarkWeek', 'indices': ...","{'media': [{'id': 891327551943041024, 'id_str'...",38800,This is Franklin. He would like you to stop ca...,891327558926688256,891327558926688256,,,,,,en,,8812,,Twitter for iPhone,138,BarkWeek,https://t.co/AtUZn91f7f


#### quality:
* remove all empty columns
- in_reply_to_status_id and in_reply_to_status_id_str >> should be strings
- in_reply_to_user_id and in_reply_to_user_id_str >> should be strings


#### tidiness:
* 'display_text_range' is an array that always starts with '0'
* 'source' column should be turned into a categorical variable
* unpack hashtags into a separate column and make it a categotical variable
* unpack tweet URLs into a separate column
* split 'created_at'column into separate date and time


In [20]:
tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2061 entries, 0 to 2060
Data columns (total 28 columns):
contributors                     0 non-null float64
coordinates                      0 non-null float64
created_at                       2061 non-null datetime64[ns]
display_text_range               2061 non-null object
entities                         2061 non-null object
extended_entities                2061 non-null object
favorite_count                   2061 non-null int64
favorited                        2061 non-null bool
full_text                        2061 non-null object
geo                              0 non-null float64
id                               2061 non-null int64
id_str                           2061 non-null int64
in_reply_to_screen_name          23 non-null object
in_reply_to_status_id            23 non-null float64
in_reply_to_status_id_str        23 non-null float64
in_reply_to_user_id              23 non-null float64
in_reply_to_user_id_str          23 n

Let's see if there are any empty columns

In [21]:
tweets.sum(axis=0)

contributors                                                                     0
coordinates                                                                      0
display_text_range               [0, 85, 0, 138, 0, 121, 0, 79, 0, 138, 0, 138,...
favorite_count                                                            16894703
favorited                                                                        0
full_text                        This is Phineas. He's a mystical boy. Only eve...
geo                                                                              0
id                                                             8098880034596373798
id_str                                                         8098880034596372480
in_reply_to_status_id                                                  1.60497e+19
in_reply_to_status_id_str                                              1.60497e+19
in_reply_to_user_id                                                    9.65306e+10
in_r

From the above it is clear that some columns contain no information. Let's remove them.

In [22]:
columns_to_drop = ['contributors','coordinates','favorited','geo','is_quote_status','possibly_sensitive','possibly_sensitive_appealable','retweeted','truncated']
tweets.drop(columns=columns_to_drop, inplace=True)
tweets.head()

Unnamed: 0,created_at,display_text_range,entities,extended_entities,favorite_count,full_text,id,id_str,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,lang,place,retweet_count,retweeted_status,source,user
0,2017-08-01 16:23:56,"[0, 85]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892420639486877696, 'id_str'...",37246,This is Phineas. He's a mystical boy. Only eve...,892420643555336193,892420643555336192,,,,,,en,,7979,,"<a href=""http://twitter.com/download/iphone"" r...","{'id': 4196983835, 'id_str': '4196983835', 'na..."
1,2017-08-01 00:17:27,"[0, 138]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892177413194625024, 'id_str'...",32027,This is Tilly. She's just checking pup on you....,892177421306343426,892177421306343424,,,,,,en,,5925,,"<a href=""http://twitter.com/download/iphone"" r...","{'id': 4196983835, 'id_str': '4196983835', 'na..."
2,2017-07-31 00:18:03,"[0, 121]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891815175371796480, 'id_str'...",24129,This is Archie. He is a rare Norwegian Pouncin...,891815181378084864,891815181378084864,,,,,,en,,3913,,"<a href=""http://twitter.com/download/iphone"" r...","{'id': 4196983835, 'id_str': '4196983835', 'na..."
3,2017-07-30 15:58:51,"[0, 79]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891689552724799489, 'id_str'...",40568,This is Darla. She commenced a snooze mid meal...,891689557279858688,891689557279858688,,,,,,en,,8132,,"<a href=""http://twitter.com/download/iphone"" r...","{'id': 4196983835, 'id_str': '4196983835', 'na..."
4,2017-07-29 16:00:24,"[0, 138]","{'hashtags': [{'text': 'BarkWeek', 'indices': ...","{'media': [{'id': 891327551943041024, 'id_str'...",38800,This is Franklin. He would like you to stop ca...,891327558926688256,891327558926688256,,,,,,en,,8812,,"<a href=""http://twitter.com/download/iphone"" r...","{'id': 4196983835, 'id_str': '4196983835', 'na..."


In [23]:
tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2061 entries, 0 to 2060
Data columns (total 19 columns):
created_at                   2061 non-null datetime64[ns]
display_text_range           2061 non-null object
entities                     2061 non-null object
extended_entities            2061 non-null object
favorite_count               2061 non-null int64
full_text                    2061 non-null object
id                           2061 non-null int64
id_str                       2061 non-null int64
in_reply_to_screen_name      23 non-null object
in_reply_to_status_id        23 non-null float64
in_reply_to_status_id_str    23 non-null float64
in_reply_to_user_id          23 non-null float64
in_reply_to_user_id_str      23 non-null float64
lang                         2061 non-null object
place                        1 non-null object
retweet_count                2061 non-null int64
retweeted_status             74 non-null object
source                       2061 non-null object


### Define
`display_text_range` column can be effectively reduced to a single `int` rather than a list.

In [24]:
# is the assumption that all 'display_text_range' values start with a '0' correct ?
for i in tweets.display_text_range:
    if i[0] > 0:
        print (i)

nothing printed, so yes - all values start with 0, which means we can drop those

#### Code

extract the relevant value, put it into a new column and drop the old column

In [25]:
tweets['display_text_end'] = tweets.display_text_range.apply(lambda x: x[1]).astype(int)
tweets.drop(columns='display_text_range', inplace=True)

#### Test
check if our new colum is of `int64` type

In [26]:
tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2061 entries, 0 to 2060
Data columns (total 19 columns):
created_at                   2061 non-null datetime64[ns]
entities                     2061 non-null object
extended_entities            2061 non-null object
favorite_count               2061 non-null int64
full_text                    2061 non-null object
id                           2061 non-null int64
id_str                       2061 non-null int64
in_reply_to_screen_name      23 non-null object
in_reply_to_status_id        23 non-null float64
in_reply_to_status_id_str    23 non-null float64
in_reply_to_user_id          23 non-null float64
in_reply_to_user_id_str      23 non-null float64
lang                         2061 non-null object
place                        1 non-null object
retweet_count                2061 non-null int64
retweeted_status             74 non-null object
source                       2061 non-null object
user                         2061 non-null object


### Define

This is a single twitter account, so `user` field should only have one repeating value in it.

In [27]:
# how many user ids are there?
tweets.user.apply(lambda x: x['id']).nunique()

1

#### Code
Drop the `user` column from the dataframe.

In [28]:
# For the purpose of our analysis - we can drop the 'user' column
tweets.drop(columns = 'user', inplace=True)

#### Test
Check that the `user` column is gone.

In [29]:
tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2061 entries, 0 to 2060
Data columns (total 18 columns):
created_at                   2061 non-null datetime64[ns]
entities                     2061 non-null object
extended_entities            2061 non-null object
favorite_count               2061 non-null int64
full_text                    2061 non-null object
id                           2061 non-null int64
id_str                       2061 non-null int64
in_reply_to_screen_name      23 non-null object
in_reply_to_status_id        23 non-null float64
in_reply_to_status_id_str    23 non-null float64
in_reply_to_user_id          23 non-null float64
in_reply_to_user_id_str      23 non-null float64
lang                         2061 non-null object
place                        1 non-null object
retweet_count                2061 non-null int64
retweeted_status             74 non-null object
source                       2061 non-null object
display_text_end             2061 non-null int64
d

### Define

There should be a limited set of appications used to post tweets.

In [30]:
# how many various sources were used to post tweets?
tweets.source.unique()

array(['<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
       '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>',
       '<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>'],
      dtype=object)

#### Code
This kind of data is best represented as a categorical variable.

In [31]:
tweets.source = tweets.source.apply(lambda x: x.split('>')[1].split('<')[0]).astype('category')

####  Test
Check unique values in the column.

In [32]:
tweets.source.unique()

[Twitter for iPhone, Twitter Web Client, TweetDeck]
Categories (3, object): [Twitter for iPhone, Twitter Web Client, TweetDeck]

### Define
Hashtags are nested deep inside dictionaries. To be useful for analysis - they need to be extracted into a separate columnb

#### Code

extract hashtags using the `apply` method, put them in a separate column and cast them into `category` type variable.

In [33]:
tweets['hashtags'] = tweets.entities.apply(lambda x: x['hashtags'])

In [34]:
def unpack_hashtags(x):
    try:
        return x[0]['text']
    except:
        return None
        
tweets.hashtags = tweets.hashtags.apply(lambda x: unpack_hashtags(x))

In [35]:
tweets.hashtags.unique()

array([None, 'BarkWeek', 'Canada150', 'PrideMonthPuppo', 'PrideMonth',
       'dogsatpollingstations', 'ScienceMarch', 'K9VeteransDay',
       'GoodDogs', 'WKCDogShow', 'BellLetsTalk', 'WomensMarch',
       'FinalFur', 'LoveTwitter', 'notallpuppers', 'NoDaysOff'],
      dtype=object)

With this few hashtags they are better as a 'category' type variable.

In [36]:
tweets.hashtags = tweets.hashtags.astype('category')

#### Test
Check that the `hashtags` colummn is a `category` type.

In [37]:
tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2061 entries, 0 to 2060
Data columns (total 19 columns):
created_at                   2061 non-null datetime64[ns]
entities                     2061 non-null object
extended_entities            2061 non-null object
favorite_count               2061 non-null int64
full_text                    2061 non-null object
id                           2061 non-null int64
id_str                       2061 non-null int64
in_reply_to_screen_name      23 non-null object
in_reply_to_status_id        23 non-null float64
in_reply_to_status_id_str    23 non-null float64
in_reply_to_user_id          23 non-null float64
in_reply_to_user_id_str      23 non-null float64
lang                         2061 non-null object
place                        1 non-null object
retweet_count                2061 non-null int64
retweeted_status             74 non-null object
source                       2061 non-null category
display_text_end             2061 non-null int64

### Define

Same as hashtags - `url`s are nested deep inside dictionaries. They need to be extracted into a separate column. 

#### Code
Using `apply` function with a short lambda expression - extract the `url` into a new column.

In [38]:
tweets['url'] = tweets.entities.apply(lambda x: x['media'][0]['url'])

#### Test
Check the new `url` column has the right information.

In [39]:
tweets.head()

Unnamed: 0,created_at,entities,extended_entities,favorite_count,full_text,id,id_str,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,lang,place,retweet_count,retweeted_status,source,display_text_end,hashtags,url
0,2017-08-01 16:23:56,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892420639486877696, 'id_str'...",37246,This is Phineas. He's a mystical boy. Only eve...,892420643555336193,892420643555336192,,,,,,en,,7979,,Twitter for iPhone,85,,https://t.co/MgUWQ76dJU
1,2017-08-01 00:17:27,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892177413194625024, 'id_str'...",32027,This is Tilly. She's just checking pup on you....,892177421306343426,892177421306343424,,,,,,en,,5925,,Twitter for iPhone,138,,https://t.co/0Xxu71qeIV
2,2017-07-31 00:18:03,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891815175371796480, 'id_str'...",24129,This is Archie. He is a rare Norwegian Pouncin...,891815181378084864,891815181378084864,,,,,,en,,3913,,Twitter for iPhone,121,,https://t.co/wUnZnhtVJB
3,2017-07-30 15:58:51,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891689552724799489, 'id_str'...",40568,This is Darla. She commenced a snooze mid meal...,891689557279858688,891689557279858688,,,,,,en,,8132,,Twitter for iPhone,79,,https://t.co/tD36da7qLQ
4,2017-07-29 16:00:24,"{'hashtags': [{'text': 'BarkWeek', 'indices': ...","{'media': [{'id': 891327551943041024, 'id_str'...",38800,This is Franklin. He would like you to stop ca...,891327558926688256,891327558926688256,,,,,,en,,8812,,Twitter for iPhone,138,BarkWeek,https://t.co/AtUZn91f7f


### Define 

`tweets` and `merged_df` datasets provide different information on the same set of records. For further analysis purposes they need to be merged.

#### Code

In [40]:
# rename id columns to identical names
tweets.rename(columns={'id':'tweet_id'},inplace=True)

In [41]:
# merge two dataframes on the 'tweet_id' column
we_rate_dogs = pd.merge(merged_df,tweets, how='inner', on='tweet_id')

#### Test
Check the dataframes are merged successfully.

In [42]:
we_rate_dogs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2061 entries, 0 to 2060
Data columns (total 47 columns):
tweet_id                      2061 non-null int64
in_reply_to_status_id_x       23 non-null float64
in_reply_to_user_id_x         23 non-null float64
timestamp                     2061 non-null object
source_x                      2061 non-null object
text                          2061 non-null object
retweeted_status_id           74 non-null float64
retweeted_status_user_id      74 non-null float64
retweeted_status_timestamp    74 non-null object
expanded_urls                 2061 non-null object
rating_numerator              2061 non-null int64
rating_denominator            2061 non-null int64
name                          2061 non-null object
doggo                         2061 non-null object
floofer                       2061 non-null object
pupper                        2061 non-null object
puppo                         2061 non-null object
jpg_url                       2061 

### Define
For the purposes of this analysis the information contained in some of the columns is irrelevant. Further to that some information was duplicated in the dataframes that were being merged. Let's remove all redundant columns.

#### Code
Put all redundant columns into a list and drop them.

In [46]:
redundant_columns = [
'in_reply_to_status_id_x',
'in_reply_to_user_id_x',
'retweeted_status_id',
'retweeted_status_user_id',
'in_reply_to_screen_name',
'in_reply_to_status_id_y',
'in_reply_to_status_id_str',
'in_reply_to_user_id_y',
'in_reply_to_user_id_str',
'retweeted_status',
'retweeted_status_timestamp',
'expanded_urls',
'full_text',
'source_x']

In [47]:
we_rate_dogs.drop(columns=redundant_columns, inplace=True)

#### Test
Check the columns in the dataframe.

In [49]:
we_rate_dogs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2061 entries, 0 to 2060
Data columns (total 34 columns):
tweet_id              2061 non-null int64
timestamp             2061 non-null object
text                  2061 non-null object
rating_numerator      2061 non-null int64
rating_denominator    2061 non-null int64
name                  2061 non-null object
doggo                 2061 non-null object
floofer               2061 non-null object
pupper                2061 non-null object
puppo                 2061 non-null object
jpg_url               2061 non-null object
img_num               2061 non-null int64
p1                    2061 non-null object
p1_conf               2061 non-null float64
p1_dog                2061 non-null bool
p2                    2061 non-null object
p2_conf               2061 non-null float64
p2_dog                2061 non-null bool
p3                    2061 non-null object
p3_conf               2061 non-null float64
p3_dog                2061 non-null bo

### Define

Tweet timestamp field should be time date

#### Code

In [50]:
we_rate_dogs.timestamp = pd.to_datetime(we_rate_dogs.timestamp)

#### Test

In [51]:
we_rate_dogs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2061 entries, 0 to 2060
Data columns (total 34 columns):
tweet_id              2061 non-null int64
timestamp             2061 non-null datetime64[ns, UTC]
text                  2061 non-null object
rating_numerator      2061 non-null int64
rating_denominator    2061 non-null int64
name                  2061 non-null object
doggo                 2061 non-null object
floofer               2061 non-null object
pupper                2061 non-null object
puppo                 2061 non-null object
jpg_url               2061 non-null object
img_num               2061 non-null int64
p1                    2061 non-null object
p1_conf               2061 non-null float64
p1_dog                2061 non-null bool
p2                    2061 non-null object
p2_conf               2061 non-null float64
p2_dog                2061 non-null bool
p3                    2061 non-null object
p3_conf               2061 non-null float64
p3_dog                206

### Define

`doggo`, `floofer`, `pupper`, `puppo` columns are better represented as `bool`s.

#### Code

In [53]:
def to_bool(x):
    if x=='None':
        return False
    else:
        return True
    
we_rate_dogs.doggo = we_rate_dogs.doggo.apply(to_bool)
we_rate_dogs.floofer = we_rate_dogs.floofer.apply(to_bool)
we_rate_dogs.pupper = we_rate_dogs.pupper.apply(to_bool)
we_rate_dogs.puppo = we_rate_dogs.puppo.apply(to_bool)

#### Test
Check if the columns have correct data types in them.

In [54]:
we_rate_dogs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2061 entries, 0 to 2060
Data columns (total 34 columns):
tweet_id              2061 non-null int64
timestamp             2061 non-null datetime64[ns, UTC]
text                  2061 non-null object
rating_numerator      2061 non-null int64
rating_denominator    2061 non-null int64
name                  2061 non-null object
doggo                 2061 non-null bool
floofer               2061 non-null bool
pupper                2061 non-null bool
puppo                 2061 non-null bool
jpg_url               2061 non-null object
img_num               2061 non-null int64
p1                    2061 non-null object
p1_conf               2061 non-null float64
p1_dog                2061 non-null bool
p2                    2061 non-null object
p2_conf               2061 non-null float64
p2_dog                2061 non-null bool
p3                    2061 non-null object
p3_conf               2061 non-null float64
p3_dog                2061 non-nu

### Define
Convert `lang` column into category.

#### Code

In [55]:
we_rate_dogs.lang = we_rate_dogs.lang.astype('category')

### Test

There is only one record in the 'place' column. What is it?

In [None]:
we_rate_dogs[we_rate_dogs.place.notnull()].place

In [None]:
we_rate_dogs.iloc[686].place

In [None]:
we_rate_dogs.iloc[686].url