
**Your goal**: wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. The Twitter archive is great, but it only contains very basic tweet information. Additional gathering, then assessing and cleaning is required for "Wow!"-worthy analyses and visualizations.


1. The WeRateDogs Twitter archive. I am giving this file to you, so imagine it as a file on hand. Download this file manually by clicking the following link: twitter_archive_enhanced.csv

2. The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (image_predictions.tsv) is hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv]

3. Each tweet's retweet count and favorite ("like") count at minimum, and any additional data you find interesting. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file. Each tweet's JSON data should be written to its own line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count. Note: do not include your Twitter API keys, secrets, and tokens in your project submission.

Detect and document at least
* **Eight quality issues**
* **Two tidiness issues**


* **three insights** 
* **one visualization**

* Create a 300-600 word written report called wrangle_report.pdf or wrangle_report.html that briefly describes your wrangling efforts. This is to be framed as an internal document.

* Create a 250-word-minimum written report called act_report.pdf or act_report.html that communicates the insights and displays the visualization(s) produced from your wrangled data. This is to be framed as an external document, like a blog post or magazine article, for example.


In [1]:
import requests as rq
import pandas as pd
import numpy as np
import io
import json

### Read the enhanced twitter archive into a dataframe

In [2]:
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')

In [3]:
twitter_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [4]:
twitter_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

### Download the tweet image predictions

In [5]:
r = rq.get('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv')

In [6]:
r.status_code

200

In [7]:
r.headers['content-type']

'text/tab-separated-values; charset=utf-8'

In [8]:
r.encoding

'utf-8'

#### Load the received .tsv file into a dataframe

In [9]:
image_predictions = pd.DataFrame.from_csv(io.StringIO(r.content.decode(r.encoding)), sep='\t')

  """Entry point for launching an IPython kernel.


In [10]:
image_predictions.head()

Unnamed: 0_level_0,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [11]:
image_predictions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2075 entries, 666020888022790149 to 892420643555336193
Data columns (total 11 columns):
jpg_url    2075 non-null object
img_num    2075 non-null int64
p1         2075 non-null object
p1_conf    2075 non-null float64
p1_dog     2075 non-null bool
p2         2075 non-null object
p2_conf    2075 non-null float64
p2_dog     2075 non-null bool
p3         2075 non-null object
p3_conf    2075 non-null float64
p3_dog     2075 non-null bool
dtypes: bool(3), float64(3), int64(1), object(4)
memory usage: 152.0+ KB


### Merge twitter archive and image predictions

In [12]:
merged_df = pd.merge(twitter_archive,image_predictions,how='inner',on='tweet_id')

In [13]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2075 entries, 0 to 2074
Data columns (total 28 columns):
tweet_id                      2075 non-null int64
in_reply_to_status_id         23 non-null float64
in_reply_to_user_id           23 non-null float64
timestamp                     2075 non-null object
source                        2075 non-null object
text                          2075 non-null object
retweeted_status_id           81 non-null float64
retweeted_status_user_id      81 non-null float64
retweeted_status_timestamp    81 non-null object
expanded_urls                 2075 non-null object
rating_numerator              2075 non-null int64
rating_denominator            2075 non-null int64
name                          2075 non-null object
doggo                         2075 non-null object
floofer                       2075 non-null object
pupper                        2075 non-null object
puppo                         2075 non-null object
jpg_url                       2075 

In [14]:
merged_df.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,...,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,...,1,orange,0.097049,False,bagel,0.085851,False,banana,0.07611,False
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,...,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,...,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,...,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,...,2,basset,0.555712,True,English_springer,0.22577,True,German_short-haired_pointer,0.175219,True


### Set up Twitter API via tweepy

In [16]:
import tweepy

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, parser=tweepy.parsers.JSONParser())
api.wait_on_rate_limit = True

In [20]:
test_tweet = api.get_status('892420643555336193', tweet_mode='extended')

In [21]:
# create an empty array to store dictionaries retrieved via API
tweets = []
missing_tweets = []

# use tweet_id's from our dataframe to retrieve original tweets
for i in merged_df.tweet_id:
    try:
        tweets.append (api.get_status(i, tweet_mode='extended'))
    except:
        missing_tweets.append(i)
        print ('tweet #', i, ' could not be located')
            
# write downloaded tweets to a json file and store it locally            
with open('tweets.json', 'w') as outfile:
    json.dump(tweets, outfile)

tweet # 888202515573088257  could not be located
tweet # 873697596434513921  could not be located
tweet # 872261713294495745  could not be located
tweet # 861769973181624320  could not be located
tweet # 851953902622658560  could not be located
tweet # 844704788403113984  could not be located
tweet # 842892208864923648  could not be located
tweet # 837366284874571778  could not be located
tweet # 837012587749474308  could not be located
tweet # 829374341691346946  could not be located
tweet # 802247111496568832  could not be located
tweet # 779123168116150273  could not be located
tweet # 754011816964026368  could not be located
tweet # 680055455951884288  could not be located


In [26]:
missing_tweets

[888202515573088257,
 873697596434513921,
 872261713294495745,
 861769973181624320,
 851953902622658560,
 844704788403113984,
 842892208864923648,
 837366284874571778,
 837012587749474308,
 829374341691346946,
 802247111496568832,
 779123168116150273,
 754011816964026368,
 680055455951884288]

### Read the downloaded and saved archive from a local .json file

In [90]:
tweets = pd.read_json ('tweets.json')

In [91]:
tweets.head()

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,favorite_count,favorited,full_text,geo,...,lang,place,possibly_sensitive,possibly_sensitive_appealable,retweet_count,retweeted,retweeted_status,source,truncated,user
0,,,2017-08-01 16:23:56,"[0, 85]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892420639486877696, 'id_str'...",37246,False,This is Phineas. He's a mystical boy. Only eve...,,...,en,,False,False,7979,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
1,,,2017-08-01 00:17:27,"[0, 138]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892177413194625024, 'id_str'...",32027,False,This is Tilly. She's just checking pup on you....,,...,en,,False,False,5925,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
2,,,2017-07-31 00:18:03,"[0, 121]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891815175371796480, 'id_str'...",24129,False,This is Archie. He is a rare Norwegian Pouncin...,,...,en,,False,False,3913,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
3,,,2017-07-30 15:58:51,"[0, 79]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891689552724799489, 'id_str'...",40568,False,This is Darla. She commenced a snooze mid meal...,,...,en,,False,False,8132,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
4,,,2017-07-29 16:00:24,"[0, 138]","{'hashtags': [{'text': 'BarkWeek', 'indices': ...","{'media': [{'id': 891327551943041024, 'id_str'...",38800,False,This is Franklin. He would like you to stop ca...,,...,en,,False,False,8812,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."


### Let's drop all the empty columns

#### quality:
* remove all empty columns
- in_reply_to_status_id and in_reply_to_status_id_str >> should be strings
- in_reply_to_user_id and in_reply_to_user_id_str >> should be strings
- remove lang as it's english-only


#### tidiness:
- some columns contain dictionaries, data inside of which is difficult to access
- 'display_text_range' is an array that always starts with '0'
- 'user' field needs to be unpacked into columns
- 'entities' needs to be unpacked into columns
- 'source' column should be turned into a categorical variable


In [92]:
tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2061 entries, 0 to 2060
Data columns (total 28 columns):
contributors                     0 non-null float64
coordinates                      0 non-null float64
created_at                       2061 non-null datetime64[ns]
display_text_range               2061 non-null object
entities                         2061 non-null object
extended_entities                2061 non-null object
favorite_count                   2061 non-null int64
favorited                        2061 non-null bool
full_text                        2061 non-null object
geo                              0 non-null float64
id                               2061 non-null int64
id_str                           2061 non-null int64
in_reply_to_screen_name          23 non-null object
in_reply_to_status_id            23 non-null float64
in_reply_to_status_id_str        23 non-null float64
in_reply_to_user_id              23 non-null float64
in_reply_to_user_id_str          23 n

Let's see if there are any empty columns

In [93]:
tweets.sum(axis=0)

contributors                                                                     0
coordinates                                                                      0
display_text_range               [0, 85, 0, 138, 0, 121, 0, 79, 0, 138, 0, 138,...
favorite_count                                                            16894703
favorited                                                                        0
full_text                        This is Phineas. He's a mystical boy. Only eve...
geo                                                                              0
id                                                             8098880034596373798
id_str                                                         8098880034596372480
in_reply_to_status_id                                                  1.60497e+19
in_reply_to_status_id_str                                              1.60497e+19
in_reply_to_user_id                                                    9.65306e+10
in_r

From the above it is clear that some columns contain no information. Let's remove them.

In [94]:
columns_to_drop = ['contributors','coordinates','favorited','geo','is_quote_status','possibly_sensitive','possibly_sensitive_appealable','retweeted','truncated']
tweets.drop(columns=columns_to_drop, inplace=True)
tweets.head()

Unnamed: 0,created_at,display_text_range,entities,extended_entities,favorite_count,full_text,id,id_str,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,lang,place,retweet_count,retweeted_status,source,user
0,2017-08-01 16:23:56,"[0, 85]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892420639486877696, 'id_str'...",37246,This is Phineas. He's a mystical boy. Only eve...,892420643555336193,892420643555336192,,,,,,en,,7979,,"<a href=""http://twitter.com/download/iphone"" r...","{'id': 4196983835, 'id_str': '4196983835', 'na..."
1,2017-08-01 00:17:27,"[0, 138]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892177413194625024, 'id_str'...",32027,This is Tilly. She's just checking pup on you....,892177421306343426,892177421306343424,,,,,,en,,5925,,"<a href=""http://twitter.com/download/iphone"" r...","{'id': 4196983835, 'id_str': '4196983835', 'na..."
2,2017-07-31 00:18:03,"[0, 121]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891815175371796480, 'id_str'...",24129,This is Archie. He is a rare Norwegian Pouncin...,891815181378084864,891815181378084864,,,,,,en,,3913,,"<a href=""http://twitter.com/download/iphone"" r...","{'id': 4196983835, 'id_str': '4196983835', 'na..."
3,2017-07-30 15:58:51,"[0, 79]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891689552724799489, 'id_str'...",40568,This is Darla. She commenced a snooze mid meal...,891689557279858688,891689557279858688,,,,,,en,,8132,,"<a href=""http://twitter.com/download/iphone"" r...","{'id': 4196983835, 'id_str': '4196983835', 'na..."
4,2017-07-29 16:00:24,"[0, 138]","{'hashtags': [{'text': 'BarkWeek', 'indices': ...","{'media': [{'id': 891327551943041024, 'id_str'...",38800,This is Franklin. He would like you to stop ca...,891327558926688256,891327558926688256,,,,,,en,,8812,,"<a href=""http://twitter.com/download/iphone"" r...","{'id': 4196983835, 'id_str': '4196983835', 'na..."


In [95]:
tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2061 entries, 0 to 2060
Data columns (total 19 columns):
created_at                   2061 non-null datetime64[ns]
display_text_range           2061 non-null object
entities                     2061 non-null object
extended_entities            2061 non-null object
favorite_count               2061 non-null int64
full_text                    2061 non-null object
id                           2061 non-null int64
id_str                       2061 non-null int64
in_reply_to_screen_name      23 non-null object
in_reply_to_status_id        23 non-null float64
in_reply_to_status_id_str    23 non-null float64
in_reply_to_user_id          23 non-null float64
in_reply_to_user_id_str      23 non-null float64
lang                         2061 non-null object
place                        1 non-null object
retweet_count                2061 non-null int64
retweeted_status             74 non-null object
source                       2061 non-null object


In [107]:
# is the assumption that all 'display_text_range' values start with a '0' correct ?
for i in tweets.display_text_range:
    if i[0] > 0:
        print (i)

In [124]:
tweets['display_text_end'] = tweets.display_text_range.apply(lambda x: x[1])

In [125]:
tweets.head()

Unnamed: 0,created_at,display_text_range,entities,extended_entities,favorite_count,full_text,id,id_str,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,lang,place,retweet_count,retweeted_status,source,user,display_text_end
0,2017-08-01 16:23:56,"[0, 85]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892420639486877696, 'id_str'...",37246,This is Phineas. He's a mystical boy. Only eve...,892420643555336193,892420643555336192,,,,,,en,,7979,,"<a href=""http://twitter.com/download/iphone"" r...","{'id': 4196983835, 'id_str': '4196983835', 'na...",85
1,2017-08-01 00:17:27,"[0, 138]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892177413194625024, 'id_str'...",32027,This is Tilly. She's just checking pup on you....,892177421306343426,892177421306343424,,,,,,en,,5925,,"<a href=""http://twitter.com/download/iphone"" r...","{'id': 4196983835, 'id_str': '4196983835', 'na...",138
2,2017-07-31 00:18:03,"[0, 121]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891815175371796480, 'id_str'...",24129,This is Archie. He is a rare Norwegian Pouncin...,891815181378084864,891815181378084864,,,,,,en,,3913,,"<a href=""http://twitter.com/download/iphone"" r...","{'id': 4196983835, 'id_str': '4196983835', 'na...",121
3,2017-07-30 15:58:51,"[0, 79]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891689552724799489, 'id_str'...",40568,This is Darla. She commenced a snooze mid meal...,891689557279858688,891689557279858688,,,,,,en,,8132,,"<a href=""http://twitter.com/download/iphone"" r...","{'id': 4196983835, 'id_str': '4196983835', 'na...",79
4,2017-07-29 16:00:24,"[0, 138]","{'hashtags': [{'text': 'BarkWeek', 'indices': ...","{'media': [{'id': 891327551943041024, 'id_str'...",38800,This is Franklin. He would like you to stop ca...,891327558926688256,891327558926688256,,,,,,en,,8812,,"<a href=""http://twitter.com/download/iphone"" r...","{'id': 4196983835, 'id_str': '4196983835', 'na...",138


In [126]:
tweets.drop(columns='display_text_range')

Unnamed: 0,created_at,entities,extended_entities,favorite_count,full_text,id,id_str,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,lang,place,retweet_count,retweeted_status,source,user,display_text_end
0,2017-08-01 16:23:56,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892420639486877696, 'id_str'...",37246,This is Phineas. He's a mystical boy. Only eve...,892420643555336193,892420643555336192,,,,,,en,,7979,,"<a href=""http://twitter.com/download/iphone"" r...","{'id': 4196983835, 'id_str': '4196983835', 'na...",85
1,2017-08-01 00:17:27,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892177413194625024, 'id_str'...",32027,This is Tilly. She's just checking pup on you....,892177421306343426,892177421306343424,,,,,,en,,5925,,"<a href=""http://twitter.com/download/iphone"" r...","{'id': 4196983835, 'id_str': '4196983835', 'na...",138
2,2017-07-31 00:18:03,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891815175371796480, 'id_str'...",24129,This is Archie. He is a rare Norwegian Pouncin...,891815181378084864,891815181378084864,,,,,,en,,3913,,"<a href=""http://twitter.com/download/iphone"" r...","{'id': 4196983835, 'id_str': '4196983835', 'na...",121
3,2017-07-30 15:58:51,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891689552724799489, 'id_str'...",40568,This is Darla. She commenced a snooze mid meal...,891689557279858688,891689557279858688,,,,,,en,,8132,,"<a href=""http://twitter.com/download/iphone"" r...","{'id': 4196983835, 'id_str': '4196983835', 'na...",79
4,2017-07-29 16:00:24,"{'hashtags': [{'text': 'BarkWeek', 'indices': ...","{'media': [{'id': 891327551943041024, 'id_str'...",38800,This is Franklin. He would like you to stop ca...,891327558926688256,891327558926688256,,,,,,en,,8812,,"<a href=""http://twitter.com/download/iphone"" r...","{'id': 4196983835, 'id_str': '4196983835', 'na...",138
5,2017-07-29 00:08:17,"{'hashtags': [{'text': 'BarkWeek', 'indices': ...","{'media': [{'id': 891087942176911360, 'id_str'...",19493,Here we have a majestic great white breaching ...,891087950875897856,891087950875897856,,,,,,en,,2932,,"<a href=""http://twitter.com/download/iphone"" r...","{'id': 4196983835, 'id_str': '4196983835', 'na...",138
6,2017-07-28 16:27:12,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 890971906207338496, 'id_str'...",11378,Meet Jax. He enjoys ice cream so much he gets ...,890971913173991426,890971913173991424,,,,,,en,,1945,,"<a href=""http://twitter.com/download/iphone"" r...","{'id': 4196983835, 'id_str': '4196983835', 'na...",140
7,2017-07-28 00:22:40,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 890729118844600320, 'id_str'...",62833,When you watch your owner call another dog a g...,890729181411237888,890729181411237888,,,,,,en,,17772,,"<a href=""http://twitter.com/download/iphone"" r...","{'id': 4196983835, 'id_str': '4196983835', 'na...",118
8,2017-07-27 16:25:51,"{'hashtags': [{'text': 'BarkWeek', 'indices': ...","{'media': [{'id': 890609177319665665, 'id_str'...",26821,This is Zoey. She doesn't want to be one of th...,890609185150312448,890609185150312448,,,,,,en,,4030,,"<a href=""http://twitter.com/download/iphone"" r...","{'id': 4196983835, 'id_str': '4196983835', 'na...",122
9,2017-07-26 15:59:51,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 890240245463175168, 'id_str'...",30719,This is Cassie. She is a college pup. Studying...,890240255349198849,890240255349198848,,,,,,en,,6943,,"<a href=""http://twitter.com/download/iphone"" r...","{'id': 4196983835, 'id_str': '4196983835', 'na...",133


In [129]:
# how many user ids are there?
tweets.user.apply(lambda x: x['id']).nunique()

1

In [132]:
# For the purpose of our analysis - we can drop the 'user' column
tweets.drop(columns = 'user')

Unnamed: 0,created_at,display_text_range,entities,extended_entities,favorite_count,full_text,id,id_str,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,lang,place,retweet_count,retweeted_status,source,display_text_end
0,2017-08-01 16:23:56,"[0, 85]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892420639486877696, 'id_str'...",37246,This is Phineas. He's a mystical boy. Only eve...,892420643555336193,892420643555336192,,,,,,en,,7979,,"<a href=""http://twitter.com/download/iphone"" r...",85
1,2017-08-01 00:17:27,"[0, 138]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892177413194625024, 'id_str'...",32027,This is Tilly. She's just checking pup on you....,892177421306343426,892177421306343424,,,,,,en,,5925,,"<a href=""http://twitter.com/download/iphone"" r...",138
2,2017-07-31 00:18:03,"[0, 121]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891815175371796480, 'id_str'...",24129,This is Archie. He is a rare Norwegian Pouncin...,891815181378084864,891815181378084864,,,,,,en,,3913,,"<a href=""http://twitter.com/download/iphone"" r...",121
3,2017-07-30 15:58:51,"[0, 79]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891689552724799489, 'id_str'...",40568,This is Darla. She commenced a snooze mid meal...,891689557279858688,891689557279858688,,,,,,en,,8132,,"<a href=""http://twitter.com/download/iphone"" r...",79
4,2017-07-29 16:00:24,"[0, 138]","{'hashtags': [{'text': 'BarkWeek', 'indices': ...","{'media': [{'id': 891327551943041024, 'id_str'...",38800,This is Franklin. He would like you to stop ca...,891327558926688256,891327558926688256,,,,,,en,,8812,,"<a href=""http://twitter.com/download/iphone"" r...",138
5,2017-07-29 00:08:17,"[0, 138]","{'hashtags': [{'text': 'BarkWeek', 'indices': ...","{'media': [{'id': 891087942176911360, 'id_str'...",19493,Here we have a majestic great white breaching ...,891087950875897856,891087950875897856,,,,,,en,,2932,,"<a href=""http://twitter.com/download/iphone"" r...",138
6,2017-07-28 16:27:12,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 890971906207338496, 'id_str'...",11378,Meet Jax. He enjoys ice cream so much he gets ...,890971913173991426,890971913173991424,,,,,,en,,1945,,"<a href=""http://twitter.com/download/iphone"" r...",140
7,2017-07-28 00:22:40,"[0, 118]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 890729118844600320, 'id_str'...",62833,When you watch your owner call another dog a g...,890729181411237888,890729181411237888,,,,,,en,,17772,,"<a href=""http://twitter.com/download/iphone"" r...",118
8,2017-07-27 16:25:51,"[0, 122]","{'hashtags': [{'text': 'BarkWeek', 'indices': ...","{'media': [{'id': 890609177319665665, 'id_str'...",26821,This is Zoey. She doesn't want to be one of th...,890609185150312448,890609185150312448,,,,,,en,,4030,,"<a href=""http://twitter.com/download/iphone"" r...",122
9,2017-07-26 15:59:51,"[0, 133]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 890240245463175168, 'id_str'...",30719,This is Cassie. She is a college pup. Studying...,890240255349198849,890240255349198848,,,,,,en,,6943,,"<a href=""http://twitter.com/download/iphone"" r...",133


In [134]:
# how many various sources were used to post tweets?
tweets.source.unique()

array(['<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
       '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>',
       '<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>'],
      dtype=object)

Considering there are only three distinct values - this is best represented as a categorical variable.

In [153]:
tweets.source = tweets.source.apply(lambda x: x.split('>')[1].split('<')[0]).astype('category')

In [155]:
tweets.source.unique()

[Twitter for iPhone, Twitter Web Client, TweetDeck]
Categories (3, object): [Twitter for iPhone, Twitter Web Client, TweetDeck]

In [163]:
retweeted = tweets[tweets.retweeted_status.notnull()]

In [167]:
retweeted.retweeted_status.

<bound method Series.keys of 32      {'created_at': 'Sun Feb 12 01:04:29 +0000 2017...
67      {'created_at': 'Mon Jun 19 17:14:49 +0000 2017...
107     {'created_at': 'Thu May 18 01:17:25 +0000 2017...
130     {'created_at': 'Sat May 06 17:49:42 +0000 2017...
167     {'created_at': 'Sat Apr 01 21:42:03 +0000 2017...
182     {'created_at': 'Sat Apr 01 00:36:55 +0000 2017...
209     {'created_at': 'Fri Jan 06 17:33:29 +0000 2017...
220     {'created_at': 'Wed Mar 08 01:41:24 +0000 2017...
223     {'created_at': 'Tue Mar 07 00:15:46 +0000 2017...
255     {'created_at': 'Sat Feb 18 01:39:12 +0000 2017...
261     {'created_at': 'Tue Nov 01 03:00:09 +0000 2016...
262     {'created_at': 'Sun Aug 28 16:51:16 +0000 2016...
276     {'created_at': 'Thu Feb 02 01:01:21 +0000 2017...
309     {'created_at': 'Sun Nov 06 01:33:58 +0000 2016...
318     {'created_at': 'Fri Jan 20 00:50:15 +0000 2017...
322     {'created_at': 'Fri Jan 20 17:00:46 +0000 2017...
328     {'created_at': 'Tue Sep 27 02:53:48