# Project: Wrangling and Analyze Data

In [1]:
import pandas as pd
import numpy as np
import requests
import tweepy
import json

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [2]:
# Read in the file into a DataFrame
df_archive = pd.read_csv('twitter-archive-enhanced.csv')

In [3]:
# Display the first few records of the DataFrame
df_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [4]:
# Use the get method to request for the data from the url
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)

In [5]:
# Save to a file
with open(url.split('/')[-1], mode='wb') as file:
    file.write(response.content)

In [6]:
# Read file into a DataFrame and view first few records
df_predictions = pd.read_csv('image-predictions.tsv', sep="\t")
df_predictions.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

In [7]:
# Using tweetpy create an API object to gather Tweeter data
consumer_key = ''
consumer_secret = ''
access_token = ''
access_secret = ''

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

In [8]:
# Get Tweet IDs from the archive DataFrame
tweet_ids = df_archive.tweet_id.values

In [9]:
# Using the API Object to get tweets from Twitter API
#with open('tweet_json.txt', 'w') as file:
#    for tweet_id in tweet_ids:
#        try:
#            tweet_json = api.get_status(tweet_id, tweet_mood='extended')
        
#            json_data = json.dumps(tweet_json._json)
#           file.write(json_data)
#            file.write('\n')
#        except:
#            pass

In [10]:
# Create a list that will be converted to DataFrame
df_list = []
with open('tweet_json.txt', encoding='utf-8') as file:
    for line in file:
        tweet = json.loads(line)
        tweet_id = tweet['id']
        retweet_count = tweet['retweet_count']
        favorite_count = tweet['favorite_count']
        df_list.append({'tweet_id': tweet_id,
                        'retweet_count': retweet_count,
                        'favorite_count': favorite_count
                       })
df_api = pd.DataFrame(df_list, columns=['tweet_id','retweet_count','favorite_count'])

In [11]:
# Display the first few records of the DataFrame
df_api.head()

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,6996,33771
1,892177421306343426,5297,29295
2,891815181378084864,3476,22029
3,891689557279858688,7210,36864
4,891327558926688256,7750,35249


## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



### Assessing archive DataFrame for quality issues

In [12]:
# View the dimensionality of the DataFrame
df_archive.shape

(2356, 17)

In [13]:
# Get the first few rows of the DataFrame
df_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [14]:
# Get concise summary of DataFrame
df_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

> From the summary report, there are errorneous data type (`tweet_id`, `timestamp`, `retweeted_status_timestamp`), retweeted records, missing expanded url (no images) records, x. These will be documented.

In [15]:
# Investigating further for null records in the `expanded_urls` column. 
# These are tweets with no images.
sum(df_archive['expanded_urls'].isnull())

59

> There are tweet with no images.

In [16]:
# Since expanded_urls are urls of the tweet images, having more than one record would mean duplicate records
# Checking for duplicate records using the expanded_urls column
df_archive[df_archive['expanded_urls'].duplicated()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
55,881633300179243008,8.816070e+17,4.738443e+07,2017-07-02 21:58:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@roushfenway These are good dogs but 17/10 is ...,,,,,17,10,,,,,
64,879674319642796034,8.795538e+17,3.105441e+09,2017-06-27 12:14:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@RealKentMurphy 14/10 confirmed,,,,,14,10,,,,,
75,878281511006478336,,,2017-06-23 16:00:04 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Shadow. In an attempt to reach maximum zo...,,,,"https://www.gofundme.com/3yd6y1c,https://twitt...",13,10,Shadow,,,,
76,878057613040115712,,,2017-06-23 01:10:23 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Emmy. She was adopted today. Massive r...,,,,https://twitter.com/dog_rates/status/878057613...,14,10,Emmy,,,,
98,873213775632977920,,,2017-06-09 16:22:42 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Sierra. She's one precious pupper. Abs...,,,,https://www.gofundme.com/help-my-baby-sierra-g...,12,10,Sierra,,,pupper,
113,870726314365509632,8.707262e+17,1.648776e+07,2017-06-02 19:38:25 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@ComplicitOwl @ShopWeRateDogs &gt;10/10 is res...,,,,,10,10,,,,,
126,868552278524837888,,,2017-05-27 19:39:34 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Say hello to Cooper. His expression is the sam...,,,,"https://www.gofundme.com/3ti3nps,https://twitt...",12,10,Cooper,,,,
135,866450705531457537,,,2017-05-22 00:28:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Jamesy. He gives a kiss to every other...,,,,https://twitter.com/dog_rates/status/866450705...,13,10,Jamesy,,,pupper,
136,866334964761202691,,,2017-05-21 16:48:45 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Coco. At first I thought she was a clo...,,,,https://twitter.com/dog_rates/status/866334964...,12,10,Coco,,,,
148,863427515083354112,8.634256e+17,7.759620e+07,2017-05-13 16:15:35 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@Jack_Septic_Eye I'd need a few more pics to p...,,,,,12,10,,,,,


> Investigating further to confirm for duplicate records in the `expanded_urls` column

In [17]:
# View all the content of the expanded url column for a particular tweet_id (case 1)
with pd.option_context("display.max_colwidth", -1):
    print(df_archive[df_archive['tweet_id'] == 667182792070062081]['expanded_urls'])

2286    https://twitter.com/dog_rates/status/667182792070062081/photo/1
Name: expanded_urls, dtype: object


In [18]:
# Conditionally select the record associated with this expanded url
df_archive[df_archive.expanded_urls == 'https://twitter.com/dog_rates/status/667182792070062081/photo/1']

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
604,798585098161549313,,,2016-11-15 17:54:59 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Timison. He just told a...,6.671828e+17,4196984000.0,2015-11-19 03:29:07 +0000,https://twitter.com/dog_rates/status/667182792...,10,10,Timison,,,,
2286,667182792070062081,,,2015-11-19 03:29:07 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Timison. He just told an awful joke bu...,,,,https://twitter.com/dog_rates/status/667182792...,10,10,Timison,,,,


In [19]:
# View all the content of the expanded url column for a particular tweet_id (case 2)
with pd.option_context("display.max_colwidth", -1):
    print(df_archive[df_archive['tweet_id'] == 839549326359670784]['expanded_urls'])

280    https://twitter.com/dog_rates/status/839549326359670784/photo/1
Name: expanded_urls, dtype: object


In [20]:
# Conditionally select the record associated with this expanded url
df_archive[df_archive.expanded_urls == 'https://twitter.com/dog_rates/status/839549326359670784/photo/1']

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
171,858860390427611136,,,2017-05-01 01:47:28 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Meet Winston. He knows he's a l...,8.395493e+17,4196984000.0,2017-03-08 18:52:12 +0000,https://twitter.com/dog_rates/status/839549326...,12,10,Winston,,,,
280,839549326359670784,,,2017-03-08 18:52:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Winston. He knows he's a little too big f...,,,,https://twitter.com/dog_rates/status/839549326...,12,10,Winston,,,,


In [21]:
# View all the content of the expanded url column for a particular tweet_id (case 3)
with pd.option_context("display.max_colwidth", -1):
    print(df_archive[df_archive['tweet_id'] == 873213775632977920]['expanded_urls'])

98    https://www.gofundme.com/help-my-baby-sierra-get-better,https://twitter.com/dog_rates/status/873213775632977920/photo/1,https://twitter.com/dog_rates/status/873213775632977920/photo/1
Name: expanded_urls, dtype: object


In [22]:
# Conditionally select the record associated with this expanded url
url = 'https://www.gofundme.com/help-my-baby-sierra-get-better,https://twitter.com/dog_rates/status/873213775632977920/photo/1,https://twitter.com/dog_rates/status/873213775632977920/photo/1'
df_archive[df_archive.expanded_urls == url]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
97,873337748698140672,,,2017-06-10 00:35:19 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Sierra. She's one preci...,8.732138e+17,4196984000.0,2017-06-09 16:22:42 +0000,https://www.gofundme.com/help-my-baby-sierra-g...,12,10,Sierra,,,pupper,
98,873213775632977920,,,2017-06-09 16:22:42 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Sierra. She's one precious pupper. Abs...,,,,https://www.gofundme.com/help-my-baby-sierra-g...,12,10,Sierra,,,pupper,


> From the three further investigations above, we can see that these records are the same but one is actually a retweet of the other. So removing the retweet would solve this duplicate issue. This also confirm that there are retweets.

In [23]:
# Selecting a random sample of records
# Horning into the last 5 columns only
df_archive.iloc[:, -5:].sample(5)

Unnamed: 0,name,doggo,floofer,pupper,puppo
1131,Puff,,,,
537,,,,,
283,,,,,
832,Oakley,,,,
653,Remus,,,,


> From the above, we see misrepresentation of missing values in the (dog name and dog stages) columns as None. There are also dog names with single letters and lowercase letters.

In [24]:
# Futher investigation of the name column.
# Select names that are all lowercase letters
lc_names = df_archive[df_archive.name.str.islower()]
lc_names.loc[:, ['tweet_id','text','name']].sample(5)

Unnamed: 0,tweet_id,text,name
2204,668636665813057536,This is an Irish Rigatoni terrier named Berta....,an
1815,676613908052996102,This is the saddest/sweetest/best picture I've...,the
1877,675109292475830276,C'mon guys. We've been over this. We only rate...,a
335,832645525019123713,There's going to be a dog terminal at JFK Airp...,not
801,772581559778025472,Guys this is getting so out of hand. We only r...,a


> It seem names that are lowercase letters are errors. Since the project details state that the name column was extracted from the text column, then investigsting more on the name column by comparing corresponding text column would be a wise course.

In [25]:
# View all the content of the text column for a particular tweet_id (case 1)
with pd.option_context("display.max_colwidth", -1):
    print(df_archive[df_archive['tweet_id'] == 668587383441514497]['text'])

2212    Never forget this vine. You will not stop watching for at least 15 minutes. This is the second coveted.. 13/10 https://t.co/roqIxCvEB3
Name: text, dtype: object


In [26]:
# Check the value in the name column for the same tweet_id
df_archive[df_archive['tweet_id'] == 668587383441514497]['name']

2212    the
Name: name, dtype: object

In [27]:
# View all the content of the text column for a particular tweet_id (case 2)
with pd.option_context("display.max_colwidth", -1):
    print(df_archive[df_archive['tweet_id'] == 666411507551481857]['text'])

2326    This is quite the dog. Gets really excited when not in water. Not very soft tho. Bad at fetch. Can't do tricks. 2/10 https://t.co/aMCTNWO94t
Name: text, dtype: object


In [28]:
# Check the value in the name column for the same tweet_id
df_archive[df_archive['tweet_id'] == 666411507551481857]['name']

2326    quite
Name: name, dtype: object

In [29]:
# View all the content of the text column for a particular tweet_id (case 3)
with pd.option_context("display.max_colwidth", -1):
    print(df_archive[df_archive['tweet_id'] == 666050758794694657]['text'])

2350    This is a truly beautiful English Wilson Staff retriever. Has a nice phone. Privileged. 10/10 would trade lives with https://t.co/fvIbQfHjIe
Name: text, dtype: object


In [30]:
# Check the value in the name column for the same tweet_id
df_archive[df_archive['tweet_id'] == 666050758794694657]['name']

2350    a
Name: name, dtype: object

> From the three further investigation, we can see that the algorithm used to extract the names made few mistakes. It seems after a few checks that all the lowercase names are erroneous.

In [224]:
# View all the content of the text column for a particular tweet_id (case 3)
with pd.option_context("display.max_colwidth", -1):
    print(df_archive[df_archive['tweet_id'] == 699072405256409088]['text'])

1408    ERMAHGERD 12/10 please enjoy https://t.co/7WrAWKdBac
Name: text, dtype: object


In [225]:
# View all the content of the text column for a particular tweet_id (case 3)
with pd.option_context("display.max_colwidth", -1):
    print(df_archive[df_archive['tweet_id'] == 838952994649550848]['text'])

283    SHE MISPLACED HER HOOMAN 13/10 MISTAKES HAPPEN https://t.co/ngAxYLVYHP
Name: text, dtype: object


> There are some records with uppercase values in the text column

In [31]:
# Investigating for records beyond August 1st, 2017
len(df_archive[df_archive['timestamp'] > '2017-08-01'])

2

> There are records beyound August 1st, 2017

### Accessing predictions DataFrame for quality issues

In [32]:
# View the dimensionality of the DataFrame
df_predictions.shape

(2075, 12)

In [33]:
# Get the first few rows of the DataFrame
df_predictions.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [34]:
# Get concise summary of the DataFrame
df_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


> The tweet id column has int64 data type while it will never be used to perform any calculation, thus incorrect data type issue

In [35]:
# Get descriptive statistics of the DataFrame
df_predictions.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


> The maximum value and the minimum value for the `img_num` column seems to be very close. Let's investigate further into that column

In [36]:
# Count unique records
df_predictions.img_num.value_counts()

1    1780
2     198
3      66
4      31
Name: img_num, dtype: int64

In [37]:
# Confirm the unique records
df_predictions.img_num.unique()

array([1, 4, 2, 3])

> There are only a limited number of values in the `img_num` column, thus a more fitting data type should be choosen.

In [38]:
# Checking for duplicate records
df_predictions[df_predictions.jpg_url.duplicated()]

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1297,752309394570878976,https://pbs.twimg.com/ext_tw_video_thumb/67535...,1,upright,0.303415,False,golden_retriever,0.181351,True,Brittany_spaniel,0.162084,True
1315,754874841593970688,https://pbs.twimg.com/media/CWza7kpWcAAdYLc.jpg,1,pug,0.272205,True,bull_mastiff,0.251530,True,bath_towel,0.116806,False
1333,757729163776290825,https://pbs.twimg.com/media/CWyD2HGUYAQ1Xa7.jpg,2,cash_machine,0.802333,False,schipperke,0.045519,True,German_shepherd,0.023353,True
1345,759159934323924993,https://pbs.twimg.com/media/CU1zsMSUAAAS0qW.jpg,1,Irish_terrier,0.254856,True,briard,0.227716,True,soft-coated_wheaten_terrier,0.223263,True
1349,759566828574212096,https://pbs.twimg.com/media/CkNjahBXAAQ2kWo.jpg,1,Labrador_retriever,0.967397,True,golden_retriever,0.016641,True,ice_bear,0.014858,False
1364,761371037149827077,https://pbs.twimg.com/tweet_video_thumb/CeBym7...,1,brown_bear,0.713293,False,Indian_elephant,0.172844,False,water_buffalo,0.038902,False
1368,761750502866649088,https://pbs.twimg.com/media/CYLDikFWEAAIy1y.jpg,1,golden_retriever,0.586937,True,Labrador_retriever,0.398260,True,kuvasz,0.005410,True
1387,766078092750233600,https://pbs.twimg.com/media/ChK1tdBWwAQ1flD.jpg,1,toy_poodle,0.420463,True,miniature_poodle,0.132640,True,Chesapeake_Bay_retriever,0.121523,True
1407,770093767776997377,https://pbs.twimg.com/media/CkjMx99UoAM2B1a.jpg,1,golden_retriever,0.843799,True,Labrador_retriever,0.052956,True,kelpie,0.035711,True
1417,771171053431250945,https://pbs.twimg.com/media/CVgdFjNWEAAxmbq.jpg,3,Samoyed,0.978833,True,Pomeranian,0.012763,True,Eskimo_dog,0.001853,True


In [39]:
# Investigating deeper for duplicates
# Conditionally select records for one of the jpg_url
df_predictions[df_predictions['jpg_url'] == 'https://pbs.twimg.com/media/CWza7kpWcAAdYLc.jpg']

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
591,679158373988876288,https://pbs.twimg.com/media/CWza7kpWcAAdYLc.jpg,1,pug,0.272205,True,bull_mastiff,0.25153,True,bath_towel,0.116806,False
1315,754874841593970688,https://pbs.twimg.com/media/CWza7kpWcAAdYLc.jpg,1,pug,0.272205,True,bull_mastiff,0.25153,True,bath_towel,0.116806,False


In [40]:
# Conditionally select records for another jpg_url
df_predictions[df_predictions['jpg_url'] == 'https://pbs.twimg.com/media/DFDw2tyUQAAAFke.jpg']

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
2051,887473957103951883,https://pbs.twimg.com/media/DFDw2tyUQAAAFke.jpg,2,Pembroke,0.809197,True,Rhodesian_ridgeback,0.05495,True,beagle,0.038915,True
2055,888202515573088257,https://pbs.twimg.com/media/DFDw2tyUQAAAFke.jpg,2,Pembroke,0.809197,True,Rhodesian_ridgeback,0.05495,True,beagle,0.038915,True


In [41]:
# Conditionally select records for another jpg_url
df_predictions[df_predictions['jpg_url'] == 'https://pbs.twimg.com/media/Cq9guJ5WgAADfpF.jpg']

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1405,769940425801170949,https://pbs.twimg.com/media/Cq9guJ5WgAADfpF.jpg,1,miniature_pinscher,0.796313,True,Chihuahua,0.155413,True,Staffordshire_bullterrier,0.030943,True
1803,832040443403784192,https://pbs.twimg.com/media/Cq9guJ5WgAADfpF.jpg,1,miniature_pinscher,0.796313,True,Chihuahua,0.155413,True,Staffordshire_bullterrier,0.030943,True


> We can confirm that these are truly duplicate records. They have the exact same information.

In [42]:
# View a random sample of DataFrame
df_predictions.sample(5)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
677,683481228088049664,https://pbs.twimg.com/media/CXw2jSpWMAAad6V.jpg,1,keeshond,0.508951,True,chow,0.442016,True,German_shepherd,0.013206,True
2024,881666595344535552,https://pbs.twimg.com/media/DDxPFwbWAAEbVVR.jpg,1,Saluki,0.529012,True,Afghan_hound,0.250003,True,golden_retriever,0.160739,True
300,671488513339211776,https://pbs.twimg.com/media/CVGbPgrWIAAQ1fB.jpg,1,hermit_crab,0.528761,False,snail,0.185644,False,shower_curtain,0.066361,False
1891,849051919805034497,https://pbs.twimg.com/media/C8hwNxbXYAAwyVG.jpg,1,fountain,0.997509,False,American_black_bear,0.001413,False,sundial,0.000681,False
1910,853299958564483072,https://pbs.twimg.com/media/C9eHyF7XgAAOxPM.jpg,1,grille,0.65228,False,beach_wagon,0.112846,False,convertible,0.086252,False


> The values in columns `p1`, `p2`, `p3` are inconsitent and there are inapproprate use of underbars.

### Accessing api DataFrame for quality issues

In [43]:
# Check the dimensionality of the DataFrame
df_api.shape

(873, 3)

In [44]:
# Get the first few rows of the DataFrame
df_api.head()

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,6996,33771
1,892177421306343426,5297,29295
2,891815181378084864,3476,22029
3,891689557279858688,7210,36864
4,891327558926688256,7750,35249


In [45]:
# Get concise summary of the DataFrame
df_api.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 873 entries, 0 to 872
Data columns (total 3 columns):
tweet_id          873 non-null int64
retweet_count     873 non-null int64
favorite_count    873 non-null int64
dtypes: int64(3)
memory usage: 20.5 KB


In [46]:
# Get descriptive statistics of the DataFrame
df_api.describe()

Unnamed: 0,tweet_id,retweet_count,favorite_count
count,873.0,873.0,873.0
mean,8.201381e+17,4346.10882,12978.088202
std,3.687956e+16,4892.892371,13592.427774
min,7.588287e+17,1.0,0.0
25%,7.895992e+17,1804.0,4995.0
50%,8.185365e+17,2968.0,9906.0
75%,8.482121e+17,5089.0,18250.0
max,8.924206e+17,51603.0,123901.0


> From the two reports above, we see that `tweet_id` columns is an int64 data type but it will never be used to perform any calculation. Thus, it should be considered as incorrect data type.

### Asessing Archive DataFrame for tidyness issues

In [47]:
# Display the first few records
df_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


> Looking at the `doggo`, `floofer`, `pupper`, `puppo` columns, we see that they are actually values not variable names. So we will have a new column that will take all four of them as values.

### Assessing predictions DataFrame for tidyness issues

In [48]:
# Display the first few records
df_predictions.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


> This table after being cleaned has no tidyness issues. It meets all the three requirements for a tidy data. 
    * Each variable forms a column
    * Each observation forms a row
    * Each type of observational unit forms a table

### Assessing API DataFrame for tidyness issues

In [49]:
# Display first few records
df_api.head()

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,6996,33771
1,892177421306343426,5297,29295
2,891815181378084864,3476,22029
3,891689557279858688,7210,36864
4,891327558926688256,7750,35249


> The DataFrame does not form an observational unit on its own while it satisfies the other two requirements. It would make sense for it to be joined with the archive table.

### Quality issues

####  ~ archive DataFrame
1. incorrect data type for `tweet_id`, `timestamp`, `retweet_status_timestamp` columns

2. tweet records with no images

3. some records are retweets

4. null values represnted as `None` in dog name and dog stage columns

5. dog names starting with small letters are erroneous

6. some values in the text column are uppercase

7. records beyond August 1st, 2017

#### ~ predictions DataFrame
8. incorrect data type for `tweet_id`, `img_num` columns

9. duplicate `jpg_url` records

10. inappropriate use of underbars and inconsistent cases in p1, p2, p3 columns

#### ~ api DataFrame
11. incorrect data type for `tweet_id` column

### Tidiness issues
1. the columns `doggo`, `floofer`, `pupper`, `puppo` are values not variable names

2. data from the Tweeter API is the same observational unit with the archive data frame

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [50]:
# Make copies of original pieces of data
archive_copy = df_archive.copy()
predictions_copy = df_predictions.copy()
api_copy = df_api.copy()

## Quality Issues

Here, all the 11 quality issues found in the above documentation will be cleaned

### Issue #1: 

Incorrect data type for `tweet_id`, `timestamp` and `retweet_status_timestamp` columns in archive DataFrame

#### Define
- Convert `tweet_id` column to object data type using `astype` method.
- Convert `timestamp` and `retweet_status_timestamp` columns to datetime data type using pandas `to_datetime` method.

#### Code

In [51]:
# Convert data type to object
archive_copy.tweet_id = archive_copy['tweet_id'].astype(str)

In [52]:
# Convert data type to datetime
archive_copy.timestamp = pd.to_datetime(archive_copy.timestamp)
archive_copy.retweeted_status_timestamp = pd.to_datetime(archive_copy.retweeted_status_timestamp)

#### Test

In [53]:
# Get concise summary report
archive_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null object
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null datetime64[ns]
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null datetime64[ns]
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: datetim

### Issue #2:

Tweet records with no images

#### Define:
- Select records in the `expanded_urls` columns with `NaN` value

#### Code

In [54]:
# Conditionally select records
archive_copy = archive_copy[archive_copy['expanded_urls'].notnull()]

#### Test

In [55]:
# Check for isnull records on the expanded url column
sum(archive_copy.expanded_urls.isnull())

0

In [56]:
# Check the dimension of the dataframe
archive_copy.shape

(2297, 17)

### Issue #3:

Some records are retweets

#### Define
- Select all records with `NaN` value in the `tweeted_status_id` column

#### Code

In [57]:
# Select only records that have null values
archive_copy = archive_copy[archive_copy['retweeted_status_id'].isnull()]

#### Test

In [58]:
sum(archive_copy.retweeted_status_id.notnull())

0

In [59]:
# Check the dimension of the dataframe
archive_copy.shape

(2117, 17)

### Issue #4

Null values represnted as None in dog name and dog stage columns

#### Define
- Replace all `None` values in the `name`, `doggo`, `floofer`, `pupper`, `puppo` columns with `NaN`

In [60]:
archive_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2117 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2117 non-null object
in_reply_to_status_id         23 non-null float64
in_reply_to_user_id           23 non-null float64
timestamp                     2117 non-null datetime64[ns]
source                        2117 non-null object
text                          2117 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null datetime64[ns]
expanded_urls                 2117 non-null object
rating_numerator              2117 non-null int64
rating_denominator            2117 non-null int64
name                          2117 non-null object
doggo                         2117 non-null object
floofer                       2117 non-null object
pupper                        2117 non-null object
puppo                         2117 non-null object
dtypes: datetime64[ns

#### Code

In [61]:
# Replace each column with correct value
archive_copy.name = archive_copy.name.replace('None', np.nan)
archive_copy.doggo = archive_copy.doggo.replace('None', np.nan)
archive_copy.floofer = archive_copy.floofer.replace('None', np.nan)
archive_copy.pupper = archive_copy.pupper.replace('None', np.nan)
archive_copy.puppo = archive_copy.puppo.replace('None', np.nan)

#### Test

In [62]:
archive_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2117 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2117 non-null object
in_reply_to_status_id         23 non-null float64
in_reply_to_user_id           23 non-null float64
timestamp                     2117 non-null datetime64[ns]
source                        2117 non-null object
text                          2117 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null datetime64[ns]
expanded_urls                 2117 non-null object
rating_numerator              2117 non-null int64
rating_denominator            2117 non-null int64
name                          1495 non-null object
doggo                         84 non-null object
floofer                       10 non-null object
pupper                        232 non-null object
puppo                         24 non-null object
dtypes: datetime64[ns](2), f

In [75]:
# View a sample of the dataframe horning down into the last five columns
archive_copy.iloc[:, -5:].sample(5)

Unnamed: 0,name,doggo,floofer,pupper,puppo
1358,Charlie,,,,
1565,Kilo,,,,
2352,a,,,,
106,Lassie,,,,
516,Sam,,,,


In [76]:
# Select records with value None in name column
archive_copy[archive_copy['name'] == 'None']

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


In [77]:
# Select records with value floofer in name column
archive_copy[archive_copy['floofer'] == 'None']

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


In [78]:
# Select records with value None in puppo column
archive_copy[archive_copy['puppo'] == 'None']

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


### Issue #5

Dog names starting with small letters are erroneous. So here, all lowercase names will be replaced with `NaN` because they were erroneiusly picked by the algorithm.

#### Define
- Replace all lowercase names with `NaN`

#### Code

In [97]:
# Replace all lowercase values in the name column with `NaN`
archive_copy.name = archive_copy.name.replace(r'\b[a-z]', np.nan, regex=True)

#### Test

In [110]:
# Display a random sample of values in the name column 
archive_copy[archive_copy.name.notnull()].sample(5)['name']

452         Bear
1061    Theodore
1013     Atticus
1824     Griffin
92           Jed
Name: name, dtype: object

In [113]:
# Select records with value 'a' in name column
archive_copy[archive_copy['name'] == 'a']

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


In [114]:
# Select records with value 'the' in name column
archive_copy[archive_copy['name'] == 'the']

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


In [111]:
# Get concise summary of dataframe
archive_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2117 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2117 non-null object
in_reply_to_status_id         23 non-null float64
in_reply_to_user_id           23 non-null float64
timestamp                     2117 non-null datetime64[ns]
source                        2117 non-null object
text                          2117 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null datetime64[ns]
expanded_urls                 2117 non-null object
rating_numerator              2117 non-null int64
rating_denominator            2117 non-null int64
name                          1391 non-null object
doggo                         84 non-null object
floofer                       10 non-null object
pupper                        232 non-null object
puppo                         24 non-null object
dtypes: datetime64[ns](2), f

### Issue #6

Some values in the text column are uppercase.

#### Define
- Change the cases from uppercase to capitalize case.

#### Code

In [318]:
archive_copy.loc[:, ['tweet_id','text']].sample(5)

Unnamed: 0,tweet_id,text
2106,670468609693655041,This is Edd. He's a Czechoslovakian Googolplex...
1583,687102708889812993,Army of water dogs here. None of them know whe...
1815,676613908052996102,This is the saddest/sweetest/best picture I've...
2275,667453023279554560,Meet Cupcake. I would do unspeakable things fo...
487,814153002265309185,This is Oshie. He's ready to party. Bought tha...


In [222]:
# View all the content of the text column for a particular tweet_id
with pd.option_context("display.max_colwidth", -1):
    print(archive_copy[archive_copy['tweet_id'] == '699072405256409088']['text'])

1408    ERMAHGERD 12/10 please enjoy https://t.co/7WrAWKdBac
Name: text, dtype: object


In [223]:
# View all the content of the text column for a particular tweet_id
with pd.option_context("display.max_colwidth", -1):
    print(archive_copy[archive_copy['tweet_id'] == '838952994649550848']['text'])

283    SHE MISPLACED HER HOOMAN 13/10 MISTAKES HAPPEN https://t.co/ngAxYLVYHP
Name: text, dtype: object


In [304]:
# Change all values of the text column to capitalize case

# def capitalize_text(tweet):
#     if tweet['text'].str.extract(r'([A-Z])'):
#         return tweet['text'].str.capitalize()
#     else:
#         return tweet['text']
    
def cap_txt(tweet):
    return tweet['text'].capitalize()

archive_copy['text'] = archive_copy.apply(cap_txt, axis=1)

#### Test

In [319]:
archive_copy3.loc[:, ['tweet_id','text']].sample(5)

Unnamed: 0,tweet_id,text
572,801285448605831168,Oh h*ck 10/10 https://t.co/bc69rrw559
1090,737678689543020544,We only rate dogs. pls stop sending in non-can...
2073,670995969505435648,Yea i can't handle this job anymore your dogs ...
1396,699788877217865730,Say hello to nala. she's a freckled high brusc...
2002,672481316919734272,Say hello to jazz. she should be on the cover ...


In [306]:
# View all the content of the text column for a particular tweet_id
with pd.option_context("display.max_colwidth", -1):
    print(archive_copy[archive_copy['tweet_id'] == '838952994649550848']['text'])

283    She misplaced her hooman 13/10 mistakes happen https://t.co/ngaxylvyhp
Name: text, dtype: object


In [307]:
archive_copy[archive_copy['tweet_id'] == '699072405256409088']['text']

1408    Ermahgerd 12/10 please enjoy https://t.co/7wra...
Name: text, dtype: object

### Issue #7

There are records beyond August 1st, 2017

#### Define
- records beyond August 1st, 2017

#### Code

In [320]:
# Select records that have timestap < August 1st, 2017
archive_copy = archive_copy[archive_copy['timestamp'] < '2017-08-01']

#### Test

In [322]:
# Select records beyond August 1st, 2017
len(archive_copy[archive_copy['timestamp'] > '2017-08-01'])

0

### Issue #8

Incorrect data type for `tweet_id`, `img_num` columns in the predictions DataFrame

#### Define
- Convert `tweet_id` column to object data type using `astype` method.
- Convert `img_num` column to category data type using pandas `astype` method.

#### Code

In [369]:
# Convert data type to object
predictions_copy.tweet_id = predictions_copy['tweet_id'].astype(str)

In [370]:
# Convert data type to category
predictions_copy.img_num = predictions_copy['img_num'].astype('category')

#### Test

In [371]:
# Get concise summary of the DataFrame
predictions_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null object
jpg_url     2075 non-null object
img_num     2075 non-null category
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), category(1), float64(3), object(5)
memory usage: 138.1+ KB


### Issue #9

Duplicate `jpg_url` records in the predictions DataFrame

#### Define
- Remove the duplicated records from the DataFrame using the `drop_duplicates` method

#### Code

In [372]:
# Remove rows that are duplicates using the `jpg_url` column
predictions_copy.drop_duplicates(subset=['jpg_url'], inplace=True)

#### Test

In [373]:
# Checking for duplicate records
predictions_copy[predictions_copy.jpg_url.duplicated()]

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog


In [374]:
# Get concise summary of the DataFrame
predictions_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2009 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2009 non-null object
jpg_url     2009 non-null object
img_num     2009 non-null category
p1          2009 non-null object
p1_conf     2009 non-null float64
p1_dog      2009 non-null bool
p2          2009 non-null object
p2_conf     2009 non-null float64
p2_dog      2009 non-null bool
p3          2009 non-null object
p3_conf     2009 non-null float64
p3_dog      2009 non-null bool
dtypes: bool(3), category(1), float64(3), object(5)
memory usage: 149.3+ KB


### Issue #10

Inappropriate use of underbars and inconsistent cases in `p1`, `p2`, `p3` columns in predictions DataFrame

#### Define
- Change all values to lower case
- Replace all `_` with spaces

#### Code

In [375]:
# Change all values of the p1, p2, p3 columns to capitalize case
predictions_copy['p1'] = predictions_copy['p1'].str.lower()
predictions_copy['p2'] = predictions_copy['p2'].str.lower()
predictions_copy['p3'] = predictions_copy['p3'].str.lower()

In [387]:
# Replace all `_` with spaces
predictions_copy.p1 = predictions_copy.p1.str.replace('_', ' ')
predictions_copy.p2 = predictions_copy.p2.str.replace('_', ' ')
predictions_copy.p3 = predictions_copy.p3.str.replace('_', ' ')

#### Test

In [394]:
# Display random sample of dataframe
predictions_copy.loc[:, ['p1','p2','p3']].sample(5)

Unnamed: 0,p1,p2,p3
1715,standard poodle,toy poodle,tibetan terrier
676,italian greyhound,whippet,german short-haired pointer
606,hog,hay,lumbermill
1800,bloodhound,great dane,leonberg
1457,cocker spaniel,golden retriever,otterhound


### Issue #11

Incorrect data type for `tweet_id` column in api dataframe

#### Define
- Convert `tweet_id` column to object data type using `astype` method.

#### Code

In [395]:
# Convert data type to object
api_copy.tweet_id = api_copy['tweet_id'].astype(str)

#### Test

In [396]:
# Get concise summary of the DataFrame
api_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 873 entries, 0 to 872
Data columns (total 3 columns):
tweet_id          873 non-null object
retweet_count     873 non-null int64
favorite_count    873 non-null int64
dtypes: int64(2), object(1)
memory usage: 20.5+ KB


## Tidiness Issue

Here, all the tidiness issues documented above will be cleaned

### Issue #1

The columns `doggo`, `floofer`, `pupper`, `puppo` are are values not variable names

#### Define
- Unpivot the columns `doggo`, `floofer`, `pupper`, `puppo` into values of a new column `dog_stage` using the `melt` method

#### Code

In [411]:
id_vars = list(archive_copy)
id_vars

['expanded_urls',
 'in_reply_to_status_id',
 'in_reply_to_user_id',
 'name',
 'rating_denominator',
 'rating_numerator',
 'retweeted_status_id',
 'retweeted_status_timestamp',
 'retweeted_status_user_id',
 'source',
 'text',
 'timestamp',
 'tweet_id',
 'dog_stage',
 'stage']

In [408]:
archive_copy = pd.melt(archive_copy, id_vars=id_vars, var_name='dog_stage', value_name='stage')

#### Test

In [409]:
archive_copy

Unnamed: 0,expanded_urls,in_reply_to_status_id,in_reply_to_user_id,name,rating_denominator,rating_numerator,retweeted_status_id,retweeted_status_timestamp,retweeted_status_user_id,source,text,timestamp,tweet_id,dog_stage,stage
0,https://twitter.com/dog_rates/status/891815181...,,,Archie,10,12,,NaT,,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,2017-07-31 00:18:03,891815181378084864,dog_stage,doggo
1,https://twitter.com/dog_rates/status/891689557...,,,Darla,10,13,,NaT,,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,2017-07-30 15:58:51,891689557279858688,dog_stage,doggo
2,https://twitter.com/dog_rates/status/891327558...,,,Franklin,10,12,,NaT,,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,2017-07-29 16:00:24,891327558926688256,dog_stage,doggo
3,https://twitter.com/dog_rates/status/891087950...,,,,10,13,,NaT,,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,2017-07-29 00:08:17,891087950875897856,dog_stage,doggo
4,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",,,Jax,10,13,,NaT,,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jax. He enjoys ice cream so much he gets ...,2017-07-28 16:27:12,890971913173991426,dog_stage,doggo
5,https://twitter.com/dog_rates/status/890729181...,,,,10,13,,NaT,,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,2017-07-28 00:22:40,890729181411237888,dog_stage,doggo
6,https://twitter.com/dog_rates/status/890609185...,,,Zoey,10,13,,NaT,,"<a href=""http://twitter.com/download/iphone"" r...",This is Zoey. She doesn't want to be one of th...,2017-07-27 16:25:51,890609185150312448,dog_stage,doggo
7,https://twitter.com/dog_rates/status/890240255...,,,Cassie,10,14,,NaT,,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,2017-07-26 15:59:51,890240255349198849,dog_stage,doggo
8,https://twitter.com/dog_rates/status/890006608...,,,Koda,10,13,,NaT,,"<a href=""http://twitter.com/download/iphone"" r...",This is Koda. He is a South Australian decksha...,2017-07-26 00:31:25,890006608113172480,dog_stage,doggo
9,https://twitter.com/dog_rates/status/889880896...,,,Bruno,10,13,,NaT,,"<a href=""http://twitter.com/download/iphone"" r...",This is Bruno. He is a service shark. Only get...,2017-07-25 16:11:53,889880896479866881,dog_stage,doggo


### Issue #2

#### Define
- data from the Tweeter API is the same observational unit with the archive data frame

#### Code

#### Test

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1.

2.

3.

### Visualization