In [2]:
import pandas as pd
import requests
import os
import glob
import json
import numpy

## Data Gathering

Download the WeRateDogs *[Twitter archive](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/59a4e958_twitter-archive-enhanced/twitter-archive-enhanced.csv)* by clicking on the download link.

Use the `pd.read_csv` method to load the data into a Pandas Dataframe

In [3]:
archives_df = pd.read_csv('twitter-archive-enhanced.csv')

Use the Requests library to download the tweet image prediction file (image_predictions.tsv)

In [4]:
#Make a new folder to hold downloaded image-predictions tsv file

folder = 'Predictions'
if not os.path.exists(folder):
    os.makedirs(folder)

In [6]:
#Download file using the requests library 

url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)

In [7]:
#write the contents of the response into a file in the Predictions folder

with open(os.path.join(folder, url.split('/')[-1]), mode='wb') as file:
    file.write(response.content)


Confirm the contents of the `Predictions` folder:

In [8]:
os.listdir('Predictions')

['image-predictions.tsv']

Now that we have confirmed that the image-predictions file has been successfully created, we can proceed with reading its contents and loading them into a dataframe

In [9]:
#Read Data from the Image Predictions file into a dataframe

predictions_df = pd.read_csv('Predictions\image-predictions.tsv', sep='\t')

 

We are going to download additional data to supplement our datasets from this **[link](https://video.udacity-data.com/topher/2018/November/5be5fb7d_tweet-json/tweet-json.txt)**. I opted for a direct download because my request for elavated access to twitter's API was rejected.   

Once you've downloaded the tweet-json text file move it into your working directory. 

In [10]:
#Create an empty list that will be used to hold the dictionary objects from the Json file

json_list = []

Read JSON Data from the `tweet-json.txt` Text File

In [11]:
'''
 > Use glob module to loop through and find text files in your working directory
 > Employ a for loop together with .readline() to read lines in the text file
 > Parse the String returned by the readline method into a Python dictionary with the json.loads method
 > Create a python dictionary that will be loaded into the empty Json_list list
 > Confirm if dictionary is present in list, if not append to list, else skip. This will ensure we dont append duplicates
'''

for tweet_file in glob.glob('*.txt'):
    with open(tweet_file, encoding='utf-8') as file:
        for line in file:
            txt_obj = file.readline()
            tweet_obj = json.loads(txt_obj)
            tweet_dict = {'tweet_id' : tweet_obj['id'], 'created_at' : tweet_obj['created_at'], 
                          'retweet_count' : tweet_obj['retweet_count'],'favorite_count' : tweet_obj['favorite_count'],
                          'followers_count' : tweet_obj['user']['followers_count']}
            if tweet_dict not in json_list:
                json_list.append(tweet_dict)

Let's check whether the objects loaded into our json_list were of the Python *`dict`* datatype  

In [12]:
#Type of the zero-indexed object in the json_list list

assert(type(json_list[0]) == dict)

Optionally, we can use the *`isinstance`* function that checks and returns *`True`* if the specified object is a subclass or an instance of the indicated type

In [13]:
#Type of last-indexed object in the json_list

isinstance(json_list[-1], dict)

True

We'll use the *`from_dict`* method to construct a dataframe from our dictionaries in the json_list list. 

In [14]:

additional_tweets_df = pd.DataFrame.from_dict(json_list)

Check if the tweet_ids in the newly created additional_tweets_df are in the archives dataframe.   

In [15]:
#counts the number of items in the additional_tweets_df, tweet_id column and checks if item is present in archive

items = 0
for item in set(list(additional_tweets_df.tweet_id)):
    if item in list(archives_df.tweet_id):
        items += 1

In [16]:
'''
    compare the count against the length of the additional tweets dataframe
    if count is equal to length of additional df, then all tweets in new df were in the archived df 
'''

items == len(list(additional_tweets_df.tweet_id))

True

## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.

Data quality issues: Data that has quality issues have issues with content like missing, duplicate, or incorrect data. This is called dirty data.
Lack of tidiness: Data that has specific structural issues that slow you down when cleaning and analyzing, visualizing, or modeling your data later.


### Quality issues
1. archives_df - duplicate tweets inform of retweets

2. archives_df - tweet_id is an integer instead of a string object

3. archives_df - *timestamp* is a string object 

4. 

5. archives_df - missing *names* represented as None 

6. archives_df - *names* extracted incorrectly in some cases i.e 'a', 'the', 'an'

6. archives_df - *in_reply_to_user_id*, *in_reply_to_status_id*, *retweeted_status_timestamp*, *retweeted_status_id* and *retweeted_status_user_id* columns have null values  

7. archive_df - *rating_denominator* is more than 10 in some cases 

8. archive_df - 

### Tidiness Issues

1. The *doggo*, *floofer*, *pupper* and *puppo* identifiers should unpivoted into a single column

2. The *created_at* column in the additional_tweets_df should be dropped, it is already present in the archives_df as *timestamp*. 

let's make a visual inspection of our first dataframe

In [32]:
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_row', None)

In [34]:
archives_df.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
669,790227638568808452,,,2016-10-23 16:25:25 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: This is Leela. She's a Fetty Woof. Lost eye while saving a baby from an avalanche. 11/10 true h*ckin hero https://t.co/2lBg3…,7.626999e+17,4196984000.0,2016-08-08 17:19:51 +0000,https://twitter.com/dog_rates/status/762699858130116608/photo/1,11,10,Leela,,,,
1889,674774481756377088,,,2015-12-10 02:15:47 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Superpup. His head isn't proportional to his body. Has yet to serve any justice. 11/10 maybe one day pupper https://t.co/gxIFgg8ktm,,,,https://twitter.com/dog_rates/status/674774481756377088/photo/1,11,10,Superpup,,,pupper,
1516,690959652130045952,,,2016-01-23 18:09:53 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This golden is happy to refute the soft mouth egg test. Not a fan of sweeping generalizations. 11/10 #notallpuppers https://t.co/DgXYBDMM3E,,,,"https://twitter.com/dog_rates/status/690959652130045952/photo/1,https://twitter.com/dog_rates/status/690959652130045952/photo/1,https://twitter.com/dog_rates/status/690959652130045952/photo/1,https://twitter.com/dog_rates/status/690959652130045952/photo/1",11,10,,,,,
1380,701214700881756160,,,2016-02-21 01:19:47 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","I'm not sure what's happening here, but it's pretty spectacular. 12/10 for both https://t.co/JKXh0NbBNL",,,,https://twitter.com/dog_rates/status/701214700881756160/photo/1,12,10,,,,,
121,869596645499047938,,,2017-05-30 16:49:31 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Scout. He just graduated. Officially a doggo now. Have fun with taxes and losing sight of your ambitions. 12/10 would throw cap for https://t.co/DsA2hwXAJo,,,,"https://twitter.com/dog_rates/status/869596645499047938/photo/1,https://twitter.com/dog_rates/status/869596645499047938/photo/1",12,10,Scout,doggo,,,


In [53]:
archives_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

Check for duplicates in the *`tweet_id`* column

In [20]:
archives_df['tweet_id'].duplicated().sum()

0

There are **no** duplicates in the archives_df dataframe. 

In [21]:
archives_df.duplicated().sum()

0

Let us examine the *name* column

In [38]:
archives_df.name.value_counts()

None              745
a                  55
Charlie            12
Cooper             11
Lucy               11
Oliver             11
Tucker             10
Penny              10
Lola               10
Winston             9
Bo                  9
Sadie               8
the                 8
Daisy               7
Buddy               7
Toby                7
an                  7
Bailey              7
Leo                 6
Oscar               6
Stanley             6
Rusty               6
Bella               6
Dave                6
Jack                6
Koda                6
Scout               6
Jax                 6
Milo                6
Louis               5
Phil                5
Sammy               5
Gus                 5
Chester             5
Alfie               5
Sunny               5
Bentley             5
very                5
Oakley              5
Finn                5
George              5
Larry               5
Clarence            4
Loki                4
Walter              4
Reggie    

In [57]:
archives_df[archives_df.name == 'a'].sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2350,666050758794694657,,,2015-11-16 00:30:50 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is a truly beautiful English Wilson Staff retriever. Has a nice phone. Privileged. 10/10 would trade lives with https://t.co/fvIbQfHjIe,,,,https://twitter.com/dog_rates/status/666050758794694657/photo/1,10,10,a,,,,
1017,746872823977771008,,,2016-06-26 01:08:52 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is a carrot. We only rate dogs. Please only send in dogs. You all really should know this by now ...11/10 https://t.co/9e48aPrBm2,,,,"https://twitter.com/dog_rates/status/746872823977771008/photo/1,https://twitter.com/dog_rates/status/746872823977771008/photo/1",11,10,a,,,,
2125,670361874861563904,,,2015-11-27 22:01:40 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is a Rich Mahogany Seltzer named Cherokee. Just got destroyed by a snowball. Isn't very happy about it. 9/10 https://t.co/98ZBi6o4dj,,,,https://twitter.com/dog_rates/status/670361874861563904/photo/1,9,10,a,,,,
1351,704054845121142784,,,2016-02-28 21:25:30 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here is a whole flock of puppers. 60/50 I'll take the lot https://t.co/9dpcw6MdWa,,,,https://twitter.com/dog_rates/status/704054845121142784/photo/1,60,50,a,,,,
649,792913359805018113,,,2016-10-31 02:17:31 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here is a perfect example of someone who has their priorities in order. 13/10 for both owner and Forrest https://t.co/LRyMrU7Wfq,,,,"https://twitter.com/dog_rates/status/792913359805018113/photo/1,https://twitter.com/dog_rates/status/792913359805018113/photo/1,https://twitter.com/dog_rates/status/792913359805018113/photo/1,https://twitter.com/dog_rates/status/792913359805018113/photo/1",13,10,a,,,,


In [59]:
archives_df[archives_df.name == 'an'].sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
759,778396591732486144,,,2016-09-21 00:53:04 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: This is an East African Chalupa Seal. We only rate dogs. Please only send in dogs. Thank you... 10/10 https://t.co/iHe6liLwWR,7.030419e+17,4196984000.0,2016-02-26 02:20:37 +0000,"https://twitter.com/dog_rates/status/703041949650034688/photo/1,https://twitter.com/dog_rates/status/703041949650034688/photo/1",10,10,an,,,,
1025,746369468511756288,,,2016-06-24 15:48:42 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is an Iraqi Speed Kangaroo. It is not a dog. Please only send in dogs. I'm very angry with all of you ...9/10 https://t.co/5qpBTTpgUt,,,,https://twitter.com/dog_rates/status/746369468511756288/photo/1,9,10,an,,,,
2333,666337882303524864,,,2015-11-16 19:31:45 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is an extremely rare horned Parthenon. Not amused. Wears shoes. Overall very nice. 9/10 would pet aggressively https://t.co/QpRjllzWAL,,,,https://twitter.com/dog_rates/status/666337882303524864/photo/1,9,10,an,,,,
2349,666051853826850816,,,2015-11-16 00:35:11 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is an odd dog. Hard on the outside but loving on the inside. Petting still fun. Doesn't play catch well. 2/10 https://t.co/v5A4vzSDdc,,,,https://twitter.com/dog_rates/status/666051853826850816/photo/1,2,10,an,,,,
2335,666287406224695296,,,2015-11-16 16:11:11 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is an Albanian 3 1/2 legged Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv,,,,https://twitter.com/dog_rates/status/666287406224695296/photo/1,1,2,an,,,,


Let's count the number of null value in the *in_reply_to_status_id* column

In [45]:
sum(archives_df.in_reply_to_status_id.isnull())

2278

Number of null values in the in_reply_to_user_id

In [46]:
sum(archives_df.in_reply_to_user_id.isnull())

2278

Number of null values in the *retweeted_status_id*

In [47]:
sum(archives_df.retweeted_status_id.isnull())

2175

Number of null values in the *retweeted_status_user_id*

In [48]:
sum(archives_df.retweeted_status_user_id.isnull())

2175

No. of unique values in the *rating_denominator* column with their respective counts

In [61]:
archives_df.rating_denominator.value_counts()

10     2333
11        3
50        3
20        2
80        2
70        1
7         1
15        1
150       1
170       1
0         1
90        1
40        1
130       1
110       1
16        1
120       1
2         1
Name: rating_denominator, dtype: int64

In [51]:
predictions_df.sample(5)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
355,672594978741354496,https://pbs.twimg.com/media/CVWJkJXWsAInlZl.jpg,1,Great_Pyrenees,0.755945,True,Old_English_sheepdog,0.082337,True,Afghan_hound,0.027037,True
170,668992363537309700,https://pbs.twimg.com/media/CUi9ARGWUAEyWqo.jpg,1,lynx,0.287506,False,tabby,0.206048,False,koala,0.081419,False
608,680070545539371008,https://pbs.twimg.com/media/CW-dU34WQAANBGy.jpg,1,earthstar,0.127701,False,Shih-Tzu,0.121811,True,bubble,0.11782,False
1940,860924035999428608,https://pbs.twimg.com/media/C_KVJjDXsAEUCWn.jpg,2,envelope,0.933016,False,oscilloscope,0.012591,False,paper_towel,0.011178,False
696,684481074559381504,https://pbs.twimg.com/media/CX_D6AJWwAAnBIw.jpg,1,Chihuahua,0.93781,True,Pomeranian,0.020307,True,polecat,0.017357,False


In [52]:
predictions_df.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


In [28]:
predictions_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [29]:
predictions_df.tweet_id.duplicated().sum()

0

There are no duplicates in the predictions dataFrame

In [63]:
additional_tweets_df.sample(10)

Unnamed: 0,tweet_id,created_at,retweet_count,favorite_count,followers_count
708,698342080612007937,Sat Feb 13 03:05:01 +0000 2016,1074,2485,3200950
636,709179584944730112,Mon Mar 14 00:49:23 +0000 2016,763,2213,3200947
457,756939218950160384,Sat Jul 23 19:49:07 +0000 2016,2295,7342,3200943
564,728986383096946689,Sat May 07 16:34:32 +0000 2016,917,3460,3200944
655,706904523814649856,Mon Mar 07 18:09:06 +0000 2016,8830,15961,3200947
547,736225175608430592,Fri May 27 15:58:54 +0000 2016,3115,8901,3200943
401,772117678702071809,Sat Sep 03 17:02:54 +0000 2016,848,4165,3200905
1171,666063827256086533,Mon Nov 16 01:22:45 +0000 2015,232,496,3201018
69,865006731092295680,Thu May 18 00:50:50 +0000 2017,8209,29063,3200891
975,673662677122719744,Mon Dec 07 00:37:52 +0000 2015,397,1339,3201004


In [64]:
additional_tweets_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1177 entries, 0 to 1176
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   tweet_id         1177 non-null   int64 
 1   created_at       1177 non-null   object
 2   retweet_count    1177 non-null   int64 
 3   favorite_count   1177 non-null   int64 
 4   followers_count  1177 non-null   int64 
dtypes: int64(4), object(1)
memory usage: 46.1+ KB


## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

# Make copies of original pieces of data

In [67]:
archives_clean = archives_df.copy()

### Issue #1: archives_df - duplicate tweets inform of retweets

#### Define:
* Delete retweets that can be identified with rows that have a value in *retweeted_status_id* column

#### Code

## Test

### Issue #2:

#### Define

#### Code

#### Test: 

## Storing Data

Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least three (3) insights and one (1) visualization.

### Insights:
1.

2.

3.

### Visualization

In [None]:
## References 

