In [351]:
import pandas as pd
import requests
import os
import glob
import json
import numpy

## Data Gathering

Download the WeRateDogs *[Twitter archive](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/59a4e958_twitter-archive-enhanced/twitter-archive-enhanced.csv)* by clicking on the download link.

Use the `pd.read_csv` method to load the data into a Pandas Dataframe

In [322]:
archives_df = pd.read_csv('twitter-archive-enhanced.csv')

Use the Requests library to download the tweet image prediction file (image_predictions.tsv)

In [9]:
#Make a new folder to hold downloaded image-predictions tsv file
folder = 'Predictions'
if not os.path.exists(folder):
    os.makedirs(folder)

In [14]:
#Download file using the requests library 
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)

In [17]:
#write the contents of the response into a file in the Predictions folder
with open(os.path.join(folder, url.split('/')[-1]), mode='wb') as file:
    file.write(response.content)


Confirm the contents of the `Predictions` folder:

In [185]:
os.listdir('Predictions')

['image-predictions.tsv']

Now that we have confirmed that the image-predictions file has been successfully created, we can proceed with reading its contents and loading them into a dataframe

In [323]:
#Read Data from the Image Predictions file into a dataframe
predictions_df = pd.read_csv('Predictions\image-predictions.tsv', sep='\t')

 

We are going to download additional data to supplement our datasets from this **[link](https://video.udacity-data.com/topher/2018/November/5be5fb7d_tweet-json/tweet-json.txt)**. I opted for a direct download because my request for elavated access to twitter's API was rejected.   

Once you've downloaded the tweet-json text file move it into your working directory. 

In [275]:
#Create an empty list that will be used to hold the dictionary objects from the Json file
json_list = []

Read JSON Data from the Text File

In [282]:
# Use glob module to loop through and find text files in your working directory
# Employ a for loop together with .readline() to read lines in the text file
# Parse the String returned by the readine method into a Python dictionary with the json.loads method
# Create a python dictionary that will be loaded into the empty Json_list list
# Confirm if dictionary is present in list, if not append to list, else skip. This will ensure we dont append duplicates   
for tweet_file in glob.glob('*.txt'):
    with open(tweet_file, encoding='utf-8') as file:
        for line in file:
            txt_obj = file.readline()
            tweet_obj = json.loads(txt_obj)
            tweet_dict = {'tweet_id' : tweet_obj['id'], 'created_at' : tweet_obj['created_at'], 'retweet_count' : tweet_obj['retweet_count'],'favorite_count' : tweet_obj['favorite_count'],'followers_count' : tweet_obj['user']['followers_count']}
            if tweet_dict not in json_list:
                json_list.append(tweet_dict)

Let's check whether the objects loaded into our json_list were of the Python *`dict`* datatype  

In [293]:
assert(type(json_list[0]) == dict)

Optionally, we can use the *`isinstance`* function that checks and returns *`True`* if the specified object is a subclass or an instance of the indicated type

In [298]:
isinstance(json_list[-1], dict)

True

We'll use the *`from_dict`* method to construct a dataframe from our dictionaries in the json_list list. 

In [314]:
additional_tweets_df = pd.DataFrame.from_dict(json_list)

Check if the tweet_ids in the newly created additional_tweets_df are in the archives dataframe.   

In [359]:
#counts the number of items in the additional_tweets_df, tweet_id column column and checks if item is present in archive  
items = 0
for item in set(list(additional_tweets_df.tweet_id)):
    if item in list(archives_df.tweet_id):
        items += 1

In [360]:
# compare the count against the lenght of the additional tweets dataframe
items == len(list(additional_tweets_df.tweet_id))

True

## Assessing Data

## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.

Data quality issues: Data that has quality issues have issues with content like missing, duplicate, or incorrect data. This is called dirty data.
Lack of tidiness: Data that has specific structural issues that slow you down when cleaning and analyzing, visualizing, or modeling your data later.


### Quality issues
1. archives_df - tweet_id is an integer instead of a string. /////////

2. archives_df - timestamp is a string object 

3. archives_df - some 

4. archives_df - duplicates of tweets inform of tweets

5. 

6. 

7.

8.

In [422]:
archives_df.sample(10)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1714,680440374763077632,,,2015-12-25 17:30:01 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Merry Christmas. My gift to you is this tiny u...,,,,https://twitter.com/dog_rates/status/680440374...,11,10,,,,,
1257,710296729921429505,,,2016-03-17 02:48:31 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine -...",This is Oshie. 12/10 please enjoy (vid by @cat...,,,,https://vine.co/v/iw9hUFAMerV,12,10,Oshie,,,,
324,834086379323871233,,,2017-02-21 17:04:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Lipton. He's a West Romanian Snuggle P...,,,,https://twitter.com/dog_rates/status/834086379...,12,10,Lipton,,,,
484,814638523311648768,,,2016-12-30 01:05:33 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Olivia. She's a passionate advocate of...,,,,https://twitter.com/dog_rates/status/814638523...,12,10,Olivia,,,,
1386,700518061187723268,,,2016-02-19 03:11:35 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Vincent. He's the man your girl is wit...,,,,https://twitter.com/dog_rates/status/700518061...,10,10,Vincent,,,,
816,770787852854652928,,,2016-08-31 00:58:39 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Winston. His tongue has gone rogue. Do...,,,,https://twitter.com/dog_rates/status/770787852...,10,10,Winston,,,,
132,866816280283807744,,,2017-05-23 00:41:20 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Jamesy. He gives a kiss...,8.664507e+17,4196984000.0,2017-05-22 00:28:40 +0000,https://twitter.com/dog_rates/status/866450705...,13,10,Jamesy,,,pupper,
779,775842724423557120,,,2016-09-13 23:44:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Blue. He was having an average day unt...,,,,https://twitter.com/dog_rates/status/775842724...,12,10,Blue,,,,
1434,697270446429966336,,,2016-02-10 04:06:43 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Bentley. He got stuck on his 3rd homew...,,,,https://twitter.com/dog_rates/status/697270446...,10,10,Bentley,,,,
991,748699167502000129,,,2016-07-01 02:06:06 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Tripp. He's being eaten by a sherk and do...,,,,https://twitter.com/dog_rates/status/748699167...,11,10,Tripp,,,,


In [421]:
archives_df[archives_df.retweeted_status_id.notna()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
19,888202515573088257,,,2017-07-21 01:02:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Canela. She attempted s...,8.874740e+17,4.196984e+09,2017-07-19 00:47:34 +0000,https://twitter.com/dog_rates/status/887473957...,13,10,Canela,,,,
32,886054160059072513,,,2017-07-15 02:45:48 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @Athletics: 12/10 #BATP https://t.co/WxwJmv...,8.860537e+17,1.960740e+07,2017-07-15 02:44:07 +0000,https://twitter.com/dog_rates/status/886053434...,12,10,,,,,
36,885311592912609280,,,2017-07-13 01:35:06 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Lilly. She just paralle...,8.305833e+17,4.196984e+09,2017-02-12 01:04:29 +0000,https://twitter.com/dog_rates/status/830583320...,13,10,Lilly,,,,
68,879130579576475649,,,2017-06-26 00:13:58 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Emmy. She was adopted t...,8.780576e+17,4.196984e+09,2017-06-23 01:10:23 +0000,https://twitter.com/dog_rates/status/878057613...,14,10,Emmy,,,,
73,878404777348136964,,,2017-06-24 00:09:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Meet Shadow. In an attempt to r...,8.782815e+17,4.196984e+09,2017-06-23 16:00:04 +0000,"https://www.gofundme.com/3yd6y1c,https://twitt...",13,10,Shadow,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1023,746521445350707200,,,2016-06-25 01:52:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Shaggy. He knows exactl...,6.678667e+17,4.196984e+09,2015-11-21 00:46:50 +0000,https://twitter.com/dog_rates/status/667866724...,10,10,Shaggy,,,,
1043,743835915802583040,,,2016-06-17 16:01:16 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Extremely intelligent dog here....,6.671383e+17,4.196984e+09,2015-11-19 00:32:12 +0000,https://twitter.com/dog_rates/status/667138269...,10,10,,,,,
1242,711998809858043904,,,2016-03-21 19:31:59 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @twitter: @dog_rates Awesome Tweet! 12/10. ...,7.119983e+17,7.832140e+05,2016-03-21 19:29:52 +0000,https://twitter.com/twitter/status/71199827977...,12,10,,,,,
2259,667550904950915073,,,2015-11-20 03:51:52 +0000,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",RT @dogratingrating: Exceptional talent. Origi...,6.675487e+17,4.296832e+09,2015-11-20 03:43:06 +0000,https://twitter.com/dogratingrating/status/667...,12,10,,,,,


In [378]:
archives_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

Check for dupicates in the *`tweet_id`* column

In [388]:
archives_df['tweet_id'].duplicated().sum()

0

There are no duplicates in the archives_df dataframe. 

In [416]:
archives_df.duplicated().sum()

0

In [417]:
predictions_df.sample(3)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1120,726828223124897792,https://pbs.twimg.com/media/ChY2aHyWMAAbNQE.jpg,1,miniature_pinscher,0.255327,True,Border_terrier,0.181279,True,Labrador_retriever,0.125185,True
435,674394782723014656,https://pbs.twimg.com/media/CVvueeeWwAUcQLR.jpg,1,toilet_tissue,0.134983,False,mosquito_net,0.133247,False,Lakeland_terrier,0.109099,True
606,679877062409191424,https://pbs.twimg.com/media/CW9olDsUsAA0XSf.jpg,1,hog,0.809466,False,hay,0.060178,False,lumbermill,0.016483,False


In [391]:
predictions_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [393]:
predictions_df.tweet_id.duplicated().sum()

0

There are no duplicates in the predictions dataFrame

In [418]:
additional_tweets_df.head()

Unnamed: 0,tweet_id,created_at,retweet_count,favorite_count,followers_count
0,892177421306343426,Tue Aug 01 00:17:27 +0000 2017,6514,33819,3200889
1,891689557279858688,Sun Jul 30 15:58:51 +0000 2017,8964,42908,3200889
2,891087950875897856,Sat Jul 29 00:08:17 +0000 2017,3261,20562,3200889
3,890729181411237888,Fri Jul 28 00:22:40 +0000 2017,16716,56848,3200889
4,890240255349198849,Wed Jul 26 15:59:51 +0000 2017,7711,32467,3200889


In [419]:
additional_tweets_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1177 entries, 0 to 1176
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   tweet_id         1177 non-null   int64 
 1   created_at       1177 non-null   object
 2   retweet_count    1177 non-null   int64 
 3   favorite_count   1177 non-null   int64 
 4   followers_count  1177 non-null   int64 
dtypes: int64(4), object(1)
memory usage: 46.1+ KB


### Tidiness issues
1.

2.

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

# Make copies of original pieces of data

### Issue #1:

#### Define:

#### Code

## Test

### Issue #2:

#### Define

#### Code

#### Test: 

## Storing Data

Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least three (3) insights and one (1) visualization.

### Insights:
1.

2.

3.

### Visualization

In [None]:
## References 

