# Project: Wrangling and Analyze Data

## Table of Contents
<ul>
    <li><a href="gathering">I. Data Gathering</a></li>
<li>

In [4]:
pip install tweepy

Note: you may need to restart the kernel to use updated packages.


In [5]:
import pandas as pd
import numpy as np
import tweepy
import json
import requests

<a id ='gathering'></a>
## I.  Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.

### <font color=green> 1. Twitter Archive Enhanced (.csv)</font>
Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [6]:
df_twittarch = pd.read_csv("twitter-archive-enhanced.csv")

### <font color = 'green'>2. Image Prediction (.tsv)</font>
Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [12]:
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
image_requests = requests.get(url, allow_redirects=True)
open('image-predictions.tsv','wb').write(image_requests.content)

df_imagepredictions = pd.read_csv('image-predictions.tsv',sep = '\t')
df_imagepredictions.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


### <font color = 'green'>3. Tweet JSON (.JSON)
Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)


### ~~<font color='green'>3a. Twitter API</font>~~

In [7]:
#consumer_key = '*****'
#consumer_secret = '****'
#access_token = '****'
#access_secret = '****'

#auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
#auth.set_access_token(access_token, access_secret)

#api = tweepy.API(auth, wait_on_rate_limit=True)

#tweet_ids = df_twittarch.tweet.id.values
#len(tweet_ids)

In [8]:
# set a function for tweet extraction
# file already created so no need to execute to continue the notebook
#def tweet_extraction():
#    count = 0
#    fails_dict = {}
#    start = timer()
#    with open('tweet_json.txt', 'w') as outfile:
#        for tweet_id in tweet_ids:
#            count += 1
#            print(str(count) + ": " + str(tweet_id))
#            try:
#                tweet = api.get_status(tweet_id, tweet_mode='extended')
#                print("Success")
#                json.dump(tweet._json, outfile)
#                outfile.write('\n')
#            except tweepy.TweepError as e:
#                print("Fail")
#                fails_dict[tweet_id] = e
#                pass
#    end = timer()
#    print(end - start)
#    print(fails_dict)

### <font color='green'>3b. Udacity File</font>

In [7]:
df_tweet_json = pd.read_json("tweet-json.json", lines=True)

# Create a list of dictionaries from this file:

twitter_json = []
with open ('tweet_json.txt', 'r') as file:
    for line in file:
        tweet = json.loads(line)
        
        tweet_id = tweet['id']
        retweet_count = tweet['retweet_count']
        favorite_count = tweet['favorite_count']
        
        twitter_json.append({'tweet_id' : tweet_id,
                         'retweet_count' : retweet_count,
                         'favorite_count' : favorite_count})
        
# Put data into a more manageable dataframe
df_twitter_json = pd.DataFrame(twitter_json)

<b><u>use UDACITY dataset</u></b> https://knowledge.udacity.com/questions/855395

><font color='red'> Created three new dataframes: </font>
>- <font color='red'>df_twittarch</font>
>- <font color='red'>df_imagepredictions</font>
>- <font color='red'>df_twitter_json</font>


## II. Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.

### A. Visual Assessment - Qualitatively evaluate the structure of the datasets
#### 1. df_twittarch

In [8]:
df_twittarch.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [None]:
df_twittarch.text.sample(20)

><font color='red'>Response: At quick glance. Will want to combine dog stages into one column</font>
#### 2. df_imagepredictions

In [None]:
df_imagepredictions.head()

><font color='red'>Response: </font>
#### 3. df_tweet_json

In [None]:
df_tweet_json.head()

><font color='red'>Response: </font>
### B. Programmatic Assessment - Use code-based tools to assess the datasets
#### 1. df_twittarch

In [None]:
df_twittarch.info()

In [None]:
df_twittarch.isnull().values.any()

In [None]:
sum(df_twittarch['tweet_id'].duplicated()

><font color='red'>Response: </font>
#### 2. df_imagepredictions

In [None]:
df_imagepredictions.info()

In [None]:
df_imagepredictions.isnull().values.any()

In [None]:
sum(df_imagepredictions.['tweet_id'].duplicated()

In [None]:
df_imagepredictions['p1'].value_counts().head()

In [None]:
df_imagepredictions['p2'].value_counts().head()

In [None]:
df_imagepredictions['p3'].value_counts().head()

><font color='red'>Response: </font>
#### 3. df_tweet_json

In [None]:
df_tweet_json.info()

In [None]:
df_tweet_json.isnull().values.any()

In [None]:
sum(df_tweet_json.['tweet_id'].duplicated()

><font color='red'>Response: </font>
### Quality issues

1.  Missing data. When data is incomplete or missing; can lead to biased or inaccurate results  <i>Table: example</i> Missing data for dog breed predictions. A few tweets deleted / missing API data.

2.  Duplicate data. Muliple entries of the same data can skew results and lead to inaccurate insights. <i>Table: example</i> Only want original ratings.
    - 181 values within retweeted_status_id and retweeted_status_user_id

3.  Inconsistent data. When the same data is recorded in diffrent formats, it can be challenging to analyze and raw meaningful conclusions.  
    - <b>df_twittarch</b> dog names have un-needed articles (e.g., a, an, the); want to extract just the dog's names

4.  Outliers. Extreme values in the data that do not fit the pattern of the rest of the data can distort the analysis and results.  <i>Table: example</i>

5.  Incorrect data type. Data that has been entered incorrectly, such as typos or errors in the data formatting can lead to incorrect conclusions. 
    - <b>df_twittarch</b> columns of wrong data types: tweet_id, in_reply_to_status_id, retweeted_status_id, retweeted_status_user_id, timestamp

6.  Erroneous data. Data that is not relevant to the research question or analysis being conducted.
    - <b>df_twittarch</b> has several columns that aren't needed: in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp

7.  Incomplete data. When some data values are missing from a variable or subset of the dataset. <i>Table: example</i>

8.  Inaccurate data. Data that has been measured or recorded inaccurately can lead to misleading insights and conclusions.. <i>Table: example</i>

### Tidiness issues
1.  Wide format data (each variable its own column). 
    - The <b>df_twittarch</b> dataframe, the column headers 'doggo', 'floofer', 'pupper', and 'puppo' should be variables to describe a single column 'stage'. 

2.  Nested data <i>Table: example</i>
    - <b>df_twittarch</b> some columns have multiple entries within it (like a tuple); expanded_urls

3.  Inconsistent data  <i>Table: example</i>
    - <b>"image_clean"<b> breed names in p1, p2, p3 will use all lowercase for consistency

## III. Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [13]:
# Make copies of original pieces of data
df_archive_clean=df_twittarch.copy()
df_image_clean=df_imagepredictions.copy()
df_json_clean=df_tweet_json.copy()

### Issue #1:  Removing doubles

#### Define: some tweets have two dogs being rated -- drop them.

##### Code

##### Test

### Issue #2:  Cleanup and minimize columns

#### Define: Drop columns that acre unneccessary or have missing values; Remove columns with missing values using dropna(); use drop() to remove an entire column.

##### Code

In [None]:
archive_clean.dropna(axis='columns',how='any', inplace=True)
archive_clean.drop(columns='source', inplace=True)

In [None]:
#Enumerate and melt dog stages together

##### Test

In [None]:
archive_clean.info()

### Issue #3:  "Rename columns"

#### Define:  "Rename <b>id_str</b> to <b>tweet_id</b> to share same key using rename() function; 

##### Code

In [None]:
tweet_clean.rename(index=str, columns=('id_str':'tweet_id'}, inplace = True)

##### Test

In [None]:
tweet_clean.info()

### Issue #4:  Addressing data types

#### Define:  "image_clean: tweet_id dtupe "string"; archive_clean: timestamp dtype "datetime"; archive_clean: tweet_id dtype "string""

##### Code

In [None]:
image_clean['tweet_id'] = image_clean['tweet_id'].astype('str')
archive_clean['timestamp'] = pd.to_datetime(archive_clean['timestamp'])
archive_clean['tweet_id'] = archive_clean['tweet_id'].astype('str')

##### Test

In [None]:
image_clean.info()

### Issue #5:  Cleaning up archive_clean['text'] into a more readable format

#### Define:  remove left over HTML from tweet import (&amp, /n)

##### Code

In [None]:
archive_clean['text'] = archive_clean.text.str.replace("&amp;", "&")
archive_clean['text'] = archive_clean.text.str.replace("\n", " ")
archive_clean['text'] = archive_clean.text.str.replace(r"http\S+", "")
archive_clean['text'] = archive_clean.text.str.strip()

##### Test

In [None]:
archive_clean.query("text == '&amp;'")

### Issue #5:  Cleaning up archive_clean['text'] into a more readable format

#### Define:  remove left over HTML from tweet import (&amp, /n)

##### Code

In [None]:
image_clean['p1'] = image_clean['p1'].str.lower()
image_clean['p2'] = image_clean['p2'].str.lower()
image_clean['p3'] = image_clean['p3'].str.lower()

##### Test

In [None]:
image_clean.info()

### Issue #6:  MERGE FOR THE ONE DATASET TO RULE THEM ALL

#### Define:  Merge all the dataframes together, on tweet_id, with the following columns

<li>Data columns (total 23 columns):
<li>tweet_id          2075 non-null object</li>
<li>timestamp         2075 non-null datetime64[ns]</li>
<li>text              2075 non-null object</li>
<li>rate_num          2075 non-null float64</li>
<li>rate_denom        2075 non-null int64</li>
<li>name              2075 non-null object</li>
<li>doggo             2075 non-null object</li>
<li>floof             2075 non-null object</li>
<li>pupper            2075 non-null object</li>
<li>puppo             2075 non-null object</li>
<li>retweet_count     2072 non-null float64</li>
<li>favorite_count    2072 non-null float64</li>
<li>jpg_url           1949 non-null object</li>
<li>img_num           1949 non-null float64</li>
<li>p1                1949 non-null object</li>
<li>p1_conf           1949 non-null float64</li>
<li>p1_dog            1949 non-null object</li>
<li>p2                1949 non-null object</li>
<li>p2_conf           1949 non-null float64</li>
<li>p2_dog            1949 non-null object</li>
<li>p3                1949 non-null object</li>
<li>p3_conf           1949 non-null float64</li>
<li>p3_dog            1949 non-null object</li>


##### Code

In [None]:
image_clean['p1'] = image_clean['p1'].str.lower()
image_clean['p2'] = image_clean['p2'].str.lower()
image_clean['p3'] = image_clean['p3'].str.lower()

##### Test

In [None]:
image_clean.info()

### Issue #7:  CHECK IT AGAIN

#### Define:  Create a new copy before working. remove null rows with drop(), 

##### Code

In [None]:
image_clean['p1'] = image_clean['p1'].str.lower()
image_clean['p2'] = image_clean['p2'].str.lower()
image_clean['p3'] = image_clean['p3'].str.lower()

##### Test

In [None]:
image_clean.info()

## IV. Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

## V. Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1.

2.

3.

### Visualization