# Data Wrangling Project

In this report, my goal is to gather, assess, clean and store the twitter data from the tweet archive of Twitter account WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a text comment about the dog. From Udacity, I got WeRateDogs Twitter archive's basic tweets data. To study interesting relationships, I did additional gathering of data via twitter API and Image Predicitions File. Then I assessed and cleaned this data and then stored it. I will show my steps of data wrangling in this report. 

## 1. Gather

In [1]:
import pandas as pd
import requests
import json
import numpy as np
%matplotlib inline

#### Gather twitter-archive-enhanced.csv

In [2]:
twitterarchiveenhanced = pd.read_csv('twitter-archive-enhanced.csv')

#### Gather image-predictions.tsv

In [3]:
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
response = requests.get(url)
response.content
with open(url.split('/')[-1], 'wb') as file:
    file.write(response.content)
imagepredict = pd.read_csv('image-predictions.tsv', sep = '\t')

#### Gather twitter json txt file

Query Twitter API for each tweet in the Twitter archive and save JSON in a text file

In [4]:
url = "https://s3.amazonaws.com/video.udacity-data.com/topher/2018/November/5be5fb7d_tweet-json/tweet-json.txt"
response = requests.get(url)
response.content
with open(url.split('/')[-1], 'wb') as file:
    file.write(response.content)

In [5]:
with open('tweet-json.txt') as file:
    status_list = []
    for line in file:
        #print(line)
        status_list.append(json.loads(line))

In [6]:
tweetjson = pd.DataFrame(status_list)

In [7]:
tweetjson = tweetjson[['id', 'favorite_count', 'retweet_count']]

#### Conclusion: I was able to gather the data from 3 different sources succesfully.

## 2. Assess

I assessed the 3 dataframes both visually and programmatically.

I found the following issues:

### Quality Issues
twitterarchiveenhanced dataframe:
- Missing values in expanded_urls
- "None" in the name column look to be a missing name
- "a" in the name column does not look to be a correct name
- "the" in the name column does not look to be a correct name
- "an" in the name column does not look to be a correct name
- "quite" in the name column does not look to be a correct name. It is providing name from the tweet for tweet id 859196978902773760.
- tweet_id 666287406224695296, has text: This is an Albanian 3 1/2 legged  Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv. But the rating is given as 1/2 instead of 9/10.
- tweet_id 832088576586297345, has text: "@docmisterio account started on 11/15/15". The rating is given as 11/15 instead of NaN.
- tweet_id 835246439529840640, has text: "@jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho". But the rating is given as 960/0 instead of 13/10.
- Retweet data is present. I only want to use original tweets.

### Tydiness Issues
twitterarchiveenhanced dataframe: 
- doggo, floofer, pupper, puppo should form one 'Dog_stages' column since this is one variable
- column "text" has text and text_url present in same column. This should be present in separate columns as: text and text_url


#### Conclusion: I found 10 Quality issues and 2 Tydiness issues


## 3. Clean

I first tried to clean Missing Values, then Tydiness issues and then finally Quality issues. In each case, I first defined the way to clean the issue and then coded the cleaning method. Then I tested if the cleaning worked. Below are some examples of my steps:

In [8]:
df_clean = twitterarchiveenhanced.copy()

### a) Missing Values
I could not fill the Missing values for expanded_urls and column name.

### b) Tydiness

#### doggo, floofer, pupper, puppo should form one 'Dog_stages' column since this is one variable

#### Define
- Combine the *doggo*, *floofer*, *pupper*, *puppo* columns to a *dog_stage* column

#### Code

#### Test

### c) Quality

#### twitterarchiveenhanced dataframe: tweet_id 666287406224695296, has text: This is an Albanian 3 1/2 legged Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv. But the rating is given as 1/2 instead of 9/10.


#### Define
- Replace the rating for tweet_id 666287406224695296 to 9/10 from 1/2

#### Code

#### Test

#### Conclusion: I was able to clean all the Tydiness issues and all Quality issues successfully. I tested the results of my cleaning and they look correct.

## 4. Store

In [9]:
# Store the clean dataframe in csv file
df_clean.to_csv('twitter_archive_master.csv')

In [10]:
# jupyter nbconvert wrangle_report.ipynb --TagRemovePreprocessor.remove_input_tags='{"remove_input_tags"}' --TagRemovePreprocessor.remove_all_outputs_tags='{"remove_all_outputs_tags"}'
# Source: https://stackoverflow.com/questions/31517194/how-to-hide-one-specific-cell-input-or-output-in-ipython-notebook

#### References:
- Udacity Project Overview
- https://stackoverflow.com/questions/47889565/reading-json-objects-from-text-file-into-pandas
- https://stackoverflow.com/questions/31517194/how-to-hide-one-specific-cell-input-or-output-in-ipython-notebook