# Wrangle Report for Project 5: Dog Analysis
Below, I summarize the data wrangling process undertaken for this project.

## Table of Contents
<ul>
<li><a href='#sources'>Sources</a></li>
<li><a href='#gather'>Gathering Data</a></li>
<li><a href='#assess'>Assessing Data</a></li>
<li><a href='#clean'>Cleaning Data</a></li>
</ul>

<a id='sources'></a>
## Sources
In this project, I worked with the WeRateDogs Twitter account's tweets. This was a combination of the following sources:

- Archived Twitter Data
- Image Predictions for Tweets
- Additional Tweet Metadata

<a id='gather'></a>
## Gathering Data

In this stage, I sought to convert each one of the sources above into a Pandas dataframe, which would allow easier, more fluid manipulation during the following stages. This was accomplished by the following steps for each source:
#### Archived Twitter Data
- Used the pd.read_csv built-in function to conver the on-hand file into a dataframe
#### Image Predictions for Tweets
- Pulled the file from Udacity's hosting location using the _requests_ library and wrote it to this directory
- Read the .tsv file in as a dataframe using pd.read_csv with the separator denoted as tabs
#### Additional Tweet Metadata
- Setup a twitter developer account
- Used the Twitter API via the _tweepy_ library to access data for each tweet from Twitter
- Dumped the json data using the _json_ library and wrote to a .txt file in this directory
- Read the .txt file in as a dataframe using pd.read_json

<a id='assess'></a>
## Assessing Data

#### Issues

##### Quality

1. Tweet id is referred to inconsistently across tables
2. There are many empty columns and columns that aren't needed (and are therefore invalid)

`df_archive`
3. Archived data contains 181 retweets, which are invalid based on our desire for original tweets only
4. `rating_denominator` data should all be 10, but some are very far off, most due to scraping the wrong numbers from the text.
5. `rating_numerator` values have a similar problem to the denominator
6. `timestamp` is object type, and should be datetime type.
7. `source` is invalid as it contains the whole hyperlink html tag. This should be a simpler categorical data format preferably a one word string.

`df_meta`
8. Data is incomplete since some tweets were not able to be pulled by the API (IDs were stored in `lost_tweets` and can also be found comparing to df_archive)
9. `id` does not match `id_str` for some tweets, and `id` matches the id used in the other two tables.

##### Tidiness
1. Data is split across three dataframes and three files, but only has two main observational units, i.e., tweet data and tweet image predictions
2. Duplicated information (tweet id in 4 columns, text in both `df_archive` and `df_meta`)
3. In `df_archive`, the dog "stage" descriptions should be one column instead of 4 since it's mostly exlusive categorical data

<a id ='clean'></a>
## Cleaning Data