# Wrangle Report for Project 5: Dog Analysis
Below, I summarize the data wrangling process undertaken for this project.

## Table of Contents
<ul>
<li><a href='#sources'>Sources</a></li>
<li><a href='#gather'>Gathering Data</a></li>
<li><a href='#assess'>Assessing Data</a></li>
<li><a href='#clean'>Cleaning Data</a></li>
</ul>

<a id='sources'></a>
## Sources
In this project, I worked with the WeRateDogs Twitter account's tweets. This was a combination of the following sources:

- Archived Twitter Data
- Image Predictions for Tweets
- Additional Tweet Metadata

<a id='gather'></a>
## Gathering Data

In this stage, I sought to convert each one of the sources above into a Pandas dataframe, which would allow easier, more fluid manipulation during the following stages. This was accomplished by the following steps for each source:
#### Archived Twitter Data
- Used the pd.read_csv built-in function to conver the on-hand file into a dataframe
#### Image Predictions for Tweets
- Pulled the file from Udacity's hosting location using the _requests_ library and wrote it to this directory
- Read the .tsv file in as a dataframe using pd.read_csv with the separator denoted as tabs
#### Additional Tweet Metadata
- Setup a twitter developer account
- Used the Twitter API via the _tweepy_ library to access data for each tweet from Twitter
- Dumped the json data using the _json_ library and wrote to a .txt file in this directory
- Read the .txt file in as a dataframe using pd.read_json

<a id='assess'></a>
## Assessing Data
I used a variety of pandas methods such as `df.head()`, `df.sample()`, and indexing to visually assess the data and investigate certain data issues. I complimented that with also using programmatical methods such as `df.info()`, `df.Series.value_counts()`, and `df.query()`. Each issue assessed was then recorded in a numerical list according to their being an issue of _Quality_ or *Tidiness*.

<a id ='clean'></a>
## Cleaning Data
In this stage, I addressed each of the problems written in the **Issues** section, breaking the process down into three steps: *Describe*, *Code*, and *Test*. The goal was to have tidy, quality data to work with in the rest of the data analysis process.

#### Key Notes
##### Iteration
I iterated a few times due to finding additional issues in the data while performing some cleaning step. For instance, it did not occur to me at first that `source` had any problems at all. It was only later when I was examining the metadata more closely from the API, that I realized this was the _type_ of source of the tweet itself, and was categorical data as there were only a few possible sources.
##### Fixes
The majority of fixes were accomplished using built-in pandas function such as `.drop()`, `.rename()`, and `.merge()`. They were fixed in an order that made sense for simplicity of code, rather than in the numerical order of the issues as written. For example, the tidiness issue of having three dataframes was fixed early on, so it simplified fixing other issues by having less columns and tables to deal with.
##### Dropping Outliers
I dropped rows to fix the _Quality_ issues \#4 and \#5. These rows were technically contained valid data, but for various reasons the ratings were far outside the majority of the distribution of ratings. They would significantly skew the data and make data visualization less discernible because of scale. Also, fixing these data points would require either complex reprocessing of the tweet text, or manually looking at and fixing each one. With almost 30 rows, this did not seem worth doing in the scope of this project. 