# Data Wrangle Report

### by Travis Gillespie

## Table of Contents
- [Introduction](#intro)
- [Gather Data](#gather)
- [Assess Data](#assess)
- [Clean Data](#clean)

<a id='intro'></a>
## Introduction

The goal of this project is to wrangle data from Twitter's tweet archive, under _WeRateDogs_. _WeRateDogs_ Twitter account rates dogs with a humourous comment. This project allowed me to apply what I learned throughout the data wrangling section, plus other skills obtained through the certification program. Some of the skills I applied are outlined below in the the following sections.

<a id='gather'></a>
## Gathering Data

This project required me to retreive three datasets using various methods.

* **Twitter Archive**: Udacity provided this data in a CSV file titled *twitter_archive_enhanced*.
----

* **Tweet Image Predictions**: This data is hosted on Udacity's servers, and I downloaded a TSV file titled *image_predictions* programmatically using the Requests library and URL (provided by Udacity).

----

* **Twitter API**: 
I queried the Twitter API using each tweet's tweet ID (found in the Twitter archive) and Python's Tweepy library. This process takes about 15 minutes to run, so I created a progress bar to ensure the process isn't cut off at any time (_e.g. rate limit_). I also created a time delay which should also prevent the rate limit hitting it's max value. The data was returned in JSON format then stored in a TXT file titled *tweet_json*. I was then able to read the *tweet_json* file into my project as a pandas DataFrame.
----

* **Backup Copies**: 
Copies of the original pieces of data were made just before the *Assessing Data* section.

<a id='assess'></a>
## Assessing Data

After retreiving each of the three datasets I assessed the data. I started with the _head()_ function to get an idea of what the data looked like, followed by _info()_ to determine both size & shape of each table. This also helped me quickly determine if there were any missing values. I also noted any potential issues and how I might go about fixing them within the _Quality_ and _Tidiness_ sections, and added anchor links to each item that could be fixed.

<a id='clean'></a>
## Cleaning Data

In this section I joined the tables then fixed the quality and tidiness issues previously mentioned. In this project, I found dropping unwanted columns from a dataframe was more efficient and provided two advantages when compared to creating new dataframes with desired columns:
1. There were less columns to drop (12) than there were to add (21) to a new dataframe.
* Dropping columns maintains the datatypes for each column within the dataframe, whereas creating a new dataframe would require me to convert the date and time columns from an object to a datetime again.

Some of the steps within the cleaning process were fairly straightforward. Such as removing null values (_e.g. missing images, retweets with no data_), checking for duplications, and renaming columns. However, I had to get creative with more challenging problems, as listed below:

1. Some dog's had very unusual names (_e.g. '', 'a', 'my', etc._). I sorted grouped all names by the length of characters in their name. This allowed me to easily spot the most peculiar values, which I then dropped rows containing those values from the dataset.
* I placed the developmental stage values (columns _'pupper', 'puppo', 'doggo', 'floofer'_) into one column. I had to make adustments for multiple entries. I decided to code these values with the label _multi-entry_.
* Some of the rows within _text_ column contained decimals in reference to  rating_numerator and rating_denominator. I used a regular expression to append these values to a list. I re-ran the division calculation and added a new column to the dataframe.

Finally, I wrote the clean data to a CSV file labeled *twitter_archive_master*.