# Reporting: wrangle_report

## Introduction

The following report details the steps taken to gather, assess, and clean data obtained for the twitter account WeRateDogs. The first set of data was obtained from two separate sources. The data in these sources contained data about each tweet sent out from the account, such as the tweet text, timestamp, retweet count, and favorite count. The second set of data contained neural network image predictions of the images in the tweets.

## Gathering

To obtain the first set of twitter data, a `.csv` file containing the twitter archive data for the account was imported with pandas `read_csv()`. The second set of twitter data was pulling directly from Twitter's API and read into python with `read_json()`. Panda’s `read_csv()` was also used to import the `.tsv` file containing the image prediction data of the images in the WeRateDogs tweets. 

## Assessing

The quality and tidiness issues spotted in the data during assessment of the data are listed below:

### Quality Issues

1. There are retweets in the dataset

2. twitter archive timestamp is a string not a time

3. image prediction p1, p2, p3 are strings, not categories

4. all tweet ids are ints, but is not useful mathematically

5. the twitter archive source column has html in it.

6. The source column seems categorical, but is a string.

7. Some dog breeds in the image predictions are lower case and some are upper case

8. Some of the dog names in tweet_data are not names ("None", "the", "a")


### Tidiness issues

1. doggo, floofer pupper, puppo are columns rather than values in a single column.

2. The tweet archive and tweet data are separate tables

## Cleaning

The cleaning process involved multiple steps to resolve all of the issues. The first cleaning step involved data type changes to multiple columns. Timestamp was converted from a string to a datetime, the image prediction columns containing dog breeds and other objects were converted to categories. The source column containing where the tweet was sent was converted from a string to a category. Finally the tweet ids were converted from ints to strings as the ids numbers are only for reference and are not useful in any mathematical sense. 

There were then another set of cleaning steps used to remove unneeded information. Retweets were not useful to this analysis and were removed from the dataset. The source column contained unneeded html tags and a regular expression was used to extract the wanted text out of the html tag. The dog names in the twitter data contained text that were not names and were removed from the data and replaced with NaN. It is possible that some of the non-name text was missed, but the bulk was removed and further cleaning can be done if more issues are found.

The last bit of cleaning involved the issue where the dog breeds and other objects in the image prediction contained some text with uppper case letters and some text with all lower case letters. To make the data consistent, all of the columns containing this kind of data were converted to all lowercase text. 

The remainder of the cleaning invloved tidiness issues. The first tidiness issue was caused by dog stages each having their own column. The stages better serve the dataset as values under a “dog_stage” variable column. The columns were condensed into a single column representing the dog_stage variable. The final tidiness issue that needed to be resolved was the tweet data and tweet archive table both contained data about each tweet and were therefore combined into a single table. 

## Summary

The twitter data for the account WeRateDogs was gathered from multiple sources, assessed, then cleaned for all identified quality and tidiness issues. While there may be issues found during analysis that need further cleaning, all of the issues found thus far in the dataset have been cleaned and the data is ready for analysis.