# Wrangle & Analyze WeRateDogs Data

<hr>

## Wrangle Report

### Gathering Data

Data for the project was gather from various sources:

* Enhanced Twitter Archive - This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017. Data is loaded from the csv file into a pandas data frame. Dataframe size is 2356 rows and 16 columns. The tweeter ID column is used as an index.  
<br/>
* Image Predictions File - This file contains top three predictions of dog breed for each dog image from the WeRateDogs Enhanced Twitter Archive. The table contains the top three predictions, tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images). Data is downloaded programmatically from the URL address into a tsv file. The content of the tsv file is then loaded into the pandas' data frame, with size is 2075 rows and 11 columns. The tweeter ID column is used as an index.  
<br/>
* Twitter API File - Retweet count and favorite count are two of the notable column omissions of the Twitter data archive. Fortunately, this additional data can be gathered from Twitter's API. Twitter API file contains tweet id, favorite count and retweet count. Data is loaded from the txt file into a pandas data frame. Dataframe size is 2354 rows and 2 columns. The tweeter ID column is used as an index. 


### Assessing Data

After gathering, the data is assessed for tidyness and quality:

* Enhanced Twitter Archive - As a first step, a sample of data is assessed visually and a summary of data types and non-null values is displayed. This allows to identify columns with the incorrect data type and/or null values. Then, IDs are checked for duplicates. Next, the number of tweets which are replies and retweets is calculated - these types of tweets will not be used in the analysis part. Source column is checked by displaying all existing values, while the quality of the data in the text column is assessed visually by displaying a sample. Expanded URLs are firstly assessed visually and then checked programmatically for the existence of two or more URLs in one cell. Rating denominator is assessed visually by displaying a sample of data, and then ratings with denominator greater than 10 are printed out for further investigation. Rating numerator is also assessed visually. Based on the visual assessment of rating columns, we check programmatically text column for any float ratings. Name of dog column is assessed visually and then programmatically checked for stop words. Dog category (stage) column is checked for the number of values. As the last step, all tweets are checked for dogs with more than one dog category (stage) assigned.

        Quality and tidiness issues detected:
            - some of the gathered tweets are replies and should be removed;
            - the timestamp has an incorrect datatype - is an object, should be DateTime;
            - source is an HTML element - its text should be extracted;
            - some rows in the text column begin from 'RT @dog_rates:';
            - some rows in the text column have leading and/or trailing whitespace;
            - some of the gathered tweets are retweets;
            - we have 59 missing expanded urls;
            - we have 639 expanded urls which contain more than one url address;
            - denominator of some ratings is not 10;
            - numerator of some ratings is greater than 10 (does not need to be cleaned);
            - float ratings have been incorrectly read from the text of tweet;
            - 'None' in the name should be convert to NaN;
            - we have stop words in the name column;
            - dog 'stage' classification (doggo, floofer, pupper or puppo) should be one column;
            - some dogs have more than one category assigned;  

<br/>

* Image Predictions - As a first step, a sample of data is assessed visually and a summary of data types and non-null values is displayed. This allows to identify columns with the incorrect data type and/or null values. Then, IDs are checked for duplicates. Next, the jpg_url column is checked to confirm if it contains only jpg and png images. As the last step, the 1st prediction is checked to see how many images have been classified as dog images.

        Quality and tidiness issues detected:
            - the dataset has 2075 entries, while Enhanced Twitter Archive dataset has 2356 entries;
            - column names are confusing and do not give much information about the content;
            - dog breeds contain underscores, and have different case formatting;
            - only 2075 images have been classified as dog images for top prediction;
            - dataset should be merged with the Enhanced Twitter Archive dataset;  
            
<br/>

* Twitter API Data - As a first step, a sample of data is assessed visually and a summary of data types and non-null values is displayed. This allows to identify columns with the incorrect data type and/or null values. Then, IDs are checked for duplicates.

        Quality and tidiness issues detected:
            - twitter archive dataset has 2356 entries, whileEnhanced Twitter Archive dataset has 2354;
            - the dataset should be merged with the Enhanced Twitter Archive dataset;

### Cleaning Data

The quality and tidiness issues identified in the Assessing Data section are cleaned using pandas:

* Enhanced Twitter Archive - As a first step, a copy of dataset is created for use throughout the cleaning exercise. As some of the gathered tweets are replies and retweets, we remove them together with other associated columns. Next, we fix the timestamp which has an incorrect data type - is an object - by converting it to DateTime. Data in the source column is cleaned by extracting inner text from the provided HTML element. As some rows have to lead and/or trailing whitespace, we strip whitespace. We have 639 expanded URLs which contain more than one URL address and 59 missing expanded URLs, therefore we build correct links by using the tweet id field. Float ratings, which have been incorrectly read from the text of tweet are gathered again, this time correctly. The denominator of some ratings is not 10, while numerator of some ratings is greater than 10 - the fact that the rating numerators are greater than the denominators does not need to be cleaned, however, we introduce a normalized rating which will be used for plots. As we have stop words and 'None' values in the Name column, we replace stop words with the correct name and replace None with Nans. Dog 'stage' classification (doggo, floofer, pupper or puppo) which was broken into four separate columns, is merged into one column.

<br/>

* Image Predictions - As a first step, a copy of dataset is created for use throughout the cleaning exercise. As some of the column names are confusing and do not give much information about the content, we rename columns. Then we clean dog breeds - we replace underscores with whitespace and capitalize the first letter to have consistent and clean formatting. Since only 2075 images have been classified as dog images for the top prediction (1st prediction), we use the dog breed predicted in the 2nd or 3rd predictions for the remaining rows.

<br/>

* Twitter API Data - There is no need to perform cleaning tasks in this data set.

As a last step of the cleaning process, we merge all datasets into one and export to csv file.
    