# **@dog_rates Dog Ratings Data Wrangling Steps.**

The following data wrangling steps were performed :

## 1) Data Gathering. 
The data was gathered and loaded into pandas dataframes from three different sources: 
-  A CSV file at hand.
-  The web, using Requests Library
-  Twitter API using Tweepy Library

## 2) Data Assessment. 
After gathering and loading the data, it was accessed for quality and tidiness issues. 
Some of the issues detected in the different dataframe tables included the following :

### a) Data Quality Issues

**twitter_archive table**

1) some of the rows in the table contain retweets. We are only interested in original ratings (non-retweets) for this particular project

2) Erroneous datatype : timestamp column - object(string) instead of datetime

3) null values in the name column represented as 'None' instead of as valid pandas null representation : NaN. There are also invalid names such as a,an and very which we will need to remove

4) plurals such as 'puppers' were not considered for cases where dog squads ie more than one dog in the picture, were being rated

4) some tweets are ratings of more than one dog in the same picture and they seem to have a different rating system where the numerator and denominator are each multiplied by the number of dogs in the picture. These should be adjusted to the common doggo rating system where the rating denominator is 10 for easier analysis

5) some datetetime parameters such as 24/7, 9/11, 4/20 have been accidentally interpreted as ratings values. '50/50 split' was also misinterpreted as a dog rating

6) some tweet texts contain more than one parameters similar to the ratings style used. The first ratings format occurence was automatically taken as the rating while in some cases the second occurence was the actual rating.

7) some ratings are not reflective of the good ol doggo rating system we are familiar with and are meant to be funny or satirical such as one tweet rating a dog 666/10 or another rating a 'Snoop doggo' 420/10. To be fair to the other good doggos in the dataset, we are going to get rid of such outlier ratings.

8) many urls in the expanded_urls column are duplicated

8) Most of the dog names are invalid eg name 'a' with 55 occurences, 'the' with 8 occurences and 'an' with 7 occurences.

9) Source device in the source column is enclosed in html tags

**image_predictions table**  

10) some of the tweets are not related to dogs.

**api_data_df table**

11) null hashtags and user_mentions represented as empty Python lists of pandas recognized null representation.


## b) Tidiness issues

1) doggo, floofer, pupper, puppo columns in the twitter_archive table should be represented as variables in one column dog_stage

2) the three tables : api_data_df, twitter_archive and image_predictions should be collapsed into one master table with the relevant data from all three tables.


## 3) Data Cleaning

The data quality and tidiness issues identified and documented in the previous step were cleaned.  
The final product of the cleaning steps was one master dataframe table which was stored as a CSV file.

In [3]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'wrangle_report.ipynb'])

0