# Data Wrangling Project - WeRateDogs Dataset

### Table of content
- [Project Detail](#detail)
- [Gathering Data](#gather)
- [Assessing Data](#assess)
    - [Data issues](#issues)
- [Cleaning Data](#clean)
    - [Cleaning summary](#summary)
- [Storing Data](#store)
- [Conclusion](#conclusion)

## Project Detail <a class="anchor" id="detail"></a>

This Project consist in analyze three different types of archives and Wrangle this data. To complete this process we will gather, assess, clean and store the new data for visualization and analysis. The dataset for this task is from Twitter account WeRateDogs that rates people's dog with a lot of humor, showing funny pictures and comments and a unique rating level. The purpose of this task is to put in practice what I learned in the lesson of Data Wrangling from Udacity Data Analyst Nanodegree.

## Gathering Data <a class="anchor" id="gather"></a>
In this step, data will be collected from three sources and dataframes will be made. These data will be obtained as following:
1. **Twitter archive file**: the *twitter-archive-enhanced.csv* was provided by Udacity in Data Wrangling project website ([**link for download**](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/59a4e958_twitter-archive-enhanced/twitter-archive-enhanced.csv)). The ***df_tweets*** will be made with pandas *read_csv* command.
2. **Tweet image predictions**: the *image-predictions.tsv* will be downloaded programmatically using requests library and the url where Udacity hosts the file (*https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv*). This file contains breeds classification based on the AI predictions of the dog images. The ***df_predictions*** will be made with pandas *read_csv* command and the argument *sep='\t'* as the file has the tsv extension.
3. **Twitter API JSON**: the *tweet_json.txt* will be made by querying the Twitter API for each tweet ID of the first file using Python´s Tweepy library and storing each tweet JSON data in this text file. With Python's *json* library, the text file will be read line by line and a pandas dataframe(***df_tweets_api***) will be made with tweet ID, retweet count, favourite count and retweeted.

## Assessing Data <a class="anchor" id="assess"></a>
Once the three dataframes were obtained, I will assess them visually and programmatically to check the data. For the former, I will use the Jupiter Notebook to print the three dataframes and for that I changed the visualization parameters with the command *pd.set_option()* to show the entire dataframe. For the later, I will use pandas methods as *info*, *duplicated*, *unique*, *value_counts* and etc. 

Then the problems encountered will be cataloged as quality issues and tidiness issues.
1. Quality: issues with content. Low quality data is also known as dirty data;
2. Tidiness: issues with structure that prevent easy analysis. Untidy data is also known as messy data.

### Data issues <a class="anchor" id="issues"></a>

#### Quality
1. The url is repeated in some cases in column expanded_urls of the df_tweets table;
2. Columns with id as float value in df_tweets table;
3. Rating numerator in df_tweets int instead of float, numbers with decimal values;
4. Rating numerator in df_tweets with wrong values, numbers with decimal values;
5. Wrong classification for IDs 835246439529840640 and 666287406224695296 in df_tweets table;
6. Keep original ratings of df_tweets table;
7. 66 duplicated jpg_url in df_predictions table;
8. Multiples predictions but only need the first correct one found in df_predictions table;
9. Some columns won't be used for analysis.

#### Tidiness
10. Stage of dog in multiples columns(doggo, floofer, pupper and puppo columns);
11. All tables should be only one dataset.

## Cleaning Data <a class="anchor" id="clean"></a>
In this step the issues found in Assessing Data will be corrected following the process of defining the problem, coding the correction and testing the code. First, to avoid any problem of data loss in the correction process, we will create copies of all dataframes respecting their names without the df at the beginning and with the word clean at the end. Whenever a mistake has been made, the copy can be restarted without losing the original data.

### Cleaning summary <a class="anchor" id="summary"></a>

#### Quality
1. Split the url using comma and use only url from twitter;
2. Change type of ID's columns to integer;
3. Change type of rating_numerator column to float;
4. Correct the decimal values of rating_numerator column;
5. Correct the classification of IDs 835246439529840640 and 666287406224695296;
6. Delete rows with retweeted_status_user_id;
7. Drop duplicated values of jpg_url;
8. Create a prediction and a confidence columns in the dataframe with the first true prediction found, searching from p1 to p3.;
9. Drop columns not used for analysis.

#### Tidiness
10. Group all the columns that represent stage of dog;
11. Merge all dataframes into one. Inner is used to make a complete dataframe.

## Storing Data <a class="anchor" id="store"></a>
Use Pandas *to_csv* method to create a new CSV file (*twitter_archive_master.csv*) to store the result of the Data Wranglind process.

## Conclusion <a class="anchor" id="conclusion"></a>
Data Wrangling is a core skill for Data Analyst. Mastering the python language and the tolls used in this project are well recommended to work with a huge amount and different data sources. It's unbelievable to think about doing all this work manually and the technology is here to help us.