## Data Wrangling Report

### Introduction

In this project, I gathered data from the WeRateDogs Twitter archive. The goal for this project was to wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations.

The wrangling tasks completed in this project are:
* Data gathering
* Assessing data
* Cleaning data

### Data Gathering

Data for the project was gathered from 3 sources as explained below

#### 1. Enhanced Twitter Archive

This archive contains basic tweet data (tweet ID, timestamp, text, etc.) up to August 1, 2017, since 2015 that the WeRateDogs account was created. This was provided by Udacity in a csv file format and contains 2000+ basic tweet data about dog rating, name, and "stage".

#### 2. Tweet Image Predictions Dataset

The file is hosted on Udacity's servers and was downloaded programmatically, using the Requests library, from [image_predictions.tsv](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv). This dataset contains dog breed prediction results (from a Neural Network classifier) for every dog images from the WeRateDogs Twitter archive

#### 3. Twitter API (tweepy)

This data resides on Twitter site and can be pulled via their API tweepy. I used the API to query additional data (in JSON format) and downloaded into a file named tweet_json.txt. This file has favorite and retweet count information for each tweet ID in the WeRateDogs Twitter archive, which are crucial for the dog rating analysis

### Assessing Data

Below are the steps taking in assessing the data

#### Enhanced Twitter Archive

* As a first step, a sample of data was assessed visually and a summary of data types and non-null values was displayed. This allowed us to identify columns with incorrect data type and/or null values.
* Then, IDs were checked for duplicates.
* Next, the number of tweets which are replies and retweets was assessed.
* Expanded URLs were firstly assessed visually and then checked programmatically for the existence of more than one URL
* `Name` column was assessed programmatically for anomalies and data inconsitency.
* Then, all tweets were checked for dogs with more than one growth stage assigned.
* Rating denominators and numerators were assessed visually by displaying a sample of data, and then based on the assessment of `rating` columns, `text` column was checked programmatically for any float ratings

##### Oservations from Enhanced Twitter Archive Assessment

##### Quality issues

1. Dataset contains retweets

2. Tthe `name` columnn contains "None" and some stopwords like 'a', 'an' etc

3. Some dogs are not classified as one of "doggo", "floofer", "pupper" or "puppo".

4. The source contains HTML code and not really sources

5.  Expanded url is more than one

6. Wrong datatype for Timestamp column

7. Wrong numerator ratings

##### Tidiness issues
1. The columns doggo, floofer,pupper and puppo represent dog's stage and should be in one column

#### Tweet Image Predictions Dataset

* A sample of data was assessed visually and a summary of data types and non-null values is displayed. This allows to identify columns with the incorrect data type and/or null values
* Then, the jpg_url column was checked for duplicates
* Lastly, the `1st prediction` was checked to see how many images were correctly classified as dog images

##### Oservations from Image Predictions Dataset Assessment

##### Quality issues

1. The dataset contains 66 duplicated images/retweets

2. Some pictures were not predicted to contain dog by top prediction model

3. Breed `prediction` column contains inconsistent cases, and underscores were used to separate breed name

##### Tidiness issues

1. The dataset contains `tweet_id`. Thus, it should be merged with the Twitter Archive dataset.

#### Twitter API Dataset

* Checked summary of data types and non-null values in the dataset.
* Then checked if the API Data contains Retweets

##### Oservations from Twitter API Dataset Assessment

#### Tidiness issues

1. `display_text_range` contains 2 variables

2. Contains `tweet_id`. Thus, it should be merged with the twitter archive dataset.

### Data Cleaning

The quality and tidiness issues identified in the Data Assessment section were cleaned using pandas, regex, and come custom modules etc

#### Twitter Archive Dataset

* First, a copy of the dataset was created for use throughout the cleaning exercise.
* Then, I removed retweets and response to tweets data from the dataset. Then drop columns with retweet and replies information
* Replaced `names` that are stopwords and `None` with NaN
* Dog 'stage' classification (`doggo, floofer, pupper, puppo`) which was broken into four separate columns, was merged into one column.
* Extract Dog stage from the `text` column
* `Source` column which contains HTML was redefined by extracting sources from the HTML
* We have some tweet URLs which contain more than one link, therefore we built correct links by using the tweet id.
* Next, we fixed the `timestamp` column which has an incorrect data type, by converting it to a DateTime object
* Lastly, re-extracted the numerator ratings from the `text` column and cleaned appropriately

#### Tweet Image Predictions Dataset

* First, a copy of the dataset was created for use throughout the cleaning exercise
* Then dropped the 66 duplicated images from the dataset
* For the pictures where the top prediction was not a dog, 2nd or 3rd prediction was used to obtain the dog breed
* Then replaced underscores with whitespace in the `breed` column, and then capitalized the first letter of each word to make it human readable
* Finaly, the cleaned version of this dataset was merged with Twitter Archive sataset set using twitter_id

#### Twitter API

* First, a copy of the dataset was created for use throughout the cleaning exercise
* The `text range` column was splitted into two separate columns: `lower_text_range` and `upper_text_range`
* Since the dataset contains `twitter_id` column, this was further merged with the Twitter Archive dataset

#### Storing Data
Before further analysis, the cleaned consolidated dataset was saved to a CSV file named `twitter_archive_master.csv` and an `SQLite` file