### Project 4 - Wrangle Report - Twitter Data

### Section I - Data Gathering

Project instructions were provided by Udacity team, main tasks are listed below.

### Data Sources
* Two files were provided ("twitter-archive-enhanced.csv" and "image_predictions.tsv")
* Twitter developer account was created to gather data from twitter and store the data in a txt file called tweet_json.txt

### Dataframes

* twitter_data: used to load the file "twitter-archive-enhanced.csv".

* img_pred: used to load the file "image_predictions.tsv"

* api_data: used to store the data gathered from the API.

### Section II - Data Assesing

Dataframe description is listed below.

### 1 - Enhanced Twitter Archive
The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. One column the archive does contain though: each tweet's text, which used to extract rating, dog name, and dog "stage" (i.e. doggo, floofer, pupper, and puppo) to make this Twitter archive "enhanced." Of the 5000+ tweets, we have filtered for tweets with ratings only (there are 2356).

* tweet_id: the unique identifier for each of the tweet
* in_reply_to_status_id: the status id for the reply given to the tweet id
* in_reply_to_user_id: the status id for the reply given to the tweet id ( w.r.t user id)
* timestamp: Date and time the tweet was created, in Excel-friendly format.
* source: the web link as source
* text: the corresponding tweets text
* retweeted_status_id: the status id for the reply given to the tweet id i.e., for the retweeted id
* retweeted_status_user_id: the status id for the reply given to the tweet id ( w.r.t user id) i.e., for the retweeted id
* retweeted_status_timestamp: Date and time the tweet was created, in Excel-friendly format.
* expanded_urls: Expanded version of url1; URL entered by user and displayed in Twitter. Note that the user-entered URL may itself be a shortened URL, e.g. from bit.ly.
* rating_numerator: the ranking given by the user
* rating_denominator: The reference ranking given by the user
* name: the breed or dog's name
* doggo, floofer, pupper, puppo -- The stage of the dog

#### Quality Observations
* not all tweets could be classified as doggo, floofer, pupper or puppo and all columns contain "None"
* the source contains unnecessary HTML code
* "None" was found in the name column
* tweet_id should be a str
* timestamp - columns should be datetime objects
* Name column contains wrong names like "None", "Bo", "a", "the", "an".

### 2 - Image Prediction
WeRateDogs Twitter archive was ran through a neural network that can classify breeds of dogs. The results: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images).

* tweet_id: tweet_id is the last part of the tweet URL after "status/"
* jpg_url: Image link or URL
* img_num: Image number
* p1: p1 is the algorithm's #1 prediction for the image in the tweet
* p1_conf: p1_conf is how confident the algorithm is in its #1 prediction
* p1_dog: p1_dog is whether or not the #1 prediction is a breed of dog
* p2: is the algorithm's second most likely prediction
* p2_conf: is how confident the algorithm is in its #2 prediction
* p2_dog: is whether or not the #2 prediction is a breed of dog
* p3: p3 is the algorithm's #3 prediction for the image in the tweet
* p3_conf: p3_conf is how confident the algorithm is in its #3 prediction
* p3_dog: p3_dog is whether or not the #3 prediction is a breed of dog

#### Quality Observations
* the predicitions are sometimes lowercase, sometimes uppercase
* there is an underscore instead of a whitespace between the words
* there are rows with no prediciton of a dog (neither in 1, 2 nor 3)
* the tweet_id colum should be string
* data contains retweets

### 3 - Twitter API
In this file Tweepy was used to query Twitter's API for additional data beyond the data included in the WeRateDogs Twitter archive.

* tweet_id: the unique identifier for each of the tweet
* favorites: The count of favorites done by user
* retweets: The count of retweets done by user
* user_followers: The count of number of followers
* user_favourites: The count of number of favourites
* date_time: Date and time the tweet was created  

#### Quality Observations
* none

### Section III - Data Cleaning

The steps below shows 8 data quality issues and 2 data tidiness issues.

#### Quality Issues
* Quality Issue # 1: In the dataframe twitter_data only 2075 tweetIds have images
* Quality Issue # 2: In the dataframe twitter_data has some wrong or multiple URLs
* Quality Issue # 3: In the dataframe twitter_data some dog names are listed such as (like "a", "an", "by", etc.)
* Quality Issue # 4: In the dataframe twitter_data there are some tweets that were retweets from other tweets
* Quality Issue # 5: In the dataframe img_pred some of the predictions are not dog related.
* Quality Issue # 6: In the dataframe twitter_data the columns in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id and retweeted_status_timestamp they has several rows with missing values.
* Quality Issue # 7: In the dataframe twitter_data the column as data type object instead of datetime64
* Quality Issue # 8: In the dataframes twitter_data, img_pred the tweet_id has the data type int instead of string.

#### Solutions
* Quality Issue # 1: Remove the tweets without image predictions.
* Quality Issue # 2: Using regex valid URLs will be extracted and copied into a new column.
* Quality Issue # 3: All the names that starts with a lower case I will have no dog names
* Quality Issue # 4: Remove the tweets that they have a tweet id in the column retweeted_status_id
* Quality Issue # 5: In the case that the prediction is not a dog, write "Not a dog"
* Quality Issue # 6: These columns contains several rows with missing values. As a result, these columns will be removed.
* Quality Issue # 7: Change the datatype of the column timestamp to datetime64
* Quality Issue # 8: Change the data type of the column tweet_id from int to string in all 3 dataframes

#### Tidiness Issues
* Tidiness Issue # 1: All three dataframes can be combined into one single dataframe.
* Tidiness Issue # 2: The dog stage columns in twitter_archive can be arranged into a single column

#### Solutions
* Tidiness Issue # 1: Remove the 4 columns and put all the data in a unique column called dogo_stage
* Tidiness Issue # 2: A new dataframe called master_file will be created with the 3 dataframes using the column tweet_id.

### Section IV - Data Store

Data was stored in csv file called "twitter_archive_master.csv", dimensions are: 1985 rows and 23 columns.
