# WeRateDogs - Data Wrangling

## Gather Data

The instruction provided by Udacity to gather data from different sources eased the task of collecting data.
* I downloaded the 'twitter-archive-enhanced.csv' file as per the instructions
* I made a twitter developer account to get the access codes and further get the data using API. The data collected was stored in a JSON file.
* The image predictions file was next downloaded in tsv format.

After collecting the data I stored them in 3 different dataframes.
* df -> 'twitter-archive-enhanced.csv'
* tweets_df -> 'tweets_json.txt' collected using API
* image_predictions_df -> image predictions file in tsv format

## Assessing Data

Visually assessed the data in the 3 dataframes created to further organize and structure the data.

### 1) df 

### Columns in df ->

* name: the breed or dog's name
* timestamp: Date and time the tweet was created, in Excel-friendly format.
* source: the web link as source
* text: the corresponding tweets text
* tweet_id: the unique identifier for each of the tweet
* retweeted_status_id: the status id for the reply given to the tweet id i.e., for the retweeted id
* retweeted_status_user_id: the status id for the reply given to the tweet id ( w.r.t user id) i.e., for the retweeted id
* retweeted_status_timestamp: Date and time the tweet was created, in Excel-friendly format.
* expanded_urls: Expanded version of url1; URL entered by user and displayed in Twitter. Note that the user-entered URL may itself be a shortened URL, e.g. from bit.ly.
* rating_numerator: the ranking given by the user
* rating_denominator: The reference ranking given by the user
* in_reply_to_status_id: the status id for the reply given to the tweet id
* in_reply_to_user_id: the status id for the reply given to the tweet id ( w.r.t user id)
* doggo, floofer, pupper, puppo -- The stage of the dog

### Quality Issues

* in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id , retweeted_status_user_id, retweeted_status_timestamp, expanded_urls have missing values.
* 'infuriating', 'just', 'life', 'light', 'mad', 'my', 'not', 'officially', 'old', 'one', 'quite', 'space', 'such', 'the', 'this', 'unacceptable', 'very' -> inappropriate names for dogs
* Missing values in doggo, floofer, pupper , puppo are None instead of NaN
* Information of text is truncated to 50 characters.
* timestamp and retweeted_status_timestamp must be of datetime
* in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id must be integers
* rating_numerator and rating_denominator have some incosistent values in the numerator and denominator
* tweet id 835246439529840640 has a rating of denominator = 0
* removed the rows with retweets value

### Tidiness Issues
* We need to join all dataframes to get a final dataframe because some columns are present in more than 1 table.
* Dog stages are found in multiple columns.

### 2) tweets_df 

### Columns in tweets_df ->

* retweets: The count of retweets done by user
* tweet_id: The unique identifier for each of the tweet
* followers: The count of number of followers
* friends: The count of number of friends
* favorites: The count of favorites done by user

### Quality Issues 
* 14 tweet id information is missing

### Tidiness Issues
* Retweets and favourites must join with the df table

### 3) image_predictions_df

### Columns in image_predictions_df ->

* p1: the algorithm's #1 prediction for the image in the tweet
* p1_conf: how confident the algorithm is in its #1 prediction
* p1_dog: whether or not the #1 prediction is a breed of dog
* p2: the algorithm's second most likely prediction
* p3: the algorithm's #3 prediction for the image in the tweet
* p3_conf: how confident the algorithm is in its #3 prediction
* p3_dog: whether or not the #3 prediction is a breed of dog
* p2_conf: how confident the algorithm is in its #2 prediction
* p2_dog: whether or not the #2 prediction is a breed of dog
* tweet_id: tweet_id is the last part of the tweet URL after "status/"
* jpg_url: Image link or URL
* img_num: Image number

### Quality Issues
* only 2075 tweetIds have images

### Tidiness Issues
* All the prediction outputs from different algorithms have to be joined with df, becuase all the information is already found there.

# Cleaning 

### Joint the 3 dataframes into 1 ->

* Converted the datatype of "tweet_id" to string/object.
* Combined the 3 dataframes into one using tweet_id.
* Converted the dog stage and category into 1 column.

### Further cleaning of data ->
* There are 331 duplicated rows becuase the count has increased. These duplicated rows might have occured due to multiple tagging done with dog_stage.
* converted in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id , retweeted_status_user_id to string
* Converted required variables to date time format
* Increase the text format representation
* We put a none value for all inappropriate names of dogs 
* rating_numerator and rating_denominator have some incosistent values in the numerator and denominator. The rating for denominator is shown as 0 in a tweet. But since it is not possible to correct the ratings with the provided data, I have left them unaltered.
* removed retweeted_status_timestamp with null values

# Storing the cleaned dataframe

Stored the data as a csv file 'twitter_archive_master.csv' for further analysis. 