# WeRateDogs Wrangling Report

## Step 1 - Gathering Data

The data needed for this investigation was contained in 3 different files. The first file, 'twitter-archive-enchance.csv' was given to us for this investigation and was imported in the traditional way from pandas and the 'read_csv' method. The second file was downloaded programmatically from a url provided. The third file was a bit more complicated. I attempted to download it in the manner laid out in the project documentation but it appears that documentation needs to be updated. I registered for a developer account from Twitter and got all the required credentials. When I used them in the code provided to download obtain a .txt file from which more data could be gathered I was presented with an error. See below. 

- Forbidden: 403 Forbidden 453 - You currently have Essential access which includes access to Twitter API v2 endpoints only. If you need access to this  endpoint, you’ll need to apply for Elevated access via the Developer Portal. You can learn more here: https://developer.twitter.com/en/docs/twitter-api/getting-started/about-twitter-api#v2-access-level

After talking with my course instructor from my university, I decided to use the .txt file that was provided in the event there were issues obtaining the information in the manner laid out, or if someone was working offline.  I was able to change the file format to json and extract the information needed to create the dataframe containing the tweet_id (which will be used to join/merge rows with other dataframes), favorite_count, and retweet_count. 

## Step 2 - Data Assessment

All 3 files were examined programmatically, specifically with the '.shape', '.info', '.duplicated()', '.describe()', and '.sample()' methods. My 'retweets_and_favs_counts' dataframe was written to both .txt format and .csv format. All 3 files were visually assessed in Microsoft Excel in their .csv format. The quality and tidiness issues I chose to address to clean this data so insights were able to be gathered from them are as follows:

### Quality

- There are a number of names that appear to be incorrect. All names start with capital letters and there are numerous names that begin with lowercase letters and are not names at all (Ex. 'a', 'an', 'very', 'such', just', 'by', and more). These names need to be replaced with 'None' as we do not know the intended name.
- The "source" column in the Twitter Archive dataframe is full of duplicates, not useful for this investigation and can be removed. 
- The 'tweet_id' column in the twitter_archive_cleaning_df, image_predictions, and retweet_and_favs_counts respective dataframes are in integer format. As no mathematical calculations will be performed on this column, string format would be more appropriate. The same goes for the 'id' column in the retweets_and_favs_df that we created.
- Timestamp has trailing 0's that serve no purpose and make the column unnecessarily wide as well as being in string format which needs to be datetime format.
- The columns related to retweeting and id's ('retweeted_status_id', 'retweeted_status_user_id', and 'retweeted_status_timestamp') associated with retweets are superfluous and not needed for this investigation and can be removed,
- The columns related to 'in reply to,' 'in_reply_to_status_id' and 'in_reply_to_user_id' are superfluous and not needed for this investigation and can be removed.
- The rating_denominator column needs to have a standardized value of 10. Virtually every single value in that column is 10 except a few outliers less than 10 and more than 10. I will remove the outlier rows provided they are few and their removal will not drastically affect my investigation.
- The rating_numerator column needs to have outlier values removed. Virtually every value in the numerator column is between 10 and 20 and for the sake of this investigation I will assume that is the intended system/scale. I will remove the outlier rows provided they are few and their removal will not drastically affect my investigation.
- The expanded_urls column adds no value to my investigation and should be dropped.
- The 3 predictions columns in the image_predictions dataframe have some glaring inconsitencies; if the dog breed prediction has more than one word the words are separated by _ as well some words are capitalized and some are not. For searching purposes it would be more intuitive to remove the underscore between the words and make sure that the case of all the characters are the same; I will remove the underscores and make sure all words in those prediction columns are lowercase across the board.

### Tidiness 

- There are 4 columns describing the 'stage' of the dog, 'doggo', 'floofer', 'pupper', and 'puppo'. This can be cleaned up by having a single 'stage' column with the correct descriptor.
- We have 3 dataframes that are all referencing the same data. It would be prudent to merge these dataframes into one master dataset after cleaning has been done.

## Cleaning

- Programmatically I changed all non-valid names, for example, 'a', 'an', 'by', 'just', 'quite', etc, with 'None'. 'None' was used when no name for the dog was given and as we cannot be sure what the name was intended to be I chose to change them 'None.' 
- There were numerous columns that were superfluous to my investigation and did nothing but clutter up the dataframe(s) with information that was not useful. I dropped those columns from my dataframe(s). Those columns were 'in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp', and 'source'. 
- It was necessary to change the datatypes for 'twitter_id', 'retweet_count', 'favorite_count', 'expanded_urls' and 'timestamp'. The 'twitter_id' column was type int and as no mathematical calculations were going to be performed on that column, type str was more appropriate. the 'retweet_count' and 'favorite_count' columns were initally type float, but as each of these counts are discreet and not continuous, type int is more appropriate. The 'expanded_urls' column added no value to my investigation and was removed. The 'timestamp' was initially type str, but this obviously needs to be changed to 'datetime64'. Although I did not perform any investigation on this column, many insights could be gathered from it if one so chose, so correcting the datatype and keeping the column appeared important to me. 
- Each tweet rates each dog in a fractional format. Essentially all the ratings followed the 'x/10' or 'x out of 10' format. It made sense to find the denominators that were not 10 and eliminate them or change them to 10 for the sake of consistency and standardization down the column. As there were only 23 denominator values that were not 10 it was more simple to filter them out as removing just 23 rows would not affect my investigation. 
- As stated above, all ratings are in a fractional format. There were some outliers in the numerator column. Over 99.5% of numerator ratings are between 10 and 20 (inclusive). As there were only 11 rows with a value of either 0 or above 20, removing those rows would not significantly affect my investigation and they were filtered out/removed.
- The image_predictions file had inconsistencies in some columns that were standardized. Many of the 'prediction' columns had string values that contained underscores as well as intermittent use of capital letters. I removed the underscores and changed the case of all values to lowercase down the column using '.lower'.
- In the twitter_archive file there was a column for each 'stage' of dog; 'doggo', 'floofer', 'pupper', and 'puppo'. Only one column is needed to contain these values. I consolidated the columns into one 'stage' column that contained one of the above mentioned values or 'None'. 
- This investigation is based on analyzing 3 different files pertaining to different elements of the same whole. Once the quality issues were addressed it made sense to combine the 3 dataframes into one dataframe joining/merging on the 'twitter_id' so the master copy contained only rows that appeared in each of the previous dataframes. When combining the dataframes I became aware that in the 'retweets_and_favs' dataframe the 'tweet_id' was listed simply as 'id'. I rectified this issue by changing the column name on the cleaning copy I created before merging the dataframes together. All three dataframes in their cleaned form were merged into 'twitter_archive_master' as requesting in the project guidelines. 