## Reporting: wragle_report
* Create a **300-600 word written report** called "wrangle_report.pdf" or "wrangle_report.html" that briefly describes your wrangling efforts. This is to be framed as an internal document.

## Step 1. Gathering Data

In this step, data needs to be gathered from 3 different sources and are to be stored into dataframes:
  - `twitter_archive_df`: Directly reading the twitter-archive-enhanced.csv from local using pandas read_csv method
  - `predictions_df`: Downloading the data from the given URL using requests library
  - `count_df`: Gathering data from twitter API using the tweepy library and store the result into twitter-json.txt

## Step 2. Assessing Data

In this step, both visual and programmatic assessment must be used to find out atleast 8 Quality issues and atleast 2 Tidiness issues.
Upon assessment I found out following issues.

### Quality 
| SNo. | Issue | Table|
| :--- | :-- | :--- |
| 1. |  `twitter_archive_df`| Incorrect datatype(tweet_id, in_reply_to_status_id, in_reply_to_user_id, timestamp, source, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp) | 
| 2. | `twitter_archive_df` | Source contain html quotes | 
| 3. | `twitter_archive_df` | Name, doggo, fluffer, pupper, puppo has None for NaN | 
| 4. | `twitter_archive_df` | There are invalid dog name such as all, my, not, a, an, the, by, such etc. | 
| 5. | `twitter_archive_df` | Delete the Retweets so that we only have the original tweets |
| 6. | `twitter_archive_df` | Remove Tweet ids who do not have image predictions in the image prediction table |
| 7. | `twitter_archive_df` | Inaccurate values of rating_numerator |
| 8. | `predictions_df` | Incorrect datatype(tweet_id, p1, p2, p3) | 
| 9. | `predictions_df` | Remove p1_dog, p2_dog, p3_dog which are set as False as these are not dog types | 
| 10. | `count_df` | Incorrect datatype(tweet_id) | 

### Tidiness
| SNo. | Issue | 
| :--- | :--- | 
| 1. | doggo, floofer, pupper, puppo should be in one column | 
| 2. | Split text column into two separate columns (tweet text and tweet_url) | 
| 3. | Combine P1, P2, and P3 into prediction_type and cofindence level columns |
| 4. | Join all three tables |

## Step 3: Cleaning the Data

Once all the issues were documented, I followed the Define, Code, and Test to Clean the data using the knowledge I gained from the previous lessons.

### Quality Issues:
1. Incorrect datatypes in twitter_archive_df (tweet_id, in_reply_to_status_id, in_reply_to_user_id, timestamp, source, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp): IDs (like tweet_id, in_reply_to_status_id etc.) should be treated as string type, and timestamp fields should be of datetime type.
2. Source field in twitter_archive_df had html codes which won't be of any use during analysis and visualization.
3. 5 columns(name, doggo, fluffer, pupper, puppo) in twitter_archive_df had None for NaN: None will be treated as a valid non null value in calculations, hence I changed all the None to NaN.
4. There were many tweets having dog names which were not valid like a, an, the etc., so I removed all names starting with a lowercase.
5. Deleted the Retweets so that we only have the original tweets.
6. Removed the Tweet IDs from Archive table which were not present in Image Prediction file.
7. There were incorrect numerators captured for tweets with decimal point ratings.
8. Incorrect datatypes in predictions_df(tweet_id, p1, p2, p3).
9. Removed p1_dog, p2_dog, p3_dog whihc were set as False as these are not dog types.
10. Incorrect DataType for tweet_id in count_df


### Tidiness Issues:
1. Merged doggo, floofer, pupper, puppo in one column called dog_stage.
2. Splited text column of archive table into two separate columns (tweet text and tweet_url).
3. Combined p1, p1_conf, p1_dog, p2, p2_conf, p2_dog, p3, p3_conf, p3_dog into prediction_type, and cofindence_level columns as all these columns convey this information only.
4. Joined all three dataframes into 1 dataframe called twitter_archive_master_df.

## Step 4. Storing Data
Once all the above steps are done, I stored the final wrangles data into twitter_archive_master.csv for future use.

## Step 5: Analyzing and Visualizing Data
Using twitter_archive_master.csv, I did some basic analysis like 
1. Tweet Sources: More than 95% of the users are using twitter for iphone
2. 65% of the tweets are about dogs in pupper stage
3. More than 25% of the tweets are about Top 5 most popular dogs combined

And visualizations like
1. Popularity vs Ratings
2. Favourite vs TimeStamp