

# Wrangle Report

## Introduction

This report documents the process of gathering, assessing, and cleaning data for the WeRateDogs Twitter archive. The data sources include "twitter_archive_enhanced.csv" and "image_predictions.tsv," provided by Udacity, as well as additional data collected via the Tweepy API.

## Chapter 1: Gathering the Data

The data collection for the WeRateDogs Twitter archive involved three main sources:

1. **Twitter Archive CSV**: The dataset "twitter_archive_enhanced.csv" was directly downloaded from Udacity. It includes comprehensive details about the tweets, such as tweet ID, text, rating, and the dog's age level.

2. **Image Predictions TSV**: The file "image_predictions.tsv," also provided by Udacity, contains predictions generated by a machine learning model about the images included in the tweets. This file offers insights into the predicted breed of each dog in the images.

3. **Twitter Data via Tweepy API**: Additional data was collected using the Tweepy API. This dataset includes details such as favorite count, tweet ID, language, retweet count, and whether the tweet was a retweet.

## Chapter 2: Assessing the Data

The assessment phase aimed to evaluate the quality and structure of the gathered datasets to ensure their suitability for analysis. Key methods used for assessment included:

1. **`.head()` Method**: This function was employed to manually inspect the first few rows of each dataset. It provided a quick visual check of the data's structure and content, helping identify obvious issues.

2. **`.info()` Method**: This function was used to obtain a summary of each dataset programmatically. It included information on the number of entries, column names, data types, and non-null counts, which facilitated the identification of missing values and incorrect data types.

These methods were instrumental in pinpointing key issues that required attention during the data cleaning phase.

## Chapter 3: Cleaning the Data

The cleaning phase involved addressing the issues identified during the assessment phase. Prior to cleaning, copies of the original datasets were made to preserve the initial data. The following methods were used to clean the data:

1. **`.replace()` Method**: This method was used to correct errors and standardize data by replacing specific values with new ones.

2. **`.isna()` Method**: Used to check for missing values, this method helped identify entries with missing data.

3. **`.dtype()` Method**: This method was employed to verify that the data types of columns were appropriate for analysis.

4. **`pd.to_datetime()` Method**: This function was used to convert columns to datetime format, ensuring that date and time data were properly formatted for analysis.

5. **`.drop()` Method**: This method allowed for the removal of specific rows or columns that were deemed unnecessary or problematic.

6. **`.merge()` Method**: This function combined multiple datasets into a single dataset based on a common column or index, facilitating a more comprehensive analysis.

## Conclusion

This report outlined the process of gathering, assessing, and cleaning the WeRateDogs Twitter archive data. We collected data from three primary sources: "twitter_archive_enhanced.csv," "image_predictions.tsv," and the Tweepy API. 

Key methods for assessment included `.head()` and `.info()`, which were used to inspect and summarize the data. For cleaning, methods such as `.replace()` for corrections, `.drop()` for removing unnecessary data, and `.merge()` for combining datasets were employed.

With the data now cleaned and organized, it is ready for analysis.

