The objective of this data wrangling project is to gather different data from a variety of sources in a variety of formats, assess its quality and tidiness, then clean it. For this project I used the Python programming language and some other Python libraries. You will get more details on this as you continue with this documentation.
The datasets used for this project is from the WeRateDogs Twitter handle. I performed some data analysis and visualization on the data. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.
In the course of this project, I did some gathering of datasets, and cleaning of the dataset. I will be dividing this section into:
- Gathering of data
- Assess data
- Cleaning of data
Gathering of data - In this step, I gathered all 3 different data from different sources: The WeRateDogs Twitter Archive CSV file, the tweet image prediction TSV file, and the additional data from the Twitter API.
Assess data - After gathering the different data from their sources, I tried to assess the data visually and programmatically to get more insights in quality and tidiness issues.
Cleaning of data - Now that the I have visually and programmatically assessed the data, I then proceeded to clean the data using the Python programming language and some libraries. The libraries I used are:
- Pandas
- Numpy
- Matplotlib.pyplot
- Seaborn