## WeRateDogs Twitter Data Wrangling Steps

The data wrangling process for the project can be broken into three steps.
<br>
**Namely gathering, assessing and cleaning.**

### 1. Gathering
Data for this project came from three different sources:
- Original twitter archive data in csv format downloaded manually
- Additional data gathered from the WeRateDogs twitter site using Twitters API and Python Tweepy library
- The tweets image prediction tsv file hosted on Udacity's servers was downloaded programmatically using the Requests library

### 2. Assessing
A number of criteria were set for assessing the data. These included:
- only want original ratings (no retweets)
- only want tweets that have images

<br>
On top of this at least 8 quality and 2 tidiness issues had to be detected visually and programmatically and then documented

### 3. Cleaning
- **Issue 1.** df_tweet_archive contains tweets without images. Used pandas dropna
- **Issue 2.** Timestamp column was in wrong format. Dealt with this by using parse_dates feature in pandas.read_csv.
- **Issue 3.** Duplicated data indicated by retweeted_status_id 181 non null values. Used pandas isnull on retweeted_status_id column and filtered.
- **Issue 4.** Not all dogs had a name. Used regex pattern and replace.
- **Issue 5.** Text column in df_archive_copy contains text and hyper link at the end. Used a lambda function to split on 'https' and then slice.
- **Issue 6.** Multiple denominator values greater than or less than ten. Replaced these with 10 using boolean logic and numpy where function.
- **Issue 7.** several records contained more than one dog stage. ie pupper and doggo for the one dog. Replaced these with np.nan
- **Issue 8.** Dropped unwanted columns.

**Tidyness Issues included:**
- Created function to combine four dog stage columns into one called stage
- Merged df_archive and df_tweet_copy into one dataframe
- Converted df_images from wide format to long using pandas wide_to_long



## Resources:

- [Reading and Writing JSON to a File in Python](https://stackabuse.com/reading-and-writing-json-to-a-file-in-python/)
- [Replace value based on condition with np.where](https://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.where.html)
- [How to strip html tags from a string in Python](https://medium.com/@jorlugaqui/how-to-strip-html-tags-from-a-string-in-python-7cb81a2bbf44)
- [matplotlib documentation:](https://matplotlib.org/3.1.1/tutorials/index.html)
- [pandas documentation:](https://pandas.pydata.org/pandas-docs/stable/)
<br>
- **Python for Data Analysis Data Wrangling with Pandas, NumPy, and IPython: Wes McKinney**

