# Data Wrangling Report for the @WeRateDogs Project
<br><br>


## Introduction:
My Data Wrangling Process covered the 3 following Steps in this Order:
- Gathering Data
- Assessing Data
- Cleaning Data

### Gathering Data:
For this Part, I had to acquire Data from 3 different Sources in 3 Pandas dataframes:<br>
1) The Twitter Archive was imported from a <i>CSV</i> File, 'twitter-archive-enhanced.csv'<br>
2) The Image Prediction File 'image-predictions.tsv', was programmatically downloaded with the <i>requests</i> python Library<br>
3) The Twitter Archive was queryied from the Twitter API, with the "tweepy" Wrapper and I saved each obtained JSON Response, as a JSON String in a TextFile called 'tweet_json.txt'.<br>I finally read this textfile line by line, adding each read JSON Object in a list of Dictionnaries and finally imported this List in a Dataframe.<br>Note: I saved each Tweet Id that could not be queryied via "tweepy", in a "tweetid_errors.txt" Logfile for later Verification.

### Assessing Data:
I assessed visually and programmatically each Dataframe and documented Quality and Tidiness Issues found along the way, building up a "Cleaning Plan".<br>
I also included the Key Points from the Project Details, specifying necessary Conditions about the Records. I came up with the following List of Issues(11 quality issues and 5 tidiness issues):<br> 

#### Quality Issues in the Archive Dataframe:

- timestamp should be a datetime Object.
- some tweets are obsolete and must be removed (could not be fetched from the tweepy API)
- some tweets are retweets (181 non null 'retweeted_status_id') and must be removed
- there are missing values in the 'expended_urls' Column.
- The ratings were parsed from the Text with a Method that led to accuracy issues
- the data type for the rating numerator should be a float, not an integer
- Some Ratings voluntary have inacurrate ('fantasy') Values. I decide to keep only tweets that have a rating between 0 and 14 over 10, (there is enough fantasy in this already!)
- Incorrect parsed dog names "a"
- Some tweets do not have an image. All the Records from the Archive (2356 entries) should also be in the Prediction Dataframe(2075 entries).

#### Quality Issue in the tweepy Dataframe:
- There are Retweets and Quotes in the tweepy Dataframe. I need to make sure that these are consistent with the Retweets/Quotes from the Archive.

#### Quality Issue in the Prediction DataFrame:
- Some tweets do not represent dogs.

#### Tidiness Issues in the Archive
- The rating for each record is kept in two Columns 'rating_numerator' and 'rating_denominator' 
- The values corresponding to the "dog stage" Variable is contained in four columns/variables: "doggo", "pupper", "puppo", "fluffer"

#### Tidiness Issue in the tweepy Dataframe
- The favorite and retweet counts fo each tweet should be part of the Archive Dataframe.

#### Tidiness Issues in the Prediction Dataframe:
- There are several possible values for the predicted breed. We want to keep the most plausible one
- The "best" predicted breed "column" should be added to our Master Dataframe. 

### Cleaning Data:


For this Part, I first made a copy of each Dataframe to clean on and then proceeded according to the following Plan:<br>

I) I first adressed the Issues to make the three Dataframes "match" together. It means that I want to keep the tweets in the Archive that were not deleted and also are in the Tweepy Dataframe. Also each tweet from the Archive and Tweepy Dataframes, should have a Picture and be in the Prediction Dataframe <br>
II) I then addressed the Issues where Data was missing, whenever possible<br>
III) I then addressed the Tidiness Issues, whenever possible<br>
IV) I finally worked on the remaining Quality Issues<br><br>
I fixed each Issue in following the three known Steps:<br>
1) Define<br>
2) Code<br>
3) Test<br>

#### Main Challenges encountered while cleaning the data and how I solved them:

1) To merge specific Columns from the Tweepy and Prediction Dataframe with the Archive, I used the Pandas merge Function, joining the dataframes on the Tweet Id like so:

In [None]:
df_archive = pd.merge(df_archive, df_tweepy, on='tweet_id', how='left')

2) To parse the "dog stage" from the Text Columns, I wrote the following Function that I mapped on each Row of the Archive Dataframe like so:<br>

In [None]:
dogs_stages= ['doggo', 'floofer', 'pupper', 'puppo']
def parse_dog_stages(text):
    stages = ' , '.join([stage for stage in dogs_stages if stage in text])
    if ',' in stages:
        return 'Multiple'
    if stages:
        return stages.capitalize()
    else:
        return 'None'

3) To keep only one Breed Variable from the Prediction Dataframe, I wrote a Function calculating the breed of each Record, with the highest Prediction Confidence, when a Dog was recognized in the Picture. Like so:

In [None]:
def get_best_predicted_breed(row):
    breeds = []
    predictions = []
    if (row['p1_dog']):
        breeds.append(row['p1'])
        predictions.append(row['p1_conf'])
    if (row['p2_dog']):
        breeds.append(row['p2'])
        predictions.append(row['p2_conf'])
    if (row['p3_dog']):
        breeds.append(row['p3'])
        predictions.append(row['p3_conf'])
    #index_max = predictions.index(max(predictions))
    return breeds[0] # since p1_conf > p2_conf > p3_conf and the order in which we append
    # the values, the breed for the highest confidence is always at index 0.

4) To parse the Ratings from the text Column and keep the Result in a Dataframe I used the following regular expression:

In [None]:
res = pd.DataFrame()
res = df_archive_clean.text.str.extract(r'(?P<num>\d{1,2}\.?\d*)/10')

### Storing the Data

I finally stored the cleaned master Dataset in a <i>CSV</i> File 'twitter_archive_master.csv'