# Wrangling  Report

##### By Wessam Alhallak

## Table of Contents
- [Introduction](#intro)
- [The Data](#data)
- [Gathering data](#gather)
- [Assessing data](#assess)
   - [Quality](#quality)
   - [Tidiness](#tidiness)
- [Cleaning data](#clean)
- [Storing](#storing)
- [Analyzing, and Visualizing](#Analysing)
   - [Insight one & visualization](#one)
   - [Insight two](#two)
   - [Insight three & visualization](#three)
   - [Insight four & visualization](#four)

<a id='intro'></a>
## Introduction

wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. The Twitter archive is great, but it only contains very basic tweet information. Additional gathering, then assessing and cleaning is required for "Wow!"-worthy analyses and visualizations.

<a id="data"></a>
## The Data
### Enhanced Twitter Archive
The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. One column the archive does contain though: each tweet's text, which I used to extract rating, dog name, and dog "stage" (i.e. doggo, floofer, pupper, and puppo) to make this Twitter archive "enhanced." Of the 5000+ tweets, I have filtered for tweets with ratings only (there are 2356).

### Additional Data via the Twitter API

Back to the basic-ness of Twitter archives: retweet count and favorite count are two of the notable column omissions. Fortunately, this additional data can be gathered by anyone from Twitter's API. Well, "anyone" who has access to data for the 3000 most recent tweets, at least. But you, because you have the WeRateDogs Twitter archive and specifically the tweet IDs within it, can gather this data for all 5000+. And guess what? You're going to query Twitter's API to gather this valuable data.


### Image Predictions File
One more cool thing: I ran every image in the WeRateDogs Twitter archive through a neural network that can classify breeds of dogs*. The results: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images).


<a id="gather"> </a>
## Gathring Data

### Reading The CSV from local file system
Using Csv Panda Reader 

### Reading The img  Data file
Using Request pacakge from python

### Saving The img Data File
using Panda save or any other os reading writing methods

## Prapring the api creditial from json file ##

twitter_api.py: This is the Twitter API code to gather some of the required data for the project. Read the code and comments, understand how the code works, then copy and paste it into your notebook.
tweet_json.txt: This is the resulting data from twitter_api.py. You can proceed with the following part of "Gathering Data for this Project" on the Project Details page: "Then read this tweet_json.txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count."



<a id="assess"></a>
 ## Assessing
 

### Programmatically

### Code

- Some code I used to assess the data 
        df_csv.head(50)
        df_img.head()
        df_img.info()
        df_api.head()
        df_csv.info()
        df_csv.doggo.unique()
        df_csv.floofer.unique()
        df_csv.puppo.unique()
        df_csv.pupper.unique()
        df_csv[df_csv.expanded_urls.isna()].info()
        df_csv[df_csv.tweet_id.duplicated()]
        pd.scatter_matrix(df_csv,figsize=(25,25))
        pd.scatter_matrix(df_csv,figsize=(25,25))
        df_img.head()
        df_img.p3.unique()

<a id="quality"></a>

## Assessing Results 

### Quality: 
 
 #### df_csv
 1 - time stamp need to be changed to datatime instead of string<br/>
 2 - dummy varialbs ( dog type ) need to be cleaned into int ( boolean)<br/>
 3 - in_reply_to_status_id should be changed to string type <br/>
 4 - retweeted_status_user_id  should be changed to string type <br/>
 5 - rating_numerator ,rating_denominator  changing into one float variable<br/>
 #### df_img:
 changing id's into string
 #### df_api

 changing id's into string
 

<a id="tidiness"></a>
 

### Tideness
 
1 - Dog stages need to be combined into one column<br/>
2 -  df_api , df_csv  represent the same obeservations it would better to be merged together
 but we should be careful about the nullable and deference count between the raw<br/>
3 - merged , df_img represent the same obeservations it would better to be merged together
 but we should be careful about the nullable and deference count between the raw

<a id="clean"></a>
## Cleaning  Data

### Quality 

### df_csv
First We will copy our dataframe to a new data frame to ensure the consisty of our work

#### Define

chaging time into datatime type

#### Code

    df_csv_clean["timestamp"]=pd.to_datetime(df_csv_clean.timestamp)

#### Test

    df_csv_clean.info()

#### Define

chaging tweet_id into string type


#### Code

    df_csv_clean["tweet_id"]=df_csv_clean["tweet_id"].astype(str)


#### Test

    df_csv_clean.head()

#### Define 

    Drop retweeted_status_id, retweeted_status_user_id,in_reply_to_status_id,in_reply_to_user_id

#### Code

    df_csv_clean.drop(["retweeted_status_id",
                        "retweeted_status_user_id",
                        "in_reply_to_status_id",
                        "in_reply_to_user_id",
                        "retweeted_status_timestamp"],
                        axis=1,inplace=True)

#### Test

    df_csv_clean.info()

#### Define
The current pipeline captures incorrect values when rating numerators contain decimals
and it will better to compine both numerator and deomerator in one column as it descripe one mesure


#### Code

    df_csv_clean[df_csv_clean.rating_numerator>10].info()
    ratting=df_csv_clean.text.str.extract('((?:\d+\.)?\d+)\/(\d+)', expand=True)
    df_csv_clean.rating_numerator=ratting[0].astype(float)
    df_csv_clean.rating_denominator=ratting[1].astype(float)
    df_csv_clean=df_csv_clean[df_csv_clean["rating_denominator"]!=0]

#### Test 

    df_csv_clean.info()
    df_csv_clean.describe()

### df_img
Copy df_img to a new dataframe


#### Define
1- chaging id into string

#### Code

    df_img_clean["tweet_id"]=df_img_clean["tweet_id"].astype(str)

#### Test

    df_img_clean.info()
    df_img_clean.describe()

### df_api  ###
Copy df_api into other cleaning Dataframe

    df_api_clean=df_api.copy()

#### Define
chage id into string

#### Code

    df_api_clean["tweet_id"]=df_api_clean["tweet_id"].astype(str)

## Cleaning
### Tidiness

#### df_csv_clean 
#### Define

Dog stages need to be combined into one column

    df_csv_clean.loc[(df_csv_clean[['doggo', 'floofer', 'pupper', 'puppo']] != 'None'
                 ).sum(axis=1) > 1]

#### Code

    df_csv_clean.doggo.replace("None","",inplace=True)
    df_csv_clean.floofer.replace("None","",inplace=True)
    df_csv_clean.pupper.replace("None","",inplace=True)
    df_csv_clean.puppo.replace("None","",inplace=True)
    df_csv_clean['stage'] = df_csv_clean.doggo + df_csv_clean.floofer + df_csv_clean.pupper + df_csv_clean.puppo
    df_csv_clean.loc[df_csv_clean.stage == 'doggopupper', 'stage'] = 'doggo,pupper'
    df_csv_clean.loc[df_csv_clean.stage == 'doggopuppo', 'stage'] = 'doggo,puppo'
    df_csv_clean.loc[df_csv_clean.stage == 'doggofloofer', 'stage'] = 'doggo,floofer'
    df_csv_clean.stage.replace("","None",inplace=True)

#### Test

    df_csv_clean.stage.value_counts()

#### Define
Drop the dummy variabls

#### Code

    df_csv_clean.drop(['doggo', 'floofer', 'pupper', 'puppo'],axis=1,inplace=True)

#### Test

    df_csv_clean.info()

 ### df_api ,  df_csv 
 #### Define ##
 
represent the same obeservations it would better to be merged together but we should be careful about the nullable and deference count between the raw


#### Code
We will Use merge for two Dataframe

    merged_1=df_csv_clean.merge(df_api_clean,how="inner",on="tweet_id")

#### Test

    merged_1.describe()
    merged_1.head()
    merged_1.info()

### df_img ,  merged_1
#### Define
df_img represent the same obeservations it would better to be merged together
 but we should be careful about the nullable and deference count between the raw

#### Code

    merged_2=merged_1.merge(df_img_clean,how="inner",on="tweet_id")

#### Test 

    merged_2.info()
    merged_2.describe()
    merged_2.head()

<a id="storing"></a>
## Saving Data 
### saving as csv  in local file system

    df=merged_2.copy()

#### Refernces : 
     1- https://github.com/latinacode/
     2- https://www.python.org/
     3- https://www.kaggle.com/datasets
     4- https://stackabuse.com/reading-and-writing-json-to-a-file-in-python/
     5- https://stackoverflow.com/questions/28384588/twitter-api-get-tweets-with-specific-id