# Project: Wrangling and Analyze Data

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [None]:
# Install Libraries for the first time

# !pip install tweepy
# !pip install configparser

In [None]:
# Import necesaries Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
import os
import time
import tweepy
import tweepy
import configparser
import json
import time
import re
import seaborn as sns
%matplotlib inline
config = configparser.ConfigParser()
config.read('config.ini')

In [None]:
#Importing the tweets archive and converting it into a dataframe

df_twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')

2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [None]:
# Downloading image prediction file using Requests library

img_url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
img_request = requests.get(img_url, allow_redirects=True)

open('image_predictions.tsv', 'wb').write(img_request.content)

In [None]:
# Using Pandas, we convert the tsv file into a Dataframe

df_image_predictions = pd.read_csv('image_predictions.tsv', sep='\t')
df_image_predictions.head()

3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

**I will comment out the Tweepy codes. To run this section that has to deal with the Twitter API, first open the config.ini file and enter your own Twitter API details there.**

In [None]:
# # To protect my Twitter API Keys, I use Configparser to declare the API keys and hide them.

# consumer_key = config['twitter']['consumer_key']
# consumer_secret = config['twitter']['consumer_secret']
# access_token = config['twitter']['access_token']
# access_secret = config['twitter']['access_secret']

In [None]:
# # Assign the API keys from COnfigparser to Tweepy variables

# auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
# auth.set_access_token(access_token, access_secret)
# api = tweepy.API(auth)

In [None]:
# # Pull a tweet from Twitter to test if the API connect is working.

# api.get_status(df_twitter_archive.tweet_id[1000]).text

In [None]:
# # Declare an empty array to save all json data that will be collected from Twitter

# tweet_list = []

# # Declare an array to save the IDs that could not be fetched

# tweet_failure = []

# # Loop through the tweet id column and fetch the data from Twitter

# for tweet_id in df_twitter_archive.tweet_id:
#     try:
#         tweet_list.append(api.get_status(tweet_id, tweet_mode='extended'))
#     except Exception as e:
#         tweet_failure.append(tweet_id)

In [None]:
# # Print the total number of successful IDs and  number of failed IDs using len function

# print("The list of tweets" ,len(tweet_list))
# print("The list of tweets no found" , len(tweet_failure))

In [None]:
# tweet_dicts = []
# for i in tweet_list:
#     tweet_dicts.append(i._json)

# #We write this list into a txt file

# with open('tweet_json.txt', 'w') as file:
#         file.write(json.dumps(tweet_dicts, indent=4))

In [None]:
# # View the json file to identify what to extract

# tweet_dicts

In [None]:
# Extract relevant data from the JSON file and store it in a list

tweets_list = []
with open('tweet_json.txt', encoding='utf-8') as json_file:
    all_meta_data = json.load(json_file)
    for tweet_dictionary in all_meta_data:
        tweet_id = tweet_dictionary['id']
        tweet_text = tweet_dictionary['full_text']
        only_url = tweet_text[tweet_text.find('https'):]
        favorite_count = tweet_dictionary['favorite_count']
        retweet_count = tweet_dictionary['retweet_count']
        whole_source = tweet_dictionary['source']
        only_device = whole_source[whole_source.find('rel="nofollow">') + 15:-4]
        source = only_device
        retweeted_status = tweet_dictionary['retweeted_status'] = tweet_dictionary.get('retweeted_status', 'Original tweet')
        if retweeted_status == 'Original tweet':
            url = only_url
        else:
            retweeted_status = 'This is a retweet'
            url = 'This is a retweet'

        tweets_list.append({'tweet_id': str(tweet_id),
                             'favorite_count': int(favorite_count),
                             'retweet_count': int(retweet_count),
                             'url': url,
                             'source': source,
                             'retweeted_status': retweeted_status,
                            })
        df_counts = pd.DataFrame(tweets_list, columns = ['tweet_id','favorite_count','retweet_count',
                                                           'source', 'retweeted_status', 'url'])
df_counts.info()

In [None]:
df_counts.sample(10)

## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



### Archive Dataframe Analysis
In this section, the different dataframes will be anayzed for **Completeness**, **Validity**, **Accuracy** and **Consistency**

#### Visual Assesment
Every dataframe that will be analyzed is first assesed visually for QUality and Tidiness issues. I will be using sample 50 to select 50 random columns to access visually.

In [None]:
df_twitter_archive.sample(50)

In [None]:
df_image_predictions.sample(50)

In [None]:
df_counts.sample(50)

Display image url tutorial from https://stackoverflow.com/questions/11854847/how-can-i-display-an-image-from-a-file-in-jupyter-notebook

In [None]:
# testing the image url

from IPython.display import Image
Image(url = df_image_predictions.jpg_url[380])

#### Programatic Assessment
Here I will be using Panda Functions and methods to assess the data for quaity and tidiness issues

In [None]:
df_twitter_archive.info()

In [None]:
df_image_predictions.info()

In [None]:
df_counts.info()

In [None]:
df_twitter_archive.describe()

In [None]:
df_counts.describe()

In [None]:
df_image_predictions.describe()

In [None]:
df_twitter_archive.name.value_counts()

In [None]:
df_image_predictions.p1.value_counts()

In [None]:
df_twitter_archive[df_twitter_archive.tweet_id.duplicated()]

In [None]:
df_counts[df_counts.tweet_id.duplicated()]

In [None]:
df_image_predictions[df_image_predictions.tweet_id.duplicated()]

### Quality issues
1. Add decimals to Rating Numerator

2. Fix Dog Ratings

3. Change Timestamp to Correct Date time format

4. Remove every Retweet

5. Fix Errors with Dog Names

6. Some records have multiple dog stages

7. Drop Unwanted columns

8. Fix Erroneous Datatypes in most columns

### Tidiness issues
1. Merge all 3 Dataframes to be one master Dataframe

2. Creating a dog breed column using data from image prediction

3. Remove tweets with no image

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing.

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [None]:
# Make copies of original pieces of data
twitter_archive_clean = df_twitter_archive.copy()
image_predictions_clean = df_image_predictions.copy()
tweets_meta_count = df_counts.copy()

### Quality Issue 1: Add decimals to Rating Numerator

#### Define:
- Convert rating numerator and denominator to float from int

#### Code

In [None]:
twitter_archive_clean.rating_numerator = twitter_archive_clean.rating_numerator.astype(float)
twitter_archive_clean.rating_denominator = twitter_archive_clean.rating_denominator.astype(float)

#### Test

In [None]:
twitter_archive_clean.info()

### Quality Issue 2: Fix Dog rating system

#### Define
- Check twitter_archive_clean for columns where ratings are not extrated properly

#### Code

In [None]:
with pd.option_context('max_colwidth', 200):
    display(twitter_archive_clean[twitter_archive_clean.text.str.contains(r"(\d+\.\d*\/\d+)")]
            [['tweet_id', 'text', 'rating_numerator', 'rating_denominator']])

In [None]:
# Print the indices of the ratings above (have decimal)
twitter_archive_clean.loc[(twitter_archive_clean.tweet_id == 883482846933004288), 'rating_numerator'] = 13.5
twitter_archive_clean.loc[(twitter_archive_clean.tweet_id == 832215909146226688), 'rating_numerator'] = 9.75
twitter_archive_clean.loc[(twitter_archive_clean.tweet_id == 786709082849828864), 'rating_numerator'] = 9.75
twitter_archive_clean.loc[(twitter_archive_clean.tweet_id == 778027034220126208), 'rating_numerator'] = 11.27
twitter_archive_clean.loc[(twitter_archive_clean.tweet_id == 681340665377193984), 'rating_numerator'] = 9.5
twitter_archive_clean.loc[(twitter_archive_clean.tweet_id == 680494726643068929), 'rating_numerator'] = 11.26

#### Test

In [None]:
with pd.option_context('max_colwidth', 200):
    display(twitter_archive_clean[twitter_archive_clean.text.str.contains(r"(\d+\.\d*\/\d+)")]
            [['tweet_id', 'text', 'rating_numerator', 'rating_denominator']])

In [None]:
image_predictions_clean[(image_predictions_clean.tweet_id == 786709082849828864)].index

In [None]:
twitter_archive_clean# Logan's rating was 75/10 now it is fixed 9.75/10
twitter_archive_clean.loc[(twitter_archive_clean.tweet_id == 786709082849828864),'rating_numerator':'rating_denominator']

In [None]:
# Logan's cute image
from IPython.display import Image
Image(url = image_predictions_clean.jpg_url[1515])

### Quality Issue #3: Change Timestamp to Correct Datetime Object

#### Define
- Removing time zone from the Timestamp column
- Change it to a datetime object

#### Code

In [None]:
twitter_archive_clean['timestamp'] = twitter_archive_clean['timestamp'].str.slice(start=0, stop=-6)

In [None]:
twitter_archive_clean['timestamp'] = pd.to_datetime(twitter_archive_clean['timestamp'], format = "%Y-%m-%d %H:%M:%S")

#### Test

In [None]:
twitter_archive_clean.info()

In [None]:
twitter_archive_clean.sample(5)

### Quality Issue 4: Remove every Retweet

#### Define
- Delete every retweet by filtering for NaN in Retweet_status_id column

#### Code

In [None]:
twitter_archive_clean = twitter_archive_clean[pd.isnull(twitter_archive_clean.retweeted_status_id)]

#### Test

In [None]:
twitter_archive_clean.info()

### Quality Issue 5: Fix Errors in Dog Name Column

#### Define
- Use Regex to search for names and repace with None

#### Code

In [None]:
twitter_archive_clean.name = twitter_archive_clean.name.str.replace('^[a-z]+', 'None')

#### Test

In [None]:
twitter_archive_clean.name.value_counts()

### Quality Issues 6: Some records have multiple dog stages

#### Define
- Will concatenate the records in the columns that represent the dog stages to one column
- Create a custom function to count and compare values to return the dog stage

#### Code

In [None]:
twitter_archive_clean['all_stage'] = twitter_archive_clean.doggo + twitter_archive_clean.floofer + twitter_archive_clean.pupper + twitter_archive_clean.puppo
twitter_archive_clean['all_stage'].value_counts()

In [None]:
# Function to check the new all stage column for dog stages

def stages_check(archive):
    if archive['all_stage'].count('None') == 2:
        return 'Multiple' #this means it has more than one dog stage
    else:
        if archive['all_stage'].count('doggo') == 1:
            return 'Doggo'
        elif archive['all_stage'].count('floofer') == 1:
            return 'Floofer'
        elif archive['all_stage'].count('pupper') == 1:
            return 'Pupper'
        elif archive['all_stage'].count('puppo') == 1:
            return 'Puppo'
        else:
            return 'None'

twitter_archive_clean['dog_stages'] = twitter_archive_clean.apply(stages_check, axis=1)

### Test

In [None]:
twitter_archive_clean.dog_stages.value_counts()

### Quality Issue #7: Drop Unwanted columns

#### Define
- Use drop function to drop columns we will not be using for this analysis from the different tables

#### Code

In [None]:
twitter_archive_clean.columns

In [None]:
image_predictions_clean.columns

In [None]:
tweets_meta_count.columns

In [None]:
twitter_archive_clean.drop(['in_reply_to_status_id', 'in_reply_to_user_id','source',
                            'retweeted_status_id', 'retweeted_status_user_id','retweeted_status_timestamp',
                            'expanded_urls','doggo', 'floofer', 'pupper', 'puppo','all_stage'], axis = 1, inplace=True)

In [None]:
image_predictions_clean.drop(['img_num'], axis = 1, inplace=True)

### Test

In [None]:
twitter_archive_clean.info()

In [None]:
image_predictions_clean.info()

### Tidiness 1. Merge all 3 Dataframes to be one master Dataframe

#### Define
- Use Reduce to merge all dataframes on tweet_id

#### Code

In [None]:
# Merge can not compare int to object, that is why i have to convert tweets_data_clean tweet_id to int from object

tweets_meta_count.tweet_id = pd.to_numeric(tweets_meta_count.tweet_id)

In [None]:
# https://stackoverflow.com/questions/44327999/python-pandas-merge-multiple-dataframes/44338256

from functools import reduce
merged_df = [twitter_archive_clean, image_predictions_clean, tweets_meta_count]
twitter_dogs = reduce(lambda left, right:  pd.merge(left, right, on = 'tweet_id'), merged_df)

#### Test

In [None]:
twitter_dogs.info()

### Tidiness 2: Creating a dog breed column using data from image prediction

#### Define
- Use at method to get the predicted dog breed from the data collected from image prediction

#### Code

In [None]:
twitter_dogs.dog_breed = 'None'

for i, row in twitter_dogs.iterrows():
    if row.p1_dog:
        twitter_dogs.at[i, 'dog_breed']= row.p1
    elif row.p2_dog and row.rating_numerator >= 10:
        twitter_dogs.at[i, 'dog_breed']= row.p2
    elif row.p3_dog and row.rating_numerator >= 10:
        twitter_dogs.at[i, 'dog_breed']= row.p3
    else:
        twitter_dogs.at[i, 'dog_breed']='None'

#### Test

In [None]:
twitter_dogs['dog_breed'].value_counts()

#### Define
- we no longer need the image prediction data se we drop them

#### Code

In [None]:
twitter_dogs.columns

In [None]:
twitter_dogs.drop(['p1', 'p1_conf','p1_dog', 'p2', 'p2_conf', 'p2_dog', 'p3','retweeted_status', 'p3_conf', 'p3_dog'],
            axis= 1, inplace=True)

#### Test

In [None]:
twitter_dogs.info()

### Quality Issue #8: Fix Erroneous Datatypes in most columns

#### Define
- Convert tweet id to string
- Convert source, dog type, dog breed to category

#### Code

In [None]:
twitter_dogs.tweet_id = twitter_dogs.tweet_id.astype(str)
twitter_dogs.source = twitter_dogs.source.astype("category")
twitter_dogs.dog_stages = twitter_dogs.dog_stages.astype("category")
twitter_dogs['dog_breed'] = twitter_dogs['dog_breed'].astype("category")

### Test

In [None]:
twitter_dogs.info()

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

In [None]:
# Use pandas to save the twitter_dogs dataframe to a new csv file

twitter_dogs.to_csv('twitter_archive_master.csv',encoding='utf-8', index=False)

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

In [None]:
# convert the datatypes back to how i want them from read_csv
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

twitter_archive_master = pd.read_csv('twitter_archive_master.csv', parse_dates=['timestamp'],
                                     dtype={'tweet_id': str, 'source': 'category', 'dog_stages': 'category',
                                            'dog_breed': 'category'})
twitter_archive_master.info()

### Insights:
1. Most popular Dog Breeds

2. Most used Twitter source for tweeting

3. Popular Dog Names

4. Corelation between Retweet, Favorites and Ratings

### Visualization

#### Most Populr Dog Breed

In [None]:
# Function to reduce code repetition

def plot_param(xlabel, ylabel, title):
    plt.xlabel(xlabel, fontsize=15)
    plt.ylabel(ylabel, fontsize=15)
    plt.title(title,fontsize= 20)
    plt.show()
    return None

In [None]:
twitter_archive_master.dog_breed.value_counts()[1:21].plot(kind='bar',figsize=(15,10), color='#808080')
plot_param('Count','Dog Breeds','Top 20 Most Rated Dog Breed')


#### Most used Twitter Source

In [None]:
twitter_archive_master.source.value_counts().plot(kind='bar',figsize=(15,10), color='#808080')
plt.xticks(rotation=0)
plot_param('Number of tweets','Source','Most Twitter Source by WeRateDogs')


#### Most Popular Dog Names

In [None]:
twitter_archive_master.name.value_counts()[1:21].plot(kind='bar', figsize=(15,10), color='#808080')
plt.xticks(rotation=0)
plot_param('Names','Number of Dogs','Most Common Dog Names')

#### Corelation between Retweet, Favorites and Ratings

In [None]:
corelation = twitter_archive_master.iloc[:,[3,8,9]]

In [None]:
corelation.corr()

In [None]:
# https://medium.com/@szabo.bibor/how-to-create-a-seaborn-correlation-heatmap-in-python-834c0686b88e

plt.figure(figsize=(15,10))
heatmap = sns.heatmap(corelation.corr(), vmin=-1, vmax=1, annot=True, linewidths=.5)
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':20})

In [None]:
# Scatter Plot showing relationship between favorite count and retweet count
corelation.plot(x='favorite_count', y='retweet_count', kind='scatter', figsize=(15,10), color=('blue','red'));
plot_param('Favorite Count','Retweet count','Favorite and Retweet')