# Introduction
- This dataset is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.

In [None]:
pip install tweepy

# Importing Libraries

In [None]:
import os 
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np 
import requests as rt 
import tweepy
import json 

# 1.0 Gathering Data 

### 1.1 Gather_from_csvfile 

In [None]:
#reading_data 
twitter_archive_enhanced = pd.read_csv("../input/weratedogs-twitter-archive/twitter-archive-enhanced.csv")

### 1.2 Gather_from_the Internet

In [None]:
#making file path or check for it 
folder_name = 'image_predictions_file'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)
    
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
response = rt.get(url)

In [None]:
#opening the file 
with open(os.path.join(folder_name,
                      url.split('/')[-1]) , mode='wb') as file:
    file.write(response.content)

In [None]:
#reading file 
image_predictions = pd.read_csv('../input/weratedogs-twitter-archive/image_predictions.tsv' , sep = "\t")

### 1.3 Gather_from_API

In [None]:
# This cell need to API information , so we don't use it here know (Don't run this cell )
# we used data from this API as json file instead . 
""""
import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer

# Query Twitter API for each tweet in the Twitter archive and save JSON in a text file
# These are hidden to comply with Twitter's API terms and conditions
consumer_key = 'HIDDEN'
consumer_secret = 'HIDDEN'
access_token = 'HIDDEN'
access_secret = 'HIDDEN'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

# NOTE TO STUDENT WITH MOBILE VERIFICATION ISSUES:
# df_1 is a DataFrame with the twitter_archive_enhanced.csv file. You may have to
# change line 17 to match the name of your DataFrame with twitter_archive_enhanced.csv
# NOTE TO REVIEWER: this student had mobile verification issues so the following
# Twitter API code was sent to this student from a Udacity instructor
# Tweet IDs for which to gather additional data via Twitter's API
tweet_ids = df_1.tweet_id.values
len(tweet_ids)

# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    for tweet_id in tweet_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            print("Success")
            json.dump(tweet._json, outfile)
            outfile.write('\n')
        except tweepy.TweepError as e:
            print("Fail")
            fails_dict[tweet_id] = e
            pass
end = timer()
print(end - start)
print(fails_dict)
"""

### 1.3 Gather_from_json-file

In [None]:
#opening json file 
df_list = []

with open('../input/weratedogs-twitter-archive/tweet_json.txt') as file:
    for line in file:
        df_list.append(json.loads(line))

In [None]:
#reading json file data
tweet_data = pd.DataFrame(df_list , columns = ['id' , 'retweet_count' , 'favorite_count'])

# 2.0 Assessing_Data 

### 2.1 Assess_'twitter-archive-data'

In [None]:
twitter_archive_enhanced.head()

In [None]:
twitter_archive_enhanced.sample(10)

In [None]:
twitter_archive_enhanced.info()

In [None]:
twitter_archive_enhanced.name.value_counts()

In [None]:
twitter_archive_enhanced.text[1]

In [None]:
twitter_archive_enhanced.expanded_urls.value_counts()

In [None]:
type(twitter_archive_enhanced.timestamp[0])

In [None]:
twitter_archive_enhanced.duplicated().sum()

In [None]:
twitter_archive_enhanced.rating_denominator.value_counts()

In [None]:
twitter_archive_enhanced.rating_numerator.value_counts()

### 2.2 Assess_image-predictions

In [None]:
image_predictions.head()

In [None]:
image_predictions.sample(10)

In [None]:
image_predictions.info()

In [None]:
type(image_predictions.jpg_url[0])

In [None]:
image_predictions.duplicated().sum()

### 2.3 Assess_tweet-data

In [None]:
tweet_data.head()

In [None]:
tweet_data.sample(10)

In [None]:
tweet_data.info()

In [None]:
tweet_data.duplicated().sum()

## 1) Quality
 
#### A- twitter_archive_enhanced
- tweet_id should be "string" not "int" 
- some values in rating_denominator column isn't "10" 
- some values in rating_numerator column less than "10" 
- some values in rating_numerator column = zero 
- timestamp should be "data time" not "str" 
- retweeted_status_id  should be removed because we interest in tweet 
- retweeted_status_user_id should be removed because we interest in tweet 
- retweeted_status_timestamp should be removed because we interest in tweet 
- Nulls represented as (none) in name column 
- some of expanded_urls rows has 2 url and we just need tweeter link 
- 59 missing value in expanded_urls column 
- some values in rating_numerator column over ratted
- strange names like ('a' , 'an') in name column 

#### B- image-predictions
- tweet_id should be "string" not "int" 
- names p columns have some upper letter and some lower letter 
- missing id's there are '2075' instead of  '2353' 

#### C- tweet_data
- id column name should be "tweet_id" instead of "id" 
- tweet_id should be "string" not "int" 
- missing id 2075 i 2324 

### Totall Quality issues : 19 

## 2) tidiness

#### A- twitter_archive_enhanced
- doggo, floofer, pupper, and puppo should be in one column not 4 

#### B- image-predictions
- in columns "confg" should be configration

#### C- tweet_data
- no isssues 

#### For A , B and C 
- combining the three dataframes in one dataframe 

### Totall tidiness issues : 3 

## making copy from all data sets 

In [None]:
clean_twitter_archive = twitter_archive_enhanced.copy()
clean_image_predictions = image_predictions.copy()
clean_tweet_data = tweet_data.copy()

# 3.0 Cleaning_Data

## issue_1
- Quality : tweet_id should be "string" not "int" 

#### .Define
- replace the name column from "id" to "tweet_id"

#### .Code 

In [None]:
clean_tweet_data.rename(columns={'id': 'tweet_id'}, inplace=True)

#### .Test

In [None]:
clean_tweet_data.head()

## issue_2
- tidiness : combining the three dataframes in one dataframe 

#### .Define
- merging tables with merge method 

#### .Code 

In [None]:
clean_twitter_archive = pd.merge(clean_twitter_archive , image_predictions , on= "tweet_id" , how="left") 
clean_twitter_archive = pd.merge(clean_twitter_archive , clean_tweet_data , on= "tweet_id" , how="left")

#### .Test

In [None]:
clean_twitter_archive

## issue_3
- Quality :  tweet_id should be "string" not "int"

#### .Define 
- converet column type with astype method 

#### .Code 

In [None]:
clean_twitter_archive['tweet_id'] = clean_twitter_archive['tweet_id'].astype(str)

#### .Test

In [None]:
type(clean_twitter_archive['tweet_id'][0])

## issue_4
- tidiness : doggo, floofer, pupper, and puppo should be in one column not 4 

#### .Define
- extract dog stage from text and put it in new column 'dog stage' then drop the real columns 

#### .Code 

In [None]:
clean_twitter_archive['Dog_stage'] = clean_twitter_archive['text'].str.extract('(doggo|floofer|pupper|puppo)')
clean_twitter_archive = clean_twitter_archive.drop(columns=['doggo','floofer','pupper','puppo'])

#### .Test

In [None]:
clean_twitter_archive['Dog_stage'].value_counts()

## issue_5
- Quality : some values in rating_denominator column isn't "10"

#### .Define
- replace the values from any num less than 10 to 10 

#### .Code 

In [None]:
clean_twitter_archive.rating_denominator = clean_twitter_archive.rating_denominator.replace( range(0,1000) , 10 )     

#### .Test

In [None]:
clean_twitter_archive.rating_denominator.value_counts()

## issues_6,7
- Quality : some values in rating_numerator column less than "10"
- Quality : some values in rating_numerator column = zero 

#### .Define
- adding 10 to correct this values 

#### .Code 

In [None]:
clean_twitter_archive.rating_numerator[clean_twitter_archive['rating_numerator'] <= 10 ] = clean_twitter_archive.rating_numerator[clean_twitter_archive['rating_numerator'] <= 10 ].add(10)

#### .Test

In [None]:
clean_twitter_archive.rating_numerator.value_counts()

In [None]:
clean_twitter_archive.query('rating_numerator <= 10 ').rating_numerator.sum()

## issue_8
- Quality : timestamp should be "data time" not "str"

#### .Define
- convert the column type using to_datetime method

#### .Code 

In [None]:
clean_twitter_archive['timestamp']= pd.to_datetime(clean_twitter_archive['timestamp'])

#### .Test

In [None]:
type(clean_twitter_archive['timestamp'][0])

## issue_9,10.11
- Quality :
  - retweeted_status_id  should be removed
  - retweeted_status_user_id should be removed 
  - retweeted_status_timestamp should be removed 

#### .Define
- remove this columns using drop method

#### .Code 

In [None]:
clean_twitter_archive = clean_twitter_archive.drop(columns=['retweeted_status_id' , 'retweeted_status_user_id' , 'retweeted_status_timestamp'])

#### .Test

In [None]:
clean_twitter_archive.info()

## issue_12
- Quality : Nulls represented as (none) in name column  

#### .Define
- replace this values with nulls using replace method and numpy nan method 

#### .Code 

In [None]:
clean_twitter_archive.name = twitter_archive_enhanced.name.replace("None", value = np.nan )

#### .Test

In [None]:
clean_twitter_archive.name.isnull().sum()

## issue_13
- Quality : names p columns have some upper letter and some lower letter  


#### .Define
- make letters capital with title method 

#### .Code 

In [None]:
clean_twitter_archive.p1 = clean_twitter_archive.p1.str.title()
clean_twitter_archive.p2 = clean_twitter_archive.p2.str.title()
clean_twitter_archive.p3 = clean_twitter_archive.p3.str.title()

#### .Test


In [None]:
clean_twitter_archive.head()

## issue_14
- in columns "confg" should be configration

#### .Define
- edit the name of the column usinf rename method 

#### .Code 


In [None]:
clean_twitter_archive.rename(columns={'p1_conf': 'p1_confident'}, inplace=True)
clean_twitter_archive.rename(columns={'p2_conf': 'p2_confident'}, inplace=True)
clean_twitter_archive.rename(columns={'p3_conf': 'p3_confident'}, inplace=True)

#### .Test


In [None]:
clean_twitter_archive.head()

# 3.0 Storing_Data

In [None]:
clean_twitter_archive.to_csv('C:\\Users\\BluRay\\Desktop\\DA-P2\\twitter_archive_master.csv', index=False)

# 4.0 Analyzing, and Visualizing_Data

In [None]:
clean_twitter_archive

### 4.1 Dog_stage ratios

In [None]:
plt.figure(figsize=(8,5))
plt.title("Number of Customers", fontsize=14)
plt.bar(x=clean_twitter_archive['Dog_stage'].value_counts().index,
        height=clean_twitter_archive.Dog_stage.value_counts().values)

## insights 
- pupper is the Most Popular stage
- floofer is the least popular stage 

### 4.2 The success rate of the algorithms

In [None]:
clean_twitter_archive['p1_dog'].value_counts()

In [None]:
tolall= (1532+543)
sucsses_p1 = (1532/tolall)*100
sucsses_p1

In [None]:
plt.figure(figsize=(8,5))
plt.title("Number of Customers", fontsize=14)
plt.bar(x=clean_twitter_archive['p1_dog'].value_counts().index,
        height=clean_twitter_archive.p1_dog.value_counts().values)

In [None]:
clean_twitter_archive['p2_dog'].value_counts()

In [None]:
tolall= (1553+522)
sucsses_p2 = (1553/tolall)*100
sucsses_p2

In [None]:
plt.figure(figsize=(8,5))
plt.title("Number of Customers", fontsize=14)
plt.bar(x=clean_twitter_archive['p2_dog'].value_counts().index,
        height=clean_twitter_archive.p2_dog.value_counts().values)

In [None]:
clean_twitter_archive['p3_dog'].value_counts()

In [None]:
tolall= (1499+576)
sucsses_p3 = (1499/tolall)*100
sucsses_p3

In [None]:
plt.figure(figsize=(8,5))
plt.title("Number of Customers", fontsize=14)
plt.bar(x=clean_twitter_archive['p3_dog'].value_counts().index,
        height=clean_twitter_archive.p3_dog.value_counts().values)

In [None]:
plt.pie([sucsses_p1,sucsses_p2,sucsses_p3],labels=('p1','p2','p3'), explode = [0.1,0.2,0.1],autopct ='%1.1f%%')
plt.show()

## insights 
- p2_algorithm is The most successful algorithm
- p3_algorithm is The least successful algorithm