# Python Script
## Project - Write script for fetching tweets from any Twitter handle

In this notebook, I will be fetching tweets from a twitter handle and dumping it into a *json* file. Then I will show the most common information about the tweet in a tabular format. I will show the following content about the tweet:

- The text of the tweet.
- Date and time of the tweet.
- The number of favorites/likes.
- The number of retweets.
- Number of Images present in Tweet. If no image returns None.

I will be using **tweepy**. An easy-to-use Python library for accessing the Twitter API.

**Note** - I will be leaving some mistakes as it is, to show what type of logic or syntax error you may face while doing this project.

### Getting the required keys 
First thing that we need to do is to get the consumer key, consumer secret, access key and access secret from twitter developer. These keys will help the API for authentication. 
You will need to complete the following steps to get the keys:
- Login to twitter developer section.
- Go to “Create an App”.
- Fill the details of the application.
- You will be provided with consumer key and consumer secret.
- There will be a button named "Create my access token" which will generate your access token.

In order to access twitter API we will use `OAuth` which will provide authentication for fetching data. We will create an OAuthHandler instance. Into this we pass our consumer token and secret. For more information about `OAuth` read [Documentation](https://pythonhosted.org/tweepy/auth_tutorial.html).

In [1]:
import tweepy

# Variables containing keys and tokens used to access twitter's API

consumer_API_key = # consumer API key
consumer_API_secret_key = # consumer API secret key
access_token = # access_token
access_token_secret = # access_token_secret

# Creating an object of class 'tweepy.OAuthHandler'
ath = tweepy.OAuthHandler(consumer_API_key, consumer_API_secret_key)
ath.set_access_token(access_token, access_token_secret)

# We can send this object equipped with keys and tokens to tweepy.API.
api = tweepy.API(ath)

## Fetching tweets and dumping them to **.json** file
Now we can use this **api** variable to perform the operations on twitter. We will fetch the tweets by using a [Cursor](http://docs.tweepy.org/en/v3.5.0/cursor_tutorial.html) function present in **tweepy** library. `Cursor` will get 2 Parameters, one will be the `id`(twitter handle) and other will be [api.user_timeline](http://docs.tweepy.org/en/v3.5.0/api.html#API.user_timeline) (a function which will return status(tweets)).


### This block contains a logical error 
Here I was fetching the tweets and dumping the corresponding data into **json** file. But the error here was that the **json** objects were not separated by `,` because I was directly dumping the data from for loop 1 by 1 into json file. Which then I corrected by making a list first then putting all the data into that list and finally dumping the list into the **json** file.

In [2]:
import json
with open('data.json', 'w') as outfile:
    for tweets in tweepy.Cursor(api.user_timeline, id="midasIIITD").items():
        json.dump(tweets._json, outfile)

### Corrected code
In this we are making a list named **tw** and iterating the loop to fetch the tweets and putting it into the list in every iteration. This will automatically put `,` in between the **json** objects. Then we will dump the list of tweets into the .json file.

In [3]:
import json
# Making a list for storing tweets
tw = []

## Iterating the loop to store tweet in list
for tweets in tweepy.Cursor(api.user_timeline, id="midasIIITD", tweet_mode = 'extended').items():
    tw.append(tweets._json)

# Printing one json objects to check everything is good 
print (json.dumps(tw[1]))

# Dumping the list into a '.json' file
with open('data1.json', 'w') as outfile:
    json.dump(tw, outfile)

{"created_at": "Sun Mar 24 18:44:01 +0000 2019", "id": 1109888617302753280, "id_str": "1109888617302753280", "full_text": "The last date for submitting a solution for the @midasIIITD internship task is 26th March midnight. We will not accept solutions submitted after the deadline. \nThus, if you have not submitted your solution yet then kindly do so before the deadline. \n#Summer #Research #Internship", "truncated": false, "display_text_range": [0, 279], "entities": {"hashtags": [{"text": "Summer", "indices": [250, 257]}, {"text": "Research", "indices": [258, 267]}, {"text": "Internship", "indices": [268, 279]}], "symbols": [], "user_mentions": [{"screen_name": "midasIIITD", "name": "MIDAS IIITD", "id": 1021355762575073281, "id_str": "1021355762575073281", "indices": [48, 59]}], "urls": []}, "source": "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>", "in_reply_to_status_id": null, "in_reply_to_status_id_str": null, "in_reply_to_user_id": null, "in_reply_to_user_

## Parsing the jsonline file and displaying data in tabular format
Now we will parse the json objects and use [pandas](https://pandas.pydata.org/pandas-docs/stable/) library to display the data in tabular format. So, we will convert the json objects into a pandas dataframe. We will be using [read_json](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html) function of pandas which will convert JSON string to pandas object.

After converting I am having a look over the dataframe and performing some of the operations on it to have better a understanding. Some of the cells have errors but I will mention them by putting an *error* word above them. I have removed many error to prevent the notebook from being untidy.

In [4]:
import pandas
df = pandas.read_json('data1.json')

In [5]:
df[:5]

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,favorite_count,favorited,full_text,geo,...,quoted_status,quoted_status_id,quoted_status_id_str,quoted_status_permalink,retweet_count,retweeted,retweeted_status,source,truncated,user
0,,,2019-03-25 13:01:57,"[0, 212]","{'hashtags': [{'text': 'MIDAS', 'indices': [16...","{'media': [{'id': 1110164861739163649, 'id_str...",8,False,Congratulations @midasIIITD students Simra Sha...,,...,,,,,1,False,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",False,"{'id': 1021355762575073281, 'id_str': '1021355..."
1,,,2019-03-24 18:44:01,"[0, 279]","{'hashtags': [{'text': 'Summer', 'indices': [2...",,8,False,The last date for submitting a solution for th...,,...,,,,,3,False,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",False,"{'id': 1021355762575073281, 'id_str': '1021355..."
2,,,2019-03-24 18:26:02,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...",,0,False,RT @IIITDelhi: @IIITDelhi invites application ...,,...,,,,,4,False,{'created_at': 'Mon Mar 18 06:42:56 +0000 2019...,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",False,"{'id': 1021355762575073281, 'id_str': '1021355..."
3,,,2019-03-24 11:34:27,"[0, 212]","{'hashtags': [], 'symbols': [], 'user_mentions...",,4,False,One more week is left to submit the workshop p...,,...,,,,,0,False,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",False,"{'id': 1021355762575073281, 'id_str': '1021355..."
4,,,2019-03-24 06:23:37,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...",,0,False,RT @IEEEBigMM19: We are honored to have Dr. Ch...,,...,,,,,5,False,{'created_at': 'Sat Mar 23 05:17:50 +0000 2019...,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",False,"{'id': 1021355762575073281, 'id_str': '1021355..."


In [6]:
data = df[['full_text',
   'created_at',
   'favorite_count',
   'retweet_count',
   'extended_entities']]
data[:5]

Unnamed: 0,full_text,created_at,favorite_count,retweet_count,extended_entities
0,Congratulations @midasIIITD students Simra Sha...,2019-03-25 13:01:57,8,1,"{'media': [{'id': 1110164861739163649, 'id_str..."
1,The last date for submitting a solution for th...,2019-03-24 18:44:01,8,3,
2,RT @IIITDelhi: @IIITDelhi invites application ...,2019-03-24 18:26:02,0,4,
3,One more week is left to submit the workshop p...,2019-03-24 11:34:27,4,0,
4,RT @IEEEBigMM19: We are honored to have Dr. Ch...,2019-03-24 06:23:37,0,5,


In [7]:
# Randomly picked one post to check how many images it contains 
for i in range(1,290):
    if(df.iloc[i]['full_text'].startswith('At @midasIIITD, we not only')):
        print(i)
        break

30


In [8]:
# Verifying the number of images
len(df.iloc[30]['extended_entities']['media'])

3

### Error
Trying to print images in every tweet but came to know that tweets with no images have *NaN* in their 'extended_entities' key and we cannon perform the following operations on them because there is no `media` key inside it.

In [9]:
for i in range(len(df)):
    print(len(df.iloc[i]['extended_entities']['media']))

2


TypeError: 'float' object is not subscriptable

In [10]:
# Tring out something but it didnt work (logical error)
import numpy as np
df1 = df["extended_entities"].replace(np.nan, '{media[]}', regex = True)
df1.to_frame(name=None)[:10]

Unnamed: 0,extended_entities
0,"{'media': [{'id': 1110164861739163649, 'id_str..."
1,{media[]}
2,{media[]}
3,{media[]}
4,{media[]}
5,{media[]}
6,{media[]}
7,{media[]}
8,{media[]}
9,{media[]}


## Calculating number of images 
Now we will calculate the number of images present in a tweet. After having a look over the Json objects for finding the key in which image content of the tweet exist. I found that details about images are stored in the key named `extended_entities` which have multiple `media` keys(equal to the number of images) having information about each image. More about this will be available in this [link](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/extended-entities-object.html).

So, from above cells, I came to know that I need to do something about the cells having `NaN`. So I made a list having values `True` if there is media present in the tweet else having `false`. I have used [notna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.notna.html) function of pandas for successfully putting the corresponding boolean values of `extended_entities` key in the created list.

Then I created another list which will store the number of images and store **none** if there is no image present in the tweet. Then I looped over the list to put the count of images present in a tweet.

In [11]:
boolList = df['extended_entities'].notna()
linkCount = []
print('length of Boolean list :',len(boolList))
for i in range(len(df)):
    
    if boolList[i] == True:
        # operations to perform if tweet contains image 
        linkCount.append(len(df.iloc[i]['extended_entities']['media']))
    else:
        # operations to perform if tweet doesnt contains image
        linkCount.append("None")

print(linkCount)

length of Boolean list : 302
[2, 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 1, 'None', 'None', 1, 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 1, 'None', 'None', 3, 'None', 'None', 'None', 'None', 1, 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 1, 2, 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 1, 1, 'None', 1, 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 1, 'None', 'None', 4, 'None', 1, 'None', 'None', 'None', 'None', 'None', 'None', 1, 1, 'None', 'None', 'None', 'None', 'None', 'None', 1, 2, 'None', 'None', 'None', 'None', 'None', 'None', 1, 'None', 'None', 'None', 'None', 'None', 3, 4, 'None', 'None', 'None', 'None', 'None', 'None', 1, 'None', 'None', 'None', 'None', 'None', 2, 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None

## Displaying the useful content of tweets

In [12]:
# Adding the image count column
df['image_count'] = linkCount

# Displaying the final data in tabaular form
df[['full_text','created_at','favorite_count','retweet_count','image_count']][:20]

Unnamed: 0,full_text,created_at,favorite_count,retweet_count,image_count
0,Congratulations @midasIIITD students Simra Sha...,2019-03-25 13:01:57,8,1,2.0
1,The last date for submitting a solution for th...,2019-03-24 18:44:01,8,3,
2,RT @IIITDelhi: @IIITDelhi invites application ...,2019-03-24 18:26:02,0,4,
3,One more week is left to submit the workshop p...,2019-03-24 11:34:27,4,0,
4,RT @IEEEBigMM19: We are honored to have Dr. Ch...,2019-03-24 06:23:37,0,5,
5,RT @IEEEBigMM19: Distinguished researchers Dr....,2019-03-24 06:23:14,0,3,
6,@IEEEBigMM19 is also available on Facebook now...,2019-03-20 08:19:24,1,1,
7,RT @IEEEBigMM19: BigMM 2019 : IEEE BigMM 2019 ...,2019-03-20 02:40:07,0,5,
8,BigMM 2019 : IEEE BigMM 2019 – Call for Worksh...,2019-03-18 02:27:47,6,3,
9,"Congratulations @midasIIITD team, Rohan, Prady...",2019-03-17 14:22:04,15,4,
