# Dealing with data: Twitter API

Last lesson, we started work on a Twitter API program that would convert IDs <-> usernames for us. Today, we'll continue on with this example and take a look at the data structures used.

## What is data structure?

The best definition I could find is essentially 

> "A data structure is a specialized format for organizing and storing data."

This is quite a general term, and covers things like:

* lists
* dicts
* objects
* strings


Let's look at our program from last week, and see what Twitter returns to us.

In [None]:
import tweepy
import json

CONSUMER_KEY = 'n4om9qk8X3EKzlmfVBU3n4K3b'
CONSUMER_SECRET = 'ol5Ftaog6CnzebaJENibnxrg9vNdz4rgtnmxZ70RvNwaUYv9v3'
ACCESS_TOKEN = '3637250534-dpI1Sz8T6Yfk2UbMyGSzfwTe6kXYEXJPrwBs5qF'
ACCESS_TOKEN_SECRET = '7z6TIplUYGQtL9ZWDQ4scvs5cwFKMq5SaQpZ0Nx8nau'
# 'xl'

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

api = tweepy.API(auth)
user = api.get_user('potus')
print(user._json)

## Serde?

Serde stands for **Ser**ialisation-**De**serialisation. This can be a complicated thing to explain, but essentially it means:

* Q: I have a data structure in one language, and I need to pass it to another one. How will the second language understand it?
* A: Use a universal standard as the format to pass between the languages. Each language will have to have a "encoder/decoder" in order to understand it

The most popular of these formats is JSON (JavaScript Object Notation). You could also consider CSV as another serde format

In [14]:
import json

person = {
    'id': 123,
    'name': 'tom'
}
print('this is the python dict:', person)
data = json.dumps(person)
print('this is the dict as JSON', data)

this is the python dict: {'id': 123, 'name': 'tom'}
this is the dict as JSON {"id": 123, "name": "tom"}


In [15]:
print('now lets create a python dict from the JSON', json.loads(data))

now lets create a python dict from the JSON {'id': 123, 'name': 'tom'}


In [16]:
print('you can pretty-print JSON nicely:\n', json.dumps(person, indent=4))

you can pretty-print JSON nicely:
 {
    "id": 123,
    "name": "tom"
}


## What's the difference between JSON and CSV? Which is better?

A: neither are better, but there are more appropriate ones to use in certain situations.

In [17]:
person = {
    'id': 123,
    'age': 45,
    'name': 'tom',
    'gender': 'fluid'
}

In [18]:
# Let's print that as JSON
print(json.dumps(person))

{"id": 123, "age": 45, "name": "tom", "gender": "fluid"}


In [23]:
# Let's print that as a CSV
row = []
for key, val in person.items():
    # we need to put each value into a flat row first
    row.append(str(val))

print('|'.join(person.keys()))
print('|'.join(row))

id|age|name|gender
123|45|tom|fluid


## Now let's revisit our Twitter example

What would be our preferred output format, and why?

In [None]:
import tweepy
import json

CONSUMER_KEY = 'n4om9qk8X3EKzlmfVBU3n4K3b'
CONSUMER_SECRET = 'ol5Ftaog6CnzebaJENibnxrg9vNdz4rgtnmxZ70RvNwaUYv9v3'
ACCESS_TOKEN = '3637250534-dpI1Sz8T6Yfk2UbMyGSzfwTe6kXYEXJPrwBs5qF'
ACCESS_TOKEN_SECRET = '7z6TIplUYGQtL9ZWDQ4scvs5cwFKMq5SaQpZ0Nx8nauxl'
# 'xl'

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

api = tweepy.API(auth)
user = api.get_user('potus')
data = user._json

print(json.dumps(data, indent=4))

In [None]:
print(json.dumps(data['entities'], indent=4))