# Files and formats

So far we worked with flat files and csv files. Today we talk a bit more about JSON, and Web API's

Pandas has a number of methods for reading tabular data as a DataFrame object. 

        read_csv
        read_fwf
        read_clipboard
        read_excel
        read_hdf
        read_html
        read_json
        read_msgpack
        read_pickle
        read_sas
        read_sql
        read_stata
        read_feather
        
So far we used most of the time `read_csv`, `read_fwf`, `read_excel`. We experienced that data can be very messy and that we need a lot of different parameters to perform a nice read without errors. Sometimes the data is seperated by variable amount of whitespaces. In these cases we can pass a regular expression as a delimiter for read_csv. 

        data = pd.read_csv('example.txt', sep = '\s+')
        
Other handy methods are

        na_values
        skiprows
        sep
        nrows
        chunksize
        skip_footer
        encoding
        
    
      

## JSON Data

https://www.youtube.com/watch?v=EfEm0g-bMPc

JSON has become one of the standard for sending http requests between webbrowsers and other applications. JSON is very nearly valid Python code using basic types like dictionaries, arrays, strings, numbers and booleans. So far we used `json.loads()` and `json.dumps()` to read and write json files. JSON data can be converted to a dataframe using the built in pandas `pd.read_json` method

In [None]:
import json
db = json.load(open('data/food.json'))
print(db[0:1])
#each entry is a dict, so we select one of these
db[0].keys()

In [None]:
import pandas as pd
df = pd.read_json('data/food.json')
df.head()

The disadvantage of `pd.read_json` is that we create columns with dictionaries. Better is to select the nutrients information from the column and put it in a different dataframe along with the `id` number, so that we can combine the two afterwards. Below you see how the first row and column cell is selected and put in a dataframe. 

In [None]:
nutrients = pd.DataFrame(db[0]['nutrients'])
nutrients.head(10)

To do this for the entire dataframe we need to loop through the dataframe and put each nutrients cell in a dataframe, add the `id` column for identification purpose. If we first create a list of all these dataframes and then concat them we have one single dataframe for all the nutrients cells in the original dataframe.

In [None]:
nutrients = []
for rec in db:
    fnuts = pd.DataFrame(rec['nutrients'])
    fnuts['id'] = rec['id']
    nutrients.append(fnuts)

nutrients = pd.concat(nutrients, ignore_index=True)

In [None]:
nutrients[50:60]

In [None]:
#check for duplicates
print(len(nutrients))
nutrients.duplicated().sum()

In [None]:
nutrients = nutrients.drop_duplicates()

In [None]:
print(len(nutrients))

Since we put the nutrients info in a seperate dataframe we can eliminate that from the original database

In [None]:
columns_to_keep = ['description',
                   'group',
                   'id',
                   'manufacturer']
df = df[columns_to_keep]
df.head()

Now we reduced the dimensions in the dataframes we can easily merge them. First we rename the columns that are in both dataframes 

In [None]:
df = df.rename(columns = {'description':'food', 'group':'food_group'}, copy = False)

In [None]:
df.head()

In [None]:
nutrients = nutrients.rename(columns = {'description':'nutrients', 'group': 'nutrient_group'})
nutrients.head()

In [None]:
ndata = pd.merge(nutrients, df, on='id', how='outer')
ndata.head(20)

With the merged data we can conduct any analysis we like

In [None]:
%matplotlib notebook
result = ndata.groupby(['nutrients', 'food_group'])['value'].quantile(0.5)
result['Total lipid (fat)'].sort_values().plot(kind='barh')

## Retrieving data troughout API's

API stands for Application Programming Interface. It is the interface that allows software applications to communicate with one another. An API is a software-to-software interface, not a user interface. With APIs, applications talk to each other without any user knowledge or intervention.


An example is the Twitter API. It is a web-based JSON API that allows developers to programmatically interact with Twitter data. The Twitter API is a web-based API. It must be accessed by making requests over the Internet to services that Twitter hosts. With a web-based API such as Twitter’s, your application sends an HTTP request, just like a web browser does. But instead of the response being delivered as a webpage, for human understanding, it’s returned in a format that applications can easily parse. Various formats exist for this purpose, and Twitter uses a popular and easy-to-use format called JSON. 

In order to access Twitter Streaming API, we need to get 4 pieces of information from Twitter: API key, API secret, Access token and Access token secret.  If you to https://apps.twitter.com/ and log in with your twitter credentials you can create a New App and get the API key credentials for yourself.

For the twitter API we need the tweepy library see https://tweepy.readthedocs.io/en/latest/

In [None]:
#source: http://adilmoujahid.com/posts/2014/07/twitter-analytics/
#Import the necessary methods from tweepy library
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream

#Variables that contains the user credentials to access Twitter API 
access_token = token
access_token_secret = secret_token
consumer_key = api_key
consumer_secret = api_secret_key


#This is a basic listener that just stores tweets in json file
class StdOutListener(StreamListener):

    def on_data(self, data):
#        with open('data/out.json', 'a') as f:
#            f.write(data)
        print(data)
        return True

    def on_error(self, status):
        print(status)


if __name__ == '__main__':

    #This handles Twitter authetification and the connection to Twitter Streaming API
    l = StdOutListener()
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    stream = Stream(auth, l)

    #This line filter Twitter Streams to capture data by the keywords
    stream.filter(track=['aerobic', 'anaerobic'])

Since this is a JSON format we can process the data accordingly

In [None]:
import json
tweets_data_path = 'data/result2.json'

tweets_data = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file:
    tweet = json.loads(line)
    tweets_data.append(tweet)

    

In [None]:
print(len(tweets_data))

In [None]:
import pandas as pd
tweets = pd.DataFrame(tweets_data)
tweets

In [None]:
tweets

### some useful regex methods

In [None]:
def extract_link(text):
    import re
    regex = r'https?://[^\s<>"]+|www\.[^\s<>"]+'
    match = re.search(regex, text)
    if match:
        return match.group()
    return ''

In [None]:
tweets['link'] = tweets['text'].apply(lambda tweet: extract_link(tweet))
tweets

In [None]:
def word_in_text(word, text):
    import re
    word = word.lower()
    text = text.lower()
    match = re.search(word, text)
    if match:
        return True
    return False

In [None]:
tweets['sport'] = tweets['text'].apply(lambda tweet: word_in_text('sport', tweet))

In [None]:
tweets

## Challenge

Try to read the sql file in the data directory. More information is to be found here https://github.com/fenna/twitter_analysis