# Google Public Books API Data Extraction
Data is pulled from api end point - https://www.googleapis.com/books/v1/volumes?q=isbn:
<br>It takes isbn as input and returns response in json format.
Note- There is limit on api call of 100 calls per minute per user. So threading won't work. So calls faster than this leads to error message in json response. So be ensure to filter those error data on dataframe. It has been taken care of in this notebook. This file is run only on 100 data.

In [11]:
import requests
import pandas as pd
from tqdm import tqdm
import pickle as pkl
book = pd.read_csv('/kaggle/input/book-recommendation-dataset/Books.csv')
book_api = {}
for i, isbn in tqdm(enumerate(book.ISBN.values[:100])): #change index here for increasing-decreasing sample data size
    url='https://www.googleapis.com/books/v1/volumes?q=isbn:'
    response=requests.get(url + isbn)
    book_api[isbn] = response.json()
    # data is being stored on every 1000 api call to avoid any huge data loss
    if i!=0 and i%1000==0:
        file = open('/kaggle/working/book_api_data.pkl', 'wb')
        pkl.dump(book_api, file)
        file.close()

  book = pd.read_csv('/kaggle/input/book-recommendation-dataset/Books.csv')
100it [00:22,  4.46it/s]


In [16]:
book_api_data.reset_index()

Unnamed: 0,index,kind,totalItems,items,error
0,0195153448,books#volumes,1,"[{'kind': 'books#volume', 'id': 'KyLfwAEACAAJ'...",
1,0002005018,books#volumes,1,"[{'kind': 'books#volume', 'id': 'yfx0vgEACAAJ'...",
2,0060973129,books#volumes,1,"[{'kind': 'books#volume', 'id': '_LufAAAAMAAJ'...",
3,0374157065,books#volumes,1,"[{'kind': 'books#volume', 'id': 'GkthXOZv17kC'...",
4,0393045218,books#volumes,1,"[{'kind': 'books#volume', 'id': '5OujQgAACAAJ'...",
...,...,...,...,...,...
95,0671867156,books#volumes,1,"[{'kind': 'books#volume', 'id': 'uf-3AL4tDFgC'...",
96,0312252617,books#volumes,1,"[{'kind': 'books#volume', 'id': 'F4UDC_QrW58C'...",
97,0312261594,books#volumes,1,"[{'kind': 'books#volume', 'id': 'bZdSPgAACAAJ'...",
98,0316748641,books#volumes,1,"[{'kind': 'books#volume', 'id': 'I3s5PwAACAAJ'...",


In [17]:
book_api_data = pd.DataFrame(book_api)
book_api_data = book_api_data.T.reset_index()
# here filtering error msg data and isbn for which data is not available in api 
#you can re-run of codes for isbns for which error message returned
books_api_1 = book_api_data[(book_api_data.error.isna()) & (book_api_data['items'].notna())]
books_api_2 = books_api_1.drop(books_api_1[books_api_1['index'].duplicated()].index)
books_api_2 = books_api_2.set_index('index')

Following data is being extracted from the Google Books API.
* title: the title of the book.
* authors: name of the authors of the books (might include more than one author.
* language: the language of the book
* genres\categories: the categories associated with the book (by Google store)
* rating\average Rating: the average rating of each book out of 5.
* maturity Rating: whether the content of the book is for mature or NOT MATURE audience.
* page Count: number of pages of the books.
* voters: the number of voters to the book. (extracted as category)
* ISBN: the unique identifier for each book.

In [18]:
field_list = ['ISBN', 'title', 'authors', 'publishedDate', 'language', 'pageCount', 'averageRating', 'ratingsCount', 'maturityRating', 'categories']
book_dict = {}
for fld in field_list:
    book_dict[fld] = []
    
for i in tqdm(books_api_2.index):
    for field in field_list:
        if field=='ISBN':
            book_dict[field].append(i)
        else:
            book_dict[field].append(books_api_2.loc[i,'items'][0]['volumeInfo'].get(field))

100%|██████████| 90/90 [00:00<00:00, 5235.75it/s]


In [19]:
# json data converted to dataframe
book_df = pd.DataFrame(book_dict)
book_df.to_csv('/kaggle/working/book_api_df.csv', index=False)
book_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   ISBN            90 non-null     object 
 1   title           90 non-null     object 
 2   authors         90 non-null     object 
 3   publishedDate   90 non-null     object 
 4   language        90 non-null     object 
 5   pageCount       89 non-null     float64
 6   averageRating   63 non-null     float64
 7   ratingsCount    63 non-null     float64
 8   maturityRating  90 non-null     object 
 9   categories      87 non-null     object 
dtypes: float64(3), object(7)
memory usage: 7.2+ KB
