#### FILES:
- **ratings.csv** contains ratings sorted by time.Ratings go from one to five. Both book IDs and user IDs are contiguous. For books, they are 1-10000, for users, 1-53424.
- **to_read.csv** provides IDs of the books marked "to read" by each user, as user_id,book_id pairs, sorted by time.
- **books.csv** has metadata for each book (goodreads IDs, authors, title, average rating, etc.). The metadata has been extracted from goodreads XML files.
- **book_tags.csv** contains tags/shelves/genres assigned by users to books. Tags in this file are represented by their IDs. Each book_id  has multiple tag_id.The field "count" denotes ‘user records’ (the number of users tagged the given tag_id with the goodreads_book_id).

#### QUESTIONS :
1. How many books do not have an original title [books.csv] ? N
2. How many unique books are present in the dataset ? Evaluate based on the 'book_id' [books.csv]  N
3. How many unique users are present in the dataset [ratings.csv] ? N
4. Which book (title) has the maximum number of ratings based on ‘work_ratings_count’  [books.csv] ? S
5. Which tag_id  is the most frequently used ie. mapped with the highest number of books [book_tags.csv]  ? (In case of more than one tag, mention the tag id with the least numerical value) N
6. Which book (title) has the most number of counts of tags given by the user [book_tags.csv,books.csv]  ? S
7. Which book (goodreads_book_id) is marked as to-read by most users [books.csv,toread.csv] ? N
8. Which is the least used tag, i.e. mapped with the lowest number of books [book_tags.csv]   ? (In case of more than one tag, mention the tag id with the least numerical value)  N
9. Which book (title) has the minimum ‘average_rating’  [books.csv] ? S
10. Which book (goodreads_book_id) has the least number of count of tags given by the user  [book_tags.csv,books.csv] ? N
11. How many tags are there in the dataset [book_tags.csv]  ? N
12. What is the average rating of all the books in the dataset based on ‘average_rating’  [books.csv]  ? N
13. Find the number of books published in the year ‘2000’ based on the ‘original_publication_year’ [books.csv] ? N
14. Predict sentiment using Textblob. How many positive titles (title) are there [books.csv] ? (cut-off >0) N
(3 marks)
15. Plot a bar chart in Flourish with top 20 unique tags  in descending order of ‘user records’ (the number of users tagged the given tag_id with the goodreads_book_id) [book_tags.csv] and share the published link.
(2 marks)
16. Bucket the average_rating of books into 6 buckets [0,1,2,3,4,5] with 0.5 decimal rounding (eg: average_rating 3.5 to 4.4 will fall in bucket 4). Plot bar graph in Flourish to show total number of books in each rating bucket. [books.csv] and share the published link.
(2 marks)

In [284]:
# Here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
# print(os.getcwd())
for dirname, _, filenames in os.walk(os.getcwd()+'/Dataset'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/Users/sanjeesi/Documents/Workspace/Data Science/Data Science IITM/TDS/Projects/Project2/Dataset/book_tags.csv
/Users/sanjeesi/Documents/Workspace/Data Science/Data Science IITM/TDS/Projects/Project2/Dataset/ratings.csv
/Users/sanjeesi/Documents/Workspace/Data Science/Data Science IITM/TDS/Projects/Project2/Dataset/toread.csv
/Users/sanjeesi/Documents/Workspace/Data Science/Data Science IITM/TDS/Projects/Project2/Dataset/books.csv


### Load datasets

In [285]:
books = pd.read_csv('Dataset/books.csv')
books.head()

Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url,NonEnglish
0,61,22557272,22557272,41107568,14,1594633665.0,9781595000000.0,Paula Hawkins,2015.0,The Girl on the Train,...,1226485,93600,27773,73897,273817,488447,362551,https://images.gr-assets.com/books/1490903702m...,https://images.gr-assets.com/books/1490903702s...,0
1,106,9418327,9418327,14302659,48,,,Tina Fey,2011.0,Bossypants,...,609260,35142,14842,31761,129390,230080,203187,https://images.gr-assets.com/books/1481509554m...,https://images.gr-assets.com/books/1481509554s...,0
2,124,7937843,7937843,9585076,151,316098337.0,9780316000000.0,Emma Donoghue,2010.0,Room,...,556327,42254,11020,26079,99831,217995,201402,https://images.gr-assets.com/books/1344265419m...,https://images.gr-assets.com/books/1344265419s...,0
3,141,18007564,18007564,21825181,148,804139024.0,9780804000000.0,Andy Weir,2012.0,The Martian,...,529702,61298,4114,10856,49200,173861,291671,https://images.gr-assets.com/books/1413706054m...,https://images.gr-assets.com/books/1413706054s...,0
4,143,18143977,18143977,25491300,139,1476746583.0,9781477000000.0,Anthony Doerr,2014.0,,...,547827,53413,6209,14527,61020,185239,280832,https://images.gr-assets.com/books/1451445646m...,https://images.gr-assets.com/books/1451445646s...,0


In [286]:
ratings = pd.read_csv('Dataset/ratings.csv')
ratings.head()

Unnamed: 0,user_id,book_id,rating
0,4,200,4
1,4,373,4
2,24,7770,4
3,32,373,5
4,40,659,3


In [287]:
toread = pd.read_csv('Dataset/toread.csv')
toread.head()

Unnamed: 0,user_id,book_id
0,112,217
1,162,200
2,256,1514
3,332,2637
4,360,217


In [288]:
book_tags = pd.read_csv('Dataset/book_tags.csv')
book_tags.head()

Unnamed: 0,goodreads_book_id,tag_id,count
0,2371,30574,3162
1,2371,14552,238
2,2371,21773,183
3,2371,21689,160
4,2371,8717,154


1. How many books do not have an original title [books.csv] ? N

In [289]:
books.isna().sum()

book_id                       0
goodreads_book_id             0
best_book_id                  0
work_id                       0
books_count                   0
isbn                         79
isbn13                       77
authors                       0
original_publication_year     0
original_title               71
title                         0
language_code                51
average_rating                0
ratings_count                 0
work_ratings_count            0
work_text_reviews_count       0
ratings_1                     0
ratings_2                     0
ratings_3                     0
ratings_4                     0
ratings_5                     0
image_url                     0
small_image_url               0
NonEnglish                    0
dtype: int64

In [290]:
71

71

In [291]:
# Remove books with original_title = NaN
removedBooks = books[books['original_title'].isna()][['book_id', 'goodreads_book_id']]
books = books[books['original_title'].notna()]

In [292]:
# Remove removedBooks from other datasets
ratings = ratings[~ratings['book_id'].isin(removedBooks['book_id'])]
toread = toread[~toread['book_id'].isin(removedBooks['book_id'])]
book_tags = book_tags[~book_tags['goodreads_book_id'].isin(removedBooks['goodreads_book_id'])]

2. How many unique books are present in the dataset ? Evaluate based on the 'book_id' [books.csv]  N

In [293]:
books.nunique()

book_id                      486
goodreads_book_id            486
best_book_id                 486
work_id                      486
books_count                  116
isbn                         436
isbn13                       438
authors                      407
original_publication_year     58
original_title               486
title                        485
language_code                  5
average_rating               113
ratings_count                483
work_ratings_count           486
work_text_reviews_count      473
ratings_1                    410
ratings_2                    456
ratings_3                    471
ratings_4                    480
ratings_5                    476
image_url                    434
small_image_url              434
NonEnglish                     1
dtype: int64

3. How many unique users are present in the dataset [ratings.csv] ? N

In [294]:
ratings.head()

Unnamed: 0,user_id,book_id,rating
0,4,200,4
1,4,373,4
2,24,7770,4
3,32,373,5
4,40,659,3


In [295]:
ratings.nunique()

user_id    43944
book_id      486
rating         5
dtype: int64

4. Which book (title) has the maximum number of ratings based on ‘work_ratings_count’  [books.csv] ? S

In [296]:
books.iloc[books['work_ratings_count'].idxmax()]['title']

'The Girl on the Train'

5. Which tag_id  is the most frequently used ie. mapped with the highest number of books [book_tags.csv]  ? (In case of more than one tag, mention the tag id with the least numerical value) N

In [316]:
book_tags.head()

Unnamed: 0,goodreads_book_id,tag_id,count
0,2371,30574,3162
1,2371,14552,238
2,2371,21773,183
3,2371,21689,160
4,2371,8717,154


In [298]:
book_tags['tag_id'].mode()

0      829
1    30574
Name: tag_id, dtype: int64

6. Which book (title) has the most number of counts of tags given by the user [book_tags.csv,books.csv]  ? S

In [299]:
book_tags.groupby(['goodreads_book_id']).sum().sort_values(['count'], ascending=False)
# books[books['goodreads_book_id']==11235712]['title']

Unnamed: 0_level_0,tag_id,count
goodreads_book_id,Unnamed: 1_level_1,Unnamed: 2_level_1
11235712,1733817,558626
9418327,1368578,387679
7937843,1467012,344959
16101128,1677935,330142
20910157,1186704,329692
...,...,...
15818278,1735668,1404
10843036,1775798,1078
525488,1690899,932
10374638,1396587,921


In [300]:
books[books['goodreads_book_id']==book_tags.groupby(['goodreads_book_id'])['count'].sum().idxmax()]['title']

11    Cinder (The Lunar Chronicles, #1)
Name: title, dtype: object

7. Which book (goodreads_book_id) is marked as to-read by most users [books.csv,toread.csv] ? N

In [301]:
toread.head()

Unnamed: 0,user_id,book_id
0,112,217
1,162,200
2,256,1514
3,332,2637
4,360,217


In [302]:
toread.nunique()

user_id    25417
book_id      486
dtype: int64

In [303]:
books[books['book_id'] == toread['book_id'].mode()[0]]['goodreads_book_id']

0    22557272
Name: goodreads_book_id, dtype: int64

8. Which is the least used tag, i.e. mapped with the lowest number of books [book_tags.csv]   ? (In case of more than one tag, mention the tag id with the least numerical value)  N

In [304]:
book_tags['tag_id'].value_counts().sort_index()

46       1
47       1
73       1
134      6
156      1
        ..
33966    1
33968    1
34125    1
34126    1
34247    1
Name: tag_id, Length: 4264, dtype: int64

9. Which book (title) has the minimum ‘average_rating’  [books.csv] ? S

In [305]:
books.iloc[books['average_rating'].idxmin()]['title']

'Beautiful Day'

10. Which book (goodreads_book_id) has the least number of count of tags given by the user  [book_tags.csv] ? N

In [320]:
book_tags.groupby(['goodreads_book_id'])['count'].sum().idxmin()

25801299

11. How many tags are there in the dataset [book_tags.csv]  ? N

In [307]:
book_tags.nunique()

goodreads_book_id     488
tag_id               4264
count                1762
dtype: int64

12. What is the average rating of all the books in the dataset based on ‘average_rating’  [books.csv]  ? N

In [308]:
books['average_rating'].mean()

3.9595679012345677

13. Find the number of books published in the year ‘2000’ based on the ‘original_publication_year’ [books.csv] ? N

In [309]:
books[books['original_publication_year']==2000]

Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url,NonEnglish
109,2087,5190,5190,1070015,37,345435168.0,9780345000000.0,Elizabeth Berg,2000.0,Open House,...,49330,1417,714,3699,16410,18155,10352,https://s.gr-assets.com/assets/nophoto/book/11...,https://s.gr-assets.com/assets/nophoto/book/50...,0
442,8138,27864391,27864391,83062,62,,9781475000000.0,David Ebershoff,2000.0,The Danish Girl,...,13984,1700,239,927,4006,5842,2970,https://images.gr-assets.com/books/1451790312m...,https://images.gr-assets.com/books/1451790312s...,0


14. Predict sentiment using Textblob. How many positive titles (title) are there [books.csv] ? (cut-off >0) N

In [310]:
from textblob import TextBlob

In [311]:
def getSentiment(text):
    return "Positive" if TextBlob(text).sentiment.polarity > 0 else "Negative"
# getSentiment("Wild: From Lost to Found on the Pacific Crest")

In [312]:
books['Sentiment'] = books['title'].apply(getSentiment)

In [313]:
len(books[books['Sentiment'] == "Positive"])

84