# Capstone 1: Collaborative Filtering Based Book Recommendation Engine

# Project Summary:

## Introduction

Recommendation engines have laid the foundation of every major tech company around us that provides retail, video-on-demand or music streaming service and thus redefined the way we shop, search for an old friend, find new music or places to go to. From finding the best product in the market to searching for an old friend online or listening to songs while driving, recommender systems are everywhere. A recommender system helps to filter vast amount of information from all users and item database to individual’s preference. For example, Amazon uses it to suggest products to customers, and Spotify uses it to decide which song to play next for a user. 
Book reading apps like Goodreads has personally helped me to find books I couldn’t put away and thus getting back to the habit of reading regularly again. While a lot of datasets for movies (Netflix, Movielens) or songs have been explored previously to understand how recommendation engine works for those applications and what are the scopes of future improvement, book recommendation engines have been relatively less explored.
The primary goal of this project is to develop a collaborative book recommendation model using good#reads dataset that can suggest readers what books to read next. Additionally, data wrangling and exploratory data analysis will be utilized to draw insights about users reading preferences (e.g. how they like to tag, what ratings they usually provide etc.) and current trends in the book market (book categories that are in demand, successful authors in the market etc.).

## Key Business Insights

> **Understanding User Behavior**

- When the tag counts of different generalized tag_names were ranked, the top 10 tag name shows that users prefer to have separate shelves for books they marked as favorite, read in a particular year (e.g. read in 1990, Childhood Books), owned or borrowed from library, read in a different format (e.g. ebook\ audiobook). The other shelving preferance per the top 10 tag_names were different book categories such as 'Fiction', 'Young - Adult' etc.

- The count plot of user provided ratings shows that users are more likely to rate a book 4 or higher. As the tag counts for books they mark as favorite is also higher (shown previosuly), it seems that users are more likely to rate and store a book when they like it. 

- Users use a wide variety of names even if they are tagging a book in the same category. Foe example Science Fiction and Fantasy 

> **Factors to Consider for a Book's Rating**

- Exploratory Data Analaysis (EDA) shows that the top 15 books per tag_count as reader's favorite is not same as the top 15 books ranked per ratings of the users. Also, while the average rating counts for the top 15 books marked as the reader's favorite is significantly higher (2191465) than the average ratings received by all books (23833), the avaerage rating counts (18198) for the top 15 books is below the average. Both favorite and top rated 15 books have higher average ratings (4.26 and 4.74 respectively compared to the average ratings of all books (4.01). These statistics suggest that only considering the average rating is not enough to rank books for recommendation. An an ideal metric should also consider how many times the book has been marked as favorite and the total number of ratings it received in addition to the average rating of the book.

- 9 of the top 15 favorite books are most frequently tagged in the Young - Adult Category.The other popular categories in the top 15 favorite books are science fiction and fantasy, romance, historical fiction or fiction in general. The harry potter books (ranked 2,3,4,6,7) have also been freqently tagged as children/childhood books. A quick look at the publication date of these books reveal that most of the books under Young Adult and Childhood categories were actually publsihed at least 10 years ago. Therefore, they were probably the favorite books of many adult readers when they were young. This highlights that the year of publication and dates of ratings can also impact a book's ranking and should be factored into the performance metric. To be able to determine if the books are equally liked by current generation of young readers, one can check if the average number of positive ratings recevied by a book per year has reduced or increased since its year of publication. As the datasets used in this project do not provide the dates when the books were rated, it was not possible to implement this scheme into the recommendation framework. 

> **Book Categories**

- Based on the tag_counts of different book categories, it was found that 'Fiction' dominates as the popular category for users of all age groups (i.e. Adult and Young Adult readers). Beside fiction in general, tags related to 'Science Fiction and Fantasy' seems to be used more frequently than other categories in both adult and young adult section. Some other popular categories are Crime & Mysetery, Historical Fiction etc. Based on the findings, it seems that the demand for different kinds of fiction are higher than books based on actual events/facts (i.e. History or Science) in the market. The market seems to agree with these conclusions as about 43% of the books in the dataset are found to be Fiction, with Non - Fiction (20.5%), Young Adult (8.3%) and Science Fiction and Fantasy (5.73%) as other prevailing categories. 

- Does this finding indicate that a new Fiction has higher chances of getting a good rating than new history book? The answer is probably negative. When average ratings of different book categories were compared, it was found that readers do not have a bias towards rating a particular category higher than the others. The average rating in every category is close to the average rating of all the books (4.01) and mostly range from 3.25 to 4.75. Higher variability exists in the ratings of categories that have more books in the market than other categories. 

> **Authors in Demand**

- JK Rowling seems to be everyone's most favorite author with 4 of her books in the the 15 Favorite books. However, when authors were ranked per the number of books they wrote and the average ratings their books received, JK Rowling did not make it to the top 10. Stephen King seems to be the most successful authors with 44 books in the market with an average rating of 3.9. Other succesful authors considering both ratings and number of books are Dean Koontz, John Grisham, Nora Roberts and Jodi Picoult. This suggests that an ideal metric to evaluate an author's demand in the market should include the number of books an author wrote, the ratings the books received, the number of books that has been marked as favorite, and the tag counts as favorite for each book.

> **Rating Counts per Book and Per User**

- All the users in the dataset have rated at least 19 books where the most active users rated 200 books. 80% of the users rated at least 100 books
- All the books in the dataset received at least 8 ratings. When books were ramked by rating_counts, it seems that the top 10 books recived more than 10,000 ratings. CDF plot of the ratings per book showed that only ~20% of the book received more than 5000 ratings.
- As the number of books in the dataset 10000 are less than the number of users (53,424), sparsity is less likely to be an issue for ML modeling with this dataset.

# Data Wrangling: Outline

- **Import Packages** 
    
- **Import Datasets**
    
- **Lets Take a Look at All the Datasets**
    
- **Comments on the Raw Datasets**
    
- **Data Wrangling**    

    - Identify Connections between Different Dataset and Merge Them as Needed
        - ratings.csv & books.csv
        - tags.csv,book_tags.csv & books.csv
        - Few Observations
        - Drop the Duplicates
        - Merge the Datasets (tags.csv, book_tags.csv & books.csv)
    - Lets Try to Clean This Combined Dataset 
        - Remove Non - English Tag Names
        - Remove Barely Used Tag Names 
        - Define Some Generalized Categories by Clustering User - Provided Tag Names
            - Tag for Favorite Books
            - Tag for Children Books
            - Tag for Owned Books
            - Find and Group All Young - Adult Books 
            - Tags for Adult Readers
                - Lets Group Tags Relevant to Fiction & Non - Fiction
                - Lets Group Tags for Audio Books and Ebooks 
                - Lets Group Tags for Book Reads in a Particular Year 
                - Lets Group Other Popular Tags to Science, History, Women, Crime & Miystery, Science &  Fantasy 
                
- **Export Tidy Dataset to CSV for EDA**

# Import Packages 

In [1]:
%matplotlib inline
#Import package for pandas dataframe
import pandas as pd
# Import the regular expression module
import re

# Import Dataset

In [2]:
r = pd.read_csv( 'ratings.csv' ) # ratings for different books
b = pd.read_csv( 'books.csv' ) # list of books and necessary infor about the books
t = pd.read_csv( 'tags.csv')  # Tag id and Tag names  used by readers to shelve a book
bt = pd.read_csv( 'book_tags.csv') #Records of tags received by different books 
tr = pd.read_csv( 'to_read.csv')# books marked by users as to read 


# Lets Take a Look at All the Datasets

## ratings.csv

In [3]:
# size of the dataset 
len(r)

5976479

In [4]:
# What kind of information does it have 
r.head()

Unnamed: 0,user_id,book_id,rating
0,1,258,5
1,2,4081,4
2,2,260,5
3,2,9296,5
4,2,2318,3


In [5]:
# Check for data type, null entries
r.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5976479 entries, 0 to 5976478
Data columns (total 3 columns):
user_id    int64
book_id    int64
rating     int64
dtypes: int64(3)
memory usage: 136.8 MB


In [6]:
# Any null entries
r.isnull().values.any()

False

In [7]:
# Check for duplicates
r.loc[r.duplicated(keep = False),:]

Unnamed: 0,user_id,book_id,rating


## books.csv

In [8]:
# size of the dataset 
len(b)

10000

In [9]:
# What kind of information does it have 
b.head()

Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,...,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
3,4,2657,2657,3275794,487,61120081,9780061000000.0,Harper Lee,1960.0,To Kill a Mockingbird,...,3198671,3340896,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...
4,5,4671,4671,245494,1356,743273567,9780743000000.0,F. Scott Fitzgerald,1925.0,The Great Gatsby,...,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...


In [10]:
# Check for data type, null entries
b.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 23 columns):
book_id                      10000 non-null int64
goodreads_book_id            10000 non-null int64
best_book_id                 10000 non-null int64
work_id                      10000 non-null int64
books_count                  10000 non-null int64
isbn                         9300 non-null object
isbn13                       9415 non-null float64
authors                      10000 non-null object
original_publication_year    9979 non-null float64
original_title               9415 non-null object
title                        10000 non-null object
language_code                8916 non-null object
average_rating               10000 non-null float64
ratings_count                10000 non-null int64
work_ratings_count           10000 non-null int64
work_text_reviews_count      10000 non-null int64
ratings_1                    10000 non-null int64
ratings_2                    10000 n

In [11]:
# Check for duplicates
b.loc[b.duplicated(keep = False),:]

Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url


## tags.csv

In [12]:
# size of the dataset 
len(t)

34252

In [13]:
# What kind of information does it have 
t.head()

Unnamed: 0,tag_id,tag_name
0,0,-
1,1,--1-
2,2,--10-
3,3,--12-
4,4,--122-


In [14]:
# Check for data type, null entries
t.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34252 entries, 0 to 34251
Data columns (total 2 columns):
tag_id      34252 non-null int64
tag_name    34252 non-null object
dtypes: int64(1), object(1)
memory usage: 535.3+ KB


In [15]:
# Check for duplicates
t.loc[t.duplicated(keep = False),:]

Unnamed: 0,tag_id,tag_name


## book_tags.csv

In [16]:
# size of the dataset 
len(bt)

999912

In [17]:
# What kind of information does it have 
bt.head()

Unnamed: 0,goodreads_book_id,tag_id,count
0,1,30574,167697
1,1,11305,37174
2,1,11557,34173
3,1,8717,12986
4,1,33114,12716


In [18]:
# Check for data type, null entries
bt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999912 entries, 0 to 999911
Data columns (total 3 columns):
goodreads_book_id    999912 non-null int64
tag_id               999912 non-null int64
count                999912 non-null int64
dtypes: int64(3)
memory usage: 22.9 MB


In [19]:
# Check for duplicates
bt.loc[bt.duplicated(keep = False),:]

Unnamed: 0,goodreads_book_id,tag_id,count
159370,22369,25148,4
159371,22369,25148,4
265127,52629,10094,1
265128,52629,10094,1
265139,52629,2928,1
265140,52629,2928,1
265154,52629,13272,1
265155,52629,13272,1
265186,52629,13322,1
265187,52629,13322,1


## to_read.csv

In [20]:
# size of the dataset 
len(tr)

912705

In [21]:
# What kind of information does it have 
tr.head()

Unnamed: 0,user_id,book_id
0,9,8
1,15,398
2,15,275
3,37,7173
4,34,380


In [22]:
# Check for data type, null entries
tr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 912705 entries, 0 to 912704
Data columns (total 2 columns):
user_id    912705 non-null int64
book_id    912705 non-null int64
dtypes: int64(2)
memory usage: 13.9 MB


In [23]:
# Check for duplicates
tr.loc[tr.duplicated(keep = False),:]

Unnamed: 0,user_id,book_id


## Comments on the Raw Dataset

> **ratings.csv** (r) is clean, it does not have any missing\null entries or duplicate rows.<br>
> **books.csv (b)** has some entires missing for some of the columns (i.e. isbn,isbn13, original_publication_year, original_title,language_code) <br>                                                   
> **tags.csv (t)** will require further investigation as the tag_names are not clear from the initial inspection. However, The dataset does not have any duplicate\missing entries <br>
> **book_tags.csv** (bt) has duplicate rows <br>
> **to_read.csv** (tr) looks clean <br>

# Data Wrangling

**Outline for Data Wrangling**

    5.1 Identify Connections between Different Dataset and Merge Them as Needed <br>
        - ratings.csv & books.csv
        - tags.csv,book_tags.csv & books.csv
        - Few Observations
        - Drop the Duplicates
        - Merge the Datasets (tags.csv, book_tags.csv & books.csv)
    5.2 Lets Try to Clean This Combined Dataset <br>
        - Remove Non - English Tag Names
        - Remove Barely Used Tag Names 
        - Define Some Generalized Categories by Clustering User - Provided Tag Names
            - Tag for Favorite Books
            - Tag for Children Books
            - Tag for Owned Books
            - Find and Group All Young - Adult Books 
            - Tags for Adult Readers
                - Lets Group Tags Relevant to Fiction & Non - Fiction
                - Lets Group Tags for Audio Books and Ebooks 
                - Lets Group Tags for Book Reads in a Particular Year 
                - Lets Group Other Popular Tags to Science, History, Women, Crime & Miystery, Science &  Fantasy 

## Identify Connections between Different Datasets and  Merge Them as Needed

###  'ratings.csv' & 'books.csv':

> Ratings are sorted chronologically, oldest first.

In [24]:
# Size of the dataset
len(r)

5976479

In [25]:
# Size of the dataset
len(b)

10000

> **There are two things to note after looking at both ratings.csv and and books.csv dataset in section 3.**
1. Both datasets have has a common column 'book_id'. So it needs to be checked if the column connects the two dataset (i.e. does the 'book_id'column in 'ratings.csv'contains the same elements/ book_ids as the'book_id' column in books.csv dataset )
2. Since len(r) >> len(b), there is possibly more than one rating per book if the 'book_id' column is common in the two datasets.

In [26]:
#Lets check if the "book_id" column has same elements 
if sorted(set(b.book_id)) == sorted(set(r.book_id)):
    print("The book_id column contains same elements, so it can be used to connect the two datasets")

The book_id column contains same elements, so it can be used to connect the two datasets


In [27]:
# Lets check the number of unique elements for all features in ratings.csv dataset
r.nunique()

user_id    53424
book_id    10000
rating         5
dtype: int64

> There are 53424 users rating 10000 books in 5 categories from 1-5, so we can conclude that each book received multiple ratings

###   'tags.csv','book_tags.csv' &  'books.csv',:

> From the observation of these three dataframes, it seems the column 'tag_id' is common between dataframes for tags.csv and book_tags.csv. Also, the column, goodreads_book_id is common for dataframes between book_tags.csv and books.csv

> Lets take a look at more rows in tags.csv as from the few rows observed in section 1.1 it was unclear clear what kind of information the column contains

In [28]:
t.tail(10)

Unnamed: 0,tag_id,tag_name
34242,34242,漫画
34243,34243,골든
34244,34244,﹏moonplus-reader﹏
34245,34245,ﺭﺿﻮﻯ-عاشور
34246,34246,ﻳﻮﺳﻒ-زيدان
34247,34247,Ｃhildrens
34248,34248,Ｆａｖｏｒｉｔｅｓ
34249,34249,Ｍａｎｇａ
34250,34250,ＳＥＲＩＥＳ
34251,34251,ｆａｖｏｕｒｉｔｅｓ


### Few Observations

> Looking at some names [e.g. Favorites, Manga, SERIES etc.] in the column 'tag_name', it seems that the dataset different provides shelf names corresponding to different tag_ids.  <br>
Also, some of the tag_names are not in english. It will be a good idea to process them later via a data cleaning step. <br>
> Since tag_id is a common column between 'book_tags.csv' datsaset and 'tags.csv', we can merge the two dataset based on the tag_ids. 

### Drop Duplicates
From our data inspection in section 4, we know that book_tags.csv has duplicates. Lets remove the duplicates before merging the datasets.

In [29]:
bt = bt.drop_duplicates()

### Merge the Datasets

In [30]:
bt = bt.merge( t, on = 'tag_id' )

In [31]:
bt.head()

Unnamed: 0,goodreads_book_id,tag_id,count,tag_name
0,1,30574,167697,to-read
1,2,30574,24549,to-read
2,3,30574,496107,to-read
3,5,30574,11909,to-read
4,6,30574,298,to-read


In [32]:
tag_table = bt.merge( b[[ 'goodreads_book_id', 'title']], on = 'goodreads_book_id' )
tag_table.head()

Unnamed: 0,goodreads_book_id,tag_id,count,tag_name,title
0,1,30574,167697,to-read,Harry Potter and the Half-Blood Prince (Harry ...
1,1,11305,37174,fantasy,Harry Potter and the Half-Blood Prince (Harry ...
2,1,11557,34173,favorites,Harry Potter and the Half-Blood Prince (Harry ...
3,1,8717,12986,currently-reading,Harry Potter and the Half-Blood Prince (Harry ...
4,1,33114,12716,young-adult,Harry Potter and the Half-Blood Prince (Harry ...


The combined table above contains all the tag_names that have been used by different users to catgorize the books. In the next step, we will perform some data cleaning to identify the most frequently occuring tag_names that can be used to categorize the books.

## Lets Try to Clean This Combined Dataset

### Remove Non - English Tag_names & Non - English Titles
> From the tags.csv table, it seemed that there are lot of non english tag_names that are not useful to extract any meaningful information. So getting rid of those rows from the combined dataset will be helpful to simplify the dataset.

In [33]:
# Check the number of unique tag_names in the dataset prior filtering out non-ascii elements
tag_table.tag_name.nunique()

34252

In [34]:
#Lets filter out rows with non - ascii characters
non_ascii_tags = tag_table[tag_table['tag_name'].str.contains(r'[^\x00-\x7F]+')]
tag_table =tag_table[~tag_table['tag_name'].str.contains(r'[^\x00-\x7F]+')]

#Lets check the size after applying the filter 
tag_table.tag_name.nunique()

32963

In [35]:
#Lets filter out rows with non - ascii characters from the title
non_ascii_title = tag_table[tag_table['title'].str.contains(r'[^\x00-\x7F]+')]
tag_table =tag_table[~tag_table['title'].str.contains(r'[^\x00-\x7F]+')]

#Lets check the size after applying the filter 
tag_table.title.nunique()

9814

### Remove Barely Used Tag_Names

In [36]:
# lets check the statistics for frequency of these tags in the book_id dataset
tag_table.tag_name.value_counts(dropna=False).describe()

count    32405.000000
mean        30.288196
std        281.439575
min          1.000000
25%          1.000000
50%          2.000000
75%          5.000000
max       9834.000000
Name: tag_name, dtype: float64

> Based on the information above, it seems that there are now 32963 tag_names to categorize a total of 10000 books, but 75% of the tag_names have been used only 5 times by the all the 53424 users. These tag_names probably represent customized categories made by the user and will not be useful to cluster the books by popular categories. So lets try to identify tags that are widely used to shelf the books. To idetify such tags, we can group the dataframe ('tag_table') by tag_ids and count the number of books that each tag_id covered.

In [37]:
def popular_tags (tagname,M):
    if tagname in List:
        return tagname

tag_table_shortened = tag_table[['goodreads_book_id','tag_id','tag_name']]
books_per_tag = tag_table_shortened.groupby(['tag_id','tag_name']).count().sort_values(by='goodreads_book_id',ascending = False)
books_per_tag.rename(columns = {'goodreads_book_id':'Number of Books'}, inplace = True)
books_per_tag.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Number of Books
tag_id,tag_name,Unnamed: 2_level_1
30574,to-read,9834
11557,favorites,9734
22743,owned,9711
5207,books-i-own,9660
8717,currently-reading,9631


> It seems that the widely used tags (to-read, favorites, owned etc.) cover almost all the books. From this table, lets idetify and eliminate tags has been used for lesst than 300 books as they represent a negligible portion (only 3%) of the 10000 books in  collection

In [38]:
tags_to_keep = books_per_tag.index[books_per_tag['Number of Books']>300].tolist()
tag_id_to_keep, tag_names_to_keep = zip(*tags_to_keep)
len(set(tag_names_to_keep))
len(set(tag_id_to_keep))

500

In [39]:
# Define a function to create a Boolean Mask 
def Filtered(dataframe,List):
    if dataframe in List:
        dataframe = True
    else:
        dataframe = False
    return dataframe
# Apply the function defined above to make a boolean mask based on tag_id_to_keep 
Filter = tag_table.tag_id.apply(lambda x:Filtered(x,tag_id_to_keep ))
# Use the boolean mask to filter data
tag_table = tag_table[Filter]
# Use the boolean mask to filter data
tag_table.nunique()

goodreads_book_id    9850
tag_id                500
count                9305
tag_name              500
title                9814
dtype: int64

## Define Some Generalized Categories by Clustering User Provided Tag_Names

### Tag for Favorite Books

Lets look at some of the remaining tag_names to checke if we can do further processing on the dataset.

In [40]:
tag_table.tag_name[50:100]

55                my-favorites
56                      own-it
57             childrens-books
58                     library
59                       audio
60         young-adult-fiction
61                       novel
62                        2005
63               scifi-fantasy
65                       faves
66             favorite-series
67                read-in-2015
68                 made-me-cry
69                    juvenile
70          shelfari-favorites
71                      kindle
72                       youth
73                     romance
74                   favourite
75                      to-buy
76                read-in-2014
77                  to-re-read
79         childhood-favorites
80                  kids-books
81                       ebook
83                contemporary
84             read-in-english
85                      5-star
86               coming-of-age
87     science-fiction-fantasy
88                read-in-2017
89                     england
90      

In [41]:
prog = re.compile('\w*[-]?[Ff]av\w*')
Favorites = []
for row in tag_table.tag_name[50:200]:
    result = prog.match(row)
    if bool(result) and row not in Favorites:
        Favorites.append(row)

In [42]:
Favorites

['my-favorites',
 'faves',
 'favorite-series',
 'shelfari-favorites',
 'favourite',
 'childhood-favorites',
 'favs',
 'favorites',
 'favourites',
 'favorite']

> It looks like the users have shelved their favorite books with tag_names such as ['my-favorites','faves','favorite-series', 'shelfari-favorites', 'favourite', 'childhood-favorites','favourite-books','favs'] etc. We will name all these categories as one common category favorites. <br>

> As a first step, lets define  functions to find custom categories following a similar pattern (e.g. (strings containing the word 'fav') and replace them with a common category [e.g. Favorites] in the 'tag_name' column.

In [43]:
def identify_custom_categories (tagname, pattern):
    prog = re.compile(pattern)
    matched_tags = []
    for row in tagname:
        result = prog.match(row)
        if bool(result) and row not in matched_tags:
            matched_tags.append(row)
    return matched_tags

def replace_custom_categories(tag_name, custom_names, preferred_category_name):
    if tag_name in custom_names:
        tag_name = preferred_category_name
    return tag_name

In [44]:
# Identify custom tag_names for favorite
pattern_for_Favorites = '\w*[-]?\w*[-]?[Ff]av\w*'
custom_tags_for_favorites = identify_custom_categories(tag_table.tag_name, str(pattern_for_Favorites))


In [45]:
# Rename the custom tag_names as Favorite
preferred_name = 'Favorite'
tag_table.tag_name = tag_table.tag_name.apply(lambda x: replace_custom_categories (x,custom_tags_for_favorites,preferred_name))

# Save all the names for favorites in a dictionary to help keyword search later in the project
KW_CategorY_Repos  = {}
KW_CategorY_Repos ['Category'] = ['Favorite']
KW_CategorY_Repos ['Possible Search KW'] = []
KW_CategorY_Repos ['Possible Search KW'].append(custom_tags_for_favorites)

In [46]:
# Check the size after filtering
tag_table.tag_name.nunique()
KW_CategorY_Repos

{'Category': ['Favorite'],
 'Possible Search KW': [['favorites',
   'favourites',
   'all-time-favorites',
   'favorite-books',
   'favorite',
   'my-favorites',
   'faves',
   'favorite-series',
   'shelfari-favorites',
   'favourite',
   'childhood-favorites',
   'favs',
   'favorite-authors',
   'favorite-author']]}

In [47]:
# Check if Replacement Worked
tag_table.tag_name [50:100]

55                    Favorite
56                      own-it
57             childrens-books
58                     library
59                       audio
60         young-adult-fiction
61                       novel
62                        2005
63               scifi-fantasy
65                    Favorite
66                    Favorite
67                read-in-2015
68                 made-me-cry
69                    juvenile
70                    Favorite
71                      kindle
72                       youth
73                     romance
74                    Favorite
75                      to-buy
76                read-in-2014
77                  to-re-read
79                    Favorite
80                  kids-books
81                       ebook
83                contemporary
84             read-in-english
85                      5-star
86               coming-of-age
87     science-fiction-fantasy
88                read-in-2017
89                     england
90      

## Tag for Children Books

Lets group tag names such as kids-books, childer-s-literature into one single category named Children Books

In [48]:
# Identify custom tag_names with children
pattern_for_Children = '\w*[-]?[cC]hildren\w*'
custom_tags_for_Children = identify_custom_categories(tag_table.tag_name, str(pattern_for_Children))

# Identify custom tag_names with childhood
pattern_for_Childhood = '\w*[-]?[cC]hildhood\w*'
custom_tags_for_Childhood = identify_custom_categories(tag_table.tag_name, str(pattern_for_Childhood))

# Identify custom tag_names with kids
pattern_for_Kids = '\w*[-]?[kK]id\w*'
custom_tags_for_Kids = identify_custom_categories(tag_table.tag_name, str(pattern_for_Kids))

custom_tags_for_Childre_and_Kids = custom_tags_for_Kids + custom_tags_for_Children + custom_tags_for_Childhood

print(custom_tags_for_Childre_and_Kids)

['kids', 'kids-books', 'kid-lit', 'kid-books', 'childrens', 'children', 'children-s', 'children-s-books', 'childrens-books', 'children-s-literature', 'children-s-lit', 'childrens-lit', 'children-young-adult', 'children-s-fiction', 'childrens-literature', 'childrens-fiction', 'children-books', 'children-ya', 'children-s-book', 'childhood', 'childhood-books', 'childhood-reads', 'my-childhood']


In [49]:
# Rename the custom tag_names as Children Books
preferred_name = 'Children Books'
tag_table.tag_name = tag_table.tag_name.apply(lambda x: replace_custom_categories (x,custom_tags_for_Childre_and_Kids
, preferred_name))

# Save all the names for favorites in a dictionary to help keyword search later in the project
KW_CategorY_Repos ['Category'].append('Children Books')
KW_CategorY_Repos ['Possible Search KW'].append(custom_tags_for_Childre_and_Kids)

In [50]:
tag_table.tag_name.nunique()

465

In [51]:
#Check if Replacement Worked
tag_table.tag_name [50:100]

55                    Favorite
56                      own-it
57              Children Books
58                     library
59                       audio
60         young-adult-fiction
61                       novel
62                        2005
63               scifi-fantasy
65                    Favorite
66                    Favorite
67                read-in-2015
68                 made-me-cry
69                    juvenile
70                    Favorite
71                      kindle
72                       youth
73                     romance
74                    Favorite
75                      to-buy
76                read-in-2014
77                  to-re-read
79                    Favorite
80              Children Books
81                       ebook
83                contemporary
84             read-in-english
85                      5-star
86               coming-of-age
87     science-fiction-fantasy
88                read-in-2017
89                     england
90      

## Tag for Owned Books

In [52]:
## Tag for Owned Books
pattern_for_Owned = '\w*[-]?\w*[-]?[oO]wn\w*'
custom_tags_for_Owned = identify_custom_categories(tag_table.tag_name, str(pattern_for_Owned))
custom_tags_for_Owned 

['books-i-own',
 'owned',
 'owned-books',
 'i-own',
 'own-it',
 'own-to-read',
 'owned-to-read',
 'books-owned',
 'i-own-it',
 'to-read-owned',
 'to-read-own']

In [53]:
# Rename the custom tag_names as Owned Books
preferred_name = 'Owned Books'
tag_table.tag_name = tag_table.tag_name.apply(lambda x: replace_custom_categories (x,custom_tags_for_Owned
, preferred_name))

In [54]:
#Lets check the tag_names after applying the filter 
tag_table.tag_name.nunique()

455

## Find All Young Adult Books ( Fiction & Non - Fiction & Other Categories)

In [55]:
## Tag for Young Adult Books
pattern_for_YA = '\w*[-]?[Yy]oung\w*'
searchfor = ['ya', 'juvenile','teen']
custom_tags_for_YA1 = list(tag_table.tag_name[tag_table.tag_name.str.contains('|'.join(searchfor))])
custom_tags_for_YA2 = identify_custom_categories(tag_table.tag_name, str(pattern_for_YA))
custom_tags_for_YA = custom_tags_for_YA1 + custom_tags_for_YA2

> It looks like there is a borad range of you adult tags with different subcategories. Lets try to group them. First, lets get all the unique tags for young adult books.

In [56]:
set(custom_tags_for_YA)

{'juvenile',
 'juvenile-fiction',
 'teen',
 'teen-fiction',
 'ya',
 'ya-books',
 'ya-contemporary',
 'ya-fantasy',
 'ya-fiction',
 'ya-lit',
 'ya-paranormal',
 'ya-romance',
 'young-adult',
 'young-adult-fantasy',
 'young-adult-fiction',
 'youngadult'}

In [57]:
#Lets group different young adult tags
Custom_Tag_YoungAdult =  [x for x in custom_tags_for_YA if x in ['ya','ya-books','ya-lit', 'young-adult','youngadult', 'juvenile', 'ya-contemporary','teen']]
Custom_Tag_YoungAdult_Fantasy =  [x for x in custom_tags_for_YA if x in ['ya-fantasy','young-adult-fantasy']]
Custom_Tag_YoungAdult_Fiction =  [x for x in custom_tags_for_YA if x in ['ya-fiction','young-adult-fiction','juvenile-fiction', 'teen-fiction']]
Custom_Tag_YoungAdult_Romance =  [x for x in custom_tags_for_YA if x in ['ya-romance']]
Custom_Tag_YoungAdult_Paranormal =  [x for x in custom_tags_for_YA if x in ['ya-paranormal']]
print('Number of tags in each category is as follows -',
      'YoungAdult:', len(Custom_Tag_YoungAdult),
      'YoungAdult_Fantasy:', len(Custom_Tag_YoungAdult_Fantasy),
      'YoungAdult_Fiction:', len(Custom_Tag_YoungAdult_Fiction),
      'YoungAdult_Romance:', len(Custom_Tag_YoungAdult_Romance),
      'YoungAdult_Paranormal:', len(Custom_Tag_YoungAdult_Paranormal))
                                                              

Number of tags in each category is as follows - YoungAdult: 8022 YoungAdult_Fantasy: 1008 YoungAdult_Fiction: 2962 YoungAdult_Romance: 477 YoungAdult_Paranormal: 356


In [58]:
# Rename the custom tag_names for Young Adult Books
preferred_name = 'Young-Adult'
tag_table.tag_name = tag_table.tag_name.apply(lambda x: replace_custom_categories (x,Custom_Tag_YoungAdult,preferred_name))

# Save all the names for favorites in a dictionary to help keyword search later in the project
KW_CategorY_Repos ['Category'].append( 'Young-Adult')
KW_CategorY_Repos ['Possible Search KW'].append(Custom_Tag_YoungAdult)

In [59]:
preferred_name = 'Young-Adult-Fantasy'
tag_table.tag_name = tag_table.tag_name.apply(lambda x: replace_custom_categories (x,Custom_Tag_YoungAdult_Fantasy
, preferred_name))

preferred_name = 'Young-Adult-Fiction'
tag_table.tag_name = tag_table.tag_name.apply(lambda x: replace_custom_categories (x,Custom_Tag_YoungAdult_Fiction
, preferred_name))

preferred_name = 'Young-Adult-Romance'
tag_table.tag_name = tag_table.tag_name.apply(lambda x: replace_custom_categories (x,Custom_Tag_YoungAdult_Romance
, preferred_name))

preferred_name = 'Young-Adult-Paranormal'
tag_table.tag_name = tag_table.tag_name.apply(lambda x: replace_custom_categories (x,Custom_Tag_YoungAdult_Paranormal
, preferred_name))


# Save all the names for favorites in a dictionary to help keyword search later in the project
KW_CategorY_Repos ['Category'].append('Young-Adult-Fantasy')
KW_CategorY_Repos ['Possible Search KW'].append(Custom_Tag_YoungAdult_Fantasy)

KW_CategorY_Repos ['Category'].append('Young-Adult-Fiction')
KW_CategorY_Repos ['Possible Search KW'].append(Custom_Tag_YoungAdult_Fiction)

KW_CategorY_Repos ['Category'].append('Young-Adult-Romance')
KW_CategorY_Repos ['Possible Search KW'].append(Custom_Tag_YoungAdult_Romance)

KW_CategorY_Repos ['Category'].append('Young-Adult-Paranormal')
KW_CategorY_Repos ['Possible Search KW'].append(Custom_Tag_YoungAdult_Paranormal)

In [60]:
#Lets check the tag_names after applying the filter 
tag_table.tag_name.nunique()

444

## Tag for Adult Readers

> Lets take a look at different segment of the 'tag_table' dataframe and then do a search to identify some tag_names used by adult readers

In [61]:
# Identify custom tag_names with the word fiction
pattern_for_Fiction = '\w*[-]?[fF]iction\w*'
custom_tags_for_Fiction = identify_custom_categories(tag_table.tag_name, str(pattern_for_Fiction))


# Identify some keywords that stood out while observing the dataframe and then do a search
searchfor = ['history', 'History','sci','mystery','Mystery','crime','Crime','Women','women','feminism','girl']
custom_tags_for_other_Adult_Books = tag_table.tag_name[tag_table.tag_name.str.contains('|'.join(searchfor))]
custom_tags_for_Adult_Books = list(custom_tags_for_other_Adult_Books) + custom_tags_for_Fiction
set(custom_tags_for_Adult_Books)

{'adult-fiction',
 'american-history',
 'classic-fiction',
 'contemporary-fiction',
 'crime',
 'crime-fiction',
 'crime-mystery',
 'crime-mystery-thriller',
 'crime-thriller',
 'fantasy-fiction',
 'fantasy-sci-fi',
 'fantasy-science-fiction',
 'fantasy-scifi',
 'feminism',
 'fiction',
 'fiction-fantasy',
 'fiction-general',
 'fiction-historical',
 'fiction-to-read',
 'general-fiction',
 'historic-fiction',
 'historical-fiction',
 'history',
 'literary-fiction',
 'modern-fiction',
 'murder-mystery',
 'mystery',
 'mystery-crime',
 'mystery-detective',
 'mystery-series',
 'mystery-suspense',
 'mystery-suspense-thriller',
 'mystery-thriller',
 'mystery-thriller-suspense',
 'mystery-thrillers',
 'non-fiction',
 'non-fiction-to-read',
 'nonfiction',
 'realistic-fiction',
 'sci-fi',
 'sci-fi-and-fantasy',
 'sci-fi-fantasy',
 'science',
 'science-fiction',
 'science-fiction-and-fantasy',
 'science-fiction-fantasy',
 'scifi',
 'scifi-fantasy',
 'speculative-fiction',
 'thriller-mystery',
 'wome

### Lets group tags relevant to 'fiction' and 'non-fictions'

In [62]:
# Identify custom tag_names for Adult Fiction and Non Fiction
pattern_for_Fiction = '\w*[-]?[fF]iction\w*'
custom_tags_for_Fiction = identify_custom_categories(tag_table.tag_name, str(pattern_for_Fiction))
custom_tags_for_Fiction

['fiction',
 'science-fiction-fantasy',
 'fantasy-fiction',
 'fiction-fantasy',
 'contemporary-fiction',
 'science-fiction',
 'adult-fiction',
 'speculative-fiction',
 'science-fiction-and-fantasy',
 'nonfiction',
 'non-fiction',
 'non-fiction-to-read',
 'realistic-fiction',
 'general-fiction',
 'historical-fiction',
 'literary-fiction',
 'fiction-historical',
 'fiction-to-read',
 'modern-fiction',
 'fiction-general',
 'classic-fiction',
 'crime-fiction',
 'historic-fiction',
 'womens-fiction']

In [63]:
#Lets group the fiction and non fiction books
Custom_Tag_Fiction =  [x for x in custom_tags_for_Adult_Books if x in ['fiction','contemporary-fiction','adult-fiction',
                                                              'speculative-fiction','realistic-fiction', 
                                                              'general-fiction',  'literary-fiction','fiction-to-read', 
                                                              'modern-fiction',
                                                             'fiction-general', 'classic-fiction']]

Custom_Tag_NonFiction =  [x for x in custom_tags_for_Adult_Books if x in ['nonfiction','non-fiction','non-fiction-to-read']]
Custom_Tag_Historical_Fiction =  [x for x in custom_tags_for_Adult_Books if x in ['historical-fiction','fiction-historical',
                                                                                  'historic-fiction']]                                                      

In [64]:
# Rename the custom tag_names for Fictio and Non Fiction Books
preferred_name = 'Fiction'
tag_table.tag_name = tag_table.tag_name.apply(lambda x: replace_custom_categories (x,Custom_Tag_Fiction
, preferred_name))

preferred_name = 'Non-Fiction'
tag_table.tag_name = tag_table.tag_name.apply(lambda x: replace_custom_categories (x,Custom_Tag_NonFiction
, preferred_name))

preferred_name = 'Historical Fiction'
tag_table.tag_name = tag_table.tag_name.apply(lambda x: replace_custom_categories (x,Custom_Tag_Historical_Fiction
, preferred_name))


# Save all the names for favorites in a dictionary to help keyword search later in the project
KW_CategorY_Repos ['Category'].append( 'Fiction')
KW_CategorY_Repos ['Possible Search KW'].append(Custom_Tag_Fiction)

KW_CategorY_Repos ['Category'].append( 'Non-Fiction')
KW_CategorY_Repos ['Possible Search KW'].append(Custom_Tag_NonFiction)

KW_CategorY_Repos ['Category'].append( 'Historical Fiction')
KW_CategorY_Repos ['Possible Search KW'].append(Custom_Tag_Historical_Fiction)

In [65]:
#Lets check the tag_names after applying the filter 
tag_table.tag_name.nunique()

430

### Lets group tags for Audiobooks and Ebooks

In [66]:
# Identify custom tag_names Audiobooks
pattern_for_Audio_Books = '\w*[-]?[Aa]udi\w*'
custom_tags_for_Audio_Books = identify_custom_categories(tag_table.tag_name, str(pattern_for_Audio_Books))
custom_tags_for_Audio_Books

['audiobook', 'audiobooks', 'audio', 'audio-books', 'audible', 'audio-book']

In [67]:
# Rename the custom tags for Audio Books
preferred_name = 'Audio Books'
tag_table.tag_name = tag_table.tag_name.apply(lambda x: replace_custom_categories (x,custom_tags_for_Audio_Books
, preferred_name))

In [68]:
# Identify custom tag_names Ebooks
pattern_for_Books = '\w*[-]?[bB]ook\w*'
custom_tags_for_Books = identify_custom_categories(tag_table.tag_name, str(pattern_for_Books))
set(custom_tags_for_Books )

{'1001-books',
 '1001-books-to-read',
 '1001-books-to-read-before-you-die',
 '1001-books-you-must-read-before-you',
 '2013-books',
 '2014-books',
 '2015-books',
 '2016-books',
 '2017-books',
 'book',
 'book-boyfriend',
 'book-boyfriends',
 'book-club',
 'book-club-books',
 'book-club-reads',
 'book-group',
 'bookclub',
 'books',
 'books-i-have',
 'books-read-in-2015',
 'books-read-in-2016',
 'books-to-buy',
 'bookshelf',
 'chapter-books',
 'comic-books',
 'e-book',
 'e-books',
 'ebook',
 'ebooks',
 'kindle-books',
 'library-book',
 'library-books',
 'my-books',
 'my-bookshelf',
 'picture-book',
 'picture-books',
 'school-books'}

In [69]:
# Identify custom tag_names Ebooks
searchfor = ['e-books', 'ebooks','ebook','e-book','kindle-books','kindle']
custom_tags_for_Ebooks = list(tag_table.tag_name[tag_table.tag_name.str.contains('|'.join(searchfor))])
set(custom_tags_for_Ebooks)

{'e-book',
 'e-books',
 'ebook',
 'ebooks',
 'kindle',
 'kindle-books',
 'my-ebooks',
 'on-kindle',
 'on-my-kindle',
 'picture-book',
 'picture-books'}

In [70]:
# Rename the custom tags for Ebooks
preferred_name = 'Ebooks'
tag_table.tag_name = tag_table.tag_name.apply(lambda x: replace_custom_categories (x,custom_tags_for_Ebooks
,preferred_name))

In [71]:
#Lets check the tag_names after applying the filter 
tag_table.tag_name[650:750]

762                                funny
763                                scifi
764                               comedy
765                               humour
766                                   sf
767                                adult
768                           1001-books
769                            book-club
771                                space
772                              Fiction
774                               satire
775                                 1001
776                           literature
777                              science
778                                  fun
779                   sci-fi-and-fantasy
780                              Fiction
781                         20th-century
782                             humorous
783                            abandoned
784    1001-books-to-read-before-you-die
785                          Audio Books
786                               Ebooks
787                          Audio Books
788             

### Lets Group Tags for Books Read in a Particular Year 

In [72]:
# Identify custom tag_names 4digits
pattern_for_4Digits = '\w*[-]?\w*[-]?\d{4}'
custom_tags_for_4Digits = identify_custom_categories(tag_table.tag_name, str(pattern_for_4Digits))
set(custom_tags_for_4Digits)

{'1001',
 '1001-books',
 '1001-books-to-read',
 '1001-books-to-read-before-you-die',
 '1001-books-you-must-read-before-you',
 '1001-import',
 '1001-to-read',
 '1990s',
 '2000s',
 '2005',
 '2006',
 '2011-reads',
 '2012-reads',
 '2013-books',
 '2013-read',
 '2013-reads',
 '2014-books',
 '2014-read',
 '2014-reads',
 '2015-books',
 '2015-read',
 '2015-reading-challenge',
 '2015-reads',
 '2016-books',
 '2016-read',
 '2016-reading-challenge',
 '2016-reads',
 '2017-books',
 '2017-reading-challenge',
 '2017-reads',
 'read-2010',
 'read-2011',
 'read-2012',
 'read-2013',
 'read-2014',
 'read-2015',
 'read-2016',
 'read-2017',
 'read-in-2008',
 'read-in-2009',
 'read-in-2010',
 'read-in-2011',
 'read-in-2012',
 'read-in-2013',
 'read-in-2014',
 'read-in-2015',
 'read-in-2016',
 'read-in-2017'}

In [73]:
# Identify custom tag_names for books read in a particular year/time
pattern_for_Year = '\w*[-]?\w*[-]?[2]\d{3}'
custom_tags_for_Year = identify_custom_categories(tag_table.tag_name, str(pattern_for_Year))
custom_tags_for_Year.append('1990s')

In [74]:
# Rename the custom tags
preferred_name = 'Books Read By Year'
tag_table.tag_name = tag_table.tag_name.apply(lambda x: replace_custom_categories (x,custom_tags_for_Year
, preferred_name))

In [75]:
#Lets check the tag_names after applying the filter 
tag_table.tag_name.nunique()

375

 ### Lets group other popular  tag_names related to Science, History,  Books for Women, Mystery & Crime,Science & Fantasy

#### Lets group tags relevant to 'Science'

In [76]:
#Lets group books on science
Custom_Tag_Science = identify_custom_categories(tag_table.tag_name, str('[Ss]cience$'))
Custom_Tag_Science

['science']

In [77]:
#Lets rename books on science
preferred_name = 'Science'
tag_table.tag_name = tag_table.tag_name.apply(lambda x: replace_custom_categories (x,Custom_Tag_Science
, preferred_name))

# Save all the names for favorites in a dictionary to help keyword search later in the project
KW_CategorY_Repos ['Category'].append( 'Science')
KW_CategorY_Repos ['Possible Search KW'].append(Custom_Tag_Science)

In [78]:
#Lets check the tag_names after applying the filter 
tag_table.tag_name.nunique()

375

#### Lets group tags relevant to 'History'

In [79]:
#Lets group books on History
Custom_Tag_History = identify_custom_categories(tag_table.tag_name, str('[Hh]istory\w*'))
Custom_Tag_History

['history']

In [80]:
#Lets rename books on History
preferred_name = 'History'
tag_table.tag_name = tag_table.tag_name.apply(lambda x: replace_custom_categories (x,Custom_Tag_History
, preferred_name))

# Save all the names for favorites in a dictionary to help keyword search later in the project
KW_CategorY_Repos ['Category'].append( 'History')
KW_CategorY_Repos ['Possible Search KW'].append(Custom_Tag_History)

In [81]:
#Lets check the tag_names after applying the filter 
tag_table.tag_name.nunique()

375

> It seems that tags for History and Science related books are well defined by all the users, they did not choose any other name to shelve these books.

#### Lets group tags relevant to 'Women'


In [82]:
# Identifythe custom tags
searchfor = ['Women','women','feminism']
custom_tags_for__Women_Books = list(tag_table.tag_name[tag_table.tag_name.str.contains('|'.join(searchfor))])

In [83]:
# Rename custom tags
preferred_name = 'Women Book List'
tag_table.tag_name = tag_table.tag_name.apply(lambda x: replace_custom_categories (x,custom_tags_for__Women_Books
, preferred_name))

# Save all the names for favorites in a dictionary to help keyword search later in the project
KW_CategorY_Repos ['Category'].append( 'Women Book List')
KW_CategorY_Repos ['Possible Search KW'].append(custom_tags_for__Women_Books)

#### Lets group tags relevant to 'Crime & Mystery'

In [84]:
#Lets group books on Crime and Mystery
searchfor = ['[mM]ystery$', '[cC]rime$']
custom_tags_for_Mystery_Crime = list(tag_table.tag_name[tag_table.tag_name.str.contains('|'.join(searchfor))])
set(custom_tags_for_Mystery_Crime)

{'crime',
 'crime-mystery',
 'murder-mystery',
 'mystery',
 'mystery-crime',
 'thriller-mystery'}

In [85]:
# Lets Rename
preferred_name = 'Crime & Mystery'
tag_table.tag_name = tag_table.tag_name.apply(lambda x: replace_custom_categories (x,custom_tags_for_Mystery_Crime
, preferred_name))

# Save all the names for favorites in a dictionary to help keyword search later in the project
KW_CategorY_Repos ['Category'].append(  'Crime & Mystery')
KW_CategorY_Repos ['Possible Search KW'].append(custom_tags_for_Mystery_Crime)

In [86]:
# Lets check the tag_names after applying the filter 
tag_table.tag_name.nunique()

365

#### Lets group tags relevant to 'Science Fiction & Fantasy'

In [87]:
# Lets group books on Science & Fantasy
searchfor = ['\w*[-]?[sS]ci(?!.*ence)\w*', '^\w*(?!.*Young-Adult).[fF]antasy$',
 'science-fiction'] #(?!.*Word) will not contain the word
custom_tags_for_Science_Fantasy = list(tag_table.tag_name[tag_table.tag_name.str.contains('|'.join(searchfor))])

set(custom_tags_for_Science_Fantasy)

{'dark-fantasy',
 'epic-fantasy',
 'fantasy-sci-fi',
 'fantasy-science-fiction',
 'fantasy-scifi',
 'fiction-fantasy',
 'high-fantasy',
 'paranormal-fantasy',
 'sci-fi',
 'sci-fi-and-fantasy',
 'sci-fi-fantasy',
 'science-fiction',
 'science-fiction-and-fantasy',
 'science-fiction-fantasy',
 'scifi',
 'scifi-fantasy',
 'sf-fantasy',
 'urban-fantasy'}

In [88]:
# Lets Rename
preferred_name = 'Science Fiction & Fantasy'
tag_table.tag_name = tag_table.tag_name.apply(lambda x: replace_custom_categories (x,custom_tags_for_Science_Fantasy
, preferred_name))

# Save all the names for favorites in a dictionary to help keyword search later in the project
KW_CategorY_Repos ['Category'].append('Science Fiction & Fantasy')
KW_CategorY_Repos ['Possible Search KW'].append(custom_tags_for_Science_Fantasy)

In [89]:
#Lets check the tag_names after applying the filter 
tag_table.tag_name.nunique()

348

> Now, we can group the dataframe by tag_name to get an idea of how frequently users like to use such categories

In [90]:
tag_table_shortened = tag_table[['tag_id','tag_name','goodreads_book_id']]
Frequency_per_tag = tag_table_shortened.groupby(['tag_name']).count().sort_values(by='goodreads_book_id',ascending = False)
Frequency_per_tag.rename(columns = {'goodreads_book_id':'Frequency'}, inplace = True)
Frequency_per_tag.head(10)

Unnamed: 0_level_0,tag_id,Frequency
tag_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Books Read By Year,59812,59812
Owned Books,44351,44351
Ebooks,37090,37090
Favorite,35534,35534
Audio Books,32747,32747
Fiction,26252,26252
Science Fiction & Fantasy,20111,20111
Children Books,18314,18314
Young-Adult,12035,12035
to-read,9834,9834


> Lets get an idea of number of books per catgory. When we merged different Tags to one single category, the tag count for a give book in each category got added. It means if a book was tagged as 'Fav' and 'myFav', it will have a tag frequency of  when the tags are merged to a single tag called Favorites. So we will have to drop duplicates to idetify books per tag

In [91]:
tag_table_shortened = tag_table[['tag_name','goodreads_book_id']]
tag_table_shortened.loc[tag_table_shortened.duplicated (keep = False),:]
books_per_tag = tag_table_shortened.drop_duplicates()
books_per_tag.head(20)

Unnamed: 0,tag_name,goodreads_book_id
0,to-read,1
1,fantasy,1
2,Favorite,1
3,currently-reading,1
4,Young-Adult,1
5,Fiction,1
7,Owned Books,1
10,series,1
12,magic,1
13,Children Books,1


In [92]:
tag_table.head()

Unnamed: 0,goodreads_book_id,tag_id,count,tag_name,title
0,1,30574,167697,to-read,Harry Potter and the Half-Blood Prince (Harry ...
1,1,11305,37174,fantasy,Harry Potter and the Half-Blood Prince (Harry ...
2,1,11557,34173,Favorite,Harry Potter and the Half-Blood Prince (Harry ...
3,1,8717,12986,currently-reading,Harry Potter and the Half-Blood Prince (Harry ...
4,1,33114,12716,Young-Adult,Harry Potter and the Half-Blood Prince (Harry ...


In [93]:
# Make a dataframe for category keyword search
KW_Repository = pd.DataFrame.from_dict(KW_CategorY_Repos)
KW_Repository.head()

Unnamed: 0,Category,Possible Search KW
0,Favorite,"[favorites, favourites, all-time-favorites, fa..."
1,Children Books,"[kids, kids-books, kid-lit, kid-books, childre..."
2,Young-Adult,"[ya, teen, juvenile, ya, teen, juvenile, ya-bo..."
3,Young-Adult-Fantasy,"[ya-fantasy, ya-fantasy, ya-fantasy, ya-fantas..."
4,Young-Adult-Fiction,"[ya-fiction, ya-fiction, juvenile-fiction, ya-..."


# Export Tidy Dataset to CSV for EDA

By using data wrangling we have reduced the tag_ids from 32963 to 348. Lets import the clean dataset to a CSV format for exploratory data analysis.

In [96]:
KW_Repository.to_csv('Category_KW_Respository.csv', encoding = 'utf-8')
tag_table.to_csv('Tidy_Tag_Table.csv', encoding = 'utf-8')
Frequency_per_tag.to_csv('Tidy_Data_for_Tag_Frequency.csv', encoding = 'utf-8')
books_per_tag.to_csv('TIdy_Data_for_Books_Per_Tag.csv',encoding = 'utf-8')

## Few Comments

- Users like to have sub categories within a category [such as historical fiction, children - nonfiction, young-adult-fiction]]. We can use this to provide some built in tags
- Users also prefer to shelve for specific series or Author (jk rowling, harry potter is the most frequent)
- After defining all the categories, we can see what category got most 5 stars
- For highly rated books, we can  look at curtomized tag names within a given category, and then check whats the most popular type of book that they like (may be young adult rated fiction better than than historical - no fiction etc]
- Whats the age group of different readers (count the books in adult, kids and YA category)