# Different approach to book recommendations using KNN (K-Nearest-Neighbors)

Most of the heavy lift here was getting the book data combined and cleaned. Leaving the steps here for future self: 
get the previously cleaned data with book_id, title, rating, and user_id fields
```
ratings = pd.read_csv("../../datasets/Book_ratings_clean.csv")
ratings.head()

id	book_id	title	rating	user_id
0	10	0829814000	Wonderful Worship in Smaller Churches	19.40	AZ0IOBU20TBOP
1	11	0829814000	Wonderful Worship in Smaller Churches	19.40	A373VVEU6Z9M0N
2	12	0829814000	Wonderful Worship in Smaller Churches	19.40	AGKGOH65VTRR4
3	13	0829814000	Wonderful Worship in Smaller Churches	19.40	A3OQWLU31BU1Y
4	14	0595344550	Whispers of the Wicked Saints	10.95	A3Q12RK71N74LB

```
the problem with this data is that it doesn't have the book metadata. Problem with the metadata is that it doesn't contain the book_id, so you have to match it on title. Not too bad. 
```
book_ids_titles = ratings[['book_id', 'title']]
book_ids_titles = book_ids_titles.drop_duplicates()

bookMetaData = pd.read_csv("../../datasets/books_data.csv")
bookMetaData.drop_duplicates()
bookMetaData.rename(columns={'Title':'title'}, inplace=True)

bookMetaData = bookMetaData.merge(book_ids_titles.set_index('title'), how='left', on='title')
bookMetaData = bookMetaData.dropna(subset=['book_id'])
bookMetaData.head()
bookMetaData.to_csv('../datasets/book_metadata_with_ids.csv')
```

Once you do that you need to aggregate all the ratings so that you get the total ratings size and mean. 
```
bookRatings = ratings.groupby('book_id').agg({'rating': [np.size, np.mean]})
bookRatings.head()

	rating
size	mean
book_id		
0002554232	3	24.32
0004332466	1	8.95
0005993814	1	35.00
0006482724	1	13.99
0007104022	1	18.99

bookNumRatings = pd.DataFrame(bookRatings['rating']['size'])
bookMeanRatings = pd.DataFrame(bookRatings['rating']['mean'])
bookNormalizedNumRatings = bookNumRatings.apply(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)))
bookNormalizedNumRatings.head()

	size
book_id	
0002554232	0.001369
0004332466	0.000000
0005993814	0.000000
0006482724	0.000000
0007104022	0.000000

bookMetaData = bookMetaData.merge(bookNormalizedNumRatings, how='left', on='book_id') 
bookMetaData = bookMetaData.merge(bookMeanRatings, how='left', on='book_id')  

```
Next pull out only the relevant data into it's own dataframe to export it to csv so you don't need to do this again. 
```
focusedBookData = bookMetaData[['book_id', 'title', 'size', 'mean', 'authors', 'publisher', 'publishedDate', 'categories']]
focusedBookData.to_csv('../../datasets/focused_book_meta_data_clean.csv')

```

In [1]:
import pandas as pd

ratings = pd.read_csv("../../datasets/focused_book_meta_data_clean.csv")
ratings.head()

Unnamed: 0.1,Unnamed: 0,book_id,title,size,mean,authors,publisher,publishedDate,categories
0,0,829814000,Wonderful Worship in Smaller Churches,0.002053,19.4,['David R. Ray'],,2000,['Religion']
1,1,595344550,Whispers of the Wicked Saints,0.021218,10.95,['Veronica Haddon'],iUniverse,2005-02,['Fiction']
2,2,253338352,"Nation Dance: Religion, Identity and Cultural ...",0.0,39.95,['Edward Long'],,2003-03-01,
3,3,802841899,The Church of Christ: A Biblical Ecclesiology ...,0.002053,25.97,['Everett Ferguson'],Wm. B. Eerdmans Publishing,1996,['Religion']
4,4,895554224,Saint Hyacinth of Poland,0.0,13.95,['Mary Fabyan Windeatt'],Tan Books & Pub,2009-01-01,['Biography & Autobiography']


Now, with this focused data containing 47,984 rows, cast it to a dict to work with for the rest of this. 

To be able to compute the distance between two books based on how similar there categories are, and their mean rating, need to convert the categorical categories into numerical values. 

In [2]:
categories = ratings.categories.unique()
print(categories)
print(len(categories))
cat_dict = {}
i = 0
for category in categories:
    cat_dict[category] = [i]
    i+=1

["['Religion']" "['Fiction']" nan ... "['Agricultural diversification']"
 "['Arthur']" "['Colombo (Sri Lanka)']"]
2225


In [3]:
ratings['categories'] = ratings['categories'].map(cat_dict)

Further improvements for later: find the unique categories by splitting on '&', for example: `['Biography & Autobiography']` would become an array of say `[1, 2]`. This way something like `[Science Fiction & Historical Fiction]` could be compared with books that are strictly `Science Fiction` and `Historical Fiction`. 
Another potential improvement would be to cast authors to numerical data too. 

In [4]:
# being more explicit with the name here
ratings.rename(columns={'size':'ratingSize'}, inplace=True)

In [5]:
book_dict = {}
for index, row in ratings.iterrows():
    book_dict[row['book_id']] = (row['title'], row['ratingSize'], row['mean'], row['authors'], row['publisher'], row['publishedDate'], row['categories'])

In [6]:
book_dict['0895554224']

('Saint Hyacinth of Poland',
 0.0,
 13.95,
 "['Mary Fabyan Windeatt']",
 'Tan Books & Pub',
 '2009-01-01',
 [3])

Now, following the tutorial for KNN, compute the distance between two given books

In [7]:
from scipy import spatial

def ComputeDistance(a, b):
    categoriesA = a[6]
    categoriesB = b[6]
    genreDistance = spatial.distance.cosine(categoriesA, categoriesB)
    popularityA = a[1]
    popularityB = b[1]
    popularityDistance = abs(popularityA - popularityB)
    return genreDistance + popularityDistance
    
ComputeDistance(book_dict['1594567263'], book_dict['0891010947'])

0.4038329911019849

The higher the distance, the less similar the books are. A classic fiction book vs a book on The Civil War are subjectively very different. 

In [16]:
print(book_dict['0939495805'])
print(book_dict['0891010947'])


('The Picture of Dorian Gray', 0.4045174537987679, 26.0, "['Óscar Wilde']", nan, '2016-01-24', [2])
('The Civil War Recollections of General Ellis Spear', 0.000684462696783, 28.0, "['Ellis Spear']", nan, '1997', [8])


Next, following the tutorial, compute the distance between the given book and all of the books in the data set. Sort those by distance, and print out the K nearest neighbors. Compare this list with the list generated by the previous attempt at `/MachineLearning/BookPrediction.ipynb`

```
Alice's Adventures in Wonderland and Through the Looking Glass (Classic Collection (Brilliance Audio))    1.000
The Picture of Dorian Gray                                                                                1.000
The Red Badge of Courage (Lake Illustrated Classics, Collection 1)                                        1.000
Wuthering Heights                                                                                         1.000
the Picture of Dorian Gray                                                                                1.000
A Christmas Carol (Classic Fiction)                                                                       1.000
A Christmas Carol, in Prose: Being a Ghost Story of Christmas (Collected Works of Charles Dickens)        1.000
The Picture of Dorian Gray (The Classic Collection)                                                       0.998
```

In [17]:
import operator

def getNeighbors(bookID, K):
    distances = []
    for book in book_dict:
        if (book != bookID):
            dist = ComputeDistance(book_dict[bookID], book_dict[book])
            distances.append((book, dist))
    distances.sort(key=operator.itemgetter(1))
    neighbors = []
    for x in range(K):
        neighbors.append(distances[x][0])
    return neighbors

K = 10
avgRating = 0
neighbors = getNeighbors('0939495805', K)
print(neighbors)
for neighbor in neighbors:
    avgRating += book_dict[neighbor][2]
    print (book_dict[neighbor][0] + " " + str(book_dict[neighbor][2]))
    
avgRating /= K

['1557424470', '1594567263', '1597370037', '1597370045', '1597370061', '0679766758', '0748608370', '0895260786', '0785263705', '0808598104']
The Picture of Dorian Gray 18.96
the Picture of Dorian Gray 12.99
The Picture of Dorian Gray (The Classic Collection) 25.04
The Picture of Dorian Gray (Classic Collection (Brilliance Audio)) 87.25
The Picture of Dorian Gray (Classic Collection (Brilliance Audio)) 39.25
Push: A Novel 7.06
Treasure Island 54.0
America Alone: The End of the World as We Know It 18.46
Blue Like Jazz: Nonreligious Thoughts on Christian Spirituality 11.35
The Killer Angels (Turtleback School & Library Binding Edition) 13.8


Definitely some different selections this way. Subjectively, I as a user would be happier with the previous recommendations. While we were at it, we computed the average rating of the 10 nearest neighbors to the selected book:

In [18]:
avgRating

28.816000000000003

Compare with actual rating mean

In [20]:
book_dict['0939495805']

('The Picture of Dorian Gray',
 0.4045174537987679,
 26.0,
 "['Óscar Wilde']",
 nan,
 '2016-01-24',
 [2])

Could be better. 

## Switching to Rating Size

Just to see if that has any impact on selections.

In [21]:
def getNeighbors(bookID, K):
    distances = []
    for book in book_dict:
        if (book != bookID):
            dist = ComputeDistance(book_dict[bookID], book_dict[book])
            distances.append((book, dist))
    distances.sort(key=operator.itemgetter(1))
    neighbors = []
    for x in range(K):
        neighbors.append(distances[x][0])
    return neighbors

K = 10
avgRating = 0
neighbors = getNeighbors('0939495805', K)
print(neighbors)
for neighbor in neighbors:
    avgRating += book_dict[neighbor][1]
    print (book_dict[neighbor][0] + " " + str(book_dict[neighbor][1]))
    
avgRating /= K

['1557424470', '1594567263', '1597370037', '1597370045', '1597370061', '0679766758', '0748608370', '0895260786', '0785263705', '0808598104']
The Picture of Dorian Gray 0.4045174537987679
the Picture of Dorian Gray 0.4045174537987679
The Picture of Dorian Gray (The Classic Collection) 0.4045174537987679
The Picture of Dorian Gray (Classic Collection (Brilliance Audio)) 0.4045174537987679
The Picture of Dorian Gray (Classic Collection (Brilliance Audio)) 0.4045174537987679
Push: A Novel 0.4045174537987679
Treasure Island 0.4024640657084189
America Alone: The End of the World as We Know It 0.4106776180698152
Blue Like Jazz: Nonreligious Thoughts on Christian Spirituality 0.4223134839151266
The Killer Angels (Turtleback School & Library Binding Edition) 0.4229979466119096


No change. 

Further testing that can be done: 
Choice of 10 for K was arbitrary - what effect do different K values have on the results?

Distance metric was also somewhat arbitrary - just took the cosine distance between the genres and added it to the difference between the normalized popularity scores. Can that be improved? Dive further into spacial.distance used in ComputeDistance method to look for improvements. 