## Personalized Recommendations for Your Next Read

Description: The project's core datasets include BX-Users with anonymized user IDs and demographic data, BX-Books featuring diverse book details and cover images from Amazon Web Services, and the crucial BX-Book-Ratings dataset containing explicit and implicit ratings for books.

### Summary

[The dataset](http://www2.informatik.uni-freiburg.de/~cziegler/BX/) is sourced from the  Institut für Informatik, Universität Freiburg.


### Packages

In [1]:
# %conda install -c conda-forge evalml-core
# !pip install sktime

'''
catboost>=1.1.1, which is not installed.
imbalanced-learn>=0.9.1, which is not installed.
ipywidgets<8.0.5,>=7.5, which is not installed.
kaleido>=0.1.0, which is not installed.
lightgbm>=2.3.1, which is not installed.
lime>=0.2.0.1, which is not installed.
plotly>=5.0.0, which is not installed.
pmdarima>=1.8.5, which is not installed.
seaborn>=0.11.1, which is not installed.
pyproject
vowpalwabbit>=8.11.0, which is not installed.
xgboost>=1.7.0, which is not installed.
sktime==0.17.0

'''

'\ncatboost>=1.1.1, which is not installed.\nimbalanced-learn>=0.9.1, which is not installed.\nipywidgets<8.0.5,>=7.5, which is not installed.\nkaleido>=0.1.0, which is not installed.\nlightgbm>=2.3.1, which is not installed.\nlime>=0.2.0.1, which is not installed.\nplotly>=5.0.0, which is not installed.\npmdarima>=1.8.5, which is not installed.\nseaborn>=0.11.1, which is not installed.\npyproject\nvowpalwabbit>=8.11.0, which is not installed.\nxgboost>=1.7.0, which is not installed.\nsktime==0.17.0\n\n'

### Importing Libraries

In [2]:
import pandas as pd

### Data Cleaning and Splitting

#### Users Dataset

In [3]:
# Read csv data
users = pd.read_csv("assets/input/book_recommendation/users.csv", sep=";", quotechar='"', encoding="latin-1")

In [4]:
# Fill NaN values in the "Location" column with the string "undefined"
users['Location'].fillna("undefined", inplace=True)

In [5]:
# Split the "Location" column using a comma (,) as the separator
location_split = users["Location"].str.split(",", expand=True)

# Assign the split values to the "City," "State/Province," and "Country" columns
users["City"] = location_split[0].str.strip()  # Strip leading/trailing spaces
users["State/Province"] = location_split[1].str.strip()
users["Country"] = location_split[2].str.strip()

# Drop the original "Location" column as it's no longer needed
users.drop("Location", axis=1, inplace=True)

In [6]:
# Step 1: Replace "undefined" with "n/a" in the DataFrame
users.replace("n/a", "undefined", inplace=True)
users

Unnamed: 0,User-ID,Age,City,State/Province,Country
0,1,,nyc,new york,usa
1,2,18.0,stockton,california,usa
2,3,,moscow,yukon territory,russia
3,4,17.0,porto,v.n.gaia,portugal
4,5,,farnborough,hants,united kingdom
...,...,...,...,...,...
278853,278854,,portland,oregon,usa
278854,278855,50.0,tacoma,washington,united kingdom
278855,278856,,brampton,ontario,canada
278856,278857,,knoxville,tennessee,usa


In [7]:
# Fill NaN values in the "Age" column with the median age
median_age = users['Age'].median()
users['Age'].fillna(median_age, inplace=True)

#### Books Dataset

In [8]:
# Read csv data
books = pd.read_csv("assets/input/book_recommendation/books.csv", encoding="ISO-8859-1", sep=";", on_bad_lines="skip")

  books = pd.read_csv("assets/input/book_recommendation/books.csv", encoding="ISO-8859-1", sep=";", on_bad_lines="skip")


In [9]:
# Remove rows with NaN values
books.dropna(inplace=True)

# Drop the "Image-URL-S", "Image-URL-M", and "Image-URL-L" columns
books.drop(["Image-URL-S", "Image-URL-M", "Image-URL-L"], axis=1, inplace=True)

#### Book Ratings

In [10]:
# Read csv data
book_ratings = pd.read_csv("assets/input/book_recommendation/book_ratings.csv", encoding="ISO-8859-1", sep=";", on_bad_lines="skip")

In [11]:
# Merge users and book_ratings dataframes on "User-ID"
merged_df = pd.merge(users, book_ratings, on='User-ID', how='inner')

# Merge the resulting dataframe with the books dataframe on "ISBN"
recommendation = pd.merge(merged_df, books, on='ISBN', how='inner')

In [12]:
# Add a new column to hold the "Yes" or "No" values
recommendation['Is-Liked?'] = recommendation['Book-Rating'].apply(lambda x: 'Yes' if x >= 7 else 'No')

In [13]:
recommendation

Unnamed: 0,User-ID,Age,City,State/Province,Country,ISBN,Book-Rating,Book-Title,Book-Author,Year-Of-Publication,Publisher,Is-Liked?
0,2,18.0,stockton,california,usa,0195153448,0,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,No
1,8,32.0,timmins,ontario,canada,0002005018,5,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,No
2,11400,49.0,ottawa,ontario,canada,0002005018,0,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,No
3,11676,32.0,undefined,undefined,undefined,0002005018,8,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,Yes
4,41385,32.0,sudbury,ontario,canada,0002005018,0,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,No
...,...,...,...,...,...,...,...,...,...,...,...,...
1031124,278851,33.0,dallas,texas,usa,0743203763,0,As Hogan Said . . . : The 389 Best Things Anyo...,Randy Voorhees,2000,Simon &amp; Schuster,No
1031125,278851,33.0,dallas,texas,usa,0767907566,5,All Elevations Unknown: An Adventure in the He...,Sam Lightner,2001,Broadway Books,No
1031126,278851,33.0,dallas,texas,usa,0884159221,7,Why stop?: A guide to Texas historical roadsid...,Claude Dooley,1985,Lone Star Books,Yes
1031127,278851,33.0,dallas,texas,usa,0912333022,7,The Are You Being Served? Stories: 'Camping In...,Jeremy Lloyd,1997,Kqed Books,Yes


In [14]:
import evalml.preprocessing as epp
from evalml import AutoMLSearch

Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)


In [15]:
recommendation.drop(columns="Is-Liked?", axis=1)

Unnamed: 0,User-ID,Age,City,State/Province,Country,ISBN,Book-Rating,Book-Title,Book-Author,Year-Of-Publication,Publisher
0,2,18.0,stockton,california,usa,0195153448,0,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,8,32.0,timmins,ontario,canada,0002005018,5,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,11400,49.0,ottawa,ontario,canada,0002005018,0,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
3,11676,32.0,undefined,undefined,undefined,0002005018,8,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
4,41385,32.0,sudbury,ontario,canada,0002005018,0,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
...,...,...,...,...,...,...,...,...,...,...,...
1031124,278851,33.0,dallas,texas,usa,0743203763,0,As Hogan Said . . . : The 389 Best Things Anyo...,Randy Voorhees,2000,Simon &amp; Schuster
1031125,278851,33.0,dallas,texas,usa,0767907566,5,All Elevations Unknown: An Adventure in the He...,Sam Lightner,2001,Broadway Books
1031126,278851,33.0,dallas,texas,usa,0884159221,7,Why stop?: A guide to Texas historical roadsid...,Claude Dooley,1985,Lone Star Books
1031127,278851,33.0,dallas,texas,usa,0912333022,7,The Are You Being Served? Stories: 'Camping In...,Jeremy Lloyd,1997,Kqed Books


In [16]:
X_train, X_holdout, y_train, y_holdout = epp.split_data(
    recommendation.drop(columns="Is-Liked?", axis=1),
    recommendation["Is-Liked?"],
    problem_type="binary",
    test_size=0.2
)

In [17]:
automl = AutoMLSearch(
    X_train=X_train,
    y_train=y_train,
    problem_type="binary",
    objective="f1",
    max_batches=1,
    verbose=True,
)

AutoMLSearch will use mean CV score to rank pipelines.


Removing columns ['ISBN', 'Book-Author'] because they are of 'Unknown' type


In [None]:
automl.search(interactive_plot=False)