# Recommender Systems 0 - Data Preparation

### Book Crossing Dataset
Link to dataset files: http://www2.informatik.uni-freiburg.de/~cziegler/BX/

Collected by Cai-Nicolas Ziegler in a 4-week crawl (August / September 2004) from the Book-Crossing community with kind permission from Ron Hornbaker, CTO of Humankind Systems. Contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books.

The Book-Crossing dataset comprises 3 tables.
#### BX-Users
Contains the users. Note that user IDs (`User-ID`) have been anonymized and map to integers. Demographic data is provided (`Location`, `Age`) if available. Otherwise, these fields contain NULL-values.
#### BX-Books
Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (`Book-Title`, `Book-Author`, `Year-Of-Publication`, `Publisher`), obtained from Amazon Web Services. Note that in case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavours (`Image-URL-S`, `Image-URL-M`, `Image-URL-L`), i.e., small, medium, large. These URLs point to the Amazon web site.
#### BX-Book-Ratings
Contains the book rating information. Ratings (`Book-Rating`) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0.

### Import libraries

In [1]:
import pandas as pd
import numpy as np

### Load Data

In [2]:
# Load books
books = pd.read_csv('data/BX-Books.csv', sep=';', on_bad_lines="skip", encoding="latin-1", \
                    low_memory=False)
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', \
                 'imageUrlM', 'imageUrlL']

# Load users
users = pd.read_csv('data/BX-Users.csv', sep=';', on_bad_lines="skip", encoding="latin-1", \
                    low_memory=False)
users.columns = ['userID', 'Location', 'Age']

# Load ratings
ratings = pd.read_csv('data/BX-Book-Ratings.csv', sep=';', on_bad_lines="skip", encoding="latin-1", \
                      low_memory=False)
ratings.columns = ['userID', 'ISBN', 'bookRating']

### Examine data

In [3]:
# Change display setting to display full text in columns
pd.set_option('display.max_colwidth', None)

#### books

In [4]:
books.shape

(271360, 8)

In [5]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0195153448.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0195153448.01.LZZZZZZZ.jpg
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0002005018.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0002005018.01.LZZZZZZZ.jpg
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0060973129.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0060973129.01.LZZZZZZZ.jpg
3,374157065,Flu: The Story of the Great Influenza Pandemic of 1918 and the Search for the Virus That Caused It,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0374157065.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0374157065.01.LZZZZZZZ.jpg
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0393045218.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0393045218.01.LZZZZZZZ.jpg


In [6]:
books.isnull().sum()

ISBN                 0
bookTitle            0
bookAuthor           1
yearOfPublication    0
publisher            2
imageUrlS            0
imageUrlM            0
imageUrlL            3
dtype: int64

#### users

In [7]:
users.shape

(278858, 3)

In [8]:
users.head()

Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [9]:
users.isnull().sum()

userID           0
Location         0
Age         110762
dtype: int64

#### ratings

In [10]:
ratings.shape

(1149780, 3)

In [11]:
ratings.head()

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [12]:
ratings.isnull().sum()

userID        0
ISBN          0
bookRating    0
dtype: int64

### Clean data

#### Clean books

In [13]:
# Drop unnecessary columns
books_tmp = books.drop(['imageUrlS', 'imageUrlM', 'imageUrlL'], axis=1)

In [14]:
# Drop rows where bookAuthor or publisher are null
books_tmp = books_tmp.dropna(subset=['bookAuthor', 'publisher'])

In [15]:
# Check the unique values of yearOfPublication
books_tmp['yearOfPublication'].unique()

array(['2002', '2001', '1991', '1999', '2000', '1993', '1996', '1988',
       '2004', '1998', '1994', '2003', '1997', '1983', '1979', '1995',
       '1982', '1985', '1992', '1986', '1978', '1980', '1952', '1987',
       '1990', '1981', '1989', '1984', '0', '1968', '1961', '1958',
       '1974', '1976', '1971', '1977', '1975', '1965', '1941', '1970',
       '1962', '1973', '1972', '1960', '1966', '1920', '1956', '1959',
       '1953', '1951', '1942', '1963', '1964', '1969', '1954', '1950',
       '1967', '2005', '1957', '1940', '1937', '1955', '1946', '1936',
       '1930', '2011', '1925', '1948', '1943', '1947', '1945', '1923',
       '2020', '1939', '1926', '1938', '2030', '1911', '1904', '1949',
       '1932', '1928', '1929', '1927', '1931', '1914', '2050', '1934',
       '1910', '1933', '1902', '1924', '1921', '1900', '2038', '2026',
       '1944', '1917', '1901', '2010', '1908', '1906', '1935', '1806',
       '2021', '2012', '2006', 'DK Publishing Inc', 'Gallimard', '1909',
       

In [16]:
# Investigate the rows having 'DK Publishing Inc' as yearOfPublication
books_tmp.loc[(books_tmp.yearOfPublication == 'DK Publishing Inc'),:]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Began (Level 4: Proficient Readers)\"";Michael Teitelbaum""",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.01.THUMBZZZ.jpg
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Books Come to Life (Level 4: Proficient Readers)\"";James Buckley""",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.01.THUMBZZZ.jpg


In [17]:
# From above, bookAuthor is incorrectly loaded with bookTitle, requiring the following corrections:

#  ISBN '078946697X'
books_tmp.loc[books_tmp.ISBN == '078946697X','yearOfPublication'] = 2000
books_tmp.loc[books_tmp.ISBN == '078946697X','bookAuthor'] = "Michael Teitelbaum"
books_tmp.loc[books_tmp.ISBN == '078946697X','publisher'] = "DK Publishing Inc"
books_tmp.loc[books_tmp.ISBN == '078946697X','bookTitle'] = "DK Readers: Creating the X-Men, How It All Began (Level 4: Proficient Readers)"

#  ISBN '0789466953'
books_tmp.loc[books_tmp.ISBN == '0789466953','yearOfPublication'] = 2000
books_tmp.loc[books_tmp.ISBN == '0789466953','bookAuthor'] = "James Buckley"
books_tmp.loc[books_tmp.ISBN == '0789466953','publisher'] = "DK Publishing Inc"
books_tmp.loc[books_tmp.ISBN == '0789466953','bookTitle'] = "DK Readers: Creating the X-Men, How Comic Books Come to Life (Level 4: Proficient Readers)"


In [18]:
# Recheck books
books_tmp.loc[(books_tmp.ISBN == '078946697X') | (books_tmp.ISBN == '0789466953'),:]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Began (Level 4: Proficient Readers)",Michael Teitelbaum,2000,DK Publishing Inc
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Books Come to Life (Level 4: Proficient Readers)",James Buckley,2000,DK Publishing Inc


In [19]:
# Investigate the rows having 'Gallimard' as yearOfPublication
books_tmp.loc[books_tmp.yearOfPublication == 'Gallimard',:]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
220731,2070426769,"Peuple du ciel, suivi de 'Les Bergers\"";Jean-Marie Gustave Le ClÃ?Â©zio""",2003,Gallimard,http://images.amazon.com/images/P/2070426769.01.THUMBZZZ.jpg


In [20]:
# From above, bookAuthor is incorrectly loaded with bookTitle, requiring the following corrections:

#  ISBN '2070426769'
books_tmp.loc[books_tmp.ISBN == '2070426769','yearOfPublication'] = 2003
books_tmp.loc[books_tmp.ISBN == '2070426769','bookAuthor'] = "Jean-Marie Gustave Le ClÃ?Â©zio"
books_tmp.loc[books_tmp.ISBN == '2070426769','publisher'] = "Gallimard"
books_tmp.loc[books_tmp.ISBN == '2070426769','bookTitle'] = "Peuple du ciel, suivi de 'Les Bergers"

In [21]:
# Recheck book
books_tmp.loc[books_tmp.ISBN == '2070426769',:]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
220731,2070426769,"Peuple du ciel, suivi de 'Les Bergers",Jean-Marie Gustave Le ClÃ?Â©zio,2003,Gallimard


In [22]:
print(f'Original books shape: {books.shape}')
print(f'Cleaned books shape:  {books_tmp.shape}')

Original books shape: (271360, 8)
Cleaned books shape:  (271357, 5)


In [23]:
# Save cleaned books to file
books_tmp.to_csv('data/BX-Books_cleaned.csv', index=False)

####  Clean users

In [24]:
# Review Age values
print(sorted(users.Age.unique()))

[nan, 0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0, 48.0, 49.0, 50.0, 51.0, 52.0, 53.0, 54.0, 55.0, 56.0, 57.0, 58.0, 59.0, 60.0, 61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0, 71.0, 72.0, 73.0, 74.0, 75.0, 76.0, 77.0, 78.0, 79.0, 80.0, 81.0, 82.0, 83.0, 84.0, 85.0, 86.0, 87.0, 88.0, 89.0, 90.0, 91.0, 92.0, 93.0, 94.0, 95.0, 96.0, 97.0, 98.0, 99.0, 100.0, 101.0, 102.0, 103.0, 104.0, 105.0, 106.0, 107.0, 108.0, 109.0, 110.0, 111.0, 113.0, 114.0, 115.0, 116.0, 118.0, 119.0, 123.0, 124.0, 127.0, 128.0, 132.0, 133.0, 136.0, 137.0, 138.0, 140.0, 141.0, 143.0, 146.0, 147.0, 148.0, 151.0, 152.0, 156.0, 157.0, 159.0, 162.0, 168.0, 172.0, 175.0, 183.0, 186.0, 189.0, 199.0, 200.0, 201.0, 204.0, 207.0, 208.0, 209.0, 210.0, 212.0, 219.0, 220.0, 223.0, 226.0

Note: Age column has some invalid entries (e.g. nan, 0 >100)

In [25]:
users_tmp = users.copy(deep=True)

# Age values below 5 and above 100 do not make much sense for our book rating, therefore replace them by NaNs
users_tmp.loc[(users_tmp.Age > 100) | (users_tmp.Age < 5), 'Age'] = np.nan

In [26]:
# Replace NaNs with mean
users_tmp.Age = users_tmp.Age.fillna(users_tmp.Age.mean())

In [27]:
# Set the data type as int
users_tmp.Age = users_tmp.Age.astype(np.int32)

In [28]:
# Recheck Age values
print(sorted(users_tmp.Age.unique()))

[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100]


In [29]:
# Review Location
users_tmp['Location'].head(1)

0    nyc, new york, usa
Name: Location, dtype: object

In [30]:
# Expand Location into city, state and country
users_tmp['Location'].str.split(',', expand=True).head(1)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,nyc,new york,usa,,,,,,


In [31]:
# Create new column 'country'
users_tmp['country'] = users_tmp['Location'].str.split(',', expand=True)[2].str.lstrip()

In [32]:
# Generate list of countries where the number of user ratings > 1000
country_list = users_tmp['country'].value_counts().loc[lambda x: (x>1000)].keys()
country_list = [x for x in country_list if x != '']
print(country_list)

['usa', 'canada', 'united kingdom', 'germany', 'spain', 'australia', 'italy', 'france', 'portugal', 'new zealand', 'netherlands', 'switzerland', 'brazil', 'china', 'sweden', 'india', 'austria', 'malaysia', 'argentina']


In [33]:
# Keep ONLY users from the countries in country_list
users_tmp = users_tmp[users_tmp.country.isin(country_list)]

In [34]:
print(f'Original users shape: {users.shape}')
print(f'Cleaned users shape:  {users_tmp.shape}')

Original users shape: (278858, 3)
Cleaned users shape:  (255809, 4)


In [35]:
# Save cleaned users to file
users_tmp.to_csv('data/BX-Users_cleaned.csv', index=False)

#### Clean ratings

In [36]:
# ratings dataset should only have ratings for books which exist in the books dataset
ratings_tmp = ratings[ratings.ISBN.isin(books_tmp.ISBN)]

In [37]:
# ratings dataset should only have ratings for users which exist in the users dataset
ratings_tmp = ratings_tmp[ratings_tmp.userID.isin(users_tmp.userID)]

In [38]:
print(f'Original ratings shape: {ratings.shape}')
print(f'Cleaned ratings shape:  {ratings_tmp.shape}')

Original ratings shape: (1149780, 3)
Cleaned ratings shape:  (976837, 3)


In [39]:
# Save cleaned books to file
ratings_tmp.to_csv('data/BX-Book-Ratings_cleaned.csv', index=False)