<h1 style="background-color:Gray;color: white;font-family:sans-serif;font-size:200%;text-align:center">The World Belongs to Those Who Read</h1>

In [None]:
from PIL import Image
import os
Image.open("../input/booksbooksbooks/library-1666702_1920.jpg")

<h2 style="background-color:Gray;color:white;font-family:sans-serif;font-size:150%;text-align:center">Table Of Content</h2>

* [1. Introduction](#1)
* [2. Libraries](#2)
* [3. Data Understanding](#3)
    * [3.1 Missing Values](#3.1)
    * [3.2 Duplicates](#3.2)
    * [3.3 Distributions](#3.3)
* [4. Data Analysis](#4) 
    * [4.1 Which Authors Write the Most Bestsellers?](#4.1)
    * [4.2 Which Genre Dominates which Year?](#4.2)
    * [4.3 How does the Mean Price Change over the Years?](#4.3)
    * [4.4 What's the Mean Price in each Genre?](#4.4)
    * [4.5 Which Books have the Most Reviews?](#4.5)
    * [4.6 Do Genres Differ in the Number of Reviews?](#4.6)
    * [4.7 Which Books have the Highest User Rating?](#4.7)
    * [4.8 How does the User Rating Change over the Years?](#4.8)
    * [4.9 Does a Higher Rating Lead to a Higher Price?](#4.9)
    * [4.10 Which Words make a Bestseller's Title?](#4.10)
* [5. Models](#5) 
    * [5.1 Preprocessing](#5.1)
    * [5.2 Choice of a Classification Metric](#5.2)
    * [5.3 Correlations](#5.3)
    * [5.4 What's the Genre of a Book?](#5.4)
    * [5.5 What's the Worth (Price) of a Book?](#5.5)
    * [5.6 How Popular is a Book?](#5.6)
* [6. Conclusion](#6) 
* [7. Evaluation](#7) 
 

<a id="1"></a>
<h2 style="background-color:Gray;color:white;font-family:sans-serif;font-size:150%;text-align:center">Introduction</h2>

In the following, we examine especially the influences on a bestseller's genre, price and user rating. Afterwards we will build models to predict those features. This is based on a dataset on Amazon's Top 50 bestselling books from 2009 to 2019. The data has been categorized into fiction and non-fiction using Goodreads. Goodreads is an American social cataloging website that allows people to search their database of books, annotations, quotes, and reviews.

With the data analysis the following questions will be answered and visualized:
* Which Authors Write the Most Bestsellers?
* Which Genre Dominates which Year?
* How does the Mean Price Change over Years?
* What's the Mean Price in each Genre?
* Which Books have the Most Reviews?
* Do Genres Differ in the Number of Reviews?
* Which Books have the Highest User Rating?
* How does the User Rating Change over the Years?
* Does a Higher Rating Lead to a Higher Price?
* Which Words make a Bestseller's Title?


With the models the following questions will be answered:
* What's the genre of a book? This is actually a real life use case. I recently talked to a Data Scientist working at a bookselling company. They are currently working on models to determine the genre of their books. 
* What's the worth (price) of a book? Clearly this is an important question for authors, booksellers as well as customers.
* How popular is a book? This allows authors and booksellers to find out the preferences of readers and to improve their work.

<a id="2"></a>
<h2 style="background-color:Gray;color:white;font-family:sans-serif;font-size:150%;text-align:center">Libraries</h2>

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import re

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from wordcloud import WordCloud, STOPWORDS

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LogisticRegression, Ridge, Lasso
from xgboost import XGBClassifier, XGBRegressor

from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import r2_score, mean_squared_error

import warnings
warnings.filterwarnings('ignore')



import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

<a id="3"></a>
<h2 style="background-color:Gray;color:white;font-family:sans-serif;font-size:150%;text-align:center">Data Understanding</h2>

Since these are bestsellers, the Amazon user ratings range from 3.3 to 4.9 of 5. Though the number of reviews written on Amazon varies considerable between the different books. The prices (as at 13/10/2020) are integers between 0 and 105. The years range from 2009-2019. The genre is only differentiated between 'Fiction' and 'Non Fiction'.

In [None]:
bestsellers = pd.read_csv('/kaggle/input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv')
bestsellers.head()

In [None]:
bestsellers.info()

In [None]:
bestsellers.rename(columns={'User Rating': 'User_Rating'}, inplace=True)
bestsellers.describe()

<a id="3.1"></a>
<h3 style="background-color:Gray;color:white;font-family:sans-serif;font-size:120%;text-align:center">Missing Values</h3>

There are twelve entries with a price of zero. Since 'To Kill a Mockingbird" cost 7 in 2007 and also the 'Diary of a Wimpy Kid' series normally has higher prices, we assume that these are missing values except for 'The Constitution of the United States'.

In [None]:
bestsellers[bestsellers['Price'] == 0].sort_values('Author')

Since there are just a few missing values, we can make detailed adjustments. Let's search if the authors with missing values in the price have written other bestsellers. If yes, we can derive the price from their other books. If no, we will use the mean price of the genre in the year.

In [None]:
bestsellers[bestsellers['Author'].isin(['Alice Schertle', 
                                        'Harper Lee', 
                                        'Jeff Kinney', 
                                        'RH Disney', 
                                        'Stephenie Meyer'])].sort_values('Author')

* Because Alice Schertle and RH Disney have not written any other bestsellers in this time, we estimate the prices by the mean price of fiction in 2014 (without the other missing values).
* For the price of Harper Lees' 'To Kill a Mockingbird' we will take the price of the book in 2019. 
* For Jeff Kinney's 'Diary of a Wimpy Kid' we will use the mean price of the previous and the following book of the series (and round up).
* The price of 'The Short Second Life of Bree Tanner: An Eclipse Novella (Twilight Saga)' we will estimate with the mean of the other two 'Eclipse' books of Stephenie Meyer.

In [None]:
# Mean price for fiction books in 2014
bestsellers_2014 = bestsellers[bestsellers['Year'] == 2014] 
price_fiction_2014 = bestsellers_2014[bestsellers_2014['Price'] >0].groupby('Genre').mean().Price.Fiction
# 'Little Blue Truck'
bestsellers.loc[219, 'Price'] = price_fiction_2014
# Disney's Frozen
bestsellers.loc[116, 'Price'] = price_fiction_2014
bestsellers.loc[193, 'Price'] = price_fiction_2014


# To Kill a Mockingbird
bestsellers.loc[bestsellers.Name == 'To Kill a Mockingbird', 'Price'] = 7

# Wimpy Kid
# 'The Getaway' is part 12 of the series. The price of part 11 is 20 and of part 13 8. The mean is 14. 
bestsellers.loc[381, 'Price'] = 14
# Change the price of book 8. Book 9 is 'The Long Haul' with a price of 22. Book 7 has a price of 7.
bestsellers.loc[71, 'Price'] = 15
# Change the price of book 6.
bestsellers.loc[42, 'Price'] = 10

#The Short Second Life of Bree Tanner
bestsellers.loc[461, 'Price'] = 13

In [None]:
#Check
bestsellers[bestsellers['Price'] == 0].sort_values('Author')

<a id="3.2"></a>
<h3 style="background-color:Gray;color:white;font-family:sans-serif;font-size:120%;text-align:center">Duplicates</h3>

Many books have been bestsellers for more than one year. In this case all columns have the same value for such 'duplicate' books except for the year. Especially the number of reviews and the user rating represent the total number over all years. 

In the following, we will work with a dataframe of unique bestsellers unless otherwise stated. In case of duplicates the book is assigned to the first year it has been a bestseller.

In [None]:
bestsellers[bestsellers.duplicated(subset=['Name', 'Author'], keep=False)].sort_values('Name')

In [None]:
# Build a second dataframe with unique bestsellers.
# In case of duplicates the book is assigned to the first year it has been a bestseller.

unique_bestsellers = bestsellers.drop_duplicates(subset=['Name', 'Author'])

<a id="3.3"></a>
<h3 style="background-color:Gray;color:white;font-family:sans-serif;font-size:120%;text-align:center">Distributions</h3>

Let's take a look at the distributions of the features:
* The user ratings range preeminently between 4.6 and 4.8 out of 5. 
* Most books have under 10,000 ratings. 
* The prices range mainly between 0 and 20.
* Considering the genre slightly more bestellers are non-fiction (54%) than fiction.

In [None]:
# Take a look at the distributions
def distribution_plot(col, boundaries=(0, 100)):
    """
    Description: Plots a histogram in order to see the distribution of the feature. 
    
    Arguments:
        col: column of a dataframe
        boundaries: range that should be plotted
        df: dataframe
    
    Returns:
       A distribution plot
    """
    
    plt.figure(figsize=(4,2))
    unique_bestsellers[col].hist(range=boundaries, bins=20, color='lightsalmon', edgecolor='palevioletred', 
                       linewidth=1)  
    plt.grid(False)
    plt.xlabel(col)
    plt.ylabel('Count')
    plt.title('Distribution of the ' + col)
    plt.show()
    

distribution_plot('User_Rating', boundaries=(3.3, 5))
distribution_plot('Reviews', boundaries=(37, 87841))
distribution_plot('Price', boundaries=(0, 105)) 

In [None]:
genre_distribution = unique_bestsellers['Genre'].value_counts()
genre_distribution
plt.pie(genre_distribution, labels=['Non Fiction', 'Fiction'], autopct='%1.2f', startangle=90, 
           colors=['lightsalmon', 'palevioletred'])
_ = plt.title('Distribution of the Genre')

<a id="4"></a>
<h2 style="background-color:Gray;color:white;font-family:sans-serif;font-size:150%;text-align:center">Data Analysis</h2>

<a id="4.1"></a>
<h3 style="background-color:Gray;color:white;font-family:sans-serif;font-size:120%;text-align:center">Which Authors Write the Most Bestsellers?</h3>

There are some authors who wrote several bestellers. Often it's a book series. Therefore the author will be a very important feature for the later models.

In [None]:
# Which Authors Write the Most Bestsellers?

books_per_author = unique_bestsellers.groupby(['Author']).count().Name.sort_values(ascending=False)

plt.figure(figsize=(8,5))
books_per_author.iloc[:12].plot(kind='barh', color=['purple', 'palevioletred', 'salmon', 'lightsalmon'])
plt.title('12 Authors with the Most Bestsellers')
plt.gca().invert_yaxis()
plt.xlabel('Number of Bestsellers')
_ = plt.ylabel('Author')

With twelve bestsellers (in eleven years!) Jeff Kinney is the unchallenged top author with the series 'Diary of a Wimpy Kid'.

In [None]:
unique_bestsellers[unique_bestsellers['Author'] == 'Jeff Kinney']

In [None]:
Image.open("../input/booksbooksbooks/WimpyKids.jpg")

<a id="4.2"></a>
<h3 style="background-color:Gray;color:white;font-family:sans-serif;font-size:120%;text-align:center">Which Genre Dominates Which Year?</h3>

In this case 'duplicates' are kept in the data because they represent the reader's taste of each year. Eventhough non-fiction is represented more often throughout the years, fiction dominated in 2014. Followed by a strong non fiction year in 2015.

In [None]:
fiction = bestsellers[bestsellers['Genre'] == 'Fiction'].groupby(['Year']).count().Genre / 50
non_fiction = bestsellers[bestsellers['Genre'] == 'Non Fiction'].groupby(['Year']).count().Genre / 50

plt.figure(figsize=(8,5))
fiction.plot(kind='bar', color='palevioletred')
non_fiction.plot(kind='bar', bottom=fiction, color='lightsalmon')
plt.title('Which Genre Dominates Which Year?')
plt.xlabel('Year')
plt.ylabel('Proportion of the Total Number of Bestsellers')
plt.legend(('Fiction', 'Non Fiction'), loc='upper left', bbox_to_anchor=(1,1), ncol=1)
_ = plt.xticks(rotation=45)

<a id="4.3"></a>
<h3 style="background-color:Gray;color:white;font-family:sans-serif;font-size:120%;text-align:center">How does the Mean Price Change over Years?</h3>

In [None]:
price_per_year = bestsellers.groupby(['Year']).mean().Price

plt.figure(figsize=(8,5))
price_per_year.plot(kind='line', color='palevioletred')
plt.title('Development of the Mean Price')
plt.xlabel('Year')
_ = plt.ylabel('Mean Price')

There is a downward trend in the mean price per year. Bestsellers are getting cheaper. 

<a id="4.4"></a>
<h3 style="background-color:Gray;color:white;font-family:sans-serif;font-size:120%;text-align:center">What's the Mean Price in each Genre?</h3>

Non fiction books are about 13% more expensive than fiction books. 

In [None]:
# Then calculate the mean price per genre

price_per_genre = unique_bestsellers.groupby(['Genre']).mean().Price

plt.figure(figsize=(8,5))
abc = price_per_genre.plot(kind='bar', color=['palevioletred', 'lightsalmon'])
plt.title('Development of the Mean Price')
plt.xlabel('Genre')
plt.ylabel('Mean Price')
_ = plt.xticks(rotation=0)

<a id="4.5"></a>
<h3 style="background-color:Gray;color:white;font-family:sans-serif;font-size:120%;text-align:center">Which Books have the Most Reviews?</h3>

The number of reviews ranges widely between 37 and 87,841. It could be an indicator for the number of books sold or how much it affects people emotionally.

In [None]:
# just a beauty correction for the plot
unique_bestsellers['Name'].replace(
    {'Fifty Shades of Grey: Book One of the Fifty Shades Trilogy (Fifty Shades of Grey Series)': 
     'Book One of the Fifty Shades Trilogy'}, 
    inplace=True)

# Search for the books with the highest number of reviews
best_reviews = unique_bestsellers[['Name','Reviews']].groupby('Name').sum().sort_values('Reviews', ascending=False)

best_reviews.iloc[:10].plot(kind='barh', color=['salmon', 'lightsalmon'])
plt.gcf().set_size_inches(8, 5)
plt.title('10 Books with the Most Reviews')
plt.gca().invert_yaxis()
plt.xlabel('Number of Reviews')
_ = plt.ylabel('Book')

By far the most reviews have been given to 'Where the Crawdads Sing' by Delia Owens with a user rating of 4.8 and 'The Girl on the Train' by Paula Hawkings with a user rating of 4.1. In 2016 a movie of 'The Girl on the Train' came out. However, it couldn't convince the audience as much as the book.

In [None]:
Image.open("../input/booksbooksbooks/crawdads.jpg")

In [None]:
Image.open("../input/booksbooksbooks/girlonthetrain.jpg")

<a id="4.6"></a>
<h3 style="background-color:Gray;color:white;font-family:sans-serif;font-size:120%;text-align:center">Do Genres Differ in the Number of Reviews?</h3>

Fiction received with 2,097,771 reviews about 56% more reviews than non-fiction with 1,341,918 reviews.

In [None]:
reviews_per_genre = unique_bestsellers.groupby(['Genre']).sum().Reviews

plt.figure(figsize=(8,5))
abc = reviews_per_genre.plot(kind='bar', color=['palevioletred', 'lightsalmon'])
plt.title('Number of Reviews in each Genre')
plt.xlabel('Genre')
plt.ylabel('Number of Reviews')
_ = plt.xticks(rotation=0)

<a id="4.7"></a>
<h3 style="background-color:Gray;color:white;font-family:sans-serif;font-size:120%;text-align:center">Which Books have the Highest User Rating?</h3>

28 books have the highest occurring rating of 4.9.

In [None]:
# Books with the highest occurring rating
unique_bestsellers[unique_bestsellers['User_Rating'] == 4.9].sort_values('Reviews', ascending=False)

Dav Pilkey tops the list with six books that received the best user rating, followed by J.K. Rowling with 4 books.

<a id="4.8"></a>
<h3 style="background-color:Gray;color:white;font-family:sans-serif;font-size:120%;text-align:center">How does the User Rating Change over the Years?</h3>

Again: In a contemplation of several years we have to keep the 'duplicates' in the data to avoid distortions for example because of a disproportion in the distribution of the genre.

In [None]:
rating_per_year = bestsellers.groupby(['Year']).mean().User_Rating

plt.figure(figsize=(8,5))
rating_per_year.plot(kind='line', color='palevioletred')
plt.title('Development of the Mean Rating')
plt.xlabel('Year')
_ = plt.ylabel('Mean Rating')

Since 2012 we see an upward trend in the mean user rating from 4.5 to 4.7 out of 5.

<a id="4.9"></a>
<h3 style="background-color:Gray;color:white;font-family:sans-serif;font-size:120%;text-align:center">Does a Higher Rating Lead to a Higher Price?</h3>

There is no clear relationship between the user rating and the price. 

In [None]:
# Does a Higher Rating Lead to a Higher Price?

ratings_reviews = unique_bestsellers.groupby(['User_Rating']).mean().Price

plt.figure(figsize=(8,5))
ratings_reviews.plot(kind='line', color='palevioletred')
plt.title('Does a Higher Rating Lead to a Higher Price?')
plt.xlabel('Rating')
_ = plt.ylabel('Mean Price')

<a id="4.10"></a>
<h3 style="background-color:Gray;color:white;font-family:sans-serif;font-size:120%;text-align:center">Which Words make a Bestseller Title?</h3>

Since some authors have been very succesful with their whole book series, some very specific words like 'Harry Potter', 'Wimpy Kid', 'Dog Man' or ' Fifty Shade(s)' show up.

In [None]:
def tokenize(text):
    """
    Description: This function processes texts in order to create a wordcloud. 
    
    Arguments:
        text: String
    
    Returns:
        clean_tokens: lists of normalized, tokenized and lemmatized words of the text without stopwords
    """
    
    # normalize case and remove punctuation
    text = re.sub(r'[^a-zA-z0-9]', ' ', text.lower())
    
    #tokenize text
    tokens = word_tokenize(text) 
    
    #lemmatize and remove stopwords
    lemmatizer = WordNetLemmatizer()
    clean_tokens = []
    for tok in tokens:
        if tok not in stopwords.words('english'):
            
            clean_tok = lemmatizer.lemmatize(lemmatizer.lemmatize(tok, pos='v'))

            clean_tokens.append(clean_tok)
    
    return clean_tokens

In [None]:
def word_list(lists):
    """
    Description: This function reformats separate lists of words into one list of words.
    
    Arguments:
        lists: separate lists of words
    
    Returns:
        list_of_words: list of words of all lists
    """
    
    list_of_words = []
    
    for list in lists:
        for word in list:
            list_of_words.append(word)
    return list_of_words


In [None]:
# Plot a wordcloud of tokenized book titles
stopwords_cloud = set(STOPWORDS)
stopwords_cloud.update(['book', 'novel'])

text = ' '.join([word for word in word_list(unique_bestsellers['Name'].apply(tokenize))])
reading_woman = np.array(Image.open("../input/booksbooksbooks/book-1296329_1280_2.png"))
cloud = WordCloud(stopwords=stopwords_cloud, 
                  background_color='white', 
                  max_words=75, 
                  mask=reading_woman, 
                  contour_width=3, 
                  contour_color='lightsalmon').generate(text)
plt.figure(figsize=(20, 10))
plt.axis("off")
_ = plt.imshow(cloud)

<a id="5"></a>
<h2 style="background-color:Gray;color:white;font-family:sans-serif;font-size:150%;text-align:center">Models</h2>

<a id="5.1"></a>
<h3 style="background-color:Gray;color:white;font-family:sans-serif;font-size:120%;text-align:center">Preprocessing</h3>

We have already handled missing values and duplicates in the "Data Understanding" section which is also important to achieve good results with the model. 

In case of the genre and user rating predictions books with a price of zero would have distorted the relationship between the price and the genre/user rating and could have lead to false classifications/ratings. In case of the pricing model those books could have lead to lower price predictions. 

In all cases ducplicates would have lead to seemingly better models, since a part of the test data would correspond to information already known from the training data.

Hence, we have already excluded possible sources of error with these adjustments.

The models can only handle numerical data. Therefore we have to adjust the categorial data in 'Genre' as well as the strings in the book titles. 

In [None]:
# Process categorial data for modeling
unique_bestsellers_preprocessed = pd.get_dummies(unique_bestsellers.drop(['Name'], axis=1),
                                                 drop_first=True)

In [None]:
# Process book titles with NLP methods for modeling
tfidf = TfidfVectorizer()
transformed_names = tfidf.fit_transform(unique_bestsellers['Name'])
transformed_names_df = pd.DataFrame(transformed_names.toarray(), columns=tfidf.get_feature_names())
unique_bestsellers_preprocessed = pd.concat([unique_bestsellers_preprocessed.reset_index(drop=True),
                                             transformed_names_df.reset_index(drop=True)], axis=1)

<a id="5.2"></a>
<h3 style="background-color:Gray;color:white;font-family:sans-serif;font-size:120%;text-align:center">Choice of a Classification Metric</h3>

Accuracy is the basic classification metric taking the proportion of true results among the total number of cases examined. It is easy to understand but not always usefull. Fore example if we want to find spam mails: Just say 'no' all the time and you will be 98% accurate (depends on the account). If a target class is very sparse (for example spam mails), models can have a high accuracy, but not be valuable. 

In our case the genre is relatively balanced, so that accuracy would be an option. Though, we will print a classification report showing the precision, recall and f1-score for each classifier.

Precision calculates the proportion of true positives among the predicted positives. It is a valid choice of evaluation metric when we want to be very sure of our prediction. The disadvantage of beeing that careful is that we would increase the number of false negatives. Therefore precision is unsuitable for our application, as no genre is more valuable.

Recall determines which proportion of actual Positives is correctly classified. It is a valid choice of evaluation metric when we want to capture as many positives as possible. The problem: Recall is 1 if we predict 1 for all examples.

There is a metric utilizing tradeoff of precision vs. recall called F1 Score. The F1-score is a number between 0 and 1 and is the harmonic mean of precision and recall. F1-score sort of maintains a balance between the precision and recall for your classifier. If your precision is low, the F1 is low and if the recall is low again your F1-score is low. Since we want to have a model with both good precision and recall, we will use the F1-score.

<a id="5.3"></a>
<h3 style="background-color:Gray;color:white;font-family:sans-serif;font-size:120%;text-align:center">Correlations</h3>

Let's also take a look at the heatmap to get a better feeling for correlations of the features before modeling.

In [None]:

# Plot heatmap of correlations
heatmap_data = unique_bestsellers_preprocessed[['User_Rating', 'Reviews', 'Price', 'Year', 
                                                'Genre_Non Fiction']]
plt.figure(figsize=(10,10))
sns.heatmap(heatmap_data.corr(), square=True, annot=True)

The strongest negative correlation consists between year and genre with -0.28. The strongest positive correlations are between year and user rating as well as between year and reviews with 0.22.

<a id="5.4"></a>
<h3 style="background-color:Gray;color:white;font-family:sans-serif;font-size:120%;text-align:center">What's the Genre of a Book?</h3>

Because it is pointless to predict the genre for books that you already know, the model is calculated on a data set that takes books that occur over a number of years only once into account. We use GridSearchCV to decide between logistic regression and XGBoost. We choose logistic regression because it's a fast and simple classification method and XGBoost because it achieves excellent results for many classification problems.

In [None]:
# Build the train and test data sets
X = unique_bestsellers_preprocessed.drop('Genre_Non Fiction', axis=1)
y = unique_bestsellers_preprocessed['Genre_Non Fiction']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
def scale_gridsearch(estimators, parameters, classifier):
    """
    Description: This function runs a pipeline of a scaler and GridSearchCV with different estimators 
                 and prints the results.
    
    Arguments:
        estimators: list of estimators to be finetuned and tested with GridsearchCV
        parameters: list of parameters for the finetuning of the estimators
        classifier: boolean which indicates, if it's a 1=classification or 0=regression problem

    Returns:
        None
    """
    
    for estimator, param in zip(estimators, parameters):
        pipeline = Pipeline([
            ('scaler', MinMaxScaler()),
            ('cv', GridSearchCV(estimator, param_grid=param, cv=10))
        ])
    
        pipeline.fit(X_train, y_train)
        y_pred = pipeline.predict(X_test)
        
        print(pipeline['cv'].best_estimator_)
        print(pipeline['cv'].best_params_)
        
        if classifier:
            print(classification_report(y_test, y_pred))
        else: 
            rmse = np.sqrt(mean_squared_error(y_test,y_pred)) 
            print('Testdata Root Mean Squared Error: {}'.format(rmse))
            
            
classifiers = [LogisticRegression(), XGBClassifier()]

clf_parameters = [{'penalty': ['l1', 'l2', 'elasticnet', 'none'],
                   'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
                   'class_weight': ['dict', 'balanced', None]},
                  
                  {'booster': ['gbtree', 'gblinear', 'dart']}
                 ]


scale_gridsearch(estimators=classifiers, parameters=clf_parameters, classifier=True)

The result slightly changes with each run. Sometimes one classifier dominates, sometimes the other, sometimes they achieve the same results. All in all XGBoost with a gblinear booster and logistic regression with classweight 'dict' and a newton-cg/lbfgs solver without penalty are about equally good according to their F1-score. 

<a id="5.5"></a>
<h3 style="background-color:Gray;color:white;font-family:sans-serif;font-size:120%;text-align:center">What's the Worth (Price) of a Book?</h3>

In [None]:
# Build the train and test data sets
X = unique_bestsellers_preprocessed.drop('Price', axis=1)
y = unique_bestsellers_preprocessed['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
regressors = [Ridge(), Lasso(), RandomForestRegressor(), XGBRegressor()]

reg_params = [{'alpha': [0.1, 0.5, 1, 5, 7, 10],
               'tol': [0.05, 0.1, 0.5]},
             
              {'alpha': [0.05, 0.1, 0.5, 1, 5, 10, 20],
               'max_iter': [100, 200, 300, 400, 500, 750]},
             
              {'max_depth': [20, 25, 30, 35], 
               'min_samples_split': [2, 3, 4]},
         
              {'booster': ['gbtree', 'gblinear', 'dart']}]

scale_gridsearch(estimators=regressors, parameters=reg_params, classifier=False)

Lasso regression shows the smallest root mean squared error of 12,96 with an alpha of 0.1 and a maximum number of iterations of 100.

<a id="5.6"></a>
<h3 style="background-color:Gray;color:white;font-family:sans-serif;font-size:120%;text-align:center">How Popular is a Book?</h3>

To predict the user rating we try the same estimators as for the price prediction (ridge regression, lasso regression, random forest and XGBoost).

In [None]:
# Build the train and test data sets
X = unique_bestsellers_preprocessed.drop('User_Rating', axis=1)
y = unique_bestsellers_preprocessed['User_Rating']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
scale_gridsearch(estimators=regressors, parameters=reg_params, classifier=False)

In this case ridge regression is the right choice with an alpha of 5 and a tol of 0.05 because of it's small root mean squared error of 0.18.

<a id="6"></a>
<h2 style="background-color:Gray;color:white;font-family:sans-serif;font-size:150%;text-align:center">Conclusion</h2>

We took the following steps:
* Data exploration: The features do not use the full range since the data is about bestsellers. For example there are no user ratings under 3.3 and the mean number of reviews is high.
* Data cleaning: There were just twelve missing values that could be handled manually and some kind of duplicates when bestsellers have been successful in several years. In order to create unique data we had to decide which year we assign the book to and chose the first year of being a bestseller. For some parts of the data anlysis we had to look at the whole data set and for some at the unique data. The models have been calculated on unique data only.
* Data analysis: We asked some questions to get to know the data and visualized the answers. The visualization of the reading girl in 'Which Words make a Bestseller Title?' has not become perfect because the contour line is not continuous.
* Further preprocessing: The non-numerical data had to be handled in order to use them in the model
* Classification metric: We chose the F1-score.
* Models: Finally we built three different models using GridsearchCV for choosing the best estimator and its parameters. In cas of the genree XGBoost with a gblinear booster and logistic regression with classweight 'dict' and a newton-cg/lbfgs solver without penalty have been about equally good. In case of the price Lasso regression has been the best solution with an alpha of 0.1 and a maximum number of iterations of 100. And for the user rating ridge regression would be the right choice with an alpha of 5 and a tol of 0.05. Interesting about the project is that GridSearchCV is put into a pipeline to compare different estimators while it finetunes the parameters at the same time. We've seen that XGBoost is not always the right choice. We still have to find out the best model for each specific case. 


<a id="7"></a>
<h2 style="background-color:Gray;color:white;font-family:sans-serif;font-size:150%;text-align:center">Evaluation</h2>

These models can only be used to a very limited extent. Since they were calculated on the basis of data about bestsellers, they can only be used for predictions concerning bestsellers. For example there would never be a prediction of the user rating under 3 or adequate prices for collectibles.

Furthermore it can be unsatisfactory to predict the genre only for fiction and non-fiction. In reality, the genre would have to be divided in much more detail. The accuracy of about 87% may be too low for corporate goals. In this case, more data would have to be used, at best with additional features, to improve the model.

How could we improve our results?
* We could extend our data to more years, more features or divide the genre in more detail.
* There could be done more data cleaning. I.e. sometimes the same book title or author are written differently. 
* We could finetune more parameters in GridSearchCV.