# Predicting the Genre of Books from Summaries


## Initial Imports

In [149]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer,TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics,linear_model
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
%matplotlib inline

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
!git clone https://reubenbf:reuben2301%21%28%28%26@github.com/MQCOMP6200-2020-S1/portfolio-reubenbf.git

Cloning into 'portfolio-reubenbf'...
remote: Enumerating objects: 50, done.[K
remote: Counting objects: 100% (50/50), done.[K
remote: Compressing objects: 100% (41/41), done.[K
remote: Total 50 (delta 18), reused 31 (delta 8), pack-reused 0[K
Unpacking objects: 100% (50/50), done.


In [4]:
%cd portfolio-reubenbf/

/content/portfolio-reubenbf


## Data Preparation

We'll use a set of book summaries from the [CMU Book Summaries Corpus](http://www.cs.cmu.edu/~dbamman/booksummaries.html) in this experiment.  This contains a large number of summaries (16,559) and includes meta-data about the genre of the books taken from Freebase.  Each book can have more than one genre and there are 227 genres listed in total.  To simplify the problem of genre prediction we will select a small number of target genres that occur frequently in the collection and select the books with these genre labels.  This will give us one genre label per book. 

The first task is to read the data. It is made available in tab-separated format but has no column headings. We can use `read_csv` to read this but we need to set the separator to `\t` (tab) and supply the column names.  The names come from the [ReadMe](data/booksummaries/README.txt) file.

In [5]:
names = ['wid', 'fid', 'title', 'author', 'date', 'genres', 'summary']

books = pd.read_csv("data/booksummaries/booksummaries.txt", sep="\t", header=None, names=names, keep_default_na=False)
books.head()

Unnamed: 0,wid,fid,title,author,date,genres,summary
0,620,/m/0hhy,Animal Farm,George Orwell,1945-08-17,"{""/m/016lj8"": ""Roman \u00e0 clef"", ""/m/06nbt"":...","Old Major, the old boar on the Manor Farm, ca..."
1,843,/m/0k36,A Clockwork Orange,Anthony Burgess,1962,"{""/m/06n90"": ""Science Fiction"", ""/m/0l67h"": ""N...","Alex, a teenager living in near-future Englan..."
2,986,/m/0ldx,The Plague,Albert Camus,1947,"{""/m/02m4t"": ""Existentialism"", ""/m/02xlf"": ""Fi...",The text of The Plague is divided into five p...
3,1756,/m/0sww,An Enquiry Concerning Human Understanding,David Hume,,,The argument of the Enquiry proceeds by a ser...
4,2080,/m/0wkt,A Fire Upon the Deep,Vernor Vinge,,"{""/m/03lrw"": ""Hard science fiction"", ""/m/06n90...",The novel posits that space around the Milky ...


We next filter the data so that only our target genre labels are included and we assign each text to just one of the genre labels.  It's possible that one text could be labelled with two of these labels (eg. Science Fiction and Fantasy) but we will just assign one of those here. 

In [0]:
target_genres = ["Children's literature",
                 'Science Fiction',
                 'Novel',
                 'Fantasy',
                 'Mystery']

# create a Series of empty strings the same length as the list of books
genre = pd.Series(np.repeat("", books.shape[0]))
# look for each target genre and set the corresponding entries in the genre series to the genre label
for g in target_genres:
    genre[books['genres'].str.contains(g)] = g

# add this to the book dataframe and then select only those rows that have a genre label
# drop some useless columns
books['genre'] = genre
genre_books = books[genre!=''].drop(['genres', 'fid', 'wid'], axis=1)

genre_books.reset_index(drop=True, inplace=True)

In [7]:
# check how many books we have in each genre category
genre_books.groupby('genre').count()

Unnamed: 0_level_0,title,author,date,summary
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Children's literature,1092,1092,1092,1092
Fantasy,2311,2311,2311,2311
Mystery,1396,1396,1396,1396
Novel,2258,2258,2258,2258
Science Fiction,1897,1897,1897,1897


## Modelling

We take the data and build a predictive model to classify the books into one of the five target genres.  We extract suitable features from the texts and select a suitable model to classify them. We build two models and compare the results to choose the better one.

### Label Encoder

We use a simple label encoder available in `sklearn.preprocessing` that helps encode target labels with value between 0 and n_classes-1.

In [0]:
le = LabelEncoder()
le.fit(list(genre_books['genre'].values))

#we create a new column 'genre_no' being encoded values of genre names
genre_books['genre_no'] = le.transform(list(genre_books['genre']))

create target_names for genre names of encoder

In [0]:
target_names = list(le.inverse_transform([0,1,2,3,4]))

In [193]:
genre_books[genre_books['genre_no']==2].summary.loc[10]

' The novel is told in epistolary format, as a series of letters, diary entries, ships\' log entries, and so forth. The main writers of these items are also the novel\'s protagonists. The story is occasionally supplemented with newspaper clippings that relate events not directly witnessed by the story\'s characters. The tale begins with Jonathan Harker, a newly qualified English solicitor, journeying by train and carriage from England to Count Dracula\'s crumbling, remote castle (situated in the Carpathian Mountains on the border of Transylvania, Bukovina and Moldavia). The purpose of his mission is to provide legal support to Dracula for a real estate transaction overseen by Harker\'s employer, Peter Hawkins, of Exeter in England. At first enticed by Dracula\'s gracious manner, Harker soon discovers that he has become a prisoner in the castle. He also begins to see disquieting facets of Dracula\'s nocturnal life. One night while searching for a way out of the castle, and against Dracu

### TfidfVectorizer

This helps transforms text to feature vectors that can be used as input to an estimator, we tweak the vectorizor by using stop words form the NLTK package and changing the regex function to only choose words with atleast three letters and 20,000 maximum features.

In [0]:
vectorizer = TfidfVectorizer(max_features=20000,token_pattern=r'(?u)\b[A-Za-z]{3,}\b',stop_words=stopwords.words('english'),min_df=5)
X = vectorizer.fit_transform(genre_books.summary).toarray()

In [207]:
X.shape

(8954, 20000)

We finally splt our data into training and testing for model estimator.

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, genre_books['genre_no'], test_size=0.2)

### MultinomialNB Model



In [209]:
clf = MultinomialNB(alpha=0.1)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)

print (metrics.classification_report(y_test,pred,target_names=target_names))

                       precision    recall  f1-score   support

Children's literature       0.64      0.44      0.52       214
              Fantasy       0.76      0.72      0.74       443
              Mystery       0.78      0.66      0.72       302
                Novel       0.63      0.77      0.69       473
      Science Fiction       0.70      0.77      0.74       359

             accuracy                           0.70      1791
            macro avg       0.70      0.67      0.68      1791
         weighted avg       0.70      0.70      0.70      1791



### Logistic Regression Model

In [184]:
clf = linear_model.LogisticRegression(solver= 'sag',max_iter=250,random_state=450)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)

print (metrics.classification_report(y_test,pred,target_names=target_names))

                       precision    recall  f1-score   support

Children's literature       0.69      0.50      0.58       201
              Fantasy       0.77      0.76      0.77       459
              Mystery       0.83      0.65      0.73       275
                Novel       0.65      0.79      0.71       471
      Science Fiction       0.75      0.77      0.76       385

             accuracy                           0.73      1791
            macro avg       0.74      0.70      0.71      1791
         weighted avg       0.73      0.73      0.72      1791



The two models for now have almost similar f-1 scores, there's still some work needed to tweak the hyperparameters to choose a better one. 

## Predicting Genre using our proposed model



In [0]:
text = ['haunting tale of the drowing kid']
s = (vectorizer.transform(text))
d = (clf.predict(s))

Use the `inverse_transform` function to recover the genre name

In [187]:
le.inverse_transform(d)

array(['Novel'], dtype='<U21')