# Introduction
There are many ways to learn computer science, you can either learn by looking at videos or by reading books (some prefer both). But, how do we know if the study material is good, can data science help us in analyzing whether a book is really good or no. In this notebook, we'll look at the top 270 computer science and programming books to find out what makes a good computer science and programming book.

In [None]:
# import necessary libraries

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from wordcloud import WordCloud

import re

In [None]:
# read data
data = pd.read_csv("../input/top-270-rated-computer-science-programing-books/prog_book.csv")

In [None]:
# looking at the shape of the dataset
data.shape

In [None]:
# prints first five rows in the dataset
data.head()

In [None]:
data.info()

This information tells us that there are no null values in the dataset. But, we need to solve one problem, the Reviews column should not be of the type object as they are integers. First, we need to remove the "," from the numbers and then convert them into integers.

In [None]:
# clean reviews
data['Reviews'] = data['Reviews'].apply(lambda x: re.sub("[^\d\.]", "", x))

data['Reviews'] = data['Reviews'].astype("int64")

# Does book type affect its rating?
The first question we would want to look at is, whether the type of book (eg: paperback or ebook) would affect its rating or not. First, we will check whether there are any outliers or not with respect to ratings of different types of books and then we will compare on the basis of average rating for each type of book.

In [None]:
plt.figure(figsize=(15,8))
sns.boxplot(x="Type", y="Rating", data=data)
plt.title("Type vs. Rating")
plt.xlabel("Type")
plt.ylabel("Rating")
plt.show()

This is interesting, there are a few outliers here and it seems as if there is only one book of the type "Boxed Set - Hardcover". Let us confirm this by checking value counts for each type of book.

In [None]:
data['Type'].value_counts()

This confirms our observation that there is only one book of the type Boxed Set - Hardcover and the dataset seems to have a large number of paperback books. Maybe students studying computer science and programming prefer reading books of the type paperback over ebooks or its just a bias that happened during data collection. Let us look at the average rating for each book type to find out more!

In [None]:
plt.figure(figsize=(10,4))
data.groupby("Type")['Rating'].mean().sort_values().plot(kind="bar", rot=0)
plt.title("Average rating with respect to book type")
plt.xlabel("Type")
plt.ylabel("Rating")
plt.show()

This shows another story, even though the number of paperback books exceed the number of ebooks, ebooks have a higher average rating than paperbacks. This could be due to an outlier in the rating of ebooks (check box-plot). Since, there is only one book of the type Boxed Set - Hardcover, we can't really comment whether that is the best choice for a publisher to go for, when he/she is choosing a type for their computer science and programming book.

# What really impacts the price of a book?
Does the type of book have an impact on the price. One clear assumption could be that hardcover books cost more than paperback, simply because of the material which is used in making that type of book. Let us verify that by comparing the average price of each book type.

In [None]:
plt.figure(figsize=(10,4))
data.groupby("Type")['Price'].mean().sort_values().plot(kind="bar", rot=0)
plt.title("Average price with respect to book type")
plt.xlabel("Type")
plt.ylabel("Price")
plt.show()

This verifies our assumption and it is actually true in the real world that books of the type hardcover cost more than paperback. What's interesting here is, the average price of an ebook is more than a paperback! What could be the reason? Maybe ebooks have clear formatting of code or many well-known authors in the field of computer science prefer writing ebooks? There can be various reasons. But, our main question is still unanswered, what really impacts the price of a book. Let's look at the correlation matrix for this dataset to find out!

In [None]:
data.corr()

This shows that there is almost no correlation between price and rating, but there is a mildly strong correlation between price and number of pages in a book. That could be intuitive, as the number of pages increase, cost of manufacturing increases and hence the increase in price. But, what about ebooks?! Maybe the author's effort is also taken into consideration? Let's look at a regression plot to visualise the relation between price and number of pages.

In [None]:
# relationship between price and number of pages
plt.figure(figsize=(12,6))
sns.regplot(x="Number_Of_Pages", y="Price", data=data)
plt.title("Relationship between Price and Number of pages")
plt.show()

There is a clear linear relationship between price and the number of pages! Hence, we can conclude that number of pages does impact price.

# How does the description of a book differ in terms of rating?
To check how does the description of a book differ in terms of rating, we can create wordclouds. First, let us check the distribution of ratings using a histogram and hope that its not skewed!

In [None]:
plt.figure(figsize=(12,6))
plt.hist(data['Rating'])
plt.title("Distribution of book ratings")
plt.xlabel("Rating")
plt.ylabel("Count")
plt.show()

Yes! Normal distribution (not exactly, but works!). By looking at the histogram, we can divide the books into two categories, high_rating category would be books having a rating > 4.3 and low_rating category with a rating < 3.85. This would lead to ~50 books in each category. Before we do this, we need to clean the text.

In [None]:
# looking at description
data['Description'][:5]

In [None]:
# cleaning text
# reference: https://www.kaggle.com/parulpandey/getting-started-with-nlp-a-general-intro
def clean_text(text):
    """Make text lowercase, remove punctuations and remove words containing numbers."""
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text 

In [None]:
# apply to data
data['Description'] = data['Description'].apply(lambda x: clean_text(x))

data['Description'].head()

Now, we are ready to create wordclouds!

In [None]:
# creating word cloud for high and low rated books

# checking rating greater than 4.3 (check histogram)
high_rating_desc = data[data['Rating'] > 4.3]['Description']

# checking rating lower than 3.85 (to compare word-clouds)
low_rating_desc = data[data['Rating'] < 3.85]['Description']

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=[26,8])

wordcloud1 = WordCloud(background_color='white', width=600,
                      height=400).generate("".join(high_rating_desc))

ax1.imshow(wordcloud1)
ax1.axis('off')
ax1.set_title("Books with rating greater than 4.3", fontsize=36)

wordcloud2 = WordCloud(background_color='white', width=600,
                      height=400).generate("".join(low_rating_desc))

ax2.imshow(wordcloud2)
ax2.axis('off')
ax2.set_title("Books with rating lower than 3.85", fontsize=36)

plt.show()

Looking at the differences, it looks like higher rated books have the words *algorithms, language and code* in their description. That is the key differentiator. But, the two interesting words here are algorithms and language, maybe due to the rise of the tech companies and more and more people wanting to crack interviews, there is a possibilty that the top computer science and programming books focus on algorithms and language. One more thing you can observe is the word python in the higher rated books wordcloud (see bottom-right corner). This is also due to the fact that Python is a very easy language to learn and its becoming very famous due to the rise of artificial intelligence.

# Description vs. number of pages
Do you think that the number of pages in a book would affect its description? What are the most common words used in a book's description which has less than 200 pages? Let's find out by making a wordcloud!

In [None]:
# creating word cloud for pages greater than 200 and less than 200

# checking pages > 200
pages_greater_than_200 = data[data['Number_Of_Pages'] > 200]['Description']

# checking rating lower than 3.85 (to compare word-clouds)
pages_less_than_200 = data[data['Number_Of_Pages'] <= 200]['Description']

from wordcloud import WordCloud

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=[26,8])

wordcloud1 = WordCloud(background_color='white', width=600,
                      height=400).generate("".join(pages_greater_than_200))

ax1.imshow(wordcloud1)
ax1.axis('off')
ax1.set_title("Books with more than 200 pages", fontsize=36)

wordcloud2 = WordCloud(background_color='white', width=600,
                      height=400).generate("".join(pages_less_than_200))

ax2.imshow(wordcloud2)
ax2.axis('off')
ax2.set_title("Books with less than or equal to 200 pages", fontsize=36)

plt.show()

Again, the most common word in a book's description with less than or equal to 200 pages is *python*! Is it that easy to explain Python with a book having less than or equal to 200 pages?! The irony here is, I can see both *simple and complex* in the wordcloud for books having less than or equal to 200 pages. Maybe, the books with the word *complex* in their description would just have really hard problems to solve? That's just an assumption, you are free to dive deeper into this.

# Conclusion
Here we are, at the end of a journey through numbers (don't forget text!). Some actionable insights which could make a good computer science and programming book are -
1. Paperback or ebook could be the best book type when we look at price and ratings.
2. It is very important to find the right balance between price and number of pages as they are positively correlated.
3. Having the right words in your book description is also a key factor in deciding a good computer science book

If it does not work out, just write a book on *Python* having *less than or equal to 200 pages*!

Those of you who reached here, thank you so much for reading this kernel, it means a lot! I hope you guys enjoyed.