<h1>books.toscrape analysis</h1>
<p>books.toscrape.com is a web scraping sandbox that allows users to scrape books from their catalogue. despite the data being ficticious, this analysis is meant to demonstrate the different skills that can be performed and conclusions that can be drawn from the data provided. so, for this notebook, we will be treating books.toscrape's data as if it were an early Amazon's equivalent (online book retailer) and exploring what factors contribute to book stockage and rating.</p>

<h2>setting up</h2>

In [136]:
#import statements
import pandas as pd
import numpy as np

In [137]:
#configuring dataframe 
df = pd.read_json('books.json', orient='index').rename_axis('title').reset_index()
df

Unnamed: 0,title,upc,category,stars,price,availability,stock,href,description
0,A Light in the Attic,a897fe39b1053632,poetry,3,£51.77,in stock,22,a-light-in-the-attic_1000/index.html,It's hard to imagine a world without A Light i...
1,Tipping the Velvet,90fa61229261140a,historical fiction,1,£53.74,in stock,20,tipping-the-velvet_999/index.html,"""Erotic and absorbing...Written with starling ..."
2,Soumission,6957f44c3847a760,fiction,1,£50.10,in stock,20,soumission_998/index.html,"Dans une France assez proche de la nôtre, un h..."
3,Sharp Objects,e00eb4fd7b871a48,mystery,4,£47.82,in stock,20,sharp-objects_997/index.html,"WICKED above her hipbone, GIRL across her hear..."
4,Sapiens: A Brief History of Humankind,4165285e1663650f,history,5,£54.23,in stock,20,sapiens-a-brief-history-of-humankind_996/index...,From a renowned historian comes a groundbreaki...
...,...,...,...,...,...,...,...,...,...
994,Alice in Wonderland (Alice's Adventures in Won...,cd2a2a70dd5d176d,classics,1,£55.53,in stock,1,alice-in-wonderland-alices-adventures-in-wonde...,
995,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)",bfd5e1701c862ac3,sequential art,4,£57.06,in stock,1,ajin-demi-human-volume-1-ajin-demi-human-1_4/i...,High school student Kei Nagai is struck dead i...
996,A Spy's Devotion (The Regency Spies of London #1),19fec36a1dfb4c16,historical fiction,5,£16.97,in stock,1,a-spys-devotion-the-regency-spies-of-london-1_...,"In England’s Regency era, manners and elegance..."
997,1st to Die (Women's Murder Club #1),f684a82adc49f011,mystery,1,£53.98,in stock,1,1st-to-die-womens-murder-club-1_2/index.html,"James Patterson, bestselling author of the Ale..."


In [138]:
#converting price column to numbers to conduct numerical analysis
df['price'] = df.price.apply(lambda x: x.split('£')[1]).astype(float)

<h2>exploratory analysis</h2>
<ul>
<li>what are the top 5 categories in stock?</li>
<li>how many books are in each star rating?</li>
<li>what is the max price? what is the min price?</li>
</ul>

In [139]:
#top 5 categories
df.category.value_counts().head(5)

default           152
nonfiction        110
sequential art     75
add a comment      67
fiction            65
Name: category, dtype: int64

In [140]:
#number of 5 star books
df.stars.value_counts()

1    226
3    203
2    196
5    195
4    179
Name: stars, dtype: int64

In [141]:
#max price of book
df.price.max()

59.99

In [142]:
#min price of book
df.price.min()

10.0

<h2>drilling down into highly rated books (4 and 5 star ratings)</h2>
after taking a quick overview of the data, we want to better understand the characteristics of highly rated books in order to optimize the selection books.toscrape.com has. 
<ul>
<li>what 5 categories show up the most in 4 and 5 star ratings?</li>
<li>what is the average price 4 and 5 star ratings?</li>
<li>what are the 5 most frequently used words of each description (excluding prepositions/connecting words)?</li>
<li>is there a pattern to the themes discussed in highly rated books?</li>
</ul>

In [143]:
#filtering down the data frame to 4 and 5 star ratings
#hrb = highly rated books
hrb = df[(df.stars == 5) | (df.stars == 4)]
hrb

Unnamed: 0,title,upc,category,stars,price,availability,stock,href,description
3,Sharp Objects,e00eb4fd7b871a48,mystery,4,47.82,in stock,20,sharp-objects_997/index.html,"WICKED above her hipbone, GIRL across her hear..."
4,Sapiens: A Brief History of Humankind,4165285e1663650f,history,5,54.23,in stock,20,sapiens-a-brief-history-of-humankind_996/index...,From a renowned historian comes a groundbreaki...
6,The Dirty Little Secrets of Getting Your Dream...,2597b5a345f45e1b,business,4,33.34,in stock,19,the-dirty-little-secrets-of-getting-your-dream...,Drawing on his extensive experience evaluating...
8,The Boys in the Boat: Nine Americans and Their...,e10e1e165dc8be4a,default,4,22.60,in stock,19,the-boys-in-the-boat-nine-americans-and-their-...,For readers of Laura Hillenbrand's Seabiscuit ...
11,Shakespeare's Sonnets,30a7f60cd76ca58c,poetry,4,20.66,in stock,19,shakespeares-sonnets_989/index.html,This book is an important and complete collect...
...,...,...,...,...,...,...,...,...,...
990,Bounty (Colorado Mountain #7),abc0b15f2c907ff0,romance,4,37.26,in stock,1,bounty-colorado-mountain-7_9/index.html,Justice Lonesome has enjoyed a life of bounty....
992,"Bleach, Vol. 1: Strawberry and the Soul Reaper...",099fae4a0705d63b,sequential art,5,34.65,in stock,1,bleach-vol-1-strawberry-and-the-soul-reapers-b...,"Hot-tempered 15-year-old Ichigo Kurosaki, the ..."
995,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)",bfd5e1701c862ac3,sequential art,4,57.06,in stock,1,ajin-demi-human-volume-1-ajin-demi-human-1_4/i...,High school student Kei Nagai is struck dead i...
996,A Spy's Devotion (The Regency Spies of London #1),19fec36a1dfb4c16,historical fiction,5,16.97,in stock,1,a-spys-devotion-the-regency-spies-of-london-1_...,"In England’s Regency era, manners and elegance..."


In [144]:
#top 5 categories
hrb.category.value_counts().head(5)

default           52
nonfiction        38
young adult       30
sequential art    29
fiction           27
Name: category, dtype: int64

In [145]:
#average price of the dataframe
hrb.price.mean().round(2)

35.69

In [146]:
#import nlp library
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

In [147]:
hrb.description[3]

'WICKED above her hipbone, GIRL across her heart Words are like a road map to reporter Camille Preaker’s troubled past. Fresh from a brief stay at a psych hospital, Camille’s first assignment from the second-rate daily paper where she works brings her reluctantly back to her hometown to cover the murders of two preteen girls. NASTY on her kneecap, BABYDOLL on her leg Since WICKED above her hipbone, GIRL across her heart Words are like a road map to reporter Camille Preaker’s troubled past. Fresh from a brief stay at a psych hospital, Camille’s first assignment from the second-rate daily paper where she works brings her reluctantly back to her hometown to cover the murders of two preteen girls. NASTY on her kneecap, BABYDOLL on her leg Since she left town eight years ago, Camille has hardly spoken to her neurotic, hypochondriac mother or to the half-sister she barely knows: a beautiful thirteen-year-old with an eerie grip on the town. Now, installed again in her family’s Victorian mansi

In [148]:
#tokenize sentences into a list of words - nltk library 
#eliminate the stop words - list comprehension?
#collect the 5 most popular words from each description and save them to a dictionary 

In [169]:
sw = stopwords.words('english')
sw.extend([',','.',':','"','?','*','^','@','!','like','#','1','70,000','6','25','...','’',"'(', 'shapiro', 'author', '1599', ')'","'ve",'‘',"''","'","'t",'11','400','48','23','e.g', 'e.g.',';','``','23','—','“','”',"'re","'s","n't",'227','w','10','12','50','200','xkcd','b','sh','2,000,000','60','34','17','10-25','125','7','5','4','3','2','6,500','53','84','88','40','15','$','45,000','24','--','f','/of','sp','h','11','500','1-4','600','1-5','80','68','13','1,775','109','v-j','109','22','27','6.0','300','13-18','70','71','72','73','74','75','76','77','78','79','1-6','100','1,000','&'])

In [170]:
#tokenization of descriptions and eliminating stopwords
hrb['words'] = hrb.description.apply(lambda x: x.lower()).apply(lambda x: word_tokenize(x)).apply(lambda x: [word for word in x if not word in sw])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  hrb['words'] = hrb.description.apply(lambda x: x.lower()).apply(lambda x: word_tokenize(x)).apply(lambda x: [word for word in x if not word in sw])


In [171]:
words_dict = {}
for i in hrb.words:
    for word in i:
        if word not in words_dict.keys():
            words_dict[word] = 1
        else:
            words_dict[word]+=1


In [172]:
words_dict.keys()



In [173]:
max(words_dict,key=words_dict.get)

'one'