# EDA of Books 

## Steps 

1. [Importing the libraries](#1-importing-the-libraries)
2. [Importing the dataset](#2-importing-the-dataset)
3. [Data Cleaning](#3-data-cleaning)
4. [Exploratory Data Analysis](#4-exploratory-data-analysis)
5. [Feature Engineering](#5-feature-engineering)
6. [Saving the dataset](#6-saving-the-dataset)

### 1. Importing the libraries

In [170]:
%load_ext autoreload
%autoreload 2
import os 
import pymongo
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
import datetime as dt
import time
from pandarallel import pandarallel

import math
from sklearn.pipeline import Pipeline


warnings.filterwarnings('ignore')

pandarallel.initialize()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


### 2. Importing the dataset




In [171]:
# Setting the environment variables: 

# export MONGO_URI='' 
# export MONGO_DB='Goodreads'
# export MONGO_COLLECTION='Books'
MONGO_URI='mongodb+srv://vprohaska:Cottage53@recosystems.hyjorhd.mongodb.net/?retryWrites=true&w=majority'
MONGO_DB='Goodreads'
MONGO_COLLECTION='BookReviews'

# MONGO_URI = os.environ['MONGO_URI']
# MONGO_DB = os.environ['MONGO_DB']
# MONGO_COLLECTION = os.environ['MONGO_COLLECTION']
print( MONGO_DB, MONGO_COLLECTION)

Goodreads BookReviews


In [172]:
pipeline = [
    {
        '$set': {
            'reviews': {
                '$objectToArray': '$reviews'
            }
        }
    }, {
        '$unwind': {
            'path': '$reviews', 
            'preserveNullAndEmptyArrays': True
        }
    }, {
        '$set': {
            'ratings': {
                '$objectToArray': '$ratings'
            }
        }
    }, {
        '$unset': '_id'
    }
]
    

In [173]:
# Connect to Mongo 
client = pymongo.MongoClient(MONGO_URI)
db = client[MONGO_DB]
collection = db[MONGO_COLLECTION]

ret_raw = collection.aggregate(pipeline)
df_raw = pd.DataFrame(ret_raw)

### 3. Data Cleaning


In [174]:
df_raw.isna().sum()

df_raw.head()

Unnamed: 0,book_id,reviews,ratings
0,77203.The_Kite_Runner,"{'k': '8947952', 'v': {'user_name': 'Linda', '...","[{'k': '5 star', 'v': '1,582,498'}, {'k': 'Rat..."
1,77203.The_Kite_Runner,"{'k': '1305882067', 'v': {'user_name': 'فرشاد'...","[{'k': '5 star', 'v': '1,582,498'}, {'k': 'Rat..."
2,77203.The_Kite_Runner,"{'k': '22703379', 'v': {'user_name': 'J.G. Kee...","[{'k': '5 star', 'v': '1,582,498'}, {'k': 'Rat..."
3,77203.The_Kite_Runner,"{'k': '9020638', 'v': {'user_name': 'Britta', ...","[{'k': '5 star', 'v': '1,582,498'}, {'k': 'Rat..."
4,77203.The_Kite_Runner,"{'k': '1338106', 'v': {'user_name': 'Chris', '...","[{'k': '5 star', 'v': '1,582,498'}, {'k': 'Rat..."


In [175]:
#drop duplicates by book id 
df_ratings = df_raw.groupby('book_id').first().reset_index()

In [176]:
df_ratings

Unnamed: 0,book_id,reviews,ratings
0,1.Harry_Potter_and_the_Half_Blood_Prince,"{'k': '683662307', 'v': {'user_name': 'Jayson'...","[{'k': '5 star', 'v': '2,059,562'}, {'k': 'Rat..."
1,10008056-journal-64,"{'k': '1321408528', 'v': {'user_name': 'Jayson...","[{'k': '5 star', 'v': '7,330'}, {'k': 'Ratings..."
2,10014677-front-row-center,"{'k': '443433180', 'v': {'user_name': 'Trish J...","[{'k': '5 star', 'v': '25'}, {'k': 'Ratings_Co..."
3,100237.Monkey,"{'k': '2151417869', 'v': {'user_name': '°°°·.°...","[{'k': '5 star', 'v': '2,366'}, {'k': 'Ratings..."
4,1002427.Teddy_Bears_1_to_10,"{'k': '12529237', 'v': {'user_name': 'Shala Ho...","[{'k': '5 star', 'v': '3'}, {'k': 'Ratings_Cou..."
...,...,...,...
4467,996483.10_Fat_Turkeys,"{'k': '2275927619', 'v': {'user_name': 'Donna ...","[{'k': '5 star', 'v': '337'}, {'k': 'Ratings_C..."
4468,99664.The_Painted_Veil,"{'k': '2453989852', 'v': {'user_name': 'Jim Fo...","[{'k': '5 star', 'v': '12,621'}, {'k': 'Rating..."
4469,9969571-ready-player-one,"{'k': '200552364', 'v': {'user_name': 'Kemper'...","[{'k': '5 star', 'v': '537,735'}, {'k': 'Ratin..."
4470,9975313-the-humming-room,"{'k': '193495728', 'v': {'user_name': 'Misty',...","[{'k': '5 star', 'v': '1,574'}, {'k': 'Ratings..."


In [144]:
%%time

# lets take a look at one row 

one_item = df_raw.iloc[0]
# lets unpack  reviews

# one_item['reviews'].values()
 # dict_values(['8947952', {'user_name': 'Linda', 'user_id': '613434-linda', 'text': 'Finished this boo
# each review is a list of 3 elements:
# 1. the reviewer username
# 2. the reviewer user_id
# 3. the review text

# take the user_id and the text and put it into a dataframe


one_item

CPU times: user 98 µs, sys: 11 µs, total: 109 µs
Wall time: 110 µs


book_id                                77203.The_Kite_Runner
reviews    {'k': '8947952', 'v': {'user_name': 'Linda', '...
ratings    [{'k': '5 star', 'v': '1,582,498'}, {'k': 'Rat...
Name: 0, dtype: object

In [162]:

dict_data = {d['k']: d['v'] for d in df_raw['ratings'].iloc[0]}

# add to original df 
df_raw['ratings'] = df_raw['ratings'].parallel_apply(lambda x: {d['k']: d['v'] for d in x})



TypeError: string indices must be integers

0         {'5 star': '1,582,498', 'Ratings_Count': '2,93...
1         {'5 star': '1,582,498', 'Ratings_Count': '2,93...
2         {'5 star': '1,582,498', 'Ratings_Count': '2,93...
3         {'5 star': '1,582,498', 'Ratings_Count': '2,93...
4         {'5 star': '1,582,498', 'Ratings_Count': '2,93...
                                ...                        
114608    {'5 star': '11', 'Ratings_Count': '38', 'Total...
114609    {'5 star': '11', 'Ratings_Count': '38', 'Total...
114610    {'5 star': '7', 'Ratings_Count': '19', 'Total_...
114611    {'5 star': '1', 'Ratings_Count': '6', 'Total_R...
114612    {'5 star': '1', 'Ratings_Count': '2', 'Total_R...
Name: ratings, Length: 114613, dtype: object

In [161]:
def generate_dataframe(df_raw):
    for ratings in df_raw['ratings']:
        dict_data = {d['k']: d['v'] for d in ratings }
        yield pd.DataFrame(dict_data, index=[0])

df_ratings = pd.concat(generate_dataframe(df_raw))  


TypeError: string indices must be integers

In [148]:
df_ratings.reset_index()

Unnamed: 0,index,5 star,Ratings_Count,Total_Review_Count,4 star,3 star,2 star,1 star
0,0,1582498,2935312,90233,918930,308702,79972,45210
1,0,1582498,2935312,90233,918930,308702,79972,45210
2,0,1582498,2935312,90233,918930,308702,79972,45210
3,0,1582498,2935312,90233,918930,308702,79972,45210
4,0,1582498,2935312,90233,918930,308702,79972,45210
...,...,...,...,...,...,...,...,...
114608,0,11,38,5,17,6,3,1
114609,0,11,38,5,17,6,3,1
114610,0,7,19,1 review,7,4,1,0
114611,0,1,6,0,0,4,1,0


In [149]:
def unpack_reviews(book_id, row):
    reviews = row['reviews']
    if pd.notna(reviews):
        for review in reviews.values():
            if type(review) == dict:
                yield book_id, review['user_id'], review['text']



In [150]:
dfs = []
for _, row in df_raw.iterrows():
    reviews_df = pd.DataFrame(unpack_reviews(row), columns=['user_id', 'text'])
    dfs.append(reviews_df)

# Concatenate the dataframes
df_reviews = pd.concat(dfs, ignore_index=True)


In [151]:
# how to join all three dataframes without losing any data?

# df_reviews
# df_ratings
# df_raw

# print all three

df_reviews_2 = df_reviews.join(df_raw['book_id'])
df_reviews_2

Unnamed: 0,user_id,text,book_id
0,613434-linda,Finished this book about a month ago but it's ...,77203.The_Kite_Runner
1,31207039,"In 2012, when I was Mathematics teacher at a p...",77203.The_Kite_Runner
2,84023-j-g-keely,This is the sort of book White America reads t...,77203.The_Kite_Runner
3,616569-britta,"""For you, a thousand times over.""""Children are...",77203.The_Kite_Runner
4,91373-chris,\nDue to the large number of negative comments...,77203.The_Kite_Runner
...,...,...,...
114456,1395652-sean,Gives alternatives to start exploring what you...,38743182-silo
114457,966963-kevin,"It was interesting, though it was less practic...",38743182-silo
114458,1895489-maryann,"If I was building a house from the ground, I t...",38743182-silo
114459,27453018-eric-gittins,Informative . Some content does not apply in o...,38743182-silo


### 4. Exploratory Data Analysis


### 5. Feature Engineering

### 6. Saving the dataset