**Modules**

In [6]:
import pandas as pd
import numpy as np
import json
import gzip
import gc

### Selecting Books With `ratings_count > 15`

**Function to parse the specific fields of the json data**

In [7]:
def parse_fields(line):
    data = json.loads(line)
    return {
        'isbn' : data['isbn'],
        'text_reviews_count' : data['text_reviews_count'],
        'series' : data['series'],
        'country_code' : data['country_code'],
        'language_code' : data['language_code'],
        'asin' : data['asin'],
        'is_ebook' : data['is_ebook'],
        'average_rating' : data['average_rating'],
        'kindle_asin' : data['kindle_asin'],
        'similar_books' : data['similar_books'],
        'description' : data['description'],
        'format' : data['format'],
        'link' : data['link'],
        'authors' : data['authors'],
        'publisher' : data['publisher'],
        'num_pages' : data['num_pages'],
        'publication_day' : data['publication_day'],
        'isbn13' : data['isbn13'],
        'publication_month' : data['publication_month'],
        'publication_year' : data['publication_year'],
        'url' : data['url'],
        'image_url' : data['image_url'],
        'book_id' : data['book_id'],
        'ratings_count' : data['ratings_count'],
        'title_without_series' : data['title_without_series']
    }

**We will go through all the books from the dataset, line by line till the end using an infinite loop**
- We will parse the details of only those books which has got `ratings_count >= 15`

In [8]:
books = []

with gzip.open("../Initial/books.json.gz") as f:
    while True:
        # reading the line
        line = f.readline()

        # we will break the infinite loop when we reach the end of the dataset file
        if not line:
            break
        
        # parsing the line
        fields = parse_fields(line)
        
        # trying to convert ratings_count into integer
        try:
            ratings_count = int(fields["ratings_count"])
        except ValueError:
            continue

        # we will consider only those books which has more than 15 ratings
        if ratings_count > 15:
            books.append(fields)

**Total number of books**

In [9]:
len(books)

1308957

**Insight**
- There are 1,308,957 books after filtering through the criteria `ratings_count > 15`

In [10]:
books[0]

{'isbn': '',
 'text_reviews_count': '7',
 'series': ['189911'],
 'country_code': 'US',
 'language_code': 'eng',
 'asin': 'B00071IKUY',
 'is_ebook': 'false',
 'average_rating': '4.03',
 'kindle_asin': '',
 'similar_books': ['19997',
  '828466',
  '1569323',
  '425389',
  '1176674',
  '262740',
  '3743837',
  '880461',
  '2292726',
  '1883810',
  '1808197',
  '625150',
  '1988046',
  '390170',
  '2620131',
  '383106',
  '1597281'],
 'description': 'Omnibus book club edition containing the Ladies of Madrigyn and the Witches of Wenshar.',
 'format': 'Hardcover',
 'link': 'https://www.goodreads.com/book/show/7327624-the-unschooled-wizard',
 'authors': [{'author_id': '10333', 'role': ''}],
 'publisher': 'Nelson Doubleday, Inc.',
 'num_pages': '600',
 'publication_day': '',
 'isbn13': '',
 'publication_month': '',
 'publication_year': '1987',
 'url': 'https://www.goodreads.com/book/show/7327624-the-unschooled-wizard',
 'image_url': 'https://images.gr-assets.com/books/1304100136m/7327624.jpg',

**Creating a DataFrame from the list of dictionaries**

In [11]:
items = pd.DataFrame.from_dict(books)

**Rows and Columns of the DataFrame**

In [12]:
print(f"Rows: {items.shape[0]}")
print(f"Columns: {items.shape[1]}")

Rows: 1308957
Columns: 25


**We don't need the `books` object anymore**
- Deleting the object from namespace and forcing Python for garbage collection (if required)
- The `del` method only removes the variable from the namespace and it does not remove the variable from the memory space
- To clear the variable from memory we can use the `gc.collect()` method

In [13]:
del(books)

In [14]:
gc.collect()

0

### Creating Modified Title to Minimize Search Space

**Removing characters apart from A-z, a-z and 0-9 and Space from `title_without_series` and saving into new `mod_title` column**

In [15]:
items["mod_title"] = items["title_without_series"].str.replace("[^a-zA-Z0-9 ]", "", regex=True)

**Make `mod_title` lower case**

In [16]:
items["mod_title"] = items["mod_title"].str.lower()

**Replacing any extra spaces with a single space**

In [17]:
items["mod_title"] = items["mod_title"].str.replace("\s+", " ", regex=True)

**Keeping only those records where the `mod_title` is not empty or has a length > 0**

In [18]:
items = items[items["mod_title"].str.len() > 0]

In [19]:
len(items) 

1302659

**Insight**
- We can observe a sligt reduction in count which is expected
  -  `(1308957 -> 1302659)`

#### `mod_title` Column Analysis - Important For Search Engine

**Calculate ratio of `len(title_without_series) and len(mod_title)`**
- This will give us a measure of truncation

In [20]:
items["actualT_modT_ratio"] = items["title_without_series"].str.len() / items["mod_title"].str.len()
items.head()

Unnamed: 0,isbn,text_reviews_count,series,country_code,language_code,asin,is_ebook,average_rating,kindle_asin,similar_books,...,isbn13,publication_month,publication_year,url,image_url,book_id,ratings_count,title_without_series,mod_title,actualT_modT_ratio
0,,7,[189911],US,eng,B00071IKUY,False,4.03,,"[19997, 828466, 1569323, 425389, 1176674, 2627...",...,,,1987,https://www.goodreads.com/book/show/7327624-th...,https://images.gr-assets.com/books/1304100136m...,7327624,140,"The Unschooled Wizard (Sun Wolf and Starhawk, ...",the unschooled wizard sun wolf and starhawk 12,1.108696
1,743294297.0,3282,[],US,eng,,False,3.49,B002ENBLOK,"[6604176, 6054190, 2285777, 82641, 7569453, 70...",...,9780743294294.0,7.0,2009,https://www.goodreads.com/book/show/6066819-be...,https://s.gr-assets.com/assets/nophoto/book/11...,6066819,51184,Best Friends Forever,best friends forever,1.0
2,1599150603.0,7,[],US,,,False,4.13,B00DU10PUG,[],...,9781599150604.0,9.0,2006,https://www.goodreads.com/book/show/287141.The...,https://s.gr-assets.com/assets/nophoto/book/11...,287141,46,The Aeneid for Boys and Girls,the aeneid for boys and girls,1.0
3,1934876569.0,6,[151854],US,,,False,4.22,,"[948696, 439885, 274955, 12978730, 372986, 216...",...,9781934876565.0,3.0,2009,https://www.goodreads.com/book/show/6066812-al...,https://images.gr-assets.com/books/1316637798m...,6066812,98,All's Fairy in Love and War (Avalon: Web of Ma...,alls fairy in love and war avalon web of magic 8,1.125
4,922915113.0,39,[],US,,,False,3.81,B00AFYVB8Q,"[287151, 1104760, 1172822, 440292, 287082, 630...",...,9780922915118.0,4.0,2000,https://www.goodreads.com/book/show/287149.The...,https://images.gr-assets.com/books/1328768789m...,287149,986,The Devil's Notebook,the devils notebook,1.052632


**Looking at `actualT_modT_ratio` values for actual titles having very short length `< 5`**

In [21]:
items.loc[items["title_without_series"].str.len() < 5]

Unnamed: 0,isbn,text_reviews_count,series,country_code,language_code,asin,is_ebook,average_rating,kindle_asin,similar_books,...,isbn13,publication_month,publication_year,url,image_url,book_id,ratings_count,title_without_series,mod_title,actualT_modT_ratio
21,1597371289,8,[],US,eng,,false,3.99,B0083Z3O8Y,"[31242, 374380, 20564, 383206, 7891, 6335178, ...",...,9781597371285,9,2005,https://www.goodreads.com/book/show/3209316-emma,https://s.gr-assets.com/assets/nophoto/book/11...,3209316,42,Emma,emma,1.0
226,,7,[297710],US,eng,,false,3.81,,"[356205, 266753, 379260, 155159, 45684, 888887...",...,,,,https://www.goodreads.com/book/show/17794787-legs,https://images.gr-assets.com/books/1365842445m...,17794787,38,Legs,legs,1.0
346,,5,[],US,,,false,3.33,,[],...,9789949301317,,2011,https://www.goodreads.com/book/show/13325505-7x7,https://s.gr-assets.com/assets/nophoto/book/11...,13325505,94,7x7,7x7,1.0
349,,2,[],US,,B001EL6RSS,true,3.38,B001EL6RSS,"[1618765, 12934562, 130, 61125, 38524, 7714891...",...,,,,https://www.goodreads.com/book/show/19222476-send,https://s.gr-assets.com/assets/nophoto/book/11...,19222476,20,Send,send,1.0
436,,5,[],US,eng,,false,3.55,,[],...,,,,https://www.goodreads.com/book/show/18042275-skin,https://images.gr-assets.com/books/1469652552m...,18042275,71,Skin,skin,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1307639,,16,[],US,,B01F293HAS,true,4.07,,"[26312997, 25897916, 25897871, 27311743, 24611...",...,,,,https://www.goodreads.com/book/show/30125838-grit,https://s.gr-assets.com/assets/nophoto/book/11...,30125838,106,Grit,grit,1.0
1307948,0375759603,28,[],US,,,false,3.29,B000FC1I6S,"[3433301, 118192, 124893, 1001309, 1136604, 22...",...,9780375759604,,2002,https://www.goodreads.com/book/show/904266.Fury,https://images.gr-assets.com/books/1408924518m...,904266,224,Fury,fury,1.0
1307959,0547728247,3,[],US,eng,,true,4.10,,"[219107, 76740, 9682235, 18952, 60146, 845501,...",...,9780547728247,4,2012,https://www.goodreads.com/book/show/16555672-ubik,https://images.gr-assets.com/books/1359642037m...,16555672,16,Ubik,ubik,1.0
1308509,1626391912,18,[],US,eng,,false,3.75,B00ND8SUT4,[],...,9781626391918,9,2014,https://www.goodreads.com/book/show/20702686-jolt,https://images.gr-assets.com/books/1399836028m...,20702686,143,Jolt,jolt,1.0


**Looking at the statistics of `actualT_modT_ratio` column while `title_without_series < 5`**

In [22]:
items.loc[items["title_without_series"].str.len() < 5, "actualT_modT_ratio"].describe()

count    8253.000000
mean        1.032736
std         0.165538
min         1.000000
25%         1.000000
50%         1.000000
75%         1.000000
max         4.000000
Name: actualT_modT_ratio, dtype: float64

In [23]:
np.percentile(items.loc[items["title_without_series"].str.len() < 5, "actualT_modT_ratio"].values,94)

1.0

In [24]:
np.percentile(items.loc[items["title_without_series"].str.len() < 5, "actualT_modT_ratio"].values,97)

1.3333333333333333

In [25]:
np.percentile(items.loc[items["title_without_series"].str.len() < 5, "actualT_modT_ratio"].values,98)

1.5

In [26]:
np.percentile(items.loc[items["title_without_series"].str.len() < 5, "actualT_modT_ratio"].values,99)

2.0

**Insight**
- Most of the distribution with the given criteria `(almost 99%)` is having `actualT_modT_ratio <= 2`
- So, we can consider `2` as our threshold while `title_without_series < 5`

**Based on the above parameters looking at data when `len(title_without_series) < 5` and `actualT_modT_ratio > 2`**

In [27]:
temp_df = items.loc[items["title_without_series"].str.len() < 5]
temp_df.loc[temp_df["actualT_modT_ratio"] > 2].sample(10)

Unnamed: 0,isbn,text_reviews_count,series,country_code,language_code,asin,is_ebook,average_rating,kindle_asin,similar_books,...,isbn13,publication_month,publication_year,url,image_url,book_id,ratings_count,title_without_series,mod_title,actualT_modT_ratio
918636,,11,[],US,pes,,False,4.23,,[],...,9786009545001.0,,2015,https://www.goodreads.com/book/show/28930830,https://images.gr-assets.com/books/1454571341m...,28930830,17,٧ جن,,4.0
270075,9758518089,2,"[179007, 179001, 179011, 889174]",US,tur,,False,4.23,,"[291693, 978999, 362055, 445761, 362053, 52958...",...,9789758518081.0,12.0,2001,https://www.goodreads.com/book/show/19478121-g,https://images.gr-assets.com/books/1387111296m...,19478121,19,Göç,g,3.0
103179,9953215073,12,[],US,ara,,False,3.64,,[],...,,6.0,2011,https://www.goodreads.com/book/show/13610925,https://images.gr-assets.com/books/1334997051m...,13610925,47,و ..,,4.0
1095472,,14,[243751],US,ara,,False,3.59,,"[6037124, 9553204, 6445626]",...,,7.0,2010,https://www.goodreads.com/book/show/6279485,https://images.gr-assets.com/books/1381864902m...,6279485,399,هم !,,4.0
113279,,37,[],US,,,False,3.91,,"[12907023, 2334969, 10872535, 6652169, 522888,...",...,,3.0,2014,https://www.goodreads.com/book/show/21460333-o,https://images.gr-assets.com/books/1394794711m...,21460333,267,Đảo,o,3.0
983859,275602595X,11,[],US,fre,,False,4.01,,[],...,9782756025957.0,9.0,2011,https://www.goodreads.com/book/show/12873782-3,https://images.gr-assets.com/books/1332325362m...,12873782,123,3'',3,3.0
658456,0801846684,24,[],US,eng,,False,4.27,,"[385120, 409241, 150347, 385672, 405100, 33094...",...,9780801846687.0,10.0,1993,https://www.goodreads.com/book/show/375182._A_,https://s.gr-assets.com/assets/nophoto/book/11...,375182,570,“A”,a,3.0
92498,,2,[],US,fin,,False,3.73,,[],...,9789529887545.0,,2008,https://www.goodreads.com/book/show/6272309-j,https://s.gr-assets.com/assets/nophoto/book/11...,6272309,59,Jää,j,3.0
380971,9866562131,1,[],US,zho,,False,3.96,,[],...,9789866562136.0,1.0,2009,https://www.goodreads.com/book/show/9967288,https://s.gr-assets.com/assets/nophoto/book/11...,9967288,20,樂園 下,,4.0
75566,,8,[],US,ara,,False,3.62,,[],...,,12.0,2014,https://www.goodreads.com/book/show/24012899-3,https://images.gr-assets.com/books/1419263497m...,24012899,24,3فاز,3,4.0


**We can get the statistics of `mod_title` length when (`actualT_modT_ratio > 2`)**

In [28]:
items.loc[(items["actualT_modT_ratio"] > 2), "mod_title"].str.len().describe()

count    42709.000000
mean         1.512655
std          2.431791
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max         45.000000
Name: mod_title, dtype: float64

In [29]:
for i in range(85,100):
    print(i, "=>" ,np.percentile(items.loc[(items["actualT_modT_ratio"] > 2), "mod_title"].str.len(),i))

85 => 1.0
86 => 1.0
87 => 1.0
88 => 1.0
89 => 2.0
90 => 2.0
91 => 2.0
92 => 2.0
93 => 2.0
94 => 2.0
95 => 3.0
96 => 4.0
97 => 6.0
98 => 10.0
99 => 15.0


**Insight**
- We cannot use only `actualT_modT_ratio > 2` as our criteria because we might loose some data with `mod_title > 4`

**We will add `mod_title` length threshold as well and will also adjust the value of `actualT_modT_ratio`**
- Based on observation testing out some values

In [48]:
items.loc[(items["actualT_modT_ratio"] > 2.3) & (items["mod_title"].str.len() > 6)]

Unnamed: 0,isbn,text_reviews_count,series,country_code,language_code,asin,is_ebook,average_rating,kindle_asin,similar_books,...,isbn13,publication_month,publication_year,url,image_url,book_id,ratings_count,title_without_series,mod_title,actualT_modT_ratio
727,,33,[691556],US,gre,,false,3.88,,"[18102876, 18052948, 21350894, 18135589, 24909...",...,9789603648482,4,2015,https://www.goodreads.com/book/show/25204099,https://images.gr-assets.com/books/1427144494m...,25204099,158,"Α μπε μπα μπλομ (Helen Grace, #1)",helen grace 1,2.357143
4253,,5,[],US,rus,,false,3.33,,[],...,9785916571431,,2011,https://www.goodreads.com/book/show/10143804--...,https://s.gr-assets.com/assets/nophoto/book/11...,10143804,31,Железный человек есть в каждом. От кресла бизн...,ironman,8.250000
4620,,1,[769261],US,ben,,false,3.41,,[],...,,,,https://www.goodreads.com/book/show/25454679,https://images.gr-assets.com/books/1430375234m...,25454679,17,"আক্রান্ত দূতাবাস (Masud Rana, #281)",masud rana 281,2.333333
5034,,1,[],US,ara,,false,4.08,,[],...,,,2002,https://www.goodreads.com/book/show/9678144-19...,https://images.gr-assets.com/books/1289496475m...,9678144,25,ظفار، الصراع السياسي والعسكري في الخليج العربي...,19701976,6.222222
6048,,10,[273034],US,gre,,false,3.87,,"[15704485, 13596809, 13612739, 15739921, 16070...",...,9789601646046,10,2012,https://www.goodreads.com/book/show/16049765,https://images.gr-assets.com/books/1348839415m...,16049765,154,Πενήντα πιο σκοτεινές αποχρώσεις του γκρι (Fif...,fifty shades 2,4.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1301558,9542813007,4,[],US,--,,false,3.41,,[],...,9789542813002,5,2013,https://www.goodreads.com/book/show/17924291-t...,https://images.gr-assets.com/books/1368519586m...,17924291,58,Където се раждат ангелите | The eye of the sky,the eye of the sky,2.421053
1305982,,3,"[787876, 773429, 483372, 755577]",US,kat,,false,4.12,,"[7552385, 12973980, 20924715, 131332, 16028305...",...,9789941467288,,2016,https://www.goodreads.com/book/show/31699289---ii,https://images.gr-assets.com/books/1472466757m...,31699289,17,მაძიებელი - ჯადოქრის პირველი კანონი II (ჭეშმარ...,ii 1 1,10.428571
1307757,,2,[],US,tha,,false,3.73,,"[5202, 281285, 639374, 953345, 1319826, 287505...",...,,,,https://www.goodreads.com/book/show/9570864-th...,https://images.gr-assets.com/books/1288012440m...,9570864,21,เดอะ รีดเดอร์ / The Reader,the reader,2.363636
1307790,,2,[772452],US,ben,,false,4.42,,[],...,,,,https://www.goodreads.com/book/show/25254196,https://images.gr-assets.com/books/1427775619m...,25254196,33,"অপারেশন তেলআবিব (Saimum, #1)",saimum 1,3.111111


In [49]:
items.loc[(items["actualT_modT_ratio"] > 2.3) & (items["mod_title"].str.len() < 7), "mod_title"].str.len().describe()

count    41523.000000
mean         1.150591
std          0.576530
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max          6.000000
Name: mod_title, dtype: float64

In [50]:
for i in range(85,100):
    print(i, "=>" ,np.percentile(items.loc[(items["actualT_modT_ratio"] > 2.3) & (items["mod_title"].str.len() < 7), "mod_title"].str.len(),i))

85 => 1.0
86 => 1.0
87 => 1.0
88 => 1.0
89 => 1.0
90 => 1.0
91 => 2.0
92 => 2.0
93 => 2.0
94 => 2.0
95 => 2.0
96 => 2.0
97 => 3.0
98 => 3.0
99 => 4.0


In [55]:
for i in range(0,10):
    print(i, "=>" ,np.percentile(items.loc[(items["actualT_modT_ratio"] > 2.4) & (items["mod_title"].str.len() < 7), "actualT_modT_ratio"],i))

0 => 2.5
1 => 4.5
2 => 6.2
3 => 7.0
4 => 8.0
5 => 8.0
6 => 9.0
7 => 9.0
8 => 9.0
9 => 9.0


**Insight**
- We have fixed the threshold
  - `actualT_modT_ratio > 2.3` and `mod_title < 7`

In [None]:
# execution stop point

10/0

**Index of the rows to delete**

In [56]:
index_to_del = items.loc[(items["actualT_modT_ratio"] > 2.3) & (items["mod_title"].str.len() < 7)].index
index_to_del

Index([     28,      37,      38,      43,      65,      72,      85,     108,
           118,     175,
       ...
       1308680, 1308686, 1308691, 1308697, 1308789, 1308850, 1308854, 1308881,
       1308931, 1308940],
      dtype='int64', length=41523)

**Dropping the rows**

In [57]:
items.drop(index=index_to_del, inplace=True)

**Dropping the `actualT_modT_ratio` column**

In [58]:
items.drop(columns="actualT_modT_ratio", inplace=True)

In [59]:
len(items)

1261136

In [None]:
# execution stop point

# 10/0

**Insight**
- We can observe a sligt reduction in count which is expected
  -  `(1,308,957 -> 1,302,659 -> 1,261,136)`

**Exporting the Data as Compressed JSON After Reducing Search Space**

In [60]:
with gzip.open('../Processed/books_p0.json.gz', 'wt', encoding='utf-8') as file:
    file.write(items.to_json(orient='records', lines=True))

In [None]:
# execution stop point

# 10/0

### Testing The Exported File

In [None]:
# testing the file  => opening gzip file streaming fashion

with gzip.open('../Processed/books_p0.json.gz') as file:
    line = file.readline()

line

b'{"isbn":"","text_reviews_count":"7","series":["189911"],"country_code":"US","language_code":"eng","asin":"B00071IKUY","is_ebook":"false","average_rating":"4.03","kindle_asin":"","similar_books":["19997","828466","1569323","425389","1176674","262740","3743837","880461","2292726","1883810","1808197","625150","1988046","390170","2620131","383106","1597281"],"description":"Omnibus book club edition containing the Ladies of Madrigyn and the Witches of Wenshar.","format":"Hardcover","link":"https:\\/\\/www.goodreads.com\\/book\\/show\\/7327624-the-unschooled-wizard","authors":[{"author_id":"10333","role":""}],"publisher":"Nelson Doubleday, Inc.","num_pages":"600","publication_day":"","isbn13":"","publication_month":"","publication_year":"1987","url":"https:\\/\\/www.goodreads.com\\/book\\/show\\/7327624-the-unschooled-wizard","image_url":"https:\\/\\/images.gr-assets.com\\/books\\/1304100136m\\/7327624.jpg","book_id":"7327624","ratings_count":"140","title_without_series":"The Unschooled Wi

In [None]:
json.loads(line)

{'isbn': '',
 'text_reviews_count': '7',
 'series': ['189911'],
 'country_code': 'US',
 'language_code': 'eng',
 'asin': 'B00071IKUY',
 'is_ebook': 'false',
 'average_rating': '4.03',
 'kindle_asin': '',
 'similar_books': ['19997',
  '828466',
  '1569323',
  '425389',
  '1176674',
  '262740',
  '3743837',
  '880461',
  '2292726',
  '1883810',
  '1808197',
  '625150',
  '1988046',
  '390170',
  '2620131',
  '383106',
  '1597281'],
 'description': 'Omnibus book club edition containing the Ladies of Madrigyn and the Witches of Wenshar.',
 'format': 'Hardcover',
 'link': 'https://www.goodreads.com/book/show/7327624-the-unschooled-wizard',
 'authors': [{'author_id': '10333', 'role': ''}],
 'publisher': 'Nelson Doubleday, Inc.',
 'num_pages': '600',
 'publication_day': '',
 'isbn13': '',
 'publication_month': '',
 'publication_year': '1987',
 'url': 'https://www.goodreads.com/book/show/7327624-the-unschooled-wizard',
 'image_url': 'https://images.gr-assets.com/books/1304100136m/7327624.jpg',