# Synopsis:
This dataset consists of 2 set of files, one which consists of numerical ratings and the other which consists of reviews/textual rating. I am gonna amalgamate the two and will create a new set of ratings which will be used to rate the best book of every year.

In [None]:
import numpy as np
import pandas as pd
import os
import glob
pd.options.mode.chained_assignment = None

Reading the csv files which start with '**book**' and concatenating them together

In [None]:
list_of_files = os.listdir('../input/goodreads-book-datasets-10m')
all_data =[]
for each_file in list_of_files:
    if each_file.startswith('book'):  
        print (each_file)
        df = pd.read_csv('../input/goodreads-book-datasets-10m/'+ each_file, usecols = ['Name', 'Rating', 'PublishYear', 'Authors'])
        all_data.append(df)
        
data = pd.concat(all_data, axis = 0)        

In [None]:
data.head()

Checking for any null values in the dataset

In [None]:
data.isna().any()

In [None]:
data.sort_values('PublishYear', ascending=False).head()

Reading the '**User_rating**' file which contains the texual reviews/ratings of the books

In [None]:
data2 = pd.read_csv('../input/goodreads-book-datasets-10m/user_rating_0_to_1000.csv', usecols=['Name',
                                                                                              'Rating'])

In [None]:
data2.head()

In [None]:
len(data2)

In [None]:
data2.isna().any()

Merging both the datasets together 

In [None]:
data_merge = pd.merge(data, data2, on = 'Name', how = 'right')

In [None]:
data_merge.head()

Below we can see that there are null values present in the dataset, which we will get rid of.

In [None]:
data_merge.isna().any()

In [None]:
data_merge.dropna(inplace = True)

In [None]:
data_merge.isna().any()

In [None]:
data_merge.head()

Getting rid of duplicates

In [None]:
data_merge.drop_duplicates(subset= ['Name'],inplace = True)

In [None]:
data_merge.head()

In [None]:
data_merge.duplicated().any()

In [None]:
data_merge.PublishYear.max()

In [None]:
data_merge.PublishYear.min()

Now I'm gonna limit the range of data from Year 1990 to 2020

In [None]:
new_data = data_merge[data_merge['PublishYear'] >= 1990]

In [None]:
new_data.head()

In [None]:
new_data.Rating_y.unique()

Checking out the various textual reviews in the dataset and based on that, creating my own scoring below

In [None]:
new_data['Rating_new'] = np.where((new_data['Rating_y'] == 'it was amazing') | (new_data['Rating_y'] == 'really liked it'),
                                   float(4.5), np.where(new_data['Rating_y'] == 'liked it', float(3.8), 
                                                       np.where(new_data['Rating_y'] == 'it was ok', float(3.5), float(2.0))))

In [None]:
new_data.head()

In [None]:
new_data.info()

Calculating the mean by considering both the existing rating and my new rating

In [None]:
new_data['Rating_mean'] = ((new_data['Rating_x'] + new_data['Rating_new'])/2).round(2).astype(float)

In [None]:
new_data.head()

In [None]:
new_data.Rating_mean.max()

In [None]:
new_data.drop(['Rating_x', 'Rating_y', 'Rating_new'], axis = 1, inplace = True)

In [None]:
new_data = new_data.sort_values(by = ['PublishYear', 'Rating_mean'], ascending = [False, False])

In [None]:
new_data.head()

Sorted the dataset in descending order of PublishYear and Rating_mean and then applied a loop to get the first value(highest rating) of every year and create a dataframe from it.

In [None]:
l = []
for year in range(1990, 2021):
    new_data1 = new_data[new_data.PublishYear == year].iloc[0]
    #print(new_data1)
    l.append(new_data1)
    new_data2 = pd.DataFrame(l, columns = ['Name', 'PublishYear', 'Authors', 'Rating_mean'])

In [None]:
new_data2 = new_data2.reset_index()
new_data2.drop('index', axis = 1, inplace = True)
new_data2.PublishYear = new_data2.PublishYear.astype(int)

**Below are the best books of every year as per my analysis**

In [None]:
new_data2