# Preprocessing Testing Data for Regression
Here we will clean up the testing data so we evaluate our regression models. 

The features we require for the regression are (1)Country Frequency (2)ISBN (3)Book Popularity. 

In order to calculate these features, we require the isbns, ratings and countries of the testing data to be clean.

In [43]:
# lets first load in the relevant libraries and files
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import normalized_mutual_info_score
from textblob import Word
import json
import re
from nltk.corpus import stopwords
from math import log

nbooks_df = pd.read_csv("BX-NewBooks.csv")
nratings_df = pd.read_csv("BX-NewBooksRatings.csv")
nusers_df = pd.read_csv("BX-NewBooksUsers.csv")
world_countries_s = (pd.read_csv("World-Countries.csv")).iloc[:, 0]

## Cleaning

### Clean ISBNs
Here, we'll remove any records that have an invalid ISBN

In [44]:
# Check that the ISBN's are valid
def validateISBN(isbn: str):
    '''
    returns True if the isbn is valid and False otherwise
    '''
    if len(isbn) != 10:
        return False
    sum = 0
    for i in range(9): 
        if isbn[i].isdigit() and (0 <= int(isbn[i]) <= 9): 
            sum += int(isbn[i]) * (10 - i)
        elif not isbn[i].isdigit():
            return False
    if isbn[9] == 'X':
        sum += 10
    elif isbn[9].isdigit():
        sum += int(isbn[9])
    return (sum % 11 == 0)


 #### Cleaning ISBNs of nbooks_df

In [45]:
# Find the valid isbns in the new books csv
invalid_isbn_condition = (nbooks_df["ISBN"].apply(validateISBN)==False)

# Remove the records with invalid isbns 
nbooks_df = nbooks_df.drop(nbooks_df[invalid_isbn_condition].index)

#### Cleaning ISBNs of nratings_df

In [46]:
# Find the valid isbns in the new books csv
invalid_isbn_condition = (nratings_df["ISBN"].apply(validateISBN)==False)

# Remove the records with invalid isbns 
nratings_df = nratings_df.drop(nratings_df[invalid_isbn_condition].index)

### Clean ratings in nratings_df

In [47]:
# let's check if any ratings are misisng
print(f"There are {sum(nratings_df['Book-Rating'].isna())} missing ratings")

There are 0 missing ratings


In [48]:
# now let's see if any ratings are outside the range [0, 10]
out_of_range_condition = ((nratings_df['Book-Rating']>10) | 
                          (nratings_df['Book-Rating']<0))
print(f"There are {len(nratings_df[out_of_range_condition])} "
      f"ratings out of the range [0, 10]")

There are 0 ratings out of the range [0, 10]


There doesn't seem to be anything obviously wrong with the ratings data, hence we won't be cleaning it any further.

### Clean country in nbooks_df

In [49]:
STOP_WORDS = set(stopwords.words("english")) # a set of english stop words

Some Helper Functions:

In [50]:
def strip_stopwords(phrase):
      '''
      Removes english stop words from a given phrase
      Note: phrase must be in all lowercase
      '''
      # split phrase up into separate words so we can detect stop words
      words = phrase.split()
      # filter out the stop words 
      important_words = [word for word in words if word not in STOP_WORDS]
      #return the important words
      return " ".join(important_words)
      
def reformat_word(word):
    '''
    Reformat words to have all punctuation and trailing space stripped, 
    turn ampersands into the word "and" and lowercase everything
    '''
    # make everything lowercase
    final_word = word.lower()
    
    # Remove punctuation
    # Regex pattern that identifies all non letter or white space characters
    punctuation_rule = r'[^\w\s]'
    # removing punctuation
    final_word = re.sub(punctuation_rule, '', final_word)
    
    # Remove stop words
    final_word = strip_stopwords(final_word)

    # Strip trailing white spaces
    final_word = final_word.strip()
    
    # just need to lowercase everything, then done reformatting!
    return final_word

def spell_correct_country(phrase):
      '''
      Tries to spell correct country words using textblob.
      Since textblob is not very good at spell checking phrases, we'll have
      to split up country names into individual words if they're longer
      than one word
      
      Will return a valid country word if it can, otherwise it will return NaN
      '''
      world_countries = list((pd.read_csv("World-Countries.csv")).iloc[:, 0])
      words = phrase.split()
      if (len(words)) == 1:
            # we only need to correct one word
            
            # here's some potential corrections
            correction_options = [x[0] for x in Word(words[0]).spellcheck()]
            # make sure the corrected words in the format we want
            correction_options = [reformat_word(x) for x in correction_options]
            
            # see if any of the correction options are in our countries
            for option in correction_options:
                  if option in world_countries:
                        # one of the correction options is a valid country
                        # return this option
                        return option
                  else:
                        # we couldn't find a valid correction :(
                        return np.nan
      else:
            # we need to correct each word individually
            corrected_words = []
            for word in words:
                  # find the best spelling correction of the word
                  corrected_word = Word(word).correct()
                  # reformat to lowercases, removed punctuation
                  # and add it to the corrected words list
                  corrected_words.append(reformat_word(corrected_word))
            # now we recombine all the corrected words
            final_correction = ' '.join(corrected_words)
            # check if the final corrected word is a valid country and return
            if final_correction in world_countries:
                  # yes! the correction is a valid country
                  return final_correction
            else:
                  # after everything, we still couldn't obtain a valid country
                  return np.nan

The actual function to clean the data:

In [51]:
def clean_countries(user_data):
      '''
      cleans up the country feature in a user_data dataframe and returns the 
      cleaned dataframe
      '''
      # Make a dictionary of common country acronyms or alternative spellings after 
      # a quick observation of the data
      with open("popular_alternative_country_names.json", "r") as file:
            alt_country_names = json.load(file)

      # Add a cloumn to the dataframe verifying if each record's country entry is
      # cleaned
      user_data["Clean-Complete"] = False

      # here's how we'll call the records with country entries yet to clean 
      uncleaned_condition = user_data["Clean-Complete"] == False

      # MARK ANY RECORDS WE DON'T NEED TO FIX

      # Any country entries that are NaN will be considered complete, we won't
      # try impute the country here as it only adds confusion to our analysis
      # and won't make the data easier to work with
      user_data.loc[user_data["User-Country"].isna(),"Clean-Complete"] = True
      # reupdate the uncleaned_condition
      uncleaned_condition = user_data["Clean-Complete"] == False

      # REFORMAT REMAINING RECORDS TO BE EASIER TO WORK WITH

      # We'll now reformat the country entries to have all punctuation and 
      # trailing space stripped, turn ampersands into the word "and" and 
      # lowercase everything
      user_data.loc[uncleaned_condition, "User-Country"] = user_data.loc[
                  uncleaned_condition, "User-Country"].apply(str).apply(
                                                            reformat_word)

      # Check the country entries to see if they're a valid country. If yes, 
      # change "clean-Complete" tag to True
      valid_country_condition = user_data["User-Country"].isin(world_countries_s)
      user_data.loc[uncleaned_condition & valid_country_condition, 
                    "Clean-Complete"] = True
      # reupdate the uncleaned_condition
      uncleaned_condition = user_data["Clean-Complete"] == False

      ## CONVERTING ACRONYMS TO FULL COUNTRY NAMES

      # Find and convert as many acronym country entries into the formal country name
      # Here's how we'll index for acronymed entries
      acronym_detected_condition = user_data["User-Country"].apply(lambda x: x in 
                                                                  alt_country_names)
      # Now we convert the acronym
      user_data.loc[acronym_detected_condition, "User-Country"] = user_data.loc[
            acronym_detected_condition, "User-Country"].apply(
                  lambda x: alt_country_names[x])
      # Tag the record as "clean-Complete"=True
      user_data.loc[acronym_detected_condition, "Clean-Complete"] = True
      # reupdate the uncleaned_condition
      uncleaned_condition = user_data["Clean-Complete"] == False

      ## SPELL CHECK TIME! (Last step!)
      # try to spell check all the remaining entries
      user_data.loc[uncleaned_condition, "User-Country"] = user_data.loc[
            uncleaned_condition, "User-Country"].apply(spell_correct_country)

      # Remove the "Clean-Complete" column, we don't need it anymore
      user_data = user_data.drop("Clean-Complete", axis=1)
      
      # Return the cleaned country data
      return user_data

Okay, now we'll actually clean the countries

In [52]:
nusers_df = clean_countries(nusers_df)

## Computing New Features

### Compute Book Popularity

In [53]:
def popularity(mean_rating, total_ratings, mean_review_freq):
    '''
    returns the popularity metric of a book given the mean rating of that book, 
    the total number of ratings that book has, and the mean number of ratings
    the books in the dataset has
    '''
    ## we chose 20 because we like its slow growth
    return mean_rating * log(total_ratings+mean_review_freq)/log(20)  

Figure out the mean_review_freq we'll use to calculate the popularities of each book

In [54]:
# find the number of ratings per isbn
num_ratings_per_isbn = nratings_df.groupby("ISBN")["Book-Rating"].size()

# convert to a dataframe
num_ratings_per_isbn_df = num_ratings_per_isbn.reset_index()
num_ratings_per_isbn_df.columns = ["ISBN", "Num-Ratings"]

# the unique "Num-Ratings" entries are:
unqiue_numratings_entries = num_ratings_per_isbn_df['Num-Ratings'].unique()

print(f"There is {len(unqiue_numratings_entries)} unqiue frequency value for " 
      "the number of ratings a certain book has")
print(f"Hence, every book has {unqiue_numratings_entries} ratings")

There is 1 unqiue frequency value for the number of ratings a certain book has
Hence, every book has [3] ratings


Since each book has the same number of ratings, it hints that the dataset may not be an accurate representation of how many people actually read and review each book, but is rather strategically accumilated such that there are 3 ratings per book.

Because of this, the popularity score of each book may not be an accurate representation of how popular the book is. We will thus not use the new_data as a testing set and will stop cleaning it here.