# Data Preparation¶
### In this notebook we will prepare our data for our search function to use.
### Currently we have data stored in four different csv files.
<ul>
<li>AnimeList.csv</li>
<li>Anime_data.csv</li>
</ul>
### It can be computationally expensive to produce analysis results from multiple data-sources for incomming stream of requests. So we will prepare our data and save it in an easily searchable structure.

In [68]:
# Importing the needed modules...
import pandas as pd
from collections import defaultdict
from os import getcwd

## Define Paths to data files

In [69]:
PATH_AnimeList   = f"{getcwd()}/Data_Sets/AnimeList.csv"
PATH_Anime_data  = f"{getcwd()}/Data_Sets/Anime_data.csv"

## Data Engineering
<ul>
    <li>## Geting data in dataframes.</li>
    <li>## Converting data to a single dictionary.</li>
</ul>

In [70]:
"""
    Read data from AnimeList.csv
"""
df_animelist            = pd.read_csv(PATH_AnimeList)
animelist_table_columns = df_animelist.columns.tolist()
print(f"COLUMNS : {animelist_table_columns}")

COLUMNS : ['anime_id', 'title', 'title_english', 'title_japanese', 'title_synonyms', 'image_url', 'type', 'source', 'episodes', 'status', 'airing', 'aired_string', 'aired', 'duration', 'rating', 'score', 'scored_by', 'rank', 'popularity', 'members', 'favorites', 'background', 'premiered', 'broadcast', 'related', 'producer', 'licensor', 'studio', 'genre', 'opening_theme', 'ending_theme']


In [71]:

"""
    Read data from Anime_data.csv
"""
df_anime_data        = pd.read_csv(PATH_Anime_data)
anime_data_table_columns = df_anime_data.columns.tolist()
print(f"COLUMNS : {anime_data_table_columns}")

COLUMNS : ['anime_id', 'name', 'genre', 'type', 'episodes', 'rating', 'members']


<ul><li> anime_id is a common column in both the tables so we will use it as a primary search-keyas well as a sort key.</li>
<li>A user will always search an anime by its title so we will create a Global secondary index to be able to perform search our datastore.
it will obviously take some extra space but almost negligible as compared to the size of the original data.
In addition, It will make our searching faster and efficient.</li><ul>

In [72]:
print(f"It is {pd.Series(df_animelist['anime_id']).is_unique}  that the column 'anime_id' has unique values for all entries in Animelist dataframe.")
print(f"It is {pd.Series(df_anime_data['anime_id']).is_unique}  that the column 'anime_id' has unique values for all entries in Anime_data dataframe.")

# Sort AnimeList dataframe on the basis of anime_id as anime_id is unique for all entries...
df_animelist_sorted = df_animelist.sort_values(by=['anime_id'])

# Sort Anime_data dataframe on the basis of anime_id as anime_id is unique for all entries...
df_anime_data_sorted  = df_anime_data.sort_values(by=['anime_id'])

It is True  that the column 'anime_id' has unique values for all entries in Animelist dataframe.
It is True  that the column 'anime_id' has unique values for all entries in Anime_data dataframe.


In [73]:
# from animelist dataframe...
animelist_Ids    = df_animelist_sorted["anime_id"].tolist()
animelist_Titles = df_animelist_sorted["title"].tolist()
animelist_Genres = df_animelist_sorted["genre"].tolist()

In [74]:
animelistDict         = {}
global_secondaryIndex = {}
for idx, animelist_Id  in enumerate(animelist_Ids):
    animelistDict[animelist_Id] = {
        "genre" : animelist_Genres[idx],
    }
    
    global_secondaryIndex[animelist_Titles[idx]] = animelist_Id 

In [75]:

# delete veriables which are no longer in use while holding large amount of data.
del animelist_Ids 
del animelist_Titles
del animelist_Genres

In [76]:
df_animelist['episodes'][df_animelist['anime_id']==1].tolist()[1:]

[]

In [77]:
# Finally, adding the data in the animelistDict...
#adding anime reviews
for idx,_ in animelistDict.items():
    try   : animelistDict[idx]["member_rating"] = df_anime_data['rating'][df_anime_data['anime_id']==idx].tolist()[0]
    except: 
        try   : animelistDict[idx]["member_rating"] = '' # If Anime ID exists in the movie dict...
        except: pass # If the Anime ID doesn't exist in our record...
#adding no of episodes.
for idx,_ in animelistDict.items():
     animelistDict[idx]["episodes"] = df_animelist['episodes'][df_animelist['anime_id']==idx].tolist()[0]
#    except: 
#        try   : animelistDict[idx]["episodes"] = '' # If Anime ID exists in the movie dict...
#        except: pass # If the Anime ID doesn't exist in our record...

In [78]:
animelistDict

{1: {'genre': 'Action, Adventure, Comedy, Drama, Sci-Fi, Space',
  'member_rating': 8.82,
  'episodes': 26},
 5: {'genre': 'Action, Space, Drama, Mystery, Sci-Fi',
  'member_rating': 8.4,
  'episodes': 1},
 6: {'genre': 'Action, Sci-Fi, Adventure, Comedy, Drama, Shounen',
  'member_rating': 8.32,
  'episodes': 26},
 7: {'genre': 'Action, Magic, Police, Supernatural, Drama, Mystery',
  'member_rating': 7.36,
  'episodes': 26},
 8: {'genre': 'Adventure, Fantasy, Shounen, Supernatural',
  'member_rating': 7.06,
  'episodes': 52},
 15: {'genre': 'Action, Sports, Comedy, Shounen',
  'member_rating': 8.08,
  'episodes': 145},
 16: {'genre': 'Comedy, Drama, Josei, Romance, Slice of Life',
  'member_rating': 8.18,
  'episodes': 24},
 17: {'genre': 'Slice of Life, Comedy, Sports, Shounen',
  'member_rating': 7.74,
  'episodes': 52},
 18: {'genre': 'Action, Cars, Sports, Drama, Seinen',
  'member_rating': 8.24,
  'episodes': 24},
 19: {'genre': 'Drama, Horror, Mystery, Police, Psychological, Sei

In [79]:
import json
print("[INFO] Writing anime Data into the disk...")
with open('Data_Sets/dataFinal.json', 'w') as fp:
    json.dump(animelistDict, fp, sort_keys=True, indent=4)
print("[INFO] Writing Global Secondary Index Data into the disk...")
with open('Data_Sets/dataFinal_GIS.json', 'w') as fp:
    json.dump(global_secondaryIndex, fp, sort_keys=True, indent=4)

[INFO] Writing anime Data into the disk...
[INFO] Writing Global Secondary Index Data into the disk...


### Now our database is ready and it can handel high inflow of requests