Purpose to Analyse the most and least popular series comic type stories on mangaplanet.com. 
Data is from 5/17/22

In [2]:
# import libraries

import numpy as np
import pandas as pd

In [72]:
# Read file

manga_df = pd.read_csv('manga_planet_data.csv')


  exec(code_obj, self.user_global_ns, self.user_ns)


In [73]:
# Renames the unnamed column and adds 1 to each value to represent the series ranking.

manga_df.rename( columns={'Unnamed: 0':'rank'},inplace = True)

manga_df['rank'] = manga_df['rank'] + 1


# Cleaning Years section

Years has incorrect information from the data scraping. Some values were shifted over 1 column. This caused the publisher column to be blank, the years to contain publisher name, and ratings column to contain year values. The ratings column was empty values.

In [74]:
# Finds years of length 4
year_df = manga_df[manga_df['year'].str.len() == 4].copy() #used to find list of incorrect years that are of length 4


In [75]:
# List of bad 4 length years and the fixed version
bad_year = ['Wink','SPA!','REX!','Puff','Moae','Moca','Mink','mimi','Lynx','Laza','Lala','LaLa','Kiss','Judy','Itan','Ikki','Gust','Gush','Garo','Eden',
            'Drap','CREA','Ciel','Ciao','Buka','Baby','Aria','Amie']
bad_year_df = year_df.loc[year_df['year'].isin(bad_year)].copy()
bad_year_df.loc[:,"publisher"]=bad_year_df['year']
bad_year_df.loc[:,"year"] = bad_year_df['rating'].str[:4]
bad_year_df['rating'] = np.nan
good_year_df = bad_year_df

In [76]:
# Replacees nan objects with a string of NAN so update can replace values with NAN, NAN is then returned to np.nan in original dataframe.

good_year_df.replace(np.nan,'NAN', inplace=True)
manga_df.update(good_year_df)
manga_df.replace('NAN',np.nan, inplace=True)


In [77]:
#  Finds incorrect year values of lenght not equal to 4 that are also not null.

year_df = manga_df[manga_df['year'].str.len() != 4].copy()
year_df = year_df.loc[~year_df['year'].isnull()]
year_df['publisher']= year_df['year']
year_df['year'] = year_df['rating'].str[:4]
year_df['rating'] = 'NAN'
year_df.replace(np.nan,'NAN', inplace=True)

In [78]:
# update original

manga_df.update(year_df)
manga_df.replace('NAN',np.nan, inplace=True)

In [79]:
# Save dataframe with fixed columns to csv
manga_df.to_csv('manga_planet_data_cleaned_v1.csv', index = False)

## Start of part two Cleaning volume column

The volume column contains the value of the number of volumed and chapters a series as a string. The string generally has the form of 'vol: 2 ch: 10'. 
My goal is to seperate the volumes and chapters and then convert them to numerical values to work with later in data analysis. Some issues with this work is some series only have volume or chapter information. A possible fix was to find the average numbers a chapters a volume has and use that equation would compute missing information. I choose to just leave the values as np.nan, as many series or digital only and have no physical volumes. Other series have more chapters then volumes as the chapters are published separatly. Another issue is some of the values given to the amount of volumes and chapters a series has is given by 3+. With the + indicating that the series has at least 3 chapters. As many series recieve new chapters frequently, I felt it was fine to just remove the + sign and not adding extra chapters to attempt to account for an unspecified amount of missing chapters.

In [3]:
# Here to reload dataframe if an error occurs.
manga_df = pd.read_csv('manga_planet_data_cleaned_v1.csv')

In [4]:
# Going to split columns later, volume information will be out in front.
manga_df.rename(columns={'latest chapter':'volume'}, inplace=True)


In [5]:
# Split volumes and chapters over delimiter : and add chapters to a chapter column.
manga_df[['volume', 'chapter']] = manga_df['volume'].str.split(';', expand=True)


In [6]:
# Some series did not have a volume. Added that information to chapter column. Plan to make all values numerical. Changed 1 chapter series known as one shot to have a lenght of 1.
manga_df['chapter'].fillna(value=manga_df['volume'], inplace=True)
manga_df['volume'].replace('One Shot',0, inplace=True)
manga_df['chapter'].replace('One Shot',1, inplace=True)

In [7]:
# change chapters in volume column to nan and volumes in chapter column to nan. Possible to just set them to 0, but I did that in my analysis work.
manga_df.loc[manga_df['volume'].str.contains('C', na=False), "volume"] = np.nan
manga_df.loc[manga_df['chapter'].str.contains('V', na=False),'chapter'] = np.nan

In [8]:
# Remove nonnumeric values.

manga_df['volume']=manga_df['volume'].str.extract('(\d+)')
manga_df['chapter']=manga_df['chapter'].str.extract('(\d+)')

In [9]:
# Summary data


manga_df['volume'] = manga_df['volume'].astype(float)
manga_df['chapter'] = manga_df['chapter'].astype(float)
manga_df.dtypes

rank           float64
title           object
description     object
volume         float64
publisher       object
year           float64
rating         float64
tags            object
chapter        float64
dtype: object

In [10]:
# Replaces nan publishers with string Undefined

manga_df['publisher'].replace(np.nan, 'Undefined', inplace=True)

In [12]:
# Save to fully clean csv file 
manga_df.to_csv('manga_planet_data_fully_cleaned_v1.csv', index=False)