<a href="https://colab.research.google.com/github/quicksilverri/fanfic-popularuty-prediction/blob/main/fanfic_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fanfic Popularity Prediction

Articles: 
1. https://mobius-project.eu/predicting-content-popularity-in-fanfiction-communities/
2. https://medium.com/@vkalkunte/what-makes-a-long-fanfic-predicting-word-count-of-fanfiction-from-ao3-c4e468758e56


HOW TO DEAL WITH LIST VALUES IN PANDAS: https://towardsdatascience.com/dealing-with-list-values-in-pandas-dataframes-a177e534f173


## Import staff

In [1]:
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns


%matplotlib inline

In [5]:
df = pd.read_csv('https://raw.githubusercontent.com/quicksilverri/fanfic-popularuty-prediction/main/11640fanfics.csv', index_col=0)

In [7]:
df.sample()

Unnamed: 0,title,author,fandoms,written,total,words,hits,comments,bookmarks,collections,lang,kudos,date,characters,parings,freeforms,rating,category,completion,warnings
508,"Hey, Gunther",1211algeo,"['Spider-Man - All Media Types', 'Spider-Man (...",1,1.0,3864.0,701,3.0,2.0,,English,54.0,15 Sep 2021,"['Otto Octavius', 'Doctor Octopus', 'Doc Ock',...","['Otto Octavius/Reader', 'Doc Ock/Reader', 'Do...","['angsty fluff', 'ah yes another time with the...",Teen And Up Audiences,"['F/M, Gen']",Complete Work,['No Archive Warnings Apply']


## Preprocess data

In [8]:
df['comments'] = df['comments'].fillna(0)
df['bookmarks'] = df['bookmarks'].fillna(0)
df['collections'] = df['collections'].fillna(0)
df['kudos'] = df['kudos'].fillna(0)

In [9]:
df['date'] = df['date'].apply(pd.to_datetime)

In [None]:
# ratings = {
#     'Not Rated': -1,
#     'General Audiences': 0, 
#     'Teen And Up Audiences': 1, 
#     'Mature': 2, 
#     'Explicit': 3, 
# }
# replace_rating = lambda x: ratings[x] if x not in [-1, 0, 1, 2, 3] else x
# df['rating'] = df['rating'].apply(replace_rating)

In [None]:
# completion = {'Work in Progress': 0, 'Complete Work': 1}
# replace_completion = lambda x: completion[x] if x not in [0, 1] else x
# df['completion'] = df['completion'].apply(replace_completion)

In [10]:
# I don't think that total number of chapters is useful, so we drop it
df.drop('total', axis=1, inplace=True)

In [25]:
# we do not have separate columns for already written chapters 
# and total chapters, so it's reasonable to rename this column
# to avoid problems 

df = df.rename(columns={'written':'chapters'})

In [26]:
# In this dataset, it's unlikely that there will be any duplicated, 
# but I'll check just in case 
df.duplicated().unique()

array([False])

In [27]:
df = df[df.words > 100]

In [24]:
# it appears that when there is only one chapter is a fanfic
# the number of chapters doesn't appear on the search page
# so we'll add it manually

df['chapters'] = df.loc[:, 'chapters'].apply(lambda x: 1 if x == 0 else x)

In [20]:
df['words_per_chapter'] = df.loc[:, 'words'] / df.loc[:, 'chapters']
df['kudos_per_hit'] = df.loc[:, 'kudos'] / df.loc[:, 'hits']

In [21]:
df = df[df['kudos_per_hit'] < df['kudos_per_hit'].quantile(0.975)]

In [29]:
df.sample()

Unnamed: 0,title,author,fandoms,chapters,words,hits,comments,bookmarks,collections,lang,...,date,characters,parings,freeforms,rating,category,completion,warnings,words_per_chapter,kudos_per_hit
10100,"Let The Devil In, Chapter 10 - Eddie Brock/Ven...",ACourtofSnakesandStars,"['Eddie Brock - Fandom', 'Venom - Fandom', 'To...",1,4559.0,428,0.0,3.0,0.0,English,...,2022-04-04,"['Eddie Brock', 'Venom']","['Eddie Brock x Reader - Relationship', 'Venom...","['Violence', 'Mention of Death', 'blood knives...",Mature,['F/M'],Complete Work,"['Creator Chose Not To Use Archive Warnings', ...",4559.0,0.091121


## Building a model

I'm going to try a few different models and test them to figure out which one works the best for this task