![Image of a house made of books](bookhouse.jpg)

## Data Cleaning Notebook

### Import Libraries

In [3]:
# imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import warnings
# Suppress all warnings
warnings.filterwarnings("ignore")

### Read in Datasets

In [4]:
# load datasets
fantasy = pd.read_csv('../data/final-fantasy-data.csv')
horror = pd.read_csv('../data/final-horror-data.csv')

### Data Overview, Cleaning and Concatenation

#### Fantasy Data Analysis

In [5]:
# fantasy head
fantasy.head()

Unnamed: 0,created_utc,title,self_text,subreddit
0,1511335000.0,Help the fight for net neutrality!,,Fantasy
1,1519578000.0,"Author Terry Goodkind shames his own cover, ar...",,Fantasy
2,1606914000.0,Elliot Page Will Continue to Star in 'Umbrella...,,Fantasy
3,1630597000.0,The Wheel of Time - Official Teaser Trailer,,Fantasy
4,1612284000.0,"GRRM latest update: ""I wrote hundreds and hund...",,Fantasy


In [46]:
fantasy.sample(10)

Unnamed: 0,created_utc,title,self_text,subreddit
1161,1712370000.0,Looking for recommendations for fantasy books ...,I have noticed that I always enjoy parts of bo...,Fantasy
565,1479301000.0,Hey r/Fantasy! I'm novelist and DOCTOR STRANGE...,Hola all. C. Robert Cargill here. You might kn...,Fantasy
1513,1711749000.0,Animal Companions,What are some recs where the MC has an animal ...,Fantasy
1283,1712123000.0,"Okay, let's see if this works. Seeking gay MAL...",I'm going to drop some preface to this one bef...,Fantasy
576,1602660000.0,I completed The Wheel of Time,I am at a loss for words. I mourn now for the ...,Fantasy
860,1675597000.0,"I just want to say, there is grimdark, and the...",I'm on the last book of the Liveships and I'm ...,Fantasy
1155,1712388000.0,Vampire Chronicles Recommended Stopping Point?,Was curious if anyone had recommendations on g...,Fantasy
1596,1711585000.0,Words and concepts with an origin on Earth in ...,I was reading Dragonoak: The Complete History ...,Fantasy
1428,1711910000.0,Must watch 70s to 99s Fantasy/SciFi Movie sugg...,Hello Everybody! Me and a couple of buddies wa...,Fantasy
794,1553781000.0,"""So You Want To Write A Medieval Europe-based ...",,Fantasy


In [47]:
# drop fantasy duplicates
fantasy.drop_duplicates(inplace=True)

In [48]:
# check duplicates were dropped
fantasy.duplicated().sum()

0

In [49]:
fantasy.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1973 entries, 0 to 1974
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   created_utc  1973 non-null   float64
 1   title        1973 non-null   object 
 2   self_text    1243 non-null   object 
 3   subreddit    1973 non-null   object 
dtypes: float64(1), object(3)
memory usage: 77.1+ KB


In [50]:
# how many nan values in fantasy
fantasy.isnull().sum()

created_utc      0
title            0
self_text      730
subreddit        0
dtype: int64

In [51]:
# replace all nan values in self_text with "no_text"
fantasy['self_text'].fillna("no_text", inplace=True)

In [52]:
fantasy.isna().sum()

created_utc    0
title          0
self_text      0
subreddit      0
dtype: int64

#### Horror Data Analysis

In [6]:
# horror head
horror.head()

Unnamed: 0,created_utc,title,self_text,subreddit
0,1615921000.0,"Hello fellow Horror readers, I've compiled a s...",A lot of people get their first introduction i...,horrorlit
1,1614796000.0,"Who else is using the “too disturbing, never a...","Show of hands? Both please, I’d like you to ke...",horrorlit
2,1588627000.0,The Horror Section is coming back to Barnes & ...,Some of you may remember my recent post regard...,horrorlit
3,1614701000.0,"What book is so disturbing, you would never re...",Saw a variation of this post on r/AskReddit an...,horrorlit
4,1598800000.0,Can I just say... This is the most welcoming l...,"I've spent this Summer (successfully, I'm plea...",horrorlit


In [54]:
horror.sample(10)

Unnamed: 0,created_utc,title,self_text,subreddit
18,1711661000.0,Male horror authors and sexually assaulting fe...,Recently I have reignited my passion for readi...,horrorlit
393,1676044000.0,Books that hide their supernatural premises un...,"Without being too spoilery, of course, what ar...",horrorlit
188,1602164000.0,Just read Clown in a Cornfield and it's exactl...,There was a clown in a cornfield trying to kil...,horrorlit
431,1610917000.0,"Just finished ""The Fisherman"" by John Langan",I absolutely devoured this novel in one weeken...,horrorlit
1876,1709820000.0,I've been translating the 800s AD Chinese stor...,"I'm a big fan of Kwaidan, and as a professiona...",horrorlit
215,1622250000.0,Am I alone in preferring short stories over no...,I love reading novels and I've read a lot of t...,horrorlit
65,1629237000.0,Sensor by Junji Ito Is Out Today!,I know I'm not the only Junji Ito fan around h...,horrorlit
182,1699648000.0,What's the creepiest cold open in a book you'v...,I'm reading through Echo by Thomas Olde Heuvel...,horrorlit
1501,1711113000.0,Exploring the Nature of Fear,"I really love horror books and movies, and I w...",horrorlit
1510,1711076000.0,Anyone else read Cuckoo by Gretchen Felker-Mar...,I got my hands on an ARC copy and I'm dying to...,horrorlit


In [7]:
horror.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1978 entries, 0 to 1977
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   created_utc  1978 non-null   float64
 1   title        1978 non-null   object 
 2   self_text    1868 non-null   object 
 3   subreddit    1978 non-null   object 
dtypes: float64(1), object(3)
memory usage: 61.9+ KB


In [8]:
# are there duplicate values
horror.duplicated().sum()

23

In [9]:
# drop all duplicates
horror.drop_duplicates(inplace=True)

In [10]:
# double check duplicates were dropped
horror.duplicated().sum()

0

In [11]:
horror.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1955 entries, 0 to 1977
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   created_utc  1955 non-null   float64
 1   title        1955 non-null   object 
 2   self_text    1845 non-null   object 
 3   subreddit    1955 non-null   object 
dtypes: float64(1), object(3)
memory usage: 76.4+ KB


In [58]:
# how many nan values in fantasy
horror.isnull().sum()

created_utc      0
title            0
self_text      110
subreddit        0
dtype: int64

In [59]:
# replace all nan values in self_text with "no_text"
horror['self_text'].fillna("no_text", inplace=True)

In [60]:
# double check all nans are replaced
horror.isna().sum()

created_utc    0
title          0
self_text      0
subreddit      0
dtype: int64

#### File Concatenation

In [61]:
# combine both files into 1
reddits = pd.concat([fantasy, horror], axis= 0)

In [62]:
# inspect the new merged file
reddits.sample(10)

Unnamed: 0,created_utc,title,self_text,subreddit
1958,1710894000.0,I'm loving Shogun on Disney. Any fantasy equiv...,Specifically thinking of series where the hero...,Fantasy
636,1688314000.0,Turns off the Carrie audiobook with two and a ...,"Not really, but it would be nice",horrorlit
1403,1711939000.0,2023 Bingo Card - My first fully filled card!,"It's my third year participating, and I'm happ...",Fantasy
1755,1710257000.0,Cannibal books,"Hi all, I need some cannibal book recommend. p...",horrorlit
684,1670264000.0,What is the dumbest/most ridiculous horror nov...,I saw a thread about the dumbest haunted house...,horrorlit
1960,1710889000.0,Animated Adaptions,I have been watching a lot of Animated shows r...,Fantasy
1450,1711323000.0,Books where the characters are stuck in one place,Any horror novel that has the characters trapp...,horrorlit
1081,1712558000.0,Japanese Fantasy Recommendations!!,I have been really meaning to read more fantas...,Fantasy
788,1523208000.0,Raven Altar by Sandara,no_text,Fantasy
1170,1712159000.0,The 'something is crawling under my skin' trope,"Please recommend me books like The troop, or T...",horrorlit


In [63]:
reddits.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3928 entries, 0 to 1977
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   created_utc  3928 non-null   float64
 1   title        3928 non-null   object 
 2   self_text    3928 non-null   object 
 3   subreddit    3928 non-null   object 
dtypes: float64(1), object(3)
memory usage: 153.4+ KB


In [64]:
reddits.shape

(3928, 4)

In [65]:
reddits.isna().sum()

created_utc    0
title          0
self_text      0
subreddit      0
dtype: int64

In [66]:
# dictionary to map subreddit column
# Fantasy = 0, horrorlit = 1
subreddit_mapping = {'Fantasy': 0, 'horrorlit': 1}
reddits['subreddit'] = reddits['subreddit'].map(subreddit_mapping)

In [67]:
# verify mapping
reddits['subreddit'].value_counts()

subreddit
0    1973
1    1955
Name: count, dtype: int64

**Create a new csv of the combine and cleaned data**

In [68]:
reddits.to_csv('/content/drive/MyDrive/DSML/GA_DS/Projects/project_3/data/subreddits.csv')

In [69]:
sub = pd.read_csv('/content/drive/MyDrive/DSML/GA_DS/Projects/project_3/data/subreddits.csv')

In [70]:
sub.head()

Unnamed: 0.1,Unnamed: 0,created_utc,title,self_text,subreddit
0,0,1511335000.0,Help the fight for net neutrality!,no_text,0
1,1,1519578000.0,"Author Terry Goodkind shames his own cover, ar...",no_text,0
2,2,1606914000.0,Elliot Page Will Continue to Star in 'Umbrella...,no_text,0
3,3,1630597000.0,The Wheel of Time - Official Teaser Trailer,no_text,0
4,4,1612284000.0,"GRRM latest update: ""I wrote hundreds and hund...",no_text,0
