**Project 4 Notebook 1**


**Data Acquisition**

Using the Google Chrome web browser extension "Web Scraper", I scraped stories and other data from Fanfiction.net. I searched for Hunger Games stories, filtering for stories that were rated T, and that had Katniss Everdeen (there are 4 fields where you can put characters, and I put Katniss Everdeen in for all 4). Looking at the .csv files in Excel, some of the stories were split into several cells. I later learned that Excel has a limit of 32,767 characters per cell, so that when a cell contains more characters than this limit, the remaining characters are split into several cells over the next rows. This is a limitation of Excel, but not of .csv files in general, and so should not affect loading the .csv files into a pandas dataframe.

**Preprocessing issues**

On Tuesday 2/23/21, I decided to go back and re-do the preprocessing, but leave the capital letters in. Because so many of the names in the stories are slight variations from modern American English (eg Peeta/Peter, Katniss/Katherine) or don't exist in modern American English, I thought it would be important to leave the capitalization in so that the POS tagger recognizes these words as proper nouns.
On 2/24/21, I observed that leaving words capitalized resulted in stop words that were capitalized not being removed. Also, I decided to not do parts of speech tagging, as the tagger will not recognize some words as nouns if they are not capitalized (eg Peeta, Katniss, Haymitch). I will replace capital letters, remove numbers and punctuation, then do ngrams, then remove stop words and proceed to vectorization and topic modeling. This happens in Notebook 2.
Later, when I couldn't get stop word removal working from the quadgrams, I decided to tokenize by single word, then use stemming to try to reduce the number of words.

In [94]:
import numpy as np
import nltk
import pandas as pd

The data was scraped in two batches and saved in .csv files. I read in the two files, created Pandas DataFrames, and then joined the two DataFrames using append.

In [61]:
data = pd.read_csv('Project-4-data/fanfiction-katniss1_pre_page_69.csv')
data.head()

Unnamed: 0,web-scraper-order,web-scraper-start-url,story_link,story_link-href,story_title,author_id,author_id-href,story_info,story_text,previous_pages,previous_pages-href
0,1613673749-43454,https://www.fanfiction.net/book/Hunger-Games/?...,Spring Break - Everlark Style,https://www.fanfiction.net/s/11858172/1/Spring...,Spring Break - Everlark Style,xerxia31,https://www.fanfiction.net/u/5705988/xerxia31,"Rated: Fiction T - English - Katniss E., Prim ...","For the Tumblr Everlark fic exchange, Spring e...",« Prev,https://www.fanfiction.net/book/Hunger-Games/?...
1,1613674516-43761,https://www.fanfiction.net/book/Hunger-Games/?...,Reading The Signs,https://www.fanfiction.net/s/11032211/1/Readin...,Reading The Signs,notalone91,https://www.fanfiction.net/u/2695835/notalone91,Rated: Fiction T - English - Angst/Drama - Kat...,A/N: Hey all. It's been ages since I posted a ...,« Prev,https://www.fanfiction.net/book/Hunger-Games/?...
2,1613674143-43610,https://www.fanfiction.net/book/Hunger-Games/?...,A Chronicle of Lies,https://www.fanfiction.net/s/11535436/1/A-Chro...,A Chronicle of Lies,whitelilly0989,https://www.fanfiction.net/u/1296268/whitelill...,Rated: Fiction T - English - Drama - Katniss E...,Author NotesI wrote this a long time ago. This...,« Prev,https://www.fanfiction.net/book/Hunger-Games/?...
3,1613675615-44227,https://www.fanfiction.net/book/Hunger-Games/?...,Watch Me Fall Apart,https://www.fanfiction.net/s/10203191/1/Watch-...,Watch Me Fall Apart,somethingtobelieve,https://www.fanfiction.net/u/5581135/something...,Rated: Fiction T - English - Angst/Romance - K...,"Written for Prompts in Panem, Day 2 - Marigold...",« Prev,https://www.fanfiction.net/book/Hunger-Games/?...
4,1613675915-44351,https://www.fanfiction.net/book/Hunger-Games/?...,100 Channels and Nothing On,https://www.fanfiction.net/s/9999075/1/100-Cha...,100 Channels and Nothing On,Izzy Samson,https://www.fanfiction.net/u/4073375/Izzy-Samson,Rated: Fiction T - English - Hurt/Comfort - Ka...,Thanks to my wonderful betas Court81981 and Ki...,« Prev,https://www.fanfiction.net/book/Hunger-Games/?...


In [62]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1725 entries, 0 to 1724
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   web-scraper-order      1725 non-null   object
 1   web-scraper-start-url  1725 non-null   object
 2   story_link             1725 non-null   object
 3   story_link-href        1725 non-null   object
 4   story_title            1725 non-null   object
 5   author_id              1725 non-null   object
 6   author_id-href         1725 non-null   object
 7   story_info             1725 non-null   object
 8   story_text             1725 non-null   object
 9   previous_pages         1700 non-null   object
 10  previous_pages-href    1700 non-null   object
dtypes: object(11)
memory usage: 148.4+ KB


In [63]:
data2=pd.read_csv('Project-4-data/fanfiction-katniss1_p69-end_complete.csv')
data.head()

Unnamed: 0,web-scraper-order,web-scraper-start-url,story_link,story_link-href,story_title,author_id,author_id-href,story_info,story_text,previous_pages,previous_pages-href
0,1613673749-43454,https://www.fanfiction.net/book/Hunger-Games/?...,Spring Break - Everlark Style,https://www.fanfiction.net/s/11858172/1/Spring...,Spring Break - Everlark Style,xerxia31,https://www.fanfiction.net/u/5705988/xerxia31,"Rated: Fiction T - English - Katniss E., Prim ...","For the Tumblr Everlark fic exchange, Spring e...",« Prev,https://www.fanfiction.net/book/Hunger-Games/?...
1,1613674516-43761,https://www.fanfiction.net/book/Hunger-Games/?...,Reading The Signs,https://www.fanfiction.net/s/11032211/1/Readin...,Reading The Signs,notalone91,https://www.fanfiction.net/u/2695835/notalone91,Rated: Fiction T - English - Angst/Drama - Kat...,A/N: Hey all. It's been ages since I posted a ...,« Prev,https://www.fanfiction.net/book/Hunger-Games/?...
2,1613674143-43610,https://www.fanfiction.net/book/Hunger-Games/?...,A Chronicle of Lies,https://www.fanfiction.net/s/11535436/1/A-Chro...,A Chronicle of Lies,whitelilly0989,https://www.fanfiction.net/u/1296268/whitelill...,Rated: Fiction T - English - Drama - Katniss E...,Author NotesI wrote this a long time ago. This...,« Prev,https://www.fanfiction.net/book/Hunger-Games/?...
3,1613675615-44227,https://www.fanfiction.net/book/Hunger-Games/?...,Watch Me Fall Apart,https://www.fanfiction.net/s/10203191/1/Watch-...,Watch Me Fall Apart,somethingtobelieve,https://www.fanfiction.net/u/5581135/something...,Rated: Fiction T - English - Angst/Romance - K...,"Written for Prompts in Panem, Day 2 - Marigold...",« Prev,https://www.fanfiction.net/book/Hunger-Games/?...
4,1613675915-44351,https://www.fanfiction.net/book/Hunger-Games/?...,100 Channels and Nothing On,https://www.fanfiction.net/s/9999075/1/100-Cha...,100 Channels and Nothing On,Izzy Samson,https://www.fanfiction.net/u/4073375/Izzy-Samson,Rated: Fiction T - English - Hurt/Comfort - Ka...,Thanks to my wonderful betas Court81981 and Ki...,« Prev,https://www.fanfiction.net/book/Hunger-Games/?...


In [64]:
data2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1718 entries, 0 to 1717
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   web-scraper-order      1718 non-null   object
 1   web-scraper-start-url  1718 non-null   object
 2   story_link             1718 non-null   object
 3   story_link-href        1718 non-null   object
 4   story_title            1718 non-null   object
 5   author_id              1718 non-null   object
 6   author_id-href         1718 non-null   object
 7   story_info             1718 non-null   object
 8   story_text             1718 non-null   object
 9   next_pages             1693 non-null   object
 10  next_pages-href        1693 non-null   object
dtypes: object(11)
memory usage: 147.8+ KB


Append the dataframes to make a dataframe with the complete dataset.

In [65]:
katniss=data.append(data2)
katniss.head()

Unnamed: 0,web-scraper-order,web-scraper-start-url,story_link,story_link-href,story_title,author_id,author_id-href,story_info,story_text,previous_pages,previous_pages-href,next_pages,next_pages-href
0,1613673749-43454,https://www.fanfiction.net/book/Hunger-Games/?...,Spring Break - Everlark Style,https://www.fanfiction.net/s/11858172/1/Spring...,Spring Break - Everlark Style,xerxia31,https://www.fanfiction.net/u/5705988/xerxia31,"Rated: Fiction T - English - Katniss E., Prim ...","For the Tumblr Everlark fic exchange, Spring e...",« Prev,https://www.fanfiction.net/book/Hunger-Games/?...,,
1,1613674516-43761,https://www.fanfiction.net/book/Hunger-Games/?...,Reading The Signs,https://www.fanfiction.net/s/11032211/1/Readin...,Reading The Signs,notalone91,https://www.fanfiction.net/u/2695835/notalone91,Rated: Fiction T - English - Angst/Drama - Kat...,A/N: Hey all. It's been ages since I posted a ...,« Prev,https://www.fanfiction.net/book/Hunger-Games/?...,,
2,1613674143-43610,https://www.fanfiction.net/book/Hunger-Games/?...,A Chronicle of Lies,https://www.fanfiction.net/s/11535436/1/A-Chro...,A Chronicle of Lies,whitelilly0989,https://www.fanfiction.net/u/1296268/whitelill...,Rated: Fiction T - English - Drama - Katniss E...,Author NotesI wrote this a long time ago. This...,« Prev,https://www.fanfiction.net/book/Hunger-Games/?...,,
3,1613675615-44227,https://www.fanfiction.net/book/Hunger-Games/?...,Watch Me Fall Apart,https://www.fanfiction.net/s/10203191/1/Watch-...,Watch Me Fall Apart,somethingtobelieve,https://www.fanfiction.net/u/5581135/something...,Rated: Fiction T - English - Angst/Romance - K...,"Written for Prompts in Panem, Day 2 - Marigold...",« Prev,https://www.fanfiction.net/book/Hunger-Games/?...,,
4,1613675915-44351,https://www.fanfiction.net/book/Hunger-Games/?...,100 Channels and Nothing On,https://www.fanfiction.net/s/9999075/1/100-Cha...,100 Channels and Nothing On,Izzy Samson,https://www.fanfiction.net/u/4073375/Izzy-Samson,Rated: Fiction T - English - Hurt/Comfort - Ka...,Thanks to my wonderful betas Court81981 and Ki...,« Prev,https://www.fanfiction.net/book/Hunger-Games/?...,,


In [66]:
katniss.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3443 entries, 0 to 1717
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   web-scraper-order      3443 non-null   object
 1   web-scraper-start-url  3443 non-null   object
 2   story_link             3443 non-null   object
 3   story_link-href        3443 non-null   object
 4   story_title            3443 non-null   object
 5   author_id              3443 non-null   object
 6   author_id-href         3443 non-null   object
 7   story_info             3443 non-null   object
 8   story_text             3443 non-null   object
 9   previous_pages         1700 non-null   object
 10  previous_pages-href    1700 non-null   object
 11  next_pages             1693 non-null   object
 12  next_pages-href        1693 non-null   object
dtypes: object(13)
memory usage: 376.6+ KB


Removed some unnecessary columns.

In [67]:
##Can delete columns "previous_pages" and "next_pages". 
##These are links that the scraping extension put in.
katniss.drop(["previous_pages", "previous_pages-href",
             "next_pages", "next_pages-href"], axis=1, inplace=True )

In [68]:
katniss.head()

Unnamed: 0,web-scraper-order,web-scraper-start-url,story_link,story_link-href,story_title,author_id,author_id-href,story_info,story_text
0,1613673749-43454,https://www.fanfiction.net/book/Hunger-Games/?...,Spring Break - Everlark Style,https://www.fanfiction.net/s/11858172/1/Spring...,Spring Break - Everlark Style,xerxia31,https://www.fanfiction.net/u/5705988/xerxia31,"Rated: Fiction T - English - Katniss E., Prim ...","For the Tumblr Everlark fic exchange, Spring e..."
1,1613674516-43761,https://www.fanfiction.net/book/Hunger-Games/?...,Reading The Signs,https://www.fanfiction.net/s/11032211/1/Readin...,Reading The Signs,notalone91,https://www.fanfiction.net/u/2695835/notalone91,Rated: Fiction T - English - Angst/Drama - Kat...,A/N: Hey all. It's been ages since I posted a ...
2,1613674143-43610,https://www.fanfiction.net/book/Hunger-Games/?...,A Chronicle of Lies,https://www.fanfiction.net/s/11535436/1/A-Chro...,A Chronicle of Lies,whitelilly0989,https://www.fanfiction.net/u/1296268/whitelill...,Rated: Fiction T - English - Drama - Katniss E...,Author NotesI wrote this a long time ago. This...
3,1613675615-44227,https://www.fanfiction.net/book/Hunger-Games/?...,Watch Me Fall Apart,https://www.fanfiction.net/s/10203191/1/Watch-...,Watch Me Fall Apart,somethingtobelieve,https://www.fanfiction.net/u/5581135/something...,Rated: Fiction T - English - Angst/Romance - K...,"Written for Prompts in Panem, Day 2 - Marigold..."
4,1613675915-44351,https://www.fanfiction.net/book/Hunger-Games/?...,100 Channels and Nothing On,https://www.fanfiction.net/s/9999075/1/100-Cha...,100 Channels and Nothing On,Izzy Samson,https://www.fanfiction.net/u/4073375/Izzy-Samson,Rated: Fiction T - English - Hurt/Comfort - Ka...,Thanks to my wonderful betas Court81981 and Ki...


In [69]:
katniss.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3443 entries, 0 to 1717
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   web-scraper-order      3443 non-null   object
 1   web-scraper-start-url  3443 non-null   object
 2   story_link             3443 non-null   object
 3   story_link-href        3443 non-null   object
 4   story_title            3443 non-null   object
 5   author_id              3443 non-null   object
 6   author_id-href         3443 non-null   object
 7   story_info             3443 non-null   object
 8   story_text             3443 non-null   object
dtypes: object(9)
memory usage: 269.0+ KB


In [71]:
#replace punctuation with a white space, remove numbers, capital letters
##on 2/23, decided to not replace capital letters
##on 2/24, decided to go back and replace capital letters again, and then not tag parts of speech, as the pos
##tagger will not recognize some names as nouns (eg Katniss, Peeta, Haymitch). Captialized stopwords
##were not being removed, which creates its own mess.
import re
import string

alphanumeric = lambda x: re.sub('\w*\d\w*', ' ', x)
punc_lower = lambda x: re.sub('[%s]' % re.escape(string.punctuation), ' ', x.lower()) #this was used 2/22 to replace
#capital letters and remove punctuation.
#punc_remove = lambda x: re.sub('[%s]' % re.escape(string.punctuation), ' ', x) this is from 2/23
katniss['story_text'] = data.story_text.map(alphanumeric).map(punc_remove)
katniss.head()

Unnamed: 0,web-scraper-order,web-scraper-start-url,story_link,story_link-href,story_title,author_id,author_id-href,story_info,story_text
0,1613673749-43454,https://www.fanfiction.net/book/Hunger-Games/?...,Spring Break - Everlark Style,https://www.fanfiction.net/s/11858172/1/Spring...,Spring Break - Everlark Style,xerxia31,https://www.fanfiction.net/u/5705988/xerxia31,"Rated: Fiction T - English - Katniss E., Prim ...",For the Tumblr Everlark fic exchange Spring e...
1,1613674516-43761,https://www.fanfiction.net/book/Hunger-Games/?...,Reading The Signs,https://www.fanfiction.net/s/11032211/1/Readin...,Reading The Signs,notalone91,https://www.fanfiction.net/u/2695835/notalone91,Rated: Fiction T - English - Angst/Drama - Kat...,A N Hey all It s been ages since I posted a ...
2,1613674143-43610,https://www.fanfiction.net/book/Hunger-Games/?...,A Chronicle of Lies,https://www.fanfiction.net/s/11535436/1/A-Chro...,A Chronicle of Lies,whitelilly0989,https://www.fanfiction.net/u/1296268/whitelill...,Rated: Fiction T - English - Drama - Katniss E...,Author NotesI wrote this a long time ago This...
3,1613675615-44227,https://www.fanfiction.net/book/Hunger-Games/?...,Watch Me Fall Apart,https://www.fanfiction.net/s/10203191/1/Watch-...,Watch Me Fall Apart,somethingtobelieve,https://www.fanfiction.net/u/5581135/something...,Rated: Fiction T - English - Angst/Romance - K...,Written for Prompts in Panem Day Marigold...
4,1613675915-44351,https://www.fanfiction.net/book/Hunger-Games/?...,100 Channels and Nothing On,https://www.fanfiction.net/s/9999075/1/100-Cha...,100 Channels and Nothing On,Izzy Samson,https://www.fanfiction.net/u/4073375/Izzy-Samson,Rated: Fiction T - English - Hurt/Comfort - Ka...,Thanks to my wonderful betas and Written ...


In [72]:
katniss.to_csv('katniss-no-punc-num.csv')
##save this to a .csv file

In [95]:
import re
import string

In [96]:
#import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/amysillman/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [97]:
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
stop=stopwords.words('english')
#import texthero as hero
#set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/amysillman/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [20]:
#this does not work. It separated all of the story_text into single letters!
#Good thing I saved the last iteration as a .csv. I'll have to load it and figure out what I did wrong.
#katniss['story_text_without_stopwords'] = katniss['story_text'].apply(lambda x: [item for item in x if item not in stop])

In [75]:
katniss.head()

Unnamed: 0,web-scraper-order,web-scraper-start-url,story_link,story_link-href,story_title,author_id,author_id-href,story_info,story_text
0,1613673749-43454,https://www.fanfiction.net/book/Hunger-Games/?...,Spring Break - Everlark Style,https://www.fanfiction.net/s/11858172/1/Spring...,Spring Break - Everlark Style,xerxia31,https://www.fanfiction.net/u/5705988/xerxia31,"Rated: Fiction T - English - Katniss E., Prim ...",For the Tumblr Everlark fic exchange Spring e...
1,1613674516-43761,https://www.fanfiction.net/book/Hunger-Games/?...,Reading The Signs,https://www.fanfiction.net/s/11032211/1/Readin...,Reading The Signs,notalone91,https://www.fanfiction.net/u/2695835/notalone91,Rated: Fiction T - English - Angst/Drama - Kat...,A N Hey all It s been ages since I posted a ...
2,1613674143-43610,https://www.fanfiction.net/book/Hunger-Games/?...,A Chronicle of Lies,https://www.fanfiction.net/s/11535436/1/A-Chro...,A Chronicle of Lies,whitelilly0989,https://www.fanfiction.net/u/1296268/whitelill...,Rated: Fiction T - English - Drama - Katniss E...,Author NotesI wrote this a long time ago This...
3,1613675615-44227,https://www.fanfiction.net/book/Hunger-Games/?...,Watch Me Fall Apart,https://www.fanfiction.net/s/10203191/1/Watch-...,Watch Me Fall Apart,somethingtobelieve,https://www.fanfiction.net/u/5581135/something...,Rated: Fiction T - English - Angst/Romance - K...,Written for Prompts in Panem Day Marigold...
4,1613675915-44351,https://www.fanfiction.net/book/Hunger-Games/?...,100 Channels and Nothing On,https://www.fanfiction.net/s/9999075/1/100-Cha...,100 Channels and Nothing On,Izzy Samson,https://www.fanfiction.net/u/4073375/Izzy-Samson,Rated: Fiction T - English - Hurt/Comfort - Ka...,Thanks to my wonderful betas and Written ...


In [23]:
##katniss=pd.read_csv('Project-4-data/katniss-no-capitals.csv')

In [24]:
katniss.head()
#ok the story_text is ok. Whew! Now to figure out how to take out the stop words.
#The reason it did that is because I didn't tokenize by word first.
#I need to tokenize the text by words before taking out the stop words. It needs to see the text in units of words.

Unnamed: 0.1,Unnamed: 0,web-scraper-order,web-scraper-start-url,story_link,story_link-href,story_title,author_id,author_id-href,story_info,story_text
0,0,1613673749-43454,https://www.fanfiction.net/book/Hunger-Games/?...,Spring Break - Everlark Style,https://www.fanfiction.net/s/11858172/1/Spring...,Spring Break - Everlark Style,xerxia31,https://www.fanfiction.net/u/5705988/xerxia31,"Rated: Fiction T - English - Katniss E., Prim ...",for the tumblr everlark fic exchange spring e...
1,1,1613674516-43761,https://www.fanfiction.net/book/Hunger-Games/?...,Reading The Signs,https://www.fanfiction.net/s/11032211/1/Readin...,Reading The Signs,notalone91,https://www.fanfiction.net/u/2695835/notalone91,Rated: Fiction T - English - Angst/Drama - Kat...,a n hey all it s been ages since i posted a ...
2,2,1613674143-43610,https://www.fanfiction.net/book/Hunger-Games/?...,A Chronicle of Lies,https://www.fanfiction.net/s/11535436/1/A-Chro...,A Chronicle of Lies,whitelilly0989,https://www.fanfiction.net/u/1296268/whitelill...,Rated: Fiction T - English - Drama - Katniss E...,author notesi wrote this a long time ago this...
3,3,1613675615-44227,https://www.fanfiction.net/book/Hunger-Games/?...,Watch Me Fall Apart,https://www.fanfiction.net/s/10203191/1/Watch-...,Watch Me Fall Apart,somethingtobelieve,https://www.fanfiction.net/u/5581135/something...,Rated: Fiction T - English - Angst/Romance - K...,written for prompts in panem day marigold...
4,4,1613675915-44351,https://www.fanfiction.net/book/Hunger-Games/?...,100 Channels and Nothing On,https://www.fanfiction.net/s/9999075/1/100-Cha...,100 Channels and Nothing On,Izzy Samson,https://www.fanfiction.net/u/4073375/Izzy-Samson,Rated: Fiction T - English - Hurt/Comfort - Ka...,thanks to my wonderful betas and written ...


In [98]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/amysillman/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [77]:
#Tokenize by word. I imported word_tokenize earlier in the notebook.
#This should create a new column with the story texts tokenized by word.
#Apparently there are still quotation marks in the story texts. 
#These need to come out and be replaced by white space

katniss['story_text'] = katniss['story_text'].str.strip(to_strip='"')

In [78]:
katniss.head()

Unnamed: 0,web-scraper-order,web-scraper-start-url,story_link,story_link-href,story_title,author_id,author_id-href,story_info,story_text
0,1613673749-43454,https://www.fanfiction.net/book/Hunger-Games/?...,Spring Break - Everlark Style,https://www.fanfiction.net/s/11858172/1/Spring...,Spring Break - Everlark Style,xerxia31,https://www.fanfiction.net/u/5705988/xerxia31,"Rated: Fiction T - English - Katniss E., Prim ...",For the Tumblr Everlark fic exchange Spring e...
1,1613674516-43761,https://www.fanfiction.net/book/Hunger-Games/?...,Reading The Signs,https://www.fanfiction.net/s/11032211/1/Readin...,Reading The Signs,notalone91,https://www.fanfiction.net/u/2695835/notalone91,Rated: Fiction T - English - Angst/Drama - Kat...,A N Hey all It s been ages since I posted a ...
2,1613674143-43610,https://www.fanfiction.net/book/Hunger-Games/?...,A Chronicle of Lies,https://www.fanfiction.net/s/11535436/1/A-Chro...,A Chronicle of Lies,whitelilly0989,https://www.fanfiction.net/u/1296268/whitelill...,Rated: Fiction T - English - Drama - Katniss E...,Author NotesI wrote this a long time ago This...
3,1613675615-44227,https://www.fanfiction.net/book/Hunger-Games/?...,Watch Me Fall Apart,https://www.fanfiction.net/s/10203191/1/Watch-...,Watch Me Fall Apart,somethingtobelieve,https://www.fanfiction.net/u/5581135/something...,Rated: Fiction T - English - Angst/Romance - K...,Written for Prompts in Panem Day Marigold...
4,1613675915-44351,https://www.fanfiction.net/book/Hunger-Games/?...,100 Channels and Nothing On,https://www.fanfiction.net/s/9999075/1/100-Cha...,100 Channels and Nothing On,Izzy Samson,https://www.fanfiction.net/u/4073375/Izzy-Samson,Rated: Fiction T - English - Hurt/Comfort - Ka...,Thanks to my wonderful betas and Written ...


In [79]:
katniss.to_csv('katniss-no-num-punc-quote.csv')

In [80]:
#Still seems to be quotation marks at the end of some of the story texts.
#Will try to tokenize anyway. Getting an error that it expected a "string or bytes-like object"
#need to force the column to a str data type.
###katniss['story_text_wtokenized'] = word_tokenize(katniss['story_text_no_quotes'])
katniss.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3443 entries, 0 to 1717
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   web-scraper-order      3443 non-null   object
 1   web-scraper-start-url  3443 non-null   object
 2   story_link             3443 non-null   object
 3   story_link-href        3443 non-null   object
 4   story_title            3443 non-null   object
 5   author_id              3443 non-null   object
 6   author_id-href         3443 non-null   object
 7   story_info             3443 non-null   object
 8   story_text             3443 non-null   object
dtypes: object(9)
memory usage: 269.0+ KB


In [81]:
#Getting an error that it expected a "string or bytes-like object"
#need to force the column to a str data type.
katniss['story_text']=katniss['story_text'].astype(str)

In [82]:
katniss.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3443 entries, 0 to 1717
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   web-scraper-order      3443 non-null   object
 1   web-scraper-start-url  3443 non-null   object
 2   story_link             3443 non-null   object
 3   story_link-href        3443 non-null   object
 4   story_title            3443 non-null   object
 5   author_id              3443 non-null   object
 6   author_id-href         3443 non-null   object
 7   story_info             3443 non-null   object
 8   story_text             3443 non-null   object
dtypes: object(9)
memory usage: 269.0+ KB


In [83]:
katniss.head()

Unnamed: 0,web-scraper-order,web-scraper-start-url,story_link,story_link-href,story_title,author_id,author_id-href,story_info,story_text
0,1613673749-43454,https://www.fanfiction.net/book/Hunger-Games/?...,Spring Break - Everlark Style,https://www.fanfiction.net/s/11858172/1/Spring...,Spring Break - Everlark Style,xerxia31,https://www.fanfiction.net/u/5705988/xerxia31,"Rated: Fiction T - English - Katniss E., Prim ...",For the Tumblr Everlark fic exchange Spring e...
1,1613674516-43761,https://www.fanfiction.net/book/Hunger-Games/?...,Reading The Signs,https://www.fanfiction.net/s/11032211/1/Readin...,Reading The Signs,notalone91,https://www.fanfiction.net/u/2695835/notalone91,Rated: Fiction T - English - Angst/Drama - Kat...,A N Hey all It s been ages since I posted a ...
2,1613674143-43610,https://www.fanfiction.net/book/Hunger-Games/?...,A Chronicle of Lies,https://www.fanfiction.net/s/11535436/1/A-Chro...,A Chronicle of Lies,whitelilly0989,https://www.fanfiction.net/u/1296268/whitelill...,Rated: Fiction T - English - Drama - Katniss E...,Author NotesI wrote this a long time ago This...
3,1613675615-44227,https://www.fanfiction.net/book/Hunger-Games/?...,Watch Me Fall Apart,https://www.fanfiction.net/s/10203191/1/Watch-...,Watch Me Fall Apart,somethingtobelieve,https://www.fanfiction.net/u/5581135/something...,Rated: Fiction T - English - Angst/Romance - K...,Written for Prompts in Panem Day Marigold...
4,1613675915-44351,https://www.fanfiction.net/book/Hunger-Games/?...,100 Channels and Nothing On,https://www.fanfiction.net/s/9999075/1/100-Cha...,100 Channels and Nothing On,Izzy Samson,https://www.fanfiction.net/u/4073375/Izzy-Samson,Rated: Fiction T - English - Hurt/Comfort - Ka...,Thanks to my wonderful betas and Written ...


In [84]:
#tokenize by word
katniss['story_text'] = katniss['story_text'].apply(word_tokenize)

In [85]:
katniss.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3443 entries, 0 to 1717
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   web-scraper-order      3443 non-null   object
 1   web-scraper-start-url  3443 non-null   object
 2   story_link             3443 non-null   object
 3   story_link-href        3443 non-null   object
 4   story_title            3443 non-null   object
 5   author_id              3443 non-null   object
 6   author_id-href         3443 non-null   object
 7   story_info             3443 non-null   object
 8   story_text             3443 non-null   object
dtypes: object(9)
memory usage: 269.0+ KB


In [86]:
katniss.head()

Unnamed: 0,web-scraper-order,web-scraper-start-url,story_link,story_link-href,story_title,author_id,author_id-href,story_info,story_text
0,1613673749-43454,https://www.fanfiction.net/book/Hunger-Games/?...,Spring Break - Everlark Style,https://www.fanfiction.net/s/11858172/1/Spring...,Spring Break - Everlark Style,xerxia31,https://www.fanfiction.net/u/5705988/xerxia31,"Rated: Fiction T - English - Katniss E., Prim ...","[For, the, Tumblr, Everlark, fic, exchange, Sp..."
1,1613674516-43761,https://www.fanfiction.net/book/Hunger-Games/?...,Reading The Signs,https://www.fanfiction.net/s/11032211/1/Readin...,Reading The Signs,notalone91,https://www.fanfiction.net/u/2695835/notalone91,Rated: Fiction T - English - Angst/Drama - Kat...,"[A, N, Hey, all, It, s, been, ages, since, I, ..."
2,1613674143-43610,https://www.fanfiction.net/book/Hunger-Games/?...,A Chronicle of Lies,https://www.fanfiction.net/s/11535436/1/A-Chro...,A Chronicle of Lies,whitelilly0989,https://www.fanfiction.net/u/1296268/whitelill...,Rated: Fiction T - English - Drama - Katniss E...,"[Author, NotesI, wrote, this, a, long, time, a..."
3,1613675615-44227,https://www.fanfiction.net/book/Hunger-Games/?...,Watch Me Fall Apart,https://www.fanfiction.net/s/10203191/1/Watch-...,Watch Me Fall Apart,somethingtobelieve,https://www.fanfiction.net/u/5581135/something...,Rated: Fiction T - English - Angst/Romance - K...,"[Written, for, Prompts, in, Panem, Day, Marigo..."
4,1613675915-44351,https://www.fanfiction.net/book/Hunger-Games/?...,100 Channels and Nothing On,https://www.fanfiction.net/s/9999075/1/100-Cha...,100 Channels and Nothing On,Izzy Samson,https://www.fanfiction.net/u/4073375/Izzy-Samson,Rated: Fiction T - English - Hurt/Comfort - Ka...,"[Thanks, to, my, wonderful, betas, and, Writte..."


In [87]:
katniss.to_csv('katniss-word-tokenized-wcap-new.csv')

I can delete a couple columns to save space. 'story_text' and 'story_text_no_quotes'
using: 
 
>katniss.drop(["story_text", "story_text_no_quotes"], axis=1, inplace=True )

In [52]:
#katniss.to_csv('katniss-word-tokenized_only.csv')

In [53]:
katniss.head()

Unnamed: 0.1,Unnamed: 0,web-scraper-order,web-scraper-start-url,story_link,story_link-href,story_title,author_id,author_id-href,story_info,story_text_wtokenized
0,0,1613673749-43454,https://www.fanfiction.net/book/Hunger-Games/?...,Spring Break - Everlark Style,https://www.fanfiction.net/s/11858172/1/Spring...,Spring Break - Everlark Style,xerxia31,https://www.fanfiction.net/u/5705988/xerxia31,"Rated: Fiction T - English - Katniss E., Prim ...","[for, the, tumblr, everlark, fic, exchange, sp..."
1,1,1613674516-43761,https://www.fanfiction.net/book/Hunger-Games/?...,Reading The Signs,https://www.fanfiction.net/s/11032211/1/Readin...,Reading The Signs,notalone91,https://www.fanfiction.net/u/2695835/notalone91,Rated: Fiction T - English - Angst/Drama - Kat...,"[a, n, hey, all, it, s, been, ages, since, i, ..."
2,2,1613674143-43610,https://www.fanfiction.net/book/Hunger-Games/?...,A Chronicle of Lies,https://www.fanfiction.net/s/11535436/1/A-Chro...,A Chronicle of Lies,whitelilly0989,https://www.fanfiction.net/u/1296268/whitelill...,Rated: Fiction T - English - Drama - Katniss E...,"[author, notesi, wrote, this, a, long, time, a..."
3,3,1613675615-44227,https://www.fanfiction.net/book/Hunger-Games/?...,Watch Me Fall Apart,https://www.fanfiction.net/s/10203191/1/Watch-...,Watch Me Fall Apart,somethingtobelieve,https://www.fanfiction.net/u/5581135/something...,Rated: Fiction T - English - Angst/Romance - K...,"[written, for, prompts, in, panem, day, marigo..."
4,4,1613675915-44351,https://www.fanfiction.net/book/Hunger-Games/?...,100 Channels and Nothing On,https://www.fanfiction.net/s/9999075/1/100-Cha...,100 Channels and Nothing On,Izzy Samson,https://www.fanfiction.net/u/4073375/Izzy-Samson,Rated: Fiction T - English - Hurt/Comfort - Ka...,"[thanks, to, my, wonderful, betas, and, writte..."


In [88]:
#Now I can try to take out the stopwords.
katniss['story_text_without_stopwords'] = katniss['story_text'].apply(lambda x: [item for item in x if item not in stop])

In [89]:
katniss.head()

Unnamed: 0,web-scraper-order,web-scraper-start-url,story_link,story_link-href,story_title,author_id,author_id-href,story_info,story_text,story_text_without_stopwords
0,1613673749-43454,https://www.fanfiction.net/book/Hunger-Games/?...,Spring Break - Everlark Style,https://www.fanfiction.net/s/11858172/1/Spring...,Spring Break - Everlark Style,xerxia31,https://www.fanfiction.net/u/5705988/xerxia31,"Rated: Fiction T - English - Katniss E., Prim ...","[For, the, Tumblr, Everlark, fic, exchange, Sp...","[For, Tumblr, Everlark, fic, exchange, Spring,..."
1,1613674516-43761,https://www.fanfiction.net/book/Hunger-Games/?...,Reading The Signs,https://www.fanfiction.net/s/11032211/1/Readin...,Reading The Signs,notalone91,https://www.fanfiction.net/u/2695835/notalone91,Rated: Fiction T - English - Angst/Drama - Kat...,"[A, N, Hey, all, It, s, been, ages, since, I, ...","[A, N, Hey, It, ages, since, I, posted, story,..."
2,1613674143-43610,https://www.fanfiction.net/book/Hunger-Games/?...,A Chronicle of Lies,https://www.fanfiction.net/s/11535436/1/A-Chro...,A Chronicle of Lies,whitelilly0989,https://www.fanfiction.net/u/1296268/whitelill...,Rated: Fiction T - English - Drama - Katniss E...,"[Author, NotesI, wrote, this, a, long, time, a...","[Author, NotesI, wrote, long, time, ago, This,..."
3,1613675615-44227,https://www.fanfiction.net/book/Hunger-Games/?...,Watch Me Fall Apart,https://www.fanfiction.net/s/10203191/1/Watch-...,Watch Me Fall Apart,somethingtobelieve,https://www.fanfiction.net/u/5581135/something...,Rated: Fiction T - English - Angst/Romance - K...,"[Written, for, Prompts, in, Panem, Day, Marigo...","[Written, Prompts, Panem, Day, Marigold, Cruel..."
4,1613675915-44351,https://www.fanfiction.net/book/Hunger-Games/?...,100 Channels and Nothing On,https://www.fanfiction.net/s/9999075/1/100-Cha...,100 Channels and Nothing On,Izzy Samson,https://www.fanfiction.net/u/4073375/Izzy-Samson,Rated: Fiction T - English - Hurt/Comfort - Ka...,"[Thanks, to, my, wonderful, betas, and, Writte...","[Thanks, wonderful, betas, Written, Holiday, G..."


In [90]:
#Super! It worked! Save it as a .csv
katniss.to_csv('katniss-wtok-no-stops-wcaps.csv')

In [91]:
#I'll delete the column that still has the stopwords, to save space. 'story_text' 
katniss.drop(["story_text"], axis=1, inplace=True )

In [92]:
katniss.head()

Unnamed: 0,web-scraper-order,web-scraper-start-url,story_link,story_link-href,story_title,author_id,author_id-href,story_info,story_text_without_stopwords
0,1613673749-43454,https://www.fanfiction.net/book/Hunger-Games/?...,Spring Break - Everlark Style,https://www.fanfiction.net/s/11858172/1/Spring...,Spring Break - Everlark Style,xerxia31,https://www.fanfiction.net/u/5705988/xerxia31,"Rated: Fiction T - English - Katniss E., Prim ...","[For, Tumblr, Everlark, fic, exchange, Spring,..."
1,1613674516-43761,https://www.fanfiction.net/book/Hunger-Games/?...,Reading The Signs,https://www.fanfiction.net/s/11032211/1/Readin...,Reading The Signs,notalone91,https://www.fanfiction.net/u/2695835/notalone91,Rated: Fiction T - English - Angst/Drama - Kat...,"[A, N, Hey, It, ages, since, I, posted, story,..."
2,1613674143-43610,https://www.fanfiction.net/book/Hunger-Games/?...,A Chronicle of Lies,https://www.fanfiction.net/s/11535436/1/A-Chro...,A Chronicle of Lies,whitelilly0989,https://www.fanfiction.net/u/1296268/whitelill...,Rated: Fiction T - English - Drama - Katniss E...,"[Author, NotesI, wrote, long, time, ago, This,..."
3,1613675615-44227,https://www.fanfiction.net/book/Hunger-Games/?...,Watch Me Fall Apart,https://www.fanfiction.net/s/10203191/1/Watch-...,Watch Me Fall Apart,somethingtobelieve,https://www.fanfiction.net/u/5581135/something...,Rated: Fiction T - English - Angst/Romance - K...,"[Written, Prompts, Panem, Day, Marigold, Cruel..."
4,1613675915-44351,https://www.fanfiction.net/book/Hunger-Games/?...,100 Channels and Nothing On,https://www.fanfiction.net/s/9999075/1/100-Cha...,100 Channels and Nothing On,Izzy Samson,https://www.fanfiction.net/u/4073375/Izzy-Samson,Rated: Fiction T - English - Hurt/Comfort - Ka...,"[Thanks, wonderful, betas, Written, Holiday, G..."


In [93]:
katniss.to_csv('katniss-nostops-wcaps-only.csv')