# Removing cases of Case: 
## A script to transform Neuromancer into readable content

I have been told that *Neuromancer* by William Gibson is one of the best and most foundational science fiction books there is. I've wanted to read it for awhile, but now that I've started, I find myself unable to get through it. Part of it is Gibson's depiction of female characters, but I've trudged through misogynistic novels before. No. My main problem is this: in dialogue, characters *constantly* address other characters by their names when they talk to them. And by constantly I mean almost every other sentence. Example:

>"Get your coffee, __Case__," Molly said. "You’re okay, but you’re not going anywhere 'til Armitage has his 
say." She sat cross legged on a silk futon and began to fieldstrip the fletcher without bothering to look at 
it. Twin mirrors tracking as he crossed to the table and refilled his cup. <br><p></p>
"Too young to remember the war, aren’t you, __Case__?" Armitage ran a large hand back through his 
cropped brown hair. A heavy gold bracelet flashed on his wrist. "Leningrad, Kiev, Siberia. We invented 
you in Siberia, __Case__." <br><p></p>
"What’s that supposed to mean?" <br><p></p>
"Screaming Fist, __Case__. You’ve heard the name." <br><p></p>
"Some kind of run, wasn’t it? Tried to burn this Russian nexus with virus programs. Yeah, I heard about 
it. And nobody got out." <br><p></p>
He sensed abrupt tension. Armitage walked to the window and looked out over Tokyo Bay. "That isn’t 
true. One unit made it back to Helsinki, __Case__." 

PEOPLE. DON'T. TALK. LIKE. THIS.

So, in order for me to get through this book, I am simply going to take out all the cases of Case that don't need to be there. This will be a task of importing messy data, parsing through it to take out all viable Cases, and exporting it to a readable format. 

Copy of Neuromancer in text format: https://archive.org/stream/NeuromancerWilliamGibson/Neuromancer%20-%20William%20Gibson_djvu.txt

In [1]:
# import text
# format text into a form I can work with
# define conditions for which to remove "Case"
    # when "Case" (WITH THE FIRST LETTER CAPITALIZED) occurs between "" 
    #       and isn't the only word between "" --> figure out how to define open and closed ""
    # be sure to remove all white space + comma before. Almost always occurs at end of sentence? Can check.
    # other names to consider: Ratz, Molly. 
# find cases of these and replace accordingly
# transform text into resonable format and export

In [199]:
# importing stuff
import urllib
import pandas as pd
from bs4 import BeautifulSoup
import re
# from ebooklib import epub
import spacy
import en_core_web_sm
import numpy as np

In [3]:
# import page content
target_url = "https://archive.org/stream/NeuromancerWilliamGibson/Neuromancer%20-%20William%20Gibson_djvu.txt"
txt = urllib.request.urlopen(target_url).read()

In [8]:
# make a nice soup
soup = BeautifulSoup(txt, 'html.parser')

In [32]:
# get full text (ft)
# found that text began with "pre" by scrolling through it 
ft = soup.find('pre')
strft = ft.string

Alright, now that I've figured out how to remove cases of Case (see Appendix), I need to go back a few steps. "Case" isn't the only character in Neuromancer, and I bet anything he does the weird dialogue thing with other characters, too. That means I have to account for this for other character's names. Only, I haven't read the book yet, so I don't know who the other characters are! 

I'm going to used Named Entity Recognition to try and identify all of the important characters, so I can apply these same rules to them, too. 

In [189]:
# load model

#nlp = spacy.load("en_core_web_sm")
nlp = en_core_web_sm.load()

# apply model to text
doc = nlp(strft)

In [304]:
# create dic of NET words and according labels

doc_dict = {}

# extract labels into a dict
for ent in doc.ents: # .ents gives NET (Named Entity Recognition)
    doc_dict[ent.text] = ent.label_

# transform to df
doc_df = pd.DataFrame(list(doc_dict.items()),columns = ['Word','Label'])

# get series with just 'people' labels
people = doc_df.Word[doc_df.Label=='PERSON'].reset_index(drop=True)

# get the number of times that name appears in the text
count = []
for i in range(0,len(people)):
    name = people[i]
    count.append(len(re.findall(name, strft)))
    
# create df with names and counts
people_df = pd.DataFrame()
people_df['Names'] = people
people_df['Count'] = count

# only want the most important characters
# sort by count
people_df = people_df.sort_values(by='Count', ascending=False)

# only considering characters that are mentioned more than 10 times
people_df = people_df[people_df.Count>10].reset_index(drop=True)

In [305]:
# spacy isn't perfect, so going to take out some that are obviously wrong
bad = [5,8,9,18,22,23,26,33]
people_df = people_df.drop(index=bad).reset_index(drop=True)

In [308]:
# list of names to check for
names = people_df.Names
names

0         Case
1         Moll
2        Molly
3      Maelcum
4         Finn
5       Winter
6         Jane
7     Flatline
8        Corto
9        Linda
10     Tessier
11       Peter
12         Dix
13       Kuang
14       Smith
15       Aerol
16        Chin
17      Ninsei
18         Ver
19       Dixie
20       Braun
21        Zone
22       Julie
23       Bruce
24      Pierre
25         Bet
Name: Names, dtype: object

Now that I have my list, I'm going to come through the text for every bad instance of these names.

In [373]:
comma_replace = r','
period_replace = r'.'
question_replace = r'?'
beginning_replace = r'"'  

# remove_superfulous removes all uncessary instances of when a 
# character addresses another character by name in dialogue
# args: 
#     name - str of a name   
#     txt - str body of text to parse through
def remove_superfulous(name, txt):
    comma = ', {word},'.format(word=name)
    period = ', {word}\.'.format(word=name)
    question = ', {word}\?'.format(word=name)
    beginning = '"{word}.[^"]'.format(word=name) 

    re.sub(comma, comma_replace, txt)
    re.sub(period, period_replace, txt)
    re.sub(question, question_replace, txt)
    re.sub(beginning, beginning_replace, txt)
    
    # get rid of extra backslashes
    txt = re.sub(r'\\', r'',txt)
    
    return txt

In [374]:
# applying function to create new fext "new_ft"
for name in names:
    new_ft = remove_superfulous(name, strft)

Here's the sample from before. So much better, right?

In [391]:
print(new_ft[57603:58451])


"Get your coffee," Molly said. "You’re okay, but you’re not going anywhere 'til Armitage has his 
say." She sat cross legged on a silk futon and began to fieldstrip the fletcher without bothering to look at 
it. Twin mirrors tracking as he crossed to the table and refilled his cup. 

"Too young to remember the war, aren’t you?" Armitage ran a large hand back through his 
cropped brown hair. A heavy gold bracelet flashed on his wrist. "Leningrad, Kiev, Siberia. We invented 
you in Siberia." 

"What’s that supposed to mean?" 

"Screaming Fist. You’ve heard the name." 

"Some kind of run, wasn’t it? Tried to burn this Russian nexus with virus programs. Yeah, I heard about 
it. And nobody got out." 

He sensed abrupt tension. Armitage walked to the window and looked out over Tokyo Bay. "That isn’t 
true. One unit made it back to Helsinki."


Now I've got to export it. My plan is to save it as a text file and then use pandoc to save it as an epub. 

In [392]:
# create a metadata text file for pandoc
metadata = '''
---
title:
- type: main
  text: Neuromancer (modified)
creator:
- role: author
  text: William Gibson
identifier:
- scheme: ISBN-10
  text: ISBN: 0-441-56958-7 
publisher:  Ace Science Fiction edition
rights: Copyright © 1984 by William Gibson
css: base.css
...
'''

outfile = open('metadata.txt','w')
outfile.write(metadata)
outfile.close()

In [393]:
# create a md file of full text for pandoc
outfile = open('fulltextmodified','w')
outfile.write(new_ft)
outfile.close()

# APPENDIX

Case will always be surrounded by two punctuation marks if it's unnecessary
- "Case, --> r'"'
- "Case. --> r'"'
- ~~, Case." --> r'\."'~~
- ~~, Case," --> r',"'~~
- ~~, Case?" --> r'?"'~~
- , Case, --> r','
- , Case. --> r'\.'
- , Case? --> r'\?'