-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor and Cleanup #28
base: master
Are you sure you want to change the base?
Conversation
Sourcery refactored master branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do see the majority of the changes here are mainly cleaning up the code. I feel that a few small changes could speed up the operations even more.
def remove_duplicates_from_file(infile_path, outfile_path="temp.000000000.bopscrk"): | ||
lines_seen = set() # holds lines already seen | ||
outfile = open(outfile_path, "w") | ||
infile = open(infile_path, "r") | ||
for line in infile: | ||
if line not in lines_seen: # not a duplicate | ||
outfile.write(line) | ||
lines_seen.add(line) | ||
outfile.close() | ||
with open(outfile_path, "w") as outfile: | ||
infile = open(infile_path, "r") | ||
for line in infile: | ||
if line not in lines_seen: # not a duplicate | ||
outfile.write(line) | ||
lines_seen.add(line) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend sorting the data read here from infile_path
before executing the loop, this will remove the need for a set(), reducing the memory footprint of this function. The conditional check for a duplicate is a check if two strings are equal with only accessing those memory locations for comparison, where a set
has to verify all values in the set that is does not exist (Many loops even if they are C level loops!).
I use sorted
here but if the file is already sorted that can be removed.
def remove_duplicates_from_file(infile_path, outfile_path="temp.000000000.bopscrk"):
last_line = '' # holds lines already seen
with open(outfile_path, "w") as outfile:
infile = open(infile_path, "r")
for line in sorted(infile):
if line != last_line: # not a duplicate
outfile.write(line)
last_line = line
def remove_by_lengths(wordlist, min_length, max_length): | ||
'''expect a list, return a new list with the values between min and max length provided''' | ||
new_wordlist = [] | ||
for word in wordlist: | ||
#if (len(str(word)) < min_length) or (len(str(word)) > max_length): wordlist.remove(word) | ||
if (len(str(word)) >= min_length) and (len(str(word)) <= max_length): new_wordlist.append(str(word)) | ||
return new_wordlist No newline at end of file | ||
return [ | ||
str(word) | ||
for word in wordlist | ||
if (len(str(word)) >= min_length) and (len(str(word)) <= max_length) | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make use of the filter
function in python. it may make it slightly cleaner and can remove some additional calculations
def length_filter(word, min, max):
word_length = len(str(word))
return word_length >= min and word_length <= max
def remove_by_lengths(wordlist, min_length, max_length):
'''expect a list, return a new list with the values between min and max length provided'''
return filter(lambda word: length_filter(word, min_length, max_length) , wordlist)
@@ -69,18 +64,15 @@ def case_transforms(word): | |||
def leet_transforms(word): | |||
new_wordlist = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing this to a set, make the list unique. no need to remove the duplicates at the end, this applies to almost all the other generate and transform funcions.
No description provided.