Les Misérables

An analysis of Victor Hugo's Les Misérables using different natural language processing models

Data mission at Wild Code School (2 days)

Data preparation (cleaning and tokenizing)
NLP models to visualize the most frequent words (with and without character names)

Data preparation

The package preparation retrieves and cleans the full text of Les Misérables in its original French version from gutenberg.org. Everything that is not part of the main text is removed (such as the list of chapters) and the main text is comined in one single string of text. mis_2_prep.py removes all French stopwords and tokenizes the text using spaCy. It also creates a version of text/tokens that removes the last and first names of the most prominent characters.

NLP

The nlp files visualize the most common words using different models, both with and without character names:

frequency count of tokens
frequency count of stemmed tokens
frequency count of lemmatized tokens
word cloud of token frequency

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
images		images
preparation		preparation
README.md		README.md
nlp_1_frequencies.py		nlp_1_frequencies.py
nlp_2_stemming.py		nlp_2_stemming.py
nlp_3_lemmatizing.py		nlp_3_lemmatizing.py
nlp_4_wordclouds.py		nlp_4_wordclouds.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Les Misérables

Data preparation

NLP

Top 10 words with and without character names

Top 10 words (excluding character names) using lemmatization

About

Releases

Packages

Languages

s-bau/les-miserables

Folders and files

Latest commit

History

Repository files navigation

Les Misérables

Data preparation

NLP

Top 10 words with and without character names

Top 10 words (excluding character names) using lemmatization

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages