This project focuses on analyzing two literary works to compare their content and style. The analysis includes text cleaning, exploration, feature engineering, vectorization, and comparison to derive insights into the themes, vocabulary, and narrative styles of the texts.
- Python 3.x
- NLTK (Natural Language Toolkit)
- scikit-learn
- WordCloud
- Matplotlib
- Installation: Ensure Python 3.x is installed on your system. Install the required Python packages using pip:
pip install nltk scikit-learn matplotlib wordcloud
- NLTK Data: Download the necessary NLTK datasets:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
document_similarity.ipynb
: The main notebook containing all the code for the project.pandp.txt
andsands.txt
: Text files containing the literary works to be analyzed. Note that the notebook works for any pair of documents.report.pdf
: The project report.
- Ensure all dependencies are installed and NLTK data is downloaded.
- Place the text files in the same directory as the
document_similarity.ipynb
. - Run the cells in
document_similarity.ipynb
sequentially. - The notebook will output the results of the analysis, including Word Clouds, Dispersion Plots, and the cosine similarity scores.
The first step involves preparing the text data by:
- Extracting the main content from additional metadata.
- Removing extra white spaces and non-alphabet characters.
- Converting all characters to lowercase.
- Removing stopwords.
We use NLTK's word tokenizer to break down the text into individual words (tokens).
This phase includes creating Word Clouds and Dispersion Plots to visually explore the dominant words and their distribution in the documents, respectively.
We consider 2 feature engineering tasks for each document:
- POS tagging followed by Lemmatization
- N-grams
Each document will have 2 sets of features corresponding to 2 feature engineering tasks.
Using TF-IDF (Term Frequency-Inverse Document Frequency), we vectorize the features generated in the previous step to numerically represent our text data.
We calculate the cosine similarity between the documents using both sets of features to understand their similarity in terms of content and style.