In [1]:
import nltk

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/thakrav/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/thakrav/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/thakrav/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/thakrav/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [2]:
from utils import big_data_dict, styled_print, create_dir, extract_images, extract_paragraphs, \
    random_select_dict


In [3]:
%load_ext autoreload
%autoreload 2

In [4]:
book_file_path = "./data/fire-and-blood.docx"

## Extract Raw Data From the Docx File

Using Microsoft Word's Word Counter Tool, we found the following information:

| Property      | Count |
| ----------- | ----------- |
| Pages       | 703         |
| Words       | 256,032     |
| Paragraphs  | 3197        |
| Lines       | 23,849      |
| Graphics    | 71          |
| Characters (No Space) | 1,189,352 |
| Characters (With Space) | 1,441,453 |

We have used Microsoft Word's Advance Search Feature (Use flag `^g`) to find total number of graphical objects in our document.

The goal of this part of code is to extract similar amount of data from our document. 

**Note** Our numbers might not exactly match these statics but we will try to capture numbers as close to these stats. 

### Extract Images

In [5]:
image_dir = create_dir("./data", "images")
image_file_paths = extract_images(book_file_path, image_dir, verbose=0)
styled_print(f"Found Total {len(image_file_paths)} Images from the Book {book_file_path}", header=True)

[1m› [4mcreating directory ... ./data/images[0m
[1m› [4mExtracting Images from ./data/fire-and-blood.docx[0m
[1m› [4mFound Total 112 Images from the Book ./data/fire-and-blood.docx[0m


In [6]:
for i, file_path in enumerate(image_file_paths):
    big_data_dict["book-images"].append(
        {
            "id": i,
            "file": file_path
        }
    )

**Observations**
- As indicated here, we were able to find total 112 images from the book. 
- We have manually checked all the images to verify that there is no `Non-Image` object present. 
- We have found that our approach to extract images actually `extracted all the correct images` and it is much better that using Microsoft Word's Advance Search option to find images. 

### Extract Paragraphs

In [7]:
paragraphs = extract_paragraphs(book_file_path, min_char_count=1)
styled_print(f"Found Total {len(paragraphs)} Paragraphs from the Book {book_file_path}", header=True)

[1m› [4mExtracting Paragraphs from ./data/fire-and-blood.docx[0m
[1m› [4mFound Total 3168 Paragraphs from the Book ./data/fire-and-blood.docx[0m


In [8]:
styled_print(f"Some Sample Paragraphs from the Book {book_file_path}", header=True)
sampled_paragraphs = random_select_dict(paragraphs, 5)
for key, val in sampled_paragraphs.items():
    styled_print(f"{key} - {val}")

[1m› [4mSome Sample Paragraphs from the Book ./data/fire-and-blood.docx[0m
    3431 - In that same hour, Lady Baela Targaryen was being spirited away to safety by agents of Lord Larys the Clubfoot. Tom Tangletongue was surprised in the castle yards as he was leaving the stables, and beheaded forthwith. “He died as he had lived, stammering,” says Mushroom. His father Tom Tanglebeard was absent from the castle, but they found him in a tavern on Eel Alley. When he protested that he was “just a simple fisherman, come to have an ale,” his captors drowned him in a cask of same.
    3657 - for three days and refused to say where she had been when she returned.
    1957 - In the Vale, however, her sister Daella was not doing near as well. After a year and a half of marriage, a different sort of message arrived at the Red Keep by raven. It was very short, and written in Daella’s own uncertain hand. “I am with child,” it said. “Mother, please come. I am frightened.”
    3499 - For six days Ki

In [9]:
for key, value in paragraphs.items():
    big_data_dict["paragraphs"].append(
        {
            "id": key,
            "text": value
        }
    )

### Clean Paragraphs

#### Remove Punctuations

In [None]:
big_data_dict = {
    "paragraphs":[
        {
            "id": "paragraph"
            "text": "sdfadf"
        }
    ],
    "sentences":[
        {
            "id": "sentence"
            "paragraph_id": "",
            "snetiment": 
            "toke"
        }
    ]
    "topics":
    
}