# **Extracting Data**

**Choosing the Book**: Since my birth month is July (7), I will use Book 7 as per the given instructions.

Extracting `file1.txt`: My birthdate is July 7, so I will extract 10 pages starting from page 7 of Book 7 and save them in `file1.txt`.

Extracting `file2.txt`: My birth year is 2002, which corresponds to page 102. I will extract 10 pages starting from page 102 of Book 7 and save them in `file2.txt`.

In [1]:
!pip install pyspellchecker
!pip install PyPDF2
!pip install fpdf


Collecting pyspellchecker
  Downloading pyspellchecker-0.8.2-py3-none-any.whl.metadata (9.4 kB)
Downloading pyspellchecker-0.8.2-py3-none-any.whl (7.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m64.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyspellchecker
Successfully installed pyspellchecker-0.8.2
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1
Collecting fpdf
  Downloading fpdf-1.7.2.tar.gz (39 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: fpdf
  Building wheel for fpdf (setup.py) ... [?25l[?25hdone
  Created wheel for fpdf: filename=fpdf-1.7.2-py2.py3-none-any.whl size=40704 sha256=5221

In [2]:
from PyPDF2 import PdfReader  # extract text from PDFs
import re  # handle text processing with regular expressions
import pandas as pd  # data manipulation and storage
from collections import Counter  # count word occurrences
from spellchecker import SpellChecker  # detect misspelled or non-English words
from fpdf import FPDF  # create PDF reports
import matplotlib.pyplot as plt  # generate visualizations


In [3]:
# file paths
PDF_FILE = "/content/Harry_Potter_(www.ztcprep.com).pdf"
OUTPUT_FILE1 = "file1.txt"
OUTPUT_FILE2 = "file2.txt"

# birth details for book and page selection  July 7 2002
BIRTH_MONTH, BIRTH_DATE, BIRTH_YEAR = 7, 7, 2002
BOOK_ID = 7
START_PAGE1 = BIRTH_DATE  # pages 7-16
START_PAGE2 = 102  # pages 102-111

from PyPDF2 import PdfReader

def extract_pages(pdf_path, start_page, num_pages=10):
    """extracts text from specified page range in a PDF."""
    reader = PdfReader(pdf_path)
    return "\n".join(reader.pages[p - 1].extract_text() for p in range(start_page, start_page + num_pages) if p <= len(reader.pages))

#extract and save text
with open(OUTPUT_FILE1, "w", encoding="utf-8") as f1, open(OUTPUT_FILE2, "w", encoding="utf-8") as f2:
    f1.write(extract_pages(PDF_FILE, START_PAGE1))
    f2.write(extract_pages(PDF_FILE, START_PAGE2))

print(f"Text extraction complete: {OUTPUT_FILE1}, {OUTPUT_FILE2}")


Text extraction complete: file1.txt, file2.txt


**Q1**:  Write Python code and use MapReduct to count occurrences of each word in the first text file (file.txt). How many times each word is repeated?

In [4]:
TEXT_FILE = "/content/file1.txt"
OUTPUT_CSV = "word_count.csv"

def extract_words(text):
    return re.findall(r'\b\w+\b', text.lower())

# Read text from file
with open(TEXT_FILE, "r", encoding="utf-8") as file:
    content = file.read()

# Count word occurrences
word_freq = Counter(extract_words(content))

# Convert to DataFrame and sort
df = pd.DataFrame(word_freq.items(), columns=["Word", "Count"]).sort_values(by="Count", ascending=False)

# Save results to CSV
df.to_csv(OUTPUT_CSV, index=False)

# Display output
print("\nWord Frequency Analysis from file1.txt:")
print(df.to_string(index=False))



Word Frequency Analysis from file1.txt:
        Word  Count
         the     79
          he     79
           a     44
         and     39
          to     35
          of     34
         was     34
     dursley     33
          it     32
           t     30
         his     28
          in     25
        that     25
          mr     23
          as     20
          on     20
        have     16
         had     16
      people     13
          at     13
        been     12
        didn     12
         mrs     11
         com     10
         www     10
         all     10
       harry     10
     ztcprep     10
      potter      9
        them      9
         but      9
        were      9
           s      9
        owls      9
        into      9
        said      8
          be      8
           d      8
         cat      8
        they      8
        back      7
       about      7
          if      7
       there      7
         for      7
           i      7
         him      7

**Q2**: From the second text file (file2.txt), write Python code and use MapReduct to count how many times non-English words (names, places, spells etc.) were used. List those words and how many times each was repeated.



In [5]:
FILE_PATH = "/content/file2.txt"
OUTPUT_FILE = "non_english_words.csv"

# Initialize spell checker
spell_checker = SpellChecker()

def get_words(text):
    return re.findall(r'\b\w+\b', text.lower())

# Read text from file
with open(FILE_PATH, "r", encoding="utf-8") as file:
    content = file.read()

# Extract words and filter non-English words
words = get_words(content)
non_english = [word for word in words if word not in spell_checker]

# Count occurrences
word_counts = Counter(non_english)

# Convert to DataFrame and save
df = pd.DataFrame(word_counts.items(), columns=["Non-English Word", "Count"]).sort_values(by="Count", ascending=False)
df.to_csv(OUTPUT_FILE, index=False)

# Display results
print("\nIdentified Non-English Words from file2.txt:")
print(df.to_string(index=False))



Identified Non-English Words from file2.txt:
Non-English Word  Count
          hagrid     29
             ter     23
             yeh     13
             www     10
         ztcprep     10
              ll      7
       gringotts      7
            didn      6
           ernon      5
              ap      3
            stuf      3
              ve      3
          izards      2
             eah      2
            hadn      2
           knuts      2
           albus      2
            wasn      2
          gettin      2
          wouldn      1
              mm      1
             teh      1
              69      1
              64      1
              70      1
            cept      1
       deliverin      1
       everythin      1
          pposed      1
       mentionin      1
              71      1
         guardin      1
         fetchin      1
              66      1
           payin      1
         shouldn      1
          muggle      1
            goin      1
         dumbled  