# extract text from docx files

First we find a relevant library for parsing docx files. 

https://python-docx.readthedocs.io/en/latest/user/documents.html

To install packages I use pip

For users of Anaconda, see https://anaconda.org/conda-forge/python-docx

In [1]:
!pip install python-docx



Once the library is installed, we can load the library into the Python environment

In [2]:
from docx import Document

What files are available to be parsed?

One method is to use `ls` from the command line in Mac or linux. `dir` is the Windows equivalent

In [3]:
!ls essays/*.docx

'essays/week1_50 Years Data Science Summary.docx'
'essays/week1_a History of Data Science.docx'
 essays/week1_Assignment1.docx
'essays/week1_a Very Short History Of Data Science_1.docx'
'essays/week1_A Very Short History Of Data Science.docx'
'essays/week1_Data 601- Summary of The History of Data Science .docx'
'essays/week1_Data Wrangling Chap 2.docx'
'essays/week1_essay_2019-08-30-18-51-59_History of Data Science.docx'
'essays/week1_essay_2019-08-30-21-06-35_A very short history on data science.docx'
'essays/week1_essay_2019-09-03-17-43-38_50 years of Data Science.docx'
'essays/week1_reading Summary.docx'
'essays/week1_summary-50 years of data science.docx'
 essays/week1_summary.docx
'essays/week1_ Summary.docx'
'essays/week2_Data601-Reading Assignment_2.docx'
'essays/week2_Data Wrangling with Python page 17 to 40.docx'
'essays/week2_Lists and Dictionaries Summary.docx'
'essays/week2_summary-Data Wrangling with Python  ch2 p17 to 40.docx'
'essays/week2 summary .docx'
 essays/Week2_su

Here we will manually specify one file. In later notes we will see how to perform this selection in Python

In [4]:
document = Document('essays/week1_50 Years Data Science Summary.docx')

In [5]:
type(document)

docx.document.Document

In [6]:
document.paragraphs

[<docx.text.paragraph.Paragraph at 0x7fe18d6212b0>,
 <docx.text.paragraph.Paragraph at 0x7fe18d6215f8>,
 <docx.text.paragraph.Paragraph at 0x7fe18d621668>,
 <docx.text.paragraph.Paragraph at 0x7fe18d621630>,
 <docx.text.paragraph.Paragraph at 0x7fe18d6216a0>,
 <docx.text.paragraph.Paragraph at 0x7fe18d6216d8>,
 <docx.text.paragraph.Paragraph at 0x7fe18d621710>,
 <docx.text.paragraph.Paragraph at 0x7fe18d621748>,
 <docx.text.paragraph.Paragraph at 0x7fe18d621780>,
 <docx.text.paragraph.Paragraph at 0x7fe18d6217b8>]

In [7]:
type(document.paragraphs)

list

In [8]:
document.paragraphs[0]

<docx.text.paragraph.Paragraph at 0x7fe18d621a20>

https://python-docx.readthedocs.io/en/latest/user/text.html

In [9]:
document.paragraphs[0].text

'Data Science 601'

In [10]:
document.paragraphs[1].text

'In recent years, data science programs have been proliferating. On September 2015, the University of Michigan announced a $100 million Data Science Initiative (DSI). In their definition of DSI they use words, such as “processing”, “analysis”, and “interpretation of vast amount of data”.'

citation: https://stackoverflow.com/questions/25228106/how-to-extract-text-from-an-existing-docx-file-using-python-docx

In [11]:
indx=0
for para in document.paragraphs:
    indx+=1
    if (len(para.text)>0):
        print("\n  paragraph",indx,"is")
        print(para.text)


  paragraph 1 is
Data Science 601

  paragraph 2 is
In recent years, data science programs have been proliferating. On September 2015, the University of Michigan announced a $100 million Data Science Initiative (DSI). In their definition of DSI they use words, such as “processing”, “analysis”, and “interpretation of vast amount of data”.

  paragraph 3 is
Many statisticians are puzzled by this new discipline which seems to claim to do the same tasks that have been part of their daily work for decades. Also, as large as the UM initiative was statisticians had an insignificant presence which left them marginalized and confused. 

  paragraph 4 is
When searching the web definition of data science and statistics; while different words are used they clearly overlap. Also, to the argument of data, statisticians have been using large amounts of data of all types for decades. So, for statisticians their profession is just being nicely package as the new and shining Data Science. Many statisti

## create a function that gets the text from a document

First, summarize what we've done:

In [12]:
document = Document('essays/week1_50 Years Data Science Summary.docx')
indx=0
for para in document.paragraphs:
    indx+=1
    if (len(para.text)>0):
        print("\n  paragraph",indx,"is")
        print(para.text)


  paragraph 1 is
Data Science 601

  paragraph 2 is
In recent years, data science programs have been proliferating. On September 2015, the University of Michigan announced a $100 million Data Science Initiative (DSI). In their definition of DSI they use words, such as “processing”, “analysis”, and “interpretation of vast amount of data”.

  paragraph 3 is
Many statisticians are puzzled by this new discipline which seems to claim to do the same tasks that have been part of their daily work for decades. Also, as large as the UM initiative was statisticians had an insignificant presence which left them marginalized and confused. 

  paragraph 4 is
When searching the web definition of data science and statistics; while different words are used they clearly overlap. Also, to the argument of data, statisticians have been using large amounts of data of all types for decades. So, for statisticians their profession is just being nicely package as the new and shining Data Science. Many statisti

In [14]:
def docx_to_dict(name_of_file):
    docx_dict = {}
    document = Document(name_of_file)
    indx=0
    for para in document.paragraphs:
        indx+=1
        if (len(para.text)>0):
            #print("\n  paragraph",indx,"is")
            #print(para.text)
            docx_dict[indx] = para.text
    return docx_dict

In [15]:
docx_to_dict('essays/week1_50 Years Data Science Summary.docx')

{1: 'Data Science 601',
 2: 'In recent years, data science programs have been proliferating. On September 2015, the University of Michigan announced a $100 million Data Science Initiative (DSI). In their definition of DSI they use words, such as “processing”, “analysis”, and “interpretation of vast amount of data”.',
 3: 'Many statisticians are puzzled by this new discipline which seems to claim to do the same tasks that have been part of their daily work for decades. Also, as large as the UM initiative was statisticians had an insignificant presence which left them marginalized and confused. ',
 4: 'When searching the web definition of data science and statistics; while different words are used they clearly overlap. Also, to the argument of data, statisticians have been using large amounts of data of all types for decades. So, for statisticians their profession is just being nicely package as the new and shining Data Science. Many statistics organizations are asking, among other quest

In [16]:
docx_dict = docx_to_dict('essays/week1_50 Years Data Science Summary.docx')

In [17]:
docx_dict

{1: 'Data Science 601',
 2: 'In recent years, data science programs have been proliferating. On September 2015, the University of Michigan announced a $100 million Data Science Initiative (DSI). In their definition of DSI they use words, such as “processing”, “analysis”, and “interpretation of vast amount of data”.',
 3: 'Many statisticians are puzzled by this new discipline which seems to claim to do the same tasks that have been part of their daily work for decades. Also, as large as the UM initiative was statisticians had an insignificant presence which left them marginalized and confused. ',
 4: 'When searching the web definition of data science and statistics; while different words are used they clearly overlap. Also, to the argument of data, statisticians have been using large amounts of data of all types for decades. So, for statisticians their profession is just being nicely package as the new and shining Data Science. Many statistics organizations are asking, among other quest