## Scraping Content from Documents

[This a folder](https://drive.google.com/file/d/1a_JlM2k_An8CT0MRLYjX6gzYvNm5t3cl/view?usp=share_link) that contains more than two dozen files.

Using the lesson on collecting content from documents, please do the following using Python:

* Analyze ONLY the .txt files (but do not physically remove the other files from this folder).

* Output a CSV file that has 4 columns: year, cognition_related , medical_condition, care_hours

* In the cognition_related column, enter ```True``` if the condition is related to Dementia or Alzheimer's disease. ```False``` if it is not.

* In the medical_conditions column, enter either “Dementia" or “Alzheimer’s" or “Not Specified” depending on the case.
* In the care_hours column, enter either "half-day" for 12-hour care, "full-day" for 24-hour care or “Not Specified"

* Export the CSV to your downloads.


In [4]:
# import libraries 
import pandas as pd
import glob

In [5]:
## glob or select documents to analyze
## NOTE that they are all sorted in numerical order
path = "project-docs/*.txt"
myfiles = sorted(glob.glob(path))
myfiles

['project-docs/decision_01.txt',
 'project-docs/decision_02.txt',
 'project-docs/decision_03.txt',
 'project-docs/decision_04.txt',
 'project-docs/decision_05.txt',
 'project-docs/decision_06.txt',
 'project-docs/decision_07.txt',
 'project-docs/decision_08.txt',
 'project-docs/decision_09.txt',
 'project-docs/decision_10.txt',
 'project-docs/decision_11.txt',
 'project-docs/decision_12.txt',
 'project-docs/decision_13.txt',
 'project-docs/decision_14.txt',
 'project-docs/decision_15.txt',
 'project-docs/decision_16.txt',
 'project-docs/decision_17.txt',
 'project-docs/decision_18.txt',
 'project-docs/decision_19.txt',
 'project-docs/decision_20.txt']

In [6]:
#### use conditional statements to capture content and place in a list
decisions_list = []
for myfile in myfiles:
  with open(myfile, "r") as textfile:
    all_lines = textfile.readlines()
    all_lines
    ## capture years
    year = all_lines[0].replace("Year: ", "").replace("\n", "")
## capture all conditions from list
    condition = all_lines[2]
    if "determined to suffer from Dementia" in condition:
      condition = "Dementia"
      cog_related = True
    elif "determined to suffer from Alzheimer" in condition:
      condition = "Alzheimer's"
      cog_related = True
    else:
      condition = "Not specified"
      cog_related = False
## capture care hours from list
    care_hours = all_lines[-1]
    if "24-hour" in care_hours:
      care_hours = "full-day"
    elif "12-hour" in care_hours:
      care_hours = "half-day"
    else:
      care_hours = "Not specified"
    ## place all variables into temp dictionary
    care_dict = {"year": year, "cognition_related": cog_related, 
                "condition": condition, "care_hours": care_hours,
                "source": myfile} 

    ## append temp dictionary to list so we can hold on to info outside this loop
    decisions_list.append(care_dict)


In [7]:
## call decision_list
decisions_list

[{'year': '2016',
  'cognition_related': True,
  'condition': 'Dementia',
  'care_hours': 'full-day',
  'source': 'project-docs/decision_01.txt'},
 {'year': '2017',
  'cognition_related': False,
  'condition': 'Not specified',
  'care_hours': 'Not specified',
  'source': 'project-docs/decision_02.txt'},
 {'year': '2021',
  'cognition_related': True,
  'condition': "Alzheimer's",
  'care_hours': 'full-day',
  'source': 'project-docs/decision_03.txt'},
 {'year': '2021',
  'cognition_related': False,
  'condition': 'Not specified',
  'care_hours': 'Not specified',
  'source': 'project-docs/decision_04.txt'},
 {'year': '2018',
  'cognition_related': True,
  'condition': 'Dementia',
  'care_hours': 'half-day',
  'source': 'project-docs/decision_05.txt'},
 {'year': '2021',
  'cognition_related': True,
  'condition': 'Dementia',
  'care_hours': 'half-day',
  'source': 'project-docs/decision_06.txt'},
 {'year': '2019',
  'cognition_related': False,
  'condition': 'Not specified',
  'care_hours

In [8]:
## convert to dataframe
df= pd.DataFrame(decisions_list)
df

Unnamed: 0,year,cognition_related,condition,care_hours,source
0,2016,True,Dementia,full-day,project-docs/decision_01.txt
1,2017,False,Not specified,Not specified,project-docs/decision_02.txt
2,2021,True,Alzheimer's,full-day,project-docs/decision_03.txt
3,2021,False,Not specified,Not specified,project-docs/decision_04.txt
4,2018,True,Dementia,half-day,project-docs/decision_05.txt
5,2021,True,Dementia,half-day,project-docs/decision_06.txt
6,2019,False,Not specified,Not specified,project-docs/decision_07.txt
7,2016,True,Dementia,full-day,project-docs/decision_08.txt
8,2014,True,Dementia,full-day,project-docs/decision_09.txt
9,2016,True,Dementia,full-day,project-docs/decision_10.txt


## Remember to do a spot check to confirm that the data was captured accurately.
#### This is just 20 rows so you could check them all, but if there were 2,000 of rows, I would spot check 5 and ask a colleague to review spot check 20.

In [10]:
## export to csv

filename = "care_decision.csv"
df.to_csv(filename, encoding="UTF-8", index = False)