# Parsing raw unstructured text files

Environment: Python 3.6.5 and Jupyter notebook

Libraries used in the task:
* re (for regular expression) 
* json (for transforming data to json format) 

## 1. Introduction
This notebook extracts data from a huge unstructured text file containing more than 30,000 job postings. The job postings have details about the job like job title, job description, job location, required qualifications, application procedure, application deadline, etc. The extracted data is transformed to XML and JSON formats.  

Regular expressions were designed for the section keys of the job postings; the section keys here refer to the key for job ID, job description, title, location, etc. Regular expressions were designed to match all the possible section keys in a job posting. The start and end indices of the section keys in a job posting were retrieved using the re library and then the relevant information for the section was extracted using string slicing method. The extracted data for every job posting was stored in a dictionary with the key as the section key.    

The extracted data was transformed to json format using the json library. The extracted data was transformed to XML format using the base python methods: looping, conditional statements, string/list methods. Finally, the transformed data is written to files, which can be used for downstream analysis. 



## 2.  Import libraries 

In [1]:
# Code to import libraries that are needed for this assessment:
import re
import json

## 3. Examining and loading the data file

In [2]:
with open('raw_job_postings_data.dat','r') as file:
    file_contents = file.read()
file.close()

**Checking** that the data file has been loaded correctly:

In [3]:
print(len(file_contents))
file_contents[:500]

72397697


'ID: 69290\nJOB RESPONSIBILITIES:\n - Develop online/ mobile games working with the team very closely (being\na team player, not a solo);\n- Work with Designers and Illustrators on artwork and design\nimplementation into the games;\n- Define specifications of game features together with Product Managers;\n- Develop and architect different types of frameworks and toolsets;\n- Constantly learn and grow your skills.\nDEADLINES: 25 January 2014\nLOCATION: Yerevan, Armenia\nabout_company:\n Arplan LLC is an archi'

When we examine the data file, we can observe that the job postings are separated by a line: "-------------\n". We can make the task more efficient by splitting the data file such that each token contains the information about 1 job posting. **Splitting the text file into a list of job postings:**

In [4]:
job_listings = re.split(r'-{30}\n',file_contents)
job_listings.pop() # remove last empty element
print(len(job_listings))

30676


There are **30,676 job postings** in the data file.  
Let us examine the patterns of a job posting more closely before we begin the extraction. 

In [5]:
job_listings[25]

"ID: 39618\ndeadline: 31 August 2007\n_LOCS: Yerevan, Armenia\nABOUT:\n Square One Restaurants is a chain of restaurants.\njob_desc: The Customer Service Representative will be responsible\nfor the communication and clients' support by phone and in the office.\nqualifications:\n - University degree in Computer Sciences or a related field;\n- 3-5 years of work experience in database design, development and\noptimization technology;\n- Excellent knowledge of OOP, T-SQL, PL-SQL, C#, ASP.net;\n- Good knowledge of Armenian and Russian languages, knowledge of\ntechnical English language;\n- Problem-solving and decision-making skills;\n- Good time management and organizational skills;\n- Knowledge of accounting is a plus.\nREMUNERATION/\nSALARY: Competitive, based on work experience and\neducational background.\ntitle: Assistant to Executive Director\nPROCEDURE: All interested and qualified candidates are\nwelcome to send their CV to: info@... . Please indicate the\nposition title in the subj

The job posting has several sections with a section key followed by a colon. For example - 'ID:','_LOCS:','job_desc:',etc. The section keys are not consistent throughout the data file. The section keys are also not in the same order for every job posting. The list of sections are as follows:
* ID
* title
* location
* job_descriptions
* job_responsibilities
* required_qualifications
* salary
* application_procedure
* start_date
* application_deadline
* about_company

We need to write a regular expression for each section key. **The general approach is to observe the different keys for a section and write a regular expression to match all possible keys for the section.**  


**Note that for the regex of the section keys, the search will only be at the start of a new line** because the keys are present only at beginning of a new line.  

Writing regular expressions for each section key:   
I. Regex for **ID**:<br>
The listing must start with "ID: " followed by atleast a digit and end with a newline

In [6]:
# ID Regex Pattern
pattern_ID = re.compile("^(ID: )\d+\n")
m = pattern_ID.search(job_listings[436])
m.group(1)

'ID: '

II. Regex for **Title**:  
  
The possible keys are TITLES, TITLE, JOB TITLE, title, _TTL, JOB_T. The approach taken to write the regex is:  
1. Write a regex to accomodate the first 5 keys with job and title: "JOB " can be optional at the beginning, followed by "TITLE or title" and "S" can be optional at the end
2. Modify the expression to accomodate the key "_TTL" by making Underscore optional at the beginning and the "TITLE" part of the expression can have "TTL" as well; that part of the expression would now be "TITLE | title | TTL"
3. Modify the expression to accomodate the key "JOB_T" by splitting "JOB " to "JOB" and " " and adding "_T" as optional after "JOB"  

Finally, the regex here is to begin with underscore as optional, followed by "JOB" as optional, followed by "_T" as optional, space as optional, followed by "TITLE | title | TTL" as optional and finally "S" as optional.

In [7]:
#title [TITLES,TITLE,JOB TITLE,title,_TTL,JOB_T]
#1. pattern_title = re.compile("^(?:JOB )?(?:TITLE|title)[Ss]?: (.*?)\n",re.MULTILINE)
#2. pattern_title = re.compile("^(?:JOB )?(?:TITLE|title|_TTL)[Ss]?: (.*?)\n",re.MULTILINE)
#3. pattern_title = re.compile("^_?(?:JOB)?(?:_T)? ?(?:TITLE|title|TTL)?[Ss]?: (.*?)\n",re.MULTILINE)
pattern_title = re.compile("^(_?(?:JOB)?(?:_T)? ?(?:TITLE|title|TTL)?S?: ).*?\n",re.M)
m = pattern_title.search(job_listings[436])
m.group(1)

'JOB TITLE: '

III. Regex for **Location**:

The possible keys are LOCATION, LOCATIONS, _LOCS, JOB_LOC.  

The regex approach was quite straightforward, and the final expression was found within one iteration. The regex here is to begin with "JOB" as optional, followed by "_" as optional, followed by "LOC" (mandatory), followed by "ATION" as optional and finally "S" as optional.

In [8]:
#Location:[LOCATION,LOCATIONS,_LOCS,JOB_LOC]
#pattern_location = re.compile("^_?LOC(?:ATION)?S?: (.*?)\n,",re.MULTILINE)
pattern_location = re.compile("^((?:JOB)?_?LOC(?:ATION)?S?: ).*?\n",re.M)
m = pattern_location.search(job_listings[2])
m.group(1)

'_LOCS: '

IV. Regex for **Job Descriptions**: <br> 

The possible keys are JOB_DESC, JOB DESCRIPTION, job_desc, _description, DESCRIPTION,. 

The regex approach was quite straightforward, and the final expression was found within one iteration. The regex here is to begin with "JOB | job" as optional, followed by "_" or " " as optional, followed by "DESC | desc" (mandatory), followed by "RIPTION | ription" as optional.

In [9]:
#Job_Description[JOB_DESC, JOB DESCRIPTION,job_desc,_description,DESCRIPTION,]
pattern_job_desc = re.compile("^((?:JOB|job)?[_ ]?(?:DESC|desc)(?:RIPTION|ription)?:\s).*?\n",re.M)
m = pattern_job_desc.search(job_listings[19])
m.group(1)

'_description: '

V. Regex for **Job Responsibilities**:  
    
The possible keys are JOB_RESPS, RESPONSIBILITY, RESP, JOB RESPONSIBILITIES, responsibilities. The approach taken to write the regex is:  
1. Write a regex to accomodate the first 3 keys: "JOB_" can be optional at the beginning, followed by "RESP" (mandatory), and "ONSIBILITY | S" can be optional at the end
2. Modify the expression to accomodate the key "JOB RESPONSIBILITIES" by splitting 'JOB_" to "JOB" and "_" optional at the beginning and the "ONSIBILITY | S" part of the expression can be split into "ONSIBILIT" and "Y | IES"
3. Modify the expression to accomodate the key "responsibilties" by having either lower or upper case wherever ressponsibilties is present in the expression

Finally, the regex here is to begin with "JOB | job" as optional, followed by "_" or " " as optional, "RESP | resp" (mandatory), followed by "ONSIBILIT | onsibilit" as optional and finally "Y | IES | ies | y | S " as optional. Note I am adding "JOB | job" at the beginning in case I missed a key with lower case "job"

In [10]:
#[JOB RESPONSIBILITIES,JOB_RESPS,RESPONSIBILITY,RESP,responsibilities,]
pattern_job_resp = re.compile("^((?:JOB|job)?[_ ]?(?:RESP|resp)(?:ONSIBILIT|onsibilit)?(?:Y|IES|ies|y|S)?:\s).*?\n",re.M)
m = pattern_job_resp.search(job_listings[498])
m.group(1)

'RESPONSIBILITY:\n'

VI. Regex for **required qualifications**:

The possible keys are qualifications, QUALIFICATION, REQUIRED QUALIFICATIONS, QUALIFS, REQ_QUAL.

The approach taken to form the regex is:
- Create a regex for the first 3 keys: start with "REQUIRED" as optional, then space as optional, followed by "QUALIFICATION | qualification (mandatory)" and "S | s" as optional
- Modify the regex to accomodate the last 2 keys by splitting "REQUIRED" into "REQ" as optional, followed by underscore as optional and "UIRED" as optional; splitting "QUALIFICATION | qualification" into "QUAL | qual" (mandatory), followed by "IF | IFICATION | ification" as optional and finally "S | s as" optional.

Finally, the regex here is to begin with "REQ" as optional, followed by underscore as optional and "UIRED" as optional, space as optional, "QUAL | qual" (mandatory), followed by "IF | IFICATION | ification" as optional and finally "S | s" as optional.

In [11]:
#[REQ_QUALS,qualifications,QUALIFICATION,REQUIRED QUALIFICATIONS,QUALIFS]
pattern_qual = re.compile("^((?:REQ)?[_]?(?:UIRED)? ?(?:QUAL|qual)?(?:IF|IFICATION|ification)?[Ss]?:\s).*?\n",re.M)
m = pattern_qual.search(job_listings[498])
m.group(1)

'qualifications:\n'

VII. Regex for **Job Salary**: 

The possible keys are JOB_SAL, SALARY, salary, remuneration, REMUNERATION.  
I wasn't able to come up with a regex similar to other sections' regex because this section has very varied keys. The regex here is to have the line start with any of the above section keys.

In [12]:
#Salary [JOB_SAL,SALARY,salary,remuneration,REMUNERATION]
pattern_sal = re.compile("^((?:JOB_SAL|SALARY|salary|REMUNERATION|remuneration)?:\s).*?\n",re.M)
m = pattern_sal.search(job_listings[0])
m.group(1)

'JOB_SAL: '

VIII. Regex for **Application procedure**: 

The possible keys are PROCEDURES, procedures, JOB_PROC, JOB_PROCS, PROCEDURE, procedure.  

This regex is quite straightforward. The regex here is to begin with "JOB" as optional, then underscore as optional, followed by "PROC | proc" (mandatory), "EDURE | edure" as optional, and finally "S | s" as optional.

In [13]:
#[PROCEDURES,procedures,JOB_PROC,JOB_PROCS,PROCEDURE, procedure]
pattern_proc = re.compile("^((?:JOB)?[_]?(?:PROC|proc)(?:EDURE|edure)?[Ss]?:\s).*?\n",re.M)
m = pattern_proc.search(job_listings[0])
m.group(1)

'PROCEDURES: '

IX. Regex for **Start Date**: 

The possible keys are start_date, DATE_START, START DATE, START_DA, DATES.  

Creating this regex is also straighforward. The regex here is to begin with "START | start" as optional, then "underscore | space" as optional, followed by "DA | da" (mandatory), "TE | te" as optional, "_START" as optional and finally "S as optional".

In [14]:
#[start_date,DATE_START,START DATE,START_DA,DATES]
pattern_start_date = re.compile("^((?:START|start)?[_ ]?(?:DA|da)(?:TE|te)?(?:_START)?[S]?:\s).*?\n",re.M)
m = pattern_start_date.search(job_listings[1300])
m.group(1)

'DATES: '

X. Regex for **Application Deadline**: 

The possible keys are DEADLINES, DEAD_LINE, deadline, APPLICATION_DEADL, Application_DL.  
The approach taken to form the regex is:
- Create a regex for the first 4 keys: start with "APPLICATION" as optional, then underscore as optional, followed by "DEAD | dead" (mandatory), underscore as optional, "LINE | line" as optional and finally "S | s" as optional
- Modify the regex to accomodate the last keys by adding "| DL" to the mandatory portion; the mandatory portion would now be "DEAD | dead | DL".  Split "LINE | line" as optional to "L | l" as optional, "INE | ine" as optional 

Finally, the regex here is to start with 'APPLICATION' as optional, then underscore as optional, followed by "DEAD | dead | DL" (mandatory), underscore as optional, "L | l" as optional, "INE | ine" as optional and finally "S | s" as optional.

In [15]:
#[DEADLINES,DEAD_LINE,deadline,APPLICATION_DEADL,Application_DL]
pattern_deadline = re.compile("^((?:APPLICATION)?[_]?(?:DEAD|dead|DL)[_]?[Ll]?(?:INE|ine)?[Ss]?:\s).*?\n",re.M)
m = pattern_deadline.search(job_listings[1173])
m.group(1)

'APPLICATION_DEADL: '

XI. Regex for **information about the company**:  
  
The possible keys are about_company, ABOUT COMPANY, ABOUT, COMPANYS_INFO, _info.  
The approach taken to form the regex is:
- Create a regex for the first 3 keys: start with "ABOUT | about", then "underscore | space" as optional, followed by "COMPANY | company" as optional
- Modify the regex to accomodate the last 2 keys by appending "S" as optional,underscore as optional and "INFO | info" as optional. Change the "ABOUT | about" to optional.  

Finally, the regex here is to start with "ABOUT | about" as optional, then "underscore | space" as optional, followed by "COMPANY | company" as optional, "S" as optional, underscore as optional and finally "INFO | info" as optional. 

In [16]:
#[about_company,ABOUT COMPANY,ABOUT,COMPANYS_INFO,_info,]
pattern_info = re.compile("^((?:ABOUT|about)?[_ ]?(?:COMPANY|company)?S?[_]?(?:INFO|info)?:\s).*?\n",re.M)
m = pattern_info.search(job_listings[2])
m.group(1)

'ABOUT COMPANY:\n'

Storing all the compiled regular expressions in a dictionary:

In [17]:
regex_dict = {
    'ID' : pattern_ID,
    'title' : pattern_title,
    'location' : pattern_location,
    'job_descriptions' : pattern_job_desc,
    'job_responsibilities' : pattern_job_resp,
    'required_qualifications' : pattern_qual,
    'salary' : pattern_sal,
    'application_procedure' : pattern_proc,
    'start_date' : pattern_start_date,
    'application_deadline' : pattern_deadline,
    'about_company' : pattern_info
            }

While writing the regex for section keys, I noticed that there is some **garbage** line like OPEN TO/, RENUMERATION/, START DATE/, etc. before a few keys. Below is a **regular expression** to catch that so that we can **exlude** that from the extracts:  
The regex here would be to start with a newline, followed by a capital alphabet, followed by text and then " / " at the end. There are few cases with multiple newlines before this garbage like "\n\n\nOPEN TO/", we need to modify the expression to accomodate multiple newlines by doing \n{1,4} and use the re.S flag to not end at a newline. Final expression is in the code below.

In [18]:
extract_garbage_rx = re.compile("(.*?)\n{1,4}[A-Z](?:[\w ])+?/$",re.S)

## 4. Extracting the data

All the regular expressions are now written, it is time to begin extracting the data. I am using a function to parse each job listing, extract the relevant data for each section and **store the data in a dictionary** with the key as the section key. For example, for joblisting 1, the dictionary would be 'ID' :  '69290', 'title' : 'IT Reporting System Administration Senior Specialist',... and so on.  

The **extraction methodology** for a listing is:  
1. Match the section keys. If there is no match for a particular section, put "N/A" to indicate missing section. Some listings do not have some sections. For example, the first listing with ID '69290' doesn't contain the 'job description' section 
2. Retrieve the exact sections keys,for example - '_TTL' for 'title'
3. Retrieve the start indices of all the section keys in a listing
3. Sort the start indices of the section keys to get the order of the sections in the listing
4. Use string splicing to extract the relevant data for that section. For splicing, we need the start and stop index for the extract; the start index will be the start index of the section key + length(exact section key) and the stop index will be start index of the next section - 1
5. Use the garbage regex defined above to check if you have garbage like OPEN TO/, START DATE/, etc. in the extract. If garbage is present, remove the garbage fromt he extract
6. Store the clean extract for the section with its section key in a dictionary

In [19]:
def listing_parser(listing):
    sec_index = {}
    sec_wording = {}
    sec_data = {}
    sec_missing = []
    for key, pattern in regex_dict.items():
        # Matching section keys
        match = pattern.search(listing)
        if match:
            # Retrieving the exact key and start index of the key
            sec_index[match.start(1)] = key
            sec_wording[match.start(1)] = match.group(1)
        else:
            sec_missing.append(key)
    # Ordering the start indices of the key
    ordered_sec_index = []
    for key in sec_index.keys():
        ordered_sec_index.append(key)
    ordered_sec_index.sort()
    # Extracting the relevant data using string splicing
    for i in range(len(ordered_sec_index)):
        extract_start_index = ordered_sec_index[i] + len(sec_wording[ordered_sec_index[i]])
        if i == len(ordered_sec_index) - 1: # Special case for the last section key
            extract_stop_index = len(listing) - 1
        else:
            extract_stop_index = ordered_sec_index[i + 1] - 1
        # Splicing to get the extarct
        extract = listing[extract_start_index:extract_stop_index]
        
        #Removing garbage things like OPEN TO/, RENUMERATION/, START DATE/, etc. at the end of a few extracts
        if len(extract) > 0 and extract[-1] == "/":
            match = extract_garbage_rx.search(extract)
            if match:
                extract = match.group(1)
        
        sec_data[sec_index[ordered_sec_index[i]]] = extract
    # Adding N/A for missing sections in a listing
    for section in sec_missing:
        sec_data[section] = 'N/A'
    return sec_data

**Checking the function** we defined above:

In [20]:
listing_parser(job_listings[0])

{'ID': '69290',
 'job_responsibilities': ' - Develop online/ mobile games working with the team very closely (being\na team player, not a solo);\n- Work with Designers and Illustrators on artwork and design\nimplementation into the games;\n- Define specifications of game features together with Product Managers;\n- Develop and architect different types of frameworks and toolsets;\n- Constantly learn and grow your skills.',
 'application_deadline': '25 January 2014',
 'location': 'Yerevan, Armenia',
 'about_company': ' Arplan LLC is an architectural consulting company\nworking on international projects.',
 'required_qualifications': ' - Minimum of Masters degree in education, training and/or training\nmethodology; \n- Minimum of ten years work experience as a trainer, curriculum\ndeveloper, or workforce development administrator;\n- Fluency in English language;\n- Fluency in Armenian or Russian is preferred.',
 'salary': 'Competitive, based on previous experience and\nprofessional skills

**Running the parser function for all the listings**:

In [21]:
full_extracted_data = []
for listing in job_listings:
    each_extract = listing_parser(listing)
    full_extracted_data.append(each_extract)

In [22]:
type(full_extracted_data) # list of dictionaries

list

**For some sections, the information is in the form of a list**; for example, the responsibilites section has a list of resposiblities in its extracted information. Therefore, it is useful to **convert the extracted information for these sections into a list.** 

In [23]:
sub_sections_dict = {
    'job_responsibilities' : 'responsibility',
    'required_qualifications' : 'qualification',
    'job_descriptions' : 'description'}

For splitting the extracted information into lists, we need to identfiy the **splitting pattern**. After observing the data, I came up with the following 2 possible patterns:  
**Pattern I**. Section Key:   
1. Text:   
    - Sub-bullet text     
    - Sub-bullet text  
2. Text:  
     - Sub-bullet text  
     - ...
  
For example,  
RESPONSIBILITY:  
   1. Publicity/Program activities:  
        - Organize country-wide Internet conferences/thematic web chats  
   2. Administration:  
        - Process weekly and monthly site reports and produce regular feedback to the staff  

**Pattern II**. Section Key:  
      - Sub-bullet text  
      - Sub-bullet text  
       ....  
For example,  
   RESPONSIBILITY:  
     - Work with the product team to define functional requirements;  
    - Produce customer and other third party facing product documentation.  

Below, I am splitting the extracted information for an example for each of the patterns:  
**Pattern I**

In [24]:
# Pattern I
extract_split = re.split(" ?\w\)(.*?:)\n",full_extracted_data[4]['job_responsibilities'])
sub_section_data = []
if len(extract_split) > 1:
    for i in range(len(extract_split)):
        if extract_split[i] == '' or extract_split[i][0] == '-':
            continue
        elif extract_split[i][-1] == ':':
            join = extract_split[i] + '\n' + extract_split[i + 1]
            sub_section_data.append(join)
        else:
            sub_section_data.append(extract_split[i])
sub_section_data

[' Publicity/Program activities:\n- Organize country-wide Internet conferences/thematic web chats \n- Draft and/or supervise various sub-projects (Volunteer and Intern\nPrograms, interaction with Peace Corps Volunteers and local NGOs, user\nsurveys, etc.)\n- Prepare weekly program news for submission to regional management and\nUS Department of State\n- Prepare weekly news briefs (in English and Armenian) for the local\nprogram website\n- Support TC and CC with distance learning project as needed \n',
 ' Administration:\n- Process weekly and monthly site reports and produce regular feedback\nto the staff\n- Assist IATP Country Coordinator in writing country reports\n- Distribute various administrative/program announcements to field\nstaff\n- Answer various field requests or forward them to the appropriate\nmanager\nOther duties as assigned.']

The splitting worked successfully for pattern I, splitting for an example of pattern II:  
**Pattern II**

In [25]:
sub_split_2_rx = re.split(";?\n-",full_extracted_data[0]['job_responsibilities'])
sub_split_2_rx

[' - Develop online/ mobile games working with the team very closely (being\na team player, not a solo)',
 ' Work with Designers and Illustrators on artwork and design\nimplementation into the games',
 ' Define specifications of game features together with Product Managers',
 ' Develop and architect different types of frameworks and toolsets',
 ' Constantly learn and grow your skills.']

Implementing the **splitting for all the listings**:
The steps for this are - 
- Check if the section key is of one of the sections for which splitting is required. Implement the remaining steps only for the sections with possible sub-sections
- If the information follows pattern 1, split based on pattern 1; else split based on pattern 2. If the information doesn't fall under one of the 2 patterns, have one element in the list (sub-section) with the entire extracted information for that section
- Overwrite the information with the updated information after splitting to the full_extracted_data dictionary

In [26]:
# Adding sub-sections for responsibilities, qualifications and job description
# Splitting extracted information for these sections
for listing in full_extracted_data:
    for key, extract in listing.items():
        if key in sub_sections_dict.keys():
            sub_sec_data_dict = {}
            sub_section_data = []
            if key == 'job_descriptions':
                s = extract.replace('\n',' ')
                sub_section_data.append(s)
            else:
                # Pattern 1
                extract_split = re.split(" ?\w\)(.*?:)\n",str(extract))
                if len(extract_split) > 1:
                    for i in range(len(extract_split)):
                        if extract_split[i] == '' or extract_split[i][0] == '-':
                            continue
                        elif extract_split[i][-1] == ':':
                            join = extract_split[i] + '\n' + extract_split[i + 1]
                            join = join.replace('\n','') # remove newlines
                            sub_section_data.append(join)
                        else:
                            s = extract_split[i].replace('\n','') # remove newlines
                            sub_section_data.append(s)
                else: # Pattern 2
                    extract_split_2 = re.split(";?\n-",str(extract))
                    if len(extract_split_2) > 1:
                        for i in range(len(extract_split_2)):
                            if i == 0:
                                if extract_split_2[0][1] == '-':
                                    first_element_extract = extract_split_2[0][2:]
                                    first_element_extract = first_element_extract.replace('\n',' ') # remove newlines 
                                    sub_section_data.append(first_element_extract)
                                else:
                                    s = extract_split_2[i].replace('\n',' ') # remove newlines
                                    sub_section_data.append(s)
                            else:
                                s = extract_split_2[i].replace('\n',' ') # remove newlines
                                sub_section_data.append(s)
                    else: # None of the patterns
                        s = extract.replace('\n',' ')
                        sub_section_data.append(s)
            # Overwriting the information to the full_extracted_data dictionary
            sub_sec_data_dict[sub_sections_dict[key]] = sub_section_data
            listing[key] = sub_sec_data_dict

**Removing newlines from the extracted information** for all the sections. Note that for the sections with sub-sections, newlines have already been removed.

In [27]:
for listing in full_extracted_data:
    for key, extract in listing.items():
        if key not in sub_sections_dict.keys():
            extract = extract.replace('\n',' ') # remove newlines
            listing[key] = extract

In [28]:
print(full_extracted_data[0]) # quality check

{'ID': '69290', 'job_responsibilities': {'responsibility': [' Develop online/ mobile games working with the team very closely (being a team player, not a solo)', ' Work with Designers and Illustrators on artwork and design implementation into the games', ' Define specifications of game features together with Product Managers', ' Develop and architect different types of frameworks and toolsets', ' Constantly learn and grow your skills.']}, 'application_deadline': '25 January 2014', 'location': 'Yerevan, Armenia', 'about_company': ' Arplan LLC is an architectural consulting company working on international projects.', 'required_qualifications': {'qualification': [' Minimum of Masters degree in education, training and/or training methodology; ', ' Minimum of ten years work experience as a trainer, curriculum developer, or workforce development administrator', ' Fluency in English language', ' Fluency in Armenian or Russian is preferred.']}, 'salary': 'Competitive, based on previous experi

## 5. Writing the extracted information to a JSON file

The extracted data was transformed into json format using the **json dump** method in the json library. A dictionary is created using the full_extracted_data dictionary to get the output in the desired format; the 'listings' and 'listing' key are added. The json dictionary is now a dictionary of dicitionaries. The json dictionary is passed to the json dump method to write the extracted information to a json file 

In [29]:
# Making the JSON dictionary
json_dict = {'listings' : {'listing' : full_extracted_data}}

In [30]:
with open('job_postings.json', 'w') as output_file:  
    json.dump(json_dict, output_file,indent=4)
output_file.close()

## 6. Writing the extracted information to an XML file

The extracted information has beem transformed to XML format using string/list methods in Python.  
  
Define the **initial line and the root tag** in the XML tree:

In [31]:
xml_output = ['<?xml version="1.0" encoding="UTF-8"?>\n'] #initial line declaring the text encoding format
xml_output.append('<listings>\n') #root start tag
xml_output

['<?xml version="1.0" encoding="UTF-8"?>\n', '<listings>\n']

**Replacing '<' and '&'** in the extracted information with **'&lt' and '&amp'** respectively:  

This is done because the XML parser will interpret the special meaning of these characters. For example, the parser will interpret "<" as the start of a new element.

In [32]:
for listing in full_extracted_data:
    for key, extract in listing.items():
        if key not in sub_sections_dict.keys():
            extract = extract.replace('<','&lt;')
            extract = extract.replace('&','&amp;')
            listing[key] = extract
        else:
            for subsection, content in extract.items():
                xml_content = []
                for item in content:
                    item = item.replace('&','&amp;')
                    item = item.replace('<','&lt;')
                    xml_content.append(item)
            extract[subsection] = xml_content 

Defining a function to convert the passed extract content of section/sub-section to XML format by adding the passed number of tabs, start tag, extract content and the end tag:

In [33]:
def writing_extract(key,extract,no_of_tabs):
    ext_output = []
    ext_output.append('\t'*no_of_tabs + '<' + key + '>' + extract + '</' + key + '>\n')
    return ext_output

writing_extract('title', full_extracted_data[0]['title'],2) # example

['\t\t<title>IT Reporting System Administration Senior Specialist</title>\n']

Defining a **function to transform the extract of the listing to XML format**. It iterates through the extract dictionary, and uses the 'writing_extract' function defined above to transform the extract for the section/sub-section to XML Format. The function returns the output string with the extracted listing in XML style. 

In [34]:
def xml_transformation(listing_extract):
    output = []
    # adding elements corresponding to the sections of the listing 
    for key, extract in listing_extract.items():
        if key == 'ID':
            output.append(list('\t<listing id="' + extract + '">\n'))
        elif key in sub_sections_dict.keys():
            output.append(list('\t\t<'+ key + '>\n'))
            # 3 tabs for sub-sections
            for subsection, content in extract.items():
                for item in content:
                    output.append(writing_extract(subsection,item,3))
            output.append(list('\t\t</'+ key + '>\n'))
        else: # 2 tabs for sections
            output.append(writing_extract(key,extract,2))
    output.append(list('\t</listing>\n'))    
    output_flat = list(sum(output, []))
    output_str = ''.join(output_flat)
    return output_str

**Running the function defined above for all listings:**

In [35]:
for listing_extract in full_extracted_data:
    xml_output.append(xml_transformation(listing_extract))

In [36]:
xml_output[:3] # quality check

['<?xml version="1.0" encoding="UTF-8"?>\n',
 '<listings>\n',
 '\t<listing id="69290">\n\t\t<job_responsibilities>\n\t\t\t<responsibility> Develop online/ mobile games working with the team very closely (being a team player, not a solo)</responsibility>\n\t\t\t<responsibility> Work with Designers and Illustrators on artwork and design implementation into the games</responsibility>\n\t\t\t<responsibility> Define specifications of game features together with Product Managers</responsibility>\n\t\t\t<responsibility> Develop and architect different types of frameworks and toolsets</responsibility>\n\t\t\t<responsibility> Constantly learn and grow your skills.</responsibility>\n\t\t</job_responsibilities>\n\t\t<application_deadline>25 January 2014</application_deadline>\n\t\t<location>Yerevan, Armenia</location>\n\t\t<about_company> Arplan LLC is an architectural consulting company working on international projects.</about_company>\n\t\t<required_qualifications>\n\t\t\t<qualification> Minim

- **Closing the root tag** 'Listings'
- Converting the list with xml outputs of each listing to a string

In [37]:
xml_output.append('</listings>\n')
xml_output_str = ''.join(xml_output)

**Writing** the xml output string to a **file**:

In [38]:
with open('job_postings.xml','w') as output_file:
    output_file.write(xml_output_str)
output_file.close()

## 7. Summary

The task demonstrated the process of extracting data from large unstructured text files and transforming the data to XML and JSON format. The extraction was done by identifying the patterns of the section keys and writing corresponding efficient regular expressions. The data transformation to XML format was done manually, helping us get a very good grasp of the XML structure. The XML transformation also required us to identify the patterns in the data to split certain sections like responsibilities to sub-sections.  

The other learnings from this task, not necessarily specific to the process of data wrangling, were:
- The general approach to extracting data is to identify patterns in the data and then use appropriate methods like regex, splitting, etc. to complete the data extraction. 
-  When parsing a huge file, write as efficient code as possible. Consider the time and space complexity of the code to extract the information from a huge data file. In this task, the following key things were done to improve the efficieny of the code so that the run time of the code was less than a minute: 
    1. Compiling all regex patterns and saving the regular expression object for reuse because the same regular expressions were used several times for all the job listings
    2. Try to avoid string concatenation for large strings as much as possible, since the complexity for string concatenation is O(n). Use str.join() or use a list to leverage the O(1) list.append() method
- Parsing a huge file can seem very daunting at the beginning. The key is to break down the process to several steps; solve the smaller steps and combine the solutions of the smaller steps to get the desired results / output


## References

1. *XML Syntax*. Retrieved from https://www.w3schools.com/xml/xml_syntax.asp
2. The Python Standard Library. *Regular expression operations documentation*: `re.search`. Retrieved from https://docs.python.org/3/library/re.html
3. The Python Standard Library. *JSON Encoder and Decoder*: `json.dump`. Retrieved from
 https://docs.python.org/3/library/json.html