# MPs Registered interests

This is a project to scrape and process MP registered interests data which is not currently available on the parliamentary API.

This information is presented as a list of XML files at https://www.theyworkforyou.com/pwdata/scrapedxml/regmem/ which are indexed by date.

The data is somewhat messy (not toooo bad, not perfect) and a new register is added every month. 

I want to design a process/app that will gather all of the data from a specified date, and refactor it into a reasonably consistent JSON format. I will then use NODE.js app to seed this to a MONGO DB on atlas and make a back end to a simple web and mobile app in flutter that people can use to find out what is actually quite inaccessable but publicaly available and pertinent information.  

For this notebook, I have written code which will scrape the XML data from a specified date,save it to JSON, and then refactor this into a more consistent/workable JSON. On running it will check if any of the files are already present and then skip them. This will allow new records to be added. 

I realise that I have devided the process up into steps that could be merged (i.e. I could go straight from XML to the refactored JSON) but I had to do a lot of inspection of the original JSON to figure out how to clean it up a bit, and also I quite enjoy how clear the different steps are. 

It should be noted that running these scripts will download data from the internet onto your computer. 

Set up - 
    If you want to run this notebook and see it working (which you may not want to do considering I going to hopefully have a site which displays all the info nicely) then I recommend that you create a new directory (call it what you will - mps_reg_interests_data maybe?) with this notebook in it. Then in that directory you need three other empty directories
    - 'regInterestsDataJSON'
    - regInterestsDataXML'
    - 'regInterestsJsonRefactored
Hopefully these are self explanatory. They will be populated with the data at different stages

NOTE/WARNING 
    There is a lot of data - be careful when you pick a start date as if you pick 2010 or something that is 10 * 12 times 5 times circa 400 mps times many interests. and it might take ages to run on your computer. 


# STEP ONE - Packages
The first step is to get the neccessary packages:

In [83]:
import requests
import xmltodict
from bs4 import BeautifulSoup as bs
import xml.etree.ElementTree as ET
import os
from xml.dom import minidom
import json
import lxml
import html5lib
import json

# STEP TWO - Get XML files since...

IMPORTANT:
    This block:
```javascript
if year > 2019:
```

Is where you specify the year from which you want the data


In [84]:

# scrapes the reg interests page, checks if already have that years entry and if not downloads it
URL = "https://www.theyworkforyou.com/pwdata/scrapedxml/regmem/"
page = requests.get(URL)
#I HAVE SINCE REALISED MAYBE I SHOULDA USED LXML FOR THE PARSER IN THIS STEP BELOW
soup = bs(page.content, 'html.parser')
links = soup.find_all('a')

# make sure to have created a directory for your XML files to be saved and then specify its path here
fileList = os.listdir("./regInterestsDataXML/")

# this loop goes through the links on the source page and checks they are in the time period we want (for initial saving) and then if they are new (for updating) anf if they satisfy both of these then it downloads them
for link in links:
    fileList = os.listdir("./regInterestsDataXML/")
    starterURL = "https://www.theyworkforyou.com/pwdata/scrapedxml/regmem/"
    if link.text.startswith("regmem"):
        year = int(link.text[6:10])
        # specify from which year here
        if year >= 2015:
            print("DEALING WITH", f"{year}:", link.text)
            if link.text not in fileList:
                fullUrl = starterURL + link.text
                r = requests.get(fullUrl)
                root = ET.fromstring(r.text)
                tree = ET.ElementTree(root)
            # at this point I tried to directly convert it to json/dict but it waa proving too much of a faff so I happily just decided to save it and deal with the files
                print("WRITING", link.text, "to",  f"./regInterestsDataXML/{link.text}")
                tree.write(f"./regInterestsDataXML/{link.text}")
                print("written!")
                #this next part opens the xml file just created (i got a bit tangled with different XML parsers and it was easier and in a way more transparent -although admittedly take upmore room - to get the original, save it in one place, open that and edit it and save the edited version elsewhere, and then convert that to JSON
                print("CLEANING OBSTRUCTIVE TAGS")

                with open(f"./regInterestsDataXML/{link.text}", "r") as file:
                    soup = bs(file, "lxml")

                brs = soup.find_all("br")
                spans = soup.find_all("span")
                texts = soup.find_all("#text")
                ems = soup.find_all("em")
                strongs = soup.find_all("strong")

                for br in brs:
                    br.replace_with(" ")
                for span in spans:
                    span.unwrap()
                if len(texts) != 0:
                    for text in texts:
                        text.unwrap()
                for em in ems:
                    em.unwrap()
                for strong in strongs:
                    strong.unwrap()
                f = open(f"./refactoredXML/{link.text}", "w")
                f.write(str(soup))
                f.close()
                file.close
            elif link.text in fileList:
                print(f"You already have {link.text} downloaded")
print("ALL DONE")

DEALING WITH 2015: regmem2015-01-06.xml
WRITING regmem2015-01-06.xml to ./regInterestsDataXML/regmem2015-01-06.xml
written!
CLEANING OBSTRUCTIVE TAGS
DEALING WITH 2015: regmem2015-01-26.xml
WRITING regmem2015-01-26.xml to ./regInterestsDataXML/regmem2015-01-26.xml
written!
CLEANING OBSTRUCTIVE TAGS
DEALING WITH 2015: regmem2015-02-09.xml
WRITING regmem2015-02-09.xml to ./regInterestsDataXML/regmem2015-02-09.xml
written!
CLEANING OBSTRUCTIVE TAGS
DEALING WITH 2015: regmem2015-02-23.xml
WRITING regmem2015-02-23.xml to ./regInterestsDataXML/regmem2015-02-23.xml
written!
CLEANING OBSTRUCTIVE TAGS
DEALING WITH 2015: regmem2015-03-09.xml
WRITING regmem2015-03-09.xml to ./regInterestsDataXML/regmem2015-03-09.xml
written!
CLEANING OBSTRUCTIVE TAGS
DEALING WITH 2015: regmem2015-03-30.xml
WRITING regmem2015-03-30.xml to ./regInterestsDataXML/regmem2015-03-30.xml
written!
CLEANING OBSTRUCTIVE TAGS
DEALING WITH 2015: regmem2015-06-08.xml
WRITING regmem2015-06-08.xml to ./regInterestsDataXML/regmem

The above section of code, then, accesed the XML resource, and downloads to the directory "regInterestsDataXML" any new interests, or if it is the first time running it, all of the interests from a certain date. It will print:
```
"You already have regmem2020-12-07.xml downloaded"
```
if you already have any of them. 

EDIT - Once I got the whole process working, I realise that I could have made my life a LOT easier if, at this point, I had taken more time to be more selective with my XML scraping. Basically the main p in the a with this code turned out to be the line break (`<br>`) tags, which when parsed into JSON caused all sorts of faffy problems. I was in too deep to the sorting it all out phase to actually step back and think about it and just remove those tags (there are various ways) at this stage. However, once i'd written something that worked for the XML I had I did not want to go back and change it as that would mean re-writing every stage. 

EDIT 2 - I DID THE ABOVE REFACTOR - so now the above code gets the XML, saved it in its original form, and then opens it and removes the fucking annoying br tags that were making life hell before and "unwraps" all the other stupid tags that also made life hell and resaves it in the 'refactoredXML' dir ready for JSON conversion, summarising and DB!


# STEP THREE - Convert
The section of code below will access each refactored XML file just created makes a JSON object of it and save it into the JSON directory you created at the start. It will then be much easier to refactor into format for uploading to DB

In [85]:
XMLfiles = os.listdir("./refactoredXML/")

XMLfilesNoType = []

for singleFile in XMLfiles:
    XMLfilesNoType.append(singleFile.split('.')[0])

JSONfiles = os.listdir("./regInterestsDataJSON/") 

JSONfilesNoType = []

for singleFile in JSONfiles:
    JSONfilesNoType.append(singleFile.split('.')[0])

for XMLfile in XMLfilesNoType:
    if XMLfile not in JSONfilesNoType:
        print(f"CONVERTING {XMLfile} to JSON")
        
        with open(f"./refactoredXML/{XMLfile}.xml") as xml_file:
            data_dict = xmltodict.parse(xml_file.read())
            xml_file.close()
            json_data = json.dumps(data_dict)
            with open(f"./regInterestsDataJSON/{XMLfile}.json", "w") as json_file:
                print("WRITING JSON")
                json_file.write(json_data)
                json_file.close()
    elif XMLfile in JSONfilesNoType:
        print("JSON FILE ALREADY CREATED")

print("ALL DONE")
    #  FOR CHANGING INTO DICT
        #  nicerDictData = json.loads(json_data)
        #     print(nicerDictData["publicwhip"])


CONVERTING regmem2019-02-18 to JSON
WRITING JSON
CONVERTING regmem2015-06-08 to JSON
WRITING JSON
CONVERTING regmem2019-07-01 to JSON
WRITING JSON
CONVERTING regmem2019-07-15 to JSON
WRITING JSON
CONVERTING regmem2019-07-29 to JSON
WRITING JSON
CONVERTING regmem2017-02-06 to JSON
WRITING JSON
CONVERTING regmem2020-06-08 to JSON
WRITING JSON
CONVERTING regmem2020-06-22 to JSON
WRITING JSON
CONVERTING regmem2020-04-27 to JSON
WRITING JSON
CONVERTING regmem2018-04-30 to JSON
WRITING JSON
CONVERTING regmem2019-05-07 to JSON
WRITING JSON
CONVERTING regmem2018-03-05 to JSON
WRITING JSON
CONVERTING regmem2016-06-06 to JSON
WRITING JSON
CONVERTING regmem2017-07-31 to JSON
WRITING JSON
CONVERTING regmem2018-06-18 to JSON
WRITING JSON
CONVERTING regmem2019-09-02 to JSON
WRITING JSON
CONVERTING regmem2019-09-16 to JSON
WRITING JSON
CONVERTING regmem2015-01-06 to JSON
WRITING JSON
CONVERTING regmem2020-03-02 to JSON
WRITING JSON
CONVERTING regmem2020-03-16 to JSON
WRITING JSON
CONVERTING regmem201

# STEP FOUR - Refactoring

This next code is messy. I know. It was my first time writing anything with python, and working with the unwieldy JSON object I got from the XML. I was new to the try/except pattern and I got carried away. However I am happy to say given the ludicrous structure of the JSON, the below works from the majority of the thousands and thousands of different, very inconsistent entries. I know its not pretty and boy is there are lot of refactoring that could be done. But my main aim was to get the data out and up on a db, and it serves its purpose for that. 

EDIT - due to my nice XML refactor this is now not so messy (only a tiny bit) 

So the below code goes through each of the XML->JSON files and refactors them to the following structure:
```
{"Some Mps Name": {"Registered Date of Interest": {"Category of Interest": [Array of Interests in that Category for that date]}}
```

Which looks like:
```
{"Diane Abbott": {"2019-01-07": {"Miscellaneous": ["Since December 2015, a trustee of the Diane Abbott Foundation, which works to excel and improve education. (Registered 26 October 2016)"]}}
```



In [93]:
JSONfiles = os.listdir("./regInterestsDataJSON/")

refactoredJSONfiles = os.listdir("./regInterestsJsonRefactored/")

refactoredFileNames = []

for fileName in refactoredJSONfiles:
    refactoredFileNames.append(fileName.split(".")[0])

for JSONfile in JSONfiles:
    JSONfilenamesplit  = JSONfile.split('.')[0][6:16]

    if JSONfilenamesplit in refactoredFileNames:
        print(f"{JSONfile} has already been refactored and will be skipped")
    else:
        print(f"*****NOT SKIPING {JSONfile}********")
        masterArray = []
        # print(JSONfile)

        # below this code is what to do for each JSON file ie for every monthly instance of members interests
        with open(f'./regInterestsDataJSON/{JSONfile}') as json_file: 
            # load in json
            data = json.load(json_file) 
            # trim to bit i need
            trimmedData = data['html']['body']['publicwhip']['regmem']
            # print(trimmedData[0])
            # extract date to use for attaching to each members dict
            date = trimmedData[0]['@date']
            # iterate through each member on the expenses data
            # for member in trimmedData:
            for member in trimmedData:
                memberName = member["@membername"]
                # clear amd create a dict with structure {'name of member': {"date of expenses": {"categoory of expense": some sort of collection of items relating to that category}}}

                newDict = {memberName: {date: {}}}
                try:

                    if type(member["category"]) == list:
                        for category in member["category"]:
                            if type(category["record"]["item"]) == str:
                                newDict[memberName][date][category["@name"]] = [category["record"]["item"]]
                            else:
                                newDict[memberName][date][category["@name"]] = category["record"]["item"]
                    
                    if type(member["category"]) == dict:
                        if type(member["category"]["record"]["item"]) == str:
                                newDict[memberName][date][member["category"]["@name"]] = [member["category"]["record"]["item"]]
                        else:
                            newDict[memberName][date][member["category"]["@name"]] = member["category"]["record"]["item"]
                except:
                    try:
                        if (member["record"]["item"] == "Nil"):
                            newDict[memberName][date]["No records held"] = "Nil"
                    except:
                        try:
                            if type(member["category"]) == list:
                                for category in member["category"]:
                                    if type(category["record"]) == list:
                                        catName = category["@name"]
                                        items = []
                                        for record in category["record"]:
                                            if type(record["item"]) == list:
                                                items = items + record["item"]
                                            if type(record["item"]) == str:
                                                items.append(record["item"])
                                        newDict[memberName][date][catName] = items
                        
                        
                        except:
                            print("******************STILL ERRORRR**********")
                            print(member)
                
                finally:
                    masterArray.append(newDict)
                    masterDictionary = {}

                    for i in masterArray:
                        masterDictionary.update(i)


                    with open(f"./regInterestsJsonRefactored/{date}.json", "w") as entireJSON:
                        json.dump(masterDictionary, entireJSON)
                        
print("ALLL DONE!!!!!!!")

*****NOT SKIPING regmem2018-03-05.json********
*****NOT SKIPING regmem2018-10-15.json********
*****NOT SKIPING regmem2019-01-07.json********
*****NOT SKIPING regmem2019-09-30.json********
*****NOT SKIPING regmem2019-06-03.json********
*****NOT SKIPING regmem2020-04-27.json********
*****NOT SKIPING regmem2018-01-22.json********
*****NOT SKIPING regmem2017-10-09.json********
*****NOT SKIPING regmem2020-05-26.json********
*****NOT SKIPING regmem2018-11-19.json********
*****NOT SKIPING regmem2017-08-29.json********
*****NOT SKIPING regmem2020-11-09.json********
*****NOT SKIPING regmem2017-01-09.json********
*****NOT SKIPING regmem2020-07-20.json********
*****NOT SKIPING regmem2020-09-28.json********
*****NOT SKIPING regmem2018-02-05.json********
*****NOT SKIPING regmem2020-10-12.json********
*****NOT SKIPING regmem2020-03-02.json********
*****NOT SKIPING regmem2017-04-10.json********
*****NOT SKIPING regmem2018-04-16.json********
*****NOT SKIPING regmem2020-05-11.json********
*****NOT SKIP

The code above, then looks at the XML files, and the JSON files, and where there is no JSON version of the XML (i.e. a new piece of XML has been loaded) then it creates a JSON of it. The JSON will be much easier to deal with.
The next code block will read in each of the JSONs 1 at a time, convert them to a dictionary, rearrange them to the right structure, then add them to a master JSON. Note at this point I could push to a database and I still might actually do this once I have tidied it all up.

edit, rather the next cell is just going to restructure and clean up the JSON file and the re-save it in a format that is easier to work with/make into a large JSON file


# STEP FIVE - SUMMARY FILE AND NEXT STEP DB
The following code goes through all of the reformatted JSONs, it extracts the names of any MPs that are mentioned, and then it writes a list of years for which we have registered expenses to a file, which will then become the basis for a collection on MONGO DB. For the purposes of this project I need two collections on my Mongo DB database, one which is a collection of MP documents which hold a name, unique ID and the years and interest-categories I have info for them, and one collections with millions of "interest" documents that detail those interests. 


In [94]:
reformattedJSONfiles = os.listdir("./regInterestsJsonRefactored/")
datesAndNamesTally = {}

for JSONfile in reformattedJSONfiles:
    print(f"in {JSONfile}")
    with open(f'./regInterestsJsonRefactored/{JSONfile}') as json_file: 
        fileData = json.load(json_file) 
        membersInTally = list(datesAndNamesTally.keys())
        fileMembers = list(fileData.keys())
        date = JSONfile.split(".")[0]
        for member in fileMembers:
            categoriesForSingleFile = list(fileData[member][date].keys())
            if member not in membersInTally:
                datesAndNamesTally[member] = {"datesSummary": [], "categorySummary": categoriesForSingleFile}
                datesAndNamesTally[member]["datesSummary"].append(date)
            if member in membersInTally:
                category_list_1 = categoriesForSingleFile
                category_list_2 = datesAndNamesTally[member]["categorySummary"]
                category_set_1 = set(category_list_1)
                category_set_2 = set(category_list_2)
                category_list_2_items_not_in_category_list_1 = list(category_set_2 - category_set_1)
                combined_category_list = category_list_1 + category_list_2_items_not_in_category_list_1
                datesAndNamesTally[member]["categorySummary"] = combined_category_list
                datesAndNamesTally[member]["datesSummary"].append(date)
    print(f"finished with {JSONfile}")

masterArray = []

for key in datesAndNamesTally:
    newArr = []
    for date in datesAndNamesTally[key]["datesSummary"]:
        newDate = date[0:4]
        newArr.append(newDate)
    datesAndNamesTally[key]["datesSummary"] = list(set(newArr))


#! reduce to just the years and only single years - map through removing all but year then make a set then save

with open(f"./datesAndNamesTally.JSON", "w") as entireJSON:
    json.dump(datesAndNamesTally, entireJSON)
    
print("ALL DONE")
# this commented out bit is for creating a massive JSON compiling all the info 

# reformattedJSONfiles = os.listdir("./regInterestsJsonRefactored/")
# masterDict = {}

# for JSONfile in reformattedJSONfiles:
#     print(f"in {JSONfile}")
#     with open(f'./regInterestsJsonRefactored/{JSONfile}') as json_file: 
        
#         fileData = json.load(json_file) 
#         masterMembers = list(masterDict.keys())
#         fileMembers = list(fileData.keys())
#         date = JSONfile.split(".")[0]
#         print(f"DATE: {date}")
#         for member in fileMembers:
#             if member not in masterMembers:
#                 masterDict[member] = {date: fileData[member][date]}
#             if member in masterMembers:
#                 if date in masterDict[member].keys():
#                     continue
#                 else:
#                     masterDict[member][f"{date}"] = fileData[member][date]
#     print("finished with JSONfile")


# with open(f"./masterDict.json", "w") as entireJSON:
#     json.dump(masterDict, entireJSON)


in 2019-01-21.json
finished with 2019-01-21.json
in 2019-09-16.json
finished with 2019-09-16.json
in 2017-07-31.json
finished with 2017-07-31.json
in 2018-03-19.json
finished with 2018-03-19.json
in 2017-10-23.json
finished with 2017-10-23.json
in 2018-01-08.json
finished with 2018-01-08.json
in 2021-02-01.json
finished with 2021-02-01.json
in 2018-10-29.json
finished with 2018-10-29.json
in 2017-07-07.json
finished with 2017-07-07.json
in 2021-04-12.json
finished with 2021-04-12.json
in 2019-07-29.json
finished with 2019-07-29.json
in 2018-12-03.json
finished with 2018-12-03.json
in 2021-03-01.json
finished with 2021-03-01.json
in 2020-07-06.json
finished with 2020-07-06.json
in 2018-07-16.json
finished with 2018-07-16.json
in 2019-10-21.json
finished with 2019-10-21.json
in 2018-08-13.json
finished with 2018-08-13.json
in 2018-04-30.json
finished with 2018-04-30.json
in 2017-01-23.json
finished with 2017-01-23.json
in 2020-11-23.json
finished with 2020-11-23.json
in 2018-02-19.json
f


The next stage is to use this information to create a new collection on MONGO DB. It will be a collection of documents with the structure:
```javascript
_id: int 
memberName: str
yearsOfRecordsHeld: array
```
eg
```javascript
_id: 92891279878218
memberName: "Dianne Abbott"
yearsOfRecordsHeld: [2019, 2020, 2021]
```
This will require some work understanding mongoDB

I have just watched a video on mongo DB schema and I think I am going to use the "one to squillions" system, where the member document doesn't actually reference the interests, but each interest document will reference the memeber (by name, and if i can figure it out by the ID that mongo assigns it)

I have now seeded the basic member data onto mongoDB. What I now need to do is check out the refactoring situ - run all of that again, check the errors, see if I can fix them. if it's too much of a faff I might just leave it.

Otherwise I will fix it, and regenerate the reformatted JSON. and then I will look into seeding THOSE files onto mongoDB. Also to do is hook the two folders together. JS seeding hooked up the folders in here. 


I now have my MONGO DB set up in place and will work to use the folders and files here to seed the DB.

Ideally once it is uploaded I will write a new list JSON which will track what is uploaded. 

