## Section 1: Aggregating Bill Data

This section will aggregate data from three sources:
- Sponsor information webscraped from the LEGISinfo database (https://www.parl.ca/legisinfo/en/bill/) in 'raw/sponsors.csv'
- Information on the members of parliament from https://www.ourcommons.ca/members/ in 'raw/mp.csv'
- Cleaned bill data from https://www.parl.ca/legisinfo/en/bills?parlsession=all in 'processed/bills_processed.csv'

In [3]:
# Import Python libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
# Read data files
sponsor_data = pd.read_csv('../data/raw/sponsors.csv')
mp_data = pd.read_csv('../data/raw/mp.csv')
bill_data = pd.read_csv('../data/processed/bills_processed.csv')

In [5]:
# Create helper function to print missing values by column
def printNull(df):
    output = "\n".join([
        "-----------", 
        "{shape}",
        "-----------",
        "{missing}"
    ]).format(
        shape = df.shape, 
        missing = df.isna().sum().to_string()
    )
    print(output)

### Part 1a: Reformatting Sponsor Data

In [6]:
# Split bill_info objects into attributes
sponsor_data[['Id', 'Name', 'Title', 'Constituency']] = sponsor_data['SponsorInfo'].str.split('\n', 3, expand=True)
sponsor_data.drop(columns = ['SponsorInfo'], inplace = True)

for col in sponsor_data.columns:
    sponsor_data[col] = sponsor_data[col].apply(lambda x: x.split(':')[1].strip())
    
# Reformat names column to remove middle initials
def removeInitials(text):
    names = text.split(' ')
    names = [x for x in names if '.' not in x]
    return ' '.join(names)
        
sponsor_data['Name'] = sponsor_data['Name'].apply(lambda x: removeInitials(x))

# Rename columns to specify bill sponsorship
sponsor_data.rename(
    columns = {
        'Name': 'SponsorName',
        'Title': 'SponsorTitle'
    }, 
    inplace = True
)

In [7]:
printNull(sponsor_data)

-----------
(6761, 4)
-----------
Id              0
SponsorName     0
SponsorTitle    0
Constituency    0


### Part 1b: Reformatting MP Data

In [8]:
# Merge first and last name columns
mp_data['Name'] = mp_data['First Name'] + ' ' + mp_data['Last Name']

# Selecting only the political affiliation and name columns
mp_data = mp_data[['Political Affiliation', 'Name']]

In [9]:
printNull(mp_data)

-----------
(1109, 2)
-----------
Political Affiliation    2
Name                     0


### Part 1c: Merging and Exporting

In [10]:
# Merging bill and mps
data = pd.merge(sponsor_data, mp_data, left_on = ['SponsorName'], right_on = ['Name'], how = 'left')

In [11]:
# Get data overview
printNull(data)
display(data.head(3))

-----------
(6763, 6)
-----------
Id                          0
SponsorName                 0
SponsorTitle                0
Constituency                0
Political Affiliation    1537
Name                     1527


Unnamed: 0,Id,SponsorName,SponsorTitle,Constituency,Political Affiliation,Name
0,44-1/S-1,Yuen Pau Woo,Senator,,,
1,43-2/S-1,Marc Gold,Senator,,,
2,43-1/S-1,Joseph Day,Senator,,,


In [12]:
# Merge data and bill
data = pd.merge(data, bill_data, on = ['Id'], how = 'left')

# Drop redundant name column
data.drop(columns = 'Name', inplace = True)

In [13]:
# Get data overview
printNull(data)
display(data.head(3))

-----------
(6763, 18)
-----------
Id                          0
SponsorName                 0
SponsorTitle                0
Constituency                0
Political Affiliation    1537
Code                        0
Title                       0
LatestStageName             0
ParliamentNumber            0
SessionNumber               0
BillType                    0
ReceivedRoyalAssent         0
Ongoing                     0
ReadingsPassed              0
BillOrigin                  0
FirstStageDate            975
LastStageDate             975
TimeAlive                 975


Unnamed: 0,Id,SponsorName,SponsorTitle,Constituency,Political Affiliation,Code,Title,LatestStageName,ParliamentNumber,SessionNumber,BillType,ReceivedRoyalAssent,Ongoing,ReadingsPassed,BillOrigin,FirstStageDate,LastStageDate,TimeAlive
0,44-1/S-1,Yuen Pau Woo,Senator,,,S-1,An Act relating to railways,First reading in the Senate,44,1,Senate Public Bill,False,True,1,IsSenateBill,2021-11-22,2021-11-22,0 days
1,43-2/S-1,Marc Gold,Senator,,,S-1,An Act relating to railways,First reading in the Senate,43,2,Senate Public Bill,False,False,1,IsSenateBill,2020-09-22,2020-09-22,0 days
2,43-1/S-1,Joseph Day,Senator,,,S-1,An Act relating to railways,First reading in the Senate,43,1,Senate Public Bill,False,False,1,IsSenateBill,2019-12-04,2019-12-04,0 days


In [14]:
data.to_csv('../data/cleaned/bill_data.csv', index = False)

## Section 2: Aggregating Parliament Session Data

This section aggregates data from two sources:
- Parliament information from https://en.wikipedia.org/wiki/List_of_Canadian_federal_parliaments in 'raw/parliaments.csv'
- Parliament session information from https://lop.parl.ca/sites/ParlInfo/default/en_CA/Parliament/parliamentsSessions in 'raw/sessions.csv'

### Part 2a: Reformatting Wikipedia Data

In [15]:
parliaments = pd.read_csv('../data/raw/parliaments.csv')

# Cleaning up the data (loaded direct from Wikipedia, really messy)

parliaments.drop(columns = ['Diagram'], inplace = True)
parliaments.drop(list(range(0, 38)) + [39, 41], axis = 0, inplace = True)

parliaments.columns = [
    'ParliamentNumber', 
    'Duration',
    'GoverningParty',
    'Seats',
    'OfficialOpposition',
    'ThirdParties'
]

# Getting ParliamentNumber

parliaments['ParliamentNumber'] = parliaments['ParliamentNumber'].apply(lambda x: int(x[:2]))

# Split GoverningParty column

parliaments[['Party', 'PrimeMinister']] = parliaments['GoverningParty'].str.split('\n', 1, expand = True)
parliaments['PrimeMinister'] = parliaments['PrimeMinister'].apply(lambda x: x.split('\n')[0].split(' ' , 1)[1].split(u'\u2014')[0])
parliaments.drop(columns = 'GoverningParty', inplace = True)

# Create Minority Column

parliaments['Minority'] = parliaments['Seats'].str.contains('minority')

# Drop unnecessary columns

parliaments.drop(columns = ['Seats', 'Duration', 'OfficialOpposition', 'ThirdParties'], inplace = True)

# Check data

display(parliaments.head())

Unnamed: 0,ParliamentNumber,Party,PrimeMinister,Minority
38,35,Liberal Party,Jean Chrétien,False
40,36,Liberal Party,Jean Chrétien,False
42,37,Liberal Party,Jean Chrétien,False
43,38,Liberal Party,Paul Martin,True
44,39,Conservative Party,Stephen Harper,True


## Part 2b: Reformatting Parliament Session Data

In [16]:
sessions = pd.read_csv('../data/raw/sessions.csv')

# Strip parliament numbers from session column
parliament_numbers = []
def getParliamentNumber(text):
    if 'Parliament:' in text:
        text = text.split(' ', 1)[-1]
        parliament_numbers.append(text)
        
    else:
        text = parliament_numbers[-1]
        
    return text

sessions['ParliamentNumber'] = sessions['Session'].apply(lambda x: getParliamentNumber(x))

# Remove session columns with NaN values

sessions.dropna(how = 'any', subset = ['Start Date'], inplace = True)

# Rename columns

sessions.rename(columns = {
    'Session': 'SessionNumber',
    'Start Date': 'StartDate',
    'End Date': 'EndDate',
    'Senate Sitings': 'SenateSittings',
    'House of Commons Sittings': 'HouseSittings'
}, inplace = True)

# Change datatype of ParliamentNumber and SessionNumber

sessions.loc[:, ['ParliamentNumber', 'SessionNumber']] = sessions.loc[:, ['ParliamentNumber', 'SessionNumber']].astype(int)

# Reverse data, and filter by ParliamentNumber

sessions = sessions.loc[sessions['ParliamentNumber'].astype(int) > 34, :]
sessions = sessions.iloc[::-1]

# Change datatype of Senate and House of Commons Sittings

sessions.iloc[:, -2:] = sessions.iloc[:, -2:].astype(int)

# Check data

display(sessions.head())

Unnamed: 0,SessionNumber,StartDate,EndDate,Duration,SenateSittings,HouseSittings,ParliamentNumber
28,1,1994-01-17,1996-02-02,746 days,133.0,278,35
27,2,1996-02-27,1997-04-27,425 days,96.0,164,35
25,1,1997-09-22,1999-09-18,726 days,158.0,243,36
24,2,1999-10-12,2000-10-22,376 days,84.0,133,36
22,1,2001-01-29,2002-09-16,595 days,124.0,211,37


## Part 2c: Merging and Exporting

In [17]:
parliaments = pd.merge(parliaments, sessions, on = ['ParliamentNumber'], how = 'right')
parliaments.to_csv('../data/cleaned/parliament_data.csv', index = False)