# Using pyCSPro

pyCSPro is a simple python library made up of two main functionalities (classes). 
The first one is the DictionaryParser class which is responsible for parsing a CSPro dictionary and also providing anciliary functions such as providing lables of record columns (could be used to replace the default column names which are the name attributes of items and therefore could be cryptic) and labels of values (this could be used to replace values such as 1, 2 with their respective lables like 'Male', 'Female' etc.

## Install the package

In [None]:
!pip install --user pycspro
!pip install --user pandas

## Parse a dictionary

Here, we are parsing the sample dictionary that is provided with CSPro and can also be downloaded from this repo

https://github.com/CSProDevelopment/examples

In [None]:
from pycspro import DictionaryParser
import json

raw_dictionary = open('dictionary/Census Dictionary.dcf', 'r').read()
dictionary_parser = DictionaryParser(raw_dictionary)
parsed_dictionary = dictionary_parser.parse()
print(json.dumps(parsed_dictionary, indent=4))

## Use parsed dictionary to parse cases

We pull out cases from the CSPro example data file. Luckily, the given example is of a single record type and therefore newlines (\n) are only found at the end of a single case entry and therefore we can use that to cut up the content of the file into individual cases. If it were was a situation where there were multiple records then those would also have been separated by the newline character and we wouldn't have been able to use it to cut up the file into individual records.

The case parser accepts a list of cases. We can choose to pass a single case in a list or as many as 100k.
The best approach would be to pass in about 50k and then convert the returned dictionary into a Pandas Data Frame, then we pass in the next batch and then convert that into another data frame and then append it to the previous data frame.

In [None]:
import pandas as pd
from pycspro import CaseParser

raw_cases = open('data/Popstan Census.dat', 'r').read()
cases = raw_cases.split('\n')
case_parser = CaseParser(parsed_dictionary)
parsed_cases = case_parser.parse(cases[:10])
dfs = {}
for table_name, table in parsed_cases.items():
    dfs[table_name] = pd.DataFrame.from_dict(table)
    print(table_name)
    display(dfs[table_name])

## Get selected columns only

Sometimes records can have upwards of a hundred columns and you probably will not be interested in all of them. In such a scenario, you can actually pass a cutting mask to the CaseParser class while instantiating it.

It will only cut and return the listed columns. What you pass in as a cutting_mask param is a dictionary with keys being the record names (the level name in case of the main table) and values being lists of the column names, which are the Item names.

In [None]:
import pandas as pd
from pycspro import CaseParser

raw_cases = open('data/Popstan Census.dat', 'r').read()
cases = raw_cases.split('\n')
cutting_mask = {
    'QUEST': ['PROVINCE', 'DISTRICT'],
    'PERSON': ['P03_SEX', 'P04_AGE', 'P11_LITERACY', 'P15_OCC'],
    'HOUSING': ['H01_TYPE', 'H05_ROOMS', 'H07_RENT', 'H08_TOILET', 'H13_PERSONS']
}
case_parser = CaseParser(parsed_dictionary, cutting_mask)
parsed_cases = case_parser.parse(cases[:10])
dfs = {}
for table_name, table in parsed_cases.items():
    dfs[table_name] = pd.DataFrame.from_dict(table)
    print(table_name)
    display(dfs[table_name])

## Changing column labels

In [6]:
housing = dfs['HOUSING']
housing.rename(columns = dictionary_parser.get_column_labels('HOUSING'))

{'H01_TYPE': 'Type of housing',
 'H02_WALL': 'Wall type',
 'H03_ROOF': 'Roof type',
 'H04_FLOOR': 'Floor type',
 'H05_ROOMS': 'Number of rooms',
 'H06_TENURE': 'Tenure',
 'H07_RENT': 'Amount of rent paid',
 'H08_TOILET': 'Type of toilet facilities',
 'H09_BATH': 'Type of bathing facilities',
 'H10_WATER': 'Source of water',
 'H11_LIGHT': 'Fuel for lighting',
 'H12_FUEL': 'Fuel for cooking',
 'H13_PERSONS': 'Number of persons in the household'}

## Changing value lables

In [7]:
person = dfs['PERSON']
person.replace(dictionary_parser.get_value_labels('PERSON'))

{'LINE': {},
 'P02_REL': {1: 'Head',
  2: 'Spouse',
  3: 'Child',
  4: 'Parent',
  5: 'Other relative',
  6: 'Nonrelative',
  9: 'Not Reported'},
 'P03_SEX': {1: 'Male', 2: 'Female'},
 'P04_AGE': {},
 'P05_MS': {1: 'Married',
  2: 'Divorced',
  3: 'Separated',
  4: 'Widowed',
  5: 'Never Married'},
 'P06_MOTHER': {1: 'Yes', 2: 'No', 3: "Don't know"},
 'P07_BIRTH': {21: 'Endar', 22: 'Victoria', 0: 'Not Reported'},
 'P08_RES95': {21: 'Endar',
  22: 'Victoria',
  0: 'Unknown',
  "'  '": 'Not Applicable'},
 'P09_ATTEND': {1: 'Yes', 2: 'No', 9: 'Not Reported', "' '": 'Not Applicable'},
 'P10_HIGH_GR': {0: 'None', 99: 'Not Reported', "'  '": 'Not Applicable'},
 'P11_LITERACY': {1: 'Literate',
  2: 'Illiterate',
  9: 'Not Reported',
  "' '": 'Not Applicable'},
 'P12_WORKING': {1: 'Yes',
  2: 'No',
  9: 'Not Reported',
  "' '": 'Not Applicable'},
 'P13_LOOKING': {1: 'Yes',
  2: 'No',
  9: 'Not Reported',
  "' '": 'Not Applicable'},
 'P14_WHY_NOT': {1: 'Had job',
  2: 'Believed job not availabl