## Lab 1. Pandas

### Structured Data Manipulation and Data Wrangling

## Part 1. Data Transformation and Group By Analysis

__Data__: Toronto Parking Tickets dataset `Parking_Tickets_Toronto2020.csv` (original data source https://www.toronto.ca/city-government/data-research-maps/open-data/) describes parking infractions in the City of Toronto issued between April and August 2020.  

Data dictionary:

- date_of_infraction (int) the date of parking violation as YYYYMMDD
- infraction (str) description of the parking violation
- fine (int) parkng ticket amount
- address (str) nearest house number and street to the location where the ticket was issued
- province (str) province or state of origin of the car's license plate


In [1]:
import pandas as pd
import numpy as np
import datetime as dt

#### 1.1- Data Import and Inspection

1. Import CSV data into a pandas data frame
2. Inspect the data frame:
    - how many rows and columns are there?
    - what data types are there?
    - describe the numerical and object columns
    - what is the number of unique values in each column?
3. Print the fist 5 rows of the columns containing character strings ('object' data type)

#### 1.2- Data Wrangling

1. Convert the column "date_of_infraction" into the date-time format
    - use the `apply` method and `strptime` function from the `datetime` package to create a new column `"date"` containing the dates of infractions expressed as a date-time object
    - drop the original `"date_of_infraction"` column from the data frame
    - create two new columns "month" and "week_day" containing the Month and Day of the week extracted from the "date" column. Hint: use the "apply" method and strftime function. Resource:  https://strftime.org/
2. Create a new column `"street"` by extracting street name from the "address" column
    - Hint: the addresses always begin with a house bumber followed by a space folowed by the street name
    - List the top 10 most frequent street names
    - How many unique combinations of address and street are there?
3. Convert the `"week_day"` and `"month"` columns into the Categorical data type
    - also make sure that your categories are properly ordered

#### 1.3- Subsetting and GroupBy Analysis

1. Find the top 5 most frequent infraction categories and top 10 most frequently occurring streets
2. Build a subset of the original data frame where the infractions and streets are those you identified in step 1
    - How many rows does the subset data frame contain?
2. Using the subset data from step 2 and the groupby method, compute:
    - mean fines for each month
    - mean fines for each week day
    - find the provinces which paid a total of > 10000 in parking tickets

## Part 2.  Merging. Missing data

#### Data:

- `debt_public.csv`
    - this data table contains the following columns:
        - Country
        - gross_debt_per_GDP (gross government debt as percent of GDP)
        - net_debt_per_GDP (net government debt as percent of GDP)


- `gdp_by_country.csv`
    - gross domestic product estimates from three independent sources (IMF, WB, CIA) and the year of the estimate


- `continents.csv`
    - a table of countries and continents


#### 2.1- Import and Inspect Data

1. Import data from the following sources: `debt_public.csv`, `gdp_by_country.csv`, `continents.csv` into Pandas data frames
2. Inspect the data frames:
    - Preview a few sample rows
    - Preview and inspect descriptive statistics for the numerical and string columns
3. Does any of the three data frames contain missing data (`NaN`)?

#### 2.2- Data Transformation

1. gdp data:
    - remove all except 'Country' and 'CIA_Estimate'
    - rename 'CIA_Estimate' column to 'GDP'
    - add a `"Continent"` column by merging the `gdp` and `continents` data frames on the `"Country"` column


2. public debt data:
    - remove rows where `"gross_debt_per_GDP"` is missing (`NaN`)
    - check for the number of missing values in each column
    - merge the `pDebt` and `gdp` data frames into a master data frame `df_debt`
    - add a caclulated column `"gross_pub_debt"` for the absolute values of gross public debt using the GDP values and the values of gross debt expressed as percentage of GDP
    - add another calculated column `"debt_bin"` by binning "gross_pub_debt" into two bins: "Low" and "High" separated by the median value of `"gross_pub_debt"`

#### 2.3- impute missing values
- verify again which columns have missing values in the master data frame `df_debt`
- does the number of missing values justify using .dropna() (removing entire rows containing missing values)?
- check which fillna method should be used. One possibility is to fill the NaNs with the mean of the non-missing values - this can work if the distribution is reasonably symmetrical, i.e., mean and median are close to each other