# ICT 782 - Day 3 Notes

# File Input and Output

## Importing data with pandas

Real data arrives in all possible formats. While numerous Python packages exist for reading and writing data in multiple formats, pandas offers convenient functions for dealing with some of the most common data formats in use today. These formats including `csv`, `json`, `html`, Microsoft Excel, `hdf5` and binary, and in `SQL` databases. Today we'll look at how pandas reads each of these file types into pandas objects.

In [1]:
import numpy as np
import pandas as pd

## Dealing with zipped files

Many datasets are compressed into zip files for easier transfer between users. Some of the most common zip file extensions are `.zip` and `.gz`.

To unzip a `.zip` file, use the syntax:
```
import zipfile

with open('<file_name>.zip', 'r') as file:
    file.extract_all('<target_directory>')
```

To unzip a `.gz` file, use the syntax (from the official documentation):
```
import gzip

with gzip.open('<file_name>.gz', 'rb') as f:
    file_content = f.read()
```

For other zip file extensions, you may have to use another external package. Check [PyPI](https://pypi.org/) or Anaconda/Environments for relevant packages.

## Reading `csv` files

Python comes pre-packaged with `csv` a simple module for reading and writing data stored in `.csv` files. The pandas package improves on this task with the easy-to-use function: `pd.read_csv()`. 

To show how this function works, we'll read in the `facebook.csv` file.

In [7]:
fb = pd.read_csv('facebook.csv')
fb

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Gain
0,5/17/2013,26.400000,26.600000,26.200001,26.250000,26.250000,29462700,-0.150000
1,5/20/2013,26.180000,26.190001,25.690001,25.760000,25.760000,42402900,-0.420000
2,5/21/2013,25.870001,26.080000,25.590000,25.660000,25.660000,26261300,-0.210001
3,5/22/2013,25.650000,25.850000,24.920000,25.160000,25.160000,45314500,-0.490000
4,5/23/2013,24.799999,25.530001,24.770000,25.059999,25.059999,37663100,0.260000
...,...,...,...,...,...,...,...,...
1254,5/10/2018,183.149994,186.130005,182.500000,185.529999,185.529999,21071400,2.380005
1255,5/11/2018,184.850006,188.320007,184.179993,186.990005,186.990005,21207800,2.139999
1256,5/14/2018,187.710007,187.860001,186.199997,186.639999,186.639999,15646700,-1.070008
1257,5/15/2018,184.880005,185.289993,183.199997,184.320007,184.320007,15429400,-0.559998


This method of reading in data is simple. Since the data is stored in a tabular format, the transfer from `csv` to `DataFrame` is smooth. Many websites provide `csv` data, though some will zip the `csv` files to reduce its space on disk.

### Keyworded arguments

There are many keyworded arguments for the `pd.read_csv` function. Rather than list them here, we can view them by calling `help` on the `pd.read_csv` function.

In [19]:
help(pd.read_csv)

Help on function read_csv in module pandas.io.parsers:

read_csv(filepath_or_buffer: Union[str, pathlib.Path, IO[~AnyStr]], sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None)
    Read a comma-separated values (csv) file into Dat

In [22]:
pd.read_csv('facebook.csv', converters = {'Volume': lambda x: int(x)/1000000})

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Gain
0,5/17/2013,26.400000,26.600000,26.200001,26.250000,26.250000,29.4627,-0.150000
1,5/20/2013,26.180000,26.190001,25.690001,25.760000,25.760000,42.4029,-0.420000
2,5/21/2013,25.870001,26.080000,25.590000,25.660000,25.660000,26.2613,-0.210001
3,5/22/2013,25.650000,25.850000,24.920000,25.160000,25.160000,45.3145,-0.490000
4,5/23/2013,24.799999,25.530001,24.770000,25.059999,25.059999,37.6631,0.260000
...,...,...,...,...,...,...,...,...
1254,5/10/2018,183.149994,186.130005,182.500000,185.529999,185.529999,21.0714,2.380005
1255,5/11/2018,184.850006,188.320007,184.179993,186.990005,186.990005,21.2078,2.139999
1256,5/14/2018,187.710007,187.860001,186.199997,186.639999,186.639999,15.6467,-1.070008
1257,5/15/2018,184.880005,185.289993,183.199997,184.320007,184.320007,15.4294,-0.559998


### Writing to `csv`

Writing to a `csv` file is also simple. We'll create a `DataFrame` object and save it as a `csv` file.

In [23]:
dat1 = pd.DataFrame({'cities': ['New York','Los Angeles','Chicago','Houston','Philadelphia'],
                     'population': [8405837, 3884307, 2718782, 2195914, 1553165], 
                     'budget (billions)': [73.0, 8.1, 7.3, 5.1, 3.95]})
dat1

Unnamed: 0,cities,population,budget (billions)
0,New York,8405837,73.0
1,Los Angeles,3884307,8.1
2,Chicago,2718782,7.3
3,Houston,2195914,5.1
4,Philadelphia,1553165,3.95


In [24]:
dat1.to_csv('city_budgets.csv')

By examining the saved file, we see that the index was also saved in the first column. To suppress the additional column, we can specify `index = False` in the `to_csv()` method.

In [25]:
dat1.to_csv('city_budgets.csv', index = False)

## `json` data

JavaScript Object Notation (`json`) is a text-based method for storing data. This format is used for sending data between your local browser and web-based applications via HyperText Transfer Protocol (HTTP). The structure of `json` is very similar to a Python dictionary.

Reading `json` data is done in pandas with the `read_json()` function. By default, data is read from `json` into `DataFrame`s, so to read into a `Series`, specify the `typ = series` keyworded argument.

For very long `json` files, we can read data line by line by passing in `lines = True` and then specifying `chunksize = <number of lines to read per iteration>`. 

Let's read in the file called `nursing.json`. This file was downloaded from the Statistics Canada website. The data presented is from Table 13-10-0101-01, 'Public nursing and residential care facilities, summary statistics (x 1,000,000)'.

In [26]:
nursing = pd.read_json('nursing.json')
nursing

Unnamed: 0,Summary statistics,2016,2017,2018
0,Operating revenue,15676.2,16341.2,16833.4
1,Operating expenses,15779.8,16391.7,16887.9
2,"Salaries, wages, commissions and benefits",11806.4,12120.9,12508.1
3,Operating surplus or deficit,-103.6,-50.5,-54.5


In the above import, we can see that the data for `Operating revenue`, `Operating expenses`, and `Salaries, wages,...` were all imported as strings (the numbers contain commas). We can specify the `dtype = True` argument in the `read_json()` function, but since our strings contain commas the column data won't be converted properly.

We'll discuss techniques for correcting this on Day 4. For now, we'll remove the commas and convert to `float` with `lambda` functions.

In [27]:
for year in ['2016','2017','2018']:
    nursing[year] = nursing[year].apply(lambda x: float(x.replace(',','')))
nursing

Unnamed: 0,Summary statistics,2016,2017,2018
0,Operating revenue,15676.2,16341.2,16833.4
1,Operating expenses,15779.8,16391.7,16887.9
2,"Salaries, wages, commissions and benefits",11806.4,12120.9,12508.1
3,Operating surplus or deficit,-103.6,-50.5,-54.5


### Writing to `json`

Given a pandas `Series` or `DataFrame`, we can easily write to `json` using the `to_json()` method. In this short example, we'll create a `DataFrame` of random values and write it to `json`.

In [28]:
example = pd.DataFrame(np.random.rand(100).reshape((10,10)), 
                       columns = ['a','b','c','d','e','f','g','h','i','j'])
example.to_json()

'{"a":{"0":0.5234960389,"1":0.8520347639,"2":0.8595491143,"3":0.8915860416,"4":0.2327143377,"5":0.6446345935,"6":0.572704724,"7":0.0766537717,"8":0.5966176903,"9":0.3523439338},"b":{"0":0.6892342972,"1":0.7092552603,"2":0.0080373426,"3":0.8196908059,"4":0.7550002273,"5":0.6939031016,"6":0.5316128312,"7":0.3940676496,"8":0.111337293,"9":0.1641365425},"c":{"0":0.4276146299,"1":0.4751076021,"2":0.2889375733,"3":0.1078478819,"4":0.8486200314,"5":0.2766525307,"6":0.6058707533,"7":0.5237181399,"8":0.121053083,"9":0.7529997803},"d":{"0":0.1571526273,"1":0.1578460936,"2":0.2472127739,"3":0.8009482672,"4":0.1243131651,"5":0.2024482903,"6":0.9625527382,"7":0.9438182711,"8":0.1310181507,"9":0.7647452449},"e":{"0":0.170035969,"1":0.8251505454,"2":0.5686237742,"3":0.8737494607,"4":0.1389203803,"5":0.9563413449,"6":0.7574542045,"7":0.7801268622,"8":0.4871930304,"9":0.4568276074},"f":{"0":0.1141999473,"1":0.2810570952,"2":0.8604056612,"3":0.7407547996,"4":0.6654534037,"5":0.8677001925,"6":0.684470757

In [29]:
with open('example_json.json', 'w') as file:
    file.write(example.to_json())

## `html` data

Most modern websites are created with a combination of the HyperText Markup Language (`html`) and Cascading Style Sheets (`css` - not discussed here). When we navigate to a website, our web browser receives a copy of the `html` code for the site's web pages. On these pages are various `html` structures, including tables. We can read these tables using Python and pandas to glean data from the web. This technique is commonly called **web scraping**.

Just from the description above, you should be questioning the ethics of this practice. What we are saying is that it is possible to navigate around the web and take data from other people's web pages. If we own the web sites we take data from, then there is no problem. **However, there are serious considerations for web scraping and potential users must take cautions that cannot be overstated.**

We will discuss these ethical and legal issues and considerations below. For now, let's look at how to read `html` tables from web pages.

We'll use pandas to read a single table from Wikipedia. The Wikipedia article is called *World population*, and we're looking for the first table, which is titled 'Population by continent'.

**Note:** I have read Wikipedia's policy on Database downloads and, since we are retrieving data at runtime, we are not violating their terms of service.

In [30]:
url = 'https://en.wikipedia.org/wiki/World_population'

data3 = pd.read_html(url, match = 'Population by continent')
data3

[               Continent Density(inhabitants/km2)  \
 0                   Asia                     96.4   
 1                 Africa                     36.7   
 2                 Europe                     72.9   
 3  North America[note 2]                     22.9   
 4          South America                     22.8   
 5                Oceania                      4.5   
 6             Antarctica           0.0003(varies)   
 
                          Population(millions)  \
 0                                        4436   
 1                                        1216   
 2                                         738   
 3                                         579   
 4                                         422   
 5                                        39.9   
 6  0.004 in summer(non-permanent, varies)[16]   
 
                                Most populous country  \
 0                      1,382,300,000[note 1] – China   
 1                             0186,987,000 – Nige

Note that this read in the `html` tables at the specified url as a list, even when `match` criteria is specified. We access the `DataFrame` object by list index.

In [31]:
data3 = data3[0]
data3

Unnamed: 0,Continent,Density(inhabitants/km2),Population(millions),Most populous country,Most populous city (metropolitan area)
0,Asia,96.4,4436,"1,382,300,000[note 1] – China","35,676,000/13,634,685 – Greater Tokyo Area/Tok..."
1,Africa,36.7,1216,"0186,987,000 – Nigeria","21,000,000 – Lagos"
2,Europe,72.9,738,"0145,939,000 – Russia;approx. 112 million in E...","16,855,000/12,506,468 – Moscow metropolitan ar..."
3,North America[note 2],22.9,579,"0324,991,600 – United States","23,723,696/8,537,673 – New York Metropolitan A..."
4,South America,22.8,422,"0209,567,000 – Brazil","27,640,577/11,316,149 – Metro Area/São Paulo City"
5,Oceania,4.5,39.9,"0024,458,800 – Australia","5,005,400 – Sydney"
6,Antarctica,0.0003(varies),"0.004 in summer(non-permanent, varies)[16]",N/A[note 3],"1,200 (non-permanent, varies) – McMurdo Station"


In [32]:
data3.keys()

Index(['Continent', 'Density(inhabitants/km2)', 'Population(millions)',
       'Most populous country', 'Most populous city (metropolitan area)'],
      dtype='object')

### Writing to `html`

We may always write a `Series` or `DataFrame` object to `html` using the `to_html()` method. The result is a Python string with the data formatted into an `html` table. This string may then be saved to a file with the `.html` extension.

In [33]:
data = pd.DataFrame(np.random.rand(16).reshape((4,4)))
text = data.to_html()

In [34]:
with open('random_data.html', 'w') as file:
    file.write(text)

### The ethics and legality of web scraping

The first thing to keep in mind whenever you are considering scraping data from websites is that **you don't own the data.** This means that, although we *can* extract data from websites, there are many times when we *shouldn't*. Most websites that host data will have Terms of Service with this issue specifically addressed.

Furthermore, many programmers create *web crawlers* or *bots*. These automated scripts navigate through the web and scrape data at intervals. Often, the bots query websites multiple times per second, reducing the capacity of web servers to provide their services effectively. If a bot queries a site too many times, the server will block the IP address of the bot. In some cases, an entire IP range is blocked.

Multiple legal cases have ensued from these kinds of practices, some of which are detailed [in this article](https://jaxenter.com/data-scraping-cases-165385.html) (spoiler alert: the owner of the data has a better case).

Here are some general rules of thumb when trying to ethically and legally obtain data from the web:
- **Read the Terms of Service.**
- If the website has an Application Passing Interface (API), use it.
- Don't assume you can use the data you are thinking of obtaining through web scraping (**read the Terms of Service!**).
- Don't make an unreasonable number of scraping requests.
- **Read the Terms of Service.**

## Reading Excel files

To read files created in Microsoft Excel, we have two options: 
1. Create an `ExcelFile` object for the Excel workbook, then read individual worksheets as attributes of the `ExcelFile` object using the `parse()` method (individual sheets are stored as a list in `sheet_names`).
2. Read in the Excel workbook directly and specify the worksheet name. By default, only the first worksheet is read.

We'll explore both of these techniques. The file `facebook.xlsx` contains two worksheets. The first worksheet, titled '2013-2016', contains daily Facebook stock quotes between 17 May 2013 and 16 May 2016. The second worksheet ('2016-2020') contains daily Facebook stock quotes between 17 May 2016 and 19 Feb 2020.

First, we'll create an `ExcelFile` object and access the worksheets as attributes.

In [35]:
fb = pd.ExcelFile('facebook.xlsx')

In [36]:
fb.sheet_names

['2013-2016', '2016-2020', 'wrong_type']

In [37]:
fb_2013_2016 = fb.parse(sheet_name = '2013-2016')
fb_2013_2016

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Gain
0,2013-05-17,26.400000,26.600000,26.200001,26.250000,26.250000,29462700,-0.150000
1,2013-05-20,26.180000,26.190001,25.690001,25.760000,25.760000,42402900,-0.420000
2,2013-05-21,25.870001,26.080000,25.590000,25.660000,25.660000,26261300,-0.210001
3,2013-05-22,25.650000,25.850000,24.920000,25.160000,25.160000,45314500,-0.490000
4,2013-05-23,24.799999,25.530001,24.770000,25.059999,25.059999,37663100,0.260000
...,...,...,...,...,...,...,...,...
750,2016-05-10,119.620003,120.500000,119.000000,120.500000,120.500000,22803700,0.879997
751,2016-05-11,120.410004,121.080002,119.419998,119.519997,119.519997,22038400,-0.890007
752,2016-05-12,119.980003,120.839996,118.900002,120.279999,120.279999,22035500,0.299996
753,2016-05-13,120.379997,120.639999,119.680000,119.809998,119.809998,18124300,-0.569999


In [38]:
facebook = pd.read_excel('facebook.xlsx', sheet_name = '2016-2020')
facebook

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2016-05-17,118.820000,119.010002,117.199997,117.349998,117.349998,21328600
1,2016-05-18,116.800003,118.269997,116.730003,117.650002,117.650002,21642300
2,2016-05-19,117.050003,117.489998,115.879997,116.809998,116.809998,20544100
3,2016-05-20,116.959999,117.989998,116.949997,117.349998,117.349998,18944800
4,2016-05-23,117.419998,117.599998,115.940002,115.970001,115.970001,20441000
...,...,...,...,...,...,...,...
941,2020-02-12,207.850006,211.220001,207.399994,210.759995,210.759995,13813700
942,2020-02-13,209.520004,214.330002,209.179993,213.139999,213.139999,15396600
943,2020-02-14,214.000000,214.929993,212.649994,214.179993,214.179993,10741700
944,2020-02-18,213.550003,217.979996,213.399994,217.800003,217.800003,15609200


### Converting imported column data

It often happens that the Excel columns are in the wrong data format. For example, we may have a column of strings that look like dates. To read this kind of data, we can specify the keyworded argument `parse_dates = ['date_strings']`.

In general, we can specify the keyworded argument `converters = {}`, where the dictionary contains key:value pairs to be converted.

Returning to the Facebook data, suppose we want to convert the `Volume` column to float. We could do that by specifying:
```
dtype = {<col_name> : <type>}
```
or:
```
converters = {'Volume': float}
```

Let's see it in action.

In [39]:
facebook = pd.read_excel('facebook.xlsx', dtype = {'Volume': float})
facebook

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Gain
0,2013-05-17,26.400000,26.600000,26.200001,26.250000,26.250000,29462700.0,-0.150000
1,2013-05-20,26.180000,26.190001,25.690001,25.760000,25.760000,42402900.0,-0.420000
2,2013-05-21,25.870001,26.080000,25.590000,25.660000,25.660000,26261300.0,-0.210001
3,2013-05-22,25.650000,25.850000,24.920000,25.160000,25.160000,45314500.0,-0.490000
4,2013-05-23,24.799999,25.530001,24.770000,25.059999,25.059999,37663100.0,0.260000
...,...,...,...,...,...,...,...,...
750,2016-05-10,119.620003,120.500000,119.000000,120.500000,120.500000,22803700.0,0.879997
751,2016-05-11,120.410004,121.080002,119.419998,119.519997,119.519997,22038400.0,-0.890007
752,2016-05-12,119.980003,120.839996,118.900002,120.279999,120.279999,22035500.0,0.299996
753,2016-05-13,120.379997,120.639999,119.680000,119.809998,119.809998,18124300.0,-0.569999


In [40]:
facebook = pd.read_excel('facebook.xlsx', converters = {'Volume': float})
facebook

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Gain
0,2013-05-17,26.400000,26.600000,26.200001,26.250000,26.250000,29462700.0,-0.150000
1,2013-05-20,26.180000,26.190001,25.690001,25.760000,25.760000,42402900.0,-0.420000
2,2013-05-21,25.870001,26.080000,25.590000,25.660000,25.660000,26261300.0,-0.210001
3,2013-05-22,25.650000,25.850000,24.920000,25.160000,25.160000,45314500.0,-0.490000
4,2013-05-23,24.799999,25.530001,24.770000,25.059999,25.059999,37663100.0,0.260000
...,...,...,...,...,...,...,...,...
750,2016-05-10,119.620003,120.500000,119.000000,120.500000,120.500000,22803700.0,0.879997
751,2016-05-11,120.410004,121.080002,119.419998,119.519997,119.519997,22038400.0,-0.890007
752,2016-05-12,119.980003,120.839996,118.900002,120.279999,120.279999,22035500.0,0.299996
753,2016-05-13,120.379997,120.639999,119.680000,119.809998,119.809998,18124300.0,-0.569999


From here, we can start to get creative. For example, instead of specifying a type as the value in our `converters` dictionary, we could specify a `lambda` expression. Here's an example of converting the `Volume` column to hexadecimal.

In [43]:
fb = pd.read_excel('facebook.xlsx', sheet_name = '2013-2016', converters = {'Volume': lambda x: hex(x)})
fb

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Gain
0,2013-05-17,26.400000,26.600000,26.200001,26.250000,26.250000,0x1c190ac,-0.150000
1,2013-05-20,26.180000,26.190001,25.690001,25.760000,25.760000,0x2870454,-0.420000
2,2013-05-21,25.870001,26.080000,25.590000,25.660000,25.660000,0x190b734,-0.210001
3,2013-05-22,25.650000,25.850000,24.920000,25.160000,25.160000,0x2b371c4,-0.490000
4,2013-05-23,24.799999,25.530001,24.770000,25.059999,25.059999,0x23eb17c,0.260000
...,...,...,...,...,...,...,...,...
750,2016-05-10,119.620003,120.500000,119.000000,120.500000,120.500000,0x15bf4f4,0.879997
751,2016-05-11,120.410004,121.080002,119.419998,119.519997,119.519997,0x1504780,-0.890007
752,2016-05-12,119.980003,120.839996,118.900002,120.279999,120.279999,0x1503c2c,0.299996
753,2016-05-13,120.379997,120.639999,119.680000,119.809998,119.809998,0x1148e0c,-0.569999


In [44]:
# Full list of keyworded arguments.
help(pd.read_excel)

Help on function read_excel in module pandas.io.excel._base:

read_excel(io, sheet_name=0, header=0, names=None, index_col=None, usecols=None, squeeze=False, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skiprows=None, nrows=None, na_values=None, keep_default_na=True, verbose=False, parse_dates=False, date_parser=None, thousands=None, comment=None, skip_footer=0, skipfooter=0, convert_float=True, mangle_dupe_cols=True, **kwds)
    Read an Excel file into a pandas DataFrame.
    
    Support both `xls` and `xlsx` file extensions from a local filesystem or URL.
    Support an option to read a single sheet or a list of sheets.
    
    Parameters
    ----------
    io : str, ExcelFile, xlrd.Book, path object or file-like object
        Any valid string path is acceptable. The string could be a URL. Valid
        URL schemes include http, ftp, s3, and file. For file URLs, a host is
        expected. A local file could be: ``file://localhost/path/to/table.xl

### Write to Excel

We can also output data to Excel files using the `to_excel()` method of the `DataFrame` object. To demonstrate this, we'll create a new `DataFrame` from the Facebook data (by copying the original data) consisting of only the `Volume` and `Gain` columns. We'll add on columns for `Gain**2` and `Gain**3`, and then save the resulting `DataFrame` to an Excel file.

In [45]:
fb_g = facebook[['Volume','Gain']].copy()
fb_g

Unnamed: 0,Volume,Gain
0,29462700.0,-0.150000
1,42402900.0,-0.420000
2,26261300.0,-0.210001
3,45314500.0,-0.490000
4,37663100.0,0.260000
...,...,...
750,22803700.0,0.879997
751,22038400.0,-0.890007
752,22035500.0,0.299996
753,18124300.0,-0.569999


In [46]:
fb_g['Gain**2'] = facebook['Gain']**2
fb_g['Gain**3'] = facebook['Gain']**3
fb_g

Unnamed: 0,Volume,Gain,Gain**2,Gain**3
0,29462700.0,-0.150000,0.022500,-0.003375
1,42402900.0,-0.420000,0.176400,-0.074088
2,26261300.0,-0.210001,0.044100,-0.009261
3,45314500.0,-0.490000,0.240100,-0.117649
4,37663100.0,0.260000,0.067600,0.017576
...,...,...,...,...
750,22803700.0,0.879997,0.774395,0.681465
751,22038400.0,-0.890007,0.792112,-0.704986
752,22035500.0,0.299996,0.089998,0.026999
753,18124300.0,-0.569999,0.324899,-0.185192


In [47]:
fb_g.to_excel('facebook_gains.xlsx')

## Reading from the clipboard

Everybody has likely copy/pasted something using a computer at some time. On a Windows/Linux PC, the user can select something with the keyboard or mouse and then copy the selection by pressing `Ctrl` + `C`. On a Mac, copying is done with `command` + `C`. These two keystrokes save the selection to the 'clipboard', a temporary storage space in the computer's memory.

Using pandas, we can access data contained in the clipboard and read it into Python. This is done with the simple function `pd.read_clipboard()`.

To illustrate this, I'll navigate to the Royal Bank of Canada (RBC) website and look at their [mortgage rates](https://www.rbcroyalbank.com/mortgages/mortgage-rates.html). We'll copy the table at that web address to the clipboard and then read it into Python.

In [48]:
data5 = pd.read_clipboard()
data5

Unnamed: 0,Term,Special Offers,APR
0,2 Year Fixed,3.040%,3.100%
1,5 Year Fixed,3.090%,3.120%
2,5 Year Variable,RBC Prime Rate - 0.600% (3.350%),3.380%


While this is a manual way to obtain data, it is simple and effective!

## `hdf5` data

The Hierarchical Data Format, or `hdf5`, was specifically designed to store huge amounts of data. It can often be seen in use in meteorology and geological fields for storing 3d data. File extensions related to `hdf5` are `.hdf`, `.h4`, `.h5`, `.hdf4`, `.hdf5`, `.he2`, and `.he5`. One of the main benefits of `hdf5` is how data is organized, which is very similar to a filesystem organization of folders and directories.

Behind the scenes, when pandas loads data from `hdf5` it is using the PyTables package, which comes with Anaconda Distribution. There is a wide range of functionality for `hdf5` with pandas. We'll only cover the most basic functions here, but if you're interested you can check out the [official documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#hdf5-pytables).

First, we'll make a new `hdf5` file.

In [49]:
store = pd.HDFStore('store.h5', mode = 'w')
store

<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

With our `hdf5` file created, we can now add data to it. Let's add a `DataFrame` of random numbers.

In [50]:
tab = pd.DataFrame(np.random.rand(100).reshape((10,10)))
store['tab'] = tab

The `DataFrame` above is now saved in the `hdf5` file in the same way that a file is stored in a directory. We can access our `DataFrame` by specifying its name, similar to how we would access a dictionary value by specifying its key.

In [51]:
store['tab']

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.099085,0.880313,0.868356,0.077682,0.516491,0.348102,0.927215,0.879138,0.4875,0.684574
1,0.604251,0.242577,0.368504,0.662979,0.288606,0.749507,0.895846,0.140755,0.054564,0.43243
2,0.086902,0.152969,0.670646,0.499511,0.538223,0.548401,0.311891,0.777406,0.809114,0.196571
3,0.07297,0.671817,0.841057,0.486297,0.095402,0.108656,0.921034,0.744264,0.062493,0.278823
4,0.549992,0.807169,0.861014,0.547611,0.226705,0.505531,0.344951,0.40326,0.054537,0.690699
5,0.529947,0.899889,0.028189,0.436676,0.926668,0.117461,0.79377,0.296493,0.48781,0.311897
6,0.930604,0.217079,0.17668,0.796255,0.742811,0.396073,0.504651,0.47298,0.94939,0.851314
7,0.011418,0.412072,0.322582,0.790515,0.518275,0.991522,0.883076,0.750539,0.712671,0.145354
8,0.575972,0.442255,0.872041,0.867145,0.069681,0.432039,0.874137,0.953014,0.523563,0.925607
9,0.007747,0.207172,0.75145,0.101848,0.562954,0.907759,0.523321,0.906072,0.630652,0.217937


We can add more `DataFrame`s to our `hdf5` file and recall them as long as the file is open. To view the names of `DataFrame`s within the file, we can call the `groups()` method.

In [52]:
tab2 = pd.DataFrame({'A': np.arange(15),
                     'B': np.random.rand(15)})
store['tab2'] = tab2

In [53]:
store.groups()

[/tab (Group) ''
   children := ['axis0' (Array), 'axis1' (Array), 'block0_values' (Array), 'block0_items' (Array)],
 /tab2 (Group) ''
   children := ['axis0' (Array), 'axis1' (Array), 'block0_values' (Array), 'block0_items' (Array), 'block1_values' (Array), 'block1_items' (Array)]]

If we want to see all of the data stored in our `hdf5` file, we use the `walk()` method. The following code comes directly from the official documentation.

In [54]:
for (path, subgroups, subkeys) in store.walk():
    for subgroup in subgroups:
        print('GROUP: {}/{}'.format(path, subgroup))
    
    for subkey in subkeys:
        key = '/'.join([path, subkey])
        print('KEY: {}'.format(key))
        print(store.get(key))

KEY: /tab
          0         1         2         3         4         5         6  \
0  0.099085  0.880313  0.868356  0.077682  0.516491  0.348102  0.927215   
1  0.604251  0.242577  0.368504  0.662979  0.288606  0.749507  0.895846   
2  0.086902  0.152969  0.670646  0.499511  0.538223  0.548401  0.311891   
3  0.072970  0.671817  0.841057  0.486297  0.095402  0.108656  0.921034   
4  0.549992  0.807169  0.861014  0.547611  0.226705  0.505531  0.344951   
5  0.529947  0.899889  0.028189  0.436676  0.926668  0.117461  0.793770   
6  0.930604  0.217079  0.176680  0.796255  0.742811  0.396073  0.504651   
7  0.011418  0.412072  0.322582  0.790515  0.518275  0.991522  0.883076   
8  0.575972  0.442255  0.872041  0.867145  0.069681  0.432039  0.874137   
9  0.007747  0.207172  0.751450  0.101848  0.562954  0.907759  0.523321   

          7         8         9  
0  0.879138  0.487500  0.684574  
1  0.140755  0.054564  0.432430  
2  0.777406  0.809114  0.196571  
3  0.744264  0.062493  0.278

When we are finished with our `hdf5` file, we should remember to `close()` it.

In [55]:
store.close()

To read in an `hdf5` file, we use the `pd.read_hdf` function. If there are more than one dataset in the `hdf5` file, we need to specify which dataset we are extracting with the `key` keyworded argument.

In [59]:
store = pd.read_hdf('store.h5', key = '/tab')
store

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.099085,0.880313,0.868356,0.077682,0.516491,0.348102,0.927215,0.879138,0.4875,0.684574
1,0.604251,0.242577,0.368504,0.662979,0.288606,0.749507,0.895846,0.140755,0.054564,0.43243
2,0.086902,0.152969,0.670646,0.499511,0.538223,0.548401,0.311891,0.777406,0.809114,0.196571
3,0.07297,0.671817,0.841057,0.486297,0.095402,0.108656,0.921034,0.744264,0.062493,0.278823
4,0.549992,0.807169,0.861014,0.547611,0.226705,0.505531,0.344951,0.40326,0.054537,0.690699
5,0.529947,0.899889,0.028189,0.436676,0.926668,0.117461,0.79377,0.296493,0.48781,0.311897
6,0.930604,0.217079,0.17668,0.796255,0.742811,0.396073,0.504651,0.47298,0.94939,0.851314
7,0.011418,0.412072,0.322582,0.790515,0.518275,0.991522,0.883076,0.750539,0.712671,0.145354
8,0.575972,0.442255,0.872041,0.867145,0.069681,0.432039,0.874137,0.953014,0.523563,0.925607
9,0.007747,0.207172,0.75145,0.101848,0.562954,0.907759,0.523321,0.906072,0.630652,0.217937


In [60]:
# Full list of keyworded arguments.
help(pd.read_hdf)

Help on function read_hdf in module pandas.io.pytables:

read_hdf(path_or_buf, key=None, mode='r', **kwargs)
    Read from the store, close it if we opened it.
    
    Retrieve pandas object stored in file, optionally based on where
    criteria
    
    Parameters
    ----------
    path_or_buf : str, path object, pandas.HDFStore or file-like object
        Any valid string path is acceptable. The string could be a URL. Valid
        URL schemes include http, ftp, s3, and file. For file URLs, a host is
        expected. A local file could be: ``file://localhost/path/to/table.h5``.
    
        If you want to pass in a path object, pandas accepts any
        ``os.PathLike``.
    
        Alternatively, pandas accepts an open :class:`pandas.HDFStore` object.
    
        By file-like object, we refer to objects with a ``read()`` method,
        such as a file handler (e.g. via builtin ``open`` function)
        or ``StringIO``.
    
        .. versionadded:: 0.19.0 support for pathlib,

## Using `SQL` with pandas

Dealing with `SQL` databases and queries is also possible with pandas. Anaconda distribution comes with SQLAlchemy, which provides database abstraction. Python also comes standard with SQLite, which we'll use as our driver library.

The function for both of these tasks is, unsurprisingly, `pd.read_sql()`. This is really an alias for two functions, `pd.read_sql_query()` and `pd.read_sql_table()`, and the context is selected based on the function argument. In this brief section, we'll demonstrate a somewhat artificial example where we will first create a `SQL` database in memory and then read it back.

In [61]:
# Creating a local SQLite database engine

from sqlalchemy import create_engine

engine = create_engine('sqlite:///:memory:')

# If you want to use an actual connection, edit the next two lines as necessary.
# with engine.connect() as connection, connection.begin():
#     data = pd.read_sql_table('data', connection)

In [62]:
data6 = pd.DataFrame({'date': ['2-2-2020','2-3-2020','2-5-2020','2-15-2020','2-16-2020'],
                      'rate': [16.7, 14.2, 15.4, 17.7, 19.1],
                      'soft': [True, False, False, True, False]
                     })
data6

Unnamed: 0,date,rate,soft
0,2-2-2020,16.7,True
1,2-3-2020,14.2,False
2,2-5-2020,15.4,False
3,2-15-2020,17.7,True
4,2-16-2020,19.1,False


In [63]:
# Save the DataFrame to our SQL 'database'

data6.to_sql('data', engine)

Now that we have a `SQL` 'database', we can use standard `SQL` queries on it.

In [64]:
pd.read_sql('SELECT rate from data WHERE rate > 15', engine)

Unnamed: 0,rate
0,16.7
1,15.4
2,17.7
3,19.1


We can also read in the entire 'database' as a `DataFrame` object.

In [65]:
data_sql = pd.read_sql('data', engine, index_col = 'index')
data_sql

Unnamed: 0_level_0,date,rate,soft
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,2-2-2020,16.7,True
1,2-3-2020,14.2,False
2,2-5-2020,15.4,False
3,2-15-2020,17.7,True
4,2-16-2020,19.1,False


If you want to use a different database URI, you can specify it when you use the `create_engine()` function from SQLAlchemy.

For examples of this usage, check out the [official documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#engine-connection-examples).

In [66]:
# Full list of keyworded arguments.
help(pd.read_sql)

Help on function read_sql in module pandas.io.sql:

read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None)
    Read SQL query or database table into a DataFrame.
    
    This function is a convenience wrapper around ``read_sql_table`` and
    ``read_sql_query`` (for backward compatibility). It will delegate
    to the specific function depending on the provided input. A SQL query
    will be routed to ``read_sql_query``, while a database table name will
    be routed to ``read_sql_table``. Note that the delegated function might
    have more specific notes about their functionality not listed here.
    
    Parameters
    ----------
    sql : string or SQLAlchemy Selectable (select or text object)
        SQL query to be executed or a table name.
    con : SQLAlchemy connectable (engine/connection) or database string URI
        or DBAPI2 connection (fallback mode)
    
        Using SQLAlchemy makes it possible to use any DB

# Missing data

As stated before, real data is often incomplete. Think of the census taker who walks door-to-door in the middle of the afternoon, trying to collect data from every household in a given neighborhood. It is not possible to collect data from every single household, since not every household has a willing occupant during the time the census taker comes to visit. Data can be incomplete for any number of reasons, each of which results in **missing data**. In a pandas `Series` or `DataFrame`, missing values are represented by `NaN` (not-a-number).

The significance of missing data is relative. If we are working with a massive dataset consisting of millions of values spanning multiple columns and only a few rows contain missing values, then we can likely drop the missing values and continue with our intended analysis. Conversely, if we have a modest dataset and more than half the columns contain missing values, then we will need a more strategic approach to missing data.

Given a dataset with missing values, our options essentially reduce to the following three choices:
1. Drop rows with missing values.
2. Fill missing values with values present in the dataset.
3. Use more sophisticated techniques to generate reasonable synthetic data.

In this section, we'll explore all three options.

## Option 1 - Dropping rows with missing values

Before removing missing values, we can view their locations using the `.isna()` method. This returns a boolean mask view of the data with `True` in positions where a `NaN` value occurs.

In [67]:
miss1 = pd.DataFrame({'col1': np.random.rand(10), 
                      'col2': [10.1, 9.8, None, 8.8, 10.2, None, 11.0, 9.9, None, 8.1]})
miss1

Unnamed: 0,col1,col2
0,0.917361,10.1
1,0.195088,9.8
2,0.816589,
3,0.793534,8.8
4,0.465535,10.2
5,0.860016,
6,0.249353,11.0
7,0.08583,9.9
8,0.337488,
9,0.306436,8.1


In [68]:
miss1.isna()

Unnamed: 0,col1,col2
0,False,False
1,False,False
2,False,True
3,False,False
4,False,False
5,False,True
6,False,False
7,False,False
8,False,True
9,False,False


In [69]:
miss1.isna().sum()

col1    0
col2    3
dtype: int64

To remove the rows that contain `NaN`, we use the `.dropna()` method. **Note:** this makes a copy of the data rather than actually deleting rows from the data. This is to avoid losing data that you might need later.

In [70]:
miss1.dropna()

Unnamed: 0,col1,col2
0,0.917361,10.1
1,0.195088,9.8
3,0.793534,8.8
4,0.465535,10.2
6,0.249353,11.0
7,0.08583,9.9
9,0.306436,8.1


By default, the `.dropna()` method drops *rows* with missing data. We can specify the keyworded argument `axis = 1` to drop *columns* with missing data (default is `axis = 0`).

In [71]:
miss1.dropna(axis = 1)

Unnamed: 0,col1
0,0.917361
1,0.195088
2,0.816589
3,0.793534
4,0.465535
5,0.860016
6,0.249353
7,0.08583
8,0.337488
9,0.306436


Another option is to only drop rows in which *all* of the values are `NaN`. For this option, use the keyworded argument `how = 'all'`.

In [72]:
miss1.dropna(how = 'all')

Unnamed: 0,col1,col2
0,0.917361,10.1
1,0.195088,9.8
2,0.816589,
3,0.793534,8.8
4,0.465535,10.2
5,0.860016,
6,0.249353,11.0
7,0.08583,9.9
8,0.337488,
9,0.306436,8.1


In [73]:
miss1['col1'].loc[2] = np.nan
miss1.dropna(how = 'all')

Unnamed: 0,col1,col2
0,0.917361,10.1
1,0.195088,9.8
3,0.793534,8.8
4,0.465535,10.2
5,0.860016,
6,0.249353,11.0
7,0.08583,9.9
8,0.337488,
9,0.306436,8.1


There is a less harsh way to deal with rows that only contain *some* missing values. Suppose we want to keep rows with a given number of non-missing values. We can specify this using the keyworded argument `thresh = <number of values>`.

In [74]:
miss1['col3'] = [None, 2.2, None, 1.2, 5.4, None, None, 1.1, 2.2, 1.2]
miss1

Unnamed: 0,col1,col2,col3
0,0.917361,10.1,
1,0.195088,9.8,2.2
2,,,
3,0.793534,8.8,1.2
4,0.465535,10.2,5.4
5,0.860016,,
6,0.249353,11.0,
7,0.08583,9.9,1.1
8,0.337488,,2.2
9,0.306436,8.1,1.2


In [75]:
# Only keep rows with at least 2 non-missing values.
miss1.dropna(thresh = 2)

Unnamed: 0,col1,col2,col3
0,0.917361,10.1,
1,0.195088,9.8,2.2
3,0.793534,8.8,1.2
4,0.465535,10.2,5.4
6,0.249353,11.0,
7,0.08583,9.9,1.1
8,0.337488,,2.2
9,0.306436,8.1,1.2


Keep in mind, your data might not have conveniently placed `NaN` values where missing values occur. I have seen a dataset where missing values were coded as any of the values `[96, 97, 98, 99]`. Different data collection strategies result in different ways of coding missing values. Luckily for us, we can use the `.replace()` method to find and replace these variably coded missing values with `NaN`.

In [76]:
miss2 = pd.DataFrame({'col1': np.random.rand(10), 
                      'col2': [10.1, 9.8, 99, 8.8, 10.2, 96, 11.0, 9.9, 97, 8.1]})
miss2

Unnamed: 0,col1,col2
0,0.986065,10.1
1,0.155589,9.8
2,0.493741,99.0
3,0.194466,8.8
4,0.16426,10.2
5,0.724393,96.0
6,0.554712,11.0
7,0.992355,9.9
8,0.05729,97.0
9,0.874637,8.1


In [77]:
miss2.replace([96., 97., 98., 99.], 'NaN')

Unnamed: 0,col1,col2
0,0.986065,10.1
1,0.155589,9.8
2,0.493741,
3,0.194466,8.8
4,0.16426,10.2
5,0.724393,
6,0.554712,11.0
7,0.992355,9.9
8,0.05729,
9,0.874637,8.1


In general, the syntax for the `.replace()` method is:
```
pandas.DataFrame.replace(<values to replace>, <replacement value>)
```

You can also pass in a dictionary consisting of `key: value` pairs where the `key`s correspond to the missing values, and the `value`s correspond to the replacement values.

## Option 2 - Filling missing values

There are many cases where we would prefer filling in the missing data to losing it altogether. The simplest method for filling a missing value is to replace `NaN` with the value immediately above or below it. For these tasks, we have the `.fillna()` method. 

If we specify the argument of the `.fillna()` method as the value to be used for replacing `NaN` values, this is equivalent to using the `.replace()` method.

To fill using the value immediately *above* the missing value, use `.fillna(method = 'ffill')`. To fill using the value immediately *below* the missing value, use `.fillna(method = 'bfill')`.

In [78]:
miss1

Unnamed: 0,col1,col2,col3
0,0.917361,10.1,
1,0.195088,9.8,2.2
2,,,
3,0.793534,8.8,1.2
4,0.465535,10.2,5.4
5,0.860016,,
6,0.249353,11.0,
7,0.08583,9.9,1.1
8,0.337488,,2.2
9,0.306436,8.1,1.2


In [79]:
miss1.fillna(method = 'ffill')

Unnamed: 0,col1,col2,col3
0,0.917361,10.1,
1,0.195088,9.8,2.2
2,0.195088,9.8,2.2
3,0.793534,8.8,1.2
4,0.465535,10.2,5.4
5,0.860016,10.2,5.4
6,0.249353,11.0,5.4
7,0.08583,9.9,1.1
8,0.337488,9.9,2.2
9,0.306436,8.1,1.2


In [80]:
miss1.fillna(method = 'bfill')

Unnamed: 0,col1,col2,col3
0,0.917361,10.1,2.2
1,0.195088,9.8,2.2
2,0.793534,8.8,1.2
3,0.793534,8.8,1.2
4,0.465535,10.2,5.4
5,0.860016,11.0,1.1
6,0.249353,11.0,1.1
7,0.08583,9.9,1.1
8,0.337488,8.1,2.2
9,0.306436,8.1,1.2


Each of these optional arguments result in changing a copy of the `DataFrame`. We can change the `DataFrame` itself by passing in the argument `inplace = True`.

In [81]:
miss1.fillna(method = 'ffill', inplace = True)
miss1

Unnamed: 0,col1,col2,col3
0,0.917361,10.1,
1,0.195088,9.8,2.2
2,0.195088,9.8,2.2
3,0.793534,8.8,1.2
4,0.465535,10.2,5.4
5,0.860016,10.2,5.4
6,0.249353,11.0,5.4
7,0.08583,9.9,1.1
8,0.337488,9.9,2.2
9,0.306436,8.1,1.2


## Option 3 - Generate synthetic data

### Assumptions about the missing data

When collecting data for clinical trials, there are 3 mechanisms by which missing data is generated. 

1. **Missing completely at random (MCAR)** - This is the best-case scenario. In this case, the missing data for different variables are completely independent of each other. Also, the missing data is independent of the data collection method. If we have MCAR missing data, then we may assume that the incomplete data accurately represents the population of interest. Unfortunately, this is not a realistic assumption in most cases.


2. **Missing at random (MAR)** - This happens when the probability of a missing value is related to some other recorded value, but has nothing to do with the missing value itself. For example, maybe we send out a health survey and find that only men didn't respond to the survey question about daily caloric intake. In this case, there is a connection between the missing value (daily caloric intake) and another recorded value (gender). However, it is impossible to know if daily caloric intake is missing **solely** because of gender. Therefore, MAR must be assumed. In this case, we can use the complete data to predict the missing values (Jakobsen et al., 2017).


3. **Missing not at random (MNAR)** - This is when a missing value directly corresponds to the variable. For example, the daily caloric intake was not reported by men because of their daily caloric intake. In other words, the mechanism explaining the missing values is dependent upon the missing values. 

### Simple imputation - **CAUTION**

This technique involves replacing missing values with other values from the dataset, such as with the `.fillna()` method described above. Alternatively, the column mean is often used for replacing missing values in a given column. Though the attractiveness of these techniques is in their simplicity, they introduce great potential for bias into statistical analyses down the line (Jakobsen et al., 2017). Therefore, these techniques must be used with caution.

For interest's sake, here is an example of simple mean imputation. For missing values in `col2`, we replace `NaN` with the mean of `col2`.

In [4]:
miss1 = pd.DataFrame({'col1': np.random.rand(10), 
                      'col2': [10.1, 9.8, None, 8.8, 10.2, None, 11.0, 9.9, None, 8.1], 
                      'col3': [None, 2.2, None, 1.2, 5.4, None, None, 1.1, 2.2, 1.2]})
miss1

Unnamed: 0,col1,col2,col3
0,0.382673,10.1,
1,0.551442,9.8,2.2
2,0.024158,,
3,0.217955,8.8,1.2
4,0.821808,10.2,5.4
5,0.467585,,
6,0.389902,11.0,
7,0.7531,9.9,1.1
8,0.69095,,2.2
9,0.350315,8.1,1.2


In [83]:
miss1['col2'].fillna(value = miss1['col2'].mean())

0    10.1
1     9.8
2     9.7
3     8.8
4    10.2
5     9.7
6    11.0
7     9.9
8     9.7
9     8.1
Name: col2, dtype: float64

In [84]:
miss1['col3'].fillna(value = miss1['col3'].mean())

0    2.216667
1    2.200000
2    2.216667
3    1.200000
4    5.400000
5    2.216667
6    2.216667
7    1.100000
8    2.200000
9    1.200000
Name: col3, dtype: float64

### Multiple imputation

For the purposes of this course, multiple imputation means generating potential replacement values from a statistical distribution several times, analyzing the resulting datasets (using simple imputation methods), and then computing summary statistics for the variable to be replaced.

An excellent flowchart for when to use multiple imputation is presented in (Jakobsen et al., 2017), and is reproduced here by the conditions of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

![image1](https://media.springernature.com/full/springer-static/image/art%3A10.1186%2Fs12874-017-0442-1/MediaObjects/12874_2017_442_Fig1_HTML.gif?as=webp)

You may be familiar with different multiple imputation methods. I'll briefly describe some common methods here. These methods are all available in the Python package **Scikit-Learn (sklearn)**.

- Iterative Imputation - take the column `y` containing missing values and fit a regressor for the complete values in `y` based on the other columns `X`. Use the predicted values from the regression for the missing values of `y`. Repeat this `max_iter` times, using the final predicted values as the replacement values.
- $k$-Nearest Neighbors (kNN) - take the column `y` containing missing values. For each missing value, find the $k$ values of `y` 'closest' to the missing value across all columns (using the Euclidean distance). The missing value is replaced by the average of the $k$-nearest neighbors.
- Multiple Imputation by Chained Equations (MICE) - similar to Iterative Imputation, but use different regressors for each iteration. This is done in sklearn by specifying the `sample_posterior = True` keyworded argument.

We will return to kNN in future weeks. Let's see examples of each of these methods. We'll specify the `random_state = 0` keyworded argument to set the random seed. This is so we can get consistent results each time we run this notebook.

In [85]:
import sklearn

sklearn.__version__

'0.22.1'

In [88]:
!python -m pip show scikit-learn 

Name: scikit-learn
Version: 0.22.1
Summary: A set of python modules for machine learning and data mining
Home-page: http://scikit-learn.org
Author: None
Author-email: None
License: new BSD
Location: c:\users\matthew.adams\appdata\local\continuum\anaconda3\lib\site-packages
Requires: numpy, joblib, scipy
Required-by: sklearn


In [None]:
!pip uninstall scikit-learn

In [2]:
import sklearn

sklearn.__version__

'0.22.2.post1'

In [5]:
# Since the IterativeImputer is experimental as of 24 Feb 2020, we need to enable it.
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, IterativeImputer, KNNImputer

miss1

Unnamed: 0,col1,col2,col3
0,0.382673,10.1,
1,0.551442,9.8,2.2
2,0.024158,,
3,0.217955,8.8,1.2
4,0.821808,10.2,5.4
5,0.467585,,
6,0.389902,11.0,
7,0.7531,9.9,1.1
8,0.69095,,2.2
9,0.350315,8.1,1.2


In [7]:
# Simple mean imputation

# We'll first replace 'NaN' with np.nan
miss1 = miss1.replace('NaN', np.nan)

imp1 = SimpleImputer(missing_values = np.nan, strategy = 'mean')

# This step finds the imputed values
imp1.fit(miss1)

# This step applies the imputed values
miss1_simple_mean = pd.DataFrame(imp1.transform(miss1))
miss1_simple_mean

Unnamed: 0,0,1,2
0,0.382673,10.1,2.216667
1,0.551442,9.8,2.2
2,0.024158,9.7,2.216667
3,0.217955,8.8,1.2
4,0.821808,10.2,5.4
5,0.467585,9.7,2.216667
6,0.389902,11.0,2.216667
7,0.7531,9.9,1.1
8,0.69095,9.7,2.2
9,0.350315,8.1,1.2


In [13]:
# Iterative Imputation - single regressor
imp2 = IterativeImputer(max_iter = 5, random_state = 0)

# Find the imputed values
imp2.fit(miss1)

# Fill with imputed values
miss1_iterative_imput = imp2.transform(miss1)
miss1_iterative_imput = pd.DataFrame(miss1_iterative_imput)
miss1_iterative_imput

Unnamed: 0,0,1,2
0,0.382673,10.1,2.671694
1,0.551442,9.8,2.2
2,0.024158,9.623901,2.230813
3,0.217955,8.8,1.2
4,0.821808,10.2,5.4
5,0.467585,9.685527,2.393244
6,0.389902,11.0,3.327181
7,0.7531,9.9,1.1
8,0.69095,9.642237,2.2
9,0.350315,8.1,1.2


In [16]:
miss1_iterative_imput.describe()

Unnamed: 0,0,1,2
count,10.0,10.0,10.0
mean,0.464989,9.685166,2.392293
std,0.247159,0.783662,1.274169
min,0.024158,8.1,1.1
25%,0.358404,9.628485,1.45
50%,0.428743,9.742764,2.215407
75%,0.656073,10.05,2.602082
max,0.821808,11.0,5.4


In [18]:
# Iterative Imputation - multiple regressors (MICE)
imp3 = IterativeImputer(max_iter = 5, random_state = 0, sample_posterior = True)

# Find the imputed values
imp3.fit(miss1)

# Fill with imputed values
miss1_MICE = imp3.transform(miss1)
miss1_MICE = pd.DataFrame(miss1_MICE)
miss1_MICE

Unnamed: 0,0,1,2
0,0.382673,10.1,8.375514
1,0.551442,9.8,2.2
2,0.024158,8.488781,-5.381555
3,0.217955,8.8,1.2
4,0.821808,10.2,5.4
5,0.467585,8.635158,6.868344
6,0.389902,11.0,-5.271444
7,0.7531,9.9,1.1
8,0.69095,9.389835,2.2
9,0.350315,8.1,1.2


In [19]:
miss1_MICE.describe()

Unnamed: 0,0,1,2
count,10.0,10.0,10.0
mean,0.464989,9.441377,1.789086
std,0.247159,0.915805,4.533754
min,0.024158,8.1,-5.381555
25%,0.358404,8.676369,1.125
50%,0.428743,9.594918,1.7
75%,0.656073,10.05,4.6
max,0.821808,11.0,8.375514


In [20]:
# k-Nearest Neighbors
imp3 = KNNImputer(n_neighbors = 3, weights = 'uniform')

miss1_knn = imp3.fit_transform(miss1)
miss1_knn = pd.DataFrame(miss1_knn)
miss1_knn

Unnamed: 0,0,1,2
0,0.382673,10.1,1.833333
1,0.551442,9.8,2.2
2,0.024158,9.0,1.533333
3,0.217955,8.8,1.2
4,0.821808,10.2,5.4
5,0.467585,10.3,1.866667
6,0.389902,11.0,2.9
7,0.7531,9.9,1.1
8,0.69095,10.3,2.2
9,0.350315,8.1,1.2


In [21]:
miss1_knn.describe()

Unnamed: 0,0,1,2
count,10.0,10.0,10.0
mean,0.464989,9.75,2.143333
std,0.247159,0.863134,1.273573
min,0.024158,8.1,1.1
25%,0.358404,9.2,1.283333
50%,0.428743,10.0,1.85
75%,0.656073,10.275,2.2
max,0.821808,11.0,5.4


In [24]:
print('{}: Iterative\n{}: MICE\n{}: kNN'.format(miss1_iterative_imput.var(),miss1_MICE.var(),miss1_knn.var()))

0    0.061088
1    0.614126
2    1.623506
dtype: float64: Iterative
0     0.061088
1     0.838698
2    20.554929
dtype: float64: MICE
0    0.061088
1    0.745000
2    1.621988
dtype: float64: kNN


Another visual from (Jakobsen et al., 2017) gives more direction about when to use different imputation techniques. The 'Chained equations' referred to are from the MICE method.

**Note:** The 'Monotonic imputation' is done for 'Monotonic missing' data, which is not discussed here. We also won't discuss the Markov chain Monte Carlo method here.

![image3](https://media.springernature.com/full/springer-static/image/art%3A10.1186%2Fs12874-017-0442-1/MediaObjects/12874_2017_442_Fig2_HTML.gif?as=webp)

# Summary

* Python has convenient built-in functions for unzipping `.zip` and `.gz` files.
* pandas features functions for importing and exporting `csv`, `json`, `html`, Microsoft Excel, clipboard, `hdf5`, and `SQL` data.
* Read the **Terms of Service** of any website from which you plan to scrape `html` data.
* The three options for dealing with missing values in a dataset are:
    1. Drop rows with missing values.
    2. Fill in missing values with adjacent values.
    3. Generate synthetic data based on other values.
* **Imputed** data is data generated from other values in the dataset.
* There are 3 mechanisms that lead to missing data:
    1. **Missing completely at random (MCAR).**
    2. **Missing at random (MAR).**
    3. **Missing not at random (MNAR).**
* If neither MCAR nor MNAR are plausible, we may use multiple imputation to generate missing values.

# *Exercises*

1. Navigate to the EuroStat [Greenhouse Gas Emissions](https://ec.europa.eu/eurostat/databrowser/view/sdg_13_10/default/table?lang=en) dataset. Download the data as an `.xlsx` file and read the data into Python using pandas. Compute the summary statistics for the data.

2. Navigate to the US Census Bureau dataset for [Government Units: US and State](https://data.census.gov/cedsci/table?tid=GOVSTIMESERIES.CG00ORG01&hidePreview=true&y=). Download the data as a `.csv` file and load it into Python using pandas. From this dataset, create a second dataset for only the year 2017.

3. Perform 3 methods of imputation on the `Amount` column of the 'Government Units: US and State' dataset. Save the results to new datasets.

4. Save all of the above datasets and results as groups in an `hdf5` file.

# References

Jakobsen, J. C., Gluud, C., Wetterslev, J., & Winkel, P. (2017). When and how should multiple imputation be used for handling missing data in randomised clinical trials–a practical guide with flowcharts. BMC medical research methodology, 17(1), 162.

Statistics Canada.  Table  13-10-0101-01   Public nursing and residential care facilities, summary statistics (x 1,000,000)
DOI:   https://doi.org/10.25318/1310010101-eng