## Unit 2:  Homework Option III -- Working With Large Files

This homework provides students with the option to get practice opening, sampling, and cleaning data when it has file sizes that might stretch your available RAM, or even exceed it.  

**What you'll turn in:**

You'll make a pull request with a folder (titled as your name) that contains the following:

 - for section I, a file called `titanic`, stored in a binary `feather` or `parquet` file format, that reduces the `titanic` dataset to its smallest possible memory amount without tampering any of its values.
 - for section II, simply turn in this notebook with the questions answered.
 - for section III, a script called `chunking.py` that, when run, will connect to an s3 bucket, and stream a file through memory, clean it, and output it to a `.csv` file, even though the entire file was never entirely loaded into RAM.

### Section I: Downcasting the Titanic Dataset

**What You'll Learn:** The basics of managing memory within files, and how to use advanced file formats such as the `feather` or `parquet`, to make it easier to maintain persistent data types when a file isn't loaded.

**What You'll Turn In:** A file called `titanic`, which contains the memory-reduced form of the file.

#### Downcasting

`Downcasting` is the task of reducing the memory footprint of different columns in your dataset so they take up less RAM when you load them in.  

Most software used for handling data makes use of your available RAM to process its tasks.  If the size of your file neatly fits into the available RAM that your computer has then this is fine.  If it's significantly larger (no laptop is going to have 1TB of RAM for example), then you won't be able to load in the file and work with it.

Pandas works this way, and therefore the amount of working RAM you have available to use is going to function as a limit for what file sizes you can work with.  

##### A Quick Intro to Data Types in Pandas & Numpy

Numbers come in different flavors in pandas and numpy.  At the simplest level you have integers (whole numbers) and floating point numbers (numbers with decimals).  

However, numbers use different sized containers to store their values.  They are as follows:

 - **64 bit:** Can store values as large as 2<sup>64</sup>, which is 18446744073709551616
 - **32 bit:** Can store values as large as 2<sup>32</sup>, which is 4294967296
 - **16 bit:** Good for values up to 2<sup>16</sup>, 65536
 - **8 bit:**  2<sup>8</sup>, or up to 264
 
**Integers** typically have the above range, while **floats** typically can only go down to 32 bits.

You can see the whole range of numeric data types here:  https://docs.scipy.org/doc/numpy/user/basics.types.html

The important detail here, is that if a number is encoded as being a 64 bit number, it will *always* use the same amount of memory to store it, even if the value itself is much smaller.  

So, if you have a column of 0's and 1's, an 8 bit encoding will work perfectly fine (since the values are less than 2<sup>8</sup>), and a 64 bit encoding will take up 8x as much memory as it needs to.

An important detail about how Pandas works is that *all numbers are automatically encoded as 64 bit numbers*.  This is good for making sure values aren't tampered with, but bad for optimizing memory with large files.

**Methods Used For Managing Memory In Pandas:**

 - `df.memory_usage()`, returns the memory usage, in bytes, of whatever is selected.
 - `pd.Series.astype()`, allows you to change the data type of 1 variable to another.
 - `df.info()`, returns the data type and memory usage of every column selected
 - `df.dtypes()/pd.Series.dtype()`, returns the data type of everything selected
 
Let's take a look at how these items work.  Run the following cells to get a quick demonstration of what they do.

In [2]:
# load in the titanic dataset here -- use a different url if need be to
# load it in
import pandas as pd
import numpy as np
df = pd.read_csv('./data/titanic.csv')

In [4]:
# check the info of your dataset -- notice the 64 bit numbers
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [5]:
# We can also check the memory usage of each column
df.memory_usage()

Index           128
PassengerId    7128
Survived       7128
Pclass         7128
Name           7128
Sex            7128
Age            7128
SibSp          7128
Parch          7128
Ticket         7128
Fare           7128
Cabin          7128
Embarked       7128
dtype: int64

In [6]:
# now let's see what happens when we adjust a columns data type
df['Survived'].astype(np.int8).memory_usage()

1019

As you can see, an 8 bit number takes up about 1/8 as much memory as a 64 bit number.  Permanently changing something's data type is simple.

In [7]:
# this will change the data type of a column to something else
df['Survived'] = df['Survived'].astype(np.int8)

In [8]:
# and now we can see its memory footprint is permanently smaller
df.memory_usage()

Index           128
PassengerId    7128
Survived        891
Pclass         7128
Name           7128
Sex            7128
Age            7128
SibSp          7128
Parch          7128
Ticket         7128
Fare           7128
Cabin          7128
Embarked       7128
dtype: int64

Notice however, that if you make a columns data type *smaller* than what it is, the original values will be tampered.  For example, the Passenger ID column has values as large as 891.....which is more than 2<sup>8</sup>.  Notice what happens when you make the change:

In [9]:
# the values at the end of the series should be 889, 890, 891, etc
df['PassengerId'].astype(np.int8)

0        1
1        2
2        3
3        4
4        5
      ... 
886    119
887    120
888    121
889    122
890    123
Name: PassengerId, Length: 891, dtype: int8

So clearly, getting things 'just right' is important.  Notice also the difference between **signed** and **unsigned** data types.  If it's signed, that means they can accept negative values.  

So, a datatype of `np.uint8` can accept ranges of 0 - 255, whereas `np.int8` accepts values from -128 to 127.

##### Categorical Data in Pandas and Numpy

Text based data in Pandas and Numpy has two different varieties:

 - **np.object**: this is the default numpy way of treating and handling data.  Pretty close to a python string, and is used to store data that doesn't have any other characteristic (integer, float, bool, etc)
 - **category**: this is a special data type built specifically for Pandas, to handle text data that has a small number of repeating values.  Like the `sex` column in our dataset.  When appropriately used, it can drastically reduce the memory footprint of text based data.  You can read more about them here:  https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html
 
See below for a demonstration.

In [10]:
# the Sex column currently uses the same memory as a 64 bit number
df['Sex'].memory_usage()

7256

In [11]:
# if you change it to a category, it uses about the same amount
# as an 8 bit number
df['Sex'].astype('category').memory_usage()

1115

Please note that using categories only has the desired effect when there are repeating values.  Make sure to check the memory footprint before making the change!

#### Section I Task:  Clean Up the Titanic Dataset

For the first part of your homework assignment, your task is to reduce the memory footprint of the titanic dataset as much as possible, while not tampering any values, and export it to a binary file format called `feather`(preferably), or `parquet` which will maintain information about all of its data types when it's loaded back in. 

You can read more about the feather file format here:  https://blog.rstudio.com/2016/03/29/feather/.

The file `titanic` should be turned in inside a folder with your name on it, and it will be inspected to make sure it was downcasted in the most appropriate manner.

You can use the `to_feather()`, and `read_feather()` methods to load and export your files.  Or the `to_parquet()` and `read_parquet()` methods, respectively.

**Note:** It is best to use feather files if you can.  They are the only file format that can maintain persistent information about the `category` data type after it's been saved.

However, you will likely need some additional libraries to get it working.  Often pandas does not work well with `feather` files right out of the box.  The typical library used to work with feather files is, appropriately called......feather.  To install the `feather` library, simply go to Anaconda Prompt/Terminal and type in `conda install feather-format`.

The library's homepage can be found here:  https://pypi.org/project/feather-format/

This library isn't always well supported, and there's a possibility that you might have an exceedingly difficult time getting it to work.  The purpose of this homework is not to have you go down rabbit holes getting an obscure library to work.  If after 30 minutes - 1 hour you still don't have the file installed, feel free to use the `parquet` file format, which can be a little easier to work with.

You can find more about it here:  https://arrow.apache.org/docs/python/parquet.html

### Section II: Working With Larger File Sizes

Section I was your warmup to get the hang of how to reduce the memory of your data and get it loaded into a more advanced file format that can be reused for other projects.

This section will extend what you just did, but add in two additional wrinkles:

 - The file will be much larger -- approximately 2 million rows
 - You'll find out what types of data types you should be working with.....*without* reading in the entire file in the first place.

Imagine this hypothetical scenario:  there's a 10 gb file that you need to work on, and loading it into pandas is returning a `MemoryError`.  You're almost certain you could reduce it to something much smaller......except you can't even read it in to figure out what to do next.

These types of chicken & egg problems are fairly common, and a popular antidote to them is to sample in a portion of your dataframe.  There are two primary ways to do this if you're reading in `.csv` files:

 - `nrows`   : tells you how many rows to read in from the original file
 - `skiprows`: tells you which rows to *skip* from the original file
 
For both of these you can just manually read in x number of rows relatively easily, but doing so right from the beginning or end has some problems.  Mainly, many datasets don't have consistent values from beginning to end.  

For example, if you have a dataset with 10 years of sales info, it's very possible values being recorded are very different at varying time segments.  Often new columns are added to datasets in the middle of their collection, and all values for all times before that are simply `null`, so just reading in the first 5000 or 10000 might not give you a consistent picture of what to expect.

For this reason, it's good to randomly sample in dataframes before you want to read them in entirely.

To do this, you can mix `lambda` functions (remember those?) with the `skiprows` argument, which can accept a function as values.

Here's the basic idea:

In [19]:
import random

# random.random() will generate a random value between 0 and 1
# so this will return True 50% of the time
random.random() > .50

True

Using this same logic, we can pass this into a lambda function like so, to read in 30% of the titanic dataset.

In [22]:
# use read_csv with a lambda function
df = pd.read_csv('./data/titanic.csv', skiprows=lambda x: x > 0 and random.random() > 0.3)
# and notice the size of the dataframe that was read in
df.shape

(273, 12)

Here, inside the `lambda` function, `x` represents the index value of the row being read in.  `x > 0` is used assuming the first row are headers, and of course `random.random() > 0.3` will return `True` 70% of the time, hence the results that we get.

The idea is that you can read in a very small fraction of a very large file if you want to investigate its most important properties.  

Now, this begs the question.....how do you specify the data types you want a column to be before you read it in?  

There is a very useful argument in `read_csv` called `dtype` that accepts a dictionary, where you can list column labels as keys, and their corresponding data type as a value.

So for example, if we wanted to change the `Embarked` column and the `Survived` column to `category` and `np.int8`, we could do so in the following way.

In [None]:
# this dictionary contains the columns we want to change
dtypes = {
    # column label: new data type
    'Embarked': 'category',
    'Survived': np.int8
}

# and now we'll re-read in the .csv file, and use the dtypes dict
df = pd.read_csv('../data/titanic.csv', dtype=dtypes)

In [None]:
# and we can see that the data type of the columns are in fact changed
df.info()

#### Section II Task: Sample, Clean Up, And Read in the taxi.csv file

This portion of section II contains the task that you will be graded on.  You can simply answer the prompts inside this notebook and turn it in.  No additional files are necessary.  The file is located in an S3 bucket at this location:  `https://dat-data.s3.amazonaws.com/taxi.csv`

It records information about every taxi ride given by a particular company for approximately 1 year.

**Part I:** Randomly sample in 10% the taxi.csv file

In [23]:
# your answer here
df = pd.read_csv('https://dat-data.s3.amazonaws.com/taxi.csv', skiprows=lambda x: x > 0 and random.random() > 0.1)

**Part II:** Go ahead and do the appropriate exploratory data analysis to figure out the most appropriate data type for each column.

In [45]:
# your answer here
df.memory_usage()

Index               128
TRIP_ID         1369240
CALL_TYPE       1369240
ORIGIN_CALL     1369240
ORIGIN_STAND    1369240
TAXI_ID         1369240
TIMESTAMP       1369240
DAY_TYPE        1369240
MISSING_DATA     171155
dtype: int64

**Part III**: Create a dictionary that contains the key/value pairs for each column that needs to be changed to a different data type, and then read in the file.

In [46]:
# your answer here
dtypes = {
    'CALL_TYPE': 'category',
    'DAY_TYPE': 'category',
    'TIMESTAMP': np.int32,
    'TAXI_ID': np.int32,
    'ORIGIN_STAND': np.float32,
    'ORIGIN_CALL': np.float32,
}

# and now we'll re-read in the .csv file, and use the dtypes dict
df = pd.read_csv('https://dat-data.s3.amazonaws.com/taxi.csv', dtype=dtypes)

**Part IV:** Confirm that each column has the appropriate data type

In [48]:
# your answer here
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1710670 entries, 0 to 1710669
Data columns (total 8 columns):
TRIP_ID         int64
CALL_TYPE       category
ORIGIN_CALL     float32
ORIGIN_STAND    float32
TAXI_ID         int32
TIMESTAMP       int32
DAY_TYPE        category
MISSING_DATA    bool
dtypes: bool(1), category(2), float32(2), int32(2), int64(1)
memory usage: 44.0 MB


### Section III (Optional): File Streaming

File streaming is a way to get around memory limitations in pandas.  When you stream in a file, you spoon feed a portion of it into memory, and when you're finished, load in the next portion, and so on until there's nothing left. 

It's less convenient than regular file I/O, but it removes any sort of memory limit you might face when working with a file because it's never loaded at the same time. 

In this section of the homework assignment, you'll be tasked with performing basic cleaning operations on a file......without ever having to load it into memory.  This means what you accomplish in this section you could perform on a file of any size.

##### Basic Introduction to File Streams

When it's available, you can specify a file stream by using the `chunksize` argument, which specifies that you only read in so many lines at a time.

Notice how it works:

In [3]:
# when we read in the csv file, we'll set chunksize to 200
df = pd.read_csv('./data/titanic.csv', chunksize=250)

# notice that df is NOT a df.....it's a file stream
type(df)

pandas.io.parsers.TextFileReader

In [4]:
# and now if we want, we can 'chunk' in 250 rows at a time
df.get_chunk()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
245,246,0,1,"Minahan, Dr. William Edward",male,44.0,2,0,19928,90.0000,C78,Q
246,247,0,3,"Lindahl, Miss. Agda Thorilda Viktoria",female,25.0,0,0,347071,7.7750,,S
247,248,1,2,"Hamalainen, Mrs. William (Anna)",female,24.0,0,2,250649,14.5000,,S
248,249,1,1,"Beckwith, Mr. Richard Leonard",male,37.0,1,1,11751,52.5542,D35,S


In [5]:
# and this would be the next 250 rows
df.get_chunk()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
250,251,0,3,"Reed, Mr. James George",male,,0,0,362316,7.2500,,S
251,252,0,3,"Strom, Mrs. Wilhelm (Elna Matilda Persson)",female,29.0,1,1,347054,10.4625,G6,S
252,253,0,1,"Stead, Mr. William Thomas",male,62.0,0,0,113514,26.5500,C87,S
253,254,0,3,"Lobb, Mr. William Arthur",male,30.0,1,0,A/5. 3336,16.1000,,S
254,255,0,3,"Rosblom, Mrs. Viktor (Helena Wilhelmina)",female,41.0,0,2,370129,20.2125,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
495,496,0,3,"Yousseff, Mr. Gerious",male,,0,0,2627,14.4583,,C
496,497,1,1,"Eustis, Miss. Elizabeth Mussey",female,54.0,1,0,36947,78.2667,D20,C
497,498,0,3,"Shellard, Mr. Frederick William",male,,0,0,C.A. 6212,15.1000,,S
498,499,0,1,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.5500,C22 C26,S


With this basic syntax, you can lazily read in bits and pieces of a file at a time, without having to worry about being able to fit the entire file into RAM at once.  This is useful for doing some basic exploratory data analysis, and peek into a file, even if it's unreasonably large.

However, what if you wanted some way to go through the *entire* file, and make bulk changes to it?  

You can do that as well, by looping through the file stream. Here's a simple example, that prints off the total memory footprint for each chunk being read in:

In [7]:
# this basically reads: for every chunk in the file stream
for chunk in pd.read_csv('./data/titanic.csv', chunksize=200):
    # do this to each chunk
    print(chunk.memory_usage().sum())

19328
19332
19332
19332
8868


Notice what we're doing here.  `chunk` is the portion of the file that you're reading in and operating on.  `pd.read_csv(.....)` in this case is the iterator that you're looping over, not the entire file itself.

What's a little difficult to get used to is that `chunk` isn't some stand alone file that you play with once you've loaded it in.  It's just a portion of a loop that you pass some commands to.

To finish this section of the homework, you'll need to write functions that allow you to stream in data, and perform specific operations on your data set, without having it entirely loaded into memory, or being able to see it.

These functions should be written in a file called `chunking.py`, and each of the following two functions should be able to be called from an IDE to observe the results.

**Function 1**

**Name:** `probe_df`

**Arguments:** 
 - `file_path`, str; required,  Location of file to read in.
 - `chunksize`, int, required, default value is 1000.  Size of the chunk to use when streaming in the file.
 
**Returns:** a dictionary encoded in the following way: 
 - each key is the name of a column within your dataset
 - the value for each key is another dictionary with the following key/value pairs:
   - `null values`: number of null values for that column
   - `dtype`: data type for that column
   - `avg_val`: average value for that column ( if numeric, otherwise don't include )
   
**Note:** The `chunksize` argument can be used with a variety of file types, but you can just assume that you'll be reading a csv file, and nothing more.

**Function II**

**Name:** `write_df`

**Arguments:**

 - `file_path_read`   : str, required; location of file to read in.
 - `file_path_write`  : str, required; location of the file to write the new file out to
 - `chunksize`, int, required, default value is 1000.  Size of the chunk to use when streaming in the file.
 - `missing_vals`: dict, optional; accepts a dictionary as an argument with key/value pairs that list the column in the dataset(key) as well as the value to fill missing values with for that column(value).  The values in this dictionary will be used to fill missing values in the file at the location in `file_path_read`.
 
**Returns:**

This function will **not** return a value in the terminal.  What it **will** do is write a new file to the location specified in `file_path_write`.  

So, for example if you call `write_df('file/path/to/stream', 'file/path/to/write')`, a new file will appear in the location at `file/path/to/write` as the function is being called.

The big idea behind this file is that you'll be able to do data cleaning operations on a file you've never actually seen in your terminal before.

**Hint:** Pandas has a `to_csv()` method, with the option of appending lines to the end of it -- this is good to use when looping through the file stream.

**Note:** We will check to see that column headers are not added multiple times!

**To Test:** We will first call `probe_df` on the `taxi` dataset in its original location to get a dictionary with its missing values.  We will then use that value as a basis for the `missing_vals` argument in the `write_df` function, and use those to fill in its missing values.

If both of these functions work as intended, they will allow us to fill in the dataset's missing values without having looked at it, and you will receive full marks.