### 1 - Loading a Large Dataset
This is the first time I've tried loading a decently sized dataset.  For reference, uncompressed, the training dataset is 4.08 GB, while the test set is 3.53 GB.  As such, when I tried to load the data with a conventional pd.read_csv() function, I don't think I even got a memory error; my computer froze.  I did a Google search and found a quick fix on [github from user lsilva6851](https://github.com/pandas-dev/pandas/issues/16537).  While I managed to finally load the whole dataset, everything was noticably slower, so I decided to only load part of the dataset to see if I could optimize the memory usage.  Thanks to [Shurti_Iyyer's Kaggle kernel](https://www.kaggle.com/shrutimechlearn/large-data-loading-trick-with-ms-malware-data) and [Josh Devlin's post on dataquest](https://www.dataquest.io/blog/pandas-big-data/) for the guidance.

#### 1.1 - Loading Libraries and Data

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Too slow.  Need optimize dtypes first
# chunksize = 100000
# chunks = []
# for chunk in pd.read_csv('../input/train.csv', chunksize=chunksize, low_memory=False):
#     chunks.append(chunk)
# df = pd.concat(chunks, axis=0)

When using the chunksize argument in the read_csv function, pandas spits out something called TextFileReader.  The simplest way to get this chunk into a dataframe is to use the get_chunk() function and pass it into pd.DataFrame().  The number inside get_chunk indicates how many rows from the chunksize you want to convert into a dataframe.

In [None]:
subset = pd.read_csv('../input/train.csv', chunksize=10000)

In [None]:
type(subset)

In [None]:
subset_df= pd.DataFrame(subset.get_chunk(10000))

In [None]:
subset_df.head()

#### 1.2 - Quick Peak at Memory Usage
Normally at this point, I'd look at the column names and see if the competition has identified any key columns.  However, in this post, I'd like to stick with just loading the dataset and optimizing the memory usage.

According to Josh, the info() function only gives an approximate memory usage number.  To get a more accurate number, set the argument memory_usage='deep'.  In this case, it is a difference between pandas displaying 6.3 MB+ and 22.4 MB.  In addition, there are 53 numeric columns (float and int) and 30 object columns, for a total of 83 columns and 10,000 rows.


In [None]:
subset_df.info(memory_usage='deep')

Josh has a great visual and explanation on how each datatype is stored, which I won't get into.  But its the basis of why we should examine how much memory each data type is using.  First, find the mean value of memory usage for each data type.

In [None]:
for dtype in ['float64', 'int64', 'object']:
    print('Average memory usage for {}: {} MB'.format(dtype, format(subset_df.select_dtypes([dtype]).memory_usage(deep = True).mean()/1024**2,'.2f')))

Right away, you should notice that objects are on average ~8 times more memory heavy than floats and ints.  Also, floats and ints have the similar average memory usages, but floats have almost double the columns as ints.  

### 2 - Optimizing Memory Usage
Since we'll be looking at memory usage a lot, I wrote a function to get a detailed list memory usage for each data type as well as total memory usage.

In [None]:
def get_memoryUsage(df):
    dtype_lst = list(df.get_dtype_counts().index)
    for dtype in dtype_lst:
        print('Total memory usage for {}: {} MB'.format(dtype, format(df.select_dtypes([dtype]).memory_usage(deep = True).sum()/1024**2,'.5f')))
    
    print('\n' + 'Total Memory Usage: {} MB'.format(format(df.memory_usage(deep=True).sum()/1024**2, '.2f')))

In [None]:
get_memoryUsage(subset_df)

Now that we have a baseline memory usage, we can start trying to optimize memory usage by changing each column's data type to the smallest possible.  For example, if a column is int64 but all the data in that column could be converted to int8, then convert the column to int8.

#### 2.1 - Downcasting Numeric Features
Downcasting is the term used in python to change data types; I believe it means something else in other programming languages.  I've written a function to downcast numeric features to unsigned ints and floats.  Unsigned ints and signed ints use up the same amount of memory, but the difference is that unsigned ints only include positive values (and zero).  Thus, if everything in a feature are all positive, then this could save memory.  In the case that the feature has negative values, the data type for that feature does not change.

In [None]:
def downcast_Numeric(df):
    for col in df.select_dtypes(['int64']):
        df[col] = pd.to_numeric(df[col], downcast = 'unsigned')
    for col in df.select_dtypes(['float64']):
        df[col] = pd.to_numeric(df[col], downcast = 'float')

In [None]:
downcast_Numeric(subset_df)

In [None]:
get_memoryUsage(subset_df)

As you can see, the floats turned into float 32 and there are two types of unsigned ints.  In addition, the memory usage for numeric features decreased by a whopping 90+%, but the total memory usage only decreased by ~11%.

#### 2.2 - Downcasting Objects
Pandas seems to have already optimized a way to downcast objects, by converting to categorial values.  The caveat is, if there are too many unique values in a feature, converting to categorial could actually increase memory usage.  A quote from the [documentation](http://pandas.pydata.org/pandas-docs/stable/categorical.html#gotchas): "If the number of categories approaches the length of the data, the Categorical will use nearly the same or more memory than an equivalent object dtype representation.".   

Using the describe function on all object features, we can see a snapshot of how many unique values there are in each column.  The first column, MachineIdentifier, would be a terrible column to change into a categorial type, as every row is a unique value.  However, the next few, ProductName, EngineVersion, and AppVersion, are great examples to convert, having only 2, 36, and 58 unique values out of 10,000 rows.  

In [None]:
subset_df.select_dtypes(['object']).describe()

Josh suggests only converting columns that have 50% or less nunique values.  

In [None]:
def downcast_Obj(df):
    for col in df.select_dtypes(['object']):
        if df[col].nunique() < len(df[col])/2:
            df[col] = df[col].astype('category')

In [None]:
downcast_Obj(subset_df)

In [None]:
subset_df.info(memory_usage='deep')

In [None]:
get_memoryUsage(subset_df)

We see that all but one of the object features got converted to a category, the one hold out being the MachineIdentifier.  In addition, the total memory usage has decreased to 2.98 MB, about an 85% decrease in memory usage.  

### 3 - Loading the Whole Data Set
Now that we have a good idea what data type each feature should be for optimal memory efficiency, we can use this knowledge to load in the full data set.  Panda's read_csv() function has takes an argument, dtype, where you input a dictionary of feature name and the associated data type you want to convert to.  For whatever reason, if I don't set GeoNameIdentifier to a float, I get the error 'pandas Integer column has NA values'.  Integers can't handle null values, but floats can (it gets converted to np.nan).  After loading the whole data set, I can dive into that deeper and hopefully convert it to a more appropriate data type.

In [None]:
dtype_dict = {}
for col in subset_df.columns:
    dtype_dict[col] = subset_df[col].dtypes
    
dtype_dict['GeoNameIdentifier']= np.dtype(np.float32)

In [None]:
train_df = pd.read_csv('../input/train.csv', dtype=dtype_dict)

In [None]:
get_memoryUsage(train_df)

In [None]:
test_df = pd.read_csv('../input/test.csv', dtype = dtype_dict)

In [None]:
get_memoryUsage(test_df)

After loading the training and test sets, the total memory usage is 2442.11 and 2142.24 MB respectively, which convert to 2.44 and 2.14 GB.  Compare these numbers to the original size of 4.08 and 3.53 GB.  This translates to about a 40% reduction in memory usage. 