## Getting Started with cuDF

## Loading data into a GPU DataFrame (GDF)

In [None]:
import cudf

### Loading data into a Pandas DataFrame

It's easy to load almost any sort of data (json, csv, etc) into a Pandas DataFrame. Ex (csv import from disk):

In [3]:
import pandas

# read data from csv file into pandas dataframe
df = pandas.read_csv('data/ipums/ipums_easy.csv.gz', compression='gzip')

[Read more on using a Pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)

### Converting a Pandas DataFrame to a GDF

In [4]:
from cudf.dataframe.dataframe import DataFrame

# convert the panda dataframe into a gpu dataframe
gdf = DataFrame.from_pandas(df)

## Working with the GDF
See the [pygdf documentation](http://pygdf.readthedocs.io/en/latest/index.html) for more.

### Take a look at the columns and their data types

In [5]:
# print the columns and their datatypes in this gdf
gdf.dtypes()

RECTYPE          int64
YEAR             int64
DATANUM          int64
SERIAL           int64
NUMPREC          int64
SUBSAMP          int64
HHWT             int64
HHTYPE           int64
REPWT          float64
CLUSTER          int64
ADJUST         float64
CPI99          float64
REGION           int64
STATEICP         int64
STATEFIP         int64
COUNTY         float64
COUNTYFIPS     float64
METRO          float64
METAREA        float64
METAREAD       float64
MET2013        float64
MET2013ERR     float64
CITY           float64
CITYERR        float64
CITYPOP        float64
PUMA           float64
PUMARES2MIG    float64
STRATA           int64
PUMASUPR       float64
CONSPUMA       float64
                ...   
REPWTP51       float64
REPWTP52       float64
REPWTP53       float64
REPWTP54       float64
REPWTP55       float64
REPWTP56       float64
REPWTP57       float64
REPWTP58       float64
REPWTP59       float64
REPWTP60       float64
REPWTP61       float64
REPWTP62       float64
REPWTP63   

### Slice the cuDF dataframe

Woah! This GDF has a lot of columns, let's make it more manageable...

In [6]:
# only select certain columns (and overwrite the gdf)
gdf = gdf.loc[0:, [
    'INCEARN', 'PERWT', 'ADJUST', 'STATEICP', 'ROOMS', 'BEDROOMS',
     'PHONE', 'VEHICLES', 'RACE', 'SEX', 'AGE', 'VETSTAT'
]]

# show the first 5 records of each column
print(gdf.head(5))

  INCEARN PERWT   ADJUST STATEICP ROOMS BEDROOMS PHONE VEHICLES RACE  SEX  AGE VETSTAT
0    4000   618 1.018516       21     7        4     2        3    1    2   66       1
1   36700   684 1.018516       21     7        4     2        3    1    1   40       1
2   54000   618 1.018516       49     5        4     2        3    1    1   51       2
3     900   609 1.018516       49     5        4     2        3    1    2   48       1
4    2000   621 1.018516       49     5        4     2        3    1    1   19       1

### Modify data types

In [7]:
gdf.dtypes()

INCEARN       int64
PERWT         int64
ADJUST      float64
STATEICP      int64
ROOMS         int64
BEDROOMS      int64
PHONE         int64
VEHICLES      int64
RACE          int64
SEX           int64
AGE           int64
VETSTAT       int64
dtype: object

Looks like `INCEARN` and `PERWT` are integers when they should be floats. Let's fix that...

In [8]:
import numpy as np

# force float64 instead of int64
gdf['INCEARN'] = gdf['INCEARN'].astype(np.float64)
gdf['PERWT'] = gdf['PERWT'].astype(np.float64)

# take another look
gdf.dtypes()

INCEARN     float64
PERWT       float64
ADJUST      float64
STATEICP      int64
ROOMS         int64
BEDROOMS      int64
PHONE         int64
VEHICLES      int64
RACE          int64
SEX           int64
AGE           int64
VETSTAT       int64
dtype: object

### Manipulate data with a user-defined function (UDF)

`INCEARN` is not a true representation of income earned. Let's adjust it by multiplying it by the `ADJUST` constant.

In [9]:
# define a function to adjust the incearn var
# so it more accurately represents income earned
adjust = gdf['ADJUST'][0]
def adjust_incearn(incearn):
    return adjust * incearn;

# apply it to the 'population' column
gdf['INCEARN'] = gdf['INCEARN'].applymap(adjust_incearn)

# drop the ADJUST column
gdf.drop_column('ADJUST')

# compute the mean
gdf['INCEARN'].mean()

18637.0999154208

### Sort the data

In [10]:
# sort the gdf by the INCEARN column
gdf = gdf.sort_values(by='INCEARN')
# reset the index so we can use loc slicing later
gdf = gdf.reset_index()
print(gdf.head(5))

        INCEARN PERWT STATEICP ROOMS BEDROOMS PHONE VEHICLES RACE  SEX  AGE VETSTAT
0 -10184.141484 538.0       53     4        3     2        2    1    1   35       1
1 -10184.141484 614.0       71     7        4     2        2    5    2   57       1
2 -10184.141484 511.0       45     9        5     2        2    1    2   48       1
3 -10184.141484 453.0       45     9        5     2        2    1    1   57       1
4 -10184.141484 593.0       53     9        5     2        5    1    2   55       1

Looks like we have some negative income values. Let's filter those out...

### Filter the data

In [11]:
# how many records do we have?
print("{} = Original # of records".format(len(gdf)))

# filter out
gdf = gdf.query('INCEARN >= 0')

# how many records do we have left?
print("{} = New # of records".format(len(gdf)))

# sanity check...
print(gdf.head(5))

10000 = Original # of records
9985 = New # of records


  INCEARN PERWT STATEICP ROOMS BEDROOMS PHONE VEHICLES RACE  SEX  AGE VETSTAT
15     0.0 559.0       49     5        4     2        3    1    2   17       1
16     0.0 589.0       43     8        4     2        3    1    1   21       1
17     0.0 617.0       43     5        3     2        1    1    2   66       1
18     0.0 574.0       43     6        4     2        1    1    1   80       2
19     0.0 616.0       43     6        4     2        1    1    2   72       1

### One hot encode categorical columns

In [12]:
# define the categorical columns
cat_cols = set(['STATEICP', 'RACE', 'SEX', 'VETSTAT'])
# store the unique values for each category column
uniques = {}

# iterate through each categorical column and one-hot
# encode it using the unique values it has
for k in cat_cols:
    uniques[k] = gdf[k].unique_k(k=1000)
    cats = uniques[k][1:]  # drop first
    gdf = gdf.one_hot_encoding(k, prefix=k, cats=cats)
    del gdf[k]
    
# we should see many more columns since the categorical
# columns will get expanded due to one-hot encoding
gdf.dtypes()

INCEARN        float64
PERWT          float64
ROOMS            int64
BEDROOMS         int64
PHONE            int64
VEHICLES         int64
AGE              int64
SEX_2          float64
STATEICP_2     float64
STATEICP_3     float64
STATEICP_4     float64
STATEICP_5     float64
STATEICP_6     float64
STATEICP_11    float64
STATEICP_12    float64
STATEICP_13    float64
STATEICP_14    float64
STATEICP_21    float64
STATEICP_22    float64
STATEICP_23    float64
STATEICP_24    float64
STATEICP_25    float64
STATEICP_31    float64
STATEICP_32    float64
STATEICP_33    float64
STATEICP_34    float64
STATEICP_35    float64
STATEICP_36    float64
STATEICP_37    float64
STATEICP_40    float64
                ...   
STATEICP_49    float64
STATEICP_51    float64
STATEICP_52    float64
STATEICP_53    float64
STATEICP_54    float64
STATEICP_56    float64
STATEICP_61    float64
STATEICP_62    float64
STATEICP_63    float64
STATEICP_64    float64
STATEICP_65    float64
STATEICP_66    float64
STATEICP_67

### Split the data into training, validation, and test sets

In [13]:
# enforce float64 data type on ALL columns
for k in gdf.columns:
    gdf[k] = gdf[k].astype(np.float64)

# set the fractions for training and validation
fractions = {
    "train": 0.8,
    "valid": 0.2
}

# validation splitpoint
splitpoint = int(len(gdf) * fractions["train"])
print('splitpoint: {} of {} is {}'.format(fractions["train"], len(gdf), splitpoint))

# break the gdf up into training, validation, and test sets
gdfs = {
    "train": gdf.loc[:splitpoint],
    "valid": gdf.loc[splitpoint:]
}
print('gdfs["train"] has {} rows'.format(len(gdfs["train"])))
print('gdfs["valid"] has {} rows'.format(len(gdfs["valid"])))

splitpoint: 0.8 of 9985 is 7988
gdfs["train"] has 7974 rows
gdfs["valid"] has 2012 rows


### Turn the GDFs into matrices

In [14]:
# produce gpu matrices (to input to ml libraries, etc)
# this step should not be necessary in the near future
# (should be able to use gdf as input)
matrices = {
    "train": {
        "x": gdfs["train"].as_gpu_matrix(columns=gdf.columns[1:]),
        "y": gdfs["train"].as_gpu_matrix(columns=[gdf.columns[0]])
    },
    "valid": {
        "x": gdfs["valid"].as_gpu_matrix(columns=gdf.columns[1:]),
        "y": gdfs["valid"].as_gpu_matrix(columns=[gdf.columns[0]])
    }
}

# check the matrix shapes (sanity check)
print('matrices["train"]["x"] shape:', matrices["train"]["x"].shape)
print('matrices["train"]["y"] shape:', matrices["train"]["y"].shape)
print('matrices["valid"]["x"] shape:', matrices["valid"]["x"].shape)
print('matrices["valid"]["y"] shape:', matrices["valid"]["y"].shape)

matrices["train"]["x"] shape: (7974, 67)
matrices["train"]["y"] shape: (7974, 1)
matrices["valid"]["x"] shape: (2012, 67)
matrices["valid"]["y"] shape: (2012, 1)
