# Getting Started with cuDF


## Using cuDF with the California Housing Dataset

The final goal of this excercise is to examine the average number of total rooms and bedrooms of a set of properties within a co-ordinate box by using only the _cuDF_ library.
### Importing as a cuDF DataFrame


In [None]:
## Import the cuDF Library:
import cudf

The data we're going to use is the `data/housing.csv` file. This file contains data on housing blocks in the state of California. Let's examine this data further. 

In [None]:
## We can read this as a cuDF dataframe by using:
californiaDF = cudf.read_csv('data/housing.csv')

## Visualize DataFrame
print(californiaDF)

## Examine Shape of DataFrame
print("\nDataframe is of dimensions: " + str(californiaDF.shape))

From the output of the above code block we can see the actual dimensions of the dataframe created. We can visualize the names of the columns by examining `californiaDF.columns.values`.

In [None]:
## Visualizing Column Names
print(californiaDF.columns.values)

We can also import a dataframe through pandas through the `cudf.DataFrame.from_pandas()` function.

In [None]:
## Import Pandas
import pandas

## Read data as pandas dataframe
californiaDF = pandas.read_csv('data/housing.csv')

## Convert to cuDF dataframe
californiaDF = cudf.DataFrame.from_pandas(californiaDF)

## Visualize DataFrame
print(californiaDF)

## Examine Shape of DataFrame
print("\nDataframe is of dimensions: " + str(californiaDF.shape))

This dataframe should be identical to the one created earlier!

### Selection

Our first task in manipulation is to extract all the rows of data and only their `longitude` , `latitude`, `total_rooms`, `total_bedrooms` and `households` column values.

In [None]:
## Selecting the columns (longitude , latitude, total_rooms, total_bedrooms and households) and all rows alone

householdDF = californiaDF.loc[:,['longitude', 'latitude', 'total_rooms','total_bedrooms','households']]

print(householdDF)

### Filter queries

Our next task is to visualize only the housing blocks within a certain longitude and latitude bounding box. This box can be defined by two `longitude`,`latitude` pairs; one pair representing the lower left of the box and one pair representing the top right. 

Lets focus on the Mountain View Area where we define:
* Lower left co-ordinates of the bounding box: `latitude` = 37.36472345, `longitude` = -122.12830693
* Top right co-ordinates of the bounding box: `latitude` = 37.40657584. `longitude`= -122.06162184


In [None]:
## Running Queries on cuDF
filteredDF = householdDF.query("(-122.06162184 <= longitude >= -122.12830693) and (37.40657584 <= latitude >= 37.36472345)")
print(filteredDF)

## Count the number of occurrences 
print(filteredDF)

We should do some preliminary cleaning as a good practice against any errors. Lets replace any 'None' categories with a value of '0' in the dataframe. This is done through `filteredDF.fillna()`. 

_This is not necessary on this data-set as it is already cleaned, but it is a good practice regardless_

_Note that if the data actually had 'None' values, this process will change the final results slightly_

In [None]:
filteredDF = filteredDF.fillna(0)

print(filteredDF)

We should now have a dataframe with housing data over the Mountain View area. Our next step is to average the number of total_rooms and total_bedrooms over the total households. 

### Operations

We need to find the average bedrooms and rooms for each of the households within a geographic area.

In [None]:
## Average values of certain columns
print("Avg. Households per block in given bounding box: " + str(filteredDF['households'].mean()))
print("Avg. Total Bedrooms per block in given bounding box: " + str(filteredDF['total_bedrooms'].mean()))
print("Avg. Total Rooms per block in given bounding box: " + str(filteredDF['total_rooms'].mean()))

print("\n----------------\n")

avgBedroomsHousehold = filteredDF['total_bedrooms'].sum()/filteredDF['households'].sum()
avgRoomsHousehold = filteredDF['total_rooms'].sum()/filteredDF['households'].sum()

print("Avg. Bedrooms per household in given bounding box: " + str(avgBedroomsHousehold))
print("Avg. Rooms per household in given bounding box: " + str(avgRoomsHousehold))


Congrats! You succesfully generated the average number of bedrooms and rooms per household in a given area in California! 

If you want to make this easier, I suggest we create a function that automates this process. Let's do this quickly!

In [None]:
## Function combining all the processes above 

# Function input variables:
# csvPath - Path to CSV housing data file
# long1 - Lower left longitude coordinate 
# lat1 - Lower left latitude coordinate
# long2 - Upper right longitude coordinate
# lat2 - Upper right latitude coordinate

def HouseHoldAnalysis(csvPath, long1, lat1, long2, lat2):
  ## Data Input
  californiaDF = cudf.read_csv(csvPath)
  print("\n Initial Dataframe is of dimensions: " + str(californiaDF.shape) +"\n")
  
  ## Selection
  householdDF = californiaDF.loc[:,['longitude', 'latitude', 'total_rooms','total_bedrooms','households']]
  
  ## Query
  filteredDF = householdDF.query("("+str(long2)+" <= longitude >= "+str(long1)+") and ("+str(lat2)+" <= latitude >= "+str(lat1)+")")
  
  
  ## Average values of certain columns
  print("FIltered Dataframe is of dimensions: " + str(filteredDF.shape) +"\n")
  print("Avg. Households per block in given bounding box: " + str(filteredDF['households'].mean()))
  print("Avg. Total Bedrooms per block in given bounding box: " + str(filteredDF['total_bedrooms'].mean()))
  print("Avg. Total Rooms per block in given bounding box: " + str(filteredDF['total_rooms'].mean()))

  print("\n----------------\n")

  avgBedroomsHousehold = filteredDF['total_bedrooms'].sum()/filteredDF['households'].sum()
  avgRoomsHousehold = filteredDF['total_rooms'].sum()/filteredDF['households'].sum()

  print("Avg. Bedrooms per household in given bounding box: " + str(avgBedroomsHousehold))
  print("Avg. Rooms per household in given bounding box: " + str(avgRoomsHousehold))

  return(avgBedroomsHousehold, avgRoomsHousehold)
  

# Here we go! 

You can now simply call this function with whatever geographical bounding box values and you will be returned with the average bedrooms per household and average rooms per household in that area! 

This function we created can also work on datasets for different areas that follow the same structure as the `data/housing.csv` file. For example, this means that if you have a similar data-set for New York, then you can calculate the average bedrooms and rooms in geographic boxes that you define!

In [None]:
## Mountain View Area (Same example as above)
HouseHoldAnalysis('data/housing.csv', -122.12830693, 37.36472345, -122.06162184, 37.40657584)

Go on try some more areas in California! If you need help finding `latitude`, `longitude` values, check this link [here](https://www.mapcoordinates.net/en).

In [None]:
## Santa Clara Area (New Example!)
HouseHoldAnalysis('data/housing.csv', -121.97046831, 37.3316244, -121.92549303, 37.36519458)

# Next Steps 

## cuDF Guide

I recommend using the [cuDF documentation guide](https://rapidsai.github.io/projects/cudf/en/latest/index.html) for a deeper understanding of GPU dataframe usage. 

## GPU Data Science

If you wish to learn more about running Data Science Projects on the GPU, I recommend you check out the [full documentation for RAPIDS.](https://docs.rapids.ai/api)

### Check out the RAPIDS notebooks repos for more examples:

* https://github.com/rapidsai/notebooks
* https://github.com/rapidsai/notebooks-contrib