# Environment Sanity Check #

Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

Check the output of `!nvidia-smi` to make sure you've been allocated a Tesla T4.

#Setup:

1. Install most recent Miniconda release compatible with Google Colab's Python install  (3.6.7)
2. Install RAPIDS libraries
3. Set necessary environment variables
4. Copy RAPIDS .so files into current working directory, a workaround for conda/colab interactions

In [0]:
!nvidia-smi

import pynvml

pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
device_name = pynvml.nvmlDeviceGetName(handle)

# your dolphin is broken, please reset & try again
if device_name != b'Tesla T4':
  raise Exception("""Unfortunately this instance does not have a T4 GPU.
    
    Please make sure you've configured Colab to request a GPU instance type.
    
    Sometimes Colab allocates a Tesla K80 instead of a T4. Resetting the instance.

    If you get a K80 GPU, try Runtime -> Reset all runtimes...""")
  
# got a T4, good to go 
else:
  print('Woo! You got the right kind of GPU!')

  # intall miniconda
  !wget -c https://repo.continuum.io/miniconda/Miniconda3-4.5.4-Linux-x86_64.sh
  !chmod +x Miniconda3-4.5.4-Linux-x86_64.sh
  !bash ./Miniconda3-4.5.4-Linux-x86_64.sh -b -f -p /usr/local

  # install RAPIDS packages
  !conda install -q -y --prefix /usr/local -c conda-forge \
    -c rapidsai-nightly/label/cuda10.0 -c nvidia/label/cuda10.0 \
    cudf cuml

  # set environment vars
  import sys, os, shutil
  sys.path.append('/usr/local/lib/python3.6/site-packages/')
  os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
  os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'

  # copy .so files to current working dir
  for fn in ['libcudf.so', 'librmm.so']:
    shutil.copy('/usr/local/lib/'+fn, os.getcwd())
  
  # miniconda and rapids install worked on first try
  try:
    # imports for examples
    import pandas as pd
    import cudf  # testing cudf only (0.8)
    import cuml
    import io, requests
    print('GOOD TO GO')
  # probably missing cudf, let's try again 
  except:
    print('IMPORT FAILURE, RERUNNING MINICONDA AND RAPIDS INSTALLATION')
    # intall miniconda
    !wget -c https://repo.continuum.io/miniconda/Miniconda3-4.5.4-Linux-x86_64.sh
    !chmod +x Miniconda3-4.5.4-Linux-x86_64.sh
    !bash ./Miniconda3-4.5.4-Linux-x86_64.sh -b -f -p /usr/local

    # install RAPIDS packages
    !conda install -q -y --prefix /usr/local -c conda-forge \
      -c rapidsai-nightly/label/cuda10.0 -c nvidia/label/cuda10.0 \
      cudf cuml

    # set environment vars
    import sys, os, shutil
    sys.path.append('/usr/local/lib/python3.6/site-packages/')
    os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
    os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'

    # copy .so files to current working dir
    for fn in ['libcudf.so', 'librmm.so']:
      shutil.copy('/usr/local/lib/'+fn, os.getcwd())

    # imports for examples
    import pandas as pd
    import cudf  # testing cudf only
    import cuml
    import io, requests
    print('GOOD TO GO')

Sun Jul 21 23:31:28 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   49C    P8    16W /  70W |      0MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

# Zillow Comp Conversion
Repo: https://github.com/eswar3/Zillow-prediction-models

In [0]:
# Info on how to get your api key (kaggle.json) here: https://github.com/Kaggle/kaggle-api#api-credentials
!pip install kaggle
!mkdir /root/.kaggle
# plug api
!echo '{"username":"warobson","key":""}' > /root/.kaggle/kaggle.json
!chmod 600 /root/.kaggle/kaggle.json
# !kaggle datasets download
!kaggle competitions download -c zillow-prize-1

# unzip kaggle data
!unzip -q "/content/sample_submission.csv.zip"
!unzip -q "/content/train_2016_v2.csv.zip"
!unzip -q "/content/properties_2016.csv.zip"
!unzip -q "/content/train_2017.csv.zip"
!unzip -q "/content/properties_2017.csv.zip"
# display content folder contents
!ls "/content/"

Collecting kaggle
[?25l  Downloading https://files.pythonhosted.org/packages/f4/de/4f22073f3afa618976ee0721b0deb72b5cde2782057e04a815a6828b53f9/kaggle-1.5.4.tar.gz (54kB)
[K    100% |████████████████████████████████| 61kB 4.3MB/s 
Collecting python-dateutil (from kaggle)
[?25l  Downloading https://files.pythonhosted.org/packages/41/17/c62faccbfbd163c7f57f3844689e3a78bae1f403648a6afb1d0866d87fbb/python_dateutil-2.8.0-py2.py3-none-any.whl (226kB)
[K    100% |████████████████████████████████| 235kB 12.1MB/s 
Collecting python-slugify (from kaggle)
  Downloading https://files.pythonhosted.org/packages/c1/19/c3cf1dc65e89aa999f85a4a3a4924ccac765a6964b405d487b7b7c8bb39f/python-slugify-3.0.2.tar.gz
Collecting text-unidecode==1.2 (from python-slugify->kaggle)
[?25l  Downloading https://files.pythonhosted.org/packages/79/42/d717cc2b4520fb09e45b344b1b0b4e81aa672001dd128c180fabc655c341/text_unidecode-1.2-py2.py3-none-any.whl (77kB)
[K    100% |████████████████████████████████| 81kB 36.7MB/s 

## [Zillow Prediction Model](https://github.com/eswar3/Zillow-prediction-models/blob/master/Step%202a-Approach1.ipynb)

    In this approach the properties data (27 million records) and transaction data (90k records) are merged together before imputing any missing values

# Import Data

    Importing properties_2016 which has data about 27 million unique house properties with 58 attributes

    Importing transaction data which has 90k records of the properties sold in the year 2016

    Merging the data sets based on property_id



In [0]:
# import 2016 properties
prop2016 = cudf.read_csv('/content/properties_2016.csv')
# peek display
print(prop2016.head())
# import train 2016  data
train2016 = cudf.read_csv('/content/train_2016_v2.csv',parse_dates=["transactiondate"])
# peek display
print(train2016.head())

   parcelid  airconditioningtypeid  architecturalstyletypeid  basementsqft  bathroomcnt  bedroomcnt  buildingclasstypeid ...  censustractandblock
0  10754147                                                                         0.0         0.0                      ...                     
1  10759547                                                                         0.0         0.0                      ...                     
2  10843547                                                                         0.0         0.0                      ...                     
3  10859147                                                                         0.0         0.0                  3.0 ...                     
4  10879947                                                                         0.0         0.0                  4.0 ...                     
[50 more columns]
   parcelid              logerror         transactiondate
0  11016594                0.0276 2016-01-01T00:

#### Renaming attributes names to be meaningful

In [0]:
# merge 2016 train and property dataframes by parcel id
df_train = train2016.merge(prop2016, how='left', on='parcelid')
# add column inidcaticating month of transaction
df_train['transaction_month'] = df_train['transactiondate'].dt.month
# rename colums for general english understandability 
df_train=df_train.rename(columns={"bathroomcnt": "total_bath",
                                  "fullbathcnt": "full_bath",
                                  "threequarterbathnbr": "half_bath",
                                  "yardbuildingsqft17": "patio_sqft",
                                  "yardbuildingsqft26":"storage_sqft",
                                  "decktypeid": "deck_flag",
                                  "pooltypeid7": "pool_with_spa_tub_no", 
                                  "pooltypeid2": "pool_with_spa_tub_yes",
                                  "hashottuborspa": "has_hottub_or_spa", 
                                  "pooltypeid10": "just_hottub_or_spa",
                                  "calculatedfinishedsquarefeet": "total_finished_living_area_sqft", 
                                  "finishedsquarefeet12": "finished_living_area_sqft",
                                  "lotsizesquarefeet": "lot_area_sqft",
                                  "finishedsquarefeet50": "finished_living_area_entryfloor_sqft1",
                                  "finishedfloor1squarefeet": "finished_living_area_entryfloor_sqft2",
                                  "finishedsquarefeet6": "base_unfinished_and_finished_area_sqft",
                                  "finishedsquarefeet15": "total_area_sqft",
                                  "finishedsquarefeet13": "preimeter_living_area_sqft",
                                  "taxvaluedollarcnt":"total_parcel_tax",
                                  "landtaxvaluedollarcnt":"land_tax",
                                  "taxamount":"total_property_tax_2016",
                                  "structuretaxvaluedollarcnt":"structure_tax",
                                  "garagetotalsqft":"garage_sqft",
                                  "fireplacecnt":"fireplace_count",
                                  "buildingqualitytypeid ":"building_quality_id",
                                  "heatingorsystemtypeid":"heating_system_id",
                                  "airconditioningtypeid":"ac_id",
                                  "storytypeid": "basement_flag",
                                  "storytypeid": "basement_flag",
                                  "poolsizesum": "pool_sqft"})
# what's it look like?
print(df_train.head())

   parcelid             logerror         transactiondate  ac_id  architecturalstyletypeid  basementsqft  total_bath ...  transaction_month
0  11827818               0.0402 2016-03-15T00:00:00.000                                                        4.0 ...                  3
1  12123024               0.0296 2016-03-15T00:00:00.000                                                        3.0 ...                  3
2  13867327               0.0344 2016-03-15T00:00:00.000                                                        2.0 ...                  3
3  12681894                0.006 2016-03-15T00:00:00.000                                                        3.0 ...                  3
4  12848541  0.06949999999999999 2016-03-15T00:00:00.000    1.0                                                 4.0 ...                  3
[53 more columns]



### Dealing Attributes with Missing Values
*   Pool_count is a binary variable, hence replace all NULL values with zero
*   pool_with_spa_tub_no & pool_with_spa_tub_yes are again binary variables hence replace all NULL values with zero

In [0]:
# replace missing pool count values so we booling
df_train['poolcnt'].fillna(0)
df_train['pool_with_spa_tub_no'].fillna(0)
df_train['pool_with_spa_tub_yes'].fillna(0)
df_train.loc[(df_train.poolcnt==1) & (df_train.has_hottub_or_spa==1) & (df_train.just_hottub_or_spa.isna())]

<cudf.DataFrame ncols=61 nrows=1204 >

### Fixing contradictions in pool related variables
*   When pool is present and if it has tub/spa then just_hottub_or_spa =0
*   When there is no pool and if there is tub/spa then just_hottub_or_spa =1
*   As they are binary variables convert NaN's to Zero

In [0]:
#when poolcnt=1 & has_hottub_or_spa=1 & just_hottub_or_spa is null then just_hottub_or_spa =0
#when poolcnt=0, has_hottub_or_spa=1, just_hottub_or_spa =1

df_train.loc[ (df_train.poolcnt==1) & (df_train.has_hottub_or_spa==1) & (df_train.just_hottub_or_spa.isnull()),'just_hottub_or_spa']=0
             
#has_hottub_or_spa is null and just_hottub_or_spa is null, both has to be zero

df_train.loc[ (df_train.has_hottub_or_spa.isnull()) & (df_train.just_hottub_or_spa.isnull()),['has_hottub_or_spa','just_hottub_or_spa']]=0


*   When there is no pool, make poolsize as zero instead of Nan

In [0]:
df_train.loc[ df_train.poolcnt==0,'pool_sqft']=0
print(df_train.pool_sqft.isnull().sum())

*   basement_flag has values 7 & Null hence convert it to a binary variable with value of zero and 1
*   When basement_flag is zero make basement_sqft also zero

In [0]:
df_train.loc[df_train.basement_flag.isnull(),'basementsqft']=0
df_train.loc[df_train.basement_flag.isnull(),'basement_flag']=0
df_train.loc[df_train.basement_flag==7,'basement_flag']=1

*   There seems to be inconsistency between the fireplace_flag and fireplace count, Let's fix it

In [0]:
#df_train.fireplaceflag.isnull().sum()
#df_train.fireplace_count.isnull().sum()
df_train.loc[(df_train.fireplace_count.isnull()) & (df_train.fireplaceflag.isnull()),'fireplaceflag'] = False
df_train.loc[(df_train.fireplace_count.isnull()) & (df_train.fireplaceflag==False),'fireplace_count'] = 0
df_train.loc[df_train['fireplace_count']>0,'fireplaceflag']= True
print("after",df_train.fireplace_count.isnull().sum())
#print("after",df_train.fireplace_count.value_counts())

*   Dropping transaction date column as this doesn't have any corellation with target variable

In [0]:
df_train=df_train.drop('transactiondate',axis=1)

*   Garage count and Garage size have same number of missing values. Let's assume this is because when there are properties with no garages then both variables are NA

In [0]:
df_train.loc[df_train.garage_sqft.isnull() & df_train.garagecarcnt.isnull(),['garagecarcnt','garage_sqft']]=0
df_train.loc[(df_train.garagecarcnt>0) & (df_train.garage_sqft==0),'garage_sqft']=np.nan
print("after",df_train.garagecarcnt.isnull().sum())
#print("after",df_train.garagecarcnt.value_counts())
print("after",df_train.garage_sqft.isnull().sum())
#print("after",df_train.garage_sqft.value_counts())

*   total_bath & calculatedbathnbr are duplicates , and calculatedbathnbr has more nulls, hence we will drop it
*   if full_bath is null and half_bath is also null, let's make total_bath=0 (missing values)

In [0]:
#total_bath & calculatedbathnbr are duplicates , and calculatedbathnbr has more nulls, hence drop it
df_train=df_train.drop('calculatedbathnbr',axis=1)

# full_bath is null & half_bath is null & total_bath=0 (missing values)
df_train.loc[(df_train.full_bath.isnull()) & (df_train.half_bath.isnull()) & (df_train.total_bath==0),'total_bath']=np.nan


# when full_bath=total_bath, half_bath=0 

df_train.loc[(df_train.full_bath==df_train.total_bath) ,'half_bath']=0

# when total_bath is present but full and half bath is null
# all 3 are null somemtimes

print(df_train.total_bath.isnull().sum())
print(df_train.half_bath.isnull().sum())
print(df_train.full_bath.isnull().sum())

* Assuming if these patio and shed variables has null values then there is no shed or patio in yard

In [0]:
#yardbuildingsqft17-patio in yard
#yardbuildingsqft26- storage shed in yard
df_train.loc[df_train.patio_sqft.isnull() ,'patio_sqft']=0
df_train.loc[df_train.storage_sqft.isnull() ,'storage_sqft']=0
print(df_train.patio_sqft.isnull().sum())
print(df_train.storage_sqft.isnull().sum())

### code fips code with respective county names
* 6037- LA
* 6059- Orange_County
* 6111- Ventura

In [0]:
df_train.loc[df_train.fips==6037 ,'fips']="LA"
df_train.loc[df_train.fips==6059 ,'fips']="Orange_County"
df_train.loc[df_train.fips==6111 ,'fips']="Ventura"

print(df_train.fips.isnull().sum())
print(df_train.fips.value_counts())

### scaling down the latitude and longitide
*    Knn imputation takes more time because of the hude numbers, moreover standardizing gives better results on most algorithms

In [0]:
df_train['latitude']=df_train['latitude'].divide(100000)
df_train['longitude']=df_train['longitude'].divide(100000)

* deck_flag has only 2 values 66 or null- convert it into binary flag

In [0]:
df_train.loc[df_train.deck_flag==66 ,'deck_flag']=1
df_train.loc[df_train.deck_flag.isnull() ,'deck_flag']=0

print(df_train.deck_flag.isnull().sum())

### Imputing unit count based on property land type (Mode Imputation)

In [0]:
#numberofstories & unitcnt &roomcnt

df_train.loc[df_train.roomcnt==0 ,'roomcnt']=np.nan

print(df_train.numberofstories.isnull().sum())
print(df_train.roomcnt.isnull().sum())
print(df_train.unitcnt.isnull().sum())

# propertylandusetypeid  and unitcnt is related 

#246 -Duplex (2 Units, Any Combination)
#247 -Triplex (3 Units, Any Combination)
#248 -Quadruplex (4 Units, Any Combination)
#260 -Residential General
#261 -Single Family Residential
#263 -Mobile Home
#264 -Townhouse
#266 -Condominium
#267 -Cooperative
#269 -Planned Unit Development
#275 -Residential Common Area 
#31 - Commercial/Office/Residential Mixed Used
#47 -Store/Office (Mixed Use)
#265 -Cluster Home

df_train.loc[(df_train.propertylandusetypeid==31) & (df_train.unitcnt.isnull()),'unitcnt']=2
df_train.loc[(df_train.propertylandusetypeid==47) & (df_train.unitcnt.isnull()),'unitcnt']=2
df_train.loc[(df_train.propertylandusetypeid==246) & (df_train.unitcnt.isnull()),'unitcnt']=2
df_train.loc[(df_train.propertylandusetypeid==247) & (df_train.unitcnt.isnull()),'unitcnt']=3
df_train.loc[(df_train.propertylandusetypeid==248) & (df_train.unitcnt.isnull()),'unitcnt']=4
df_train.loc[(df_train.propertylandusetypeid==260) & (df_train.unitcnt.isnull()),'unitcnt']=1
df_train.loc[(df_train.propertylandusetypeid==261) & (df_train.unitcnt.isnull()),'unitcnt']=1
df_train.loc[(df_train.propertylandusetypeid==263) & (df_train.unitcnt.isnull()),'unitcnt']=1
df_train.loc[(df_train.propertylandusetypeid==264) & (df_train.unitcnt.isnull()),'unitcnt']=1
df_train.loc[(df_train.propertylandusetypeid==266) & (df_train.unitcnt.isnull()),'unitcnt']=1
df_train.loc[(df_train.propertylandusetypeid==267) & (df_train.unitcnt.isnull()),'unitcnt']=1
df_train.loc[(df_train.propertylandusetypeid==269) & (df_train.unitcnt.isnull()),'unitcnt']=1
df_train.loc[(df_train.propertylandusetypeid==275) & (df_train.unitcnt.isnull()),'unitcnt']=1

#typeconstructiontypeid (based on location and year of building)
print(df_train.typeconstructiontypeid.isnull().sum())
print(df_train.propertylandusetypeid.isnull().sum())

* "preimeter_living_area_sqft" and "total_finished_living_area_sqft" have the same values except that "preimeter_living_area_sqft" has more duplicates
* "total_area_sqft" and "total_finished_living_area_sqft" have the same values except that "total_area_sqft" has more duplicates
* "total_finished_living_area_sqft" and "finished_living_area_sqft" have the same values except that "finished_living_area_sqft" has more duplicates
* "base_unfinished_and_finished_area_sqft" and "total_finished_living_area_sqft" have the same values except that "base_unfinished_and_finished_area_sqft" has more duplicates
    * let's drop them all

In [0]:
df_train=df_train.drop('preimeter_living_area_sqft', axis=1)
df_train=df_train.drop('total_area_sqft', axis=1)
df_train=df_train.drop('finished_living_area_sqft', axis=1)
df_train=df_train.drop('base_unfinished_and_finished_area_sqft', axis=1)

#calculatedfinishedsquarefeet": "total_finished_living_area_sqft", 
#"finishedsquarefeet12": "finished_living_area_sqft",
#"lotsizesquarefeet": "lot_area_sqft",
#"finishedsquarefeet50": "finished_living_area_entryfloor_sqft1",
#finishedfloor1squarefeet": "finished_living_area_entryfloor_sqft2",
#"finishedsquarefeet6": "base_unfinished_and_finished_area_sqft",
#"finishedsquarefeet15": "total_area_sqft",
#"finishedsquarefeet13": "preimeter_living_area_sqft"