<a href="https://colab.research.google.com/github/rajuiit/TuriCreatewithSFramesInstall-in-colab/blob/master/turicreate_sframes_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Turi Create SFrames

https://github.com/apple/turicreate/blob/master/userguide/sframe/sframe-intro.md

SFrames are the primary data structure for extracting data from other sources for use in Turi Create.

They are similar to Pandas Dataframes but do not need to be loaded as a whole into RAM, so are not constrained by the RAM of the machine running the code. This makes it a scalable data structure. It is column immutable and supports out-of-core processing.

SFrames can extract data from the following static file formats:

*   CSV
*   JSON
*   SQL databases

## Turi Create and GPU Setup

In [0]:
!apt install libnvrtc8.0
!pip uninstall -y mxnet-cu80 && pip install mxnet-cu80==1.1.0
!pip install turicreate

## Google Drive Access

You will be asked to click a link to generate a secret key to access your Google Drive. 

Copy and paste secret key it into the space provided with the notebook.

In [0]:
import os.path
from google.colab import drive

# mount Google Drive to /content/drive/My Drive/
if os.path.isdir("/content/drive/My Drive"):
  print("Google Drive already mounted")
else:
  drive.mount('/content/drive')

## Fetch Data

In [0]:
import os.path
import urllib.request
import tarfile
import zipfile
import gzip
from shutil import copy

def fetch_remote_datafile(filename, remote_url):
  if os.path.isfile("./" + filename):
    print("already have " + filename + " in workspace")
    return
  print("fetching " + filename + " from " + remote_url + "...")
  urllib.request.urlretrieve(remote_url, "./" + filename)

def cache_datafile_in_drive(filename):
  if os.path.isfile("./" + filename) == False:
    print("cannot cache " + filename + ", it is not in workspace")
    return
  
  data_drive_path = "/content/drive/My Drive/Colab Notebooks/data/"
  if os.path.isfile(data_drive_path + filename):
    print("" + filename + " has already been stored in Google Drive")
  else:
    print("copying " + filename + " to " + data_drive_path)
    copy("./" + filename, data_drive_path)
  

def load_datafile_from_drive(filename, remote_url=None):
  data_drive_path = "/content/drive/My Drive/Colab Notebooks/data/"
  if os.path.isfile("./" + filename):
    print("already have " + filename + " in workspace")
  elif os.path.isfile(data_drive_path + filename):
    print("have " + filename + " in Google Drive, copying to workspace...")
    copy(data_drive_path + filename, ".")
  elif remote_url != None:
    fetch_remote_datafile(filename, remote_url)
  else:
    print("error: you need to manually download " + filename + " and put in drive")
    
def extract_datafile(filename, expected_extract_artifact=None):
  if expected_extract_artifact != None and (os.path.isfile(expected_extract_artifact) or os.path.isdir(expected_extract_artifact)):
    print("files in " + filename + " have already been extracted")
  elif os.path.isfile("./" + filename) == False:
    print("error: cannot extract " + filename + ", it is not in the workspace")
  else:
    extension = filename.split('.')[-1]
    if extension == "zip":
      print("extracting " + filename + "...")
      data_file = open(filename, "rb")
      z = zipfile.ZipFile(data_file)
      for name in z.namelist():
          print("    extracting file", name)
          z.extract(name, "./")
      data_file.close()
    elif extension == "gz":
      print("extracting " + filename + "...")
      if filename.split('.')[-2] == "tar":
        tar = tarfile.open(filename)
        tar.extractall()
        tar.close()
      else:
        data_zip_file = gzip.GzipFile(filename, 'rb')
        data = data_zip_file.read()
        data_zip_file.close()
        extracted_file = open('.'.join(filename.split('.')[0:-1]), 'wb')
        extracted_file.write(data)
        extracted_file.close()
    elif extension == "tar":
      print("extracting " + filename + "...")
      tar = tarfile.open(filename)
      tar.extractall()
      tar.close()
    elif extension == "csv":
      print("do not need to extract csv")
    else:
      print("cannot extract " + filename)
      
def load_cache_extract_datafile(filename, expected_extract_artifact=None, remote_url=None):
  load_datafile_from_drive(filename, remote_url)
  extract_datafile(filename, expected_extract_artifact)
  cache_datafile_in_drive(filename)
  

In [0]:
load_cache_extract_datafile("song_data.csv.zip", "song_data.csv", "https://static.turi.com/datasets/millionsong/song_data.csv")

already have song_data.csv.zip in workspace
files in song_data.csv.zip have already been extracted
song_data.csv.zip has already been stored in Google Drive


In [0]:
load_cache_extract_datafile("10000.txt.zip", "10000.txt", "https://static.turi.com/datasets/millionsong/10000.txt")

already have 10000.txt.zip in workspace
files in 10000.txt.zip have already been extracted
10000.txt.zip has already been stored in Google Drive


In [0]:
load_cache_extract_datafile("loc-gowalla_totalCheckins.txt.gz", "loc-gowalla_totalCheckins.txt", "https://snap.stanford.edu/data/loc-gowalla_totalCheckins.txt.gz")

already have loc-gowalla_totalCheckins.txt.gz in workspace
files in loc-gowalla_totalCheckins.txt.gz have already been extracted
loc-gowalla_totalCheckins.txt.gz has already been stored in Google Drive


## Setup Turi Create

In [0]:
import mxnet as mx
import turicreate as tc

In [0]:
# Use all GPUs (default)
tc.config.set_num_gpus(-1)

# Use only 1 GPU
#tc.config.set_num_gpus(1)

# Use CPU
#tc.config.set_num_gpus(0)

## Sample Data

The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

https://labrosa.ee.columbia.edu/millionsong/

The first table contains metadata about each song in the database. Here's how we load it into an SFrame:

In [0]:
songs = tc.SFrame.read_csv("./song_data.csv")

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [0]:
songs.head()

song_id,title,release,artist_name,year
SOQMMHC12AB0180CB8,Silent Night,Monster Ballads X-Mas,Faster Pussy cat,2003
SOVFVAK12A8C1350D9,Tanssi vaan,Karkuteillä,Karkkiautomaatti,1995
SOGTUKN12AB017F4F1,No One Could Ever,Butter,Hudson Mohawke,2006
SOBNYVR12A8C13558C,Si Vos Querés,De Culo,Yerba Brava,2003
SOHSBXH12A8C13B0DF,Tangle Of Aspens,Rene Ablaze Presents Winter Sessions ...,Der Mystic,0
SOZVAPQ12A8C13B63C,"Symphony No. 1 G minor ""Sinfonie ...",Berwald: Symphonies Nos. 1/2/3/4 ...,David Montgomery,0
SOQVRHI12A6D4FB2D7,We Have Got Love,Strictly The Best Vol. 34,Sasha / Turbulence,0
SOEYRFT12AB018936C,2 Da Beat Ch'yall,Da Bomb,Kris Kross,1993
SOPMIYT12A6D4F851E,Goodbye,Danny Boy,Joseph Locke,0
SOJCFMH12A8C13B0C2,Mama_ mama can't you see ? ...,March to cadence with the US marines ...,The Sun Harbor's Chorus- Documentary Recordings ...,0


No options are needed for the simplest case, as the SFrame parser infers column types. Of course, there are many options you may need to specify when importing a csv file. Some of the more common options come in to play when we load the usage data of users listening to these songs online:

In [0]:
usage_data = tc.SFrame.read_csv("./10000.txt",
                                header=False,
                                delimiter='\t',
                                column_type_hints={'X3':int})

In [0]:
usage_data.head()

X1,X2,X3
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOAKIMP12A8C130995,1
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBBMDR12A8C13253B,2
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBXHDL12A81C204C0,1
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBYHAJ12A6701BF1D,1
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODACBL12A8C13C273,1
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODDNQT12A6D4F5F7E,5
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODXRTY12AB0180F3B,1
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOFGUAY12AB017B0A8,1
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOFRQTD12A81C233C0,1
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOHQWYZ12A6D4FA701,1


The header and delimiter options are needed because this particular csv file does not provide column names in its first line, and the values are separated by tabs, not commas. The column_type_hints keeps the SFrame csv parser from attempting to infer the datatype of each column, which it does by default. For a full list of options when parsing csv files, check our [API Reference](https://apple.github.io/turicreate/docs/api/generated/turicreate.SFrame.read_csv.html#turicreate.SFrame.read_csv).

Here we might want to rename columns from the default names:

In [0]:
usage_data.rename({'X1':'user_id', 'X2':'song_id', 'X3':'listen_count'})

user_id,song_id,listen_count
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOAKIMP12A8C130995,1
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBBMDR12A8C13253B,2
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBXHDL12A81C204C0,1
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBYHAJ12A6701BF1D,1
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODACBL12A8C13C273,1
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODDNQT12A6D4F5F7E,5
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODXRTY12AB0180F3B,1
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOFGUAY12AB017B0A8,1
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOFRQTD12A81C233C0,1
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOHQWYZ12A6D4FA701,1


SFrames can be saved as a csv file or in the SFrame binary format. If your SFrame is saved in binary format loading it is instantaneous, so we won't ever have to parse that file again. Here, the default is to save in binary format, and we supply the name of a directory to be created which will hold the binary files:

In [0]:
usage_data.save('./music_usage_data.sframe')

Loading is then very fast:

In [0]:
same_usage_data = tc.load_sframe('./music_usage_data.sframe')

## Data Types

An SFrame is made up of columns of a contiguous type, a number of datatypes are supported:

*   int (signed 64-bit integer)
*   float (double-precision floating point)
*   str (string)
*   array.array (1-D array of doubles)
*   list (arbitrarily list of elements)
*   dict (arbitrary dictionary of elements)
*   datetime.datetime (datetime with microsecond precision)
*   image (image)

## Memory Intensive Example

https://blog.usejournal.com/python-for-big-data-computation-on-a-single-computer-c232046df3c3

The data we will use for our experiment comes from the (now inexistent) Gowalla social networking site. Two data nice data sets coming from this site are available here. We will be looking at the biggest one, which contains the event-log of “check-ins” of Gowalla’s users to a set of locations. This data set contains 6.44 million records, each containing a single check-in and just a few columns, of which we will pick only 3: user_id, location_id and checkin_ts (the second-resolution timestamp of the check-in event).

https://snap.stanford.edu/data/loc-gowalla.html

The problem and its (theoretical) solution
We will use Turi Create to attack what could be termed the “stalker-stalkee detection problem” on this data set. In this problem, we are asked to identify pairs of users (E, R) that maximize the ‘stalking measure between E and R’. The stalking measure between E and R is defined as the number of distinct locations where there was ever a check-in by user E (the stalkEE) followed by a check-in by user R (the stalkER).

The first thing is to index the check-ins by location_id (remember that in pandas a single value for a key can refer to more than one row). This will make the following computation easier.

Then comes the tricky part, for each location we want to consider all pairs of check-ins where the check-in time stamp of the first user in the pair strictly precedes that of the second user. So generate chin_pairs, a data frame containing all pairs of check-ins for the same location and then filter it to enforce the conditions just described, to generate pairs_filtered.

However, trying to run a naïve Pandas solution on a laptop or PC with the amount of RAM that is usual these days, (say 16GB), will result in a MemoryError exception. With Turi Create and SFrames we do not have such problems.

In [0]:
checkins = ( tc.SFrame.read_csv( 'loc-gowalla_totalCheckins.txt',                  
                                 delimiter='\t', header=False )
                .rename( {'X1': 'user_id', 'X2' : 'checkin_ts',
                          'X3': 'lat', 'X4' : 'lon',
                          'X5': 'location_id'} )
  [["user_id", "location_id", "checkin_ts"]] )

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,str,float,float,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [0]:
checkins.head()

user_id,location_id,checkin_ts
0,22847,2010-10-19T23:55:27Z
0,420315,2010-10-18T22:17:43Z
0,316637,2010-10-17T23:42:03Z
0,16516,2010-10-17T19:26:05Z
0,5535878,2010-10-16T18:50:42Z
0,15372,2010-10-12T23:58:03Z
0,21714,2010-10-12T22:02:11Z
0,420315,2010-10-12T19:44:40Z
0,153505,2010-10-12T15:57:20Z
0,420315,2010-10-12T15:19:03Z


Next, generate the pairs of check-ins that satisfy the conditions of our detection algorithms.

In [0]:
import datetime
import dateutil.parser

In [0]:
chin_ps = ( checkins.join(checkins, on='location_id').rename( {'checkin_ts': 'checkin_ts_ee', 'checkin_ts.1': 'checkin_ts_er', 'user_id': 'stalkee' , 'user_id.1': 'stalker' } ) )

In [0]:
chin_ps['time_diff'] = (chin_ps['checkin_ts_er'].apply(dateutil.parser.parse) - chin_ps['checkin_ts_ee'].apply(dateutil.parser.parse)) / 86400

In [0]:
# pairs_filtered = chin_ps[ (chin_ps['checkin_ts_ee'] < chin_ps['checkin_ts_er']) & (chin_ps['stalkee'] != chin_ps['stalker']) ]
pairs_filtered = chin_ps[ (chin_ps['time_diff'] > 0.0) & (chin_ps['time_diff'] < 1.0) & (chin_ps['stalkee'] != chin_ps['stalker']) ]
pairs_filtered.head()

stalkee,location_id,checkin_ts_ee,stalker,checkin_ts_er,time_diff
7,420315,2010-10-18T20:24:42Z,0,2010-10-18T22:17:43Z,0.0784837962963
7,420315,2010-10-18T15:08:58Z,0,2010-10-18T22:17:43Z,0.297743055556
31,420315,2010-10-18T14:00:53Z,0,2010-10-18T22:17:43Z,0.345023148148
66,420315,2010-10-18T18:59:11Z,0,2010-10-18T22:17:43Z,0.13787037037
327,420315,2010-10-18T21:21:12Z,0,2010-10-18T22:17:43Z,0.0392476851852
327,420315,2010-10-18T14:05:59Z,0,2010-10-18T22:17:43Z,0.341481481481
342,420315,2010-10-18T14:10:40Z,0,2010-10-18T22:17:43Z,0.338229166667
350,420315,2010-10-18T19:28:34Z,0,2010-10-18T22:17:43Z,0.117465277778
456,420315,2010-10-18T16:00:08Z,0,2010-10-18T22:17:43Z,0.262210648148
515,420315,2010-10-18T11:42:06Z,0,2010-10-18T22:17:43Z,0.441400462963


In [0]:
final_result = ( pairs_filtered[['stalkee', 'stalker', 'location_id']]
                    .unique()
                    .groupby( ['stalkee', 'stalker'], {"location_count": agg.COUNT })
                    .topk( 'location_count', k=5 )
                    .materialize() )

In [0]:
print( final_result )

## Articles, Repositories, etc

*   https://medium.com/@nilotic2/a-guide-to-turi-create-a72f53f26721
*   https://blog.usejournal.com/python-for-big-data-computation-on-a-single-computer-c232046df3c3
*   https://github.com/onmyway133/Avengers