# Learning Dask With Python Distributed Computing 

## Authors

* **Soumil Nitin Shah** 


## Soumil Nitin Shah 
Bachelor in Electronic Engineering |
Masters in Electrical Engineering | 
Master in Computer Engineering |

* Website : https://soumilshah.herokuapp.com
* Github: https://github.com/soumilshah1995
* Linkedin: https://www.linkedin.com/in/shah-soumil/
* Blog: https://soumilshah1995.blogspot.com/
* Youtube : https://www.youtube.com/channel/UC_eOodxvwS_H7x2uLQa-svw?view_as=subscriber
* Facebook Page : https://www.facebook.com/soumilshah1995/
* Email : shahsoumil519@gmail.com
* projects : https://soumilshah.herokuapp.com/project


Hello! I’m Soumil Nitin Shah, a Software and Hardware Developer based in New York City. I have completed by Bachelor in Electronic Engineering and my Double master’s in Computer and Electrical Engineering. I Develop Python Based Cross Platform Desktop Application , Webpages , Software, REST API, Database and much more I have more than 2 Years of Experience in Python

###### Step 1: Installation of Dask 
* !pip install dask
* !pip install cloudpickle 
* !pip install "dask[dataframe]"
* !pip install "dask[complete]"

In [1]:
!pip show dask

Name: dask
Version: 2020.12.0
Summary: Parallel PyData with Task Scheduling
Home-page: https://github.com/dask/dask/
Author: None
Author-email: None
License: BSD
Location: c:\python38\lib\site-packages
Requires: pyyaml
Required-by: swifter, distributed


In [1]:
# Define the Imports 
try:
    import os
    import json
    import math
    import dask
    from dask.distributed import Client
    import dask.dataframe as df
    import dask.multiprocessing
except Exception as e:
    print("Some Modules are Missing : {} ".format(e))

In [2]:
size = os.path.getsize("netflix_titles.csv") / math.pow(1024,3)
print("Size in GB : {} ".format(size))

Size in GB : 0.0022451020777225494 


In [3]:
client = Client(n_workers=3, threads_per_worker=1, processes=False, memory_limit='2GB')

In [4]:
client

0,1
Client  Scheduler: inproc://192.168.1.7/6556/1  Dashboard: http://192.168.1.7:8787/status,Cluster  Workers: 3  Cores: 3  Memory: 6.00 GB


In [5]:
df = df.read_csv("netflix_titles.csv")

In [6]:
df.head(2)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...


In [7]:
df.tail(2)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
6232,70281022,TV Show,A Young Doctor's Notebook and Other Stories,,"Daniel Radcliffe, Jon Hamm, Adam Godley, Chris...",United Kingdom,,2013,TV-MA,2 Seasons,"British TV Shows, TV Comedies, TV Dramas","Set during the Russian Revolution, this comic ..."
6233,70153404,TV Show,Friends,,"Jennifer Aniston, Courteney Cox, Lisa Kudrow, ...",United States,,2003,TV-14,10 Seasons,"Classic & Cult TV, TV Comedies",This hit sitcom follows the merry misadventure...


In [8]:
df.shape

(Delayed('int-f1852546-9cb5-452f-ac18-1fbc43c9e8d2'), 12)

#### Selecting a column or Columns

In [9]:
df.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

In [12]:
df["show_id"].head(3)

0    81145628
1    80117401
2    70234439
Name: show_id, dtype: int64

In [10]:
df[["show_id", "title"]].head(3)

Unnamed: 0,show_id,title
0,81145628,Norm of the North: King Sized Adventure
1,80117401,Jandino: Whatever it Takes
2,70234439,Transformers Prime


#### Apply Function

In [11]:
def toupper(x):
    return x.upper()

In [13]:
df["title"].map(toupper).compute()

0           NORM OF THE NORTH: KING SIZED ADVENTURE
1                        JANDINO: WHATEVER IT TAKES
2                                TRANSFORMERS PRIME
3                  TRANSFORMERS: ROBOTS IN DISGUISE
4                                      #REALITYHIGH
                           ...                     
6229                                   RED VS. BLUE
6230                                          MARON
6231         LITTLE BABY BUM: NURSERY RHYME FRIENDS
6232    A YOUNG DOCTOR'S NOTEBOOK AND OTHER STORIES
6233                                        FRIENDS
Name: title, Length: 6234, dtype: object

#### Appply Functon  on distributed Cluster 

In [16]:
A = client.map(toupper , df["title"])

In [17]:
Titles = [ result.result()  for result in A]

In [18]:
Titles[0]

'NORM OF THE NORTH: KING SIZED ADVENTURE'

In [27]:
# you can also do something like this 
df.title.map(toupper).compute()

0           NORM OF THE NORTH: KING SIZED ADVENTURE
1                        JANDINO: WHATEVER IT TAKES
2                                TRANSFORMERS PRIME
3                  TRANSFORMERS: ROBOTS IN DISGUISE
4                                      #REALITYHIGH
                           ...                     
6229                                   RED VS. BLUE
6230                                          MARON
6231         LITTLE BABY BUM: NURSERY RHYME FRIENDS
6232    A YOUNG DOCTOR'S NOTEBOOK AND OTHER STORIES
6233                                        FRIENDS
Name: title, Length: 6234, dtype: object

#### basics of Dask 

In [29]:
from time import sleep

def inc(x):
    sleep(1)
    return x + 1

def add(x, y):
    sleep(1)

In [30]:

%%time
# This takes three seconds to run because we call each
# function sequentially, one after the other

x = inc(1)
y = inc(2)
z = add(x, y)

Wall time: 3 s


###### Parallelize with the dask.delayed decorator

In [32]:
# This runs immediately, all it does is build a graph
from dask import delayed

x = delayed(inc)(1)
y = delayed(inc)(2)
z = delayed(add)(x, y)

In [33]:
%%time
# This actually runs our computation using a local thread pool

z.compute()

Wall time: 2.01 s


##### Exercise: Parallelize a for loop¶


In [35]:
data = [1, 2, 3, 4, 5, 6, 7, 8]


In [44]:

%%time

# Sequential code
results = []
for x in data:
    y = inc(x)
    results.append(y)
    
total = sum(results)

Wall time: 8 s


In [45]:
%%time

results = []

for x in data:
    y = delayed(inc)(x)
    results.append(y)
    
total = delayed(sum)(results)
print("Before computing:", total)  # Let's see what type of thing total is
result = total.compute()
print("After computing :", result)  # After it's computed

Before computing: Delayed('sum-fa5bfbcb-3a3e-42da-8a31-b44f96537b03')
After computing : 44
Wall time: 3.02 s


##### Lazy execution¶


In [50]:


from time import sleep

@delayed
def inc(x):
    sleep(1)
    return x + 1

@delayed
def add(x, y):
    sleep(1)

In [51]:
%%time

# this looks like ordinary code

x = inc(15)
y = inc(30)

total = add(x, y)
total.compute()




Wall time: 2.01 s


In [14]:
!git clone https://github.com/soumilshah1995/Stackoverflow-issue-.git
    

Cloning into 'Stackoverflow-issue-'...
