# Dask dataframe

![](https://avatars3.githubusercontent.com/u/17131925?s=400&v=4)

<font color = 'Orange' >Blog on Dask Dataframe ->[Visit Here](https://inblog.in/Data-manipulation-with-Dask-dataframe-kt6Z5irDVg)</font>

**A Dask dataFrame is a large parallel dataFrame composed of many smaller Pandas DataFrames, split along the index. These Pandas DataFrames may live on disk for larger-than-memory computing on a single machine, or on many different machines in a cluster. One Dask dataFrame operation triggers many operations on the constituent Pandas DataFrames.**

In [None]:
import warnings
warnings.filterwarnings(action="ignore")

In [None]:
import dask.dataframe as dd
import dask.array as da
import pandas as pd
import numpy as np

dask_df = dd.read_csv('/kaggle/input/us-accidents/US_Accidents_June20.csv')

In [None]:
# Checking dataframe partitions
dask_df.map_partitions(type).compute()

In [None]:
# Checking length of each partition
dask_df.map_partitions(len).compute()

In [None]:
#Accessing one of the partition
dask_df.partitions[1].compute()

# Getting top 5 rows of the dataframe

In [None]:
dask_df.head()

# Getting last 5 rows

In [None]:
dask_df.tail()

# Checking dtypes 

In [None]:
dask_df.dtypes

# Getting five number summary with mean and count.

In [None]:
dask_df.describe().compute()

# Setting  ID column as a index

In [None]:
dask_df = dask_df.set_index('ID')
dask_df

# Appending new columns

**Severity shows the impact on traffic duration.So lets find append a new column which gives True if traffic duration is too long.**

In [None]:
dask_df['long_delay'] = dask_df['Severity']==4
dask_df.head()

# Changing datatypes of columns

In [None]:
dask_df['long_delay'].dtype                                  # output: bool
dask_df['long_delay'] = dask_df['long_delay'].astype('int')
dask_df['long_delay'].dtype                                  # output: int64

# Getting unique element

In [None]:
dask_df['Severity'].unique().compute()

'''
Output: 
0 3
1 2
2 4
3 1
'''

# Count of each unique element

In [None]:
dask_df['Severity'].value_counts().compute()

'''
Output: 
2   2373210
3   998913
4   112320
1   29174
'''

# Accessing the data of data frame

###  1.  Access particular row

In [None]:
dask_df.loc['A-3'].compute()

###  2.  Access particular column

In [None]:
dask_df.loc[:,'City'].compute()

###   3.  Access particular row and column

In [None]:
dask_df.loc['A-100','State'].compute()

'''
Output:
ID
A-100   OH
Name:  State, dtype:  object
'''

### 4. Accessing a range of rows 

In [None]:
dask_df.loc['A-5':'A-9']

# Condition search

In [None]:
dask_df[dask_df['Start_Lat']== 39.865147].compute()

# Multiple Condiational Search

In [None]:
dask_df[da.logical_and(dask_df['Start_Lng']==-86.779770,dask_df['Start_Lat']== 36.194839)].compute()

# Getting the Number of Null values of each columns

In [None]:
dask_df.isna().sum(axis=0).compute()

# Filling Null values

In [None]:
dask_df['Wind_Speed(mph)'].isnull().sum().compute()  # output 454609

In [None]:
dask_df['Wind_Speed(mph)'] = dask_df['Wind_Speed(mph)'].fillna(10)

In [None]:
dask_df['Wind_Speed(mph)'].isnull().sum().compute()  # output 0

# Drop rows with Null value

In [None]:
dask_df = dask_df.dropna(subset=['Zipcode'])

# Dropping columns


In [None]:
print(any(dask_df.columns=='long_delay'))       # output True
dask_df = dask_df.drop('long_delay',axis=1)
print(any(dask_df.columns=='long_delay'))       # output False  

# Groupby method

In [None]:
byState = dask_df.groupby('State')

In [None]:
byState['Temperature(F)'].mean().compute()