NYCTaxi
------

[Download link](http://www.andresmh.com/nyctaxitrips/)

Taxi trips taken in 2013 released by a FOIA request.  Around 20GB CSV uncompressed.

**Try the following:**

*  Use `dask.dataframe` with pandas-style queries
*  Store in HDF5 both with and without categoricals, measure the size of the file and query times
*  Set the index by one of the date-time columns and store in castra (also using categoricals).  Perform range queries and measure speed.  What size and complexity of query can you perform while still having an "interactive" experience?

## Trip the data down to 10.000 rows to easy mangage the data

```python

In [None]:
import dask.dataframe as dd

taxi_data = dd.read_csv(r'G:\My Drive\Sem 1 năm 4\Thay An Big Data\Lab\Lab 8\trip_data_1.csv', parse_dates=['pickup_datetime'])
result = taxi_data[(taxi_data['pickup_datetime'] >= '2013-01-01') & (taxi_data['pickup_datetime'] < '2014-01-01')]



In [None]:
%%time
taxi_data.to_hdf('taxi_data.h5', '/data', format='table', mode='w', append=True, data_columns=True)

print(f"HDF5 (No Categoricals) Size: {result.memory_usage(deep=True).sum().compute() / 1e6} MB")
print(f"HDF5 (No Categoricals) Query Time: ")

HDF5 (No Categoricals) Size: 4.264446 MB
HDF5 (No Categoricals) Query Time: 
CPU times: total: 93.8 ms
Wall time: 398 ms


In [None]:
%%time
taxi_data.to_hdf('taxi_data.h5', '/data', format='table', mode='w', append=True, data_columns=True)

print(f"HDF5 (No Categoricals) Size: {result.memory_usage(deep=True).sum().compute() / 1e6} MB")
print(f"HDF5 (No Categoricals) Query Time: ")

HDF5 (No Categoricals) Size: 4.264446 MB
HDF5 (No Categoricals) Query Time: 
CPU times: total: 31.2 ms
Wall time: 364 ms


In [None]:
%%time
result = result.categorize(columns=['medallion', 'hack_license', 'vendor_id', 'rate_code', 'store_and_fwd_flag', 'pickup_datetime', 'dropoff_datetime', 'passenger_count', 'trip_time_in_secs', 'trip_distance', 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude'])
result.to_hdf('taxi_data_with_categoricals.h5', '/data', format='table', mode='w', append=True, data_columns=True)

print(f"HDF5 (With Categoricals) Size: {result.memory_usage(deep=True).sum().compute() / 1e6} MB")
print(f"HDF5 (With Categoricals) Query Time: ")

HDF5 (With Categoricals) Size: 2.773716 MB
HDF5 (With Categoricals) Query Time: 
CPU times: total: 93.8 ms
Wall time: 906 ms


### the `castra` module is no longer maintained and has been deprecated.

As an alternative, you can use `parquet` format with `dask` for on-disk dataframe storage.

In [None]:
import dask.dataframe as dd

# Set the index by the date-time column
df = taxi_data.set_index('pickup_datetime')

# Store in parquet
df.to_parquet('parquet_data')

# Perform range queries
df = dd.read_parquet('parquet_data')
result = df.loc['start_date':'end_date']
result

Unnamed: 0_level_0,medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
,object,object,object,int64,object,object,int64,int64,float64,float64,float64,float64,float64
,...,...,...,...,...,...,...,...,...,...,...,...,...


Github Archive
----------------

[Download link](https://www.githubarchive.org/)

Every public github event for the last few years stored as gzip compressed line-delimited JSON data.  Watch out, the schema switches at the 2014-2015 transition.

**Try the following:**

*  Use `dask.bag` to inspect the data
*  Drill down using functions like `pluck` and `filter`
*  Find who the most popular committers were in 2015

In [None]:
import dask.bag as db
import json
from collections import Counter

github_data = db.read_text('G:\My Drive\Sem 1 năm 4\Thay An Big Data\Lab\Lab 8\2015-01-01-15.json\2015-01-01-15.json').map(json.loads)

In [None]:
github_data = db.read_text('/content/2015-01-01-15.json').map(json.loads)



# Drill down using functions like pluck and filter
commits = github_data.filter(lambda record: 'type' in record and record['type'] == 'PushEvent')

# Find who the most popular committers were in 2015
commits_2015 = commits.filter(lambda record: '2015' in record['created_at'])
popular_committers = commits_2015.pluck('actor').pluck('login').frequencies().topk(10, lambda x: x[1])

# Compute the result
popular_committers

dask.bag<topk-aggregate, npartitions=1>

Reddit Comments
-----------------

[Download link](https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/)

Every publicly available reddit comment, available as a large torrent

**Try the following:**

*  Use `dask.bag` to inspect the data
*  Combine `dask.bag` with `nltk` or `gensim` to perform textual analyis on the data
*  Reproduce the work of [Daniel Rodriguez](https://extrapolations.dev/blog/2015/07/reproduceit-reddit-word-count-dask/) and see if you can improve upon his speeds when analyzing this data.

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
import dask.bag as db
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Load the data
b = db.read_text('G:\My Drive\Sem 1 năm 4\Thay An Big Data\Lab\Lab 8\Reddit Comment.csv')

# Tokenize the comments
tokenized_comments = b.map(word_tokenize)

# Remove stop words
stop_words = set(stopwords.words('english'))
filtered_comments = tokenized_comments.map(lambda comment: [word for word in comment if word.casefold() not in stop_words])

# Perform word count
word_count = filtered_comments.flatten().frequencies().compute()

print(word_count)

[('count', 1), (',', 41), ('comment', 1), ('avg_score', 1), ('count_subs', 1), ('count_authors', 1), ('example_id', 1), ('6056', 1), ('Thanks', 1), ('!', 2), (',1.808.790.956,132,5920', 1), ('r/pcmasterrace', 1), ('/comments/34tnkh/c/cqymdpy', 1), ('5887', 1), ('Yes,56.868.377.856,131,5731', 1), ('r/AdviceAnimals', 1), ('/comments/37s8vv/c/crpkuqv', 1), ('5441', 1), ('Yes.,87.958.409.805,129,5293', 1), ('r/movies', 1), ('/comments/36mruc/c/crfzgtq', 1), ('4668', 1), ('lol,33.695.471.736,121,4443', 1), ('r/2007scape', 1), ('/comments/34y3as/c/cqz4syu', 1), ('4256', 1), (':', 2), ('(', 2), (',102.876.656.485,121,4145', 1), ('r/AskReddit', 3), ('/comments/35owvx/c/cr70qla', 1), ('3852', 1), ('No.,38.500.449.796,127,3738', 1), ('r/MMA', 1), ('/comments/36kokn/c/crese9p', 1), ('3531', 1), ('F,62.622.771.182,106,3357', 1), ('r/gaming', 1), ('/comments/35dxln/c/cr3mr06', 1), ('3466', 1), ('No,35.924.608.652,124,3353', 1), ('r/PS4', 1), ('/comments/359xxn/c/cr3h8c7', 1), ('3386', 1), ('Thank',

NYC 311
---------

[Download link](https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9)

All 311 service requests since 2010 in New York City

In [None]:
data = pd.read_csv(r'G:\My Drive\Sem 1 năm 4\Thay An Big Data\Lab\Lab 8\NYC 311.csv')
data

Unnamed: 0,Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,...,Vehicle Type,Taxi Company Borough,Taxi Pick Up Location,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Latitude,Longitude,Location
0,59990471,01/11/2024 12:00:00 PM,,DSNY,Department of Sanitation,Derelict Vehicles,Derelict Vehicles,Street,10467,343 EAST 209 STREET,...,,,,,,,,4.087.663.618.204.300,-7.387.407.522.979.460,"(40.87663618204302°, -73.87407522979463°)"
1,59988966,01/11/2024 12:00:00 PM,,DSNY,Department of Sanitation,Derelict Vehicles,Derelict Vehicles,Street,10467,3228 DECATUR AVENUE,...,,,,,,,,4.087.454.575.900.520,-7.387.502.656.053.260,"(40.87454575900527°, -73.87502656053266°)"
2,59983031,01/11/2024 12:00:00 PM,,DSNY,Department of Sanitation,Derelict Vehicles,Derelict Vehicles,Street,11206,202 MAUJER STREET,...,,,,,,,,4.071.110.604.333.750,-7.394.152.606.103.560,"(40.71110604333756°, -73.94152606103569°)"
3,59983034,01/11/2024 12:00:00 PM,,DSNY,Department of Sanitation,Derelict Vehicles,Derelict Vehicles,Street,10455,673 BECK STREET,...,,,,,,,,408.148.077.988.588,-7.390.113.194.581.160,"(40.8148077988588°, -73.90113194581166°)"
4,59983033,01/11/2024 12:00:00 PM,,DSNY,Department of Sanitation,Derelict Vehicles,Derelict Vehicles,Street,10455,673 BECK STREET,...,,,,,,,,408.148.077.988.588,-7.390.113.194.581.160,"(40.8148077988588°, -73.90113194581166°)"
5,59988973,01/11/2024 12:00:00 PM,,DSNY,Department of Sanitation,Derelict Vehicles,Derelict Vehicles,Street,11206,195 MAUJER STREET,...,,,,,,,,4.071.102.149.850.240,-7.394.260.464.288.810,"(40.71102149850243°, -73.94260464288811°)"
6,59987452,01/11/2024 12:00:00 PM,,DSNY,Department of Sanitation,Derelict Vehicles,Derelict Vehicles,Street,10455,673 BECK STREET,...,,,,,,,,408.148.077.988.588,-7.390.113.194.581.160,"(40.8148077988588°, -73.90113194581166°)"
7,59988776,01/11/2024 01:49:23 AM,,NYPD,New York City Police Department,Blocked Driveway,No Access,Street/Sidewalk,11367,136-14 61 ROAD,...,,,,,,,,40.742.158.749.056.800,-7.382.956.299.071.490,"(40.742158749056856°, -73.82956299071492°)"
8,59988894,01/11/2024 01:48:44 AM,,NYPD,New York City Police Department,Illegal Parking,Blocked Hydrant,Street/Sidewalk,11356,122-07 22 AVENUE,...,,,,,,,,4.078.015.802.264.800,-7.384.584.792.528.790,"(40.78015802264805°, -73.84584792528797°)"
9,59984397,01/11/2024 01:48:00 AM,,NYPD,New York City Police Department,Illegal Parking,Blocked Hydrant,Street/Sidewalk,10461,2567 POPLAR STREET,...,,,,,,,,4.084.443.106.593.460,-7.384.776.635.760.520,"(40.84443106593468°, -73.84776635760521°)"


European Centre for Medium Range Weather Forecasts
----------------------------------------------------------

[Download script](https://gist.github.com/mrocklin/26d8323f9a8a6a75fce0)

Download historical global weather data from the ECMWF.

**Try the following:**

*  What is the variance in temperature over time?
*  What areas experienced the largest temperature swings in the last month relative to their previous history?
*  Plot the temperature of the earth as a function of latitude and then as longitude

In [None]:
import pandas as pd
from ecmwfapi import ECMWFDataServer
import dask.dataframe as dd
import matplotlib.pyplot as plt

In [None]:
dates = pd.date_range('2014-01-01', '2014-12-31', freq='D')
dates = [str(d).split()[0] for d in dates]

server = ECMWFDataServer()

for date in dates:
    server.retrieve({
      'stream'    : "oper",
      'levtype'   : "sfc",
      'param'     : "165.128/166.128/167.128",
      'dataset'   : "interim",
      'step'      : "00",
      'grid'      : "0.25/0.25",
      'time'      : "00/06/12/18",
      'date'      : date,
      'type'      : "an",
      'class'     : "ei",
      'target'    : date + ".nc3",
      'format'    : "netcdf" })

weather_data = dd.read_netcdf('path/to/ecmwf_data/*.nc3')

2024-01-12 23:31:51 ECMWF API python library 1.6.3
2024-01-12 23:31:51 ECMWF API at https://api.ecmwf.int/v1
2024-01-12 23:31:52 Welcome Application anonymous
2024-01-12 23:31:53 
2024-01-12 23:31:53 You are accessing ECMWF data services as an anonymous user. For
2024-01-12 23:31:53 improved quality of service you should consider registering an account
2024-01-12 23:31:53 with ECMWF.
2024-01-12 23:31:53 


APIException: "ecmwf.API error 1: User 'anonymous' has not access to datasets/interim. Please accept the terms and conditions at http://apps.ecmwf.int/datasets/licences/general"

In [None]:
variance_over_time = weather_data.groupby('timestamp')['temperature'].var().compute()
variance_over_time.plot(title='Variance in Temperature Over Time')
plt.xlabel('Timestamp')
plt.ylabel('Temperature Variance')
plt.show()

NameError: name 'weather_data' is not defined

In [None]:
weather_data = weather_data.set_index('timestamp')
last_month_data = weather_data.loc['2014-01-01':'2014-12-31']
temperature_swings_last_month = last_month_data.groupby(['latitude', 'longitude'])['temperature'].std().compute()
largest_temperature_swings = temperature_swings_last_month.nlargest(5)

print(largest_temperature_swings)

In [None]:
weather_data.plot.scatter(x='latitude', y='temperature', alpha=0.2)
plt.title('Temperature vs Latitude')
plt.xlabel('Latitude')
plt.ylabel('Temperature')
plt.show()

In [None]:
weather_data.plot.scatter(x='longitude', y='temperature', alpha=0.2)
plt.title('Temperature vs Longitude')
plt.xlabel('Longitude')
plt.ylabel('Temperature')
plt.show()