# Create Data

This code generates a large synthetic time series dataset using Dask and saves it as a Parquet file. The resulting dataset contains approximately 655 million rows and may take a few minutes to create. Let's run this code first, and while it processes, we'll learn about Dask DataFrames.

In [None]:
import dask
ddf = dask.datasets.timeseries(
    start="2000-01-01",
    end = "2020-12-31",
    freq = "1s",
    partition_freq="7d",
    seed=42
)
ddf.to_parquet("big_file.parquet")

Now, lets create a folder **"flights_data"** with 500 csv files with 1000 rows each. Each csv file will be randomly generated with flight information with the following columns:

- **flight_id**: A unique identifier for each flight, generated sequentially for each row.
- **origin**: The airport where the flight originates, randomly chosen from a list of airports (JFK, LGA, EWR).
- **destination**: The airport where the flight is destined to land, randomly chosen from the same list of airports.
- **airline**: The airline operating the flight, randomly chosen from a list of airlines (Delta, United, American).
- **status**: The status of the flight, indicating whether it is 'On Time', 'Delayed', or 'Cancelled', randomly assigned.
- **delay_minutes**: The number of minutes the flight is delayed, randomly assigned a value between 0 and 120.
- **num_passengers**: The number of passengers on the flight, randomly assigned a value between 50 and 300.
- **distance**: The distance the flight will travel, randomly assigned a value between 100 and 5000 miles.
- **flight_duration**: The duration of the flight in minutes, randomly assigned a value between 30 and 600 minutes.


In [1]:
import os
import pandas as pd
import numpy as np

# Create the flights_data directory if it doesn't exist
if not os.path.exists('flights_data'):
    os.makedirs('flights_data')

# Define the columns for the flight data
columns = ['flight_id', 'origin', 'destination', 'airline', 'status', 'delay_minutes', 'num_passengers', 'distance', 'flight_duration']

# Define some sample data for the columns
airports = ['JFK', 'LGA', 'EWR']
airlines = ['Delta', 'United', 'American']
statuses = ['On Time', 'Delayed', 'Cancelled']

# Generate and save multiple CSV files
num_files = 500
num_rows_per_file = 1000

for i in range(num_files):
    # Generate random flight data
    data = {
        'flight_id': np.arange(i * num_rows_per_file, (i + 1) * num_rows_per_file),
        'origin': np.random.choice(airports, num_rows_per_file),
        'destination': np.random.choice(airports, num_rows_per_file),
        'airline': np.random.choice(airlines, num_rows_per_file),
        'status': np.random.choice(statuses, num_rows_per_file),
        'delay_minutes': np.random.randint(0, 121, num_rows_per_file),
        'num_passengers': np.random.randint(50, 301, num_rows_per_file),
        'distance': np.random.randint(100, 5001, num_rows_per_file),
        'flight_duration': np.random.randint(30, 601, num_rows_per_file)
    }
    
    df = pd.DataFrame(data)
    
    # Save the DataFrame to a CSV file
    df.to_csv(f'flights_data/flights_data_{i+1}.csv', index=False)

print("CSV files created in the 'flights_data' folder.")

CSV files created in the 'flights_data' folder.
