# NB01 - Data Collection
**Author:** Thea West

**Description:**

This notebook walks through the process of collecting, cleaning, and compiling weather data from 5 cities into one CSV file. The data is sourced from [Open Meteo](https://open-meteo.com/en/docs/historical-weather-api) and the JSON files from each city are converted to Pandas dataframes. A column is added for city name and the dates are changed to datetime format. Finally the five dataframes are concatenated into one and saved to a CSV file in the data folder.


In [None]:
import pandas as pd
import requests
import json

## Data Collection

I will compare London to four other capital cities in the general geographic area: Dublin, Paris, Brussels, and Amsterdam. All of these cities are within about 350 miles of London and have a relatively similar climate. As far as I am aware, these cities do not have quite the same reputation for rain as London.

I will compare the raininess of these cities for the year 2024 as this is recent data which spans all seasons throughout the year. 

I will use rain sum (millimeters of rain in a day) and precipitation hours (number of hours it rained in a day). These two measurements will account for two possible measures of raininess: actual volume of rain fallen and amount of time it rains. I will compare both with the other cities to see if London is truly ranier. 

Start by setting the paramaters for which data to fetch:

In [2]:
url_endpoint = "https://archive-api.open-meteo.com/v1/archive"
params = {
    "latitude": [51.5085, 48.8534, 50.8505, 52.374, 53.3331],
    "longitude": [-0.1257, 2.3488, 4.3488, 4.8897, -6.2489],
	"start_date": "2024-01-01",
	"end_date": "2024-12-31",
	"daily": "rain_sum,precipitation_hours",
    "timezone": ["Europe/London", "Europe/Berlin", "Europe/Berlin", "Europe/Berlin", "Europe/London"]
}

Fetch the data and check if it was successful:

In [3]:
response = requests.get(url_endpoint, params=params)

if response.status_code == 200:
    print(f"Success")
else:
    print(f"Failure! Status code: {response.status_code}")

Success


Store the raw data responses as a json:

In [4]:
raw_data = response.json()

## Convert to Pandas Dataframe and Clean

Convert json to a pandas dataframe:

In [34]:
lat_long_df = pd.DataFrame(raw_data)

Add a column specifying the city:

In [35]:
# This works because I know what order the cities were read in. 
# If different cities were added or in a different order, this would need to change
cities = ["London", "Paris", "Brussels", "Amsterdam", "Dublin"]
lat_long_df = lat_long_df.assign(city = cities)

Get the `daily` data into the rows:

In [None]:
# list to store each 'daily' df
list_of_daily_dfs = []
# iterate through each row in the lat_long_df
for _, row in lat_long_df.iterrows():
    # make a df of just this city's daily data
    temp = pd.DataFrame(row['daily'])
    # tag it with city 
    temp['city'] = row['city']
    list_of_daily_dfs.append(temp)

# stitch all five cities’ daily dfs into one big df
df = pd.concat(list_of_daily_dfs, ignore_index=True)
df

Unnamed: 0,time,rain_sum,precipitation_hours,city
0,2024-01-01,8.3,8.0,London
1,2024-01-02,9.6,11.0,London
2,2024-01-03,2.3,6.0,London
3,2024-01-04,27.7,8.0,London
4,2024-01-05,4.6,8.0,London
...,...,...,...,...
1825,2024-12-27,0.0,0.0,Dublin
1826,2024-12-28,0.0,0.0,Dublin
1827,2024-12-29,0.0,0.0,Dublin
1828,2024-12-30,0.0,0.0,Dublin


Change the items in the `time` column to datetime objects:

In [37]:
df['time'] = pd.to_datetime(df['time'], format='ISO8601')

For initial analysis of the data I may want a count of "rainy days." I will define a "rainy day" as having either at least 2.5 mm of rainfall or at least 4 hours of rain. 

Create a column which defines each day as "rainy" (True) or "not rainy" (False):

In [41]:
df = df.assign(rainy_day = (df['rain_sum'] >= 2.5) | (df['precipitation_hours'] >= 4.0))
df

Unnamed: 0,time,rain_sum,precipitation_hours,city,rainy_day
0,2024-01-01,8.3,8.0,London,True
1,2024-01-02,9.6,11.0,London,True
2,2024-01-03,2.3,6.0,London,True
3,2024-01-04,27.7,8.0,London,True
4,2024-01-05,4.6,8.0,London,True
...,...,...,...,...,...
1825,2024-12-27,0.0,0.0,Dublin,False
1826,2024-12-28,0.0,0.0,Dublin,False
1827,2024-12-29,0.0,0.0,Dublin,False
1828,2024-12-30,0.0,0.0,Dublin,False


## Save Dataframe as a CSV

The CSV is created and saved in the proper location in the code block below:

In [None]:
csv_path = "data/rain_data_2024.csv"
df.to_csv(csv_path, index=False)
print(f"Clean data saved to {csv_path}")

Clean data saved to data/rain_data_2024.csv
