# NB01 - Data Collection
**Author:** Thea West

**Description:**

This notebook walks through the process of collecting, cleaning, and compiling weather data from 5 cities into one CSV file. The data is sourced from [Open Meteo](https://open-meteo.com/en/docs/historical-weather-api) and the JSON files from each city are converted to Pandas dataframes. A column is added for city name and the dates are changed to datetime format. Finally the five dataframes are concatenated into one and saved to a CSV file in the data folder.


In [2]:
import pandas as pd
import requests
import json
import os

## Data Collection

I will compare London to four other capital cities in the general geographic area: Dublin, Paris, Brussels, and Amsterdam. All of these cities are within about 350 miles of London and have a relatively similar climate. As far as I am aware, these cities do not have the same reputation for rain as London.

I will compare the raininess of these cities for the year 2024 as this is recent data which spans all seasons throughout the year. 

I will use rain sum (millimeters of rain in a day) and precipitation hours (number of hours it rained in a day). These two measurements will account for two possible measures of raininess: actual volume of rainfall and amount of time it rains. I will compare both with the other cities to see if London is truly ranier. If it is only rainier by one measurement, that may point to how the general public percieves raininess and, by extension, how London got its reputation.

Start by setting the paramaters for which data to fetch for each city:

In [3]:
url_endpoint = "https://archive-api.open-meteo.com/v1/archive"
london_params = {
    "latitude": 51.5085,
    "longitude": -0.1257,
	"start_date": "2024-01-01",
	"end_date": "2024-12-31",
	"daily": "rain_sum,precipitation_hours",
    "timezone": "Europe/London"
}
paris_params = {
    "latitude": 48.8534,
    "longitude": 2.3488,
	"start_date": "2024-01-01",
	"end_date": "2024-12-31",
	"daily": "rain_sum,precipitation_hours", 
    "timezone": "Europe/Berlin"
}
brussels_params = {
    "latitude": 50.8505,
    "longitude": 4.3488,
	"start_date": "2024-01-01",
	"end_date": "2024-12-31",
	"daily": "rain_sum,precipitation_hours", 
    "timezone": "Europe/Berlin"
}
amsterdam_params = {
    "latitude": 52.374,
    "longitude": 4.8897,
	"start_date": "2024-01-01",
	"end_date": "2024-12-31",
	"daily": "rain_sum,precipitation_hours", 
    "timezone": "Europe/Berlin"
}
dublin_params = {
    "latitude": 53.3331,
    "longitude": -6.2489,
	"start_date": "2024-01-01",
	"end_date": "2024-12-31",
	"daily": "rain_sum,precipitation_hours", 
    "timezone": "Europe/London"
}

Fetch the data and check if it was successful:

In [4]:
response_london = requests.get(url_endpoint, params=london_params)
response_paris = requests.get(url_endpoint, params=paris_params)
response_brussels = requests.get(url_endpoint, params=brussels_params)
response_amsterdam = requests.get(url_endpoint, params=amsterdam_params)
response_dublin = requests.get(url_endpoint, params=dublin_params)

cities = [["London", response_london], ["Paris", response_paris], ["Brussels", response_brussels], ["Amsterdam", response_amsterdam], ["Dublin", response_dublin]]
for city in cities:
    if city[1].status_code == 200:
        print(f"{city[0]}: Success")
    else:
        print(f"{city[0]}: Failure! Status code: {city[1].status_code}")

London: Success
Paris: Success
Brussels: Success
Amsterdam: Success
Dublin: Success


Store the raw data responses as a jsons:

In [5]:
raw_data_london = response_london.json()
raw_data_paris = response_paris.json()
raw_data_brussels = response_brussels.json()
raw_data_amsterdam = response_amsterdam.json()
raw_data_dublin = response_dublin.json()

## Convert to Pandas Dataframe and Clean

Convert each json to a pandas dataframe:

In [6]:
df_london = pd.DataFrame(raw_data_london['daily'])
df_paris = pd.DataFrame(raw_data_paris['daily'])
df_brussels = pd.DataFrame(raw_data_brussels['daily'])
df_amsterdam = pd.DataFrame(raw_data_amsterdam['daily'])
df_dublin = pd.DataFrame(raw_data_dublin['daily'])

Add a column to each dataframe specifying the city:

In [21]:
df_london['city'] = "London"
df_paris['city'] = "Paris"
df_brussels['city'] = "Brussels"
df_amsterdam['city'] = "Amsterdam"
df_dublin['city'] = "Dublin"

Combine the dataframes into one and reset the index:

In [22]:
df = pd.concat([df_london, df_paris, df_brussels, df_amsterdam, df_dublin]).reset_index(drop=True)

Change the items in the `time` column to datetime objects:

In [23]:
df['time'] = pd.to_datetime(df['time'], format='ISO8601')

For initial analysis of the data I may want a count of "rainy days." I will define a "rainy day" as having either at least 2.5 mm of rainfall or at least 4 hours of rain. 

Create a column which defines each day as "rainy" (True) or "not rainy" (False):

In [24]:
rainy_day = []
for i in range(len(df)):
    if (df['rain_sum'][i] >= 2.5) or (df['precipitation_hours'][i] >= 4.0):
        rainy_day.append(True)
    else:
        rainy_day.append(False)

df['rainy_day'] = rainy_day
df

Unnamed: 0,time,rain_sum,precipitation_hours,city,rainy_day
0,2024-01-01,8.3,8.0,London,True
1,2024-01-02,9.6,11.0,London,True
2,2024-01-03,2.3,6.0,London,True
3,2024-01-04,27.7,8.0,London,True
4,2024-01-05,4.6,8.0,London,True
...,...,...,...,...,...
1825,2024-12-27,0.0,0.0,Dublin,False
1826,2024-12-28,0.0,0.0,Dublin,False
1827,2024-12-29,0.0,0.0,Dublin,False
1828,2024-12-30,0.0,0.0,Dublin,False


## Save Dataframe as a CSV

Two commands from the os library are used to save the CSV to the data folder. 

```python
os.makedirs("data", exist_ok = True)
```
This line creates the data folder. The 'exist_ok=True' makes it so that if the folder already exists it will be used instead of throwing an error.


```python
csv_path = os.path.join("data", "rain_data_2024.csv")
```
This line concatenates the "data" folder name with the file name. The csv_path is now `data` > `rain_data_2024.csv`

The CSV is created and saved in the proper location in the code block below:

In [25]:
os.makedirs("data", exist_ok = True)

csv_path = os.path.join("data", "rain_data_2024.csv")
df.to_csv(csv_path, index=False)
print(f"Clean data saved to {csv_path}")

Clean data saved to data/rain_data_2024.csv
