# NB01 - Data Collection
**Author:** Thea West

**Description:**

*insert short description of the notebook here*


## Imports

In [2]:
import pandas as pd
import requests
import json

## Data Collection

I will compare London to four other capital cities in the general geographic area: Dublin, Paris, Brussels, and Amsterdam. All of these cities are within about 350 miles of London and have a relatively similar climate. As far as I am aware, these cities do not have the same reputation for rain as London.

I will compare the raininess of these cities for the year 2024 as this is recent data which spans all seasons throughout the year. 

I will use rain sum (millimeters of rain in a day) and precipitation hours (number of hours it rained in a day). These two measurements will account for two possible measures of raininess: actual volume of rainfall and amount of time it rains. I will compare both with the other cities to see if London is truly ranier. If it is only rainier by one measurement, that may point to how the general public percieves raininess and, by extension, how London got its reputation.

Start by setting the paramaters for which data to fetch for each city:

In [3]:
url_endpoint = "https://archive-api.open-meteo.com/v1/archive"
london_params = {
    "latitude": 51.5085,
    "longitude": -0.1257,
	"start_date": "2024-01-01",
	"end_date": "2024-12-31",
	"daily": "rain_sum,precipitation_hours",
    "timezone": "Europe/London"
}
paris_params = {
    "latitude": 48.8534,
    "longitude": 2.3488,
	"start_date": "2024-01-01",
	"end_date": "2024-12-31",
	"daily": "rain_sum,precipitation_hours", 
    "timezone": "Europe/Berlin"
}
brussels_params = {
    "latitude": 50.8505,
    "longitude": 4.3488,
	"start_date": "2024-01-01",
	"end_date": "2024-12-31",
	"daily": "rain_sum,precipitation_hours", 
    "timezone": "Europe/Berlin"
}
amsterdam_params = {
    "latitude": 52.374,
    "longitude": 4.8897,
	"start_date": "2024-01-01",
	"end_date": "2024-12-31",
	"daily": "rain_sum,precipitation_hours", 
    "timezone": "Europe/Berlin"
}
dublin_params = {
    "latitude": 53.3331,
    "longitude": -6.2489,
	"start_date": "2024-01-01",
	"end_date": "2024-12-31",
	"daily": "rain_sum,precipitation_hours", 
    "timezone": "Europe/London"
}

Fetch the data and check if it was successful:

In [4]:
response_london = requests.get(url_endpoint, params=london_params)
response_paris = requests.get(url_endpoint, params=paris_params)
response_brussels = requests.get(url_endpoint, params=brussels_params)
response_amsterdam = requests.get(url_endpoint, params=amsterdam_params)
response_dublin = requests.get(url_endpoint, params=dublin_params)

cities = [["London", response_london], ["Paris", response_paris], ["Brussels", response_brussels], ["Amsterdam", response_amsterdam], ["Dublin", response_dublin]]
for city in cities:
    if city[1].status_code == 200:
        print(f"{city[0]}: Success")
    else:
        print(f"{city[0]}: Failure! Status code: {city[1].status_code}")

London: Success
Paris: Success
Brussels: Success
Amsterdam: Success
Dublin: Success


Store the raw data responses as a jsons:

In [5]:
raw_data_london = response_london.json()
raw_data_paris = response_paris.json()
raw_data_brussels = response_brussels.json()
raw_data_amsterdam = response_amsterdam.json()
raw_data_dublin = response_dublin.json()

## Convert to Pandas Dataframe and Clean

Convert each json to a pandas dataframe:

In [6]:
df_london = pd.DataFrame(raw_data_london['daily'])
df_paris = pd.DataFrame(raw_data_paris['daily'])
df_brussels = pd.DataFrame(raw_data_brussels['daily'])
df_amsterdam = pd.DataFrame(raw_data_amsterdam['daily'])
df_dublin = pd.DataFrame(raw_data_dublin['daily'])

Add a column to each dataframe specifying the city:

In [7]:
df_london['city'] = "london"
df_paris['city'] = "paris"
df_brussels['city'] = "brussels"
df_amsterdam['city'] = "amsterdam"
df_dublin['city'] = "dublin"

Combine the dataframes into one and reset the index:

In [8]:
df = pd.concat([df_london, df_paris, df_brussels, df_amsterdam, df_dublin]).reset_index(drop=True)

Change the items in the `time` column to datetime objects:

In [9]:
df['time'] = pd.to_datetime(df['time'], format='ISO8601')

## Save Dataframe as a CSV

The os library is used to save the csv to the data folder.

```python
os.makedirs("data", exist_ok = True)
```
This line creates the data folder. The 'exist_ok=True' makes it so that if the folder already exists it will be used instead of throwing an error.


```python
csv_path = os.path.join("data", "rain_data_2024.csv")
```
This line concatenates the "data" folder name with the file name. The csv_path is now `data` > `rain_data_2024.csv`

In [16]:
import os

os.makedirs("data", exist_ok = True)

csv_path = os.path.join("data", "rain_data_2024.csv")
df.to_csv(csv_path, index=True)
print(f"Clean data saved to {csv_path}")

Clean data saved to data/rain_data_2024.csv


In [None]:
csv_path = "data/rain_data_2024.csv"
df.to_csv(csv_path, index=True)
print(f"Clean data saved to {csv_path}")

OSError: Cannot save file into a non-existent directory: 'me204-2025-midterm-theaw3275'